CN111263224B

CN111263224B - Video processing method and device and electronic equipment

Info

Publication number: CN111263224B
Application number: CN201811457560.4A
Authority: CN
Inventors: 沈飞; 杨浩; 姜文晖; 赵小伟; 刘扬; 文杰
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2018-11-30
Filing date: 2018-11-30
Publication date: 2022-07-15
Anticipated expiration: 2038-11-30
Also published as: CN111263224A

Abstract

The embodiment of the invention provides a video processing method, a video processing device and electronic equipment, wherein the method comprises the following steps: acquiring a video stream for shooting user behaviors; searching a video segment with a first characteristic in the video stream; and determining whether the user has a first preset behavior corresponding to the first characteristic according to the video clip. According to the video processing method, the video processing device and the electronic equipment, provided by the embodiment of the invention, the user can be monitored and subjected to subsequent settlement processing according to the judgment result of whether the first preset behavior exists, for example, the functions of auxiliary settlement or alarm and the like can be carried out according to whether the user has the missed-scanning behavior, the missed-scanning behavior of the user during the payment is avoided or reduced, the economic loss of retail stores is reduced, manpower and material resources are saved, the user behavior is analyzed through the video processing, the shopping and payment processes of the user are not disturbed, the shopping and payment processing efficiency of the user is effectively improved, and the user experience is improved.

Description

Video processing method and device and electronic equipment

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a video processing method and apparatus, and an electronic device.

Background

With the continuous development of new retail business, how to improve efficiency and reduce cost in retail stores becomes more and more important, for example, how to improve shopping or settlement efficiency of users, or how to improve the goods-putting efficiency becomes a problem to be solved urgently.

For example, a self-service cash register terminal is used more and more widely as a main means for improving the checkout experience and efficiency of users at the line. The self-service cash register is mostly arranged at an outlet of a store, so that a consumer can scan commodities and check out payment in a self-service mode, a queuing process is avoided, and great convenience is provided for the consumer.

In the prior art, a user often has intentional or unintentional scanning missing behavior when using a self-service cash register, and economic loss is brought to a retail store. In order to solve the problem, the current self-service cash register terminal adopts a weighing mode to confirm the commodities scanned by the user, the user needs to settle accounts according to the specified steps, the user behavior is strictly limited, the settling efficiency is low, and the user experience is poor.

Disclosure of Invention

In view of this, embodiments of the present invention provide a video processing method and apparatus, and an electronic device, so as to reduce the cost of retail stores.

In a first aspect, an embodiment of the present invention provides a video processing method, including:

acquiring a video stream for shooting user behaviors;

determining a movement trajectory of a hand of a user in the video stream;

searching a video segment with a first characteristic in the video stream;

determining whether a first predetermined behavior corresponding to the first feature occurs according to the video clip.

In a second aspect, an embodiment of the present invention provides a video processing method, including:

acquiring a video stream for shooting user behaviors;

searching a video segment with a first characteristic in the video stream;

and determining whether the user has a first preset behavior corresponding to the first characteristic according to the video clip.

In a third aspect, an embodiment of the present invention provides a video processing method, including:

acquiring an offline video for shooting user behaviors;

searching a video clip with a first characteristic in the offline video;

In a fourth aspect, an embodiment of the present invention provides a store management method, including:

acquiring a video stream for shooting behaviors of store managers;

Searching a video segment with a second characteristic in the video stream;

and determining whether a second preset behavior corresponding to the second characteristic occurs to the manager according to the video clip.

In a fifth aspect, an embodiment of the present invention provides a store management method, including:

acquiring an offline video for shooting the behavior of store managers;

searching a video clip with a second characteristic in the offline video;

In a sixth aspect, an embodiment of the present invention provides a video processing apparatus, including:

the acquisition module is used for acquiring a video stream for shooting user behaviors;

a detection module for determining a movement trajectory of a user's hand in the video stream;

the searching module is used for searching a video clip with a first characteristic in the video stream;

a determination module configured to determine whether a first predetermined behavior corresponding to the first feature occurs according to the video segment.

In a seventh aspect, an embodiment of the present invention provides a video processing apparatus, including:

a determining module, configured to determine whether a first predetermined behavior corresponding to the first feature occurs to the user according to the video clip.

In an eighth aspect, an embodiment of the present invention provides a video processing apparatus, including:

the acquisition module is used for acquiring an offline video for shooting user behaviors;

the searching module is used for searching video clips with first characteristics in the offline video;

In a ninth aspect, an embodiment of the present invention provides a store management apparatus, including:

the acquisition module is used for acquiring a video stream for shooting the behavior of store managers;

the searching module is used for searching the video clips with the second characteristics in the video stream;

and the determining module is used for determining whether a second preset behavior corresponding to the second characteristic occurs to the manager or not according to the video clip.

In a tenth aspect, an embodiment of the present invention provides a store management apparatus, including:

the acquisition module is used for acquiring an offline video for shooting the behavior of store managers;

The searching module is used for searching a video clip with a second characteristic in the offline video;

In an eleventh aspect, an embodiment of the present invention provides an electronic device, including: a first memory and a first processor; the first memory is configured to store one or more computer instructions, wherein the one or more computer instructions, when executed by the first processor, implement the video processing method of the first aspect.

In a twelfth aspect, an embodiment of the present invention provides an electronic device, including: a second memory and a second processor; the second memory is configured to store one or more computer instructions, wherein the one or more computer instructions, when executed by the second processor, implement the video processing method of the second aspect.

In a thirteenth aspect, an embodiment of the present invention provides an electronic device, including: a third memory and a third processor; the third memory is configured to store one or more computer instructions, wherein the one or more computer instructions, when executed by the third processor, implement the video processing method of the third aspect.

In a fourteenth aspect, an embodiment of the present invention provides an electronic device, including: a third memory and a third processor; the third memory is configured to store one or more computer instructions, wherein the one or more computer instructions, when executed by the third processor, implement the store management method of the fourth aspect.

In a fifteenth aspect, an embodiment of the present invention provides an electronic device, including: a third memory and a third processor; the third memory is configured to store one or more computer instructions, wherein the one or more computer instructions, when executed by the third processor, implement the store management method of the fifth aspect.

An embodiment of the present invention provides a computer storage medium, configured to store a computer program, where the computer program enables a computer to implement the video processing method according to the first aspect when executed.

An embodiment of the present invention provides a computer storage medium, which is used to store a computer program, and the computer program enables a computer to implement the video processing method according to the second aspect when executed.

An embodiment of the present invention provides a computer storage medium, which is used to store a computer program, and the computer program enables a computer to implement the video processing method according to the third aspect when executed.

An embodiment of the present invention provides a computer storage medium, configured to store a computer program, where the computer program enables a computer to implement the store management method according to the fourth aspect when executed.

An embodiment of the present invention provides a computer storage medium, configured to store a computer program, where the computer program enables a computer to implement the store management method according to the fifth aspect when executed.

The video processing method, the video processing device and the electronic equipment provided by the embodiment of the invention can acquire the video stream for shooting the user behavior, search the video segment with the first characteristic in the video stream, and determining whether the user exhibits a first predetermined behavior corresponding to the first characteristic based on the video clip, so as to monitor and subsequently settle the user according to the judgment result of whether the first predetermined action exists, for example, can perform functions of auxiliary settlement or alarm and the like according to whether the user has the scanning missing behavior, avoid or reduce the scanning missing behavior when the user checks out the account, reduce the economic loss of retail stores, save manpower and material resources, moreover, the user behavior is analyzed through video processing, the shopping and payment processes of the user are not disturbed, the shopping and payment processing efficiency of the user is effectively improved, and the user experience is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a schematic view of an application scenario provided in an embodiment of the present invention;

fig. 2 is an interaction schematic diagram of a self-service cash register terminal according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a self-service cash register terminal according to an embodiment of the present invention;

fig. 4 is a schematic flowchart of a first embodiment of a video processing method according to the present invention;

fig. 5 is a schematic diagram illustrating a partition of a placing table according to an embodiment of the present invention;

fig. 6 is a schematic flowchart of a second video processing method according to an embodiment of the present invention;

fig. 7 is a schematic diagram of a method for determining missing scan logic in a commodity tracking process according to an embodiment of the present invention;

fig. 8 is a schematic flowchart of a third embodiment of a video processing method according to the present invention;

FIG. 9 is a diagram illustrating a merging confidence according to an embodiment of the present invention;

fig. 10 is a schematic flowchart of a fourth embodiment of a video processing method according to the present invention;

fig. 11 is a schematic flowchart of a first store management method according to an embodiment of the present invention;

fig. 12 is a schematic flowchart of a second store management method according to an embodiment of the present invention;

fig. 13 is a schematic structural diagram of a first video processing apparatus according to an embodiment of the present invention;

fig. 14 is a schematic structural diagram of a second video processing apparatus according to an embodiment of the present invention;

fig. 15 is a schematic structural diagram of a third embodiment of a video processing apparatus according to the present invention;

fig. 16 is a schematic structural diagram of a first store management apparatus according to an embodiment of the present invention;

fig. 17 is a schematic structural diagram of a second store management apparatus according to an embodiment of the present invention;

fig. 18 is a schematic structural diagram of a first electronic device according to an embodiment of the present invention;

fig. 19 is a schematic structural diagram of a second electronic device according to an embodiment of the present invention;

fig. 20 is a schematic structural diagram of a third electronic device according to an embodiment of the present invention;

fig. 21 is a schematic structural diagram of a fourth electronic device according to an embodiment of the present invention;

Fig. 22 is a schematic structural diagram of a fifth electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without inventive step based on the embodiments of the present invention, are within the scope of protection of the present invention.

The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the examples of the present invention and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, and "a" and "an" generally include at least two, but do not exclude at least one, unless the context clearly dictates otherwise.

It should be understood that the term "and/or" as used herein is merely a relationship that describes an associated object, meaning that three relationships may exist, e.g., a and/or B, may represent: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter associated objects are in an "or" relationship.

The words "if", as used herein, may be interpreted as "at … …" or "when … …" or "in response to a determination" or "in response to a detection", depending on the context. Similarly, the phrase "if determined" or "if detected (a stated condition or event)" may be interpreted as "upon determining" or "in response to determining" or "upon detecting (a stated condition or event)" or "in response to detecting (a stated condition or event)", depending on the context.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a good or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such good or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of additional like elements in the article of commerce or system in which the element is comprised.

The embodiment of the invention provides a video processing method which can acquire a video stream for shooting user behaviors, search a video segment with a first characteristic in the video stream, and determine whether a first preset behavior corresponding to the first characteristic occurs to a user or not according to the video segment.

Wherein the first characteristic and the first predetermined behavior can be set according to actual needs. Alternatively, the first predetermined behavior may be any behavior of the user in the store, for example, a theft behavior during shopping, a placement of an article in a wrong location, a scanning miss behavior during checkout, and the like, and accordingly, the first characteristic may be a characteristic suspected of the first predetermined behavior.

For example, when the first predetermined action is a theft, the first characteristic may be a characteristic of a suspected theft, such as moving a hand from a shelf to a distance from a body that is less than a preset threshold, or the like. As long as the characteristic of suspected theft is detected, whether the user steals can be judged according to the video clip where the characteristic is located.

There are many ways to determine whether the first predetermined behavior occurs based on the video segment. Optionally, the video segment may be detected by a machine learning model, and it is determined whether the first predetermined behavior occurs in the video segment.

In the embodiment of the invention, the video stream shot in real time is acquired, and when the specific behaviors of the user are analyzed by methods such as a machine learning model and the like, shorter video segments may need to be processed, so that the video segment suspected of the first predetermined behavior can be found from the video stream through the first feature, and the video segment is further processed to determine whether the first predetermined behavior occurs.

For convenience of description, the following describes in detail implementation procedures and principles of the embodiments of the present invention by taking the first predetermined behavior as an example of the missed scan behavior.

Fig. 1 is a schematic view of an application scenario provided in an embodiment of the present invention. As shown in fig. 1, a user may select a commodity to be purchased in a store, and the commodity is generally provided with a barcode, a two-dimensional code, and other identifiers. After the user selects and purchases the goods, the goods can be settled at the self-service cash register terminal, the self-service cash register terminal can be provided with the scanning device, and the user can scan the identification of the goods through the scanning device, so that the settlement of the goods is realized.

Fig. 2 is an interaction schematic diagram of a self-service cash register terminal according to an embodiment of the present invention. As shown in fig. 2, after a user scans a commodity at the self-service cash register terminal, the self-service cash register terminal may send a scanning result to the server, and the server may query commodity information corresponding to the scanning result, such as a name and a price of the commodity, and send the commodity information to the self-service cash register terminal, and the commodity information is displayed to the user by the self-service cash register terminal.

After the user finishes scanning all the commodities, the self-service cash-receiving terminal can calculate the settlement price, or the server can generate the settlement price according to the prices of all the commodities, discount information and the like and send the settlement price to the self-service cash-receiving terminal, and the self-service cash-receiving terminal can display the settlement price to the user and finish settlement according to the payment behavior of the user, so that the whole self-service settlement flow is finished.

In the whole self-service checkout flow, the self-service cash register terminal can collect video streams when a user scans commodities and judges whether the user has scanning missing behaviors or not according to the video streams.

Fig. 3 is a schematic structural diagram of a self-service cash register terminal according to an embodiment of the present invention. As shown in fig. 3, the self-service checkout terminal may be provided with a display device, a scanning device, a placing table, a camera, and the like.

The display device can display commodity information, settlement price and other information which are needed to be paid finally. The placing table is used for placing commodities. The scanning device is used for scanning the identification of the commodity, such as a bar code or a two-dimensional code. Optionally, the scanning device may be a scanning device Of a POS (Point Of Sale) device, and the POS device may determine corresponding commodity information according to a scanning result Of the commodity.

The camera is used for shooting self-service checkout behaviors of the user. In the self-service cash register terminal shown in fig. 3, the camera is arranged at the top, and in practical application, the camera may be arranged at any position capable of shooting the user checkout behavior, for example, the camera may be arranged opposite to the user or on the side of the user. When whether the user has the scanning missing behavior is detected by analyzing the video stream, the corresponding detection strategy can be adjusted according to the specific position of the camera.

The embodiment of the invention provides a method for shooting the checkout behavior of a user in the self-service checkout process of the user and processing the shot video stream so as to determine whether the user has the scanning missing behavior. Fig. 1 to fig. 3 show optional application scenarios and structures according to an embodiment of the present invention. It will be understood by those skilled in the art that the specific hardware architecture may be adjusted according to actual needs, as long as the detection of the user's missing scanning behavior through the video stream can be achieved.

For example, the functions of processing the video stream and determining whether the user has missed scanning may be implemented by a self-service cash register terminal, or may be implemented by a server. Optionally, the self-service cash register terminal may send the collected video stream to the server, and the server detects whether a scanning missing behavior occurs and returns a detection result; or the self-service cash register terminal can also send the video stream to other equipment such as a background monitoring terminal of a store for video processing.

The following describes an implementation process of a video processing method according to an embodiment of the present invention with reference to the following method embodiments and accompanying drawings. In addition, the sequence of steps in each method embodiment described below is only an example and is not strictly limited.

Fig. 4 is a schematic flowchart of a video processing method according to a first embodiment of the present invention. The main execution body of the method in this embodiment may be any electronic device with a video processing function, and optionally, may be a self-service cash register terminal. As shown in fig. 4, the video processing method in this embodiment may include:

step 401, acquiring a video stream for shooting user behavior.

Step 402, finding a video segment with a first characteristic in the video stream.

Step 403, determining whether a first predetermined behavior corresponding to the first feature occurs according to the video clip.

In the embodiment of the present invention, a self-service cash register terminal is taken as an example for explanation. Those skilled in the art will appreciate that the principles and methods of implementing video processing with other devices are similar to self-service checkout terminals.

The first predetermined behavior may be a missing scanning behavior, and the video segment with the first characteristic may be a video segment with a suspected missing scanning behavior. For convenience of description, in the embodiment of the present invention, a suspected missed scanning behavior is recorded as a suspicious behavior, and a video segment with a first feature may specifically be a video segment with a suspicious behavior.

Specifically, the video stream may be processed in real time to determine whether there is a video segment with suspicious behavior in the video stream. In the embodiment of the invention, the specific representation form of the missing scanning can be various. Table 1 shows an example of a classification of the missed scan behavior.

TABLE 1 classification of missed Scan behavior example

As shown in table 1, the miss-scan behavior can be divided into two major categories: a code scanning missing behavior and a direct bagging behavior. The code scanning missing scanning behavior means that a user has a code scanning action, but the code scanning is not successful finally due to subjective or objective reasons, for example, the user intentionally shields the barcode, or the POS device has slow response and is not scanning in time. The direct bagging behavior refers to that a user directly moves the commodity to a code scanned area without a code scanning action.

Fig. 5 is a schematic diagram illustrating a partition of a placing table according to an embodiment of the present invention. As shown in fig. 5, looking down on the placement table, the placement table may be divided into two areas: the area A is a code scanning area, the area B is a code scanning area, before settlement, a user can place the commodities in the area A, when settlement is conducted, the commodities are taken up from the area A, code scanning is completed through POS equipment, and then the commodities after code scanning are placed in the area B.

If the user does not scan the code, the commodity is directly moved to the area B, and the direct bagging behavior is considered. Alternatively, the commodities are moved to the area B, the commodities can be moved from the area a to the area B, or the commodities can be moved from other areas to the area B, and the direct bagging behavior can be considered as long as the commodities are moved from the area other than the area B to the area B no matter where the initial position is. That is, in fig. 5, the movement represented by the arrow 1 or the movement represented by the arrow 2 can be regarded as the direct bagging.

As described above, the miss-scanning behavior can be divided into a code-scanning miss-scanning behavior and a direct bagging-off behavior, and accordingly, as long as a code-scanning action occurs, or as long as a commodity is moved to a code-scanning area, the miss-scanning behavior is considered to be possible, and the suspicious behavior is recorded. Whether it is a missed scan behavior may be further confirmed in conjunction with the scanning results of the POS device and/or a machine learning model.

Table 1 is merely an example showing several common missed scan behaviors. In general, after the POS device successfully scans the identifier of the commodity, a scanning result corresponding to the commodity may be obtained; if the identification of the commodity is not scanned, the corresponding scanning result is not obtained. It will be appreciated that an action may be considered suspicious if it should be accompanied by the acquisition of scan results (i.e., the scan results must be acquired when the action occurs, otherwise a missed scan is indicated). For example, when a user has a code scanning action, or moves a commodity from a code scanning area to a code scanning area, a scanning result should be acquired, otherwise, scanning is missed, and then the code scanning action of the user, or the action of moving the commodity from the code scanning area to the code scanning area, may be regarded as a suspicious action.

There are many methods for detecting video segments of suspicious behavior in a video stream. Alternatively, suspicious behavior in the video stream may be detected by the recognition model. Specifically, the recognition model may be trained through a sample, and suspicious behaviors in the video stream may be searched according to the trained model.

The method comprises the steps that a sample can comprise a plurality of videos, each video is marked, the starting and ending time of a suspicious behavior is marked, then a recognition model is trained according to the sample, and after the training is finished, the video to be detected is input into the recognition model, so that the video fragment of the suspicious behavior can be detected.

Specifically, whether the missing scanning behavior occurs or not can be judged by combining the scanning result of the POS device and/or the machine learning model.

Optionally, if a scanning result is obtained within the start-stop time of the video segment of the suspicious behavior, it is determined that no missing scanning behavior occurs, and if a scanning result is not obtained, it is determined that a missing scanning behavior occurs.

Or, the video clip can be input into a machine learning model, and a detection result of whether the video clip belongs to the missed scanning behavior is obtained. The machine learning model may be trained over a large number of samples.

Or, the POS signal may be combined with the machine learning model, and first, whether a scanning result is obtained within the start-stop time of the video clip is judged, and if the scanning result is obtained, it is determined that no missing scanning occurs; if not, the video clip can be input into the machine learning model and further confirmed by the machine learning model.

The method requires that the video segment of each suspicious behavior is accompanied by a scanning result, does not consider the tracking process of the commodity, has simple logic and easy realization, but can generate false alarm. To improve accuracy, the video clip can also be combined with the tracking process of the goods.

Optionally, determining whether a first predetermined behavior corresponding to the first feature occurs according to the video clip may include: after the tracking process of a commodity is finished, if a video clip with a first characteristic appears in the tracking process, judging whether a scanning result of the commodity is obtained in the tracking process; and if the scanning result is not obtained, judging whether a first preset behavior occurs according to the video clip with the first characteristic.

Specifically, after the tracking process of a commodity is finished, if a video clip of a suspicious behavior appears in the tracking process, whether a scanning result of the commodity is obtained in the tracking process is judged; and if the scanning result is obtained in the tracking process of one commodity, determining that the scanning missing behavior does not occur in the tracking process of the commodity. Wherein, the tracking process of the commodity is the process that the commodity is held in the hand.

The video segment of the suspicious behavior may be a segment in a complete product tracking process. For example, the moment of picking up the commodity and the moment of putting down the commodity of the user can be judged according to the hand of the user and the moving track of the commodity, and a complete commodity tracking process can be determined according to the moment of picking up the commodity and the moment of putting down the commodity, wherein the starting and ending time of the process is the moment of picking up the commodity and putting down the commodity. During a complete article tracking process, one or more video clips of suspicious activity may appear.

For example, after the user starts self-checkout, the 0.5 th second starts, a commodity is picked up and moved to the code scanning device, the code scanning behavior is detected to occur from the 1.5 th second to the 2.0 th second, the movement from the non-B area to the B area is detected from the 2.5 th second to the 3.0 th second, and the commodity is detected to be put down from the 4.0 th second, then the tracking process corresponding to the commodity is from the 0.5 th second to the 4 th second for 3.5 seconds, wherein there are two suspicious segments, a video segment of the code scanning behavior from the 1.5 th second to the 2.0 th second, and a video segment of the movement of the commodity to the B area from the 2.5 th second to the 3.0 th second, and both video segments last for 0.5 seconds.

If the scanning result is obtained in the tracking process of the commodity, namely the process from the 0.5 th second to the 4 th second, the commodity is considered not to be under scanning. If the scanning result is not obtained, it may be considered that a missing scanning behavior occurs, or if the scanning result is not obtained, it may be further determined whether a missing scanning behavior occurs according to the video segments of the two suspicious behaviors, and specifically, the video segments of the two suspicious behaviors may be analyzed through a machine learning model to determine whether a missing scanning behavior occurs.

In summary, if the POS device does not have the identifier of the scanned item during the tracking of an item, the user may have a scanning missing behavior. In addition, if the identification of the commodity is scanned, but the commodity information determined by scanning is inconsistent with the commodity information determined by video streaming, a scanning missing behavior can be considered to occur, user cheating is prevented, and the fake identification is used for replacing the real identification, so that loss is brought to stores.

For example, by processing the video stream, it is found that the commodity in the user's hand is a drink. However, if the scanning result is chewing gum, it means that the commodity information determined by the identifier is inconsistent with the commodity information detected by the video stream, and it can be considered that a scanning missing behavior has occurred.

After the user finishes scanning all the commodities, the commodities can be settled according to the scanning missing condition of the user. Specifically, if the scanning missing condition of the user meets the preset condition, the commodity scanned by the user can be settled, and the user can leave the commodity with the commodity after completing payment normally. If the preset condition is not met, the commodity is not allowed to be settled. The preset condition can be set according to actual needs.

In an alternative embodiment, as long as the user is detected to have the scanning missing behavior, the settlement of the article is not allowed, and the settlement can be normally performed only if the user does not have the scanning missing behavior in the whole scanning process.

In another optional implementation manner, as long as the number of times of missed scanning of the user is less than a certain value, the user is allowed to settle accounts for the scanned items, so that a fault-tolerant space can be provided for a video processing algorithm, the shopping experience of the user is prevented from being influenced by misjudgment, and the shopping process is saved.

Correspondingly, the method in this embodiment may further include: responding to an operation event that the user confirms that the commodity is scanned completely, and counting the times of the user that the scanning missing behavior occurs; and if the times of the user missing scanning behavior are less than the preset times, the commodities scanned by the user are settled. The preset number of times can be set according to actual needs, and for example, can be 4 times.

The operation event that the user confirms that the scanning of the commodities is completed may refer to an operation that the user determines that all the commodities have been scanned through modes of clicking a screen, pressing a key, inputting voice, and the like, for example, a "completion" button may be displayed on the self-service cash register terminal, the user may click the "completion" button when the user finishes scanning all the commodities, and the self-service cash register terminal may settle the commodities scanned by the user in response to the click operation of the user.

And if the times of the user missing scanning behavior are not less than the preset times, the commodity is not allowed to be settled. In addition, a settlement prohibiting interface can be displayed, and/or warning information can be sent to the monitoring terminal.

Specifically, the settlement prohibiting interface is used to prompt that the user cannot perform settlement, and optionally, "detecting that there is a missed scanning behavior, and the user cannot perform settlement" may be displayed on the settlement prohibiting interface, or "there is a missed scanning behavior, and the clerk needs to come and handle" and the like.

The monitoring terminal can be a background monitoring terminal and/or a field monitoring terminal and the like. The on-site monitoring terminal can be any terminal carried by on-site monitoring personnel, for example, a mobile phone or wearable equipment such as a watch and an intelligent bracelet, and the on-site monitoring personnel can be personnel used for assisting a user to complete self-service cash collection on site in a store, a waiter and the like. After the on-site monitoring terminal receives the warning information, the warning information can be pushed to on-site monitoring personnel according to the warning information, and the on-site monitoring personnel is prompted to process the warning information. For example, "xx cashier terminals detect a missed scanning action and please go forward to process" may be displayed or played.

And the background monitoring terminal is used for monitoring the scanning behavior of the user by background monitoring personnel. The background monitoring personnel can be staff used for monitoring videos in stores, and the background monitoring terminal can be any terminal with a video playing function, such as a mobile phone, a tablet device, a computer, a smart television, a display and the like. The backstage monitoring terminal can display the warning information to backstage monitoring personnel after receiving the warning information, and the backstage monitoring personnel can conveniently schedule field monitoring personnel to process or know the service condition of each self-service cash register terminal in the front field.

In practical application, the self-service cash register terminal can collect video streams for shooting user behaviors when a user scans commodities, detects the user behaviors according to the video streams, judges whether the user has a scanning missing behavior, allows the user to normally pay only when the user behaviors meet certain conditions, such as the scanning missing behavior does not occur or the scanning missing times of the user are smaller than preset times, and can block the payment behaviors of the user otherwise, so that the loss of merchants caused by the scanning missing of the user is prevented.

The embodiment of the invention adopts the video stream to detect whether the user has the scanning missing behavior, and has obvious progress compared with a method for carrying out settlement by using a weighing device in the prior art.

Among the prior art, receive silver-colored terminal by oneself and be provided with the weighing machine, utilize gravity-feed tank to weigh the loss prevention, compare sweeping the weight that the yard commodity corresponds with the weight of the commodity on the weighing machine, if the weight is different then the suggestion is reported to the police to realize weighing the loss prevention, machine itself takes up an area of great, and every commodity must weigh moreover, and user experience is not good enough. The video processing method provided by the embodiment of the invention realizes the loss prevention function through video processing, the user has no perception, the interference to the user is reduced, the user is not disturbed in the cash registering process, the user experience can be effectively improved, the store space is saved, and the application range is wider.

In the embodiments of the present invention, the missing scanning behavior is taken as an example for detailed description, and it can be understood by those skilled in the art that the missing scanning behavior may be replaced by any other first predetermined behavior, such as a theft behavior, a behavior of misplacing an article, and the like, and the specific processing procedure may refer to the processing procedure of the missing scanning behavior, and is not described herein again.

In summary, the video processing method provided in this embodiment may obtain a video stream for capturing a user behavior, search for a video segment having a first characteristic in the video stream, and determine whether the user has a first predetermined behavior corresponding to the first characteristic according to the video segment, so as to monitor and perform subsequent settlement processing on the user according to a determination result of whether the first predetermined behavior exists, for example, perform functions such as auxiliary settlement or alarm according to whether the user has a missed-scanning behavior, avoid or reduce the missed-scanning behavior of the user during checkout, reduce economic loss of retail stores, save manpower and material resources, and implement analysis on the user behavior through video processing, so that the shopping and checkout processes of the user are not disturbed, effectively improve processing efficiency of shopping and checkout of the user, and improve user experience.

Fig. 6 is a schematic flowchart of a second embodiment of a video processing method according to the present invention. On the basis of the technical solutions provided by the other embodiments, in order to improve the algorithm accuracy, after the video stream is acquired, the present embodiment detects whether there is a video segment with a suspicious behavior in the video stream through multiple detection modes. As shown in fig. 6, the video processing method in this embodiment may include:

step 601, acquiring a video stream for shooting user behaviors.

Step 602, the video streams are respectively input to a plurality of detection modules, and video segments with the first characteristics in the video streams are searched.

Step 603, determining whether a first predetermined behavior corresponding to the first characteristic occurs according to the searched video clip.

Still taking the first predetermined behavior as the miss-scanning behavior and the video segment with the first characteristic as the video segment with the suspicious behavior as an example, different detection modules use different detection methods to search for the video segment with the suspicious behavior in the video stream. The detection module in the embodiment of the invention can be any module capable of detecting the user behavior.

Optionally, the plurality of detection modules may include at least two of: the device comprises a track detection module, an optical flow detection module and a segmentation detection module. The track detection module, the optical flow detection module and the segmentation detection module respectively realize the detection of the user behaviors through methods of hand tracks, optical flows, segmented video streams and the like.

Optionally, inputting the video stream to a trajectory detection module, and searching for a video segment of a suspicious behavior in the video stream may include: detecting position information of hands and/or commodities in each frame of image of the video stream; determining the motion trail of the hand and/or the commodity according to the position information of the hand and/or the commodity in each frame of image; and searching the video clip of the suspicious behavior according to the motion trail of the hand and/or the commodity.

Optionally, inputting the video stream to an optical flow detection module, and searching for a video segment of a suspicious behavior in the video stream may include: detecting the motion trail of a moving target in the video stream by adopting an optical flow tracking algorithm; searching a video segment of a suspicious behavior according to the motion trail of the moving target; wherein the moving target comprises a user's hand and/or merchandise.

Optionally, inputting the video stream to a segmentation detection module, and searching for a video segment of a suspicious behavior in the video stream may include: acquiring a video with preset duration in the video stream; and searching the video segments of the suspicious behaviors in the video with the preset duration.

Specifically, after the video segments of one or more suspicious behaviors are found through the plurality of modules, whether missing scanning occurs or not can be determined according to the video segments of one or more suspicious behaviors found.

Optionally, after the tracking process of a commodity is finished, if a video clip of a suspicious behavior appears in the tracking process, it may be determined whether a scanning result of the commodity is obtained in the tracking process; if the scanning result is not obtained, judging whether a missing scanning behavior occurs or not according to the video segment of the suspicious behavior; wherein, the tracking process of the commodity is the process that the commodity is held in the hand.

The tracking process of the commodity can be determined by the position information of the commodity and the hand in the video stream. Optionally, the position information of the commodity and the hand in the video stream may be detected, the motion tracks of the commodity and the hand may be determined according to the position information of the commodity and the hand, and whether the commodity is held in the hand may be determined according to the motion tracks of the commodity and the hand.

Specifically, if the position of the commodity coincides with or is close to the position of the hand, and the movement locus is similar, the commodity can be considered to be held in the hand. In other alternative implementations, the item may be considered to be held in the hand as long as the area in which the item is located overlaps the area in which the hand is located.

When the commodity is separated from the hand, the hand can be considered to put down the commodity, and the tracking process is finished. Optionally, after it is determined that the product is held in the hand, if the time for detecting an empty hand (i.e. no product is held in the hand) exceeds a preset time, it is determined that the tracking process of the product is finished. If the empty hand is detected, but the preset time is not exceeded, the tracking process is not considered to be finished, misjudgment is prevented, and the detection accuracy is improved.

In the above-described method for detecting a commodity tracking process, it may not be detected which commodity the user has in hand is, and as long as it is detected that there is no commodity in hand for a long time, the user is considered that the commodity has been put down, that is, the tracking process of the previous commodity is ended.

Alternatively, the specific type of the commodity may be detected, for example, whether the commodity is a drink or a chewing gum is detected, and if the commodity is detected to be changed in the hand of the user, the tracking process of the previous commodity is ended.

In the embodiment of the invention, the target tracking algorithm can also be adopted to detect the tracking process of the commodity, the accuracy and the efficiency of different algorithms can be different, and the algorithms can be selected according to requirements in practical application.

After the tracking process of the commodity is determined, if a plurality of video clips with suspicious behaviors exist in the tracking process, whether the missed scanning behavior occurs or not can be judged according to the last video clip, or a video clip with an overlapped part with the last video clip can be searched, the searched video clip and the last video clip are combined, and whether the missed scanning behavior occurs or not is judged according to the combined video clip.

The last video clip in the embodiment of the present invention refers to a video clip with the end time closest to the end time in the commodity tracking process.

Alternatively, it may be determined by a machine learning model whether the behavior in the video segment belongs to a missed scan behavior.

Fig. 7 is a schematic diagram of a method for determining a missing scan logic in a commodity tracking process according to an embodiment of the present invention. As shown in fig. 7, after the tracking process of a commodity is determined, it may be determined whether a video segment of a suspicious behavior exists in the tracking process, and if the video segment does not exist, it is determined that a missing scanning behavior does not occur in the tracking process.

And if video segments of suspicious behaviors appear in the tracking process, judging whether a scanning result is obtained in the tracking process. And if the scanning result is obtained in the tracking process of one commodity, determining that the scanning missing behavior does not occur in the tracking process of the commodity.

In the embodiment of the invention, as long as the scanning result is obtained once in the tracking process, the tracking process can be considered to have no missing scanning, and if the scanning result is not obtained once in the tracking process, the last video clip in the tracking process can be verified through the machine learning model.

Optionally, before the verification of the last video clip, if there is another video clip coinciding with the last video clip, the another video clip is merged with the last video clip, and then the merged last video clip is input to the machine learning model, and it is determined whether a behavior in the video clip is a missed scanning behavior. And if the last video clip is not overlapped with any other video clip, directly inputting the last video clip into the machine learning model, and judging whether the behaviors in the video clip belong to the missing scanning behaviors or not.

If the behavior in the last video clip is determined to be the scanning missing behavior, the scanning missing behavior is indicated in the commodity tracking process; and if the behavior in the last video clip is not the scanning missing behavior, determining that the scanning missing behavior does not occur in the commodity tracking process.

The following examples are given. The tracking process of a commodity lasts for 4 seconds from the 1.5 th second to the 5.5 th second of the video stream, three video clips of suspicious behavior are detected in the 4 seconds, the first video clip is from the 2.0 th second to the 2.4 th second, the second video clip is from the 3.0 th second to the 3.5 th second, and the third video clip is from the 3.3 rd second to the 3.6 th second.

Now that there is a video segment of suspicious behavior in the tracking process, it can be further determined whether a scanning result is obtained in the tracking process. If the scanning result is obtained between the 1.5 th second and the 5.5 th second of the video stream, it can be determined that the scanning missing behavior does not occur in the tracking process of the commodity.

If no scanning result is obtained, the video segment of the suspicious behavior can be input to a machine learning model for further confirmation. According to the previous example, if the last video segment is the third video segment, and the second video segment and the third video segment are partially overlapped, the second video segment and the third video segment can be merged to obtain the video segments from the 3.0 second to the 3.6 second.

And inputting the video clips from 3.0 second to 3.6 second in the video stream into a machine learning model, and determining whether the behaviors are the scanning missing behaviors, wherein if yes, the scanning missing behaviors are considered to occur in the commodity tracking process, otherwise, the scanning missing behaviors are not considered to occur.

When a plurality of video segments of suspicious behaviors exist in the tracking process of the commodity, only the last video segment or the combined last video segment is detected, and the processing efficiency of the video stream can be improved.

In other alternative embodiments, all video segments of suspicious behaviors in the tracking process may also be input to the machine learning model for detection, so as to improve the accuracy of detection.

Reference may be made to the description of other embodiments that may not explicitly describe a particular concept or method in this embodiment.

In conclusion, the video processing method of the embodiment searches for the video segments of the suspicious behaviors through a plurality of detection modules together, so that the accuracy of the algorithm is effectively improved; in addition, when the scanning result is not obtained in the tracking process of the commodity, the video clip can be input into the machine learning model for missing scanning detection, whether the missing scanning behavior occurs can be confirmed according to the searched video clip, and the processing efficiency and accuracy of the video stream are improved.

In the technical solutions provided by the embodiments of the present invention, a specific implementation method for determining whether behaviors in a video clip belong to missing scanning behaviors through a machine learning model may include: determining the confidence degree that the video segment of the suspicious behavior belongs to the missing scanning behavior through a machine learning model; and judging whether a missing scanning behavior occurs or not according to the confidence coefficient.

Specifically, the output of the machine learning model may be a confidence that the input video segment belongs to the missing scan behavior, and if the confidence is greater than a preset threshold, the missing scan behavior is considered to occur, for example, the threshold may be 0.6. Inputting the last video clip in the commodity tracking process into a machine learning model, and obtaining that the confidence coefficient of the missed scanning behavior is 0.3, which shows that the behavior in the video clip has only 30% probability of the missed scanning behavior and is less than a threshold value of 0.6, and at this moment, the missed scanning behavior does not occur in the whole commodity tracking process; if the obtained confidence coefficient of the behavior belonging to the missing scanning is 0.8, the behavior in the video clip has a probability of 80% belonging to the missing scanning behavior, and the missing scanning behavior can be considered to occur in the commodity tracking process.

In other optional embodiments, the determining whether a missed scanning behavior occurs according to the video segment of the suspicious behavior may include: if a plurality of video segments of suspicious behaviors exist in the tracking process, determining the confidence coefficient that the video segment of each suspicious behavior belongs to the missing scanning behavior through a machine learning model; calculating a weighted sum of confidence degrees corresponding to the video segments of the plurality of suspicious behaviors; and if the weighted sum of the plurality of video clips is greater than a preset threshold value, determining that the scanning missing behavior occurs.

As described above, different detection methods may be used to process the video stream to find the video segments of the suspicious behavior existing therein, where the weights corresponding to the video segments found by the different detection methods may be different. For example, the video segments of suspicious behaviors in the video stream are searched through the algorithm a and the algorithm B, where the accuracy of the algorithm a is higher, the weight of the video segment searched through the algorithm a may be higher, and the accuracy of the algorithm B is lower, the weight of the video segment searched through the algorithm B may be lower. Of course, the weights may also be set according to other policies, for example, the weights of all video segments may be set to be the same, and so on.

In the technical solutions provided by the embodiments of the present invention, determining, by a machine learning model, a confidence that a video segment of a suspicious behavior belongs to a missed scan behavior may include: determining a corresponding machine learning model according to the type of the video segment of the suspicious behavior; and inputting the video clip into the corresponding machine learning model to obtain the confidence coefficient of the video clip belonging to the missing scanning behavior.

In an alternative embodiment, the video segments may be sorted according to table 1. Optionally, the video clip may be divided into two types, a code scanning missing scanning behavior and a direct bagging behavior, and may also be divided into more detailed types: block code scanning, back code scanning, scan code too fast, a to B, other zones to B, etc. When a video segment of a suspicious activity is searched for in step 602, the type of the searched video segment may be determined.

Accordingly, machine learning models can also be divided into various types: a machine learning model for confirming code scanning missing scanning behaviors and a machine learning model for confirming direct bagging behaviors, or more specifically, multiple types. When the machine learning model is trained, the machine learning model can be trained according to samples of corresponding types. When the video clip needs to be confirmed whether to be the miss-scanning behavior, the video clip can be processed by adopting a machine learning model of a corresponding type.

For example, a plurality of video segments of suspicious behaviors are detected in the tracking process of the commodity, and if the last video segment is a video segment of a code scanning missing scanning behavior, a machine learning model for identifying the code scanning missing scanning behavior can be adopted to determine whether the video segment is the missing scanning behavior; if the last video clip is the video clip of the direct bagging behavior, a machine learning model for identifying the direct bagging behavior can be adopted to determine whether the video clip is the missing scanning behavior.

The implementation principles of different types of machine learning models may be similar, for example, DNN (Deep Neural Network) may be used for implementation, but training samples may be different, so that training is effectively performed for different types of behaviors, and the accuracy of detecting whether different types of video segments are missed scan behaviors is improved.

In other embodiments, other classification methods may be used. For example, video clips can be divided into three types: video clips obtained by the trajectory detection module, video clips obtained by the optical flow detection module, video clips obtained by the segmentation detection module, and the like.

Fig. 8 is a schematic flowchart of a third embodiment of a video processing method according to the present invention. In this embodiment, a video segment that meets the requirement in the video stream is searched by detecting the movement track of the hand. As shown in fig. 8, the video processing method in this embodiment may include:

step 801, acquiring a video stream for shooting user behaviors.

Step 802, determining a movement track of a user's hand in the video stream.

Specifically, the position information of the hand of the user in each frame image of the video stream may be detected, and the motion trajectory of the hand may be determined according to the position information of the hand in each frame image.

And step 803, finding the video segments with the first characteristics in the video stream.

And step 804, determining whether a first preset behavior corresponding to the first characteristic occurs according to the video clip.

Optionally, the searching for the video segment with the first characteristic in the video stream may include any one of the following items: if the hand of the user enters the code-scanned area from the non-code-scanned area, determining that a video clip with a first characteristic appears; if the hand of the user enters the code-scanned area from the non-code-scanned area and the time interval between the hand of the user and the last entry of the code-scanned area is greater than a preset interval, determining that a video clip with a first characteristic appears; and if the hand of the user enters the code-scanned area from the non-code-scanned area and the farthest distance between the hand and the code-scanned area after the hand leaves the code-scanned area last time is greater than the preset distance, determining that the video clip with the first characteristic appears. The following description will be given taking the skip-scan behavior as an example.

Specifically, through the motion track of the hand of the user, it can be determined whether the hand of the user enters the code-scanned area from the non-code-scanned area. The code-scanned area may be the area B in fig. 5, and the code-non-scanned area may refer to any area other than the area B, may be the area a, or may be other areas other than the area a and B.

If the hand of the user enters the code-scanned area from the non-code-scanned area, the video segment with suspicious behavior can be determined. The video segments of the suspicious behavior may be video segments within a period of time before and after a time when the video segments enter the code-scanned area.

Optionally, the video segments of the suspicious behavior may be video segments within a first preset time period before entering the code-scanned area and a second preset time period after entering the code-scanned area.

Assuming that the first and second preset time periods are both t₀Then if the user's hand T enters the scanned code region from the non-scanned code region at time T, [ T-T₀,T+t₀]The video in this period can be regarded as availableA video clip of a suspected behavior. A more intuitive example is that if the first preset time period and the second preset time period are both 1 second, and the hand of the user enters the scanned code area at the 15 th second of the video stream, the video segment of the corresponding suspicious behavior is the 14 th to 16 th seconds of video.

If the user's hand enters the scanned area from the non-scanned area several times in the video stream, a plurality of corresponding video segments can be found.

Optionally, to avoid false alarm caused by shaking of the hand of the user, after the hand of the user enters the code-scanned area, it is considered that a suspicious behavior occurs once, and under the condition that the hand enters again after leaving the code-scanned area, if the time interval between two entries is short or the moving distance of the hand is short, it is considered that the suspicious behavior does not occur again.

That is, when the moving distance and/or moving time of the hand at the boundary of the scanned code region is short, it may be considered as the shaking behavior of the hand, not the behavior of entering the scanned code region for the second time.

In an optional implementation manner, searching for a video segment with suspicious behavior according to the motion trajectory of the hand of the user may include: and if the hand of the user enters the scanned code region from the non-scanned code region and the time interval between the hand of the user and the time interval of entering the scanned code region last time is larger than the preset interval, determining the video segment with suspicious behaviors.

When the fact that the hand of the user enters the code-scanned area from the non-code-scanned area is detected, if the time interval from the last entry of the hand of the user into the code-scanned area is smaller than the preset interval, the current behavior is not considered to belong to suspicious behavior. The preset interval may be 1 second. For example, if it is detected that the hand of the user enters the code-scanned area in 15.5 seconds, then rapidly leaves and re-enters the code-scanned area in 15.8 seconds, the hand of the user can be considered to have a jitter behavior at the boundary of the code-scanned area between 15.5 seconds and 15.8 seconds, and the two entries are only counted as one suspicious behavior, but not two suspicious behaviors.

In another optional implementation manner, searching for a video segment with suspicious behavior according to the motion trajectory of the hand of the user may include: and if the hand of the user enters the code-scanned area from the non-code-scanned area and the farthest distance between the hand of the user and the code-scanned area after leaving the code-scanned area last time is greater than the preset distance, determining the video segment with suspicious behaviors.

When the fact that the hand of the user enters the code-scanned area from the non-code-scanned area is detected, if the user does not go far after leaving the code-scanned area last time and the farthest distance between the hand and the code-scanned area is smaller than the preset distance, the current behavior is not considered to be suspicious behavior. The preset interval may be 5 cm. For example, detecting that the user's hand entered the code-scanned area at 10 seconds and then exited again, and re-entered the code-scanned area at 12 seconds, and the distance between the user's hand and the code-scanned area was less than 5 centimeters from 10 seconds to 12 seconds, then the user's hand may be considered to linger at the boundary of the code-scanned area between 10 seconds and 12 seconds, with the two entries accounting for only one suspicious activity and not two suspicious activities.

The method for judging whether the video segments of the suspicious behaviors appear or not is provided by taking the hands of the user as targets. Similarly, a video clip of suspicious behavior may also be determined to target the item. The specific realization principle and process using the commodity as the target are similar to the realization principle and process using hands, and the realization method using the commodity as the target can be obtained by replacing hands with the commodity in the method.

Further, in order to increase the detection accuracy, the hand of the user and the commodity can be used as a target, and whether a video clip of a suspicious behavior appears or not can be judged together through the track of the hand and the track of the commodity.

Optionally, the searching for the video segment with the first characteristic in the video stream may include any one of the following: if the hands and the commodities of the user enter the code-scanned area from the non-code-scanned area, determining that a video clip with a first characteristic appears; if the hands and the commodities of the user enter the scanned code area from the non-scanned code area and the time interval between the hands and the commodities entering the scanned code area last time is larger than a preset interval, determining that a video clip with a first characteristic appears; and if the hands and the commodities of the user enter the code-scanned area from the non-code-scanned area and the farthest distance between the hands and the code-scanned area after leaving the code-scanned area last time is greater than a preset distance, determining that a video clip with the first characteristic appears. The following description will be given by taking the skip-scan behavior as an example.

The hand and the commodity enter the code scanning area, which may be a hand holding the commodity to enter the code scanning area. The video clips with the first characteristic can be video clips in a first preset time period before the hand-held commodity enters the code scanning area and a second preset time period after the hand-held commodity enters the code scanning area.

Under the condition that whether a video clip of a suspicious behavior appears or not is judged by using a hand and a commodity of a user as a detection target, if only the hand enters a code scanned area and no commodity enters the code scanned area, the suspicious behavior is not considered to appear, and only if the hand and the commodity enter the code scanned area simultaneously, the suspicious behavior is considered to appear.

After the video segments are found, the video segments of the suspicious behaviors can be detected through a machine learning model, and whether the missing scanning behaviors occur or not is determined.

Alternatively, the video segments may be processed by DNN. The DNN recognition rate is high, and whether the video clip belongs to the missed scanning behavior can be accurately determined.

Optionally, after the video segment of the suspicious behavior is acquired, it may be directly determined by using the DNN whether the video segment belongs to the missed scanning behavior, or the method in this embodiment may be used in combination with the methods in the foregoing embodiments. For example, the tracking process of the goods and the obtained scanning result can be combined to comprehensively determine whether to input the video clip into the DNN for further confirmation.

Optionally, in the tracking process of a commodity, video clips of a plurality of suspicious behaviors may be detected, for example, a user may hold the commodity to go in and out of a code scanned area, but as long as a scanning result is obtained at least once in the tracking process of the commodity, it may be considered that a missed scanning behavior does not occur, and if a scanning result is not obtained, the last video clip may be input into the DNN for confirmation.

Of course, it is also possible to detect the video stream by using multiple detection methods simultaneously in combination with other detection methods, and search for video segments of suspicious behaviors therein.

In summary, the video processing method provided in this embodiment may detect the motion trajectory of the hand of the user in the video stream, analyze the state of the hand according to the motion trajectory of the hand of the user, further analyze whether a video segment with a first characteristic appears, and determine whether the user has a first predetermined behavior, such as whether a missed scanning behavior exists, by combining with a machine learning model, thereby effectively solving the problem of commodity loss prevention based on visual dimensions, improving the processing efficiency of user checkout, and improving the operation experience of the user without limiting the operation behavior of the user.

In addition to finding the video segment according to the motion track of the hand as described above, the video segment may be found by other methods. For example, the video segment with the first feature in the video stream can be found through the optical flow or through the video with the preset duration.

An embodiment of the present invention further provides a video processing method, including: acquiring a video stream for shooting user behaviors; detecting the motion trail of a moving target in a video stream by adopting an optical flow tracking algorithm; searching a video clip with a first characteristic according to the motion track of the moving target; determining whether a first predetermined behavior corresponding to the first feature occurs according to the video clip.

Among them, the Optical Flow (Optical Flow) tracking algorithm is an important method for analyzing moving images at present, and its concept was first proposed by James j.gibson in the 40 th century, when an object is moving, its brightness pattern of a corresponding point on an image is also moving, and the Apparent Motion (Apparent Motion) of the brightness pattern of the image is the Optical Flow.

Optionally, in this embodiment, a Fast Optical Flow calculation method of Fast Optical Flow using density Inverse Search may be adopted to calculate a moving object in a video stream. After a moving target in the video stream is detected according to an optical flow tracking algorithm, the moving target can be used as a hand and/or a commodity of a user, so that a video clip of suspicious behaviors is searched according to the motion track of the hand and/or the commodity.

Specifically, if it is detected that the moving object leaves after entering the preset area, the video segment with suspicious behavior can be determined. The preset area may be a code scanned area, or the preset area may be an area closer to the scanning device. Taking the latter as an example, the preset region may be a region within a preset range of the scanning device. The scanning device is used for acquiring a corresponding scanning result when a user scans a commodity.

Optionally, the preset area may be an area that is smaller than a preset distance value from the scanning device in the vertical direction. Specifically, the optical flow may be analyzed, and the motion trajectory in the Y direction may be modeled and decomposed into two motion modes, i.e., approaching and leaving, so as to determine whether suspicious behavior exists. When the moving object approaches the scanning device and leaves, a suspicious behavior can be considered to occur.

Of course, approaching and departing in the X direction may also be used to confirm that a suspicious behavior occurs, or the X direction and the Y direction are combined, and if the moving object enters a preset range and departs in the X direction and the Y direction, it is indicated that a suspicious behavior occurs.

In addition, after the moving object is detected to approach and leave, the moving object is identified, if the moving object is a commodity, suspicious behaviors are determined to occur, and if the moving object is a non-commodity, such as a mobile phone or a bag, the suspicious behaviors are not considered to occur.

Correspondingly, searching for a video segment of a suspicious behavior according to the motion trajectory of the moving object may include: if the moving target is detected to leave after entering a preset area, identifying the moving target, and judging whether the moving target comprises a commodity or not; and if the moving target comprises a commodity, determining a video clip with suspicious behaviors.

The start and stop times of suspicious activity may be determined based on the time of entering and/or leaving a preset area. Optionally, the start time of the video segment of the suspicious activity may be a time of entering the preset area, and the end time of the video segment of the suspicious activity may be a time of leaving the preset area.

Alternatively, the adjustment may be performed according to actual needs, for example, the start time of the video segment of the suspicious activity may be N seconds before the time of entering the preset region, and the end time of the video segment of the suspicious activity may be M seconds after the time of leaving the preset region, where N and M are both real numbers.

Optionally, after determining the video segment in which the suspicious behavior occurs, the accurate start time and the accurate end time of the video segment may also be determined according to a machine learning model. In particular, video segments of suspicious behavior may be input to a machine learning model from which accurate start and end times are determined.

Specifically, after the video segment of the suspicious behavior is acquired, whether the video segment belongs to the missed scanning behavior may be determined directly through DNN, or the method in this embodiment may be used in combination with the methods in other embodiments. For example, the tracking process of the goods and the obtained scanning result can be combined to comprehensively determine whether to input the video clip into the DNN for further confirmation.

Optionally, in the tracking process of a commodity, video clips of a plurality of suspicious behaviors may be detected, for example, a user may hold the commodity to go in and out of a code scanned area, but as long as a scanning result is obtained in the tracking process of the commodity, it can be considered that no missing scanning behavior occurs, and if a scanning result is not obtained, the last video clip may be input into the DNN for confirmation.

The current mainstream behavior analysis method adopts an off-line analysis method, namely the start-stop time and the type of the action occurring in the video can be predicted only by seeing a complete video segment containing the action, which is not suitable for the condition that real-time early warning is needed. In the embodiment, a simple and efficient solution based on optical flow is adopted, so that the scanning missing action can be predicted and judged in real time, and the scanning missing action of the user can be recognized at the first time.

According to the video processing method, the optical flow tracking algorithm can be adopted to detect the motion track of the moving target in the video flow, the video clip with the first characteristic can be searched according to the motion track of the moving target, and whether the user has a first preset behavior such as missing scanning behavior or not is confirmed by combining with the machine learning model, so that the problem of commodity loss prevention is effectively solved based on visual dimensions, the processing efficiency of user account settlement is improved, the operation behavior of the user is not limited, the operation experience of the user is improved, the video clip with the first characteristic can be timely detected based on the optical flow, and the requirements of real-time monitoring and early warning are met.

In addition to determining the motion trajectory of the user's hand and/or the merchandise from the video stream as described above, the motion trajectory of the user's hand and/or the merchandise may be obtained in other ways. For example, video segments of suspicious behaviors can be detected through the sensing device, so that the detection accuracy can be effectively improved.

Specifically, an embodiment of the present invention further provides a video processing method, including: acquiring a sensing signal sent by a sensing device; determining the motion track of the hand of the user according to the sensing signal; searching a video clip with a first characteristic in a video stream for shooting user behaviors according to the motion track of the hand; determining whether a first predetermined behavior corresponding to the first feature occurs according to the video clip.

Optionally, searching for a video segment with a first characteristic in a video stream capturing user behavior according to the motion trajectory of the hand may include: judging whether the hand of the user enters a preset area and leaves; and if so, determining that the video clip with the first characteristic appears in the video stream for shooting the user behavior.

Wherein the sensing means may be any type of device capable of detecting the position of a hand. The sensing signal may be any signal capable of representing a change in the position of the hand.

In an alternative implementation, the sensing device may be a distance sensor that can detect the distance between the surrounding obstacle and itself. Accordingly, the sensing signal may be a distance between the hand and the distance sensor. After the distance between the hand and the distance sensor is obtained, the motion track of the hand can be determined according to the distance.

Alternatively, the distance sensor may be provided at a position capable of detecting whether the user's hand enters or leaves the preset area. For example, the distance sensor may be disposed beside the scanning device, and when the hand of the user approaches or leaves the scanning device, the distance between the distance sensor and the user becomes smaller and larger.

In this way, the sensor signal detected by the distance sensor can determine whether the condition of the video segment triggering the suspicious behavior is met. Specifically, if the sensing signal changes from being greater than the preset value to being smaller than the preset value, and then changes from being greater than the preset value again, it indicates that the approach-departure process has passed, and at this time, it may be considered that a video segment of suspicious behavior has occurred.

Similar to the method described above, the start time of the video segment of the suspicious activity may be the time when the hand enters the preset area, and the end time may be the time when the hand leaves the preset area.

In another alternative implementation, the sensing device may be an infrared sensor. The infrared sensor may also detect whether the user is approaching or leaving a preset area, and the video segment of the suspicious behavior in the video stream may be searched according to the sensing signal fed back by the infrared sensor, and specific implementation principles and processes may refer to the foregoing embodiments, which are not described herein again.

The embodiment of the invention also provides a video processing method, which comprises the following steps: acquiring a video with preset duration in a video stream for shooting user behaviors; searching a video clip with a first characteristic in the video with the preset duration; determining whether a first predetermined behavior corresponding to the first feature occurs according to the video segment.

Optionally, after the video stream is obtained, the video stream may be segmented to obtain videos with preset time duration, and the videos with the preset time duration are respectively processed to determine whether a video segment of a suspicious behavior exists in the videos with the preset time duration.

The step can be specifically used for acquiring the video with the preset duration. Optionally, the obtaining the video with the preset duration in the video stream may include: determining the starting time of code scanning checkout of the user; and segmenting the video stream after the starting time according to a preset time length to obtain the video stream with the preset time length. The preset time period may be set according to actual needs, and may be, for example, 5.2 seconds, that is, every 5.2 seconds of video in the video stream is processed.

Specifically, the time when the user starts to perform code scanning checkout is taken as the 0 th second of the video stream, then the 0 th to 5.2 th seconds are one video, the 5.3 th to 10.4 th seconds are one video, the 10.5 th to 15.6 th seconds are one video, and so on, the video stream can be divided into a plurality of videos.

There may be various methods for determining the start time of the user code-scanning checkout. Optionally, a checkout starting instruction input by the user may be obtained, and according to the checkout starting instruction, the start time of checkout by scanning a code by the user is determined.

For example, the user may be provided with an option to start the operation, a key, etc. for the user to select. Alternatively, a "start" button may be displayed on the display device of the self-service checkout terminal, and in response to an operation event that the user clicks the "start" button, the start time of the user code scanning checkout may be determined as the time when the user clicks the "start" button.

Or, the start time of the user code scanning checkout may be determined according to the time when the first scanning result is obtained. Specifically, after the user takes the first commodity and scans the first commodity successfully, the start time of the code scanning and checkout of the user may be determined as the time of obtaining the scanning result of the first commodity. Therefore, the user does not need to click to start manually, and the time for the user to check out the account by scanning the code is saved.

Optionally, the capturing of the video stream may be started after the user starts to scan the code and check out. When the user does not start scanning the commodity, the video acquisition function can not be started temporarily, and the resource consumption is effectively reduced.

After the video stream is acquired, video segments of suspicious behaviors can be searched for each video with preset duration. There are many methods for searching for video segments of suspicious behavior from videos with preset duration. For example, a video segment meeting the requirement can be extracted from a video of a preset duration by a machine learning method.

Optionally, the 3D convolution feature of the video with the preset duration may be extracted, and the video segment of the suspicious behavior in the video with the preset duration may be determined according to the convolution feature.

Specifically, the 3D convolution feature described in this embodiment may be an inflected 3D ConvNet feature, or another 3D convolution feature such as a Pseudo-3D ConvNet feature. Based on the extracted 3D convolution characteristics, suspicious behaviors in the video can be detected and identified by adopting an Action Proposal Network. Then, normalization operation can be performed on the detected video segments of the suspicious behavior, so that the features of all the video segments have uniform size, which is convenient for subsequent processing, for example, when it is determined whether the video segments of the suspicious behavior belong to the missing scanning behavior, the normalization operation can be realized through 3D convolution features.

In other optional implementation manners, after the video segments of the suspicious behaviors are found through the 3D convolution feature, if the number of the video segments is multiple, the video segments of the suspicious behaviors may be merged. For convenience of description, a video segment directly found through the 3D convolution feature is referred to as a sub-segment herein.

Optionally, searching for a video segment of a suspicious behavior in the video with the preset duration may include: searching sub-segments of suspicious behaviors in the video with the preset duration; if the video has a plurality of sub-segments of suspicious behaviors, calculating the confidence coefficient that the missing scanning behaviors exist at each time point in each sub-segment; and obtaining at least one video segment of the suspicious behavior according to the confidence coefficient of the missing scanning behavior existing in each time point.

Specifically, for each 5.2 seconds of video, the sub-segments of the suspicious behaviors can be searched through the 3D convolution feature, for example, there are 3 sub-segments of the suspicious behaviors in the 5.2 seconds of video, the time lengths of the sub-segments of the suspicious behaviors may be the same or different, and there may be overlapping portions in the sub-segments of the suspicious behaviors.

For each of the 3 sub-segments, a confidence that there is a missing scan behavior in each sub-segment may be calculated, and at times other than the 3 sub-segments in the 5.2 second video, the confidence may be considered to be 0. When the confidence coefficient of the missing scanning behavior of each sub-segment is calculated, a curve can be output to represent the confidence coefficient of each time point in the sub-segment.

Specifically, for each sub-segment, a confidence coefficient change curve of all the durations of the sub-segment may be calculated, or confidence coefficients corresponding to multiple time points in the sub-segment may be calculated, and the confidence coefficients corresponding to the multiple time points are connected into a smooth curve, so as to obtain the confidence coefficients corresponding to the sub-segment.

After the confidence corresponding to each sub-segment is determined, all the detected sub-segments of the suspicious behavior can be traversed, the confidence is accumulated in the time dimension, and the final sub-segment of the suspicious behavior is determined according to the accumulated confidence.

Optionally, obtaining the video segment of the at least one suspicious behavior according to the confidence that the missing scanning behavior exists at each time point may include: for each time point, overlapping the corresponding confidence coefficients of the time point in each sub-segment to obtain a combined confidence coefficient corresponding to the time point; searching a time point with the merging confidence coefficient larger than a preset threshold value; and obtaining at least one video segment of the suspicious behavior according to the searched time point.

For example, the first 5.2 seconds of video of the video stream is processed to obtain a plurality of sub-segments, wherein two sub-segments are included in the 1.5 th second of the video stream, the 1 st to 2 nd seconds are sub-segments of the first suspicious activity, and the 1.5 th to 3 rd seconds are sub-segments of the second suspicious activity. In the first sub-segment, the confidence corresponding to the 1.5 th second is 1, that is, when the first sub-segment is processed, the confidence that the 1.5 th second belongs to the missing scan behavior is 1. In the second sub-segment, the confidence corresponding to the 1.5 th second is 0.8. The merging confidence corresponding to the 1.5 th second of the video stream is 1+0.8 or 1.8.

Then, video segments of suspicious behavior can be determined according to the combined confidence of the time points. Specifically, a segment with a merging confidence greater than a specified threshold (e.g., 1.0) may be taken as a video segment of suspicious behavior.

Fig. 9 is a schematic diagram of merging confidence according to an embodiment of the present invention. As shown in fig. 9, a segment with a confidence greater than 1.0 may be merged as a video segment of suspicious behavior. For example, if the confidence between the 3.6 th and the 4 th seconds is greater than 1.0, it is considered that suspicious behavior occurs between the 3.6 th and the 4 th seconds.

Further, if the time interval between any two video clips is smaller than the preset time interval, the two video clips are merged. For example, the preset time interval may be 0.25 seconds, and when the interval between two segments is less than 0.25 seconds, the two segments are merged.

After the video clips with suspicious behaviors are found, whether code scanning behaviors exist in the video clips or whether the commodities move to a code scanning area can be analyzed through a machine learning model for each video clip, if yes, whether scanning results of the commodities are obtained is judged between the starting time and the ending time of the video clips, if not, it is confirmed that scanning missing exists, and otherwise, scanning missing does not exist.

Or after the tracking process of a commodity is finished, if a video clip of a suspicious behavior appears in the tracking process, judging whether a scanning result of the commodity is obtained in the tracking process; and if the scanning result is not obtained, judging whether a missing scanning behavior occurs or not through a machine learning model according to the video segment of the suspicious behavior.

In practical applications, three modules may be used: the action detection module, the post-processing module and the action verification module realize the functions, the three modules can adopt a cascade structure, and the missing scanning actions can be more finely detected and classified through the cascade structure.

The motion detection module firstly performs motion detection on an input video stream, processes videos every 5.2 seconds, and detects sub-segments of suspicious behaviors from the video stream. Specifically, the motion detection module may determine sub-segments of the suspicious behavior in a 3D convolution manner, and determine, for each sub-segment, a start time, an end time, and a confidence that each intermediate time point belongs to the code scanning motion, where there may be an overlap between the sub-segments.

Then, the post-processing module performs post-processing on the sub-segments detected from the video, and connects temporally adjacent segments to form a complete video segment. Specifically, for each time point, the post-processing module may add the corresponding confidence levels of the time point in the sub-segments, output the segments with the confidence level greater than 1 (the segments above the middle horizontal line in fig. 11), and combine the segments close to each other, and output the complete video segment.

And finally, the action verification module further confirms each complete video clip, determines whether missing scanning occurs or not, can obtain more accurate missing scanning time clips, and eliminates interference during action detection.

The video processing method can acquire the video with the preset duration in the video stream, search the video clip with the first characteristic in the video with the preset duration, detect the video clip through the machine learning model, determine whether the first preset behavior occurs or not, monitor the user under the condition that the user does not sense the first preset behavior, and has simple logic and easy realization.

In an offline retail scenario, the checkout process of each commodity of a user is very short, so that commodity detection and customer attitude estimation algorithms are required to be capable of achieving real-time accuracy under limited computing resources.

The embodiment of the invention also provides a video processing method which can detect the position of the commodity in the image and the user posture information. The method can comprise the following steps: processing images in the video stream to obtain a semantic feature map corresponding to the images; and detecting the position information of the commodity and the posture information of the user in the image according to the semantic feature map.

Specifically, a video stream for capturing user behavior may be obtained, and the video stream may be decoded to obtain a frame-by-frame image. Then, processing can be performed on each frame of image, and a semantic feature map corresponding to each frame of image is determined. Wherein, the image can be any type of image such as RGB image, gray scale image, YUV image, etc.

Optionally, before determining the semantic feature map corresponding to the image, the image may be centered and subjected to scale normalization. The centralization means that a mean value is subtracted from a pixel value corresponding to each pixel point in the image, and the scale normalization means that a square difference is divided from each pixel value after the mean value is subtracted, so that convergence is facilitated, and the training effect of a subsequent model is better. The mean and the variance refer to the mean and the variance of pixel values corresponding to pixel points in all images in the video sample.

In this embodiment, processing the image in the video stream to obtain the semantic feature map corresponding to the image may include: calculating a characteristic vector corresponding to each pixel point according to the pixel value of each pixel point of the image in the video stream; the semantic feature map corresponding to the image comprises feature vectors corresponding to all pixel points in the image; the feature vector corresponding to the pixel point comprises probability information of the pixel point belonging to each semantic feature.

For a frame of image, the corresponding semantic feature map comprises probability information of each pixel point in the image belonging to each semantic feature.

Where the semantic features may be any feature, such as a person's hand, a person's eye, a good, a table, etc. If 128 semantic features are preset, in this step, for each pixel point, probability information that the pixel point belongs to each semantic feature may be calculated to obtain a feature vector corresponding to the pixel point, where the feature vector includes probability information that the pixel point belongs to each semantic feature, that is, the feature vector includes 128 numerical values, and each numerical value represents probability information that the pixel point belongs to one semantic feature.

The probability information represents the intensity of the pixel point belonging to the semantic feature, the probability information can be the probability without normalization, and the larger the numerical value is, the larger the probability representing the pixel point belonging to the semantic feature is.

Optionally, a semantic feature map corresponding to each frame image may be extracted by combining a bottom-up channel-level convolution and a group convolution of 1 × 1 with top-down scale pyramid feature fusion.

The calculation amount of the group convolution of the channel-level convolution and 1x1 is lower than that of the common convolution of convolution kernels with the same size, so that the calculation cost of forward convolution operation is low, and the scale pyramid feature fusion can fuse image features with different semantics, so that the feature representation has strong distinguishability.

And processing each frame of image to obtain a semantic feature map corresponding to each frame of image.

In this embodiment, the position information of the commodity can be determined according to the semantic feature map, and the posture of the user can be estimated at the same time. The position information of the commodity and the user posture estimation can be realized by adopting any target detection method and any posture estimation method.

Optionally, detecting the position information of the commodity and the posture information of the user in the image according to the semantic feature map may include the following steps a to d:

and a, predicting the position information of a plurality of candidate objects in the image according to the semantic feature map.

Optionally, predicting the position information of the multiple candidate objects in the image according to the semantic feature map may include: and predicting the position information of a plurality of candidate objects in the image aiming at the feature vector corresponding to each pixel point.

Specifically, for each pixel point, the positions of a plurality of candidate objects around the pixel point can be predicted according to the feature vector corresponding to the pixel point. The number of the candidate objects predicted by each pixel point can be set according to actual needs. For example, for each pixel point, the position information of 15 candidates around the pixel point can be predicted.

The position information of the candidate object may include the coordinates of the center point of the candidate object and the length and width of the rectangular frame in which the candidate object is located.

There are many ways to predict candidate objects from semantic feature maps. Alternatively, the mask RCNN algorithm may be employed to determine candidate objects.

Optionally, after predicting the position information of multiple candidate objects for the feature vector corresponding to each pixel point, before classifying the candidate objects, the position information of multiple candidate objects in the image may be obtained by performing deduplication on the candidate objects predicted according to the feature vector corresponding to each pixel point.

Assuming that there are 800 × 1000 pixel points in the image, 15 candidate objects are predicted for each pixel point, so that there are 800 × 1000 × 15 candidate objects in total, and there are many possibilities that are repeated in the middle, the candidate objects can be de-duplicated through an algorithm, and the de-duplicated candidate objects are classified. Alternatively, deduplication can be implemented by non-maximum suppression algorithms. For example, 800 × 1000 × 15 candidate objects are subjected to the deduplication, and only 1000 candidate objects remain, then the 1000 candidate objects may be subjected to the next classification operation.

B, classifying the candidate objects and determining the type of each candidate object, wherein the type of the candidate object comprises at least one of the following items: user, commodity, background.

As described above, 1000 de-duplicated candidate objects may be classified, all the candidate objects are classified into three categories, namely, user, commodity and background, and the boundaries of the rectangular frame where the commodity and the user are located may be refined.

There are many methods for classifying the candidate object, and optionally, a machine learning model such as a neural network model may be used to classify the candidate object. After the classification is finished, the rectangular frame where the commodity and the user are located can be refined. The refinement refers to regression of the rectangular frame, namely processing the rectangular frame, so that the rectangular frame where the commodity or the user is located is more accurate.

And c, determining a characteristic vector corresponding to the area where the user is located according to the position information of the candidate object of which the type is the user.

And d, predicting the posture information of the user according to the characteristic vector corresponding to the area where the user is located.

Specifically, the attitude information of the user can be estimated according to the feature vector corresponding to each pixel point in the region where the user is located. The pose information may include position information for a plurality of key points of the user. For example, for a region identified as a user, the position information of 17 key points of the user's nose, eyes, ears, shoulders, elbows, wrists, pelvis, knees, ankles, etc. can be located by the feature vectors. The position information of the user's hand can be determined from the position information of the key points.

There are various methods of calculating the location information of the user's key points. In this embodiment, the position information of 17 key points may be predicted through a convolution network and a deconvolution network according to the semantic feature map. Specifically, the position information of the key point can be predicted by 4 times of convolution and 2 times of deconvolution, so that the speed is high and the effect is not reduced.

In summary, if a candidate object is considered as a user, the position information of the hand of the user may be further determined according to the gesture of the user; if a candidate object is considered as a commodity, the position information of the commodity can be directly obtained. If a candidate is considered background, no further processing is required on it.

The position information of the commodity may include position information of a polygon frame where the commodity is located; the position information of the hand may include center point coordinates of the hand. The hand position information and the commodity position information can be applied to any flow of the self-service cash registering process.

Optionally, when a tracking process of a commodity needs to be determined, according to the position information of the hand and the position information of the commodity in each image of the video stream, a movement track of the hand and a movement track of the commodity in the video stream may be determined; and determining the tracking process of the commodity according to the movement track of the hand and the movement track of the commodity.

Optionally, when the video segment of the suspicious behavior in the video stream needs to be searched, whether the hands and the commodities enter the code scanned area or not can be determined through the moving tracks of the hands and the commodities, so that whether the video segment of the suspicious behavior appears or not can be determined.

The algorithms used in the embodiments of the present invention may be replaced by any other general algorithm capable of realizing the related functions. For example, when determining the semantic feature map, the semantic feature map of the image may be determined through rcnn (regions with CNN features), ssd (single Shot multi box detector), yolo (young Only Look once) and the like, and the position information and the category of the candidate object in the image may be detected according to the semantic feature map.

RCNN, SSD and YOLO all belong to target detection algorithms, and can be used for learning through large-scale object labeling and predicting the coordinate and category information of a target in an image.

When the pose of the user is estimated according to the semantic feature map, the pose estimation can be realized by adopting an algorithm such as OpenPose and the like or a convolution and deconvolution mode.

The embodiment can realize the positioning of the commodities and the hands of the user based on deep learning, and finds the behaviors of account omission and account non-settlement of the customers from the visual dimension through the analysis of the commodities and the hands, thereby achieving the purpose of visual loss prevention.

Compared with the prior art, the method in the embodiment has the advantages that the semantic feature map is shared by commodity position detection and user gesture detection. Specifically, after the position and the category of the candidate object are obtained through the semantic feature map, for the region determined as the user, the feature vector in the region is selected from the semantic feature map, and the feature vector passes through a shallow full convolution network, so that 17 key points in total, such as the nose, the eyes, the ears, the shoulders, the elbows, the wrists, the pelvis, the knees, the ankles and the like of the customer, can be predicted.

Compared with a method for processing images through target detection algorithms such as SSD and YOLO to obtain position information of users and commodities and then processing the images through attitude estimation algorithms such as OpenPose to obtain user attitude information, the method can achieve commodity detection and user attitude estimation through the same semantic feature map.

After the position information of the user and the commodity is detected, the attitude estimation can be realized without extracting the semantic feature map again, the flow of the semantic feature map extracted repeatedly is reduced, and the algorithm complexity is reduced. Commodity detection and attitude estimation can be completed more efficiently in a mode of sharing a semantic feature map by commodity detection and attitude estimation.

According to the video processing method, the images in the video stream can be processed to obtain the semantic feature maps corresponding to the images, the position information of the commodities and the posture information of the user in the images are detected according to the semantic feature maps corresponding to the images, the settlement behaviors of the customers can be analyzed through the positions and the states of the hands of the users and the commodities in the video stream, the settlement missing behaviors and the settlement not behaviors of the customers can be found out from the visual dimension, the purpose of visual loss prevention is achieved, and the efficiency of the self-service cash collecting terminal is improved; in addition, commodity detection and user posture estimation are achieved through the shared semantic feature map, the video stream can be detected more efficiently, algorithm processing efficiency is improved, and user experience is improved.

Fig. 10 is a schematic flowchart of a fourth embodiment of a video processing method according to the present invention. As shown in fig. 10, the video processing method in this embodiment may include:

step 1001, an offline video for shooting user behavior is obtained.

Step 1002, finding a video segment with a first characteristic in the offline video.

Step 1003, determining whether the user has a first predetermined behavior corresponding to the first characteristic according to the video clip.

The implementation principle and process of the method in this embodiment may refer to the foregoing embodiments, and the only difference is that the foregoing embodiments may be used to process a real-time video stream, and the present embodiment may be used to process an offline video.

For parts of this embodiment that are not described in detail, reference may be made to the related description of the foregoing embodiments. The implementation process and technical effect of the technical solution refer to the description in the foregoing embodiments, and are not described herein again.

Fig. 11 is a schematic flowchart of a first store management method according to an embodiment of the present invention. As shown in fig. 11, the store management method in this embodiment may include:

step 1101, acquiring a video stream for shooting the behavior of store managers;

step 1102, searching for a video segment with a second characteristic in the video stream;

step 1103, determining whether a second preset behavior corresponding to the second characteristic occurs to the administrator or not according to the video clip.

Specifically, one or more cameras may be set in a work area of the administrator, and the cameras may capture behaviors of the administrator and send the behaviors to the server for analysis.

Wherein the manager may refer to any person working at the store, such as a service person, a pick-up person, and the like. The second predetermined behavior may refer to any behavior of a store manager, such as various behaviors of shelving goods, arranging goods shelves, packaging goods, and the like, and the second characteristic may be any characteristic suspected of the second predetermined behavior.

For example, the second predetermined behavior may be an on-shelf behavior, i.e., a behavior of placing an item on a shelf, and the second characteristic may be a characteristic of a behavior of suspected placement of an item on a shelf, such as moving a hand from a basket on which the item is placed to a shelf. As long as the characteristic of the suspected shelving behavior is detected, whether the user has the shelving behavior or not can be judged according to the video clip where the characteristic is located.

There may be many ways to determine whether the second predetermined behavior occurs based on the video segment. Optionally, the video segment may be detected by a machine learning model, and it is determined whether a second predetermined behavior occurs in the video segment.

The specific implementation principle and process of how to search the video segment and determine whether the predetermined behavior occurs according to the video segment are similar to those of the foregoing embodiments, and only the first predetermined behavior in the foregoing embodiments needs to be replaced by the second predetermined behavior.

Optionally, monitoring information may be sent to the monitoring terminal, where the monitoring information may include information about whether a second predetermined behavior occurs or the number of times the second predetermined behavior occurs to the manager, and the monitoring person may monitor the manager according to the information, and handle the manager in a manual or machine intervention manner when the behavior of the manager is abnormal.

To sum up, the store management method provided by the embodiment of the present invention can obtain a video stream for shooting the behavior of a store manager, search for a video clip with a second feature in the video stream, and determine whether the manager has a second predetermined behavior corresponding to the second feature according to the video clip, so as to monitor and perform subsequent settlement processing on a user according to a determination result of whether the second predetermined behavior exists, for example, it can be determined whether the work of the manager reaches the standard according to the number of times that the manager puts on a shelf, thereby effectively reducing economic loss of a retail store, and analyzing the behavior of the manager through video processing, so that the work process of the manager is not disturbed, and the work efficiency of the manager is improved.

Fig. 12 is a schematic flowchart of a second store management method according to an embodiment of the present invention. As shown in fig. 12, the store management method in this embodiment may include:

and step 1201, acquiring an offline video for shooting the behavior of the store manager.

And 1202, searching for a video film with a second characteristic in the offline video.

Step 1203, determining whether a second preset behavior corresponding to the second feature appears in the manager according to the video clip.

The implementation principle and process of the method in this embodiment may refer to the store processing method provided in the foregoing embodiment, and the only difference is that the foregoing embodiment may be used to process a real-time video stream, and the present embodiment may be used to process an offline video.

A video processing apparatus according to one or more embodiments of the present invention will be described in detail below. Those skilled in the art will appreciate that these video processing devices may each be configured using commercially available hardware components through the steps taught by the present solution.

Fig. 13 is a schematic structural diagram of a first video processing apparatus according to an embodiment of the present invention. As shown in fig. 13, the apparatus may include:

the acquiring module 131 is configured to acquire a video stream for capturing a user behavior;

a detection module 132 for determining a movement trajectory of a user's hand in the video stream;

a searching module 133, configured to search for a video segment with a first characteristic in the video stream;

A determining module 134, configured to determine whether a first predetermined behavior corresponding to the first feature occurs according to the video segment.

Optionally, the video stream is a video stream obtained by a shooting device; the shooting range of the shooting device comprises a code scanning area for placing a code scanning commodity and a code scanning area for placing a code scanning commodity.

Optionally, the first characteristic is a characteristic of suspected missed scanning behavior; the first predetermined behavior is a missed scan behavior.

Optionally, the video clips with the first characteristic comprise video clips of code scanning actions and/or video clips of moving the goods to code scanning areas.

Optionally, the search module 133 may specifically be configured to: respectively inputting the video streams to a plurality of detection modules, and searching for video segments with first characteristics in the video streams; wherein different detection modules use different detection methods to search for the video segments with the first characteristics in the video stream.

Optionally, the determining module 134 may specifically include: the system comprises a first judging unit, a second judging unit and a third judging unit, wherein the first judging unit is used for judging whether a scanning result of a commodity is acquired in the tracking process or not if a video clip with a first characteristic appears in the tracking process after the tracking process of the commodity is finished; the second judging unit is used for judging whether a first preset behavior occurs or not according to the video clip with the first characteristic when the scanning result is not obtained; wherein the tracking process of the commodity is a process of holding the commodity in the hand.

Optionally, the first determining unit may be further configured to: detecting position information of commodities and hands in the video stream; determining the movement tracks of the commodities and the hands according to the position information of the commodities and the hands; and determining whether the commodity is held in the hand or not according to the movement tracks of the commodity and the hand.

Optionally, the first determining unit may be further configured to: after the commodity is confirmed to be held in the hand, if the time for detecting the empty hand exceeds the preset time, the tracking process of the commodity is confirmed to be finished.

Optionally, the first determining unit may be further configured to: and if the scanning result is obtained in the tracking process of one commodity, determining that the first preset behavior does not appear in the tracking process of the commodity.

Optionally, the second determining unit may be specifically configured to: and when the scanning result is not obtained, if a plurality of video segments with the first characteristic exist in the tracking process, judging whether a first preset behavior occurs according to the last video segment.

Optionally, the second determining unit may be specifically configured to: when a scanning result is not obtained, if a plurality of video clips with the first characteristic exist in the tracking process, searching other video clips with overlapping parts with the last video clip; merging the found video clip with the last video clip; and judging whether a first preset behavior occurs according to the merged video clip.

Optionally, the second determining unit may be specifically configured to: when the scanning result is not obtained, determining the confidence coefficient that the video clip with the first characteristic belongs to the first preset behavior through a machine learning model; and judging whether a first preset behavior occurs according to the confidence coefficient.

Optionally, the second determining unit may be specifically configured to: when a scanning result is not obtained, if a plurality of video clips with first characteristics exist in the tracking process, determining the confidence degree of each video clip with the first characteristics belonging to a first preset behavior through a machine learning model; calculating a weighted sum of confidence degrees corresponding to the plurality of video segments with the first characteristic; if the weighted sum is greater than a preset threshold, determining that a first predetermined behavior occurs.

Optionally, the second determining unit may be specifically configured to: when a scanning result is not obtained, determining a corresponding machine learning model according to the type of the video clip with the first characteristic; inputting the video clip into the corresponding machine learning model to obtain the confidence coefficient that the video clip belongs to a first preset behavior; and judging whether a first preset behavior occurs according to the confidence.

Optionally, the second determining unit may be specifically configured to: when a scanning result is not obtained, if a plurality of video segments with first characteristics exist in the tracking process, determining the confidence degree that each video segment with the first characteristics belongs to a first preset behavior through a machine learning model; calculating a weighted sum of confidence degrees corresponding to the plurality of video segments with the first characteristic; if the weighted sum is larger than a preset threshold value, determining that a first preset action occurs; wherein determining, by the machine learning model, a confidence that the video segment with the first feature belongs to the first predetermined behavior comprises: determining a corresponding machine learning model according to the type of the video clip with the first characteristic; and inputting the video clip into the corresponding machine learning model to obtain the confidence coefficient that the video clip belongs to the first preset behavior.

Optionally, the determining module 134 may further be configured to: responding to an operation event that the user confirms that the commodity is scanned completely, and counting the times of first preset behaviors of the user; and if the times of the first preset behaviors of the user are less than the preset times, the commodity scanned by the user is settled.

Optionally, the determining module 134 may be further configured to: and if the times of the first preset behavior of the user are not less than the preset times, displaying a settlement prohibiting interface and/or sending warning information to a monitoring terminal.

Optionally, the detecting module 132 may be specifically configured to: detecting position information of a hand of a user in each frame image of the video stream; and determining the motion trail of the hand according to the position information of the hand in each frame of image.

Optionally, the search module 133 may specifically be configured to: and if the hands and the commodities of the user enter the code-scanned area from the non-code-scanned area, determining that the video clip with the first characteristic appears.

Optionally, the search module 133 may specifically be configured to: and if the hands and the commodities of the user enter the scanned code area from the non-scanned code area and the time interval from the last time of entering the scanned code area is greater than a preset interval, determining that the video clip with the first characteristic appears.

Optionally, the search module 133 may specifically be configured to: and if the hands and the commodities of the user enter the code-scanned area from the non-code-scanned area and the farthest distance between the hands and the code-scanned area after leaving the code-scanned area last time is greater than a preset distance, determining that a video clip with the first characteristic appears.

Optionally, the video segments with the first characteristic are video segments within a first preset time period before entering the code-scanned area and a second preset time period after entering the code-scanned area.

Optionally, the apparatus may further include: the semantic processing module is used for processing the images in the video stream to obtain a semantic feature map corresponding to the images; and the gesture detection module is used for detecting the position information of the commodity and the gesture information of the user in the image according to the semantic feature map.

Optionally, the semantic processing module may be specifically configured to: calculating a characteristic vector corresponding to each pixel point according to the pixel value of each pixel point of the image in the video stream; the semantic feature map corresponding to the image comprises feature vectors corresponding to all pixel points in the image; and the feature vector corresponding to the pixel point comprises probability information of the pixel point belonging to each semantic feature.

Optionally, the gesture detection module may be specifically configured to: predicting the position information of a plurality of candidate objects in the image according to the semantic feature map; classifying the plurality of candidate objects, and determining the type of each candidate object, wherein the type of the candidate object comprises at least one of the following items: user, commodity, background; determining a feature vector corresponding to the area where the user is located according to the position information of the candidate object of which the type is the user; and predicting the posture information of the user according to the feature vector corresponding to the area where the user is located.

Optionally, the gesture detection module may be specifically configured to: predicting the position information of a plurality of candidate objects aiming at the characteristic vector corresponding to each pixel point; carrying out duplicate removal on candidate objects obtained by feature vector prediction corresponding to each pixel point to obtain position information of a plurality of candidate objects in the image; classifying the candidate objects, and determining the type of each candidate object, wherein the type of the candidate object comprises at least one of the following items: user, commodity, background; determining a feature vector corresponding to the area where the user is located according to the position information of the candidate object of which the type is the user; and predicting the posture information of the user according to the feature vector corresponding to the area where the user is located.

Optionally, the gesture detection module may be further configured to: and determining the position information of the hand according to the posture information of the user.

Optionally, the gesture detection module may be further configured to: determining the movement track of the hand and the movement track of the commodity in the video stream according to the position information of the hand and the position information of the commodity in each image of the video stream; and determining the tracking process of the commodity according to the moving track of the hand and the moving track of the commodity.

The apparatus shown in fig. 13 can execute the scheme provided by the third embodiment of the video processing method, and reference may be made to the related description of the foregoing embodiment for a part not described in detail in this embodiment. The implementation process and technical effect of the technical solution refer to the description in the foregoing embodiments, and are not described herein again.

Fig. 14 is a schematic structural diagram of a second video processing apparatus according to an embodiment of the present invention. As shown in fig. 14, the apparatus may include:

an obtaining module 141, configured to obtain a video stream for capturing a user behavior;

a searching module 142, configured to search for a video segment with a first characteristic in the video stream;

a determining module 143, configured to determine whether the user has a first predetermined behavior corresponding to the first feature according to the video segment.

The apparatus shown in fig. 14 can execute the scheme provided by the first embodiment of the video processing method, and reference may be made to the related description of the foregoing embodiment for a part not described in detail in this embodiment. The implementation process and technical effect of the technical solution refer to the description in the foregoing embodiments, and are not described herein again.

Fig. 15 is a schematic structural diagram of a third video processing apparatus according to an embodiment of the present invention. As shown in fig. 15, the apparatus may include:

An obtaining module 151, configured to obtain an offline video of a captured user behavior;

a searching module 152, configured to search for a video segment with a first characteristic in the offline video;

a determining module 153, configured to determine whether the user has a first predetermined behavior corresponding to the first feature according to the video segment.

The apparatus shown in fig. 15 can execute the solution provided by the fourth embodiment of the video processing method, and reference may be made to the related description of the foregoing embodiment for a part of this embodiment that is not described in detail. The implementation process and technical effect of the technical solution refer to the description in the foregoing embodiments, and are not described herein again.

Fig. 16 is a schematic structural diagram of a first store management apparatus according to an embodiment of the present invention. As shown in fig. 16, the apparatus may include:

an obtaining module 161, configured to obtain a video stream for shooting behavior of store managers;

a searching module 162, configured to search for a video segment with a second characteristic in the video stream;

the determining module 163 is configured to determine whether a second predetermined behavior corresponding to the second feature occurs to the administrator according to the video clip.

The device shown in fig. 16 may execute the scheme provided by the first embodiment of the store management method, and reference may be made to the related description of the foregoing embodiment for a part not described in detail in this embodiment. The implementation process and technical effect of the technical solution refer to the description in the foregoing embodiments, and are not described herein again.

Fig. 17 is a schematic structural diagram of a second store management apparatus according to an embodiment of the present invention. As shown in fig. 17, the apparatus may include:

the acquisition module 171 is configured to acquire an offline video for shooting the behavior of a store manager;

a searching module 172, configured to search for a video segment with a second characteristic in the offline video;

a determining module 173, configured to determine whether a second predetermined behavior corresponding to the second feature occurs to the administrator according to the video clip.

The apparatus shown in fig. 17 can execute the solution provided by the second embodiment of the store management method, and reference may be made to the related description of the second embodiment of the present embodiment. The implementation process and technical effect of the technical solution refer to the description in the foregoing embodiments, and are not described herein again.

Fig. 18 is a schematic structural diagram of a first electronic device according to an embodiment of the present invention. The electronic device can be any electronic device with a video processing function, such as a self-service cash register terminal, a server and the like. As shown in fig. 18, the electronic device may include: a first processor 21 and a first memory 22. Wherein the first memory 22 is used for storing a program for supporting an electronic device to execute the video processing method provided by any one of the foregoing embodiments, and the first processor 21 is configured to execute the program stored in the first memory 22.

The program comprises one or more computer instructions which, when executed by the first processor 21, are capable of carrying out the steps of:

acquiring a video stream for shooting user behaviors;

determining a movement trajectory of a hand of a user in the video stream;

searching a video segment with a first characteristic in the video stream;

Optionally, the first processor 21 is further configured to perform all or part of the steps in the foregoing embodiment shown in fig. 8.

The electronic device may further include a first communication interface 23, which is used for the electronic device to communicate with other devices or a communication network.

Fig. 19 is a schematic structural diagram of a second electronic device according to an embodiment of the present invention. The electronic device can be any electronic device with a video processing function, such as a self-service cash register terminal, a server and the like. As shown in fig. 19, the electronic device may include: a second processor 24 and a second memory 25. Wherein the second memory 25 is used for storing a program for supporting an electronic device to execute the video processing method provided by any one of the foregoing embodiments, and the second processor 24 is configured to execute the program stored in the second memory 25.

The program comprises one or more computer instructions which, when executed by the second processor 24, are capable of performing the steps of:

acquiring a video stream for shooting user behaviors;

searching a video segment with a first characteristic in the video stream;

Optionally, the second processor 24 is further configured to perform all or part of the steps in the foregoing embodiment shown in fig. 4.

The electronic device may further include a second communication interface 26 for communicating with other devices or a communication network.

Fig. 20 is a schematic structural diagram of an electronic device according to a third embodiment of the present invention. The electronic device can be any electronic device with a video processing function, such as a self-service cash register terminal, a server and the like. As shown in fig. 20, the electronic device may include: a third processor 27 and a third memory 28. Wherein the third memory 28 is used for storing a program for supporting the electronic device to execute the video processing method provided by any one of the foregoing embodiments, and the third processor 27 is configured to execute the program stored in the third memory 28.

The program comprises one or more computer instructions which, when executed by the third processor 27, are capable of performing the steps of:

acquiring an offline video for shooting user behaviors;

searching a video clip with a first characteristic in the offline video;

Optionally, the third processor 27 is further configured to perform all or part of the steps in the embodiment shown in fig. 10.

The electronic device may further include a third communication interface 29, which is used for the electronic device to communicate with other devices or a communication network.

Fig. 21 is a schematic structural diagram of a fourth electronic device according to an embodiment of the present invention. The electronic device can be any electronic device with store management function, such as a server. As shown in fig. 21, the electronic device may include: a fourth processor 210 and a fourth memory 211. Wherein, the fourth memory 211 is used for storing a program that supports an electronic device to execute the store management method provided by any one of the foregoing embodiments, and the fourth processor 210 is configured to execute the program stored in the fourth memory 211.

The program comprises one or more computer instructions which, when executed by the fourth processor 210, is capable of performing the steps of:

acquiring a video stream for shooting the behavior of store managers;

searching a video segment with a second characteristic in the video stream;

Optionally, the fourth processor 210 is further configured to perform all or part of the steps in the embodiment shown in fig. 11.

The electronic device structure may further include a fourth communication interface 212, which is used for the electronic device to communicate with other devices or a communication network.

Fig. 22 is a schematic structural diagram of a fifth electronic device according to an embodiment of the present invention. The electronic device can be any electronic device with store management function, such as a server. As shown in fig. 22, the electronic device may include: a fifth processor 213 and a fifth memory 214. Wherein the fifth memory 214 is used for storing programs that support the electronic device to execute the store management method provided by any one of the foregoing embodiments, and the fifth processor 213 is configured to execute the programs stored in the fifth memory 214.

The program comprises one or more computer instructions which, when executed by the fifth processor 213, are capable of performing the steps of:

acquiring an offline video for shooting the behavior of store managers;

searching a video clip with a second characteristic in the offline video;

Optionally, the fifth processor 213 is further configured to execute all or part of the steps in the embodiment shown in fig. 12.

The electronic device may further include a fifth communication interface 215 for communicating with other devices or a communication network.

Additionally, embodiments of the present invention provide a computer-readable storage medium storing computer instructions that, when executed by a processor, cause the processor to perform acts comprising:

acquiring a video stream for shooting user behaviors;

determining a movement trajectory of a hand of a user in the video stream;

searching a video segment with a first characteristic in the video stream;

The computer instructions, when executed by a processor, may further cause the processor to perform all or part of the steps involved in the third embodiment of the video processing method.

acquiring a video stream for shooting user behaviors;

searching a video segment with a first characteristic in the video stream;

The computer instructions, when executed by a processor, may further cause the processor to perform all or a portion of the steps involved in embodiments one through two of the above-described video processing methods.

acquiring an offline video for shooting user behaviors;

searching a video clip with a first characteristic in the offline video;

The computer instructions, when executed by a processor, may further cause the processor to perform all or part of the steps involved in the fourth embodiment of the video processing method.

acquiring a video stream for shooting the behavior of store managers;

searching a video segment with a second characteristic in the video stream;

The computer instructions, when executed by a processor, may further cause the processor to perform all or a portion of the steps involved in one of the store management method embodiments described above.

Acquiring an offline video for shooting the behavior of store managers;

searching a video clip with a second characteristic in the offline video;

The computer instructions, when executed by a processor, may further cause the processor to perform all or part of the steps involved in embodiment two of the store management method described above.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment may be implemented by a necessary general hardware platform, and may also be implemented by a combination of hardware and software. With this understanding in mind, the above-described solutions and/or portions thereof that are prior art may be embodied in the form of a computer program product, which may be embodied on one or more computer-usable storage media having computer-usable program code embodied therein (including but not limited to disk storage, CD-ROM, optical storage, etc.).

The present invention has been described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable network connection device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable network connection device, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable network connection device to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable network connection device to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both permanent and non-permanent, removable and non-removable media, may implement the information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, and not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A video processing method, comprising:

acquiring a video stream for shooting user behaviors;

determining a movement trajectory of a user's hand in the video stream;

searching a video segment with a first characteristic in the video stream;

determining whether a first predetermined behavior corresponding to the first feature occurs according to the video segment;

determining from the video segment whether a first predetermined behavior corresponding to the first feature occurs, comprising:

after the tracking process of a commodity is finished, if a video clip with a first characteristic appears in the tracking process, judging whether a scanning result of the commodity is obtained in the tracking process;

If the scanning result is not obtained, judging whether a first preset behavior occurs or not according to the video clip with the first characteristic; wherein, the tracking process of the commodity is the process that the commodity is held in the hand.

2. The method of claim 1, wherein the video stream is a video stream obtained by a camera;

the shooting range of the shooting device comprises a code scanning area for placing a code scanning commodity and a code scanning area for placing a code scanning commodity.

3. The method of claim 1, wherein the first characteristic is a characteristic of suspected missed scan behavior; the first predetermined behavior is a missed scan behavior.

4. The method according to claim 1, wherein the video clips with the first characteristic comprise video clips of code scanning actions and/or video clips of moving goods to code scanning areas.

5. The method of claim 1, further comprising:

detecting position information of commodities and hands in the video stream;

determining the movement tracks of the commodity and the hand according to the position information of the commodity and the hand;

and determining whether the commodity is held in the hand according to the movement tracks of the commodity and the hand.

6. The method of claim 5, further comprising:

after the commodity is confirmed to be held in the hand, if the time for detecting the empty hand exceeds the preset time, the tracking process of the commodity is confirmed to be finished.

7. The method of claim 1, further comprising:

and if the scanning result is obtained in the tracking process of one commodity, determining that the first preset behavior does not appear in the tracking process of the commodity.

8. The method of claim 1, wherein determining whether the first predetermined behavior occurs according to the video segment with the first feature comprises:

and if a plurality of video clips with the first characteristic exist in the tracking process, judging whether a first preset behavior occurs according to the last video clip.

9. The method of claim 8, wherein determining whether the first predetermined behavior occurred based on the last video segment comprises:

searching other video clips with overlapped parts with the last video clip;

merging the found video clip with the last video clip;

and judging whether a first preset behavior occurs according to the merged video clip.

10. The method of claim 1, wherein determining whether the first predetermined behavior occurs according to the video segment with the first feature comprises:

determining, by a machine learning model, a confidence that a video segment with a first feature belongs to a first predetermined behavior;

and judging whether a first preset behavior occurs according to the confidence coefficient.

11. The method of claim 1, wherein determining whether the first predetermined behavior occurs based on the video segment with the first feature comprises:

if a plurality of video segments with the first characteristics exist in the tracking process, determining the confidence coefficient that each video segment with the first characteristics belongs to the first preset behavior through a machine learning model;

calculating a weighted sum of the confidence degrees corresponding to the plurality of video segments with the first characteristic;

determining that a first predetermined behavior occurs if the weighted sum is greater than a preset threshold.

12. The method of claim 10 or 11, wherein determining a confidence that the video segment with the first feature belongs to the first predetermined behavior by the machine learning model comprises:

determining a corresponding machine learning model according to the type of the video segment with the first characteristic;

And inputting the video clip into the corresponding machine learning model to obtain the confidence coefficient of the video clip belonging to the first preset behavior.

13. The method of claim 1, further comprising:

responding to an operation event that the user confirms that the commodity is scanned completely, and counting the times of first preset behaviors of the user;

and if the times of the first preset behavior of the user are less than the preset times, the commodities scanned by the user are settled.

14. The method of claim 13, further comprising:

and if the times of the first preset behavior of the user are not less than the preset times, displaying a settlement prohibiting interface and/or sending warning information to a monitoring terminal.

15. The method of claim 1, wherein determining a trajectory of a user's hand in the video stream comprises:

detecting position information of a hand of a user in each frame image of the video stream;

and determining the motion trail of the hand according to the position information of the hand in each frame image.

16. The method of claim 1, wherein finding a video segment in the video stream that has a first characteristic comprises:

And if the hands and the commodities of the user enter the code-scanned area from the non-code-scanned area, determining that the video clip with the first characteristic appears.

17. The method of claim 1, wherein finding a video segment with a first characteristic in the video stream comprises:

and if the hands and the commodities of the user enter the scanned code area from the non-scanned code area and the time interval from the last time of entering the scanned code area is greater than a preset interval, determining that the video clip with the first characteristic appears.

18. The method of claim 1, wherein finding a video segment with a first characteristic in the video stream comprises:

and if the hands and the commodities of the user enter the code-scanned area from the non-code-scanned area and the farthest distance between the hands and the code-scanned area after leaving the code-scanned area last time is greater than a preset distance, determining that a video clip with the first characteristic appears.

19. The method of claim 18, wherein the video segments with the first characteristic are video segments within a first predetermined time period before entering the code-scanned area and a second predetermined time period after entering the code-scanned area.

20. The method of claim 1, further comprising:

Processing images in the video stream to obtain a semantic feature map corresponding to the images;

and detecting the position information of the commodity and the posture information of the user in the image according to the semantic feature map.

21. The method of claim 20, wherein processing the image in the video stream to obtain a semantic feature map corresponding to the image comprises:

calculating a characteristic vector corresponding to each pixel point according to the pixel value of each pixel point of the image in the video stream;

the semantic feature map corresponding to the image comprises feature vectors corresponding to all pixel points in the image; and the feature vector corresponding to the pixel point comprises probability information of the pixel point belonging to each semantic feature.

22. The method of claim 20, wherein detecting position information of the commodity and pose information of the user in the image according to the semantic feature map comprises:

predicting the position information of a plurality of candidate objects in the image according to the semantic feature map;

classifying the plurality of candidate objects, and determining the type of each candidate object, wherein the type of the candidate object comprises at least one of the following items: user, commodity, background;

Determining a feature vector corresponding to the area where the user is located according to the position information of the candidate object of which the type is the user;

and predicting the posture information of the user according to the feature vector corresponding to the area where the user is located.

23. The method according to claim 22, wherein predicting the position information of the plurality of object candidates in the image according to the semantic feature map comprises:

predicting the position information of a plurality of candidate objects aiming at the characteristic vector corresponding to each pixel point;

and carrying out duplicate removal on the candidate objects obtained by the characteristic vector prediction corresponding to each pixel point to obtain the position information of a plurality of candidate objects in the image.

24. The method of claim 20, further comprising:

and determining the position information of the hand according to the posture information of the user.

25. The method of claim 20, further comprising:

determining the movement track of the hand and the movement track of the commodity in the video stream according to the position information of the hand and the position information of the commodity in each image of the video stream;

and determining the tracking process of the commodity according to the movement track of the hand and the movement track of the commodity.

26. A video processing method, comprising:

Acquiring a video stream for shooting user behaviors;

searching a video segment with a first characteristic in the video stream;

determining whether a first predetermined behavior corresponding to the first characteristic occurs to the user according to the video clip;

determining whether the user has a first predetermined behavior corresponding to the first characteristic according to the video clip, including:

27. A video processing method, comprising:

acquiring an offline video for shooting user behaviors;

searching a video clip with a first characteristic in the offline video;

determining whether the user has a first predetermined behavior corresponding to the first feature according to the video clip, including:

28. A store management method, comprising:

acquiring a video stream for shooting the behavior of store managers;

searching a video segment with a second characteristic in the video stream;

determining whether a second preset behavior corresponding to the second characteristic occurs to the manager according to the video clip;

after the tracking process of a commodity is finished, if a video clip with a second characteristic appears in the tracking process, judging whether a scanning result of the commodity is obtained in the tracking process; if the scanning result is not obtained, judging whether a second preset behavior occurs according to the video clip with the second characteristic; wherein the tracking process of the commodity is a process of holding the commodity in the hand.

29. A store management method, comprising:

acquiring an offline video for shooting the behavior of store managers;

searching a video clip with a second characteristic in the offline video;

determining whether a second preset behavior corresponding to the second characteristic occurs to the manager or not according to the video clip;

after the tracking process of a commodity is finished, if a video clip with a second characteristic appears in the tracking process, judging whether a scanning result of the commodity is obtained in the tracking process; if the scanning result is not obtained, judging whether a second preset behavior occurs according to the video clip with the second characteristic; wherein, the tracking process of the commodity is the process that the commodity is held in the hand.

30. A video processing apparatus, comprising:

a determining module, configured to determine whether a first predetermined behavior corresponding to the first feature occurs according to the video segment;

The determining module comprises a first judging unit, and is used for judging whether a scanning result of the commodity is acquired in the tracking process or not if a video clip with a first characteristic appears in the tracking process after the tracking process of the commodity is finished;

if the scanning result is not obtained, judging whether a first preset behavior occurs according to the video clip with the first characteristic; wherein the tracking process of the commodity is a process of holding the commodity in the hand.

31. A video processing apparatus, comprising:

a determining module, configured to determine whether a first predetermined behavior corresponding to the first feature occurs to the user according to the video segment;

32. A video processing apparatus, comprising:

a determining module, configured to determine whether a first predetermined behavior corresponding to the first feature occurs to the user according to the video clip;

the determining module comprises a first judging unit and a second judging unit, wherein the first judging unit is used for judging whether a scanning result of a commodity is acquired in the tracking process or not after the tracking process of the commodity is finished and if a video clip with a first characteristic appears in the tracking process; if the scanning result is not obtained, judging whether a first preset behavior occurs or not according to the video clip with the first characteristic; wherein, the tracking process of the commodity is the process that the commodity is held in the hand.

33. An store management apparatus, comprising:

the determining module is used for determining whether a second preset behavior corresponding to the second characteristic occurs to the manager or not according to the video clip;

The determining module is specifically configured to, after the tracking process of a commodity is finished, determine whether a scanning result of the commodity is obtained in the tracking process if a video clip with a second characteristic appears in the tracking process; if the scanning result is not obtained, judging whether a second preset behavior occurs according to the video clip with the second characteristic; wherein, the tracking process of the commodity is the process that the commodity is held in the hand.

34. An store management apparatus, comprising:

35. An electronic device, comprising: a first memory and a first processor; the first memory is to store one or more computer instructions, wherein the one or more computer instructions, when executed by the first processor, implement the video processing method of any of claims 1 to 25.

36. An electronic device, comprising: a second memory and a second processor; the second memory is to store one or more computer instructions, wherein the one or more computer instructions, when executed by the second processor, implement the video processing method of claim 26.

37. An electronic device, comprising: a third memory and a third processor; the third memory is to store one or more computer instructions, wherein the one or more computer instructions, when executed by the third processor, implement the video processing method of claim 27.

38. An electronic device, comprising: a fourth memory and a fourth processor; the fourth memory is for storing one or more computer instructions, wherein the one or more computer instructions, when executed by the fourth processor, implement the store management method of claim 28.

39. An electronic device, comprising: a fifth memory and a fifth processor; the fifth memory is for storing one or more computer instructions which, when executed by the fifth processor, implement the store management method of claim 29.