CN112286360A

CN112286360A - Method and apparatus for operating a mobile device

Info

Publication number: CN112286360A
Application number: CN202011215818.7A
Authority: CN
Inventors: 杨洋; 李东卫; 都业广
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Priority date: 2020-11-04
Filing date: 2020-11-04
Publication date: 2021-01-29
Also published as: WO2022095674A1

Abstract

Embodiments of the present disclosure disclose methods and apparatus for operating a mobile device. One embodiment of the method comprises: in response to detecting the operation gesture, acquiring a video of the operation gesture; inputting the video of the operation gesture into a pre-trained operation gesture recognition model, and determining the type identifier of the operation gesture; determining an operation instruction identifier corresponding to the type identifier of the operation gesture based on the corresponding relationship between the type identifier of the prestored operation gesture and the operation instruction identifier; determining an operation area indicated by the operation gesture based on the video of the operation gesture; determining an operation instruction represented by an operation gesture based on the operation area and the operation instruction identification; and executing the operation instruction. The method and the device can simplify the process of operating the mobile device through gestures by the user, and are particularly beneficial to reducing the difficulty of operating the mobile device by the user with visual impairment.

Description

Method and apparatus for operating a mobile device

Technical Field

Embodiments of the present disclosure relate to the field of computer technologies, and in particular, to the field of machine learning, and in particular, to a method and an apparatus for operating a mobile device.

Background

At present, the mobile device brings great convenience to the life of people. For the people with normal vision, the mobile device is simple and convenient to operate, but for the people with visual impairment (such as the elderly with poor vision), the existing mobile device or the application loaded on the mobile device still has the place with inconvenient operation.

In the related art, a mobile terminal using a barrier-free operation method usually requires a user to click a device screen with a hand through voice, and when the user clicks a button or a character, the mobile device prompts the user of the use of the clicked button or the content of the character through voice, and then guides the user to operate through voice.

Disclosure of Invention

Embodiments of the present disclosure propose methods and apparatuses for operating a mobile device.

In a first aspect, an embodiment of the present disclosure provides a method for operating a mobile device, the method including: in response to detecting the operation gesture, acquiring a video of the operation gesture; inputting the video of the operation gesture into a pre-trained operation gesture recognition model, and determining the type identifier of the operation gesture; determining an operation instruction identifier corresponding to the type identifier of the operation gesture based on the corresponding relationship between the type identifier of the prestored operation gesture and the operation instruction identifier; determining an operation area indicated by the operation gesture based on the video of the operation gesture; determining an operation instruction represented by an operation gesture based on the operation area and the operation instruction identification; and executing the operation instruction.

In some embodiments, the type identification is determined via: sampling a preset number of key frames from a video of an operation gesture; extracting key points of a hand from a key frame, and generating feature information of each key point of the hand in the key frame, wherein the feature information comprises key point identification and key point coordinates, the key point identification is used for representing the position of the key point on the hand, and the key point coordinates are used for representing the relative position relationship between the key point and other key points in the key frame; according to the time sequence of each key frame, encoding the feature information of the key points with the same key point identification in each key frame into the feature vector of the key point; generating a feature matrix based on the feature vectors of the key points; and inputting the characteristic matrix into an operation gesture recognition model to obtain the type identification of the operation gesture.

In some embodiments, the operating region is determined via: determining the movement track of each key point based on the characteristic information of each key point in each key frame; and determining an operation area indicated by the operation gesture based on the moving track of each key point.

In some embodiments, before inputting the feature matrix into the operation gesture recognition model, the method further includes: flattening the feature matrix to obtain a first-order feature matrix; and updating the first-order feature matrix into the feature matrix.

In some embodiments, inputting the feature matrix into the operation gesture recognition model to obtain the type identifier of the operation gesture includes: inputting the characteristic matrix into a gesture recognition model, and estimating the confidence of each type of identification corresponding to the operation gesture; and determining the type of the operation gesture based on the confidence degrees of the various types of identifications.

In some embodiments, in response to detecting the operation gesture, obtaining video of the operation gesture includes: in response to detecting the hand motion, capturing a video of the hand motion; and in response to the fact that the collected video of the hand action meets the preset conditions, determining the hand action as the operation gesture, and determining the video of the hand action as the video of the operation gesture.

In some embodiments, the method further comprises: and broadcasting the operation instruction by voice.

In a second aspect, embodiments of the present disclosure provide an apparatus for operating a mobile device, the apparatus comprising: a gesture detection unit configured to acquire a video of an operation gesture in response to detecting the operation gesture; the area determination unit is configured to determine an operation area indicated by the operation gesture based on the video of the operation gesture; the gesture recognition unit is configured to input the video of the operation gesture into a pre-trained operation gesture recognition model and determine the type identifier of the operation gesture; the identification determining unit is configured to determine an operation instruction identification instruction corresponding to the type identification of the operation gesture based on the corresponding relationship between the type identification of the prestored operation gesture and the operation instruction identification; the determining unit is configured to determine an operation instruction represented by the operation gesture based on the operation area and the type identifier; an instruction execution unit configured to execute the operation instruction.

In some embodiments, the gesture recognition unit further comprises: the sampling module is configured to sample a preset number of key frames from the video of the operation gesture; the characteristic extraction module is configured to extract key points of a hand from a key frame and generate characteristic information of each key point of the hand in the key frame, wherein the characteristic information comprises a key point identifier and key point coordinates, the key point identifier is used for representing the position of the key point on the hand, and the key point coordinates are used for representing the relative position relationship between the key point and other key points in the key frame; the vector generation module is configured to encode feature information of the key points with the same key point identification in each key frame into a feature vector of the key point according to the time sequence of each key frame; a matrix generation module configured to generate a feature matrix based on the feature vectors of the respective keypoints; and the gesture recognition module is configured to input the feature matrix into the operation gesture recognition model to obtain the type identification of the operation gesture.

In some embodiments, the region determining unit is further configured to determine the operation region indicated by the operation gesture via: determining the movement track of each key point based on the characteristic information of each key point in each key frame; and determining an operation area indicated by the operation gesture based on the moving track of each key point.

In some embodiments, the gesture recognition unit further comprises a matrix flattening processing module configured to flatten the feature matrix to obtain a first-order feature matrix; and updating the first-order feature matrix into the feature matrix.

In some embodiments, the identification module is further configured to: inputting the characteristic matrix into an operation gesture recognition model, and estimating the confidence of each type of identification corresponding to the operation gesture; and determining the type of the operation gesture based on the confidence degrees of the various types of identifications.

In some embodiments, the gesture detection unit further comprises: a motion detection module configured to capture video of hand motions in response to detecting the hand motions; the motion determination module is configured to determine the hand motion as the operation gesture in response to determining that the collected video of the hand motion meets the preset condition, and determine the video of the hand motion as the video of the operation gesture.

In some embodiments, the apparatus further comprises: and the voice broadcasting unit is configured to broadcast the operation instruction in voice.

According to the method and the device for operating the mobile equipment, when the operation gesture of the user is detected, the video of the operation gesture of the user is obtained, the operation area indicated by the operation gesture is determined from the video of the operation gesture of the user, the type identifier of the operation gesture is recognized through the machine learning model, then the operation instruction represented by the operation gesture of the user is determined based on the operation area and the type identifier, and the operation instruction is executed. The method and the device can simplify the process of operating the mobile device through gestures by the user, and are particularly beneficial to reducing the difficulty of operating the mobile device by the user with visual impairment.

Drawings

Other features, objects and advantages of the disclosure will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture diagram in which some embodiments of the present disclosure may be applied;

FIG. 2 is a flow diagram for one embodiment of a method for operating a mobile device, according to the present disclosure;

FIG. 3 is a schematic diagram of one scenario of the method for operating the mobile device shown in FIG. 2;

FIG. 4 is a flow diagram of yet another embodiment of a method for operating a mobile device according to the present disclosure;

FIG. 5 is a schematic block diagram illustrating one embodiment of an apparatus for operating a mobile device according to the present disclosure;

FIG. 6 is a schematic structural diagram of an electronic device suitable for use in implementing embodiments of the present disclosure.

Detailed Description

The present disclosure is described in further detail below with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that, in the present disclosure, the embodiments and features of the embodiments may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 illustrates an exemplary system architecture 100 for a method of operating a mobile device or an apparatus for operating a mobile device to which embodiments of the present disclosure may be applied.

As shown in fig. 1, the system architecture 100 may include

mobile terminal devices

101, 102, 103, a network 104 and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

mobile terminal device

101, 102, 103 to interact with the server 105 via the network 104, for example, a video of the operation gesture may be sent to the server, and a type identification of the recognized operation gesture may be received from the server.

The

mobile terminal apparatuses

101, 102, and 103 may be hardware or software. When the

mobile terminal devices

101, 102, and 103 are hardware, they may be electronic devices with image capturing devices and communication functions, including but not limited to smart phones, tablet computers, e-book readers, laptop computers, and the like. When the

terminal apparatuses

101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented, for example, as multiple software or software modules to provide distributed services, or as a single software or software module. And is not particularly limited herein.

The server 105 may be a server that provides various services, such as a background data server that processes operation gesture videos uploaded by the

mobile terminal devices

101, 102, 103. The background data server can recognize the received video of the operation gesture and feed back the recognition result (such as the type identifier of the operation gesture) to the mobile terminal device.

It should be noted that the method for operating the mobile device provided by the embodiment of the present disclosure is generally performed by the mobile terminal device. Accordingly, means for operating the mobile device may be provided in the

mobile terminal device

101, 102, 103.

It should be noted that the method for operating a mobile device in the present disclosure may be directly stored in a processor of the mobile terminal device, so that when a user uses the mobile terminal device, the user may directly operate the mobile terminal device through the method for operating a mobile device in the present disclosure to implement various functions of the mobile terminal device. In addition, the method for operating the mobile device in the present disclosure may be embedded in an application program, and when the mobile terminal device loaded with the application program runs the application program, a user may operate the mobile terminal device by the method for operating the mobile device in the present disclosure, so as to implement various functions of the application program. This is not a limitation of the present application.

With continued reference to fig. 2, a flow 200 of one embodiment of a method for operating a mobile device in accordance with the present disclosure is shown. The method for operating a mobile device comprises the following steps:

in step 201, in response to detecting the operation gesture, a video of the operation gesture is acquired.

In this embodiment, the execution subject is a mobile device (for example, a smartphone) equipped with image capture. When the execution main body detects the operation gesture of the user, the operation gesture of the user can be continuously captured through the image acquisition equipment, and a video of the operation gesture is obtained.

In some optional implementations of this embodiment, before step 201, the following steps may also be included: in response to detecting the operation gesture, acquiring a video of the operation gesture, including: in response to detecting the hand motion, capturing a video of the hand motion; and in response to the fact that the collected video of the hand action meets the preset conditions, determining the hand action as the operation gesture, and determining the video of the hand action as the video of the operation gesture.

In this implementation manner, in order to avoid the misoperation, when the execution main body detects the hand motion of the user, the detected hand motion may be compared with a preset condition to determine whether the hand motion of the user is an operation gesture. As an example, the preset condition may be set as a preset duration according to experience, and when the length of the video for executing the hand motion acquired by the main body reaches the preset duration, it indicates that the user currently has an operation intention, and at this time, the hand motion may be determined as an operation gesture; and if the length of the collected video of the hand action is less than the preset time length, the fact that the user does not have operation will at the moment is shown, and the hand action at the moment is taken as misoperation and is not taken as an operation gesture.

In a specific example, when the execution subject detects a hand motion of a user, a sliding window with a preset length may be generated, then a video of the collected hand motion is placed in the sliding window, and when the sliding window is filled with the video in the sliding window, indicating that the video meets a preset condition, the execution subject may determine the hand motion as an operation gesture.

Step 202, inputting the video of the operation gesture into a pre-trained operation gesture recognition model, and determining the type identifier of the operation gesture.

In this embodiment, the type identifier is an action type used for characterizing a hand action of the user, and may be, for example, an action type such as click, slide, drag and drop. The operation gesture recognition model may represent a corresponding relationship between an operation gesture of a user and a type identifier of the operation gesture, and is used to recognize, from a video of the operation gesture input by the execution subject, the type identifier corresponding to a hand action of the user in the video, for example, the type identifier may be a convolutional neural network model, a cyclic neural network model, or another supervised deep learning model.

As an example, the hand movement of the user in the video of the operation gesture acquired by the subject is performed as: if the thumb and the index finger in the contact state are gradually far away from each other, the operation gesture recognition model can output the type identifier of the operation gesture as drag and drop; for another example, if the user's hand motion in the video of the operation gesture acquired by the main body is executed as: and the index finger stretches out, the other four fingers hold the ball, and then the index finger continuously clicks twice, so that the operation gesture recognition model can output the type identifier of the operation gesture as double click.

In a specific example, the operation gesture recognition model in this embodiment may be obtained by: an initial operational gesture recognition model and corresponding loss function are constructed, which may be, for example, a convolutional neural network. Then, a video of the hand action can be obtained from an open database, a type identifier is marked for the video, a sample video is obtained, and a training set for training an initial operation gesture recognition model is constructed. And then, inputting the sample video in the training set into the initial operation gesture recognition model, and correcting parameters in the initial operation gesture recognition model based on the loss function until the loss function is converged to obtain the trained operation gesture recognition model. Then, videos of a preset number of hand motions can be recorded as a test set, the trained operation gesture recognition model is input, P-R verification (precision-call, precision rate and recall rate) is carried out on the trained operation gesture recognition model, if the verification is passed, the accuracy of the trained operation gesture recognition model meets the requirement, and the trained operation gesture recognition model can be used for recognizing the operation gesture of the user.

And 203, determining an operation instruction identifier corresponding to the type identifier of the operation gesture based on the corresponding relationship between the type identifier of the prestored operation gesture and the operation instruction identifier.

In this embodiment, the operation instruction identifies an instruction for characterizing the execution of the execution subject, i.e. an action that the user desires to implement on the mobile device. As an example, a corresponding relationship list of the type identifier and the operation instruction identifier may be pre-constructed on the execution main body, for example, the type identifier "drag and drop" corresponds to the operation instruction identifier "zoom in", the type identifier "click" corresponds to the operation instruction identifier "stand-alone", and the type identifier "drag" corresponds to the operation instruction identifier "move". Therefore, the execution main body can determine the operation instruction identification corresponding to the operation gesture of the user based on the corresponding relation list, and the conversion process from the operation gesture of the user to the machine instruction is realized.

And step 204, determining an operation area indicated by the operation gesture based on the video of the operation gesture.

In this embodiment, the execution main body may determine the operation region indicated by the operation gesture based on the projection of the hand motion of the user in the operation panel region of the execution main body, where the operation region may be a point, a line, or a region. For example, the executing body may determine an activity area of a hand of the user and a movement track of a key point of the hand from a video of the operation gesture, so as to determine the operation area indicated by the operation gesture.

In a specific example, the executing subject may determine the operation region indicated by the operation gesture by using an OpenCV algorithm or a gesture recognition algorithm in the IOS system, for example, when the executing subject recognizes that the palm of the user slides left and right, an initial position and an end position of the palm may be recorded, then a projection position of the initial position and the end position of the palm in the operation panel region is determined, and two projection points are determined as a start point and an end point of the dragging; for another example, when the execution subject recognizes that the user clicks with the index finger, the position of the index finger at the end of the clicking may be recorded, and then a projection point of the position of the index finger on the operation panel area is determined, and the projection point is determined as the operation area of the operation gesture of the user.

Step 205, determining an operation instruction represented by the operation gesture based on the operation area and the operation instruction identifier.

In this embodiment, the executing main body may determine, based on the operation instruction identifier obtained in step 203 and the operation area indicated by the operation gesture determined in step 204, an operation instruction that the user desires to execute by the executing main body. As an example, the operation instruction of the user is identified as "click", the operation area is the "a" point in the operation panel, and then the execution subject may determine that the operation instruction characterized by the operation gesture of the user is: clicking a point A; for another example, if the operation command of the user is "drag and drop", the operation area is a curve including B, C, D points in the operation panel, where point C is a starting point located on the curve, and points B and D are end points located at two ends of the curve, then the execution subject may determine that the operation command represented by the operation gesture of the user is: and amplifying the content in the operation panel by taking the point C as a center, wherein the amplification factor is determined by the distances of the points B and D relative to the point C.

In step 206, the operation instruction is executed.

In this embodiment, the execution body converts the operation expectation of the user into a machine instruction that can be executed by the mobile device based on the operation gesture of the user, thereby implementing the operation of the mobile device by the user.

As an example, the execution body may transfer the operation instruction from the table to the inside in the interface layers, each interface layer includes an action component that can respond to a certain operation instruction, and when a first action component that can respond to the current operation instruction is found in the interface layer, that is, the transfer of the operation instruction is stopped, the current interface layer executes a corresponding action. Therefore, the mobile equipment can be operated by the user through hand actions.

With continued reference to fig. 3, fig. 3 is a schematic view of a scenario of the flow of the method shown in fig. 2. In fig. 3, the execution body 301 may be a smart phone with a front camera, and when the front camera of the smart phone detects an operation gesture of a user (for example, a click shown in fig. 3), the front camera may capture a continuous motion of the user to obtain a video of the operation gesture; inputting a video of an operation gesture into a pre-trained operation gesture recognition model to obtain the type identifier of the operation gesture as 'click'; the smart phone determines that the operation instruction identifier corresponding to the type identifier is 'single click', and the operation area of the operation gesture is a 'page close' key at the upper right corner of the screen; the smart phone determines that the operation instruction represented by the operation gesture is 'click page closing key', and then the smart phone executes the operation instruction: and closing the current page, namely realizing the operation of the user on the smart phone through hand action.

With further reference to fig. 4, a flow 400 of yet another embodiment of a method for operating a mobile device is shown. The flow 400 of the method for operating a mobile device includes the steps of:

step 401, step 201, in response to detecting the operation gesture, acquiring a video of the operation gesture. This step is similar to the step 201 and will not be described herein again.

Step 402, a preset number of key frames are sampled from the video of the operation gesture.

In this implementation, the execution main body may sample the video of the operation gesture based on a preset sampling policy (for example, a preset time interval), and extract a preset number of key frames from the video, so as to reduce the amount of computation.

Step 403, extracting key points of the hand from the key frame, and generating feature information of each key point of the hand in the key frame, where the feature information includes a key point identifier and key point coordinates, where the key point identifier is used to represent a part of the key point on the hand, and the key point coordinates are used to represent a relative position relationship between the key point and other key points in the key frame.

In this embodiment, the key points may be joints and wrist joints of fingers, and the key point identifier may be a code of a hand position, for example, a wrist joint may be represented by 0, a thumb may be represented by 1, an index finger may be represented by 2, and so on; a denotes a finger tip, B denotes a first joint, C denotes a second joint, and D denotes a finger root joint. The key point mark is '0' to indicate that the key point is a wrist joint; the keypoint is identified as "3B," indicating that the keypoint is the second joint of the middle finger. The key point coordinates may be pixel coordinates of the key point in a key frame, and in order to reduce interference of the palm size with the positional relationship, the length of the key frame may be normalized, and then the pixel coordinates of the key point may be determined as the key point coordinates.

Step 404, according to the time sequence of each key frame, the feature information of the key points with the same key point identification in each key frame is encoded into the feature vector of the key point.

In this embodiment, the execution subject couples the position information and the time information of the keypoint through the feature vector of the keypoint obtained in step 404, so that the movement track of the keypoint can be accurately determined.

Step 405, based on the feature vectors of the key points, a feature matrix is generated.

In this embodiment, the feature testimony is obtained based on the feature vector of each key point, including the movement track of each key point of the hand of the user, so that the continuous motion of the hand of the user can be accurately represented.

To illustrate the above steps, first, a key frame X ^ j can be selected from the video of the operation gesture every k frames. A keypoint extraction algorithm (e.g., OpenCV algorithm) may then be employed to extract the keypoints of the hand from each keyframe, which may be, for example, 21 keypoints including the ends and joints of the fingers and wrist joints. Then, a feature matrix X ═ { X _1, X _2 …, X _21} is constructed based on the keypoints, where each feature vector X _ i in X contains the keypoint type and the keypoint coordinates of the keypoint. Then, according to the time sequence of each key frame, splicing x _ i corresponding to each key frame into x _ i ^ j, and obtaining the feature matrix of the operation gesture.

And 406, inputting the feature matrix into the gesture recognition model to obtain the type identifier of the operation gesture.

It should be noted that the above steps 402 to 405 may be integrated as a preprocessing module into the operation gesture model in step 406, which is not limited in this application.

In some optional implementations of this embodiment, before performing step 406, the following steps may also be adopted: flattening the feature matrix to obtain a first-order feature matrix; and updating the first-order feature matrix into the feature matrix. Therefore, the recognition efficiency of the gesture recognition model is improved.

In some optional implementations of the embodiment, the type identifier of the operation gesture may be determined as follows: inputting the characteristic matrix into an operation gesture recognition model, and estimating confidence degrees of various types of identifications corresponding to the operation gesture; and determining the type identification of the operation gesture based on the confidence of each type identification.

In this embodiment, the confidence of the type identifier is used to characterize the probability that the operation gesture is the type identifier. As an example, after the execution subject inputs the video of the operation gesture into the operation gesture recognition model, the operation gesture recognition model outputs the confidence degrees of the plurality of type identifiers corresponding to the operation gesture, and then determines the type identifier with the highest confidence degree as the type identifier of the operation gesture.

Step 407, determining an operation area indicated by the operation gesture based on the video of the operation gesture. This step corresponds to the step 204, and is not described herein again.

In some optional implementations of the present embodiment, the operation region indicated by the operation gesture may be determined via the following steps: determining the movement track of each key point based on the characteristic information of each key point in each key frame; and determining an operation area indicated by the operation gesture based on the moving track of each key point.

In this implementation manner, the execution main body may determine the operation area indicated by the operation gesture based on the feature information of the key point. For example, when the execution subject recognizes that the displacement of the key point at the end of the index finger of the user in the direction perpendicular to the operation panel is from far to near and stays at the end position, and the positions of other key points remain unchanged, the coordinates of the key point at the end of the index finger can be determined as the operation area of the operation gesture.

And step 408, determining an operation instruction represented by the operation gesture based on the operation area and the operation instruction identification. This step corresponds to the step 205, and is not described herein again.

In step 409, the operation instruction is executed. This step corresponds to the step 206, and is not described herein again.

As can be seen from fig. 4, compared with the embodiment shown in fig. 2, the process 400 of the method for operating a mobile device in this embodiment highlights the steps of sampling key frames, extracting feature information of key points from the key frames, and generating a feature matrix based on the feature information of the key points in each key frame, where sampling the key frames can reduce the data amount of operation gesture recognition, and generate feature vectors of the key points according to the time sequence of the key frames, so that the continuous motions of the user's hand can be accurately represented, and the recognition accuracy is improved.

Furthermore, in some optional implementations of the foregoing embodiments, the method for operating a mobile device of the present disclosure may further include: and broadcasting the operation instruction by voice. Therefore, when the main body executes the operation instruction, the execution action can be informed to the user, so that the user can take subsequent actions, and the operation convenience of the user with visual disorder can be further improved.

With further reference to fig. 5, as an implementation of the methods shown in the above figures, the present disclosure provides an embodiment of an apparatus for operating a mobile device, which corresponds to the method embodiment shown in fig. 2, and which is particularly applicable to various electronic devices.

As shown in fig. 5, the apparatus 500 for operating a mobile device of the present embodiment includes: a gesture detection unit 501 configured to acquire a video of an operation gesture in response to detection of the operation gesture; the gesture recognition unit 502 is configured to input the video of the operation gesture into a pre-trained operation gesture recognition model, and determine the type identifier of the operation gesture; an identifier determining unit 503 configured to determine, based on a correspondence between a type identifier of a prestored operation gesture and an operation instruction identifier, an operation instruction identifier corresponding to the type identifier of the operation gesture; a region determination unit 504 configured to determine an operation region indicated by the operation gesture based on the video of the operation gesture; a determining unit 505 configured to determine an operation instruction represented by the operation gesture based on the operation area and the type identifier; an instruction execution unit 506 configured to execute the operation instruction.

In this embodiment, the gesture recognition unit 502 further includes: the sampling module is configured to sample a preset number of key frames from the video of the operation gesture; the feature extraction module is configured to extract hand key points from a key frame and generate feature information of each hand key point in the key frame, wherein the feature information comprises key point identifications and key point coordinates, the key point identifications are used for representing hand positions of the key points, and the key point coordinates are used for representing relative position relations of the key points and other key points in the key frame; the vector generation module is configured to encode feature information of the key points with the same key point identification in each key frame into a feature vector of the key point according to the time sequence of each key frame; a matrix generation module configured to generate a feature matrix based on the feature vectors of the respective keypoints; and the gesture recognition module is configured to input the feature matrix into the gesture recognition model to obtain the type identification of the operation gesture.

In some embodiments, the region determining unit 504 is further configured to determine the operation region indicated by the operation gesture via: determining the movement track of each key point based on the characteristic information of each key point in each key frame; and determining an operation area indicated by the operation gesture based on the moving track of each key point.

In some embodiments, the gesture recognition unit 502 further comprises a matrix flattening processing module configured to flatten the feature matrix to obtain a first-order feature matrix; and updating the first-order feature matrix into the feature matrix.

In some embodiments, the identification module is further configured to: inputting the characteristic matrix into a gesture recognition model, and estimating the confidence of each type of identification corresponding to the operation gesture; and determining the type of the operation gesture based on the confidence degrees of the various types of identifications.

In some embodiments, the gesture detection unit 501 further includes: a motion detection module configured to capture video of hand motions in response to detecting the hand motions; the motion determination module is configured to determine the hand motion as the operation gesture in response to determining that the collected video of the hand motion meets the preset condition, and determine the video of the hand motion as the video of the operation gesture.

In some embodiments, the apparatus 500 further comprises: and the voice broadcasting unit is configured to broadcast the operation instruction in voice.

Referring now to fig. 6, a schematic diagram of an electronic device (e.g., the server or terminal device of fig. 1) 600 suitable for use in implementing embodiments of the present disclosure is shown. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), and the like. The terminal device shown in fig. 6 is only an example, and should not bring any limitation to the functions and the use range of the embodiments of the present disclosure.

As shown in fig. 6, electronic device 600 may include a processing means (e.g., central processing unit, graphics processor, etc.) 601 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage means 608 into a Random Access Memory (RAM) 603. In the RAM603, various programs and data necessary for the operation of the electronic apparatus 600 are also stored. The processing device 601, the ROM 602, and the RAM603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

Generally, the following devices may be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 607 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 608 including, for example, tape, hard disk, etc.; and a communication device 609. The communication means 609 may allow the electronic device 600 to communicate with other devices wirelessly or by wire to exchange data. While fig. 6 illustrates an electronic device 600 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided. Each block shown in fig. 6 may represent one device or may represent multiple devices as desired.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 609, or may be installed from the storage means 608, or may be installed from the ROM 602. The computer program, when executed by the processing device 601, performs the above-described functions defined in the methods of embodiments of the present disclosure. It should be noted that the computer readable medium of the embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In embodiments of the disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In embodiments of the present disclosure, however, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: in response to detecting the operation gesture, acquiring a video of the operation gesture; inputting the video of the operation gesture into a pre-trained operation gesture recognition model, and determining the type identifier of the operation gesture; determining an operation instruction identifier corresponding to the type identifier of the operation gesture based on the corresponding relationship between the type identifier of the prestored operation gesture and the operation instruction identifier; determining an operation area indicated by the operation gesture based on the video of the operation gesture; determining an operation instruction represented by an operation gesture based on the operation area and the operation instruction identification; and executing the operation instruction.

Computer program code for carrying out operations for embodiments of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present disclosure may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes a gesture detection unit, a gesture recognition unit, an identification determination unit, a region determination unit, and an instruction execution unit. Where the names of the cells do not in some cases constitute a limitation on the cell itself, for example, the gesture detection cell may also be described as a "cell that acquires a video of an operation gesture in response to detecting the operation gesture".

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is made without departing from the inventive concept as defined above. For example, the above features and (but not limited to) technical features with similar functions disclosed in the embodiments of the present disclosure are mutually replaced to form the technical solution.

Claims

1. A method for operating a mobile device, comprising:

in response to detecting an operation gesture, acquiring a video of the operation gesture;

inputting the video of the operation gesture into a pre-trained operation gesture recognition model, and determining the type identifier of the operation gesture;

determining an operation instruction identifier corresponding to the type identifier of the operation gesture based on the corresponding relationship between the type identifier of the operation gesture and the operation instruction identifier which is prestored;

determining an operation area indicated by the operation gesture based on the video of the operation gesture;

determining an operation instruction represented by the operation gesture based on the operation area and the operation instruction identification;

and executing the operation instruction.

2. The method of claim 1, wherein the type identifier is determined via:

sampling a preset number of key frames from the video of the operation gesture;

extracting key points of a hand from the key frames, and generating feature information of each key point in the key frame, wherein the feature information comprises key point identification and key point coordinates, the key point identification is used for representing the position of the key point on the hand, and the key point coordinates are used for representing the relative position of the key point in the key frame relative to other key points;

according to the time sequence of each key frame, encoding the feature information of the key points with the same key point identification in each key frame into the feature vector of the key point;

generating a feature matrix based on the feature vector of each key point;

and inputting the characteristic matrix into the operation gesture recognition model to obtain the type identification of the operation gesture.

3. The method of claim 2, wherein the operating region is determined via:

determining the movement track of each key point based on the characteristic information of each key point in each key frame;

and determining an operation area indicated by the operation gesture based on the moving track of each key point.

4. The method of claim 2, wherein prior to inputting the feature matrix into an operational gesture recognition model, further comprising: flattening the feature matrix to obtain a first-order feature matrix;

and updating the first-order feature matrix into the feature matrix.

5. The method of claim 2, wherein inputting the feature matrix into an operational gesture recognition model, resulting in an identification of a type of the operational gesture, comprises:

inputting the characteristic matrix into the operation gesture recognition model, and estimating confidence degrees of various types of identifications corresponding to the operation gesture;

and determining the type identification of the operation gesture based on the confidence of each type identification.

6. The method of claim 1, wherein, in response to detecting an operational gesture, capturing video of the operational gesture comprises:

in response to detecting a hand motion, capturing a video of the hand motion;

in response to determining that the collected video of the hand action meets a preset condition, determining the hand action as an operation gesture, and determining the video of the hand action as a video of the operation gesture.

7. The method according to one of claims 1 to 6, the method further comprising: and broadcasting the operation instruction by voice.

8. An apparatus for operating a mobile device, comprising:

a gesture detection unit configured to acquire a video of an operation gesture in response to detection of the operation gesture;

the gesture recognition unit is configured to input the video of the operation gesture into a pre-trained operation gesture recognition model, and determine the type identifier of the operation gesture;

the identification determining unit is configured to determine an operation instruction identification corresponding to the type identification of the operation gesture based on the corresponding relationship between the type identification of the prestored operation gesture and the operation instruction identification;

an area determination unit configured to determine an operation area indicated by the operation gesture based on the video of the operation gesture;

the instruction determining unit is configured to determine an operation instruction represented by the operation gesture based on the operation area and the operation instruction identification;

an instruction execution unit configured to execute the operation instruction.

9. The apparatus of claim 8, wherein the gesture recognition unit further comprises:

the sampling module is configured to sample a preset number of key frames from the video of the operation gesture;

the feature extraction module is configured to extract key points of a hand from the key frame and generate feature information of each key point in the key frame, wherein the feature information comprises a key point identifier and key point coordinates, the key point identifier is used for representing the position of the key point on the hand, and the key point coordinates are used for representing the relative position of the key point in the key frame relative to other key points;

the vector generation module is configured to encode feature information of key points with the same key point identification in each key frame into a feature vector of the key point according to the time sequence of each key frame;

a matrix generation module configured to generate a feature matrix based on the feature vector of each of the keypoints;

and the gesture recognition module is configured to input the feature matrix into an operation gesture recognition model to obtain the type identification of the operation gesture.

10. The apparatus according to claim 8, wherein the region determination unit is further configured to determine the operation region indicated by the operation gesture via:

11. The apparatus of claim 9, wherein the gesture recognition unit further comprises a matrix flattening processing module configured to flatten the feature matrix resulting in a first order feature matrix; and updating the first-order feature matrix to the feature matrix.

12. The apparatus of claim 9, wherein the identification module is further configured to:

13. The apparatus of claim 8, wherein the gesture detection unit further comprises:

a motion detection module configured to capture video of a hand motion in response to detecting the hand motion;

a motion determination module configured to determine the hand motion as an operation gesture and determine the video of the hand motion as the video of the operation gesture in response to determining that the captured video of the hand motion satisfies a preset condition.

14. The apparatus according to one of claims 8 to 13, further comprising: and the voice broadcasting unit is configured to broadcast the operation instruction in voice.

15. A mobile device, comprising:

an image capture device configured to capture video;

one or more processors;

a storage device having one or more programs stored thereon,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.

16. A computer-readable medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the method of any one of claims 1-7.