CN108491808B

CN108491808B - Method and device for acquiring information

Info

Publication number: CN108491808B
Application number: CN201810263902.2A
Authority: CN
Inventors: 刘晓乾
Original assignee: Baidu Online Network Technology Beijing Co Ltd
Current assignee: Baidu Online Network Technology Beijing Co Ltd
Priority date: 2018-03-28
Filing date: 2018-03-28
Publication date: 2021-11-23
Anticipated expiration: 2038-03-28
Also published as: CN108491808A

Abstract

The embodiment of the application discloses a method and a device for acquiring information. One embodiment of the method comprises: acquiring angle information and an initial lip action video from the acquired face video, wherein the angle information is used for representing the face-oriented angle in the face video; correcting the initial lip motion video through the angle information to obtain a corrected lip motion video; and importing the corrected lip motion video into a pre-trained lip recognition model to obtain character information corresponding to the corrected lip motion video, wherein the lip recognition model is used for representing the corresponding relation between the corrected lip motion video and the character information. The method and the device improve the accuracy of obtaining the character information corresponding to the lip language.

Description

Method and device for acquiring information

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to a method and a device for acquiring information.

Background

With the development of information technology, people increasingly perform information interaction through electronic equipment. People can send information to an information receiver through functions of character input or voice input of information interaction application on the electronic equipment. For example, a user can input text information through a virtual keyboard provided by an information interaction application on the electronic equipment and send the text information to an information receiver; or the voice information is input through the voice input function provided by the information interaction application, and the information interaction application can directly send the voice information to the information receiver or convert the voice information into characters and send the characters to the information receiver.

Disclosure of Invention

The embodiment of the application aims to provide a method and a device for acquiring information.

In a first aspect, an embodiment of the present application provides a method for acquiring information, where the method includes: acquiring angle information and an initial lip action video from the acquired face video, wherein the angle information is used for representing the face-oriented angle in the face video; correcting the initial lip motion video through the angle information to obtain a corrected lip motion video; and importing the corrected lip motion video into a pre-trained lip recognition model to obtain character information corresponding to the corrected lip motion video, wherein the lip recognition model is used for representing the corresponding relation between the corrected lip motion video and the character information.

In some embodiments, the modifying the initial lip motion video according to the angle information to obtain a modified lip motion video includes: correcting each frame of image contained in the initial lip motion video through the angle information to obtain a corrected image corresponding to the image; and combining the corrected images to obtain a corrected lip motion video.

In some embodiments, the lip language recognition model includes a convolutional neural network, a cyclic neural network, and a fully-connected layer.

In some embodiments, the importing the corrected lip motion video into a pre-trained lip language recognition model to obtain text information corresponding to the corrected lip motion video includes: inputting the corrected lip motion video into the convolutional neural network to obtain image feature vectors of each frame of image of the corrected lip motion video, wherein the convolutional neural network is used for representing the corresponding relation between the video and the image feature vectors of each frame of image of the video; inputting the image feature vectors of each frame of image of the corrected lip motion video to the recurrent neural network to obtain the video feature vectors of the corrected lip motion video, wherein the recurrent neural network is used for representing the corresponding relation between the image feature vectors of each frame of image of the video and the video feature vectors of the video, and the video feature vectors of the video are used for representing the incidence relation between the image feature vectors of each frame of image of the video; and inputting the video feature vector of the corrected lip motion video into the full-link layer to obtain the character information, wherein the full-link layer is used for representing the corresponding relation between the video feature vector of the video and the character information.

In some embodiments, the lip language recognition model is obtained by training the following steps: obtaining a plurality of sample lip action videos and sample text information corresponding to each sample lip action video in the plurality of sample lip action videos; and training to obtain the lip language recognition model by taking each sample lip action video in the plurality of sample lip action videos as input and taking sample character information corresponding to each sample lip action video in the plurality of sample lip action videos as output.

In some embodiments, the training of the lip recognition model using each of the plurality of sample lip motion videos as an input and using sample text information corresponding to each of the plurality of sample lip motion videos as an output includes: the following training steps are performed: sequentially inputting each sample lip motion video in the plurality of sample lip motion videos to an initial lip language recognition model to obtain predicted character information corresponding to each sample lip motion video in the plurality of sample lip motion videos; and comparing the predicted character information corresponding to each sample lip action video in the plurality of sample lip action videos with the sample character information corresponding to the sample lip action video to obtain the identification accuracy of the initial lip language identification model, determining whether the identification accuracy is greater than a preset accuracy threshold, and if so, taking the initial lip language identification model as the trained lip language identification model.

In some embodiments, the training of the lip recognition model using each of the plurality of sample lip motion videos as an input and using sample text information corresponding to each of the plurality of sample lip motion videos as an output includes: and responding to the condition that the accuracy is not larger than the preset accuracy threshold, adjusting the parameters of the initial lip language recognition model, and continuing to execute the training step.

In a second aspect, an embodiment of the present application provides an apparatus for acquiring information, where the apparatus includes: the information acquisition unit is used for acquiring angle information and an initial lip action video from the acquired face video, wherein the angle information is used for representing the face-oriented angle in the face video; a correction unit for correcting the initial lip motion video according to the angle information to obtain a corrected lip motion video; and the character information acquisition unit is used for importing the corrected lip motion video into a pre-trained lip language recognition model to obtain character information corresponding to the corrected lip motion video, and the lip language recognition model is used for representing the corresponding relation between the corrected lip motion video and the character information.

In some embodiments, the correction unit includes: a corrected image obtaining subunit, configured to correct, for each frame of image included in the initial lip motion video, the image according to the angle information to obtain a corrected image corresponding to the image; and the corrected lip motion video acquisition subunit is used for combining the corrected images to obtain a corrected lip motion video.

In some embodiments, the text information obtaining unit includes: an image feature vector obtaining subunit, configured to input the corrected lip motion video to the convolutional neural network, so as to obtain an image feature vector of each frame of image of the corrected lip motion video, where the convolutional neural network is used to represent a correspondence between the video and the image feature vector of each frame of image of the video; a video feature vector obtaining subunit, configured to input image feature vectors of each frame of image of the corrected lip motion video to the recurrent neural network, so as to obtain video feature vectors of the corrected lip motion video, where the recurrent neural network is used to represent a correspondence between the image feature vectors of each frame of image of the video and the video feature vectors of the video, and the video feature vectors of the video are used to represent an association between the image feature vectors of each frame of image of the video; and the character information acquisition subunit is used for inputting the video feature vector of the corrected lip motion video into the full-link layer to obtain the character information, wherein the full-link layer is used for representing the corresponding relation between the video feature vector of the video and the character information.

In some embodiments, the apparatus includes a lip recognition model training unit, and the lip recognition model training unit includes: the device comprises a sample information acquisition subunit, a lip motion analysis subunit and a lip motion analysis subunit, wherein the sample information acquisition subunit is used for acquiring a plurality of sample lip motion videos and sample text information corresponding to each sample lip motion video in the plurality of sample lip motion videos; and the lip recognition model training subunit is used for taking each sample lip action video in the plurality of sample lip action videos as input, taking sample character information corresponding to each sample lip action video in the plurality of sample lip action videos as output, and training to obtain the lip recognition model.

In some embodiments, the lip language recognition model training subunit includes: a lip language identification model training module, configured to sequentially input each sample lip action video in the multiple sample lip action videos to an initial lip language identification model, so as to obtain predicted text information corresponding to each sample lip action video in the multiple sample lip action videos; and comparing the predicted character information corresponding to each sample lip action video in the plurality of sample lip action videos with the sample character information corresponding to the sample lip action video to obtain the identification accuracy of the initial lip language identification model, determining whether the identification accuracy is greater than a preset accuracy threshold, and if so, taking the initial lip language identification model as the trained lip language identification model.

In some embodiments, the lip recognition model training subunit further includes: and the parameter adjusting module is used for responding to the condition that the accuracy is not greater than the preset accuracy threshold, adjusting the parameters of the initial lip language recognition model and continuing to execute the training step.

In a third aspect, an embodiment of the present application provides an electronic device, including: one or more processors; a memory for storing one or more programs; the camera is used for collecting images; the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method for obtaining information of the first aspect.

In a fourth aspect, the present application provides a computer-readable medium, on which a computer program is stored, where the computer program is executed by a processor to implement the method for acquiring information of the first aspect.

According to the method and the device for acquiring information, angle information and an initial lip action video are acquired from an acquired face video; then correcting the initial lip action video through the angle information to obtain a corrected lip action video; and finally, the corrected lip motion video is imported into a pre-trained lip language recognition model to obtain the character information corresponding to the corrected lip motion video, so that the accuracy of obtaining the character information corresponding to the lip language is improved.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;

FIG. 2 is a flow diagram of one embodiment of a method for obtaining information according to the present application;

FIG. 3 is a flow diagram of one embodiment of a lip recognition model training method according to the present application;

FIG. 4 is a schematic illustration of an application scenario of a method for obtaining information according to the present application;

FIG. 5 is a schematic block diagram illustrating one embodiment of an apparatus for obtaining information according to the present application;

FIG. 6 is a schematic block diagram of a computer system suitable for use in implementing an electronic device according to embodiments of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Fig. 1 illustrates an exemplary system architecture 100 to which the method for acquiring information or the apparatus for acquiring information of the embodiments of the present application may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. Various image processing applications and information processing applications, such as a visible light camera, a near infrared camera, an image acquisition application, an image processing application, a text input application, etc., may be installed on the

terminal devices

101, 102, 103.

The

terminal apparatuses

101, 102, and 103 may be hardware or software. When the

terminal devices

101, 102, 103 are hardware, they may be various electronic devices having a display and a camera and supporting information processing, including but not limited to smart phones, tablet computers, e-book readers, laptop portable computers, desktop computers, and the like. When the

terminal apparatuses

101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.

The server 105 may be a server that provides various services, such as a server that performs data processing on images or characters transmitted from the

terminal apparatuses

101, 102, 103. The server may analyze and process the received data such as the images or the characters, and feed back the processing result (for example, the lip video, the lip images or the characters) to the terminal device.

It should be noted that the method for acquiring information provided in the embodiment of the present application is generally executed by the

terminal devices

101, 102, and 103, and accordingly, the apparatus for acquiring information is generally disposed in the

terminal devices

101, 102, and 103.

The server may be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster formed by multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to FIG. 2, a flow 200 of one embodiment of a method for obtaining information in accordance with the present application is shown. The method for acquiring information comprises the following steps:

step 201, angle information and an initial lip motion video are obtained from the obtained face video.

In the embodiment, the execution subject of the method for acquiring information (for example, the

terminal devices

101, 102, 103 shown in fig. 1) may acquire the face video of the user through a wired connection manner or a wireless connection manner. The face video is acquired by a visible light camera or a near-infrared camera of the execution main body. The face video may be acquired by the user performing lip language input by executing the main body. It should be noted that the wireless connection means may include, but is not limited to, a 3G/4G connection, a WiFi connection, a bluetooth connection, a WiMAX connection, a Zigbee connection, a uwb (ultra wideband) connection, and other wireless connection means now known or developed in the future.

The existing information interaction application generally requires a user to input correct pinyin or handwritten characters when processing information, which is easy to be a barrier for users (such as old people, children or disabled people) who have difficulty in inputting information. When a user inputs information through a voice function of an existing information interactive application, the user is required to make a sound. On one hand, voice input is easily heard by others, thereby causing information leakage; on the other hand, voice input is also susceptible to interference from ambient noise or interference to others.

When information is input through lip language, the embodiment can firstly acquire a face video of a user through a visible light camera or a near infrared camera on an execution main body. Generally, when a user inputs information in a lip language manner, a visible light camera and a near-infrared camera on an execution main body can be placed at a certain position of the face of the user, so that the visible light camera and the near-infrared camera (hereinafter, collectively referred to as a camera) can collect a face video of the user. Then, the executing subject can acquire angle information and an initial lip action video from the face video. When the face video is a three-dimensional video, the execution main body can determine the angle information of the face relative to the camera according to the image of the face in the face video. For example, the performing subject may first determine whether the facial features of the user in the face video are complete (e.g., the facial image may be a facial image containing the left and right eyebrows, left and right eyes, left and right cheeks, nose, mouth), whether the nose of the user is at the midpoint of the video, and so on. Then, the execution subject may construct a reference face plane (e.g., may be a plane consisting of 3 reference points selected on both eyes and lips, respectively) and a reference face direction (e.g., may be a reference point perpendicular to the reference face plane and passing over the nose of the user) of the user from the facial features. Finally, the angle information of the face video can be determined by the reference face plane and the reference face direction. The angle information may be an angle between the reference face direction and the camera horizontal direction or vertical direction. When the face video is a non-three-dimensional video, data processing can be performed on images contained in the face video to obtain corresponding three-dimensional images, and then the step of obtaining the angle information is executed. The execution main body can also acquire an initial lip action video from the face video in a video frame mode and the like. The initial lip action video is a video directly obtained from the face video, and can be used for recognizing lip language according to the lip action in the video.

And step 202, correcting the initial lip motion video through the angle information to obtain a corrected lip motion video.

When lip language is input through the execution main body, the user usually controls the face to be opposite to the camera as much as possible, so that the execution main body can acquire a face video with a good effect. However, in practice, when a user inputs lip language, the user usually accompanies interference such as shaking of the head of the user or shaking of the hand of the user, which causes the face video acquired by the execution main body not to be aligned with the camera. Interference also exists in the initial lip motion video acquired from the face video. Moreover, the image of the lips in the face video is much smaller than that of the face, so that the interference has a greater influence on the recognition of the lips from the initial lip action video. Therefore, the embodiment can correct the initial lip motion video through the angle information to obtain the corrected lip motion video. And then lip language recognition is carried out by correcting the lip action video, so that the accuracy of the lip language recognition can be improved. The correction performed on the initial lip motion video may be operations such as enlarging and reducing the image included in the initial lip motion video, or may be other operations, which are not described in detail here.

In some optional implementation manners of this embodiment, the modifying the initial lip motion video according to the angle information to obtain a modified lip motion video may include the following steps:

first, for each frame of image included in the initial lip motion video, the image is corrected by the angle information, and a corrected image corresponding to the image is obtained.

Videos acquired by an execution subject through a camera may be generally classified into three-dimensional videos and non-three-dimensional videos. Correspondingly, the method for performing the subject to correct the initial lip motion video may include: when the acquired face video is a three-dimensional video, a three-dimensional structure can be constructed for the lip part in the initial lip action video. And then, selecting an image of which the lips are opposite to the camera as a reference lip image. Then, based on the reference lip image, the distance or angle of the other lip images not directed to the camera is adjusted by the three-dimensional structure of the lip portion, and a corrected image of the lip image is obtained. When the acquired face video is a non-three-dimensional video (for example, a common 2D video), an image of which the lips are facing the camera may be acquired as a reference lip image. Then, the lip images included in each frame of image in the initial lip motion video are subjected to image stretching, image symmetry processing, image direction adjustment, or the like according to the angle information to obtain a corrected image. The corrected image may be regarded as an image when the lip image and the camera are aligned. The method for correcting the initial lip motion video may be other methods, and details are not repeated here.

And secondly, combining the corrected images to obtain a corrected lip motion video.

After the corrected images are obtained, the images of each frame before correction can be combined again according to the sequence of the images in the initial lip motion video. Each frame of image contained in the corrected lip action video obtained after combination is the image when the lips are over against the camera, so that the accuracy of subsequent lip language identification is improved.

And 203, importing the corrected lip motion video into a pre-trained lip language recognition model to obtain character information corresponding to the corrected lip motion video.

The execution body can store a lip language recognition model. The lip language identification model is used for representing the corresponding relation between the corrected lip motion video and the character information. Here, the text information is lip language information corresponding to the lip movement. The electronic equipment can train a lip language recognition model which can represent and correct the corresponding relation between the lip action video and the character information in various modes.

In this embodiment, the lip language recognition model may be an artificial neural network, which abstracts the neural network of the human brain from the information processing perspective, establishes a certain simple model, and forms different networks according to different connection modes. An artificial neural network is generally composed of a large number of nodes (or neurons) interconnected with each other, each node representing a specific output function, called an excitation function. Each connection between two nodes represents a weighted value, called a weight (also called a parameter), for the signal passing through the connection. The output of the network is different according to the connection mode, the weight value and the excitation function of the network. The lip language recognition model typically includes a plurality of layers, each layer including a plurality of nodes. Generally, the weights of the nodes in the same layer may be the same, and the weights of the nodes in different layers may be different, so that the parameters of the layers of the lip language recognition model may also be different. Here, the execution body may input the corrected lip motion video from the input side of the lip language recognition model, sequentially perform processing (for example, multiplication, convolution, or the like) of parameters of each layer in the lip language recognition model, and output the corrected lip motion video from the output side of the lip language recognition model, where the information output from the output side is character information corresponding to the corrected lip motion video.

As an example, the execution subject may generate a correspondence table storing the correspondence between a plurality of sample lip motion videos and sample text information based on counting a large number of sample lip motion videos and sample text information, and use the correspondence table as the lip language recognition model. In this way, the execution subject may sequentially compare the corrected lip motion video with the plurality of sample lip motion videos in the correspondence table, and if one sample lip motion video in the correspondence table is the same as or similar to the corrected lip motion video, use the sample text information corresponding to the sample lip motion video in the correspondence table as the text information of the corrected lip motion video.

As another example, the execution subject may first obtain a plurality of sample lip action videos and sample text information corresponding to each sample lip action video in the plurality of sample lip action videos; and then taking each sample lip action video in the plurality of sample lip action videos as input, taking sample text information corresponding to each sample lip action video in the plurality of sample lip action videos as output, and training to obtain the lip language recognition model. Here, the executing subject may take a plurality of sample lip motion videos and play them for those skilled in the art. Those skilled in the art can match sample text information for each sample lip motion video in the multiple sample lip motion videos according to actual lip language content, that is, the sample text information is lip language information.

In some optional implementations of the present embodiment, the lip language recognition model may include a convolutional neural network, a cyclic neural network, and a fully-connected layer. The step of importing the corrected lip motion video into a pre-trained lip language recognition model to obtain text information corresponding to the corrected lip motion video may include the steps of:

firstly, inputting the corrected lip motion video to the convolutional neural network to obtain an image feature vector of each frame of image of the corrected lip motion video.

In this embodiment, the execution subject may input the modified lip motion video to the convolutional neural network, so as to obtain an image feature vector of each frame image of the modified lip motion video. The video is generally composed of a plurality of frames of images, and the image feature vector of each frame of image of the video can be used to describe features possessed by each frame of image, such as motion features of lips, structural features of lips, position features of lips, and the like. The image feature vector can be realized by arranging a mark point on the image and using the coordinate value of the mark point on a plane coordinate or a space coordinate.

In this embodiment, the convolutional neural network may be a feedforward neural network whose artificial neurons may respond to a portion of the coverage of surrounding cells, which may perform well for large image processing. In general, the basic structure of a convolutional neural network includes two layers, one of which is a feature extraction layer, and the input of each neuron is connected to the local acceptance domain of the previous layer and extracts the features of the local acceptance domain. Once the local feature is extracted, the position relation between the local feature and other features is determined; the other is a feature mapping layer, each calculation layer of the network is composed of a plurality of feature mappings, each feature mapping is a plane, and the weights of all neurons on the plane are equal. Here, the execution body may input the corrected lip motion video from an input side of the convolutional neural network, sequentially perform processing of parameters of each layer in the convolutional neural network, and output the corrected lip motion video from an output side of the convolutional neural network, where information output by the output side is an image feature vector of each frame image of the corrected lip motion video.

In this embodiment, the convolutional neural network is used to represent the correspondence between the video and the image feature vectors of each frame image of the video. The electronic device can train a convolutional neural network which can represent the corresponding relation between the video and the image feature vectors of each frame of image of the video in various ways.

As an example, the execution subject may generate a correspondence table storing correspondence relationships between a plurality of sample lip motion videos and image feature vectors of respective frames of images of the sample lip motion videos based on counting the image feature vectors of a large number of sample lip motion videos and respective frames of images of the sample lip motion videos, and use the correspondence table as a convolutional neural network. In this way, the execution subject may sequentially compare the corrected lip motion video with the plurality of sample lip motion videos in the correspondence table, and if one sample lip motion video in the correspondence table is the same as or similar to the corrected lip motion video, the image feature vector of each frame image of the sample lip motion video in the correspondence table may be used as the image feature vector of each frame image of the corrected lip motion video.

As another example, the executing subject may first obtain image feature vectors of each frame image of the sample lip motion video and the sample lip motion video; and then taking the sample lip motion video as input, taking the image feature vector of each frame image of the sample lip motion video as output, and training to obtain a convolutional neural network capable of representing the corresponding relation between the sample lip motion video and the image feature vector of each frame image of the sample lip motion video. In this way, the execution body may input the corrected lip motion video from the input side of the convolutional neural network, sequentially perform processing on parameters of each layer in the convolutional neural network, and output the result from the output side of the convolutional neural network, where the information output by the output side is the image feature vector of each frame of image of the corrected lip motion video.

And secondly, inputting the image feature vector of each frame of image of the corrected lip motion video into the recurrent neural network to obtain the video feature vector of the corrected lip motion video.

In this embodiment, the execution subject may input the image feature vector of each frame of image of the corrected lip motion video to the recurrent neural network, so as to obtain the video feature vector of the corrected lip motion video. The video feature vector can be used for representing the association relationship between the image feature vectors of the frame images of the video. For example, the video feature vector may be a dynamic feature when image feature vectors of a plurality of consecutive lip images are combined. The dynamic feature can be used for features such as lip action habits when pronouncing a specific character.

In this embodiment, the recurrent neural network is an artificial neural network with nodes directionally connected into a ring. The essential feature of such a network is that there is both an internal feedback and a feed-forward connection between the processing units, the internal state of which may exhibit dynamic timing behavior.

In this embodiment, the recurrent neural network may be used to represent a correspondence between image feature vectors of each frame of image of the video and video feature vectors of the video, and the electronic device may train the recurrent neural network that may represent a correspondence between image feature vectors of each frame of image of the video and video feature vectors of the video in various ways.

As an example, the execution subject may generate a correspondence table storing correspondence between image feature vectors of each frame image of a plurality of sample lip motion videos and video feature vectors of the sample lip motion videos based on counting the image feature vectors of each frame image of a large number of sample lip motion videos and the video feature vectors of the sample lip motion videos, and use the correspondence table as a recurrent neural network. In this way, the execution subject can calculate the euclidean distance between the image feature vector of each frame image of the corrected lip motion video and the image feature vector of each frame image of the plurality of sample lip motion videos in the correspondence table. And if the Euclidean distance between the image feature vector of each frame image of one sample lip motion video in the corresponding relation table and the image feature vector of each frame image of the corrected lip motion video is smaller than a preset distance threshold, taking the video feature vector of the sample lip motion video in the corresponding relation table as the video feature vector of the corrected lip motion video.

As another example, the executing subject may first obtain image feature vectors of each frame of image of the sample lip motion video and video feature vectors of the sample lip motion video; and then taking the image feature vector of each frame of image of the sample lip motion video as input, taking the video feature vector of the sample lip motion video as output, and training to obtain a recurrent neural network capable of representing the corresponding relation between the image feature vector of each frame of image of the sample lip motion video and the video feature vector of the sample lip motion video. In this way, the execution body can input the image feature vector of each frame image of the corrected lip motion video from the input side of the recurrent neural network, sequentially process the parameters of each layer in the recurrent neural network, and output the image feature vector from the output side of the recurrent neural network, wherein the information output from the output side is the video feature vector of the corrected lip motion video.

And thirdly, inputting the video characteristic vector of the corrected lip motion video into the full-connection layer to obtain the character information.

In this embodiment, the execution subject may input the video feature vector of the corrected lip motion video to the full-connected layer, so as to obtain the text information corresponding to the corrected lip motion video. The lip motion in the lip motion correction video can correspond to different text contents. The execution subject can extract the video feature vector corresponding to the lip action, and then determine corresponding text information according to the video feature vector.

In this embodiment, each node of the fully-connected layer is connected to all nodes of the output layer of the recurrent neural network, and is used to integrate the video feature vectors output by the output layer of the recurrent neural network. The parameters of a fully connected layer are also typically the most due to its fully connected nature. Meanwhile, after the parameters of the full connection layer are utilized to carry out linear transformation on the video feature vectors, a nonlinear excitation function can be added to convert the result of the linear transformation, so that nonlinear factors are introduced to enhance the expression capability of the lip language recognition model. The excitation function may be a softmax function, which is a common excitation function in an artificial neural network and is not described in detail herein.

In this embodiment, the full connection layer may be used to represent a corresponding relationship between video feature vectors of a video and text information, and the execution main body may train the full connection layer that may represent a corresponding relationship between video feature vectors of a video and text information in a variety of ways.

As an example, the execution subject may generate a correspondence table storing correspondence between video feature vectors of a plurality of sample lip motion videos and sample text information based on counting the video feature vectors and the sample text information of a large number of sample lip motion videos, and use the correspondence table as a full-connected layer. In this way, the execution subject may calculate euclidean distances between the video feature vector of the corrected lip motion video and the video feature vectors of the plurality of sample lip motion videos in the correspondence table, and if the euclidean distance between the video feature vector of one sample lip motion video in the correspondence table and the video feature vector of the corrected lip motion video is smaller than a preset distance threshold, use the sample text information of the sample lip motion video in the correspondence table as the text information of the corrected lip motion video.

With further reference to FIG. 3, a flow 300 of one embodiment of a lip recognition model training method according to the present application is shown. In this embodiment, the lip language recognition model may include a convolutional neural network, a cyclic neural network, and a full connectivity layer, including the steps of:

step 301, obtaining a plurality of sample lip motion videos and sample text information corresponding to each sample lip motion video in the plurality of sample lip motion videos.

In this embodiment, an executing subject (for example,

terminal devices

101, 102, and 103 shown in fig. 1) of the lip language recognition model training method may obtain a plurality of sample lip motion videos and sample text information corresponding to each sample lip motion video in the plurality of sample lip motion videos.

In this embodiment, the execution subject may obtain a plurality of sample lip action videos and play the videos for a person skilled in the art, and the person skilled in the art may configure corresponding sample text information for each sample lip action video in the plurality of sample lip action videos according to actual lip language content in the sample lip action video, that is, the sample text information is lip language information.

Step 302, sequentially inputting each sample lip motion video in the plurality of sample lip motion videos to an initial lip language recognition model, and obtaining predicted text information corresponding to each sample lip motion video in the plurality of sample lip motion videos.

In this embodiment, the execution subject may sequentially input each sample lip motion video of the plurality of sample lip motion videos to the initial lip language recognition model, so as to obtain predicted text information corresponding to each sample lip motion video of the plurality of sample lip motion videos. Here, the execution agent may input each sample lip motion video from an input side of the initial lip language recognition model, sequentially perform processing of parameters of each layer in the initial lip language recognition model, and output the result from an output side of the initial lip language recognition model, where information output from the output side is predicted character information corresponding to the sample lip motion video. The initial lip language recognition model can be an untrained lip language recognition model or an untrained lip language recognition model, each layer of the initial lip language recognition model is provided with initialization parameters, and the initialization parameters can be continuously adjusted in the training process of the lip language recognition model.

In some optional implementations of this embodiment, the initial lip language recognition model is obtained by training through the following steps:

the method comprises the steps of firstly, obtaining a plurality of sample lip language videos and sample lip language information of each sample lip language video in the sample lip language videos.

When the initial lip language recognition model is trained, the training can be realized through the sample lip language video containing typical lip actions and the sample lip language information of the corresponding sample lip language video, so that the learning accuracy of the initial lip language recognition model is improved.

And secondly, training to obtain an initial behavior recognition model by using a machine learning method and taking each sample lip language video in the sample lip language videos as input and taking sample lip language information of each sample lip language video in the sample lip language videos as output.

The machine learning method may extract a motion feature vector (such as the above-mentioned image feature vector) from the lip motion of each of the plurality of sample lip language videos. The machine learning method can establish the corresponding relation between the sample lip language information and the motion characteristic vector so as to determine the corresponding sample lip language information through the lip motion, and further train to obtain an initial lip language recognition model. The initial lip language recognition model may be implemented based on a deep learning model or the like, and may have a structure including a convolutional neural network, a cyclic neural network, and a full connection layer.

Step 303, comparing the predicted text information corresponding to each sample lip motion video in the plurality of sample lip motion videos with the sample text information corresponding to the sample lip motion video to obtain the recognition accuracy of the initial lip language recognition model.

In this embodiment, the execution subject may compare the predicted text information corresponding to each sample lip motion video in the multiple sample lip motion videos with the sample text information corresponding to the sample lip motion video, so as to obtain the recognition accuracy of the initial lip language recognition model. Specifically, if the predicted text information corresponding to a sample lip motion video is the same as or similar to the sample text information corresponding to the sample lip motion video, the initial lip language identification model identifies correctly; if the predicted text information corresponding to one sample lip motion video is different from or not similar to the sample text information corresponding to the sample lip motion video, the initial lip language recognition model is wrongly recognized. Here, the execution subject may calculate a ratio of the number of correctly recognized characters to the total number of characters in the sample, and take the ratio as the recognition accuracy of the initial lip language recognition model.

Step 304, determining whether the recognition accuracy is greater than a preset accuracy threshold.

In this embodiment, the execution subject may compare the recognition accuracy of the initial lip language recognition model with a preset accuracy threshold. If the accuracy is greater than the preset accuracy threshold, go to step 305; if the accuracy is greater than the predetermined accuracy threshold, step 306 is executed.

And 305, taking the initial lip language recognition model as a trained lip language recognition model.

In this embodiment, when the recognition accuracy of the initial lip language recognition model is greater than the preset accuracy threshold, it indicates that the training of the lip language recognition model is completed. At this time, the execution subject may use the initial lip language recognition model as a trained lip language recognition model.

Step 306, adjusting parameters of the initial lip language recognition model.

In this embodiment, when the recognition accuracy of the initial lip language recognition model is not greater than the preset accuracy threshold, the execution subject may adjust parameters of the initial lip language recognition model, and return to the execution step 302 until a lip language recognition model capable of representing the correspondence between the sample lip motion video and the sample text information is trained.

With continued reference to fig. 4, fig. 4 is a schematic diagram of an application scenario of the method for acquiring information according to the present embodiment. In the application scenario of fig. 4, the user may perform lip language information input through the terminal device 102 in a situation where pronunciation is not suitable. The user may initiate a lip input application or lip input function on terminal device 102. The user can record a face video containing lip language content of 'query weather information' within a certain range of the camera of the terminal device 102. The terminal device 102 acquires angle information and an initial lip action video from the acquired face video, corrects the initial lip action video through the angle information to obtain a corrected lip action video, and finally introduces the corrected lip action video into a lip language recognition model to obtain character information 'query weather information' corresponding to the corrected lip action video. After the terminal device 102 obtains the text message "inquire weather information", it can obtain the weather information of the current time through the network.

The method provided by the embodiment of the application comprises the steps of firstly obtaining angle information and an initial lip action video from an obtained face video; then correcting the initial lip action video through the angle information to obtain a corrected lip action video; and finally, the corrected lip motion video is imported into a pre-trained lip language recognition model to obtain the character information corresponding to the corrected lip motion video, so that the accuracy of obtaining the character information corresponding to the lip language is improved.

With further reference to fig. 5, as an implementation of the methods shown in the above-mentioned figures, the present application provides an embodiment of an apparatus for acquiring information, which corresponds to the method embodiment shown in fig. 2, and which is particularly applicable to various electronic devices.

As shown in fig. 5, the apparatus 500 for acquiring information of the present embodiment may include: an information acquisition unit 501, a correction unit 502, and a character information acquisition unit 503. The information obtaining unit 501 is configured to obtain angle information and an initial lip motion video from an obtained face video, where the angle information is used to represent an angle faced by a face in the face video; the correction unit 502 is configured to correct the initial lip motion video according to the angle information to obtain a corrected lip motion video; the text information obtaining unit 503 is configured to import the corrected lip motion video into a pre-trained lip recognition model, so as to obtain text information corresponding to the corrected lip motion video, where the lip recognition model is used to represent a correspondence between the corrected lip motion video and the text information.

In some optional implementations of this embodiment, the modifying unit 502 may include: a modified image acquisition sub-unit (not shown in the figure) and a modified lip motion video acquisition sub-unit (not shown in the figure). The corrected image acquisition subunit is configured to correct, for each frame of image included in the initial lip motion video, the image according to the angle information to obtain a corrected image corresponding to the image; and the corrected lip motion video acquisition subunit is used for combining the corrected images to obtain a corrected lip motion video.

In some optional implementations of the present embodiment, the lip language recognition model may include a convolutional neural network, a cyclic neural network, and a full connectivity layer.

In some optional implementation manners of this embodiment, the text information obtaining unit 503 may include: an image feature vector acquisition subunit (not shown in the figure), a video feature vector acquisition subunit (not shown in the figure), and a text information acquisition subunit (not shown in the figure). The image feature vector acquisition subunit is configured to input the corrected lip motion video to the convolutional neural network, so as to obtain an image feature vector of each frame of image of the corrected lip motion video, where the convolutional neural network is configured to represent a correspondence between a video and an image feature vector of each frame of image of the video; the video feature vector acquisition subunit is configured to input image feature vectors of each frame of image of the corrected lip motion video to the recurrent neural network, so as to obtain video feature vectors of the corrected lip motion video, where the recurrent neural network is used to represent a correspondence between the image feature vectors of each frame of image of the video and the video feature vectors of the video, and the video feature vectors of the video are used to represent an association between the image feature vectors of each frame of image of the video; and the character information acquisition subunit is used for inputting the video feature vector of the corrected lip motion video into the full-connection layer to obtain the character information, wherein the full-connection layer is used for representing the corresponding relation between the video feature vector of the video and the character information.

In some optional implementations of this embodiment, the apparatus 500 for obtaining information may include a lip recognition model training unit (not shown in the figure), and the lip recognition model training unit may include: a sample information acquisition subunit (not shown in the figure) and a lip language recognition model training subunit (not shown in the figure). The sample information acquisition subunit is used for acquiring a plurality of sample lip action videos and sample text information corresponding to each sample lip action video in the plurality of sample lip action videos; and the lip language identification model training subunit is used for taking each sample lip action video in the plurality of sample lip action videos as input, taking sample character information corresponding to each sample lip action video in the plurality of sample lip action videos as output, and training to obtain the lip language identification model.

In some optional implementations of this embodiment, the lip recognition model training subunit may include: a lip language identification model training module (not shown in the figure) for sequentially inputting each sample lip motion video in the plurality of sample lip motion videos to an initial lip language identification model to obtain predicted text information corresponding to each sample lip motion video in the plurality of sample lip motion videos; and comparing the predicted character information corresponding to each sample lip action video in the plurality of sample lip action videos with the sample character information corresponding to the sample lip action video to obtain the identification accuracy of the initial lip language identification model, determining whether the identification accuracy is greater than a preset accuracy threshold, and if so, taking the initial lip language identification model as the trained lip language identification model.

In some optional implementation manners of this embodiment, the lip recognition model training subunit further includes: and a parameter adjusting module (not shown in the figure) for adjusting the parameters of the initial lip language recognition model in response to the accuracy not greater than the preset accuracy threshold, and continuing to execute the training step.

The present embodiment also provides an electronic device, including: one or more processors; a memory for storing one or more programs that, when executed by the one or more processors, cause the one or more processors to perform the above-described method for obtaining information.

The present embodiment also provides a computer-readable medium, on which a computer program is stored, which program, when being executed by a processor, carries out the above-mentioned method for acquiring information.

Referring now to FIG. 6, shown is a block diagram of a computer system 600 suitable for use in implementing the electronic device of an embodiment of the present application. The electronic device shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 6, the computer system 600 includes a Central Processing Unit (CPU)601 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the system 600 are also stored. The CPU 601, ROM 602, and RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

The following components are connected to the I/O interface 605: an input portion 606 including a keyboard, a mouse, and the like; an output portion 607 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The driver 610 is also connected to the I/O interface 605 as needed. A removable medium 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 610 as necessary, so that a computer program read out therefrom is mounted in the storage section 608 as necessary. A camera (visible light camera and/or near infrared camera) 612 is also connected to the I/O interface 605 as needed.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 609, and/or installed from the removable medium 611. The computer program performs the above-described functions defined in the method of the present application when executed by a Central Processing Unit (CPU) 601.

It should be noted that the computer readable medium mentioned above in the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present application may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes an information acquisition unit, a correction unit, and an information acquisition unit. The names of these units do not in some cases constitute a limitation on the units themselves, and for example, the information acquisition unit may also be described as a "unit for obtaining text information by a lip language recognition model".

As another aspect, the present application also provides a computer-readable medium, which may be contained in the apparatus described in the above embodiments; or may be present separately and not assembled into the device. The computer readable medium carries one or more programs which, when executed by the apparatus, cause the apparatus to: acquiring angle information and an initial lip action video from the acquired face video, wherein the angle information is used for representing the face-oriented angle in the face video; correcting the initial lip motion video through the angle information to obtain a corrected lip motion video; and importing the corrected lip motion video into a pre-trained lip recognition model to obtain character information corresponding to the corrected lip motion video, wherein the lip recognition model is used for representing the corresponding relation between the corrected lip motion video and the character information.

The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the invention. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims

1. A method for obtaining information, comprising:

the method comprises the steps that angle information and an initial lip action video are obtained from an obtained face video, wherein the angle information is used for representing the angle of a face in the face video, and the angle information is used for indicating the degree that the face in the face video is not over against a camera;

correcting the initial lip action video through the angle information to obtain a corrected lip action video, wherein the corrected lip action video comprises a lip image right opposite to the camera;

and importing the corrected lip motion video into a pre-trained lip recognition model to obtain character information corresponding to the corrected lip motion video, wherein the lip recognition model is used for representing the corresponding relation between the corrected lip motion video and the character information.

2. The method of claim 1, wherein the modifying the initial lip motion video by the angle information to obtain a modified lip motion video comprises:

correcting each frame of image contained in the initial lip motion video through the angle information to obtain a corrected image corresponding to the image;

and combining the corrected images to obtain a corrected lip motion video.

3. The method of claim 1, wherein the lip language recognition model comprises a convolutional neural network, a cyclic neural network, and a fully-connected layer.

4. The method according to claim 3, wherein the importing the modified lip motion video into a pre-trained lip language recognition model to obtain text information corresponding to the modified lip motion video comprises:

inputting the corrected lip motion video into the convolutional neural network to obtain image feature vectors of each frame of image of the corrected lip motion video, wherein the convolutional neural network is used for representing the corresponding relation between the video and the image feature vectors of each frame of image of the video;

inputting the image feature vectors of each frame of image of the corrected lip motion video to the recurrent neural network to obtain the video feature vectors of the corrected lip motion video, wherein the recurrent neural network is used for representing the corresponding relation between the image feature vectors of each frame of image of the video and the video feature vectors of the video, and the video feature vectors of the video are used for representing the incidence relation between the image feature vectors of each frame of image of the video;

and inputting the video feature vector of the corrected lip motion video into the full-link layer to obtain the text information, wherein the full-link layer is used for representing the corresponding relation between the video feature vector of the video and the text information.

5. The method of claim 1, wherein the lip recognition model is trained by:

obtaining a plurality of sample lip action videos and sample text information corresponding to each sample lip action video in the plurality of sample lip action videos;

and taking each sample lip action video in the plurality of sample lip action videos as input, taking sample text information corresponding to each sample lip action video in the plurality of sample lip action videos as output, and training to obtain the lip language recognition model.

6. The method according to claim 5, wherein the training of the lip recognition model by taking each of the plurality of sample lip motion videos as an input and sample text information corresponding to each of the plurality of sample lip motion videos as an output comprises:

the following training steps are performed: sequentially inputting each sample lip action video in the plurality of sample lip action videos to an initial lip language recognition model to obtain predicted character information corresponding to each sample lip action video in the plurality of sample lip action videos; and comparing the predicted character information corresponding to each sample lip action video in the plurality of sample lip action videos with the sample character information corresponding to the sample lip action video to obtain the identification accuracy of the initial lip language identification model, determining whether the identification accuracy is greater than a preset accuracy threshold, and if so, taking the initial lip language identification model as the trained lip language identification model.

7. The method according to claim 6, wherein the training of the lip recognition model by taking each of the plurality of sample lip motion videos as an input and sample text information corresponding to each of the plurality of sample lip motion videos as an output comprises:

and responding to the situation that the accuracy is not larger than the preset accuracy threshold, adjusting the parameters of the initial lip language recognition model, and continuing to execute the training step.

8. An apparatus for obtaining information, comprising:

the information acquisition unit is used for acquiring angle information and an initial lip action video from an acquired face video, wherein the angle information is used for representing the angle of the face in the face video, and the angle information is used for indicating the degree of the face in the face video not directly facing a camera;

the correction unit is used for correcting the initial lip motion video through the angle information to obtain a corrected lip motion video, wherein the corrected lip motion video comprises a lip image right opposite to the camera;

and the character information acquisition unit is used for importing the corrected lip motion video into a pre-trained lip language recognition model to obtain character information corresponding to the corrected lip motion video, and the lip language recognition model is used for representing the corresponding relation between the corrected lip motion video and the character information.

9. The apparatus of claim 8, wherein the correction unit comprises:

a corrected image obtaining subunit, configured to correct, for each frame of image included in the initial lip motion video, the image according to the angle information, so as to obtain a corrected image corresponding to the image;

and the corrected lip motion video acquisition subunit is used for combining the corrected images to obtain a corrected lip motion video.

10. The apparatus of claim 8, wherein the lip language recognition model comprises a convolutional neural network, a cyclic neural network, and a fully-connected layer.

11. The apparatus of claim 10, wherein the text information obtaining unit comprises:

the image feature vector acquisition subunit is configured to input the corrected lip motion video to the convolutional neural network, so as to obtain an image feature vector of each frame of image of the corrected lip motion video, where the convolutional neural network is configured to represent a correspondence between a video and the image feature vector of each frame of image of the video;

the video feature vector acquisition subunit is configured to input image feature vectors of each frame of image of the corrected lip motion video to the recurrent neural network to obtain video feature vectors of the corrected lip motion video, where the recurrent neural network is configured to represent a correspondence between the image feature vectors of each frame of image of the video and the video feature vectors of the video, and the video feature vectors of the video are configured to represent an association between the image feature vectors of each frame of image of the video;

and the text information acquisition subunit is used for inputting the video feature vector of the corrected lip motion video into the full-link layer to obtain the text information, wherein the full-link layer is used for representing the corresponding relation between the video feature vector of the video and the text information.

12. The apparatus of claim 8, wherein the apparatus comprises a lip recognition model training unit, the lip recognition model training unit comprising:

the device comprises a sample information acquisition subunit, a lip motion analysis subunit and a lip motion analysis subunit, wherein the sample information acquisition subunit is used for acquiring a plurality of sample lip motion videos and sample text information corresponding to each sample lip motion video in the plurality of sample lip motion videos;

and the lip recognition model training subunit is used for taking each sample lip action video in the sample lip action videos as input, taking sample character information corresponding to each sample lip action video in the sample lip action videos as output, and training to obtain the lip recognition model.

13. The apparatus of claim 12, wherein the lip recognition model training subunit comprises:

the lip language recognition model training module is used for sequentially inputting each sample lip action video in the sample lip action videos to an initial lip language recognition model to obtain predicted character information corresponding to each sample lip action video in the sample lip action videos; and comparing the predicted character information corresponding to each sample lip action video in the plurality of sample lip action videos with the sample character information corresponding to the sample lip action video to obtain the identification accuracy of the initial lip language identification model, determining whether the identification accuracy is greater than a preset accuracy threshold, and if so, taking the initial lip language identification model as the trained lip language identification model.

14. The apparatus of claim 13, wherein the lip recognition model training subunit further comprises:

and the parameter adjusting module is used for responding to the condition that the accuracy is not greater than the preset accuracy threshold, adjusting the parameters of the initial lip language recognition model and continuing to execute the training step.

15. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs;

the camera is used for collecting images;

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.

16. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 7.