CN113095230A

CN113095230A - Method and device for helping blind person to search for articles

Info

Publication number: CN113095230A
Application number: CN202110400186.XA
Authority: CN
Inventors: 房云峰; 俞益洲; 李一鸣; 乔昕
Original assignee: Beijing Shenrui Bolian Technology Co Ltd; Shenzhen Deepwise Bolian Technology Co Ltd
Current assignee: Beijing Shenrui Bolian Technology Co Ltd; Shenzhen Deepwise Bolian Technology Co Ltd
Priority date: 2021-04-14
Filing date: 2021-04-14
Publication date: 2021-07-09

Abstract

The invention provides a method and a device for helping blind people to find objects. The method comprises the following steps: acquiring an article to be searched by a user through a voice module; acquiring a depth image of an indoor article photographed by a binocular camera; inputting the image into an identification model, and identifying an article to be searched; estimating a location of the item based on the depth image of the item and guiding a user to approach the item through a voice module; if search error information fed back by a user is received, the identification model is finely adjusted by modifying part of the weight parameters, so that the object to be searched can be correctly identified. According to the invention, the articles to be searched by the user are obtained through the voice module, and the user can be guided to smoothly find the articles through the voice module, so that the daily life of the visually impaired is facilitated. The invention can also automatically fine-tune the recognition model according to the error information fed back by the user, and can quickly train the model to correctly recognize the searched article because only part of the weight parameters of the model are adjusted.

Description

Method and device for helping blind person to search for articles

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a method and a device for helping a blind person to search for an article.

Background

Visual information is the most important information source for human beings to recognize the surrounding environment, and about 80% of information obtained by human beings is input from a visual system. According to the statistics of the world health organization, 2.85 hundred million people with visual impairment exist in the world. The visually impaired person loses normal vision and has difficulty in understanding the color and shape. In view of this, many intelligent products for assisting the blind to live are disclosed. The invention patent with application number 201810534069.0 discloses blind person auxiliary glasses for visual positioning based on multi-mode data, which utilizes a small processor to process multi-mode data acquired by a GNSS and a camera and output a positioning result. The glasses can be used for positioning under different illumination conditions such as day and night, have the advantages of low false detection rate, low omission factor, good real-time performance, good cross-platform performance and the like, and can well meet the application requirement of accurate positioning of visually impaired people.

Most of the existing life products for assisting the blind are used for navigation and detection of dangerous objects and obstacles (dangerous automobiles, water pits and the like are detected to assist the blind to avoid obstacles and walk), and the function of searching and positioning indoor objects is lacked; and the dependent model is generally obtained by training a fixed training sample, cannot be adaptively adjusted according to a specific application environment, and is low in detection precision.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a method and a device for helping the blind to search for an article.

In order to achieve the above object, the present invention adopts the following technical solutions.

In a first aspect, the present invention provides a method for assisting a blind person in finding an item, comprising:

acquiring an article to be searched by a user through a voice module;

acquiring a depth image of an indoor article photographed by a binocular camera;

inputting the image into an identification model, and identifying an article to be searched;

estimating a location of the item based on the depth image of the item and guiding a user to approach the item through a voice module;

if search error information fed back by a user is received, the identification model is finely adjusted by modifying part of the weight parameters, so that the object to be searched can be correctly identified.

Further, the identification model is a CNN-based item detection network YOLO-V3.

Further, the method of estimating the location of the item comprises:

acquiring a pixel value of the depth image of the article, wherein the pixel value is the distance R between the article and a user;

calculating the number m of pixel points of the distance between the article and the central point of the image in the image;

calculating the azimuth deviation alpha of the article relative to the position right in front of the user according to the following formula:

α＝arcsin(mr/R)

where r is the actual distance represented by each pixel interval.

Further, the method for fine-tuning the recognition model comprises the following steps:

adding the depth image of the indoor article into the training sample set, and retraining the recognition model; and in the training process, only the weight and the bias parameters of the fully-connected layer after the features in the recognition model are extracted are adjusted.

In a second aspect, the present invention provides an apparatus for assisting a blind person in finding an item, comprising:

the name acquisition module is used for acquiring an article to be searched by a user through the voice module;

the image acquisition module is used for acquiring a depth image of the indoor article shot by the binocular camera;

the article identification module is used for inputting the image into an identification model and identifying an article to be searched;

an item location module to estimate a location of the item based on the depth image of the item and guide a user to approach the item through a voice module;

and the model fine-tuning module is used for fine-tuning the identification model by modifying part of the weight parameters if the search error information fed back by the user is received, so that the identification model can correctly identify the searched article.

Further, the identification model in the item identification module is a CNN-based item detection network YOLO-V3.

Further, the method for the item location module to estimate the location of the item comprises:

acquiring a pixel value of the depth image of the article, wherein the pixel value is the distance R between the article and a user; calculating the distance m between the object in the image and the central point of the image according to the following formula:

in the formula (x)₁,y₁)、(x₀,y₀) Respectively representing the coordinates of the object in the image and the coordinates of the center of the image, wherein the unit is a pixel point;

α＝arcsin(mr/R)

in the formula, r is the actual distance represented by each pixel point.

Further, the method for the model fine-tuning module to fine-tune the identification model comprises the following steps:

Compared with the prior art, the invention has the following beneficial effects.

According to the invention, the articles to be searched by the user are obtained through the voice module, the articles can be automatically identified, the position of the articles is estimated, and the user can be guided to smoothly find the articles through the voice module, so that the daily life of people with visual impairment is greatly facilitated. The invention can also automatically fine-tune the identification model according to the search error information fed back by the user, and can quickly train the model to correctly identify the searched article because only part of the weight parameters of the model are adjusted.

Drawings

Fig. 1 is a flowchart of a method for helping a blind person to find an item according to an embodiment of the present invention.

FIG. 2 is a block diagram of a hardware module according to an embodiment of the present invention.

Fig. 3 is a block diagram of an apparatus for assisting the blind in finding an item according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer and more obvious, the present invention is further described below with reference to the accompanying drawings and the detailed description. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a flowchart of a method for helping a blind person to find an article according to an embodiment of the present invention, including the following steps:

step 101, acquiring an article to be searched by a user through a voice module;

step 102, acquiring a depth image of an indoor article shot by a binocular camera;

step 103, inputting the image into an identification model, and identifying an article to be searched;

104, estimating the position of the item based on the depth image of the item, and guiding a user to approach the item through a voice module;

and 105, if searching error information fed back by the user is received, fine-tuning the identification model by modifying part of the weight parameters so that the identification model can correctly identify the searched article.

The hardware structure related to the embodiment at least comprises a binocular camera, a recognition module and a voice module. The binocular camera is generally installed on a pair of blind glasses for photographing a video image of the surroundings and inputting the video image to the recognition module. The identification module may be a separate processor or a cloud server for identifying the item to be located by processing the input video image. It is therefore generally necessary to provide a communication module. The voice module is used for realizing interaction with a user, such as receiving a voice instruction (finding what item) of the user or feeding back information; or to guide the user by voice towards the item. The method described in this embodiment is implemented by a program executed in a processor or a cloud server.

In this embodiment, step 101 is mainly used to obtain the name of an article to be searched by the user through the voice module. When a user wants to find a certain article, the user can speak out to the voice module, such as a crutch or a cup to be found. The voice module converts the received sound signal into text information and inputs the text information into the recognition module. This embodiment is through setting up voice module, can make things convenient for user's operation greatly, especially so to the user that has the obstacle to the vision.

In this embodiment, step 102 is mainly used to acquire a video image captured by a binocular camera. The binocular camera consists of a left camera and a right camera which are positioned on the same plane and have parallel optical axes and the parameters of which are consistent. Depth images may be obtained using binocular cameras. Depth images, also known as range images, refer to images having as pixel values the distances (depths) from the camera to various points in the scene, which directly reflect the geometry of the visible surface of the scene. The distance between the camera and the object can be calculated according to the video images shot by the binocular camera.

In this embodiment, step 103 is mainly used for identifying the article. Article identification based on video image belongs to image identification technical field. Image recognition, which refers to a technique for processing, analyzing and understanding images by a computer to recognize various different patterns of objects and objects, is a practical application of applying a deep learning algorithm. The current image recognition technology generally comprises face recognition and commodity (object) recognition. The face recognition is mainly applied to security check, identity verification and mobile payment; the commodity identification is mainly applied to the commodity circulation process, in particular to the field of unmanned retail such as unmanned goods shelves and intelligent retail cabinets. In the embodiment, a shot image is input into a recognition model, and an article to be searched is recognized from the image through feature extraction.

In this embodiment, step 104 is mainly used to estimate the location of the identified item. The position of the item is typically offset by the distance of the item from the user and the orientation of the item relative to the front of the user, e.g., 60 centimeters in front and 40 degrees to the right of the current orientation. The binocular camera can shoot the depth image, and the distance between the object and the user can be obtained according to the shot depth image; according to the distance between the object on the image and the center of the image and the distance between the object and the user, the azimuth deviation of the object relative to the front of the user can be obtained. After the position of the article is estimated, the user is guided to approach the article by a voice module through voice, such as "the position of a crutch is 30 degrees to the left and 50 centimeters in front of the current orientation".

In this embodiment, step 105 is mainly used to perform model fine tuning when an object is identified as wrong. It is not surprising that the recognition model is originally trained with common and popular object features, and has a large deviation from the environment where the user is located and the object used in terms of color/shape, etc., so that the accuracy of the model recognition is not high. The information identifying the error is generally obtained by the user through voice module feedback. After the user finds the corresponding object according to the voice guidance, if the user finds that the object is not the object that the user wants to find, the user can use the voice to feed back the wrong recognition result, such as 'recognition error'. In the embodiment, after the search error information is received, the identification model is automatically finely adjusted. The fine tuning, as the name implies, is to make a targeted local adjustment of the model, e.g. to modify only part of the weight parameters. By adopting the mode of self-adapting and fine-tuning the model, the training time is short, and the model can quickly and correctly identify the articles in the environment where the user is located, thereby bringing better use experience to the user.

As an alternative embodiment, the identification model is a CNN-based item detection network YOLO-V3.

The embodiment provides a technical scheme of the recognition model. YOLO-V3 is the third version of a YOLO (you Only Look one) series target detection algorithm, and compared with the previous algorithm, the precision is obviously improved particularly for small targets. The network model of YOLO-V3 uses a Darknet-53 structure consisting essentially of a series of 1 x 1 and 3x3 convolutional layers, each followed by a BN layer and a LeakyReLU layer, the last Connected being a fully Connected layer and also a convolutional layer for a total of 53 convolutional layers. Since a fully connected forward neural network is not used, the network can correspond to an input image of any size. In addition, the pooling layer is not present in YOLO-V3 either, but instead stride of convolution is set to 2 to achieve the effect of downsampling while passing scale-invariant features to the next layer. Besides, structures like ResNet and FPN networks are also used in YOLO-V3, and the two structures are also beneficial to improving detection accuracy.

As an alternative embodiment, the method of estimating the position of the item comprises:

α＝arcsin(mr/R)

in the formula, r is the actual distance represented by each pixel point.

The embodiment provides a technical scheme for positioning the article. As mentioned above, the binocular camera can capture a depth image of the object, and the pixel value of the depth image represents the distance between the object and the camera, so the distance R between the user and the object can be obtained according to the pixel value of the depth image of the object. The azimuth deviation (or declination) of the article relative to the right front of the user can be calculated according to simple solid and plane geometric knowledge (solving right triangles): the ratio of the perpendicular distances R1 and R of the item directly in front of the user is equal to the sine of the declination angle. R1 can be found from the product of the distance m (number of pixel points) between the item in the image and the center point of the image and the actual distance represented by each pixel point. And m can be obtained according to the coordinates of the object in the image and the coordinates of the center of the image. Since the unit of the coordinates in the image is the number of pixel points, the unit of m obtained is also the number of pixel points.

As an alternative embodiment, the method for fine-tuning the recognition model includes:

The embodiment provides a technical scheme for fine tuning the recognition model. As mentioned above, the fine-tuning is only a targeted local adjustment of the model, in order to enable a fast "adaptation" of the model to the environment. In this embodiment, in order to make the model "adapt" to the environment, the image of the specific object in the user room is used as a training sample, such as a cup, a crutch, etc. commonly used, and the specific color, shape, etc. of the image are subjected to feature extraction to train the model, so that the accuracy of model identification can be improved more specifically. The local adjustment of the model parameters means that only the weight and the bias of the fully-connected layer after the feature extraction in the model are corrected, and the parameters of the convolutional layer for feature extraction are not adjusted. This ensures that the feature layer is stable. In consideration of the possible misleading result in use, the function of restoring the default setting can be set in practical application to restore the original model.

In the fully connected layer, the characteristics of the data matrix are stored in rows and the weight of the fully connected layer from the input to a certain output is stored in columns. The offset vectors are also stored in rows. Specifically, it can be expressed as:

σ(X)*W+B＝Y

where X is the input vector of the fully-connected layer, σ is the excitation function, Y is the output vector of the fully-connected layer, W is the weight matrix, and B is the bias matrix. In the reverse propagation, assuming the loss function is L, the derivative of L with respect to W, B is:

this results in an updated formula for the weight W, B:

in the formula, W^*、B^*Respectively, W, B, and α is a learning rate, typically 0.01.

Fig. 3 is a schematic composition diagram of an apparatus for assisting the blind in finding an object according to an embodiment of the present invention, the apparatus comprising:

the name acquisition module 11 is used for acquiring an article to be searched by a user through the voice module;

an image acquisition module 12 for acquiring a depth image of an indoor article photographed by a binocular camera;

the article identification module 13 is used for inputting the image into an identification model and identifying an article to be searched;

an item location module 14 for estimating a location of the item based on the depth image of the item and guiding a user to approach the item through a voice module;

and the model fine-tuning module 15 is used for fine-tuning the identification model by modifying part of the weight parameters if the search error information fed back by the user is received, so that the identification model can correctly identify the searched article.

The apparatus of this embodiment may be used to implement the technical solution of the method embodiment shown in fig. 1, and the implementation principle and the technical effect are similar, which are not described herein again. The same applies to the following embodiments, which are not further described.

As an alternative embodiment, the identification model in the item identification module 13 is a CNN-based item detection network YOLO-V3.

As an alternative embodiment, the method for the item location module to estimate the location of the item comprises:

α＝arcsin(mr/R)

in the formula, r is the actual distance represented by each pixel point.

As an alternative embodiment, the method for the model fine tuning module 15 to fine tune the recognition model includes:

adding the depth image of the indoor article into the training sample set, and retraining the recognition model; and in the training process, only the weight parameters of the full-connection layer after the features in the recognition model are extracted are adjusted.

The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method of assisting a blind in finding an item, comprising the steps of:

acquiring an article to be searched by a user through a voice module;

2. The method for helping the blind to find articles as claimed in claim 1, wherein the recognition model is CNN-based article detection network YOLO-V3.

3. The method for assisting the blind in finding an item according to claim 1, wherein the method for estimating the location of the item comprises:

calculating the distance m between the object in the image and the central point of the image according to the following formula:

α＝arcsin(mr/R)

in the formula, r is the actual distance represented by each pixel point.

4. The method of assisting the blind in locating items according to claim 1, wherein the method of fine tuning the recognition model comprises:

5. An apparatus for assisting a blind in locating an item, comprising:

6. The apparatus for helping the blind to find out the article as claimed in claim 5, wherein the identification model in the article identification module is CNN-based article detection network YOLO-V3.

7. The apparatus for helping the blind to find the object as claimed in claim 5, wherein the method for the object location module to estimate the location of the object comprises:

α＝arcsin(mr/R)

in the formula, r is the actual distance represented by each pixel point.

8. The apparatus for helping blind person to find out articles according to claim 5, wherein the method for fine tuning the identification model by the model fine tuning module comprises: