CN112381052A

CN112381052A - System and method for identifying visually impaired users in real time

Info

Publication number: CN112381052A
Application number: CN202011385486.7A
Authority: CN
Inventors: 蕭啟穎; 岑仲欣; 盧國慶
Original assignee: Chuangqi Social Technology Co ltd
Current assignee: Chuangqi Social Technology Co ltd
Priority date: 2020-12-01
Filing date: 2020-12-01
Publication date: 2021-02-19

Abstract

The embodiment of the invention discloses a system and a method for identifying visually impaired users in real time, wherein the system comprises the following steps: identifying a terminal: preprocessing the collected image in front of the visually impaired user in real time, continuously comparing the preprocessing with a digital target default value, and outputting in a voice mode when the similarity reaches a threshold value; a server: distributing the images uploaded by the identification terminal to the service terminal, and sending the identification result uploaded by the service terminal to the corresponding identification terminal; the service terminal: receiving an image uploaded by a terminal, and sending a result of the volunteer manual identification to a server; or to provide manual assistance to the visually impaired in real time. The invention can intelligently identify the images shot by the vision-impaired user and upload and distribute the images to corresponding volunteers, and the volunteers send the identification information to the vision-impaired user, thereby assisting the vision-impaired user to distinguish scenes and objects in real time and in all directions.

Description

System and method for identifying visually impaired users in real time

Technical Field

The invention relates to the technical field of vision assistance, in particular to a system and a method for identifying visually impaired users in real time.

Background

Typically, visually impaired people need to receive their information via braille, radio, audio books and certain applications that can only read specific information for them. In recognizing images, there are also some mobile applications that recognize images purely through artificial intelligence. Artificial intelligence is not universal in solving visual problems. Humans can easily identify key information in letters, and artificial intelligence typically only reads all content. Alternatively still, we would need to train artificial intelligence using many pictures with the same format in order to be able to extract key information in this particular format. This makes the use of artificial intelligence to provide visual assistance less than ideal in practical use. Another reason is that human interaction is crucial to solving complex problems.

Disclosure of Invention

The technical problem to be solved by the embodiments of the present invention is to provide a system and a method for identifying visually impaired users in real time to assist visually impaired people in identifying images.

In order to solve the above technical problem, an embodiment of the present invention provides a system for identifying visually impaired users in real time, including:

identifying a terminal: preprocessing the collected image in front of the visually impaired user in real time, continuously comparing the preprocessing result with a digital target default value, and outputting in a voice mode when the similarity reaches a threshold value; if the similarity does not reach the threshold value through comparison or the user selects to enter a manual mode, uploading the current image to a server;

a server: distributing the images uploaded by the identification terminal to the service terminal, and sending the identification result uploaded by the service terminal to the corresponding identification terminal; or the identification terminal and the service terminal are connected in real time through a network;

the service terminal: the corresponding volunteers receive the images uploaded by the identification terminal through the service terminal and then send the manual identification results of the volunteers to the server; or the system is connected with the identification terminal, so that the volunteer and the visually impaired user can directly communicate in real time, and manual help is provided for the visually impaired user in real time.

And further, preprocessing the image by adopting a convolutional neural network model in the identification terminal, and outputting the result after the image data sequentially enters a first convolutional layer, a second convolutional layer, a pooling layer, a first complete connection layer and a second complete connection layer of the convolutional neural network model.

Further, the identification terminal further comprises a character recognition module: and recognizing characters in the image and outputting the characters in a voice form.

Correspondingly, the embodiment of the invention also provides a method for identifying the visually impaired user in real time, which comprises the following steps:

step 1: the identification terminal preprocesses the collected image in front of the vision-impaired user in real time, continuously compares the preprocessing result with a digital target default value, and outputs the preprocessing result in a voice form when the similarity reaches a threshold value; if the similarity does not reach the threshold value through comparison or the user selects to enter a manual mode, uploading the current image to a server;

step 2: the server distributes the images uploaded by the identification terminals to the service terminals and sends the identification results uploaded by the service terminals to the corresponding identification terminals; or the identification terminal and the service terminal are connected in real time through a network;

and step 3: the corresponding volunteers receive the images uploaded by the identification terminal through the service terminal and then send the manual identification results of the volunteers to the server; or the system is connected with the identification terminal, so that the volunteer and the visually impaired user can directly communicate in real time, and manual help is provided for the visually impaired user in real time.

Further, preprocessing an image by adopting a convolutional neural network model in the step 1, wherein the image data sequentially enter a first convolutional layer, a second convolutional layer, a pooling layer, a first complete connection layer and a second complete connection layer of the convolutional neural network model and then output results.

Further, step 1 further comprises:

a character recognition substep: and recognizing characters in the image and outputting the characters in a voice form.

The invention has the beneficial effects that: the invention can intelligently identify the images shot by the vision-impaired user and upload and distribute the images to corresponding volunteers, and the volunteers send the identification information to the vision-impaired user, thereby assisting the vision-impaired user to distinguish scenes and objects in real time and in all directions.

Drawings

Fig. 1 is a schematic structural diagram of a system for identifying visually impaired users in real time according to an embodiment of the present invention.

Fig. 2 is a transmission diagram of an embodiment of the invention.

Fig. 3 is a flowchart illustrating a method for identifying a visually impaired user in real time according to an embodiment of the present invention.

FIG. 4 is a model diagram of a convolutional neural network model employed by embodiments of the present invention.

FIG. 5 is a block diagram of a convolutional neural network model employed by embodiments of the present invention.

Detailed Description

It should be noted that the embodiments and features of the embodiments in the present application can be combined with each other without conflict, and the present invention is further described in detail with reference to the drawings and specific embodiments.

If directional indications (such as up, down, left, right, front, and rear … …) are provided in the embodiment of the present invention, the directional indications are only used to explain the relative position relationship between the components, the movement, etc. in a specific posture (as shown in the drawing), and if the specific posture is changed, the directional indications are changed accordingly.

In addition, the descriptions related to "first", "second", etc. in the present invention are only used for descriptive purposes and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature.

Referring to fig. 1, a system for identifying visually impaired users in real time according to an embodiment of the present invention includes an identification terminal, a server, and a service terminal.

Identifying a terminal: preprocessing the collected image in front of the visually impaired user in real time, continuously comparing the preprocessing result with a digital target default value, and outputting in a voice mode when the similarity reaches a threshold value; and if the similarity does not reach the threshold value or the user selects to enter the manual mode, uploading the current image to the server. Such as humans (men/women, elderly/toddlers/pre-designed relatives), household items (doors/windows/tables/chairs/tv/sand/pots/cups), outdoors (cars/buses, stairways/elevators) or larger road signs etc. The data set of the embodiment of the invention can improve the accuracy by continuously updating and self-learning and revising.

A server: distributing the images uploaded by the identification terminal to the service terminal, and sending the identification result uploaded by the service terminal to the corresponding identification terminal; or the identification terminal and the service terminal are connected in real time through a network.

The invention can help the user with the problem of eyesight in real time or in the shortest time, and automatically identify the image shot by the user with visual impairment. Referring to fig. 2, the identification terminal according to the embodiment of the present invention is installed at a mobile phone end of a visually impaired user, and the service terminal is installed at a mobile phone end of a volunteer (a significant worker). In addition, through a community of volunteers (significant workers), the service provided by the invention can be operated twenty-four hours all day, and all people around the world can download the program and become the volunteers (significant workers). When the vision-impaired user needs help, the vision-impaired user can shoot the scenery or the object and send the scenery or the object to a volunteer (a prosthetist) in real time, the volunteer (the prosthetist) provided with the service terminal receives the notification, the volunteer (the prosthetist) can select whether to respond, and the volunteer (the prosthetist) with time can describe the picture to the vision-impaired user by voice or text messages.

In addition, to provide real-time assistance, embodiments of the present invention also allow the visually impaired user to make a real-time support request, having time for a volunteer (prosthetist) to contact the visually impaired user via a form similar to a video conference and to be able to see the scene in front of the eyes of the visually impaired user and assist the visually impaired user in real-time through voice.

The present invention provides greater flexibility in receiving information. For users with impaired vision, the present invention may contact volunteers who volunteer to provide visual instructions, who would like to provide help to facilitate the daily life of users with impaired vision. For volunteers, the present invention helps people anytime and anywhere while accumulating opportunities for volunteer hours. For advertisers, the present invention provides a wide audience population so that their information can be received by different people across the country.

As an implementation manner, the identification terminal adopts a convolutional neural network model to preprocess an image, and the image data sequentially enters a first convolutional layer, a second convolutional layer, a pooling layer, a first fully-connected layer and a second fully-connected layer of the convolutional neural network model and then outputs a result.

Referring to fig. 4 to 5, the image data is decomposed into layers of different pixel sizes in a convolutional neural network model (CNN), which learns the mapping relationship between the input and output from the input image data without human intervention for an accurate expression process of the features. The input layer is an array converted from pixels; the convolutional neural network model comprises a plurality of convolutional layers (convolutional layers), pooling layers, which are all called sampling layers (fully connected layers), and the like, and can be regarded as a neural network consisting of connected neurons; and the output layer is the result of the discrimination. The input layer, the hidden layer and the full connection layer are composed of three groups of convolution pooling layers. The data entering the input layer is an array converted from a two-dimensional color map (RGB color mode), and the size of the array is determined by multiplying the resolution of the image by the number of RGB bits (width × height × 3). Data from the input layer first enters the first convolutional layer. The convolutional layer has a filter (filter) corresponding to it, which is a matrix of numbers. At the convolutional layer, the input matrix is convolutionally multiplied by the filter matrix to obtain a new matrix, i.e. the feature map.

The convolution layer adjusts the size, step size and filling mode of the filter according to the practical application, wherein the selection of the filter determines the range of each sampling, the step size determines the number of pixels moved by each sampling, and the filling mode includes zero filling and discarding filling (for processing the condition that the size of the filter is not consistent with the size of the image). The nature of the convolution operation allows the feature map to preserve the relationship between pixels in the original image. The feature map is dimensionality reduced at the pooling layer, enters a second convolution layer, and is then pooled again. There are three ways of maximum pooling (max pooling), mean pooling (average pooling), and pooling (sum pooling) to compress image data while preserving the relationship between image pixels.

And the activation layer selects different functions to perform non-linear processing on the data, and the more common mode is a modified linear Unit (ReLU). Convolutional layers with different filters implement a variety of operations on the image, such as edge detection, contour detection, blurring, sharpening, and so on.

And the full connection layer, after convolution and pooling for many times, data reaches the full connection layer. The image data is classified by an excitation function (e.g., a logistic regression function with loss), and the final output result indicates the probability of the input image belonging to a certain class.

In the evaluation, whether the model is over-fitted or not can be judged by respectively comparing the difference values between the accuracy and the loss values of the training sample and the verification sample. If the difference of the accuracy values in the training and verifying stages is large, the model is over-fitted; the lower the accuracy of the training and verification stage is, the less satisfactory the image discrimination effect is. The model uses binary cross entropy (binary entropy) as an objective function, and the model is subjected to updating iteration with the aim of minimizing the loss value. Thus, a smaller loss value means that the trained model fits better to the data. The training and verification results of the present invention are recorded in a background database, and the test results are shown in the following visual analysis of the models.

For example, the MS COCO dataset may also be employed. The MS COCO data set was released by microsoft team in 2014, and a total of 8 million images for training were collected, wherein 80 common objects in life were labeled, for example: cats, dogs, airplanes, tables, cars …, etc., and provides about 4 million test images. The data set is updated again in 2019, and although different object types are not newly added, the number of images is increased to 30 ten thousand images. The data set provides a completely open content, and the calculation model proposed by each family can have objective comparison basis by taking the same data set as a reference.

As an implementation manner, the identification terminal further includes a text recognition module: and recognizing characters in the image and outputting the characters in a voice form.

The visually impaired user can also use the character recognition module of the invention to exert self-help capability, if the user needs to recognize characters, such as letters or building announcements, the user can take a picture by holding a mobile phone with one hand and an object with the other hand and release the characters by using the character recognition function, and the character recognition module can also convert the characters which are discolored into sound for reading and listening by the visually impaired user.

These prosthetics support function and artificial intelligence word recognition function are complementary and insufficient, when the simple recognition is carried out in the simple word recognition work, the vision-impaired user can choose to solve the simple problem without depending on the prosthetics, but for the object with more complex surface shape or the nearby environment vision user, the user needs to send the picture to the prosthetics or contact the prosthetics through the real-time function of the vision and get real-time assistance.

Referring to fig. 3, a method for identifying visually impaired users in real time according to an embodiment of the present invention includes:

As an implementation manner, the convolutional neural network model is used to preprocess the image in step 1, and the image data sequentially enters the first convolutional layer, the second convolutional layer, the pooling layer, the first fully-connected layer and the second fully-connected layer of the convolutional neural network model and then outputs the result.

As an embodiment, step 1 further comprises:

The embodiment of the invention provides three assistance methods for the vision-impaired user: (1) the visually impaired user can take a picture in front of the user, and the analysis result of the picture is output in a voice form so that the visually impaired person can understand the picture; (2) for complex situations such as a street environment, a visually impaired user may take a picture and send it to one volunteer, who has time to describe the picture to the user; (3) the visually impaired user can also select to contact with volunteers in a visual mode, and the volunteers with time can provide visual assistance for real-time communication for the user; (4) the vision-impaired user can shoot the image in front, and the invention recognizes the characters in the image and outputs the characters in a voice form so as to be recognized by the vision-impaired person.

The 4 functions of the invention are complementary, the artificial intelligent character recognition is a straight-forward function without volunteers, and the vision-impaired user can exert the self-help ability. The volunteer description picture and the function of the volunteer to visually contact the visually impaired user are more suitable for complex situations. The invention relates the visually impaired people and the care volunteers, and contributes to the social public welfare. The invention enables volunteers to directly indicate visually impaired people through sound or characters, and visually impaired people can receive character and image information more conveniently.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A system for real-time identification of visually impaired users, comprising:

2. The system of claim 1, wherein the identification terminal pre-processes the image by using the convolutional neural network model, and the image data sequentially enters a first convolutional layer, a second convolutional layer, a pooling layer, a first fully-connected layer and a second fully-connected layer of the convolutional neural network model and then outputs the result.

3. The system of claim 1, wherein the identification terminal further comprises a text recognition module: and recognizing characters in the image and outputting the characters in a voice form.

4. A method for identifying visually impaired users in real time is characterized by comprising the following steps:

5. The method for identifying the visually impaired user in real time as claimed in claim 4, wherein the convolutional neural network model is adopted to preprocess the image in step 1, and the image data sequentially enters a first convolutional layer, a second convolutional layer, a pooling layer, a first fully connected layer and a second fully connected layer of the convolutional neural network model and then the result is output.

6. The method for identifying visually impaired users in real-time as claimed in claim 4, wherein the step 1 further comprises: