CN110852171A

CN110852171A - Scene description robot system and method for online training

Info

Publication number: CN110852171A
Application number: CN201910974489.5A
Authority: CN
Inventors: 李秀; 宋恺祥; 段桂春
Original assignee: Shenzhen International Graduate School of Tsinghua University
Current assignee: Shenzhen International Graduate School of Tsinghua University
Priority date: 2019-10-14
Filing date: 2019-10-14
Publication date: 2020-02-28

Abstract

The embodiment of the application discloses a scene description robot system and a scene description robot method for online training. The method comprises the following steps: a1, receiving new image-text pair data; a2, constructing a training set according to the new image-text data; a3, training the image-text model in the training state by using the training set to obtain a trained training model; and A4, updating the image-text model of the test state for the service according to the trained training model. The system comprises the blind guiding robot and the server. The image-text model of the test state can be updated and changed along with the environment, so that the adaptability of the system to a real scene can be greatly improved, and the prediction effect can be ensured.

Description

Scene description robot system and method for online training

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a scene description robot system and method for online training.

Background

The existing facilities or equipment for ensuring the travel of the blind mainly comprise: accessible facility, guide dog and guide stick. Because the life of the blind is inconvenient, the blind guiding dog and the blind guiding walking stick gradually become tools for helping the blind go out. However, guide dogs are not easy to train and are costly; the detection range of the blind guiding walking stick is limited. The learning of the braille also requires time cost, and the blind people have urgent need to understand the real world.

Research teams at home and abroad are dedicated to researching a more intelligent and reliable blind guiding robot. For example, a blind guiding robot based on embedded technology can identify obstacles and traffic signs; for another example, the human-computer interactive blind guiding robot can detect the external environment through a sensor and transmit the external environment to the blind in a voice mode; and for example, a scene description system based on CNN-LSTM (Convolutional Neural Networks; LSTM, Long Short-Term Memory network) downloads the trained model into embedded equipment to realize the function of translating the characters by the image.

Most products or systems helping the blind to understand the real world are realized by firstly using a public image-to-text data set (such as Microsoft COCO and the like), training on a deep learning model of a CNN-LSTM model, performing links such as model compression and the like, and then programming the model into an embedded device for use. The performance of deep learning depends on the distribution of data; the images in the public data set mostly have higher quality and more definite scenes; the method is really applied to the life scene where the blind person is located, the image collected by the camera is easy to appear in the conditions of blurring, underexposure and the like, and the scene content is greatly different from that in the data set, so that the effect is poor easily during testing, and the influence of the factors such as the quality of the camera, the shooting angle and the like is large. This is a dilemma faced by the existing scene description systems in the process of commercialization.

The above background disclosure is only for the purpose of assisting in understanding the inventive concepts and technical solutions of the present application and does not necessarily pertain to the prior art of the present application, and should not be used to assess the novelty and inventive step of the present application in the absence of explicit evidence to suggest that such matter has been disclosed at the filing date of the present application.

Disclosure of Invention

The application provides a scene description robot system and method for on-line training, which can solve the technical problem of low prediction effect caused by the fact that the distribution of a data set used by a training model is different from the distribution of data collected from a use environment in the image description process.

In a first aspect, the present application provides a method for online training of a scenario-describing robot: a1, receiving new image-text pair data; a2, constructing a training set according to the new image-text data; a3, training the image-text model in the training state by using the training set to obtain a trained training model; and A4, updating the image-text model of the test state for the service according to the trained training model.

In some preferred embodiments, the a2 includes:

assigning a sample weight to each of the new image-text pairs; wherein the sample weights are inversely related to the interval of the current time as the time of the new image-text pair data upload;

collecting the new image-text pair data into the training set based on the sample weights.

In some preferred embodiments, the a2 further comprises:

extracting global features of each new image-text pair data;

comparing and sequencing global features of the new image-text pair data with all samples in all data sets, and selecting N samples with the highest similarity and N samples with the lowest similarity to be added into the training set; wherein N is an integer.

In some preferred embodiments, training the image-text model of the training state using the training set comprises: and searching the network structure and parameters of the training model by a neural network structure searching method by using the training set.

In some preferred embodiments, the neural network structure searching method is a gradient optimization-based pdarts network structure searching algorithm.

In some preferred embodiments, the process of network structure search is divided into four phases: eight search spaces are provided in the stage one, and unimportant candidate operations are removed in the training process; the search space of the second stage is four; the search space of the stage three is two; stage four also leaves one of the most important candidate operations.

In some preferred embodiments, during the search for the network structure, different matrices are maintained; the matrix represents the weight of each optional operation in the search space; the matrix maintained in the first stage is 12-by-8-dimensional, represents 12 paths and 8 search spaces; the matrix maintained in the second stage is 12 x 4 dimensions, represents 12 paths and 4 search spaces; the matrix maintained in the third stage is 12 x 2 dimensions, represents 12 paths and 2 search spaces; the matrix maintained at stage four is 12 x 1, indicating that there is finally one of the most important candidate operations left.

In a second aspect, the application provides an online-training scenario description robot system, which includes a blind-guiding robot and a server; the server comprises a computer program for executing the above method.

In some preferred embodiments, the blind guiding robot is connected with the server through a wireless network; the specific form of the wireless network comprises a 4G network and a 5G network.

In a third aspect, the present application provides a computer readable storage medium having stored therein program instructions which, when executed by a processor of a computer, cause the processor to perform the above-described method.

Compared with the prior art, the beneficial effects of the embodiment of the application are as follows:

and constructing a training set according to the received new image-text data, and training the image-text model in the training state based on the training set to obtain a trained training model. The new image-text pair data describes the current user's environmental context, and the training model is then adaptable to the current user's environmental context. And updating the image-text model of the test state for the service based on the training model, so that the updated image-text model of the test state can adapt to the environmental scene of the current user. The image-text model in the test state can be updated and changed along with the environment, so that the adaptability of the system to a real scene can be greatly improved, and the prediction effect can be ensured.

Drawings

FIG. 1 is an information interaction diagram depicting a robotic system in a scenario of online training according to an embodiment of the present application;

FIG. 2 illustrates a change in the structure of modules in an iterative discarding search space process according to one embodiment of the present application;

FIG. 3 illustrates the change of weight matrices in an iterative discarding search space process according to one embodiment of the present application;

FIG. 4 is a forgetting graph according to an embodiment of the present application;

FIG. 5 is a flow chart of an online training process according to an embodiment of the present application;

FIG. 6 illustrates a scenario-describing robotic system and logic for its use for online training according to one embodiment of the present application;

fig. 7 is a flowchart illustrating a blind guiding robot service according to an embodiment of the present application.

Detailed Description

In order to make the technical problems, technical solutions and advantageous effects to be solved by the embodiments of the present application more clearly apparent, the present application is further described in detail below with reference to fig. 1 to 7 and the embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The embodiment provides an online-trained scene description robot system (or adaptive scene description system). Referring to fig. 1, the system includes a blind guiding robot 1 and a server 2. The system is mainly designed for the vision-impaired people, and can be trained by a robot assisted by normal people before products are provided for users and in use.

The blind guiding robot 1 is a mobile terminal product, and may be called a client. The blind guiding robot 1 can provide scene service for the visually impaired, and can describe the scene. In the embodiment, the blind guiding robot 1 is an embedded device; the blind guiding robot 1 comprises accessories such as a camera, a microphone and a loudspeaker, and can collect data and transmit the collected data to the server 2. Specifically, the blind guiding robot 1 transmits data to the server 2 through a wireless network such as a 5G network; in other implementations, the blind guiding robot 1 transmits the data to the server 2 through a 4G network or other wireless network.

The list of the hardware of the blind guiding robot 1, i.e. the client, constituting the main devices is shown in table 1. The components in this list are key components to achieve the functionality of the system. The raspberry pi platform is selected here because of its high scalability and low price.

TABLE 1 client Key hardware device inventory

The software system of the blind-guiding robot 1 uses an existing open interface based on the function to be implemented. Mainly comprises a voice interaction program module, a data exchange program module and the like. In addition, other self-defined functions can be realized according to the client. In the software part, the operations of image acquisition, image uploading, text downloading, text reading, voice communication and the like are realized through the combination of different software functional modules. The functional modules involved in this embodiment are as follows in table 2.

Table 2 function module list of blind guiding robot

In table 2 above, the speech recognition and speech synthesis module, the chat conversation module, and the chinese-english translation module respectively call a Baidu speech synthesis and recognition open interface, a Turing robot open interface, and a channel translation open interface, and the image description algorithm uses a NIC classical image capture algorithm model.

The server 2 is deployed with an image-text model. It is noted that, in the present embodiment, the image-text model is deployed in the server 2, not in the blind guiding robot 1. After receiving the data such as the images or videos uploaded by the blind guiding robot 1, the server 2 processes the data and returns the processing result to the blind guiding robot 1, so that the blind guiding robot 1 provides scene description services for the visually impaired.

The image-text model of this embodiment structurally inherits the CNN-LSTM framework (CNN, Convolutional Neural Networks; LSTM, Long Short-Term Memory network) of the conventional image description algorithm. The image-text model is the core of the algorithm of this embodiment, and the function of converting the image into characters is realized, and the quality of the function implementation depends on the quality of the model.

The image-text model of the server 2 is replicated in two, i.e. the server 2 is deployed with two image-text models, one in a test state and one in a training state. The image-text model of the test state is used directly to serve visually impaired persons; and the image-text model in the training state performs regular or irregular training according to the received image-text to realize the updating of the structure and the parameters. The image-text model in the training state replaces the image-text model in the testing state at intervals; this process is called model update, i.e. an automatic update mechanism.

Referring to fig. 1, the automatic update mechanism is a scene description robot online training method of the present embodiment, and includes steps a1 through a 4. The execution subject of the method is the server 2.

Step a1, new image-text pair data is received.

In the present embodiment, the updated data of the image-text model in the training state is derived from the image-text pair remotely provided by the sighted person, and is uploaded to the server 2 by the blind guiding robot 1.

And step A2, constructing a training set according to the new image-text data.

The training of the image-text model comprises a deployment phase and an application phase. The model in the deployment phase, the structure and parameters of the model are trained on the public data set; in the application phase, the data set used by the training model contains two parts, one part is a public data set, referred to herein as initial data, and the other part is an image-text sample pair obtained from the blind guiding robot 1, referred to herein as client data. The training process in the deployment phase is the same as the conventional way, and the application phase is described below with emphasis.

The frequency of the server 2 acquiring data from the blind guiding robot 1, namely the client side, fluctuates, the training process takes time, and the model actually used for serving the user is updated after the model training is completed, so that an updating frequency exists. The logic flow of the online training is as follows: on the server side, whenever a new data upload is detected, a training set is constructed using the current new data as well as the previous data.

And A3, training the image-text model in the training state by using a training set to obtain a trained training model.

The training process is actually a process of searching the network structure and parameters of the image-text model in the training state, and the model obtained by training is suitable for the current training set, i.e. the current training data.

The image-text model is the core of the scene description robot system and takes charge of converting images into characters, and the quality of the model directly determines the quality of picture description. The embodiment provides an on-line training strategy, and the structure and the weight of the model are updated by using the real environment where the user is located.

In the embodiment, online training is realized by a neural network structure searching method; wherein, only the CNN part of the image-text model is subjected to structure search, and the CNN part and the LSTM part are both subjected to weight updating.

In this implementation, the structure search of the CNN part adopts a gradient optimization-based pdarts (progressive differential Architecture search) neural network structure search algorithm. The algorithm model is formed by stacking a plurality of modules, wherein candidate search spaces with 8 possibilities are connected among feature maps in each module, and the candidate search spaces are not connected, 3 × 3 maximum pooling, 3 × 3 average pooling, long jump chaining, 3 × 3 depth separation convolution, 5 × 5 depth separation convolution, 3 × 3 void convolution and 5 × 5 void convolution respectively.

In this embodiment, the whole neural network is formed by stacking eight modules, as shown in fig. 2.

Nodes

0, 1, 2, and 3 all represent a feature map and the arrows represent the search space (e.g., convolution, pooling, long jump links, etc.). In order to reduce the amount of calculation and speed up the search, the process of searching the network structure is divided into four stages: eight search spaces exist in stage one (1-20 epochs), and unimportant candidate operations are removed in the training process; there are four search spaces for stage two (21-40 epochs); there are two search spaces for stage three (41-60 epochs); stage four (61-80epochs) leaves one of the most important candidate operations. The test process uses a stage four network structure. Finally, the model structure adapted to the current training set can be searched.

In the process of searching the network structure, different matrices need to be maintained. The matrix represents the importance, i.e., weight, of each optional operation in the search space. Referring to fig. 3, the matrix maintained at stage one is 12 x 8 dimensional, representing 12 paths, 8 search spaces; the matrix maintained in the second stage is 12 x 4 dimensions, and represents 12 paths and 4 search spaces; the matrix maintained in the third stage is 12 x 2 dimensions, and represents 12 paths and 2 search spaces; the matrix maintained at stage four is 12 x 1, indicating that there is finally one of the most important candidate operations left. The random initialization of the first three stages is normalized by rows and the last stage is normalized by columns.

And step A4, updating the image-text model of the test state for the service according to the trained training model.

And updating the image-text model of the test state after searching the latest network structure and parameters. Specifically, a trained training model is used for replacing an image-text model of the previous test state to obtain an updated image-text model of the test state; or, updating the network structure and parameters of the image-text model in the previous test state according to the trained training model. The updated model is directly served to the user. The above steps a1 to a4 are repeated after each update. Among them, the sequence of step a1 to step a4 is flexible, such as step a1 is run after step a 2.

The update process of the model continues until no more image-text pairs are received. The blind guiding robot 1, that is, the data sample collected by the client is the real environment where the user is located, and is used for fine tuning of the model.

The online training strategy of the model can enable the model to be more adaptive to the environmental scene of the current user.

Referring to fig. 5, the foregoing step a2 is explained.

In the application phase, the present embodiment employs a associative-contrast search strategy. When a training set is constructed, for each newly uploaded sample, namely image-text pair data, extracting the global features of the sample by using a classification network pre-trained on ImageNet; then, comparing and sequencing the characteristics of the sample with all samples in all data sets (initial data and client data), and selecting N samples with the highest similarity and N samples with the lowest similarity to be added into a training set; the cosine similarity measurement can be selected for the comparison of the feature vectors; wherein, N is an integer and can be set artificially. This data set expansion mechanism facilitates the model's memory of the association and comparison of the scene. Here N also determines the ratio of initial data to customer data in the training set.

Each new image-text pair data received from the client, namely the sample data, is endowed with a sample weight after the uploading is finished, and the weight is in negative correlation with the interval of the current time along with the uploading time of the sample, namely the earlier the sample is uploaded, the less weight is obtained by the sample. The sample weights herein will be used for sampling of the training data, and if the weights are larger, the probability of being collected into the training set is greater, which can be quantified using the results of all customer sample weight normalization. By the strategy, a forgetting mechanism can be added to the training process, so that the model is more robust to the adaptability of a new environment.

FIG. 4 shows a forgetting curve, which is the result of the translation transformation of the inverse proportional function, i.e. theWherein r, b and t₀Are optional parameters. The trend of the forgetting curve shows that the more newly uploaded samples are more likely to be extracted into the training set, the more impressive the model is, the earlier the samples are uploaded, the smaller the probability of being recalled by the model is, and the sample is probably completely forgotten before a certain time.

The time for updating the model is determined by the training process, and the model is updated after the training is completed. For example, if there are 50 image-text pairs in the data uploaded by the client in the current period, and N is 10, the total number of the training sets is 550 samples, the time required for 80 rounds of training (epoch) is about 8 hours, and the current update interval is 8 hours.

Referring to fig. 6, a scene description in the present embodiment is that there are a manual training process and a scene service process in the robot system.

The manual training process is explained.

First, an initialized image-text model needs to be obtained. The initialized model structurally inherits the CNN-LSTM framework of the traditional image description algorithm, is trained using a public image-text dataset (e.g., Microsoft COCO), and is deployed at the server 2.

Secondly, aiming at the life scene of the person with visual impairment, the family members or the product providers can carry out initialization training on the products in advance, namely, the scene or the object which can be seen through the camera of the blind guiding robot 1 is described in a voice mode under the condition of excellent light. In this way, the image-text pairs may be uploaded to the server 2 as part of the training data via a 5G network or the like. The blind guiding robot 1 after the initial training can provide scene description service for the vision-impaired.

A scene service procedure is explained.

The scene service process is a stage directly used for servicing the visually impaired, and the model used at this time is an updated image-text model of the test state (test model for short). In the process, the blind guiding robot 1, namely the client uploads the acquired image to the server 2, the result is returned to the blind guiding robot 1 after the prediction of the test model of the server 2, and voice broadcasting is performed. When the process is independent of the vision-impaired person, the scene description robot system describes a real scene.

The time of the manual training process and the scene service process is explained.

For the manual training process, generally, after the user takes the blind guiding robot 1, the vision-normality person first describes some scenes and objects in the living environment of the vision-dysfunction person, similar to the initialized manual entry.

For the scene service process, generally, after the manual training process, the scene service process and the manual training process can be performed alternately, the higher the frequency of the manual training process is, the more data is uploaded to the server 2, and the more the updated test model is familiar to the living environment of the visually impaired.

The use of the blind person guiding robot 1 will be explained. Referring to fig. 7, the mode entered may be selected by voice when the device is turned on. If the online training mode is carried out, the robot training can be carried out by the assistance of the family of the people with visual impairment; if the scene service mode is started, the blind guiding robot 1 can provide services for the visually impaired, describe the real scene and broadcast the voice, and each mode has a corresponding trigger (or keyword).

Compared with the traditional scene description robot, the method and the system have the advantage that the scene description is carried out in an adaptive online training mode.

In the embodiment, a model-device separation strategy is adopted, namely the model is separated from the blind guiding robot 1, the image-text model is moved from the embedded device to the server, so that the calculation cost of the embedded device can be reduced, and the model is connected with the server through a 4G/5G technology, so that the functions of the client device are purer and simpler, the expandability of the model can be greatly improved, conditions can be provided for the updating of the model, and the characteristics of the internet of things are provided. Therefore, by adopting the online training strategy of the scene description robot, the client can upload image-text sample pairs, and the sample pairs can be used for training and improving the model, so that the robustness of the model to the environment scene where the user is located can be improved. The model used for scene description is not a constant one, but is constantly changing with a certain update frequency. The technical problems that the traditional scene description robot is low in prediction speed, low in prediction accuracy and low in adaptability when applied to a real scene can be solved to a certain extent by the aid of the method. The traditional scene description robot transplants an image-text translator (image-text model) trained on an open data set into embedded equipment to realize the image-text translator, the structure and parameters of the model are invariable, and the data used by the training model and the image data collected by the real scene where a user is located are probably not in the same distribution, which is an important reason causing poor prediction effect. The present embodiment solves the above technical problems by means of user feedback and intelligent processing. In the embodiment, an online training strategy for describing the robot aiming at the scene is adopted, and a network structure searching technology is applied to the online training process, so that the parameters of the model are updated in the online training process, and the structure of the model is also updated, thereby being beneficial to searching the network structure and the parameters which are more suitable for the current data environment. The image-to-text model in this embodiment is in an online training state, where the online training refers to: in the initial situation, a CNN-LSTM model pre-trained on an open data set is used, but in deployment, environmental data of a user can be collected and used as a record in a new data set, a background model is continuously trained, the record is added into a training process, and the model at a server end is updated to a model trained by the user's own scene data at intervals. Thus, the image-text model used in the embodiment is changed and continuously adapts to the real environment where the user is located, and the model structure and parameters are updated and changed according to the environment, so that the adaptability to the real scene can be greatly improved.

The present embodiment employs a training data selection strategy of an association-contrast and forgetting mechanism that simulates human cognitive rules when constructing a training set. Therefore, through the learning strategy of bionics, the model obtained through training has more flexibility and adaptability, the scene description robot is more humanized, and the scene description robot initially has the characteristics of human memory and forgetting of the scene.

The scene description robot framework and the use scheme for the training assistance of the healthy people of the vision disorder person are adopted in the embodiment, the scene description model is optimized through the knowledge of the healthy people, and the model is automatically optimized to provide better service for the vision disorder person. The data used in the embodiment is derived from the actual environment where the user lives, so that the problem of poor prediction effect in the actual use process due to the fact that the training set and the testing set belong to different distributions is solved.

Those skilled in the art will appreciate that all or part of the processes of the embodiments methods may be performed by a computer program, which may be stored in a computer-readable storage medium and executed to perform the processes of the embodiments methods. And the aforementioned storage medium includes: various media capable of storing program codes, such as ROM or RAM, magnetic or optical disks, etc.

The foregoing is a further detailed description of the present application in connection with specific/preferred embodiments and is not intended to limit the present application to that particular description. For a person skilled in the art to which the present application pertains, several alternatives or modifications to the described embodiments may be made without departing from the concept of the present application, and these alternatives or modifications should be considered as falling within the scope of the present application.

Claims

1. An on-line training method for a scene description robot is characterized in that:

a1, receiving new image-text pair data;

a2, constructing a training set according to the new image-text data;

a3, training the image-text model in the training state by using the training set to obtain a trained training model;

and A4, updating the image-text model of the test state for the service according to the trained training model.

2. The on-line training method of claim 1, wherein the a2 comprises:

3. The on-line training method of claim 2, wherein said a2 further comprises:

extracting global features of each new image-text pair data;

4. The on-line training method of claim 1, wherein the training of the image-text model of the training state using the training set comprises: and searching the network structure and parameters of the training model by a neural network structure searching method by using the training set.

5. The online training method of claim 4, wherein: the neural network structure searching method is a pdarts network structure searching algorithm based on gradient optimization.

6. The on-line training method of claim 5, wherein the network structure search process is divided into four stages: eight search spaces are provided in the stage one, and unimportant candidate operations are removed in the training process; the search space of the second stage is four; the search space of the stage three is two; stage four also leaves one of the most important candidate operations.

7. The online training method of claim 6, wherein: maintaining different matrixes in the process of searching the network structure; the matrix represents the weight of each optional operation in the search space; the matrix maintained in the first stage is 12-by-8-dimensional, represents 12 paths and 8 search spaces; the matrix maintained in the second stage is 12 x 4 dimensions, represents 12 paths and 4 search spaces; the matrix maintained in the third stage is 12 x 2 dimensions, represents 12 paths and 2 search spaces; the matrix maintained at stage four is 12 x 1, indicating that there is finally one of the most important candidate operations left.

8. An online training scene description robot system, characterized in that: the blind guiding robot comprises a blind guiding robot and a server; the server comprises a computer program for executing the method according to any of claims 1 to 7.

9. The scenario-describing robotic system of claim 8, wherein: the blind guiding robot is connected with the server through a wireless network; the specific form of the wireless network comprises a 4G network and a 5G network.

10. A computer-readable storage medium characterized by: the computer-readable storage medium has stored therein program instructions which, when executed by a processor of a computer, cause the processor to carry out the method according to any one of claims 1 to 7.