CN110852171A - Scene description robot system and method for online training - Google Patents
Scene description robot system and method for online training Download PDFInfo
- Publication number
- CN110852171A CN110852171A CN201910974489.5A CN201910974489A CN110852171A CN 110852171 A CN110852171 A CN 110852171A CN 201910974489 A CN201910974489 A CN 201910974489A CN 110852171 A CN110852171 A CN 110852171A
- Authority
- CN
- China
- Prior art keywords
- training
- model
- text
- image
- stage
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000012549 training Methods 0.000 title claims abstract description 130
- 238000000034 method Methods 0.000 title claims abstract description 74
- 238000012360 testing method Methods 0.000 claims abstract description 22
- 230000008569 process Effects 0.000 claims description 43
- 239000011159 matrix material Substances 0.000 claims description 15
- 238000004422 calculation algorithm Methods 0.000 claims description 8
- 238000013528 artificial neural network Methods 0.000 claims description 7
- 238000004590 computer program Methods 0.000 claims description 3
- 238000005457 optimization Methods 0.000 claims description 3
- 238000012163 sequencing technique Methods 0.000 claims description 3
- 230000000694 effects Effects 0.000 abstract description 7
- 230000006870 function Effects 0.000 description 8
- 238000013527 convolutional neural network Methods 0.000 description 6
- 230000001771 impaired effect Effects 0.000 description 6
- 238000009826 distribution Methods 0.000 description 5
- 230000007613 environmental effect Effects 0.000 description 5
- 230000007246 mechanism Effects 0.000 description 5
- 208000029257 vision disease Diseases 0.000 description 4
- 241000282472 Canis lupus familiaris Species 0.000 description 3
- 230000003044 adaptive effect Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000011176 pooling Methods 0.000 description 3
- 238000000926 separation method Methods 0.000 description 3
- 238000013519 translation Methods 0.000 description 3
- 241001166076 Diapheromera femorata Species 0.000 description 2
- 206010047571 Visual impairment Diseases 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 230000015572 biosynthetic process Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 230000015654 memory Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 230000006403 short-term memory Effects 0.000 description 2
- 238000003786 synthesis reaction Methods 0.000 description 2
- 230000004393 visual impairment Effects 0.000 description 2
- 239000011800 void material Substances 0.000 description 2
- 240000007651 Rubus glaucus Species 0.000 description 1
- 235000011034 Rubus glaucus Nutrition 0.000 description 1
- 235000009122 Rubus idaeus Nutrition 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000001149 cognitive effect Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 235000001968 nicotinic acid Nutrition 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000010845 search algorithm Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 239000013598 vector Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/20—Scenes; Scene-specific elements in augmented reality scenes
-
- A—HUMAN NECESSITIES
- A61—MEDICAL OR VETERINARY SCIENCE; HYGIENE
- A61H—PHYSICAL THERAPY APPARATUS, e.g. DEVICES FOR LOCATING OR STIMULATING REFLEX POINTS IN THE BODY; ARTIFICIAL RESPIRATION; MASSAGE; BATHING DEVICES FOR SPECIAL THERAPEUTIC OR HYGIENIC PURPOSES OR SPECIFIC PARTS OF THE BODY
- A61H3/00—Appliances for aiding patients or disabled persons to walk about
- A61H3/06—Walking aids for blind persons
- A61H3/061—Walking aids for blind persons with electronic detecting or guiding means
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Biomedical Technology (AREA)
- Physical Education & Sports Medicine (AREA)
- Pain & Pain Management (AREA)
- Epidemiology (AREA)
- Rehabilitation Therapy (AREA)
- Animal Behavior & Ethology (AREA)
- Public Health (AREA)
- Veterinary Medicine (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Multimedia (AREA)
- Manipulator (AREA)
Abstract
The embodiment of the application discloses a scene description robot system and a scene description robot method for online training. The method comprises the following steps: a1, receiving new image-text pair data; a2, constructing a training set according to the new image-text data; a3, training the image-text model in the training state by using the training set to obtain a trained training model; and A4, updating the image-text model of the test state for the service according to the trained training model. The system comprises the blind guiding robot and the server. The image-text model of the test state can be updated and changed along with the environment, so that the adaptability of the system to a real scene can be greatly improved, and the prediction effect can be ensured.
Description
Technical Field
The application relates to the technical field of artificial intelligence, in particular to a scene description robot system and method for online training.
Background
The existing facilities or equipment for ensuring the travel of the blind mainly comprise: accessible facility, guide dog and guide stick. Because the life of the blind is inconvenient, the blind guiding dog and the blind guiding walking stick gradually become tools for helping the blind go out. However, guide dogs are not easy to train and are costly; the detection range of the blind guiding walking stick is limited. The learning of the braille also requires time cost, and the blind people have urgent need to understand the real world.
Research teams at home and abroad are dedicated to researching a more intelligent and reliable blind guiding robot. For example, a blind guiding robot based on embedded technology can identify obstacles and traffic signs; for another example, the human-computer interactive blind guiding robot can detect the external environment through a sensor and transmit the external environment to the blind in a voice mode; and for example, a scene description system based on CNN-LSTM (Convolutional Neural Networks; LSTM, Long Short-Term Memory network) downloads the trained model into embedded equipment to realize the function of translating the characters by the image.
Most products or systems helping the blind to understand the real world are realized by firstly using a public image-to-text data set (such as Microsoft COCO and the like), training on a deep learning model of a CNN-LSTM model, performing links such as model compression and the like, and then programming the model into an embedded device for use. The performance of deep learning depends on the distribution of data; the images in the public data set mostly have higher quality and more definite scenes; the method is really applied to the life scene where the blind person is located, the image collected by the camera is easy to appear in the conditions of blurring, underexposure and the like, and the scene content is greatly different from that in the data set, so that the effect is poor easily during testing, and the influence of the factors such as the quality of the camera, the shooting angle and the like is large. This is a dilemma faced by the existing scene description systems in the process of commercialization.
The above background disclosure is only for the purpose of assisting in understanding the inventive concepts and technical solutions of the present application and does not necessarily pertain to the prior art of the present application, and should not be used to assess the novelty and inventive step of the present application in the absence of explicit evidence to suggest that such matter has been disclosed at the filing date of the present application.
Disclosure of Invention
The application provides a scene description robot system and method for on-line training, which can solve the technical problem of low prediction effect caused by the fact that the distribution of a data set used by a training model is different from the distribution of data collected from a use environment in the image description process.
In a first aspect, the present application provides a method for online training of a scenario-describing robot: a1, receiving new image-text pair data; a2, constructing a training set according to the new image-text data; a3, training the image-text model in the training state by using the training set to obtain a trained training model; and A4, updating the image-text model of the test state for the service according to the trained training model.
In some preferred embodiments, the a2 includes:
assigning a sample weight to each of the new image-text pairs; wherein the sample weights are inversely related to the interval of the current time as the time of the new image-text pair data upload;
collecting the new image-text pair data into the training set based on the sample weights.
In some preferred embodiments, the a2 further comprises:
extracting global features of each new image-text pair data;
comparing and sequencing global features of the new image-text pair data with all samples in all data sets, and selecting N samples with the highest similarity and N samples with the lowest similarity to be added into the training set; wherein N is an integer.
In some preferred embodiments, training the image-text model of the training state using the training set comprises: and searching the network structure and parameters of the training model by a neural network structure searching method by using the training set.
In some preferred embodiments, the neural network structure searching method is a gradient optimization-based pdarts network structure searching algorithm.
In some preferred embodiments, the process of network structure search is divided into four phases: eight search spaces are provided in the stage one, and unimportant candidate operations are removed in the training process; the search space of the second stage is four; the search space of the stage three is two; stage four also leaves one of the most important candidate operations.
In some preferred embodiments, during the search for the network structure, different matrices are maintained; the matrix represents the weight of each optional operation in the search space; the matrix maintained in the first stage is 12-by-8-dimensional, represents 12 paths and 8 search spaces; the matrix maintained in the second stage is 12 x 4 dimensions, represents 12 paths and 4 search spaces; the matrix maintained in the third stage is 12 x 2 dimensions, represents 12 paths and 2 search spaces; the matrix maintained at stage four is 12 x 1, indicating that there is finally one of the most important candidate operations left.
In a second aspect, the application provides an online-training scenario description robot system, which includes a blind-guiding robot and a server; the server comprises a computer program for executing the above method.
In some preferred embodiments, the blind guiding robot is connected with the server through a wireless network; the specific form of the wireless network comprises a 4G network and a 5G network.
In a third aspect, the present application provides a computer readable storage medium having stored therein program instructions which, when executed by a processor of a computer, cause the processor to perform the above-described method.
Compared with the prior art, the beneficial effects of the embodiment of the application are as follows:
and constructing a training set according to the received new image-text data, and training the image-text model in the training state based on the training set to obtain a trained training model. The new image-text pair data describes the current user's environmental context, and the training model is then adaptable to the current user's environmental context. And updating the image-text model of the test state for the service based on the training model, so that the updated image-text model of the test state can adapt to the environmental scene of the current user. The image-text model in the test state can be updated and changed along with the environment, so that the adaptability of the system to a real scene can be greatly improved, and the prediction effect can be ensured.
Drawings
FIG. 1 is an information interaction diagram depicting a robotic system in a scenario of online training according to an embodiment of the present application;
FIG. 2 illustrates a change in the structure of modules in an iterative discarding search space process according to one embodiment of the present application;
FIG. 3 illustrates the change of weight matrices in an iterative discarding search space process according to one embodiment of the present application;
FIG. 4 is a forgetting graph according to an embodiment of the present application;
FIG. 5 is a flow chart of an online training process according to an embodiment of the present application;
FIG. 6 illustrates a scenario-describing robotic system and logic for its use for online training according to one embodiment of the present application;
fig. 7 is a flowchart illustrating a blind guiding robot service according to an embodiment of the present application.
Detailed Description
In order to make the technical problems, technical solutions and advantageous effects to be solved by the embodiments of the present application more clearly apparent, the present application is further described in detail below with reference to fig. 1 to 7 and the embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
The embodiment provides an online-trained scene description robot system (or adaptive scene description system). Referring to fig. 1, the system includes a blind guiding robot 1 and a server 2. The system is mainly designed for the vision-impaired people, and can be trained by a robot assisted by normal people before products are provided for users and in use.
The blind guiding robot 1 is a mobile terminal product, and may be called a client. The blind guiding robot 1 can provide scene service for the visually impaired, and can describe the scene. In the embodiment, the blind guiding robot 1 is an embedded device; the blind guiding robot 1 comprises accessories such as a camera, a microphone and a loudspeaker, and can collect data and transmit the collected data to the server 2. Specifically, the blind guiding robot 1 transmits data to the server 2 through a wireless network such as a 5G network; in other implementations, the blind guiding robot 1 transmits the data to the server 2 through a 4G network or other wireless network.
The list of the hardware of the blind guiding robot 1, i.e. the client, constituting the main devices is shown in table 1. The components in this list are key components to achieve the functionality of the system. The raspberry pi platform is selected here because of its high scalability and low price.
TABLE 1 client Key hardware device inventory
The software system of the blind-guiding robot 1 uses an existing open interface based on the function to be implemented. Mainly comprises a voice interaction program module, a data exchange program module and the like. In addition, other self-defined functions can be realized according to the client. In the software part, the operations of image acquisition, image uploading, text downloading, text reading, voice communication and the like are realized through the combination of different software functional modules. The functional modules involved in this embodiment are as follows in table 2.
Table 2 function module list of blind guiding robot
In table 2 above, the speech recognition and speech synthesis module, the chat conversation module, and the chinese-english translation module respectively call a Baidu speech synthesis and recognition open interface, a Turing robot open interface, and a channel translation open interface, and the image description algorithm uses a NIC classical image capture algorithm model.
The server 2 is deployed with an image-text model. It is noted that, in the present embodiment, the image-text model is deployed in the server 2, not in the blind guiding robot 1. After receiving the data such as the images or videos uploaded by the blind guiding robot 1, the server 2 processes the data and returns the processing result to the blind guiding robot 1, so that the blind guiding robot 1 provides scene description services for the visually impaired.
The image-text model of this embodiment structurally inherits the CNN-LSTM framework (CNN, Convolutional Neural Networks; LSTM, Long Short-Term Memory network) of the conventional image description algorithm. The image-text model is the core of the algorithm of this embodiment, and the function of converting the image into characters is realized, and the quality of the function implementation depends on the quality of the model.
The image-text model of the server 2 is replicated in two, i.e. the server 2 is deployed with two image-text models, one in a test state and one in a training state. The image-text model of the test state is used directly to serve visually impaired persons; and the image-text model in the training state performs regular or irregular training according to the received image-text to realize the updating of the structure and the parameters. The image-text model in the training state replaces the image-text model in the testing state at intervals; this process is called model update, i.e. an automatic update mechanism.
Referring to fig. 1, the automatic update mechanism is a scene description robot online training method of the present embodiment, and includes steps a1 through a 4. The execution subject of the method is the server 2.
Step a1, new image-text pair data is received.
In the present embodiment, the updated data of the image-text model in the training state is derived from the image-text pair remotely provided by the sighted person, and is uploaded to the server 2 by the blind guiding robot 1.
And step A2, constructing a training set according to the new image-text data.
The training of the image-text model comprises a deployment phase and an application phase. The model in the deployment phase, the structure and parameters of the model are trained on the public data set; in the application phase, the data set used by the training model contains two parts, one part is a public data set, referred to herein as initial data, and the other part is an image-text sample pair obtained from the blind guiding robot 1, referred to herein as client data. The training process in the deployment phase is the same as the conventional way, and the application phase is described below with emphasis.
The frequency of the server 2 acquiring data from the blind guiding robot 1, namely the client side, fluctuates, the training process takes time, and the model actually used for serving the user is updated after the model training is completed, so that an updating frequency exists. The logic flow of the online training is as follows: on the server side, whenever a new data upload is detected, a training set is constructed using the current new data as well as the previous data.
And A3, training the image-text model in the training state by using a training set to obtain a trained training model.
The training process is actually a process of searching the network structure and parameters of the image-text model in the training state, and the model obtained by training is suitable for the current training set, i.e. the current training data.
The image-text model is the core of the scene description robot system and takes charge of converting images into characters, and the quality of the model directly determines the quality of picture description. The embodiment provides an on-line training strategy, and the structure and the weight of the model are updated by using the real environment where the user is located.
In the embodiment, online training is realized by a neural network structure searching method; wherein, only the CNN part of the image-text model is subjected to structure search, and the CNN part and the LSTM part are both subjected to weight updating.
In this implementation, the structure search of the CNN part adopts a gradient optimization-based pdarts (progressive differential Architecture search) neural network structure search algorithm. The algorithm model is formed by stacking a plurality of modules, wherein candidate search spaces with 8 possibilities are connected among feature maps in each module, and the candidate search spaces are not connected, 3 × 3 maximum pooling, 3 × 3 average pooling, long jump chaining, 3 × 3 depth separation convolution, 5 × 5 depth separation convolution, 3 × 3 void convolution and 5 × 5 void convolution respectively.
In this embodiment, the whole neural network is formed by stacking eight modules, as shown in fig. 2. Nodes 0, 1, 2, and 3 all represent a feature map and the arrows represent the search space (e.g., convolution, pooling, long jump links, etc.). In order to reduce the amount of calculation and speed up the search, the process of searching the network structure is divided into four stages: eight search spaces exist in stage one (1-20 epochs), and unimportant candidate operations are removed in the training process; there are four search spaces for stage two (21-40 epochs); there are two search spaces for stage three (41-60 epochs); stage four (61-80epochs) leaves one of the most important candidate operations. The test process uses a stage four network structure. Finally, the model structure adapted to the current training set can be searched.
In the process of searching the network structure, different matrices need to be maintained. The matrix represents the importance, i.e., weight, of each optional operation in the search space. Referring to fig. 3, the matrix maintained at stage one is 12 x 8 dimensional, representing 12 paths, 8 search spaces; the matrix maintained in the second stage is 12 x 4 dimensions, and represents 12 paths and 4 search spaces; the matrix maintained in the third stage is 12 x 2 dimensions, and represents 12 paths and 2 search spaces; the matrix maintained at stage four is 12 x 1, indicating that there is finally one of the most important candidate operations left. The random initialization of the first three stages is normalized by rows and the last stage is normalized by columns.
And step A4, updating the image-text model of the test state for the service according to the trained training model.
And updating the image-text model of the test state after searching the latest network structure and parameters. Specifically, a trained training model is used for replacing an image-text model of the previous test state to obtain an updated image-text model of the test state; or, updating the network structure and parameters of the image-text model in the previous test state according to the trained training model. The updated model is directly served to the user. The above steps a1 to a4 are repeated after each update. Among them, the sequence of step a1 to step a4 is flexible, such as step a1 is run after step a 2.
The update process of the model continues until no more image-text pairs are received. The blind guiding robot 1, that is, the data sample collected by the client is the real environment where the user is located, and is used for fine tuning of the model.
The online training strategy of the model can enable the model to be more adaptive to the environmental scene of the current user.
Referring to fig. 5, the foregoing step a2 is explained.
In the application phase, the present embodiment employs a associative-contrast search strategy. When a training set is constructed, for each newly uploaded sample, namely image-text pair data, extracting the global features of the sample by using a classification network pre-trained on ImageNet; then, comparing and sequencing the characteristics of the sample with all samples in all data sets (initial data and client data), and selecting N samples with the highest similarity and N samples with the lowest similarity to be added into a training set; the cosine similarity measurement can be selected for the comparison of the feature vectors; wherein, N is an integer and can be set artificially. This data set expansion mechanism facilitates the model's memory of the association and comparison of the scene. Here N also determines the ratio of initial data to customer data in the training set.
Each new image-text pair data received from the client, namely the sample data, is endowed with a sample weight after the uploading is finished, and the weight is in negative correlation with the interval of the current time along with the uploading time of the sample, namely the earlier the sample is uploaded, the less weight is obtained by the sample. The sample weights herein will be used for sampling of the training data, and if the weights are larger, the probability of being collected into the training set is greater, which can be quantified using the results of all customer sample weight normalization. By the strategy, a forgetting mechanism can be added to the training process, so that the model is more robust to the adaptability of a new environment.
FIG. 4 shows a forgetting curve, which is the result of the translation transformation of the inverse proportional function, i.e. theWherein r, b and t0Are optional parameters. The trend of the forgetting curve shows that the more newly uploaded samples are more likely to be extracted into the training set, the more impressive the model is, the earlier the samples are uploaded, the smaller the probability of being recalled by the model is, and the sample is probably completely forgotten before a certain time.
The time for updating the model is determined by the training process, and the model is updated after the training is completed. For example, if there are 50 image-text pairs in the data uploaded by the client in the current period, and N is 10, the total number of the training sets is 550 samples, the time required for 80 rounds of training (epoch) is about 8 hours, and the current update interval is 8 hours.
Referring to fig. 6, a scene description in the present embodiment is that there are a manual training process and a scene service process in the robot system.
The manual training process is explained.
First, an initialized image-text model needs to be obtained. The initialized model structurally inherits the CNN-LSTM framework of the traditional image description algorithm, is trained using a public image-text dataset (e.g., Microsoft COCO), and is deployed at the server 2.
Secondly, aiming at the life scene of the person with visual impairment, the family members or the product providers can carry out initialization training on the products in advance, namely, the scene or the object which can be seen through the camera of the blind guiding robot 1 is described in a voice mode under the condition of excellent light. In this way, the image-text pairs may be uploaded to the server 2 as part of the training data via a 5G network or the like. The blind guiding robot 1 after the initial training can provide scene description service for the vision-impaired.
A scene service procedure is explained.
The scene service process is a stage directly used for servicing the visually impaired, and the model used at this time is an updated image-text model of the test state (test model for short). In the process, the blind guiding robot 1, namely the client uploads the acquired image to the server 2, the result is returned to the blind guiding robot 1 after the prediction of the test model of the server 2, and voice broadcasting is performed. When the process is independent of the vision-impaired person, the scene description robot system describes a real scene.
The time of the manual training process and the scene service process is explained.
For the manual training process, generally, after the user takes the blind guiding robot 1, the vision-normality person first describes some scenes and objects in the living environment of the vision-dysfunction person, similar to the initialized manual entry.
For the scene service process, generally, after the manual training process, the scene service process and the manual training process can be performed alternately, the higher the frequency of the manual training process is, the more data is uploaded to the server 2, and the more the updated test model is familiar to the living environment of the visually impaired.
The use of the blind person guiding robot 1 will be explained. Referring to fig. 7, the mode entered may be selected by voice when the device is turned on. If the online training mode is carried out, the robot training can be carried out by the assistance of the family of the people with visual impairment; if the scene service mode is started, the blind guiding robot 1 can provide services for the visually impaired, describe the real scene and broadcast the voice, and each mode has a corresponding trigger (or keyword).
Compared with the traditional scene description robot, the method and the system have the advantage that the scene description is carried out in an adaptive online training mode.
In the embodiment, a model-device separation strategy is adopted, namely the model is separated from the blind guiding robot 1, the image-text model is moved from the embedded device to the server, so that the calculation cost of the embedded device can be reduced, and the model is connected with the server through a 4G/5G technology, so that the functions of the client device are purer and simpler, the expandability of the model can be greatly improved, conditions can be provided for the updating of the model, and the characteristics of the internet of things are provided. Therefore, by adopting the online training strategy of the scene description robot, the client can upload image-text sample pairs, and the sample pairs can be used for training and improving the model, so that the robustness of the model to the environment scene where the user is located can be improved. The model used for scene description is not a constant one, but is constantly changing with a certain update frequency. The technical problems that the traditional scene description robot is low in prediction speed, low in prediction accuracy and low in adaptability when applied to a real scene can be solved to a certain extent by the aid of the method. The traditional scene description robot transplants an image-text translator (image-text model) trained on an open data set into embedded equipment to realize the image-text translator, the structure and parameters of the model are invariable, and the data used by the training model and the image data collected by the real scene where a user is located are probably not in the same distribution, which is an important reason causing poor prediction effect. The present embodiment solves the above technical problems by means of user feedback and intelligent processing. In the embodiment, an online training strategy for describing the robot aiming at the scene is adopted, and a network structure searching technology is applied to the online training process, so that the parameters of the model are updated in the online training process, and the structure of the model is also updated, thereby being beneficial to searching the network structure and the parameters which are more suitable for the current data environment. The image-to-text model in this embodiment is in an online training state, where the online training refers to: in the initial situation, a CNN-LSTM model pre-trained on an open data set is used, but in deployment, environmental data of a user can be collected and used as a record in a new data set, a background model is continuously trained, the record is added into a training process, and the model at a server end is updated to a model trained by the user's own scene data at intervals. Thus, the image-text model used in the embodiment is changed and continuously adapts to the real environment where the user is located, and the model structure and parameters are updated and changed according to the environment, so that the adaptability to the real scene can be greatly improved.
The present embodiment employs a training data selection strategy of an association-contrast and forgetting mechanism that simulates human cognitive rules when constructing a training set. Therefore, through the learning strategy of bionics, the model obtained through training has more flexibility and adaptability, the scene description robot is more humanized, and the scene description robot initially has the characteristics of human memory and forgetting of the scene.
The scene description robot framework and the use scheme for the training assistance of the healthy people of the vision disorder person are adopted in the embodiment, the scene description model is optimized through the knowledge of the healthy people, and the model is automatically optimized to provide better service for the vision disorder person. The data used in the embodiment is derived from the actual environment where the user lives, so that the problem of poor prediction effect in the actual use process due to the fact that the training set and the testing set belong to different distributions is solved.
Those skilled in the art will appreciate that all or part of the processes of the embodiments methods may be performed by a computer program, which may be stored in a computer-readable storage medium and executed to perform the processes of the embodiments methods. And the aforementioned storage medium includes: various media capable of storing program codes, such as ROM or RAM, magnetic or optical disks, etc.
The foregoing is a further detailed description of the present application in connection with specific/preferred embodiments and is not intended to limit the present application to that particular description. For a person skilled in the art to which the present application pertains, several alternatives or modifications to the described embodiments may be made without departing from the concept of the present application, and these alternatives or modifications should be considered as falling within the scope of the present application.
Claims (10)
1. An on-line training method for a scene description robot is characterized in that:
a1, receiving new image-text pair data;
a2, constructing a training set according to the new image-text data;
a3, training the image-text model in the training state by using the training set to obtain a trained training model;
and A4, updating the image-text model of the test state for the service according to the trained training model.
2. The on-line training method of claim 1, wherein the a2 comprises:
assigning a sample weight to each of the new image-text pairs; wherein the sample weights are inversely related to the interval of the current time as the time of the new image-text pair data upload;
collecting the new image-text pair data into the training set based on the sample weights.
3. The on-line training method of claim 2, wherein said a2 further comprises:
extracting global features of each new image-text pair data;
comparing and sequencing global features of the new image-text pair data with all samples in all data sets, and selecting N samples with the highest similarity and N samples with the lowest similarity to be added into the training set; wherein N is an integer.
4. The on-line training method of claim 1, wherein the training of the image-text model of the training state using the training set comprises: and searching the network structure and parameters of the training model by a neural network structure searching method by using the training set.
5. The online training method of claim 4, wherein: the neural network structure searching method is a pdarts network structure searching algorithm based on gradient optimization.
6. The on-line training method of claim 5, wherein the network structure search process is divided into four stages: eight search spaces are provided in the stage one, and unimportant candidate operations are removed in the training process; the search space of the second stage is four; the search space of the stage three is two; stage four also leaves one of the most important candidate operations.
7. The online training method of claim 6, wherein: maintaining different matrixes in the process of searching the network structure; the matrix represents the weight of each optional operation in the search space; the matrix maintained in the first stage is 12-by-8-dimensional, represents 12 paths and 8 search spaces; the matrix maintained in the second stage is 12 x 4 dimensions, represents 12 paths and 4 search spaces; the matrix maintained in the third stage is 12 x 2 dimensions, represents 12 paths and 2 search spaces; the matrix maintained at stage four is 12 x 1, indicating that there is finally one of the most important candidate operations left.
8. An online training scene description robot system, characterized in that: the blind guiding robot comprises a blind guiding robot and a server; the server comprises a computer program for executing the method according to any of claims 1 to 7.
9. The scenario-describing robotic system of claim 8, wherein: the blind guiding robot is connected with the server through a wireless network; the specific form of the wireless network comprises a 4G network and a 5G network.
10. A computer-readable storage medium characterized by: the computer-readable storage medium has stored therein program instructions which, when executed by a processor of a computer, cause the processor to carry out the method according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910974489.5A CN110852171A (en) | 2019-10-14 | 2019-10-14 | Scene description robot system and method for online training |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910974489.5A CN110852171A (en) | 2019-10-14 | 2019-10-14 | Scene description robot system and method for online training |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110852171A true CN110852171A (en) | 2020-02-28 |
Family
ID=69596450
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910974489.5A Pending CN110852171A (en) | 2019-10-14 | 2019-10-14 | Scene description robot system and method for online training |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110852171A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112600906A (en) * | 2020-12-09 | 2021-04-02 | 中国科学院深圳先进技术研究院 | Resource allocation method and device for online scene and electronic equipment |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106503055A (en) * | 2016-09-27 | 2017-03-15 | 天津大学 | A kind of generation method from structured text to iamge description |
CN108681433A (en) * | 2018-05-04 | 2018-10-19 | 南京信息工程大学 | A kind of sampling selection method for data de-duplication |
CN109299341A (en) * | 2018-10-29 | 2019-02-01 | 山东师范大学 | One kind confrontation cross-module state search method dictionary-based learning and system |
CN109345302A (en) * | 2018-09-27 | 2019-02-15 | 腾讯科技(深圳)有限公司 | Machine learning model training method, device, storage medium and computer equipment |
CN109710787A (en) * | 2018-12-30 | 2019-05-03 | 陕西师范大学 | Image Description Methods based on deep learning |
CN109753900A (en) * | 2018-12-21 | 2019-05-14 | 西安科技大学 | A kind of blind person's auxiliary vision system based on CNN/LSTM |
CN110288097A (en) * | 2019-07-01 | 2019-09-27 | 腾讯科技(深圳)有限公司 | A kind of method and relevant apparatus of model training |
CN110297503A (en) * | 2019-07-08 | 2019-10-01 | 中国电子科技集团公司第二十九研究所 | A kind of method of more unmanned systems collaboratively searching danger sources |
-
2019
- 2019-10-14 CN CN201910974489.5A patent/CN110852171A/en active Pending
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106503055A (en) * | 2016-09-27 | 2017-03-15 | 天津大学 | A kind of generation method from structured text to iamge description |
CN108681433A (en) * | 2018-05-04 | 2018-10-19 | 南京信息工程大学 | A kind of sampling selection method for data de-duplication |
CN109345302A (en) * | 2018-09-27 | 2019-02-15 | 腾讯科技(深圳)有限公司 | Machine learning model training method, device, storage medium and computer equipment |
CN109299341A (en) * | 2018-10-29 | 2019-02-01 | 山东师范大学 | One kind confrontation cross-module state search method dictionary-based learning and system |
CN109753900A (en) * | 2018-12-21 | 2019-05-14 | 西安科技大学 | A kind of blind person's auxiliary vision system based on CNN/LSTM |
CN109710787A (en) * | 2018-12-30 | 2019-05-03 | 陕西师范大学 | Image Description Methods based on deep learning |
CN110288097A (en) * | 2019-07-01 | 2019-09-27 | 腾讯科技(深圳)有限公司 | A kind of method and relevant apparatus of model training |
CN110297503A (en) * | 2019-07-08 | 2019-10-01 | 中国电子科技集团公司第二十九研究所 | A kind of method of more unmanned systems collaboratively searching danger sources |
Non-Patent Citations (1)
Title |
---|
XIN CHEN等: ""Progressive Differentiable Architecture Search:Bridging the Depth Gap between Search and Evaluation"", 《ARXIV》 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112600906A (en) * | 2020-12-09 | 2021-04-02 | 中国科学院深圳先进技术研究院 | Resource allocation method and device for online scene and electronic equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110555390B (en) | Pedestrian re-identification method, device and medium based on semi-supervised training mode | |
Yamato et al. | Recognizing human action in time-sequential images using hidden Markov model. | |
CN106845411B (en) | Video description generation method based on deep learning and probability map model | |
CN106897372B (en) | Voice query method and device | |
CN108922559A (en) | Recording terminal clustering method based on voice time-frequency conversion feature and integral linear programming | |
CN111709493B (en) | Object classification method, training device, object classification equipment and storage medium | |
CN113936275A (en) | Unsupervised domain adaptive semantic segmentation method based on region feature alignment | |
CN113705811A (en) | Model training method, device, computer program product and equipment | |
CN115953643A (en) | Knowledge distillation-based model training method and device and electronic equipment | |
CN113177538A (en) | Video cycle identification method and device, computer equipment and storage medium | |
CN111694977A (en) | Vehicle image retrieval method based on data enhancement | |
CN114037055A (en) | Data processing system, method, device, equipment and storage medium | |
CN111242176A (en) | Computer vision task processing method and device and electronic system | |
CN110852171A (en) | Scene description robot system and method for online training | |
CN114943937A (en) | Pedestrian re-identification method and device, storage medium and electronic equipment | |
CN110633688A (en) | Training method and device of translation model and sign language video translation method and device | |
CN117372782A (en) | Small sample image classification method based on frequency domain analysis | |
CN114091668A (en) | Neural network pruning method and system based on micro-decision maker and knowledge distillation | |
CN112948709B (en) | Continuous interest point real-time recommendation method driven by influence perception | |
CN111461228B (en) | Image recommendation method and device and storage medium | |
CN114566184A (en) | Audio recognition method and related device | |
CN113761152A (en) | Question-answer model training method, device, equipment and storage medium | |
CN113569867A (en) | Image processing method and device, computer equipment and storage medium | |
CN113255695A (en) | Feature extraction method and system for target re-identification | |
CN116416212B (en) | Training method of road surface damage detection neural network and road surface damage detection neural network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200228 |
|
RJ01 | Rejection of invention patent application after publication |