CN116781688A

CN116781688A - Internet of things terminal remote upgrading method and device based on reinforcement learning

Info

Publication number: CN116781688A
Application number: CN202310850254.1A
Authority: CN
Inventors: 刘树波; 吴钧诚; 蔡朝晖; 涂国庆
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2023-07-12
Filing date: 2023-07-12
Publication date: 2023-09-19

Abstract

The application discloses a reinforcement learning-based remote upgrading method and device for an Internet of things terminal, and relates to the technical field of Internet of things, wherein the method comprises the steps of sending an upgrading request to the Internet of things terminal through a gateway to inform the Internet of things terminal to conduct remote upgrading, and the upgrading request comprises an address of an updating server and latest deadline of the remote upgrading; dividing a Flash area of the terminal of the Internet of things into an application software program area and an upgrade bootstrap program area; the method comprises the steps that after an internet of things terminal receives an upgrading request, an application software program is switched to an upgrading guide program by switching an interrupt vector table, and a distributed autonomous upgrading decision is made based on a reinforcement learning algorithm; based on the FTP protocol, the Internet of things terminal communicates with the update server to download the files required for updating, update the Q table and update the application software firmware. The application can ensure the reliability of remote updating and upgrading.

Description

Internet of things terminal remote upgrading method and device based on reinforcement learning

Technical Field

The application relates to the technical field of the Internet of things, in particular to a method and a device for remotely upgrading an Internet of things terminal based on reinforcement learning.

Background

With the development of the internet of things technology, the number of sensing devices of the internet of things such as sensors is increased in a explosive manner, so that a large amount of real-time data is generated. In the sensor network, the terminal device needs to report data to the gateway and manage the data through the gateway. Along with the change of actual needs and the generation of security problems, the firmware of the terminal equipment of the internet of things often needs to be updated in a remote upgrading mode. In the remote upgrading process, the Internet of things terminal needs to acquire new firmware and then loads the new firmware into the memory to finish updating. However, updating all terminal devices simultaneously often exceeds the processing power of the gateway. When the number of update requests processed by the gateway exceeds the load, conditions such as response timeout, update failure and the like are easily caused, and time and energy are wasted.

At present, the conventional remote upgrading scheme of the internet of things equipment is used for numbering the internet of things equipment, and the gateway is used for sending the updating request in batches so as to control the quantity of the equipment which is simultaneously upgraded. The method is simple to manage, but has low automation degree, is easy to perform poorly when facing complex environmental conditions, has poor real-time performance, cannot react to network environment changes, and cannot fully utilize the resources of the gateway.

Disclosure of Invention

Aiming at the defects in the prior art, the application aims to provide a reinforcement learning-based remote upgrading method and device for an Internet of things terminal, which can ensure the reliability of remote upgrading.

In order to achieve the above purpose, the application provides a reinforcement learning-based remote upgrading method for an internet of things terminal, which specifically comprises the following steps:

an upgrade request is sent to the Internet of things terminal through the gateway to inform the Internet of things terminal of remote upgrade, wherein the upgrade request comprises an address of an update server and the latest deadline of remote upgrade;

dividing a Flash area of the terminal of the Internet of things into an application software program area and an upgrade bootstrap program area;

the method comprises the steps that after an internet of things terminal receives an upgrading request, an application software program is switched to an upgrading guide program by switching an interrupt vector table, and a distributed autonomous upgrading decision is made based on a reinforcement learning algorithm;

based on the FTP protocol, the Internet of things terminal communicates with the update server to download the files required for updating, update the Q table and update the application software firmware.

On the basis of the technical scheme, the application software program is operated under the daily operation condition of the terminal of the Internet of things, and after the upgrading request sent by the gateway is received, the switching interrupt vector table is switched to the upgrading guide program by the application software program.

Based on the technical scheme, the distributed autonomous upgrade decision based on the reinforcement learning algorithm comprises the following specific steps:

judging the upgrading times of the terminal of the Internet of things, if the terminal of the Internet of things is upgraded for the first time, randomly initializing the time slot number by the terminal of the Internet of things to serve as an initial state and initializing the Q table, otherwise, using the time slot number adopted in the last upgrading as the state and delaying the Q table;

selecting actions according to epsilon-greedy strategies, randomly selecting with epsilon probability, selecting the action with the largest Q value in the current state through a lookup Q table with 1-epsilon probability, and modifying the time slot number according to the selected actions to obtain a new state;

and calculating the starting time of the remote upgrade according to the latest deadline of the remote upgrade in the upgrade request sent by the gateway and the time slot number, and starting the remote upgrade after the waiting time is reached.

On the basis of the technical proposal, the method comprises the following steps,

when the Q table is initialized, the Q table is assigned according to the requirement of an application scene, and the Q value of the action of reducing the time slot number in each time slot state in the Q table can be improved so as to finish upgrading in a shorter time slot.

On the basis of the technical scheme, the actions comprise reducing the time slot number, maintaining the time slot number and increasing the time slot number.

Based on the technical scheme, the communication between the internet of things terminal and the update server based on the FTP protocol is used for downloading the files required for updating, and the method specifically comprises the following steps:

the method comprises the steps that communication is carried out between an internet of things terminal and an update server through an FTP protocol, the internet of things terminal downloads an application software firmware description file and analyzes file content, and the application software firmware description file comprises a firmware version number, a firmware file size, an author and a modification date;

the Internet of things terminal compares the original firmware version number with the firmware version number obtained by analyzing the application software firmware description file to judge whether new application software firmware needs to be downloaded or not:

if yes, the terminal of the Internet of things downloads new application software firmware, verifies the integrity and then switches to an application software program;

if not, switching to the application software program directly.

Based on the above technical solution, the updating of the Q table is performed, where an update formula adopted for performing the updating of the Q table is:

Q(s,a)＝Q(s,a)+A(r+γ*max(Q(s′,a′))-Q(s,a))

wherein Q (s, a) represents a Q value of the action a in the state s, s represents a current state, a represents a selected action, a represents a learning rate, r represents a reward function, γ represents a discount factor, s ' represents a next state after the action a is performed, a ' represents an action selected in the state s ', and Q (s ', a ') represents a Q value obtained by the action a ' in the state s '.

Based on the technical scheme, before updating the application software firmware, the method further comprises the following steps:

judging whether the application software firmware is successfully downloaded or not:

if yes, updating the application software firmware;

if not, switching from the upgrade guiding program to the application software program, reporting the update failure to the gateway and ending the update.

Based on the technical scheme, the updating of the application software firmware comprises the following specific steps:

saving an interrupt vector table of the upgrade bootstrap program, setting an upgrade identification, closing the interrupt, and clearing an application software program code area;

sequentially reading and analyzing 512 byte data of downloaded application software firmware, wherein the application software firmware comprises an address identifier and a machine code in an ASCII code form;

the upgrade bootstrap program extracts the address identifier and stores the address identifier into the address variable, and converts the machine code in the form of ASCII code into hexadecimal machine code and stores the hexadecimal machine code into a cache area;

continuously reading the application software firmware until all reading is completed, writing the data in the buffer area into Flash pointed by address variable when the EOF at the end of the file is read, modifying the upgrade mark to be completed, writing the data in the buffer area of the additional interrupt vector table into the interrupt vector table of the main controller according to blocks, covering, and starting interrupt;

the terminal of the Internet of things restarts the main controller, switches to an application software program and reports a new firmware version number to the gateway;

when the storage space of the buffer area is full, the buffer area is processed in the following processing mode: if the address variable points to the interrupt vector table, writing the data in the buffer area into the buffer area of the additional interrupt vector table according to the offset, otherwise, writing the content of the buffer area into Flash pointed by the address variable according to the block, and modifying the address variable to enable the address variable to point to the writing address of the next machine code.

The application provides a reinforcement learning-based remote upgrading device for an internet of things terminal, which comprises the following components:

the system comprises a sending module, a gateway and a remote upgrading module, wherein the sending module is used for sending an upgrading request to the Internet of things terminal through the gateway so as to inform the Internet of things terminal of remote upgrading, and the upgrading request comprises an address of an updating server and the latest deadline of the remote upgrading;

the division module is used for dividing a Flash area of the terminal of the Internet of things into an application software program area and an upgrade bootstrap program area;

the decision module is used for driving the Internet of things terminal to switch the interrupt vector table from the application software program to the upgrade guide program after receiving the upgrade request, and carrying out distributed autonomous upgrade decision based on the reinforcement learning algorithm;

and the updating module is used for driving the communication between the internet of things terminal and the updating server based on the FTP protocol so as to download the files required for updating, update the Q table and update the application software firmware.

Compared with the prior art, the application has the advantages that:

(1) The method is characterized in that distributed autonomous updating decision is conducted in limited resource Internet of things equipment by a reinforcement learning method, so that stable, reliable and intelligent remote updating of the Internet of things equipment is realized;

(2) The reinforcement learning algorithm is used for carrying out distributed autonomous decision, communication is not carried out between the terminals, the communication load between the terminals and the gateway is reduced, excessive competition of resources between the terminals is avoided, the time required for updating can be reduced, and the waste of energy is reduced;

(3) The remote updating mechanism uses the memory of the internet of things module, can realize remote updating on the main controller equipment with limited memory resources, and ensures the reliability of remote updating through the checking mechanism of the software firmware and the description file.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a method for remotely upgrading an internet of things terminal based on reinforcement learning in an embodiment of the application;

fig. 2 is an overall flow frame diagram of a reinforcement learning-based remote upgrading method for an internet of things terminal.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application.

Referring to fig. 1, the method for remotely upgrading the internet of things terminal based on reinforcement learning provided by the embodiment of the application specifically includes the following steps:

s1: an upgrade request is sent to the Internet of things terminal through the gateway to inform the Internet of things terminal of remote upgrade, wherein the upgrade request comprises an address of an update server and the latest deadline of remote upgrade;

s2: dividing a Flash area of the terminal of the Internet of things into an application software program area and an upgrade bootstrap program area;

s3: the method comprises the steps that after an internet of things terminal receives an upgrading request, an application software program is switched to an upgrading guide program by switching an interrupt vector table, and a distributed autonomous upgrading decision is made based on a reinforcement learning algorithm;

in the application, an application software program is operated under the daily operation condition of the terminal of the Internet of things, and after an upgrade request sent by a gateway is received, the switching interrupt vector table is switched to an upgrade guide program by the application software program, so that a distributed autonomous upgrade decision is started.

In the application, the distributed autonomous upgrade decision is made based on the reinforcement learning algorithm, and the specific steps include:

s301: judging the upgrading times of the terminal of the Internet of things, if the terminal of the Internet of things is upgraded for the first time, randomly initializing the time slot number by the terminal of the Internet of things to serve as an initial state and initializing the Q table, otherwise, using the time slot number adopted in the last upgrading as the state and delaying the Q table; the reinforcement learning algorithm is a Qlearning reinforcement learning algorithm, and is an algorithm for recording a behavior value (Q).

When the Q table is initialized, the Q table is assigned according to the requirement of an application scene, and the Q value of the action of reducing the time slot number in each time slot state in the Q table can be improved so as to finish upgrading in a shorter time slot, so that the terminal has a trend of starting remote upgrading earlier.

S302: selecting actions according to an epsilon-greedy strategy (greedy strategy), randomly selecting with epsilon probability, selecting the action with the largest Q value in the current state through a lookup Q table with 1-epsilon probability, and modifying the time slot number according to the selected actions to obtain a new state;

the actions include decreasing the slot number, maintaining the slot number, and increasing the slot number. I.e. the new state s' is obtained by modifying the time slot number according to the selected action a.

S303: and calculating the starting time of the remote upgrade according to the latest deadline of the remote upgrade in the upgrade request sent by the gateway and the time slot number, and starting the remote upgrade after the waiting time is reached.

S4: based on FTP (file transfer protocol) protocol, the internet of things terminal communicates with the update server to download the files required for updating, update the Q table, and update the application software firmware.

In the application, based on the FTP protocol, the terminal of the Internet of things communicates with the update server to download the files required for updating, and the method specifically comprises the following steps:

s401: the method comprises the steps that communication is carried out between an internet of things terminal and an update server through an FTP protocol, the internet of things terminal downloads an application software firmware description file and analyzes file content, and the application software firmware description file comprises a firmware version number, a firmware file size, an author and a modification date; that is, the application firmware description file includes necessary information such as firmware version number, firmware file size, etc., and may also include unnecessary information such as author, modification date, etc.

S402: the Internet of things terminal compares the original firmware version number with the firmware version number obtained by analyzing the application software firmware description file to judge whether new application software firmware needs to be downloaded or not:

if not, switching to the application software program directly.

The Internet of things terminal downloads the application software firmware by using the Internet of things module and stores the application software firmware in a memory of the Internet of things module, and performs integrity verification according to the related information in the description file after the downloading is completed, and if the verification fails, the downloading is repeatedly attempted until the preset fault tolerance times are reached.

In the application, the Q table is updated, wherein an update formula adopted for updating the Q table is as follows:

Q(s,a)＝Q(s,a)+A(r+γ*max(Q(s′,a′))-Q(s,a))

wherein Q (s, a) represents the Q value of the action a in the state s, s represents the current state, a represents the selected action, a represents the learning rate, r represents the reward function, and is determined by the downloading condition, if the downloading is successful, the reward function should be positive, otherwise, the reward function can be set to be different negative values according to the number of failed retries, and represents the intensity of competition in the current state, and γ represents the discount factor, and can be set according to the actual application scenario.

In the application, before updating the application software firmware, the method further comprises the following steps:

if yes, updating the application software firmware;

if not, switching from the upgrade guiding program to the application software program, reporting the update failure to the gateway and ending the update. The main controller used in the embodiment of the application is an MSP430F5438A chip.

In the application, the updating of the application software firmware is carried out, and the specific steps comprise:

s411: saving an interrupt vector table of the upgrade bootstrap program, setting an upgrade identification, closing the interrupt, and clearing an application software program code area;

s412: sequentially reading and analyzing 512 byte data of downloaded application software firmware, wherein the application software firmware comprises an address identifier and a machine code in an ASCII code form;

s413: the upgrade bootstrap program extracts the address identifier and stores the address identifier into the address variable, and converts the machine code in the form of ASCII code into hexadecimal machine code and stores the hexadecimal machine code into a cache area;

s414: continuously reading the application software firmware until all reading is completed, writing the data in the buffer area into Flash pointed by address variable when the EOF at the end of the file is read, modifying the upgrade mark to be completed, writing the data in the buffer area of the additional interrupt vector table into the interrupt vector table of the main controller according to blocks, covering, and starting interrupt;

s415: the terminal of the Internet of things restarts the main controller, switches to an application software program and reports a new firmware version number to the gateway;

The whole flow frame diagram of the reinforcement learning-based remote upgrading method for the terminal of the Internet of things is shown in fig. 2.

The remote upgrading method for the terminal of the Internet of things mainly comprises three parts, namely a gateway, the terminal of the Internet of things and an updating server.

The gateway is used for initiating an upgrading request of remote upgrading, directly communicating with an Internet of things terminal running an application software program, and providing information such as an address of an updating server, the latest deadline of remote upgrading and the like for the Internet of things terminal;

the internet of things terminal carries out distributed autonomous upgrade decision through the reinforcement learning algorithm, and the phenomenon that multiple terminals excessively compete for limited resources of an update server is avoided. After making a decision through the reinforcement learning algorithm, the terminal waits for the updating time of the decision to come and then communicates with the updating server. The terminals do not communicate in the whole process, and the environment is only learned by the reinforcement learning algorithm so as to coordinate the relationship between the terminals and the update server; the internet of things terminal downloads files required for updating from the update server through an FTP protocol;

the update server provides FTP service for the terminal of the Internet of things, and the terminal of the Internet of things downloads the application software firmware and the application software firmware description file.

In the application, the terminal of the Internet of things carries out distributed autonomous upgrade decision through the reinforcement learning algorithm, the Qlearning reinforcement learning algorithm is used, the time is divided into a plurality of time slots according to the latest deadline of remote upgrade provided by the gateway, the randomly initialized time slot number is used as the state of the model, the actions of reducing the time slot number, maintaining the time slot number and increasing the time slot number are used as the model, and a proper action strategy is obtained through iterative training of the model. The terminal selects actions and updates the state (time slot number) through the reinforcement learning algorithm, observes the updated performance of remote upgrades in the new state and modifies the expected cumulative rewards available for taking specific actions in the corresponding state in the Q-table.

In the initialization stage, the Q table can be flexibly assigned according to actual needs when the Q table is initialized, if the time for waiting for the time slot of the terminal of the Internet of things is expected to be reduced, the Q value for reducing the time slot numbering action can be selected under the condition that the initialization Q represents the improvement of each state.

In the iteration stage, actions are selected according to an epsilon-greedy algorithm, the actions are randomly selected according to epsilon probability, and the action with the largest expected cumulative rewards available in the current state is selected according to 1-epsilon probability.

In the application, communication is not needed among the plurality of internet of things terminals, management relationship does not exist between the internet of things terminals and the update server, and the update server only needs to provide the FTP service. And after receiving the update request of the gateway, the terminal of the Internet of things switches from the application software program to the upgrade guide program for updating. The terminal compares the application program firmware description file obtained from the update server with the locally stored application program firmware information and then automatically judges whether the update requirement exists. If the application program firmware needs to be updated, the application program firmware is downloaded, and after the downloading is completed, the integrity verification can be performed through information in the application program firmware description file, such as file size, MD5 verification and the like. And moving the firmware after passing the integrity verification.

In the application, the update condition of the remote upgrade of the terminal under the current time slot is observed as the input of the rewarding function. According to the reinforcement learning principle, the intelligent agent obtains a proper action strategy through interactive learning with the environment, and the Internet of things terminal obtains an action strategy and a time slot number which are suitable for the terminal through observing the updating condition of remote updating, including whether the updating is successful or not, the competition condition of the other terminals and the resource allowance of the updating server through the information such as the number of times of trying and the like.

According to the application, the terminal of the Internet of things works in the application software program in daily life, and is switched to the upgrade bootstrap program after receiving the upgrade request sent by the gateway. The upgrade bootstrap is responsible for distributed decision making and remote upgrade application software firmware downloading, parsing and loading. The downloaded firmware is stored in a memory area of the internet of things module. When the firmware is loaded, the upgrade bootstrap program reads 512 bytes of firmware each time, and gradually covers the application software program after analysis; after all the reading is completed, the upgrade bootstrap program is switched back to the application software program, and the version number is reported to the gateway by the application software program.

According to the reinforcement learning-based remote upgrading method for the terminal of the Internet of things, the terminal of the Internet of things autonomously decides the updating time, so that the environment can be learned.

The main controller of the terminal of the Internet of things comprises two programs, namely an upgrade guide program and an application software program. Both programs exist in Flash (memory) at the same time. In the normal working mode, the main controller runs the application software program, and is switched to the upgrade guide program for remote upgrade only after receiving the upgrade request sent by the gateway. After switching to the upgrade guide program, firstly obtaining an ideal upgrade time slot number according to the reinforcement learning algorithm, and then waiting for starting upgrade to a specific time slot; FTP communication is carried out between the internet of things module and the update server, new version application program firmware and application program firmware description files are downloaded, the integrity is checked and then the new version application program firmware and the application program firmware description files are loaded into Flash, and old application programs are covered; and updating the reinforcement learning model according to the environment feedback, switching to a new application software program, reporting a new version number to the gateway by the new application program, and ending the updating.

In the application, after the terminal runs the upgrade bootstrap program for the first time, an initial time slot number is randomly generated and stored in Flash. The number is then maintained by the upgrade bootstrap program through a reinforcement learning algorithm.

In the application, the reinforcement learning algorithm used is a reinforcement learning algorithm based on Qlearning. The algorithm uses a Q table to store a Q value for each state and action, the Q value representing the expected cumulative rewards that can be achieved by taking an action in a given state. In the iterative process of Q learning, the terminal of the Internet of things communicates with the update server, obtains the reward signal and the new state according to the update condition, and updates the Q value of the corresponding state and action in the Q table. The update formula of the Q table is:

Q(s,a)＝Q(s,a)+A(r+γ*max(Q(s′,a′))-Q(s,a))。

in the present application, the time slot number is selected as the state, and the actions include three actions of decreasing the time slot number, reserving the time slot number, and increasing the time slot number. At each update, the terminal selects an action using an epsilon-greedy policy, i.e., selects a random action with epsilon probability, and queries the Q table with a probability of 1 epsilon to select the action with the highest Q value to make a decision. After making a decision, the terminal immediately modifies the time slot number according to the selected action and waits for observing the update condition, and updates according to a formula according to the feedback of the update condition. Along with the continuous iterative updating of the communication Q table of the terminal of the Internet of things and the updating server, the terminal can gradually learn the optimal action strategy and adjust to obtain a stable time slot number.

In the application, the internet of things terminal and the update server communicate in an FTP mode, and the downloaded data are stored in a memory of the internet of things module. Two files are stored on the update server: application firmware and application firmware description file. The application firmware description file contains basic information such as firmware size, version number and the like, and can be used for verifying the integrity of the downloaded application firmware. Firstly, downloading an application software firmware description file by a terminal, and judging whether the version number of the local application program firmware is the latest version or not; and after the update is confirmed, the application software firmware is downloaded, and after the downloading is finished, whether the information in the application software firmware is consistent with the information in the application software firmware description file is verified by comparing. After the integrity verification is passed, the terminal reads the memory in the Internet of things module and writes the application software firmware code into Flash, updates the local application program firmware version number after the completion, and finally updates the interrupt vector table and restarts to realize the switching from the upgrading bootstrap program to the application software program.

The application adopts the Qlearning reinforcement learning method, uses the update condition as feedback to continuously adjust the time slot number, can avoid excessive competition of a plurality of terminals for limited update server resources, realizes intelligent distributed autonomous decision, and can stably and reliably carry out remote upgrade. The application provides a reinforcement learning-based remote upgrading method for an Internet of things terminal, which is used for solving the technical problem of how to provide a remote upgrading mechanism which is easy to realize, efficient and reliable. The application is realized by the following technical scheme, which comprises distributed autonomous decision making based on Qlearning reinforcement learning and a remote updating mechanism between a terminal and an updating server.

The algorithm flow for the Qlearning autonomous updating decision comprises the following steps:

(1) Randomly initializing a time slot number as an initial state, and initializing a Q table;

(2) According to epsilon-greedy strategy, selecting random action with epsilon probability, inquiring Q table with 1-epsilon probability to select action with highest Q value, and updating state, namely time slot number according to the selected action;

(3) After waiting for the arrival of a time slot, carrying out remote updating, observing the result of the remote updating as the input of a reward function, and updating the Q table through the following formula:

Q(s,a)＝Q(s,a)+A(r+γ*max(Q(s′,a′))-Q(s,a))。

(4) If the updating is successful, ending the decision algorithm and switching to the application software program; otherwise, jumping to the step (2).

In a possible implementation manner, the embodiment of the present application further provides a non-transitory computer readable storage medium, where the readable storage medium is located in a PLC (ProgrammableLogic Controller ) controller, and a computer program is stored on the readable storage medium, where the program when executed by a processor implements the following steps of a reinforcement learning-based remote upgrading method for an internet of things terminal:

The storage media may take the form of any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. The computer readable storage medium may be, for example, but not limited to: an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations of the present application may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The embodiment of the application provides a reinforcement learning-based remote upgrading device for an Internet of things terminal, which comprises a sending module, a dividing module, a decision module and an updating module.

The sending module is used for sending an upgrading request to the terminal of the Internet of things through the gateway so as to inform the terminal of the Internet of things of remote upgrading, wherein the upgrading request comprises an address of an updating server and the latest deadline of the remote upgrading; the division module is used for dividing a Flash area of the terminal of the Internet of things into an application software program area and an upgrade bootstrap program area; the decision module is used for driving the Internet of things terminal to switch the interrupt vector table from the application software program to the upgrade guide program after receiving the upgrade request, and carrying out distributed autonomous upgrade decision based on the reinforcement learning algorithm; the updating module is used for driving the communication between the internet of things terminal and the updating server based on the FTP protocol so as to download the files required for updating, update the Q table and update the application software firmware.

The foregoing is only a specific embodiment of the application to enable those skilled in the art to understand or practice the application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Claims

1. The method for remotely upgrading the terminal of the Internet of things based on reinforcement learning is characterized by comprising the following steps of:

2. The method for remotely upgrading the internet of things terminal based on reinforcement learning as set forth in claim 1, wherein the method comprises the following steps: and running an application software program under the daily running condition of the terminal of the Internet of things, and switching the interrupt vector table from the application software program to the upgrade guide program after receiving the upgrade request sent by the gateway.

3. The method for remotely upgrading an internet of things terminal based on reinforcement learning according to claim 1, wherein the method for making a distributed autonomous upgrade decision based on reinforcement learning algorithm comprises the following specific steps:

4. The method for remotely upgrading an internet of things terminal based on reinforcement learning as set forth in claim 3, wherein:

5. The method for remotely upgrading an internet of things terminal based on reinforcement learning as set forth in claim 3, wherein: the actions include decreasing the slot number, maintaining the slot number, and increasing the slot number.

6. The method for remotely upgrading an internet of things terminal based on reinforcement learning as set forth in claim 3, wherein the FTP protocol-based communication between the internet of things terminal and the update server is performed to download the file required for updating, and the specific steps include:

if not, switching to the application software program directly.

7. The method for remotely upgrading an internet of things terminal based on reinforcement learning as set forth in claim 6, wherein the updating of the Q table is performed by adopting an updating formula:

Q(s,a)＝Q(s,a)+A(r+γ*max(Q(s′,a′))-Q(s,a))

8. The reinforcement learning-based remote upgrading method for an internet of things terminal of claim 7, further comprising, before the updating of the application software firmware:

if yes, updating the application software firmware;

9. The method for remotely upgrading an internet of things terminal based on reinforcement learning as set forth in claim 8, wherein the updating of the application software firmware comprises the specific steps of:

10. The utility model provides a thing networking terminal remote upgrade device based on reinforcement study which characterized in that includes: