WO2023080891A1 - Système et procédé de traitement de données parallèles d'une unité à double microcontrôleur - Google Patents

Système et procédé de traitement de données parallèles d'une unité à double microcontrôleur Download PDF

Info

Publication number
WO2023080891A1
WO2023080891A1 PCT/US2021/057997 US2021057997W WO2023080891A1 WO 2023080891 A1 WO2023080891 A1 WO 2023080891A1 US 2021057997 W US2021057997 W US 2021057997W WO 2023080891 A1 WO2023080891 A1 WO 2023080891A1
Authority
WO
WIPO (PCT)
Prior art keywords
mcu
program
data
data set
training
Prior art date
Application number
PCT/US2021/057997
Other languages
English (en)
Inventor
Da Qi REN
Original Assignee
Zeku, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zeku, Inc. filed Critical Zeku, Inc.
Priority to PCT/US2021/057997 priority Critical patent/WO2023080891A1/fr
Publication of WO2023080891A1 publication Critical patent/WO2023080891A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7807System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package

Definitions

  • Embodiments of the present disclosure relate to a system and method of data processing.
  • a system-on-chip may include a static random access memory (SRAM) configured to obtain a data set.
  • the SoC may include a first NOR flash configured to maintain a first program.
  • the SoC may include a first microcontroller unit (MCU) coupled to the first NOR flash via a first address link and a first data link.
  • the first MCU may be configured to locate the first program via the first address link.
  • the first MCU may be configured to read the first program via the first data link.
  • the first MCU may be configured to obtain a first portion of the data set from the SRAM.
  • the first MCU may be configured to perform first processing of the first portion of the data set based on the first program.
  • the SoC may include a second NOR flash configured to maintain a second program.
  • the SoC may include a second MCU coupled to the second NOR flash via a second address link and a second data link.
  • the second MCU may be configured to locate the second program via the second address link.
  • the second MCU may be configured to read the second program via the second data link.
  • the second MCU may be configured to obtain a second portion of the data set from the SRAM.
  • the second MCU may be configured to perform second processing of the second portion of the data set based on the second program.
  • an SoC may include a first MCU.
  • the first MCU may be configured to obtain a program from a first NOR flash.
  • the first MCU may be configured to obtain a first portion of a data set from an SRAM.
  • the first MCU may be configured to perform first processing of the first portion of the data set based on the program.
  • the SoC may include a second MCU.
  • the second MCU may be configured to obtain the program from a second NOR flash.
  • the second MCU may be configured to obtain a second portion of the data set from the SRAM.
  • the second MCU may be configured to perform second processing of the second portion of the data set based on the program.
  • the first NOR flash and the second NOR flash may be different.
  • the first processing and the second processing may be performed in parallel.
  • a method of parallel MCU processing may include obtaining, by a first MCU, a program from a first NOR flash.
  • the method may include obtaining, by the first MCU, a first portion of a data set from an SRAM.
  • the method may include performing, by the first MCU, first processing of the first portion of the data set based on the program.
  • the method may include obtaining, by a second MCU, the program from a second NOR flash.
  • the method may include obtaining, by the second MCU, a second portion of the data set from the SRAM.
  • the method may include performing, by the second MCU, second processing of the second portion of the data set based on the program.
  • the first NOR flash and the second NOR flash may be different.
  • the first processing and the second processing may be performed in parallel.
  • FIG. 1 illustrates a block diagram of a system-on-chip (SoC), according to certain embodiments of the present disclosure.
  • SoC system-on-chip
  • FIG. 2 illustrates a block diagram of the SoC of FIG. 1 in which a pair of microcontroller units (MCUs) perform parallel recognition processing of different portions of a sensor data set, according to certain embodiments of the present disclosure.
  • MCUs microcontroller units
  • FIG. 3 illustrates a block diagram of the SoC of FIG. 1 in which a pair of MCUs perform parallel recognition processing of different portions of a training data set, according to certain embodiments of the present disclosure.
  • FIG. 4 illustrates a block diagram of the SoC of FIG. 1 in which one MCU performs recognition processing of a sensor data set and another MCU performs recognition processing of a training data set in parallel, according to certain embodiments of the present disclosure.
  • FIG. 5 illustrates pseudocode that may be used to implement a token-based process flow, according to certain embodiments of the present disclosure.
  • FIG. 6 illustrates the token-based process flow that may be implemented using the pseudocode of FIG. 5, according to certain embodiments of the present disclosure.
  • FIG. 7 illustrates a tensor flow deployment technique, according to certain embodiments of the present disclosure.
  • FIG. 8 illustrates a block diagram of a tensor flow deployment technique that may be used to update an inference model, according to certain embodiments of the present disclosure.
  • FIG. 9A illustrates a mapping technique that may be implemented by a pair of MCUs of the SoC of FIG. 1 for long short-term memory (LSTM) for speech recognition, according to certain aspects of the present disclosure.
  • LSTM long short-term memory
  • FIG. 9B illustrates a block diagram of the pair of MCUs configured to implement the mapping technique of FIG. 9A, according to certain embodiments of the present disclosure.
  • FIG. 10 illustrates a method for implementing a dual -MCU parallel processing technique, according to certain embodiments of the present disclosure.
  • FIG. 11 illustrates a wireless network, according to certain embodiments of the present disclosure.
  • FIG. 12 illustrates a block diagram of a node, according to certain embodiments of the present disclosure.
  • references in the specification to “one embodiment,” “an embodiment,” “an example embodiment,” “some embodiments,” “certain embodiments,” etc. indicate that one or more embodiments described may include a particular feature, structure, or characteristic, but every embodiment may not necessarily include the particular feature, structure, or characteristic. Moreover, such phrases do not necessarily refer to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it would be within the knowledge of a person skilled in the pertinent art to effect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described.
  • terminology may be understood at least in part from usage in context.
  • the term “one or more” as used herein, depending at least in part upon context may be used to describe any feature, structure, or characteristic in a singular sense or may be used to describe combinations of features, structures or characteristics in a plural sense.
  • terms, such as “a,” “an,” or “the,” again, may be understood to convey a singular usage or to convey a plural usage, depending at least in part upon context.
  • the terms “based on,” “based upon,” and terms with similar meaning may be understood as not necessarily intended to convey an exclusive set of factors and may, instead, allow for existence of additional factors not necessarily expressly described, again, depending at least in part on context.
  • various neural network frameworks e.g., as Caffe 2, TensorFlow Lite, advanced reduced instruction set computer (RISC) machines (ARM) neural network (NN), etc.
  • RISC reduced instruction set computer
  • ARM advanced reduced instruction set computer
  • N neural network
  • the optimized code running on the MCU can perform Al functions related to voice, vision (image), and anomaly detection.
  • These models may be downloaded to the MCU, which may run inferences that optimize the neural network.
  • These Al toolsets also provide code examples of Al applications based on neural networks.
  • Co-processors such as ARM Cortex - M33 make use of popular application programming interfaces (APIs), such as cortex microcontroller software interface standard-digital signal processing (CMSIS-DSP), to simplify code portability, which tightly couples the MCU and the co-processor to speed up Al functions (e.g., co-processing and matrix operations).
  • APIs application programming interfaces
  • CMSIS-DSP cortex microcontroller software interface standard-digital signal processing
  • the machine learning accelerator described above may achieve a high energy efficiency ratio through a dedicated design, it also limits the scope of the application. Moreover, this machine learning accelerator can only be used to accelerate a part of the machine learning algorithm and without consideration of versatility. Thus, one drawback of using this machine learning accelerators relates to the difficulty in designing a single machine learning accelerator that can cover multiple applications. Another drawback relates to the optimized code running on the MCU to perform inference to optimize neural networks. The Al execution model conversion tool can run optimized neural network inferences on low-cost, low-power MCUs. However, using this approach, the MCU is unable to participate in the training of the machine learning model, and remains idle for an undesirable amount of time, thereby wasting resources. Yet another drawback relates to the lack of collecting local samples so that personalized analysis cannot be achieved. Still further, the computing power of such devices is comparatively small.
  • the present disclosure provides a dual- MCU parallel processing scheme.
  • the present dual-MCU parallel processing scheme flexibly uses a pair of MCUs, so that each MCU skips the HOLD function and directly communicates data through an off-MCU SRAM. By skipping the HOLD function, the latency associated with communicating data between MCUs may be reduced.
  • both MCUs may apply an inference program to a sensor data set that is divided between the two MCUs.
  • both MCUs may apply a training program to a training data set that is divided between the two MCUs.
  • one MCU may apply an inference program to a sensor data set while the other MCU applies a training program to a training data set.
  • the MCU performing inference may update the training data set in real-time with inference information derived from the sensor data set.
  • data set may include raw data, processed data, a null data set, a single piece of data, a plurality of data, truth data, image data, digital signal processor (DSP) data, image sensor data, audio data, video data, etc.
  • DSP digital signal processor
  • FIG. 1 illustrates a block diagram of a system 100 having an SoC 102, according to some embodiments of the present disclosure.
  • SoC 102 may include a plurality of functional units. These functional units may include one or more of, e.g., a RAM 106, a first MCU 108a (referred to hereinafter as “MCU0 108a”), a second MCU 108b (referred to hereinafter as “MCU1 108b”), an SRAM 110, a first NOR flash 112a coupled to MCU0 108a, a second NOR flash 112b coupled to MCU1 108b, one or more sensors 114, and/or a computer processing unit (CPU) 116.
  • CPU computer processing unit
  • System 100 may be applied or integrated into various systems and apparatuses capable of high-speed data processing, such as computers and wireless communication devices.
  • system 100 may be part of a mobile phone, a desktop computer, a laptop computer, a tablet, a vehicle computer, a gaming console, a printer, a positioning device, a wearable electronic device, a smart sensor, a virtual reality (VR) device, an augmented reality (AR) device, or any other suitable electronic devices having highspeed data processing capability.
  • VR virtual reality
  • AR augmented reality
  • SoC 102 may serve as an application processor (AP) and/or a baseband processor (BP) that imports data and instructions from RAM 106, executing instructions to perform various mathematical and logical calculations on the data, and exporting the calculation results for further processing and transmission over cellular networks.
  • AP application processor
  • BP baseband processor
  • SRAM 110 may receive and store data of different types from various sources via communication channels (e.g., bus 104).
  • SRAM 110 may receive and store digital imaging data captured by a camera (e.g., sensor(s) 114) of the wireless communication device, voice data transmitted via cellular networks, such as a phone call from another user, or text data input by the user of the system through an interactive input device, such as a touch panel, a keyboard, or the like.
  • SRAM 110 may receive and store training data that may be used to train an inference program stored in first NOR flash 112a and second NOR flash 112b.
  • Each of first NOR flash 112a and second NOR flash 112b may receive and store an inference program and a training program that may be accessed by MCU0 108a and MUC1 108b, respectively.
  • RAM 106 may receive and store computer instructions to be loaded to MCU0 108a and/or MCU1 108b for data processing, such as instructions associated with regular mode threads and/or interrupt service threads. Such instructions may be in the form of an instruction set, which contains discrete instructions that teach the microprocessor or other functional components of the microcontroller chip to perform one or more of the following types of operations — data handling and memory operations, arithmetic and logic operations, control flow operations, co-processor operations, etc.
  • RAM 106 may be provided as a standalone component in or attached to the apparatus, such as a hard drive, a Flash drive, a solid-state drive (SSD), or the like. Other types of memory compatible with the present disclosure may also be conceived. It is understood that RAM 106 may not be the only component capable of storing data and instructions. SRAM 110 may also store data and instructions and, unlike RAM 106, may have direct access to MCU0 108a and MCU1 108b.
  • Bus 104 functions as a highway that allows data to move between various nodes, e.g., memory, microprocessor, transceiver, user interface, or other sub-components in system 100, according to some embodiments.
  • Bus 104 can be serial or parallel.
  • Bus 104 can also be implemented by hardware (such as electrical wires, optical fiber, etc.). It is understood that bus 104 can have sufficient bandwidth for storing and loading a large amount of data (e.g., vectors) between RAM 106 and other functional units without delay to the data processing by MCU0 108a and MCUl 108b.
  • SoC designs may integrate one or more components for computation and processing on an integrated-circuit (IC) substrate.
  • SoC design is an ideal design choice because of its compact area. It further has the advantage of small power consumption.
  • MCU0 108a, MCU1 108b, and SRAM 110 are integrated into SoC 102. It is understood that in some examples, MCU0 108a, MCU1 108b, and SRAM 110 may not be integrated on the same chip, but instead on separate chips.
  • two MCUs are illustrated in the dual- MCU architecture of system 100, it is understood that more than two MCUs may be included in SoC 102 without departing from the scope of the present disclosure.
  • MCU0 108a and MCU1 108b may include any suitable specialized processor which can handle a specific operation in an embedded system.
  • MCU0 108a and MCU1 108b may be configured as, e.g., a CPU, a graphic processing unit (GPU), a digital processing processor (DSP), a tensor processing unit (TPU), a vision processing unit (VPU), a neural processing unit (NPU), a synergistic processing unit (SPU), a physics processing unit (PPU), and an image signal processor (ISP).
  • each MCU handles a specific operation associated with inference operations (e.g., image recognition, speech recognition, etc.) and/or inference model training operations.
  • inference operations e.g., image recognition, speech recognition, etc.
  • SRAM 110 may be configured to receive and store sensor data.
  • the sensor data includes image data captured by a camera of sensor(s) 114.
  • the sensor data may include any other type of data captured by any other type of sensor.
  • This data may include voice data captured by a microphone or video camera, video data captured by a video camera, etc.
  • MCU0 108a and MCU1 108b may apply an inference model (stored in first NOR flash 112a and second NOR flash 112b) to the sensor data to perform inference operations.
  • inference operations include, e.g., image recognition, speech recognition, anomaly recognition etc.
  • SRAM 110 may be configured to receive and store training data.
  • MCU0 108a and MCU1 108b may apply a training model (stored in first NOR flash 112a and second NOR flash 112b) to the training data to generate an initial or updated inference model.
  • SRAM 110 may receive the training data from RAM 106, from the cloud, from another device coupled to SoC 102, or from an external device, for example.
  • MCU0 108a and MCU1 108b may be configured to perform various parallel inference and/or training operations, as described below.
  • MCU0 108a and MCU 1 108b may be configured to perform inference operations, concurrently.
  • MCU0 108a may locate an inference model in first NOR flash 112a via address line 140a and read the inference model via data line 150a.
  • MCU1 108b may locate the inference model in second NOR flash 112b via address line 140b and read the inference model via data line 150b.
  • MCU0 108a may locate a first portion of the data set via address line 120a and read the first portion of the data set via data line 130a.
  • MCU1 108b may locate a second portion of the data set via address line 120b and read the second portion of the data set via data line 130b. Then, each of MCU0 108a and MCU1 108b may apply the inference model to their respective data portion to perform, e.g., image or voice recognition of the data. Additional details of the dual-MCU parallel inference operation are provided below in connection with FIG. 2.
  • MCU0 108a and MCU1 108b may be configured to both perform training operations, concurrently.
  • MCU0 108a may locate a training model in first NOR flash 112a via address line 140a and read the training model via data line 150a.
  • MCU1 108b may locate the training model in second NOR flash 112b and read the training model via data line 150b.
  • MCU0 108a may locate a first portion of the data set via address line 120a and read the first portion of the data set via data line 130a.
  • MCU 1 108b may locate a second portion of the data set via address line 120b and read the second portion of the data set via data line 130b. Then, each of MCU0 108a and MCU1 108b may apply the training model to their respective data portion to generate and/or update an inference model. Additional details of the dual-MCU parallel training operation are provided below in connection with FIG. 3.
  • MCU0 108a may be configured to perform inference operations
  • MCU1 108b may be configured to perform training operations, concurrently.
  • the inference data identified by MCU0 108a may be used to update the training data used by MCU1 108b for the training operations.
  • MCU0 108a may locate an inference model in first NOR flash 112a via address line 140a and read the inference model via data line 150a
  • MCU1 108b may locate the training model in second NOR flash 112b and read the training model via data line 150b
  • MCU0 108a may locate sensor data in SRAM 110 via address line 120a and read the sensor data via data line 130a.
  • MCU1 108b may locate a training data set via address line 120b and read the training data set via data line 130b. As MCU0 108a identifies image or speech data from the sensor data, the sensor data and correlated image or speech data may be added to the training data set used by MCU1 108b. This can happen in real-time.
  • a label such as “correct” or “incorrect” may be identified by MCU1 108a and accompany the sensor data and image/ speech data based on user input. The label may be generated based on truth data associated with the image. That way, the training model may update the inference model in real-time based on the accuracy of the information identified by the inference operations being concurrently performed by MCU0 108a. Additional details of the dual- MCU parallel inference/training operation are provided below in connection with FIG. 4.
  • system 100 may use a message passing interface (MPI), which supports dual-MCU parallel operations for distributed applications.
  • MPI message passing interface
  • the MPI of system 100 may include lightweight MPI (LMPI), which is a distributed embedded system standard for embedded systems in which each process node (e.g., MCU) is not powerful enough to run the complete MPI standard.
  • LMPI lightweight MPI
  • MCU0 108a, and MCU1 108b may communicate with CPU 116.
  • CPU 116 may be powerful enough to perform MPI calculations, run the complete MPI standard, and connect to the fast network, e.g., using the pseudocode 500 depicted in FIG. 5.
  • the distributed system of system 100 may include two types of nodes: 1) a CPU node (CPU 116) and 2) an MCU client/process node (MCU0 108a and/or MCU1 108b).
  • the CPU node starts the MPI
  • the MCU node runs the LMPI process.
  • the program run by MCU0 108a and MCU1 108b may be stored in first NOR flash 112a and second NOR flash 112b, respectively.
  • CPU 116 may send control messages via bus 104 to MCU0 108a and/or MCU1 108b, which may then read the program (e.g., inference model, training model, etc.) from its respective NOR flash.
  • CPU 116 may select a processing mode for each of the MCU0 108a and the MCU1 108b.
  • CPU 116 may communicate with MCU0 108 via bus 104 and configure the MCU0 108 to perform the first processing based on a first processing mode (e.g., training mode or inference mode) associated with the first program (e.g., training model or inference model).
  • a first processing mode e.g., training mode or inference mode
  • CPU 116 may communicate with MCU1 108b and configure the MCU1 108b to perform the second processing based on a second processing mode (e.g., training mode or inference mode) associated with the second program (e.g., training mode or inference mode).
  • a second processing mode e.g., training mode or inference mode
  • CPU 116 may send first instructions associated with the first program to first NOR flash 112a while the MCU0 108a performs the first processing, and send second instructions associated with the second program the second NOR flash 112b while the MCU1 108b performs the second processing. In this way, the operations may proceed in synchronized fashion between the MCUs and NOR flashes.
  • the first program and the second program may be different programs.
  • the first program and the second program may be parts of the same program stored by both first NOR flash 112a and second NOR flash 112b.
  • MCU0 108a and/or MCU1 108b may send information associated with the training and/or inference models to CPU 116.
  • CPU 116 may be configured to modify the training and/or inference models and send the modified models to first NOR flash 112a and/or second NOR flash 112b. In some embodiments, CPU 116 may be configured to modify the programs or models located at one or more of first NOR flash 112a and/or second NOR flash 112b. [0043] Still further, to implement the various dual-MCU operations described above, MCU0 108a and MCU1 108b may communicate with one another via SRAM 110 using a tokenbased mechanism, an example of which is depicted in FIG. 6.
  • MCU0 108a may draw part of the address space of SRAM 110 as a dedicated channel for data transmission (e.g., high- end address space is available) with MCU1 108b; at the same time, this address space may be mapped to the same size SRAM address space on the side of MCU1 108b that receives the data, and which MCU1 108b can directly access.
  • the address lines 120a, 120b may be arranged according to implementation.
  • MCU0 108a or MCU1 108b reads and writes data to this address via data lines 130a or 130b.
  • MCU0 108a and/or MCU1 108b contact each other to prepare for data transmission.
  • the data lines 130a, 130b on both sides may be separated by a switch gate circuit, and MCU0 108a may be directly mapped to the SRAM address used by MCU1 108b through the switch gate circuit, and vice versa.
  • MCU0 108a prepares to read or write data on the MCUl-side of SRAM 110
  • a read and write strobe signal of MCU0 108a is connected to the read and write terminal of MCU1 108b through the switch.
  • these switch gate circuits may be opened. In this way, MCUO 108a can read and write data in/to the MCU1 address space of SRAM 110, or vice versa.
  • MCUO 108a hands over bus control during operations by communicating with MCU1 108b via bus 104.
  • MCU1 108b may hand over the bus control to MCUO 108a.
  • Each MCU has an idle state working mode that can be awakened by an interrupt.
  • system 100 may include two or more CPUs that communicate using the following techniques. For example, rather than communicating via VO ports to transmit data, each CPU may use part of the address space of RAM 106 and/or SRAM 110 (to which both CPUs are connected via address lines and data lines). This may increase the transmission speed and use fewer system resources than communicating via VO ports and/or bus 104. Dual-CPU communication may be achieved using shared memory at CPU 116.
  • First NOR flash 112a and second NOR flash 112b may be used to store the startup code and application code of the embedded system for inference and training Al.
  • NOR flash as opposed to other types of memory, may provide larger memory storage capacity for the larger code link library generated by Al applications.
  • ’today’s embedded systems with abundant application software may be updated online from time to time in order to perform security fixes and new functions. This irregular update function may be easily written to NOR flash.
  • the present dual-MCU system achieves Al operations that may be applied flexibly to various applications. Moreover, parallel inference, training, and/or inference/training operations may be performed by the dual-MCU system, which reduces the processing time for performing inference and/or training operations at SoC 102. Still further, the dual-MCU system can update a training data set in real-time so that the inference model may be adaptable based on the most recent information. Additional details associated with dual- MCU system of SoC 102 are provided below in connection with FIGs. 2-12.
  • FIG. 2 illustrates a block diagram 200 of SoC 102 of FIG. 1 in which MCUO 108a and MCU1 108b perform parallel recognition processing of different portions of sensor data 210, according to certain embodiments of the present disclosure.
  • MCUO 108a and MCU1 108b may be configured to both perform inference operations, concurrently.
  • MCU0 108a may locate an inference model 212 in first NOR flash 112a via address line 140a and read the inference model 212 via data line 150a.
  • MCU1 108b may locate the inference model 212 in second NOR flash 112b and read the inference model 212 via data line 150b.
  • MCU0 108a may locate a first portion of the sensor data 210-1 via address line 120a and read the first portion of the data set via data line 130a.
  • MCU1 108b may locate a second portion of the sensor data 210-2 via address line 120b and read the second portion of the sensor data 210-2 via data line 130b.
  • each of MCU0 108a and MCU1 108b may apply the inference model 212 to their respective data portion to perform, e.g., image or voice recognition of sensor data 210.
  • a user may take a picture of a stop sign in France using a camera (sensor 114).
  • the sensor data 210 may include an image of the stop sign including writing in the French language.
  • the user may wish to have the sign’s wordage (e.g., “Arret”) translated.
  • the inference model 212 may be able to identify the French word (e.g., Arret) and translate it to English (e.g., “Stop”) using inference model 212.
  • Sensor data 210 may be divided into different letters and/or words that are then processed by MCU0 108a and MCU1 108b to identify words and/or context using inference model 212.
  • Other implementations of the dual-MCU inference operations are contemplated.
  • inference model 212 may be an object detection model that is applied to each frame of a video.
  • each of MCU0 108a and MCU1 108b may apply the object detection model to different frames of the video.
  • parallelizing ML inferences across all available MCUs e.g., MCU0 108a, MCU1 108b, and/or additional MCUs not depicted in FIG. 2 on system 100 offers the potential to reduce the overall inference time.
  • the parallel inference operations of the present disclosure are performed using data parallelism.
  • This approach differs from model parallelism, which would entail splitting inference model 212 into different parts and loading respective model parts into MCU0 108a orMCUl 108b.
  • Loading the neural networks of inference model 212 often takes a significant amount of time. For example, a few hundred milliseconds or multiple seconds are typical.
  • the present disclosure proposes performing initialization once per process/thread and reusing the process/thread at the same MCU to perform inference.
  • Data transformation and forwarding pass on the neural network are the two most computationally intensive steps.
  • certain applications may utilize one-hundred percent of one the core of one MCU, while underutilizing the other MCU.
  • a process/thread may be assigned to MCU0 108a, while other process/threads may be assigned to MCU1 108b.
  • MCU0 108a may apply a first thread of the training program to the training data, while MCU1 108b applies a second thread of the training program to the training data.
  • MCU0 108a may be configured to access the first thread from first NOR flash 112a, while MCU1 108 may be configured to access the second thread from second NOR flash 112b.
  • the code stored in the NOR flash may specify the MCU identification (ID) onto which inference model 212 model should be loaded. For example, if two MCUs and two processes run inferences/leaming in parallel, the code should explicitly assign one process to MCU0 108a and the other MCU1 108b. In some embodiments, different inputs can be parallelized for simultaneous processing by different threads.
  • FIG. 3 illustrates a block diagram 300 of SoC 102 of FIG. 1 in which MCU0 108a and MCU1 108b perform parallel recognition processing of different portions of training data 310, according to certain embodiments of the present disclosure.
  • MCU0 108a and MCU1 108b may be configured to both perform training operations, concurrently.
  • MCU0 108a may locate a training model 312 in first NOR flash 112a via address line 140a and read the training model via data line 150a.
  • MCU1 108b may locate the training model 312 in second NOR flash 112b and read the training model via data line 150b.
  • MCU0 108a may locate a first portion of the training data 310-1 via address line 120a and read the first portion of the training data 310-1 via data line 130a.
  • Training data 310 may include inference model data collected over time when an inference model is run.
  • the inference model data may be stored locally, e.g., at SRAM 110 and/or RAM 106, or on a cloud-based server or external device.
  • training data 310 may include the sensor data processed by the inference model, the output of the inference model, and any accuracy data associated with the output.
  • the accuracy data may include a label or tag such as, e.g., “correct” or “incorrect.”
  • the output of the inference model 212 (from FIG. 2) is displayed, a user may interact with the device (e.g., a touch screen, keypad, etc.) to indicate whether the output data is accurate and/or useful.
  • the inference model data may be used to train training model 312 to improve the accuracy and/or robustness of the generated inference model 212.
  • training data 310 may be downloaded from the cloud or received from another device.
  • training data 310 includes multiple images of digits, each of which contains AT identical instances.
  • An MCU uses each pair of images to mutually predict each pair of M instances to minimize the total loss for training.
  • MCU0 108a uses the first frame of “0” as a query, and each subsequent frame of “0” is regarded as an independent global detection, and with the highest score is directly obtained as the result. This way, cumulative error caused by relying on adjacent frames may be avoided.
  • FIG. 4 illustrates a block diagram 400 of SoC 102 of FIG. 1 in which MCU0 108a performs inference operations of sensor data 210 and MCU1 108b performs training operations of training data 310 in parallel, according to certain embodiments of the present disclosure.
  • MCU0 108a may be configured to perform inference operations and MCU1 108b may be configured to perform training operations, concurrently.
  • sensor data 210 identified by MCU0 108a may be used to update training data 310 used by MCU1 108b for the training operations.
  • MCU0 108a may locate inference model 212 in first NOR flash 112a via address line 140a and read the inference model via data line 150a
  • MCU1 108b may locate training model 312 in second NOR flash 112b and read the training model via data line 150b.
  • MCU0 108a may locate sensor data 210 in SRAM 110 via address line 120a and read the sensor data via data line 130a.
  • MCU1 108b may locate training data 310 via address line 120b and read the training data set via data line 130b.
  • recognition data 410 e.g., image, context, speech, etc.
  • sensor data 210 may be added to training data 310, such that training data 310 is updated in real-time.
  • a label such as “correct” or “incorrect” may accompany the sensor data 210 and recognition data 410 based on user input, as described above in connection with FIG. 3. That way, training model 312 may update inference model 212 in real-time based on the accuracy of the information identified by the inference operations being concurrently performed by MCU0 108a.
  • Parallel inference/training operations described in connection with FIG. 4 may be performed using the token-based process flow 600 depicted in FIG. 6.
  • FIG. 7 illustrates a tensor flow deployment technique 700 to generate a program/model that is stored in first NOR flash 112a and/or second NOR flash 112b of FIG. 1, according to certain embodiments of the present disclosure.
  • a neural network source code and assemble code may be compiled into respective .ofiles, which may be linked to a program file.
  • the program file may then be stored in a NOR flash.
  • the program format may be copied into the NOR flash and easy to access. This means the code part and data part are split separately, then the code and data can be accessed by a certain sequence.
  • certain optimizations may be considered.
  • FIG. 8 illustrates a block diagram of a tensor flow deployment technique 800 that may be used to update an inference model, according to certain embodiments of the present disclosure.
  • the training engine 802 of MCU0 108a may update or generate a new inference model, which is input into CPU 116.
  • CPU 116 may obtain (at 801) the pre-trained model (e.g., from the training engine 802 of MCU0 108a), make optimizations (at 803), convert (at 805 the tensor-flow to tensor-flow lite, convert (at 807) the tensor-flow lite to the runtime parameters.
  • the model may be input into NOR flash and/or inference engine 804 of MCU1 108b.
  • FIG. 9A illustrates a mapping technique 900 that may be implemented by MCU0 108a and MCU1 108b for long short-term memory (LSTM) speech recognition, according to certain aspects of the present disclosure.
  • FIG. 9B illustrates a block diagram 901 of MCU0 108a and MCU1 108b implementing the mapping technique of FIG. 9A, according to certain embodiments of the present disclosure. FIGs. 9A and 9B will be described together.
  • LTSM is a deep learning system that avoids the vanishing gradient problem.
  • LSTM is normally augmented by recurrent gates called “forget gates.”
  • LSTM prevents backpropagated errors from vanishing or exploding. Instead, errors can flow backwards through unlimited numbers of virtual layers unfolded in space. That is, LSTM can learn tasks that require memories of events that happened thousands or even millions of discrete time steps earlier.
  • Problem-specific LSTM- like topologies can be evolved using MCU0 108a and MCU1 108b. LSTM works even given long delays between significant events and can handle signals that mix low and high frequency components.
  • Certain applications may use a stack of LSTM recurrent neural networks (RNNs) and train them by Connectionist Temporal Classification (CTC) to find an RNN weight matrix that maximizes the probability of the label sequences in a training set, given the corresponding input sequences.
  • RNNs LSTM recurrent neural networks
  • CTC Connectionist Temporal Classification
  • LSTM can learn to recognize context- sensitive languages, unlike previous models based on hidden Markov models (HMM) and similar concepts.
  • MCU0 108a may perform process A
  • MCU1 108b may perform process B to implement LTSM.
  • the output of process A may be input into SRAM 110, which is then read by MCU1 108b as the input for process B.
  • the output for process B may be input into SRAM 110, which is then read by MCU0 108a as the next input for process A, and so on.
  • FIG. 10 illustrates a flowchart of a method 1000 of implementing a dual-MCU parallel processing technique, according to embodiments of the disclosure.
  • Method 1000 may be performed by an apparatus, e.g., system 100, SoC 102, MCU0 108a, MCU1 108b, SRAM 110, first NOR flash 112a, second NOR flash 112b, sensor(s) 114, and/or CPU 116.
  • Method 1000 may include steps 1002-1012 as described below. It is to be appreciated that some of the steps may be optional, and some of the steps may be performed simultaneously, or in a different order than shown in FIG. 10.
  • the apparatus may obtain a program from a first NOR flash.
  • MCU0 108a may locate an inference model 212 or a training model 312 in first NOR flash 112a via address line 140a and read the inference model 212 or training model 312 via data line 150a.
  • the apparatus may obtain a first portion of a data set from an SRAM.
  • MCU0 108a may locate a first portion of the sensor data 210-1 or training data 310-1 via address line 120a and read the first portion of the sensor data 210-1 or training data 310-1 via data line 130a.
  • the apparatus may perform first processing of the first portion of the data set based on the program. For example, referring to FIGs. 1-4, MCU0 108a applies the inference model 212 and/or training model 312 to the first portion of the sensor data 210-1 or training data 310-1. [0072] At 1008, the apparatus may obtain the program from a second NOR flash. For example, referring to FIGs. 2-4, MCU1 108b may locate the inference model 212 and/or training model 312 in second NOR flash 112b and read the inference model 212 and/or training model 312 via data line 150b.
  • the apparatus may obtain a second portion of the data set from the SRAM.
  • MCU1 108b may locate a second portion of the sensor data 210-2 and/or training data 310-2 via address line 120b and read the second portion of the sensor data 210-2 and/or training data 310-2 via data line 130b.
  • the apparatus may perform second processing of the second portion of the data set based on the program. For example, referring to FIGs. 1-4, MCU01108b apply the inference model 212 and/or training model 312 to the second portion of the sensor data 210-2 or training data 310-2.
  • FIG. 11 illustrates a wireless network 1100, in which certain aspects of the present disclosure may be implemented, according to some embodiments of the present disclosure.
  • wireless network 1100 may include a network of nodes, such as a user equipment (UE) 1102, an access node 1104, and a core network element 1106.
  • User equipment 1102 may be any terminal device, such as a mobile phone, a desktop computer, a laptop computer, a tablet, a vehicle computer, a gaming console, a printer, a positioning device, a wearable electronic device, a smart sensor, or any other device capable of receiving, processing, and transmitting information, such as any member of a vehicle to everything (V2X) network, a cluster network, a smart grid node, or an Internet-of-Things (loT) node.
  • V2X vehicle to everything
  • cluster network such as a cluster network
  • smart grid node such as a smart grid node
  • Internet-of-Things (loT) node such as any member of a vehicle to everything (V2X) network, a cluster network, a smart grid node, or an Internet-of-Things (loT)
  • Access node 1104 may be a device that communicates with user equipment 1102, such as a wireless access point, a base station (BS), a Node B, an enhanced Node B (eNodeB or eNB), a next-generation NodeB (gNodeB or gNB), a cluster master node, or the like. Access node 1104 may have a wired connection to user equipment 1102, a wireless connection to user equipment 1102, or any combination thereof. Access node 1104 may be connected to user equipment 1102 by multiple connections, and user equipment 1102 may be connected to other access nodes in addition to access node 1104. Access node 1104 may also be connected to other UEs. It is understood that access node 1104 is illustrated by a radio tower by way of illustration and not by way of limitation.
  • Core network element 1106 may serve access node 1104 and user equipment 1102 to provide core network services.
  • core network element 1106 may include a home subscriber server (HSS), a mobility management entity (MME), a serving gateway (SGW), or a packet data network gateway (PGW).
  • HSS home subscriber server
  • MME mobility management entity
  • SGW serving gateway
  • PGW packet data network gateway
  • EPC evolved packet core
  • LTE Long-Term Evolution
  • core network element 1106 includes an access and mobility management function (AMF) device, a session management function (SMF) device, or a user plane function (UPF) device, of a core network for the new radio (NR) system.
  • AMF access and mobility management function
  • SMF session management function
  • UPF user plane function
  • Core network element 1106 is shown as a set of rack-mounted servers by way of illustration and not by way of limitation.
  • Core network element 1106 may connect with a large network, such as the Internet 1108, or another internet protocol (IP) network, to communicate packet data over any distance.
  • IP internet protocol
  • data from user equipment 1102 may be communicated to other UEs connected to other access points, including, for example, a computer 1110 connected to Internet 1108, for example, using a wired connection or a wireless connection, or to a tablet 1112 wirelessly connected to Internet 1108 via a router 1114.
  • computer 1110 and tablet 1112 provide additional examples of possible UEs
  • router 1114 provides an example of another possible access node.
  • a generic example of a rack-mounted server is provided as an illustration of core network element 1106.
  • database servers such as a database 1116
  • security and authentication servers such as an authentication server 1118.
  • Database 1116 may, for example, manage data related to user subscription to network services.
  • a home location register (HLR) is an example of a standardized database of subscriber information for a cellular network.
  • authentication server 1118 may handle authentication of users, sessions, and so on.
  • an authentication server function (AUSF) device may be the specific entity to perform user equipment authentication.
  • a single server rack may handle multiple such functions, such that the connections between core network element 1106, authentication server 1118, and database 1116, may be local connections within a single rack.
  • Each element in FIG. 11 may be considered a node of wireless network 1100. More detail regarding the possible implementation of a node is provided by way of example in the description of a node 1200 in FIG. 12.
  • Node 1200 may be configured as user equipment 1102, access node 1104, or core network element 1106 in FIG. 1.
  • node 1200 may also be configured as computer 1110, router 1114, tablet 1112, database 1116, or authentication server 1118 in FIG. 11.
  • node 1200 may include a processor 1202, a memory 1204, and a transceiver 1206. These components are shown as connected to one another by a bus, but other connection types are also permitted.
  • node 1200 When node 1200 is user equipment 1102, additional components may also be included, such as a user interface (UI), sensors, and the like. Similarly, node 1200 may be implemented as a blade in a server system when node 1200 is configured as core network element 1106. Other implementations are also possible.
  • UI user interface
  • sensors sensors
  • core network element 1106 Other implementations are also possible.
  • Transceiver 1206 may include any suitable device for sending and/or receiving data.
  • Node 1200 may include one or more transceivers, although only one transceiver 1206 is shown for simplicity of illustration.
  • An antenna 1208 is shown as a possible communication mechanism for node 1200. Multiple antennas and/or arrays of antennas may be utilized for receiving multiple spatially multiplex data streams.
  • examples of node 1200 may communicate using wired techniques rather than (or in addition to) wireless techniques.
  • access node 1104 may communicate wirelessly to user equipment 1102 and may communicate by a wired connection (for example, by optical or coaxial cable) to core network element 1106.
  • Other communication hardware such as a network interface card (NIC), may be included as well.
  • NIC network interface card
  • node 1200 may include processor 1202. Although only one processor is shown, it is understood that multiple processors can be included.
  • Processor 1202 may include microprocessors, microcontroller units (MCUs), digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), programmable logic devices (PLDs), state machines, gated logic, discrete hardware circuits, and other suitable hardware configured to perform the various functions described throughout the present disclosure.
  • Processor 1202 may be a hardware device having one or more processing cores.
  • Processor 1202 may execute software.
  • Software shall be construed broadly to mean instructions, instruction sets, code, code segments, program code, programs, subprograms, software modules, applications, software applications, software packages, routines, subroutines, objects, executables, threads of execution, procedures, functions, etc., whether referred to as software, firmware, middleware, microcode, hardware description language, or otherwise.
  • Software can include computer instructions written in an interpreted language, a compiled language, or machine code. Other techniques for instructing hardware are also permitted under the broad category of software.
  • node 1200 may also include memory 1204. Although only one memory is shown, it is understood that multiple memories can be included. Memory 1204 can broadly include both memory and storage.
  • memory 1204 may include random-access memory (RAM), read-only memory (ROM), static RAM (SRAM), dynamic RAM (DRAM), ferroelectric RAM (FRAM), electrically erasable programmable ROM (EEPROM), compact disc readonly memory (CD-ROM) or other optical disk storage, hard disk drive (HDD), such as magnetic disk storage or other magnetic storage devices, Flash drive, solid-state drive (SSD), or any other medium that can be used to carry or store desired program code in the form of instructions that can be accessed and executed by processor 1202.
  • RAM random-access memory
  • ROM read-only memory
  • SRAM static RAM
  • DRAM dynamic RAM
  • FRAM ferroelectric RAM
  • EEPROM electrically erasable programmable ROM
  • CD-ROM compact disc readonly memory
  • HDD hard disk drive
  • Flash drive such as magnetic disk storage or other magnetic storage devices
  • SSD solid-state drive
  • memory 1204 may be embodied by any computer-readable medium, such as a non-transitory computer-readable medium.
  • Processor 1202, memory 1204, and transceiver 1206 may be implemented in various forms in node 1200 for performing wireless communication functions.
  • processor 1202, memory 1204, and transceiver 1206 of node 1200 are implemented (e.g., integrated) on one or more system-on-chips (SoCs).
  • SoCs system-on-chips
  • processor 1202 and memory 1204 may be integrated on an application processor (AP) SoC (sometimes known as a “host,” referred to herein as a “host chip”) that handles application processing in an operating system (OS) environment, including generating raw data to be transmitted.
  • API application processor
  • OS operating system
  • processor 1202 and memory 1204 may be integrated on a baseband processor (BP) SoC (sometimes known as a “modem,” referred to herein as a “baseband chip”) that converts the raw data, e.g., from the host chip, to signals that can be used to modulate the carrier frequency for transmission, and vice versa, which can run a real-time operating system (RTOS).
  • BP baseband processor
  • processor 1202 and transceiver 1206 (and memory 1204 in some cases) may be integrated on a radio frequency (RF) SoC (sometimes known as a “transceiver,” referred to herein as an “RF chip”) that transmits and receives RF signals with antenna 1208.
  • RF radio frequency
  • processor 1202 may be implemented as system 100 as described above to provide low latency, high precision stack overflow prevention.
  • the functions described herein may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or encoded as instructions or code on a non-transitory computer-readable medium.
  • Computer-readable media includes computer storage media. Storage media may be any available media that can be accessed by a computing device, such as node 1200 in FIG. 12.
  • such computer-readable media can include RAM, ROM, EEPROM, CD-ROM or other optical disk storage, HDD, such as magnetic disk storage or other magnetic storage devices, Flash drive, SSD, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a processing system, such as a mobile device or a computer.
  • Disk and disc includes CD, laser disc, optical disc, digital video disc (DVD), and floppy disk where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.
  • an SoC is provided.
  • the SoC may include an SRAM configured to obtain a data set.
  • the SoC may include a first NOR flash configured to maintain a first program.
  • the SoC may include a first MCU coupled to the first NOR flash via a first address link and a first data link.
  • the first MCU may be configured to locate the first program via the first address link.
  • the first MCU may be configured to read the first program via the first data link.
  • the first MCU may be configured to obtain a first portion of the data set from the SRAM.
  • the first MCU may be configured to perform first processing of the first portion of the data set based on the first program.
  • the SoC may include a second NOR flash configured to maintain a second program.
  • the SoC may include a second MCU coupled to the second NOR flash via a second address link and a second data link.
  • the second MCU may be configured to locate the second program via the second address link.
  • the second MCU may be configured to read the second program via the second data link.
  • the second MCU may be configured to obtain a second portion of the data set from the SRAM.
  • the second MCU may be configured to perform second processing of the second portion of the data set based on the second program.
  • the SoC may include a CPU.
  • the CPU may be configured to select a processing mode for each of the first MCU and the second MCU.
  • the CPU may be configured to configure the first MCU to perform the first processing based on a first processing mode associated with the first program.
  • the CPU may be configured to configure the second MCU to perform the second processing based on a second processing mode associated with the second program.
  • the CPU may be configured to send first instructions associated with the first program to the first NOR flash while the first MCU performs the first processing.
  • the CPU may be configured to send second instructions associated with the second program to the second NOR flash while the second MCU performs the second processing.
  • the first MCU may perform the first processing of the first portion of the data set and the second MCU may perform the second processing of the second portion of the data set in parallel.
  • the CPU may be further configured to modify one or more of the first program of the first NOR flash or the second program of the second NOR flash.
  • the first NOR flash and the second NOR flash may communicate with the CPU via an SoC bus.
  • the first program comprises a first inference model.
  • the second program may include a second inference model.
  • the data set may include sensor data.
  • the first MCU may be configured to perform the first processing of the first portion of the data set by applying the first inference model to a first portion of the sensor data to identify first inference information as an output of the first inference model.
  • the second MCU may be configured to perform the second processing of the second portion of the data set by applying the second inference model to a second portion of the sensor data to identify second inference information as an output of the second inference model.
  • the first MCU may apply the first inference model to the first portion of the sensor data and the second MCU may apply the second inference model to the second portion of the sensor data concurrently.
  • the first portion of the sensor data and the second portion of the sensor data may include different portions of the sensor data.
  • the first program may include a first training model.
  • the second program may include a second training model.
  • the data set may include training data.
  • the first MCU may be configured to perform the first processing of the first portion of the data set by applying the first training model to a first portion of the training data to generate a first inference model.
  • the second MCU may be configured to perform the second processing of the second portion of the data set by applying the second training model to a second portion of the training data to generate a second inference model.
  • the first MCU may apply the first training model to the first portion of the training data and the second MCU may apply the second training model to the second portion of the training data concurrently.
  • the first portion of the training data and the second portion of the training data may include different portions of the training data.
  • the first program and the second program may include a same training model or a same inference model.
  • the first MCU may be configured to perform the first processing of the first portion of the data set by applying a first thread of the same training model or the same inference model to the data set.
  • the second MCU may be configured to perform the second processing of the second portion of the data set by applying a second thread of the same training model or the same inference model to the data set.
  • the first program may include an inference model.
  • the second program may include a training model.
  • the data set may include sensor data and training data.
  • the second MCU may be configured to perform the second processing of the second portion of the data set by applying the training model to the training data to generate the inference program. In some embodiments, the second MCU may be configured to perform the second processing of the second portion of the data set by receiving inference results associated with the sensor data and the inference program from the first MCU. In some embodiments, the second MCU may be configured to perform the second processing of the second portion of the data set by obtaining truth data associated with the inference results. In some embodiments, the second MCU may be configured to perform the second processing of the second portion of the data set by generating label information associated with an accuracy of the inference results by comparing the inference results with the truth data.
  • the second MCU may be configured to perform the second processing of the second portion of the data set by inputting the inference results and the label information into the training data. In some embodiments, the second MCU may be configured to perform the second processing of the second portion of the data set by applying the training model to the training data, the inference results, and the label information to generate an updated inference model.
  • the first MCU may be configured to perform the first processing of the first portion of the data set by applying the inference model to the sensor data to generate the inference information as an output of the inference model.
  • the first MCU may apply the inference model to the sensor data, and the second MCU applies the training model to the training data concurrently.
  • the sensor data may be obtained from one or more of a sensor or a DSP.
  • the first MCU and the second MCU may communicate via the SRAM.
  • a first address space in the SRAM may be mapped to read operations by the first MCU and write operations by the second MCU.
  • a second address space in the SRAM may be mapped to read operations by the second MCU and write operations by the first MCU.
  • the first address space and the second address space may be different.
  • an SoC may include a first MCU.
  • the first MCU may be configured to obtain a program from a first NOR flash.
  • the first MCU may be configured to obtain a first portion of a data set from an SRAM.
  • the first MCU may be configured to perform first processing of the first portion of the data set based on the program.
  • the SoC may include a second MCU.
  • the second MCU may be configured to obtain the program from a second NOR flash.
  • the second MCU may be configured to obtain a second portion of the data set from the SRAM.
  • the second MCU may be configured to perform second processing of the second portion of the data set based on the program.
  • the first NOR flash and the second NOR flash may be different.
  • the first processing and the second processing may be performed in parallel.
  • the first MCU may be further configured to locate the first program via a first address link with the first NOR flash. In some embodiments, the first MCU may be further configured to read the first program via a first data link with the first NOR flash. In some embodiments, the first MCU may be further configured to locate the first portion of the data set via a second address link with the SRAM. In some embodiments, the first MCU may be further configured to read the first portion of the data set via a second data link with the SRAM. In some embodiments, the first MCU may be further configured to write first processed data to the SRAM via an SRAM cache. In some embodiments, the first MCU may read the first portion of the data set directly from the SRAM.
  • the second MCU may be further configured to locate the second program via a third address link with the second NOR flash. In some embodiments, the second MCU may be further configured to read the second program via a third data link with the second NOR flash. In some embodiments, the second MCU may be further configured to locate the second portion of the data set via a fourth address link with the SRAM. In some embodiments, the second MCU may be further configured to read the second portion of the data set via a fourth data link with the SRAM. In some embodiments, the second MCU may be further configured to write second processed data to the SRAM via an SRAM cache. In some embodiments, the second MCU may read the second portion of the data set directly from the SRAM.
  • the first program and the second program may be separate programs.
  • the first program may include an inference program.
  • the second program may include a training program.
  • the first program and the second program may be parts of a single program.
  • the first program may be associated with an inference program part of the single program.
  • the second program may be associated with a training program part of the single program.
  • a method of parallel MCU processing may include obtaining, by a first MCU, a program from a first NOR flash.
  • the method may include obtaining, by the first MCU, a first portion of a data set from an SRAM.
  • the method may include performing, by the first MCU, first processing of the first portion of the data set based on the program.
  • the method may include obtaining, by a second MCU, the program from a second NOR flash.
  • the method may include obtaining, by the second MCU, a second portion of the data set from the SRAM.
  • the method may include performing, by the second MCU, second processing of the second portion of the data set based on the program.
  • the first NOR flash and the second NOR flash may be different.
  • the first processing and the second processing may be performed in parallel.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Microelectronics & Electronic Packaging (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Image Processing (AREA)

Abstract

L'invention concerne, selon un aspect de la divulgation, un système sur puce (SoC). Le SoC peut comprendre une première unité de microcontrôleur (MCU) couplée à un premier flash NOR et une seconde MCU couplée à un second flash NOR. Les deux MCU peuvent être couplées à une mémoire vive statique (SRAM). À l'aide des parties divisées d'un ensemble de données lues à partir de la SRAM et des programmes lus à partir des premier et second flashs NOR, la première MCU et la seconde MCU peuvent effectuer un traitement parallèle des parties de l'ensemble de données pour effectuer une inférence et/ou une formation.
PCT/US2021/057997 2021-11-04 2021-11-04 Système et procédé de traitement de données parallèles d'une unité à double microcontrôleur WO2023080891A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/US2021/057997 WO2023080891A1 (fr) 2021-11-04 2021-11-04 Système et procédé de traitement de données parallèles d'une unité à double microcontrôleur

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2021/057997 WO2023080891A1 (fr) 2021-11-04 2021-11-04 Système et procédé de traitement de données parallèles d'une unité à double microcontrôleur

Publications (1)

Publication Number Publication Date
WO2023080891A1 true WO2023080891A1 (fr) 2023-05-11

Family

ID=86241649

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2021/057997 WO2023080891A1 (fr) 2021-11-04 2021-11-04 Système et procédé de traitement de données parallèles d'une unité à double microcontrôleur

Country Status (1)

Country Link
WO (1) WO2023080891A1 (fr)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160306752A1 (en) * 2013-07-29 2016-10-20 Intel Corporation Execution-aware memory protection
US20190258251A1 (en) * 2017-11-10 2019-08-22 Nvidia Corporation Systems and methods for safe and reliable autonomous vehicles
US20200017114A1 (en) * 2019-09-23 2020-01-16 Intel Corporation Independent safety monitoring of an automated driving system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160306752A1 (en) * 2013-07-29 2016-10-20 Intel Corporation Execution-aware memory protection
US20190258251A1 (en) * 2017-11-10 2019-08-22 Nvidia Corporation Systems and methods for safe and reliable autonomous vehicles
US20200017114A1 (en) * 2019-09-23 2020-01-16 Intel Corporation Independent safety monitoring of an automated driving system

Similar Documents

Publication Publication Date Title
Zhou et al. Edge intelligence: Paving the last mile of artificial intelligence with edge computing
US20230153620A1 (en) Dynamic processing element array expansion
US11960843B2 (en) Multi-module and multi-task machine learning system based on an ensemble of datasets
JP6974270B2 (ja) 知能型高帯域幅メモリシステム及びそのための論理ダイ
Jiang et al. Accelerating mobile applications at the network edge with software-programmable FPGAs
US20230095092A1 (en) Denoising diffusion generative adversarial networks
US11676021B1 (en) Multi-model training pipeline in distributed systems
CN110852254A (zh) 人脸关键点跟踪方法、介质、装置和计算设备
US11948352B2 (en) Speculative training using partial gradients update
KR20220164570A (ko) 딥 러닝 가속기 및 랜덤 액세스 메모리를 구비한 에지 서버
CN111242273B (zh) 一种神经网络模型训练方法及电子设备
US20230082536A1 (en) Fast retraining of fully fused neural transceiver components
US20230410487A1 (en) Online learning method and system for action recognition
US20220101539A1 (en) Sparse optical flow estimation
US20220300418A1 (en) Maximizing resource bandwidth with efficient temporal arbitration
Valery et al. CPU/GPU collaboration techniques for transfer learning on mobile devices
US11308396B2 (en) Neural network layer-by-layer debugging
US11467946B1 (en) Breakpoints in neural network accelerator
US10846201B1 (en) Performance debug for networks
CN111832291B (zh) 实体识别模型的生成方法、装置、电子设备及存储介质
US10783004B1 (en) Method, apparatus, and electronic device for improving parallel performance of CPU
WO2023080891A1 (fr) Système et procédé de traitement de données parallèles d'une unité à double microcontrôleur
US20220391781A1 (en) Architecture-agnostic federated learning system
CN115098262A (zh) 一种多神经网络任务处理方法及装置
US11531578B1 (en) Profiling and debugging for remote neural network execution

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21963456

Country of ref document: EP

Kind code of ref document: A1