WO2020043057A1 - 图片处理方法、任务数据处理方法和装置 - Google Patents

图片处理方法、任务数据处理方法和装置 Download PDF

Info

Publication number
WO2020043057A1
WO2020043057A1 PCT/CN2019/102587 CN2019102587W WO2020043057A1 WO 2020043057 A1 WO2020043057 A1 WO 2020043057A1 CN 2019102587 W CN2019102587 W CN 2019102587W WO 2020043057 A1 WO2020043057 A1 WO 2020043057A1
Authority
WO
WIPO (PCT)
Prior art keywords
processing
task data
task
text
sub
Prior art date
Application number
PCT/CN2019/102587
Other languages
English (en)
French (fr)
Inventor
辛遥
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to EP19854472.8A priority Critical patent/EP3846079A4/en
Publication of WO2020043057A1 publication Critical patent/WO2020043057A1/zh
Priority to US17/010,812 priority patent/US20200401829A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/2163Partitioning the feature space
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3818Decoding for concurrent execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/24Aligning, centring, orientation detection or correction of the image
    • G06V10/242Aligning, centring, orientation detection or correction of the image by image rotation, e.g. by 90 degrees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/42Global feature extraction by analysis of the whole pattern, e.g. using frequency domain transformations or autocorrelation
    • G06V10/422Global feature extraction by analysis of the whole pattern, e.g. using frequency domain transformations or autocorrelation for representing the structure of the pattern or shape of an object therefor
    • G06V10/426Graphical representations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/94Hardware or software architectures specially adapted for image or video understanding
    • G06V10/955Hardware or software architectures specially adapted for image or video understanding using specific electronic processors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images
    • G06V20/63Scene text, e.g. street names
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/153Segmentation of character regions using recognition of characters or words
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/19153Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation using rules for classification or partitioning the feature space
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/48Indexing scheme relating to G06F9/48
    • G06F2209/486Scheduler internals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/5018Thread allocation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to the field of data processing, and in particular, to a picture processing method, a task data processing method, and a device.
  • CPU Central Processing Unit
  • GPU Graphics Processing Unit
  • the CPU uses serial processing, that is, in the process of processing, it needs to wait for the previous task data to be processed and obtain the execution result corresponding to the previous task data before it can continue to execute the next task data.
  • the efficiency of task data processing is low.
  • the cost of processing through the GPU is relatively high, and it has large power consumption.
  • a picture processing method is implemented by a processing unit corresponding to each substructure in a machine learning model executing a corresponding subtask. At least part of the processing unit includes a field programmable gate array FPGA unit; the method includes:
  • the text in the text box feature map is recognized to obtain a text recognition result.
  • the above picture processing method determines a candidate text box at any angle in the picture to be processed according to the text characteristics of the picture to be processed, and can identify candidate text boxes at different angles. Pooling each candidate text box, and projecting each candidate text box of a different size onto a fixed-size feature map to obtain the text box feature map of each candidate text box, which improves the adaptability of processing candidate text boxes and can process Candidate text boxes of different sizes and different angles can identify the text of each candidate text box by identifying the text in the text box feature map.
  • data can be processed in parallel to implement the above picture processing method, which can reduce the cost and power consumption, and improve the accuracy and efficiency of text recognition in pictures to be processed.
  • a method for processing task data includes:
  • the subtasks of the corresponding substructures are sequentially executed by the processing units corresponding to the substructures; at least part of the processing units include field programmable gates.
  • Array FPGA unit
  • each processing unit when the processing unit is in an idle state, the sub-tasks corresponding to the next task data are executed in parallel.
  • a method for processing task data, which is applied to a distributed server host includes:
  • the distributed server slave is configured to put the task data into a thread pool, and when processing the task data, obtain multiple task data from the thread pool; for each of the task data, respectively according to the machine
  • the order of the sub-structures in the learning model, and the sub-tasks of the corresponding sub-structures are executed sequentially by the processing units corresponding to each of the sub-structures; at least part of the processing units include field programmable gate array FPGA units; in each of the processing units During the processing, when the processing unit is in an idle state, the sub-task corresponding to the next task data is executed in parallel.
  • a task data processing method applied to a distributed server slave includes:
  • the task data is placed in a thread pool
  • the subtasks of the corresponding substructures are sequentially executed by the processing units corresponding to the substructures; at least part of the processing units include field programmable gates.
  • An array FPGA unit during the processing of each of the processing units, when the processing unit is in an idle state, a sub-task corresponding to the next task data is executed in parallel.
  • a task data processing device includes: a task scheduling unit and a field programmable gate array FPGA unit, and the task scheduling unit is connected to the FPGA unit;
  • the task scheduling unit is configured to obtain a plurality of task data; for each of the task data, in accordance with the order of the substructures in the machine learning model, the corresponding substructures are sequentially executed by the processing units corresponding to the substructures. Subtasks; at least part of the processing unit includes an FPGA unit; during the processing of each of the processing units, when the processing unit is in an idle state, the subtask corresponding to the next task data is executed in parallel.
  • the above-mentioned task data processing method and device execute sub-tasks corresponding to sub-structures in a machine learning model through some FPGA units, and when acquiring multiple task data, each FPGA unit executes sub-tasks corresponding to each task data in parallel, so that each task The sub-tasks corresponding to the data can be processed in parallel, thereby improving the processing efficiency of multiple task data.
  • a computer-readable storage medium stores at least one instruction, at least one program, code set, or instruction set in the computer-readable storage medium.
  • the instruction, the program, the code set, or the instruction set is composed of
  • the processor loads and executes to implement the operations performed in the image processing method described above.
  • a computer-readable storage medium stores at least one instruction, at least one program, code set, or instruction set in the computer-readable storage medium.
  • the instruction, the program, the code set, or the instruction set is composed of
  • the processor loads and executes to implement the operations performed in the task data processing method described above.
  • a computer-readable storage medium stores at least one instruction, at least one program, code set, or instruction set in the computer-readable storage medium.
  • the instruction, the program, the code set, or the instruction set is composed of
  • the processor loads and executes to implement the operations performed in the task data processing method described above.
  • a computer-readable storage medium stores at least one instruction, at least one program, code set, or instruction set in the computer-readable storage medium.
  • the instruction, the program, the code set, or the instruction set is composed of
  • the processor loads and executes to implement the operations performed in the task data processing method described above.
  • a computer device includes a processor and a memory, and the memory stores at least one instruction, at least one program, a code set, or an instruction set, the at least one instruction, the at least one program, or the instruction set.
  • the code set or instruction set is loaded and executed by the processor to implement the picture processing method described above.
  • a computer device includes a processor and a memory, and the memory stores at least one instruction, at least one program, a code set, or an instruction set, the at least one instruction, the at least one program, and the A code set or an instruction set is loaded and executed by the processor to implement the task data processing method described above.
  • the distributed server host includes a processor and a memory.
  • the memory stores at least one instruction, at least one piece of program, code set, or instruction set.
  • the program, the code set or the instruction set is loaded and executed by the processor to implement the task data processing method described above.
  • a distributed server slave includes: a processor and a memory, and the memory stores at least one instruction, at least one program, a code set, or an instruction set, and the at least one instruction, the At least one program, the code set or the instruction set is loaded and executed by the processor to implement the task data processing method described above.
  • FIG. 1 is an application scenario diagram of a task data processing method in an embodiment
  • FIG. 2 is a schematic diagram of an internal structure of a computer device in an embodiment
  • FIG. 3 is a block diagram of a task data processing apparatus according to an embodiment
  • FIG. 4 is a schematic diagram of an internal structure of a task data processing device in an embodiment
  • FIG. 5 is a schematic flowchart of a task data processing method according to an embodiment
  • FIG. 6 is a schematic diagram of encapsulation of task data in an embodiment
  • FIG. 7 is a schematic diagram of parallel execution of multi-threaded tasks in an embodiment
  • FIG. 8 is a timing diagram of parallel execution of multi-threaded tasks in one embodiment
  • FIG. 9 is a timing diagram of parallel execution of multi-threaded tasks in an embodiment
  • FIG. 10 is a schematic diagram of a CPU and an FPGA unit processing tasks in parallel in one embodiment
  • 11 is an application environment diagram of a task data processing method in another embodiment
  • FIG. 13 is a schematic flowchart of a task data processing method according to an embodiment
  • FIG. 14 is a software architecture diagram of a task data processing method in an embodiment
  • 15 is a schematic flowchart of steps for processing image processing task data by each substructure in an embodiment
  • 16 is a schematic flowchart of a step of obtaining a classification result in an embodiment
  • 17 is a schematic flowchart of obtaining an image processing result in an embodiment
  • FIG. 18 is a schematic flowchart of a picture processing method according to an embodiment
  • 20 is a schematic diagram of a text recognition result corresponding to an application scenario
  • 21 is a schematic diagram of a text recognition result corresponding to another application scenario
  • 22 is a schematic diagram of a text recognition result corresponding to another application scenario
  • FIG. 23 is a schematic diagram of a text recognition result corresponding to another application scenario.
  • FIG. 24 is a schematic diagram of a text recognition result corresponding to another application scenario.
  • FIG. 1 is an application scenario diagram of a task data processing method in an embodiment.
  • the application scenario includes a CPU 110, a board interface 120, and a task data processing device 130.
  • the CPU 110 communicates with the task data processing device 130 through the board interface 120.
  • the board interface 120 and the CPU 110 are integrated on the motherboard of the computer equipment.
  • the board interface 120 may be a board slot on the motherboard, and the task data processing device 130 can communicate with the CPU 110 by inserting the board slot.
  • At least one FPGA (Field-Programmable Gate Array) unit is integrated in the task data processing device 130.
  • FIG. 2 is a schematic diagram of the internal structure of a computer device integrated with the CPU 110 and the board interface 120 in FIG. 1.
  • the computer device includes a CPU 110, a memory, a network interface, and a board interface 120 connected through a system bus.
  • the board interface 120 is connected to the task data processing device 130.
  • the memory includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium of the computer device can store an operating system and a computer program.
  • the CPU 110 can execute the task data processing method described below.
  • the task data processing device 130 and the computer equipment CPU 110 are used to provide computing and control capabilities, and support the operation of the entire computer equipment and task data processing device 130.
  • a computer program may be stored in the internal memory, and when the computer program is executed by the CPU 110, the CPU 110 may execute a task data processing method described below.
  • the network interface of the computer device is used for network communication.
  • the computer equipment may be a distributed server slave.
  • the board interface 120 may be a PCIE 3x8 interface.
  • FIG. 2 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer equipment to which the solution of the present application is applied.
  • the specific computer equipment may include There are more or fewer parts than shown in Figure 2, or some parts are combined, or there are different parts arrangements.
  • a task data processing device 130 is provided, and the device specifically includes a task data acquisition module 132, a task data processing module 134, and an execution result acquisition module 136.
  • the task data acquisition module 132 is configured to acquire a plurality of task data.
  • a task data processing module 134 for each task data, executes the sub-tasks of the corresponding sub-structure through the processing unit corresponding to each sub-structure in order according to the order of the sub-structures in the machine learning model; at least part of the processing units include FPGA units ; When the processing unit is in an idle state, the subtask corresponding to the next task data is executed.
  • An execution result acquisition module 136 is configured to obtain a corresponding task execution result after the execution of each task data sub-task in each sub-structure is completed.
  • the task data processing apparatus 130 provided in this application may be implemented in the form of a computer program, and the computer program may be run on a computer device as shown in FIG. 2.
  • the memory of the computer device may store various program modules constituting the task data processing apparatus 130, for example, the task data acquisition module 132, the task data processing module 134, and the execution result acquisition module 136 shown in FIG.
  • the computer program constituted by each program module causes the CPU 110 to execute the steps in the task data processing method of each embodiment of the present application described in this specification.
  • the computer device shown in FIG. 2 may obtain multiple task data through the task data acquisition module 132 in the task data processing apparatus 130 shown in FIG. 3.
  • Computer equipment can use the task data processing module 134 for each task data to execute the sub-tasks of the corresponding sub-structure through the processing unit corresponding to each sub-structure in accordance with the order of the sub-structures in the machine learning model; at least part of the processing unit includes FPGA Unit; during the processing of each processing unit, when the processing unit is in an idle state, the subtask corresponding to the next task data is executed.
  • the computer device can obtain the corresponding task execution result after the execution of each task data in the sub-task of each sub-structure through the execution result acquisition module 136.
  • the task data processing apparatus 130 includes a task scheduling unit and one or more FPGA units, and the task scheduling unit is connected to each FPGA unit.
  • the task data processing apparatus 130 includes a task scheduling unit and four FPGA units as an example for description.
  • the task scheduling unit is used to obtain multiple task data; for each task data, the processing unit corresponding to each substructure in the machine learning model is sequentially executed in accordance with the order of the substructures in the machine learning model; At least part of the processing unit includes one or more FPGA units.
  • the FPGA unit is used to execute the sub-task corresponding to the next task data when the processing unit is idle during the processing of each processing unit; after the execution of each task data sub-task in each sub-structure, the corresponding The results of the task execution and output.
  • the task data processing apparatus 130 further includes: a register and a memory; the memory is connected to the FPGA unit; and the register is connected to the task scheduling unit.
  • the task scheduling unit is also used to read the processing unit call data from the register, so that for each task data according to the processing unit processing data, according to the order of the substructures in the machine learning model, the corresponding substructures in the machine learning model are scheduled in turn.
  • the processing unit executes the sub-tasks of the corresponding sub-structure.
  • the FPGA unit is also used to read task data written by the CPU from the memory.
  • the subtask corresponding to the next task data is executed; when each After the task data of each substructure is completed, the task data obtains the corresponding task execution results; the task execution results are stored in the memory.
  • the task data processing apparatus 130 further includes: a bus controller; the bus controller is connected to the memory and each FPGA unit, respectively. Each FPGA unit stores the task execution result into the memory through the bus controller.
  • the memory may be a DDR4 memory.
  • the task data processing device 130 is connected to the CPU 110 through the board interface 120; the processing unit further includes a CPU 110.
  • a method for processing task data is provided.
  • the task data processing method can be applied to the task data processing apparatus 130 in FIG. 1 described above. This embodiment mainly uses the method applied to the task data processing apparatus 130 in FIG. 1 as an example for illustration. Referring to FIG. 5, the task data processing method specifically includes the following steps:
  • the task data is data corresponding to a task to be processed.
  • the CPU 110 obtains task data, sends the task data to the task data processing device 130, and the task data processing device 130 stores the received task data.
  • the task data processing device 130 reads a plurality of task data from the stored task data.
  • the CPU 110 reads task data from the memory of the computer device, and sends the read task data to the task data processing device 130.
  • the task data processing device 130 receives task data sent by the CPU 110 and stores the task data in a memory.
  • the CPU 110 sends a task execution instruction to the task data processing device 130; the task data processing device 130 receives the task execution instruction sent by the CPU 110, determines to process the task data, and according to the task execution instruction from the memory Read multiple task data.
  • the task data processing device 130 caches task data, a double-buffered ping-pong operation is used, and data reading and calculation are performed simultaneously, reducing the waiting time between the two parts and improving processing efficiency.
  • FIG. 6 is a schematic diagram of encapsulation of task data in one embodiment.
  • the CPU 110 After the CPU 110 obtains the task data, it encapsulates the task data into FPGA thread tasks.
  • FPGA thread tasks include task data and FPGA execution instructions.
  • FPGA execution instructions include write instructions, read instructions, and start instructions.
  • the FPGA execution instruction is used to call the FPGA unit to process task data.
  • the CPU 110 places the encapsulated FPGA thread task into a thread pool, and the task data processing device 130 obtains the FPGA thread task from the thread pool and reads the task data from the FPGA thread task.
  • the task data processing device 130 may also read the FPGA execution instruction from the FPGA thread task.
  • the subtasks of the corresponding substructure are executed by the processing units corresponding to the substructures in sequence; at least part of the processing units include FPGA units; when the processing unit When it detects that it is in an idle state, it extracts the next task data through the thread, and executes the subtask corresponding to the next task data in the processing unit corresponding to the thread.
  • the task data processing apparatus 130 includes a plurality of processing units.
  • the multiple processing units include an FPGA unit and a CPU unit.
  • the machine learning model is a data model for pre-trained task data processing. For each substructure in the machine learning model, there is a corresponding processing unit.
  • the processing unit is configured to execute a subtask corresponding to a substructure in a corresponding machine learning model.
  • the task data processing device 130 inputs a plurality of task data to a processing unit corresponding to the machine learning model, and processes the task data through the processing unit of the machine learning model.
  • the task data processing device 130 executes the sub-tasks corresponding to the task data by the processing unit in the order of the substructures in the machine learning model; and during the processing of each processing unit, Detect whether the sub-task of the previous task data in the current sub-structure has been completed, detect whether the sub-task of the current task data in the previous sub-structure has been completed; and detect that the sub-task of the previous task data in the current sub-structure is completed, At the same time, it is detected that the current task data is executed after the sub-task of the previous sub-structure is completed, that is, when the processing unit corresponding to the current sub-structure is in an idle state, the sub-task of the current task data in the current sub-structure is started.
  • the task data processing device 130 enters the task data into the processing unit corresponding to the first substructure in the machine learning model, and executes the task unit according to the task data through the processing unit corresponding to the first substructure.
  • the sub-task corresponding to the first sub-structure obtains the first sub-task data.
  • the task data processing device 130 inputs the first sub-task data into the processing unit corresponding to the second sub-structure in the machine learning model, and executes the sub-corresponding to the second sub-structure according to the first sub-task data through the processing unit corresponding to the second sub-structure. Task to get second subtask data.
  • the task data processing device inputs the second sub-task data into the third sub-structure in the machine learning model until the task execution result output by the last sub-structure in the machine learning model is obtained.
  • the task data processing device 130 inputs the previous task data into the processing unit corresponding to the first sub-structure in the machine learning model, and processes the processing corresponding to the first sub-structure.
  • the unit executes the sub-task corresponding to the first sub-structure according to the previous task data to obtain the first sub-task data corresponding to the previous task data.
  • the task data processing device 130 inputs the current task data into the processing corresponding to the first substructure in the machine learning model.
  • a unit that executes the sub-task corresponding to the first sub-structure according to the current task data through the processing unit corresponding to the first sub-structure, and the task data processing device inputs the first sub-task data corresponding to the previous task data to the machine learning.
  • the processing unit corresponding to the second substructure in the model through the processing unit corresponding to the second substructure, executes the subtask corresponding to the second substructure according to the first subtask data corresponding to the previous task data, and obtains the corresponding data of the previous task. Data for the second subtask.
  • the processing unit corresponding to the first sub-structure completes the word task of the first sub-structure according to the current task data, obtains the first sub-task data corresponding to the current task data, and obtains the second corresponding to the previous task data.
  • the task data processing device 130 inputs the first sub-task data corresponding to the current task data to the processing unit corresponding to the second sub-structure in the machine learning model, and through the processing unit corresponding to the second sub-structure, according to the current task
  • the first subtask data corresponding to the data execute the subtask corresponding to the second substructure to obtain the second subtask data corresponding to the current task data
  • the task data processing device 130 converts the second subtask data corresponding to the previous task data, Input to the processing unit corresponding to the third sub-structure in the machine learning model until the last sub-organization in the machine learning model outputs the task execution result corresponding to the previous task data and the task execution result corresponding to the current task data.
  • the steps of S504 include: for each task data, the task data processing device 130 reads the processing unit call data from the register; the processing unit call data is written into the register by the CPU 110; and according to the processing unit call data In accordance with the order of the substructures in the machine learning model, the processing unit corresponding to each substructure is called in order to execute the subtasks of the corresponding substructure.
  • the processing unit call data is data required by the task data processing device 130 to call the processing unit.
  • the processing unit call data may include the processing unit identification, and may also include instructions used to call the processing unit.
  • the instructions used to call the processing unit may include at least one of a unit write instruction, a unit read instruction, and a unit execution instruction.
  • the processing unit call data corresponding to each task data is written into the register by the CPU 110.
  • the task data processing device 130 reads the processing unit invocation data corresponding to each task data from the register, extracts the processing unit identifier in the processing unit invocation data, and according to the processing unit corresponding to the extracted processing unit identifier, according to the machine learning model neutron
  • the order of the structures calls the processing unit corresponding to each substructure in order to execute the subtasks of the corresponding substructure.
  • the step of S504 further includes: when none of the processing units is in an idle state, waiting for the processing unit corresponding to the subtask of the current substructure to be released. For example, during the processing of each processing unit, when the previous task data has not completed the sub-tasks of the current sub-structure and the current task data has completed the sub-tasks of the previous sub-structure, it waits for the current sub-structure. The processing unit corresponding to the subtask is released. After the processing unit corresponding to the subtask of the current substructure is released, the processing unit corresponding to the subtask of the current substructure is called to execute the subtask of the current task data in the current substructure.
  • FIG. 7 is a schematic diagram of a multi-threaded task executed in parallel in one embodiment.
  • the task data processing device 130 reads thread task 1, thread task 2, and thread task 3 from the thread pool, and thread task 1, thread task 2, and thread task 3 are connected in order.
  • the output of the FPGA unit 1 is used as the input of the FPGA unit 2
  • the output of the FPGA unit 2 is used as the input of the FPGA unit 3, that is, the FPGA unit 1, the FPGA unit 2
  • Thread task calls Each thread task can call each FPGA unit individually, so that different FPAG units can run different thread tasks at the same time, improving throughput.
  • FIG. 8 and 9 are timing diagrams of parallel execution of multi-threaded tasks in one embodiment.
  • the thread tasks corresponding to the task data need to execute the corresponding subtasks through the FPGA unit 1, the FPGA unit 2 and the FPGA unit 3 in order to obtain the task execution results of the task data.
  • the thread task 1 acquires the task data 1 and calls the FPGA unit 1 to execute the subtask 1 of the task data 1.
  • the thread task 1 When the FPGA unit 1 finishes executing the sub-task 1 of the thread task 1, the thread task 1 calls the FPGA unit 2 to execute the sub-task 2, while the thread task 2 obtains task data 2 and calls the FPGA unit 1 to execute the sub-task 1 of the thread task 2.
  • thread task 1 calls FPGA unit 3 to execute sub-task 3
  • thread task 2 calls FPGA unit 2 to execute Subtask 2
  • thread task 3 acquires task data 3
  • thread task 2 calls FPGA unit 1 to execute subtask 1.
  • thread task 2 calls FPGA unit 3 to execute sub-task 3.
  • thread task 3 calls FPGA unit 2 to execute sub-task 2
  • thread task 1 can obtain task data 4, and then calls FPGA unit 1 to execute line sub-task 1 until each thread obtained by the thread unit calls FPGA unit.
  • the number of thread tasks can be set to n, where n is a positive integer.
  • the plurality of processing units may include a CPU 110 and an FPGA unit.
  • FIG. 10 is a schematic diagram of a CPU 110 and an FPGA unit processing tasks in parallel in one embodiment. As shown in Figure 10, thread task 1, thread task 2, and thread task 3 call the processing unit in the same order. Thread task 1 calls FPGA unit 1. After thread task 1 releases FPGA unit 1, thread task 1 calls CPU 110.
  • Thread task 2 calls FPGA unit 1; when thread task 1 releases the CPU, thread task 1 calls FPGA unit 2; when thread task 1 releases CPU 110 and thread task 2 releases FPGA unit 1, thread task 2 calls CPU 110, thread Task 3 calls FPGA unit 1; when thread task 1 releases FPGA unit 2, thread task 1 calls FPGA unit 3; when thread task 1 releases FPGA unit 2 and thread task 2 releases CPU 110, thread task 2 calls FPGA unit 2 ; When thread task 2 releases CPU 110, thread task 3 releases FPGA unit 1, thread task 3 calls CPU 110; when thread task 1 releases FPGA unit 3, wait for FPAG unit 1 to be released by thread task 3, and when FPGA unit 1 is released After the release, the thread task 3 calls the FPGA unit 1 again, thereby ensuring the parallel processing of each thread task, until each of the parallel processing thread tasks obtains the corresponding task execution result.
  • the task data processing device 130 For each task data, after the task data processing device 130 detects that the sub-tasks of each sub-structure in the machine learning model have been executed, it obtains the task execution results output by the processing unit corresponding to the last sub-structure in the machine learning model, thereby obtaining each task. Task execution results corresponding to the data.
  • the sub-tasks of the corresponding sub-structure are sequentially executed by the processing unit corresponding to each sub-structure.
  • the processing unit corresponds to a sub-structure of a machine learning model, and at least part of the processing unit includes an FPGA unit.
  • the previous task data is completed in the sub-task of the current sub-structure, and the current task data is performed in the previous sub-structure.
  • each processing unit processes sub-tasks of multiple task data in parallel, so that a machine learning model can process multiple task data in parallel in a low-cost and low-power structure, thereby improving the processing efficiency of task data.
  • FIG. 11 is an application environment diagram of a task data processing method in an embodiment.
  • Figure 11 includes a terminal, a distributed server host, and a distributed server slave.
  • the terminal is connected to the distributed server host through the network.
  • the distributed server host is connected to the distributed server slave through the network.
  • the distributed server slave can be One or more.
  • the distributed server slave is provided with a thread pool and a processing unit scheduler.
  • the board interface of the distributed server slave is connected with a task data processing device 130, and the task data processing device 130 is provided with an FPGA unit.
  • the distributed server slave machine processes the real-time task data by executing the processing unit scheduler. When the distributed server slave machine executes the processing scheduler, it reads the task data from the thread tasks in the thread pool and executes the thread tasks according to the task data.
  • FIG. 12 is an internal environment diagram of a distributed server slave in an embodiment.
  • the distributed server slave is provided with a thread pool and a processing unit scheduler.
  • the distributed server slaves process the real-time task data processing method by executing the processing unit scheduler.
  • the processing unit scheduler When the processing unit scheduler is executed, the task data is obtained from the thread tasks in the thread pool, and the FPGA is scheduled according to the task data in the order of the subtasks in the unit scheduler.
  • the unit and the CPU 110 execute the corresponding subtasks, and the processing unit scheduler can process multiple thread tasks in parallel. After processing by the processing unit scheduler, the task execution results of multiple thread tasks are obtained, and the task execution results are returned to the corresponding thread tasks.
  • the slave server returns to the distributed server host through the distributed server.
  • the processing unit scheduler includes n subtasks, where n is a positive integer.
  • a method for processing task data is provided, which is applied to a distributed server host.
  • the method includes the following:
  • the distributed server host receives the task data sent by the terminal; determines the distributed server slave address allocated for the task data; and sends the task data to the distributed server slave according to the allocated distributed server slave address.
  • the task data may be image processing task data.
  • the distributed server host may allocate distributed slaves to the task data according to the working status of each distributed server slave; correspondingly, the distributed server host determines as The step of assigning the slave server address of the task data may be as follows: According to the working status of each distributed server slave, the distributed server master selects the distributed server slave in the idle state from each distributed server slave. To determine the slave address of the selected distributed server.
  • the distributed server host may assign distributed slaves to the task data according to the type of task data; correspondingly, the step of the distributed server host determining the distributed server slave address assigned to the task data may be: The distributed server host selects each type of distributed server slave from each distributed server slave according to the type of the task data, and determines the selected distributed server slave address.
  • the distributed server slave machine puts task data into the thread pool, and obtains multiple task data from the thread pool; for each task data, according to the order of the substructures in the machine learning model, each substructure is processed in turn.
  • a unit executes a subtask of a corresponding substructure; at least part of the processing unit includes an FPGA unit; during the processing of each processing unit, when the processing unit is in an idle state, the subtask corresponding to the next task data is executed.
  • the machine learning model may be an image processing model.
  • each substructure is processed in turn.
  • the distributed server slave When the distributed server slave receives the task data sent by the distributed server host, it puts the task data into the thread pool; when the task data is processed, the distributed server slave obtains multiple task data from the thread pool.
  • the distributed server host may instruct the distributed slave to process the task data; correspondingly, when the distributed server slave receives the FPGA execution instruction sent by the distributed server host, the distributed server slave only starts from the thread pool. Get multiple task data.
  • the sub-tasks of the corresponding sub-structure can be executed by the FPGA unit.
  • the distributed server slaves respectively pass the sub-structure order in the machine learning model in order.
  • the FPGA unit corresponding to each sub-structure executes the sub-task of the corresponding sub-structure; during the processing of the FPGA unit, when the FPGA unit is in an idle state, the sub-task corresponding to the next task data is executed.
  • the FPGA unit obtains the corresponding task execution result
  • the task execution result is stored in the memory; the CPU 110 reads the task execution result from the memory and returns the task execution result to the distributed server host.
  • the distributed server host receives the task execution result returned from the distributed server slave, and sends the returned task execution result to the terminal.
  • the task data processing device 130 obtains a plurality of image processing task data. For each image processing task data, in accordance with the order of the substructures in the image processing model, the processing units corresponding to the substructures are sequentially executed to execute the image processing sub-structures of the corresponding substructures. Task; during the processing of each processing unit, the previous image processing task data has been executed in the image processing subtask of the current substructure, and the current image processing task data has been executed in the image processing subtask of the previous substructure After that, the current image processing task data starts to execute the image processing subtasks in the current substructure.
  • each processing unit processes the image processing subtasks of multiple image processing task data in parallel, so that the image processing model can process multiple image processing tasks in parallel in a low cost and low power consumption structure. Data, thereby improving the processing efficiency of image processing task data.
  • the task data is image processing task data; the machine learning model is an image processing model; and the task execution result is an image processing result.
  • the image processing result may be an image recognition result; the image recognition result may be a text recognized from an image.
  • an embodiment of the present application provides a system for processing image data.
  • the system for processing image data includes an access layer connected to a terminal, a distributed server host, and one or more distributed server servers located at the system layer. Machine and algorithm layer.
  • the distributed server slave is connected to the algorithm layer through an interface (such as API (Application Programming Interface)).
  • the thread pool is set in the distributed server slave.
  • the algorithm layer sets up a machine learning model.
  • the processing unit includes a CPU and an FPGA unit.
  • the terminal accesses the distributed server host through the access layer.
  • the distributed server host interacts with the distributed server slaves at the system level.
  • the distributed server slave calls the algorithm in the caffe-FPGA.so file of the algorithm layer through the API interface (board interface), and calls the FPGA unit and CPU configuration to process the task data according to the algorithm in the caffe-FPGA.so file.
  • the thread pool of the slave server of the distributed server includes multiple thread tasks, and each thread task calls FPGA and CPU to process the thread tasks in parallel according to caffe-FPGA.so.
  • the embodiment of the present application is a software architecture based on caffe's OCR (Optical Character Recognition) scene text detection FPGA accelerator software.
  • the caffe is modified, and classes supporting FPGA unit calls and multiple FPGA unit parallel calls are added to make it support multiple Thread concurrency mechanism;
  • caffe is packaged into a caffe-FPGA.so file, and an API is added to support the algorithm.
  • Caffe-FPGA.so is hosted under a distributed server architecture to schedule FPGA units, thereby realizing the FPGA unit's thread tasks. Parallel processing.
  • the machine learning model includes a convolutional layer, a Region Proposal Network (RPN), a pooling layer, a fully connected layer, a first classification layer, and the like.
  • the output of the convolutional layer is connected to the input of the RPN
  • the output of the RPN is connected to the input of the pooling layer
  • the output of the pooling layer is connected to the input of the fully connected layer
  • the output of the fully connected layer is connected to The input of the first classification layer is connected.
  • the first classification layer is used to output an image processing result.
  • the RPN further includes an RPN convolution layer, a second classification layer, a candidate region determination layer (Proposals), and an NMS (Non Maximum Suppression) layer.
  • the output of the convolution layer is connected to the input of the RPN convolution layer
  • the output of the RPN convolution layer is connected to the input of the second classification layer
  • the output of the second classification layer is connected to the input of the candidate region determination layer.
  • the terminal of the candidate area determination layer is connected to the input of the NMS layer
  • the output of the NMS layer is connected to the input of the pooling layer.
  • the image processing sub-structures corresponding to the sub-structures are sequentially executed by the processing units corresponding to the sub-structures.
  • the task includes steps for each substructure to process image processing task data. This step specifically includes:
  • the image processing task data is input to an FPGA unit corresponding to a convolution layer substructure in the image processing model, and a convolution result is obtained.
  • the image processing model is a data model trained in advance to process image processing task data according to image data.
  • the image processing model includes a convolutional layer substructure. According to the order of the substructures in the image processing model, the convolutional layer substructure can be the first substructure in the image processing model.
  • the processing unit corresponding to the convolution layer substructure is an FPGA unit.
  • the task data processing device 130 reads the processing unit call data from the register, extracts the processing unit identifier from the processing unit call data, determines the FPGA unit corresponding to the convolution layer substructure in the image processing model according to the processing unit identifier, and sends the substructure to the convolution layer.
  • the corresponding FPGA unit sends a task execution notification.
  • the FPGA unit corresponding to the convolution layer substructure receives the task execution notification, it reads the image processing task data from the memory, and performs convolution processing corresponding to the convolution layer substructure on the image processing task data to obtain the convolution of the image processing task data. result.
  • the FPGA unit corresponding to the convolutional layer sub-structure reads the model parameters of the image processing model from the memory, and configures it according to the read model parameters, so that the FPGA unit corresponding to the convolutional layer sub-structure is based on the model parameters. Convolution processing is performed on the image processing task data to obtain the convolution result of the image processing task data.
  • the convolution result is stored in the memory.
  • the CPU 110 writes the model parameters to the memory through the PCIE (Peripheral Component Interconnect Express) high-speed serial computer expansion bus standard DMA write operation.
  • PCIE Peripheral Component Interconnect Express
  • the convolution result is sent to the CPU 110, and the CPU 110 performs the task of selecting the candidate region corresponding to the candidate network substructure in the image processing model to obtain the region selection result.
  • the image processing model includes a convolutional layer substructure. According to the order of the substructures in the image processing model, the candidate network substructure can be the second substructure in the image processing model.
  • the processing unit corresponding to the candidate network substructure is a CPU unit.
  • the subtask corresponding to the candidate network substructure is a candidate region selection task.
  • the candidate region selection task is used to select a candidate region to be processed in an image corresponding to the image processing task data.
  • the candidate region to be processed may be a region including the text.
  • the task data processing device 130 detects that the FPGA unit corresponding to the substructure of the convolution layer stores the convolution result in the memory, and sends a candidate region selection task execution notification to the CPU 110. After receiving the notification of candidate region selection task execution, the CPU 110 reads the convolution result from the memory, executes the candidate region selection task corresponding to the candidate network substructure according to the convolution result, obtains the region selection result, and stores the region selection result in the memory. in.
  • the CPU 110 reads the model parameters of the image processing model from the memory, configures the candidate network substructure according to the model parameters, and executes the candidate region selection task based on the convolution result through the configured candidate network substructure to obtain the region selection. result.
  • S1506 Perform classification processing on the region selection result through the FPGA unit corresponding to the classification substructure in the image processing model to obtain the classification result.
  • the image data processing model includes a classification substructure.
  • the classification substructure may be the third substructure in the image processing model.
  • the task data processing device 130 After the task data processing device 130 detects that the CPU 110 stores the region selection result in the memory, it sends a task execution notification to the FPGA unit corresponding to the classification substructure.
  • the FPGA unit corresponding to the classification substructure receives the task execution notification, it reads the region selection result from the memory, performs classification processing on the read region selection result, obtains the classification result, and stores the classification result in the memory.
  • the FPGA unit corresponding to the classification substructure reads model parameters of the image processing model from the memory, configures the classification substructure according to the model parameters, and performs classification processing on the region selection structure through the classification substructure to obtain the classification result.
  • the CPU 110 determines the task execution result corresponding to the image processing task data according to the classification result.
  • the task data processing device 130 sends a task result determination notification to the CPU 110 when detecting that the FPGA unit corresponding to the classification substructure stores the classification result in the memory.
  • the CPU 110 receives the task result determination notification, it reads the classification result from the memory, and extracts the task execution result corresponding to the image processing task data according to the classification result.
  • the task execution result corresponding to the image processing task data can be the image recognition result.
  • S1506 further includes a step of obtaining a classification result, and this step specifically includes the following content:
  • S1602 Invoke the FPGA unit corresponding to the non-maximum suppression substructure in the image processing model, and perform non-maximum suppression processing on the region selection result to obtain the non-maximum suppression result.
  • the image processing model also includes a non-maximum suppressor structure.
  • the processing unit corresponding to the non-maximum suppression sub-structure is an FPGA unit.
  • the sub-task corresponding to the non-maximum suppression sub-structure is a non-maximum suppression processing task, and the non-maximum suppression result is a processing result corresponding to the non-maximum suppression processing task.
  • the task data processing device 130 When the task data processing device 130 detects that the CPU 110 stores the region selection result in the memory, it sends a task execution notification to the FPGA unit corresponding to the non-maximum suppression substructure.
  • the FPGA unit corresponding to the non-maximum suppression sub-structure receives the task execution notification, it reads the region selection results from the memory, and performs non-maximum suppression processing on the read region selection results to obtain the non-maximum suppression results. To store the non-maximum suppression results in memory.
  • the FPGA unit corresponding to the non-maximum suppression sub-structure reads model parameters of the image processing model from the memory, configures the non-maximum suppression sub-structure according to the model parameters, and The result of region selection is processed by non-maximum suppression, and the result of non-maximum suppression is obtained.
  • S1604 The non-maximum suppression result is subjected to the pooling layer processing by using the FPGA unit corresponding to the pooling layer substructure in the image processing model to obtain the pooling result.
  • the image data processing model also includes a pooling layer substructure, and the processing unit corresponding to the pooling layer substructure is an FPGA unit.
  • the sub-task corresponding to the sub-structure of the pooling layer is the sub-task of the pooling layer
  • the processing result corresponding to the sub-task of the pooling layer is the pooling result.
  • the task data processing device 110 When the task data processing device 110 detects that the FPGA unit corresponding to the non-maximum suppression sub-structure stores the non-maximum suppression result in the memory, it sends a task execution notification to the FPGA unit corresponding to the sub-structure of the pooling layer.
  • the FPGA unit corresponding to the sublayer of the pooling layer receives the task execution notification, it reads the non-maximum suppression result from the memory, and executes the pooling layer subtask according to the non-maximum suppression result, that is, the non-maximum suppression result. Perform the pooling process to obtain the pooling result, and store the pooling result in the memory.
  • the FPGA unit corresponding to the pooling layer substructure reads the model parameters of the image processing model from the memory, configures the pooling layer substructure according to the model parameters, and suppresses the non-maximum result through the pooling layer substructure. Perform the pooling process to obtain the pooling result.
  • S1606 The pooling result is input to the FPGA unit corresponding to the fully connected layer substructure in the image processing model to obtain a classification result.
  • the image processing model also includes a fully connected layer substructure, and the processing unit corresponding to the fully connected layer substructure is an FPGA unit.
  • the sub-task corresponding to the fully-connected layer substructure is a fully-connected processing task, and the processing result corresponding to the fully-connected processing task is a classification result.
  • the task data processing device 130 When the task data processing device 130 detects that the FPGA unit corresponding to the sub-structure of the pooling layer stores the pooling result in the memory, it sends a task execution notification to the FPGA unit corresponding to the fully-connected layer sub-structure.
  • the FPGA unit corresponding to the fully connected layer substructure receives the task execution notification, it reads the pooling result from the memory, executes the fully connected processing task according to the pooling result, obtains the classification result, and stores the classification result in the memory.
  • the FPGA unit corresponding to the fully-connected layer sub-structure reads the model parameters of the image processing model from the memory, configures the fully-connected layer sub-structure according to the read model parameters, and performs pooling through the fully-connected layer sub-structure. The result is fully connected and the classification result is obtained.
  • the image processing result may be an image recognition result.
  • the image recognition result may be an OCR recognition result or an image target recognition result.
  • the distributed server acquires pictures to be processed from the machine, and encapsulates the pictures to be processed into thread tasks.
  • the thread task calls the FPGA unit corresponding to the basic convolution layer, and performs convolution processing on the pictures to be processed to obtain text features.
  • the thread task inputs the text feature into the candidate region generation network.
  • the thread task calls the FPGA unit corresponding to the RPN convolution to perform the RPN convolution processing on the text feature to obtain the volume corresponding to the text feature.
  • Product results The thread task calls the FPGA unit corresponding to the classification, and performs classification processing on the convolution result corresponding to the text feature to obtain the classification result.
  • the thread task calls the CPU 110 to determine the candidate text box according to the classification result, and the CPU 110 to perform regression adjustment on the determined candidate text box.
  • Candidate text boxes for each arbitrary angle are obtained, and the FPGA unit corresponding to the non-maximum suppression is called, and the candidate text boxes overlapping in the candidate text boxes are processed to obtain non-overlapping candidate text boxes for any arbitrary angle.
  • the thread task calls the FPGA unit corresponding to the rotation of the region of interest to pool the candidate text boxes at any angles generated by the candidate area to generate network output.
  • the candidate text boxes at any angle are rotated and adjusted.
  • the candidate text box is projected onto a fixed-size feature map to obtain a text box feature map corresponding to each candidate text box.
  • the thread task calls the FPGA unit corresponding to the recognition result output to recognize the text in the text box feature map, and outputs the text recognition result.
  • the FPGA corresponding to the convolution layer uses 32 input and 32 output parallelism; the FPGA corresponding to the classification uses 16 input and 64 output parallelism; The FPGA unit corresponding to the RPN convolution uses 8 inputs and 8 outputs in parallel, which improves the processing efficiency.
  • a picture processing method is provided to implement the above-mentioned OCR recognition.
  • the method specifically includes the following:
  • the picture to be processed is a picture to be processed for text recognition.
  • Text recognition can be performed by using OCR (Optical Character Recognition) technology to recognize text in a picture.
  • OCR Optical Character Recognition
  • the terminal needs to identify the picture to be processed, it sends a text recognition request to the CPU 110; the CPU 110 receives the text recognition request and obtains the picture to be processed according to the text recognition request.
  • the text recognition request carries a picture to be processed.
  • the step of the CPU 110 obtaining the picture to be processed according to the text recognition request may be: the CPU 110 obtains the picture to be processed from the text recognition request.
  • the text recognition request carries a picture identifier of a picture to be processed.
  • the step that the text recognition request carries a picture to be processed may be: the CPU 110 parses the text recognition request, parses and extracts a picture identifier in the text recognition request, and reads the picture to be processed from the memory according to the picture identifier.
  • the picture identifier may be a storage address of the picture in the memory.
  • the size of the picture to be processed can be any size; therefore, the picture processing method applicable in this application can support different sizes of pictures to be processed, adaptively configure different sizes of pictures, and support a maximum of 1024 * 1024-size image to be processed.
  • the text feature is a feature representing text in a picture to be processed.
  • the CPU 110 obtains the picture to be processed, the picture to be processed is subjected to convolution processing, and the text features in the picture to be processed are extracted through the convolution processing.
  • the steps of S1804 include the following: the CPU 110 inputs the picture to be processed into the convolution layer; and performs convolution processing on the picture to be processed according to the convolution kernel of the convolution layer to obtain the text features of the picture to be processed.
  • the CPU 110 inputs the picture to be processed into the convolutional layer of the machine learning model, performs convolution processing on the picture to be processed through the convolution kernel of the convolution layer, and obtains text features of the picture to be processed through the convolution processing.
  • the machine learning model may be a picture processing model for performing text recognition processing on a picture to be processed.
  • S1806 Determine a candidate text box at an arbitrary angle in the picture to be processed according to the text characteristics.
  • the candidate text box is an area box including text in the picture to be processed.
  • the CPU 110 extracts the text features of the picture to be processed, it determines a candidate text box including text in the picture to be processed according to the text features.
  • the candidate text box may be a candidate text box of any angle, and the candidate text box of any angle may be a candidate text box of any one of a horizontal angle, a vertical angle, and an inclined angle.
  • the candidate region generation network (RPN, Region Proposal Network) input to the machine learning model is determined by the candidate region generation network based on the text features to determine to be processed Candidate text boxes at any angle in the picture.
  • the candidate region generation network can be a rotation candidate region generation network (RRPN, Rotation, RPN)
  • the RRPN algorithm can improve the accuracy. Because the RRPN algorithm has a complicated process and the operation speed is slow on the CPU 110, in the embodiment of the present application, the accelerator architecture covers the most time-consuming part of the algorithm. The computing efficiency is greatly improved. Compared with the CPU 110 software version, it has achieved more than ten times improvement, the throughput is 1.4 times that of the GPU, and the cost has been reduced to 30%.
  • S1808 Pool the candidate text boxes, project each candidate text box onto a fixed-size feature map, and obtain a text box feature map corresponding to each candidate text box.
  • the CPU 110 pools the candidate text boxes of any arbitrary angle, and projects the candidate text boxes of any arbitrary angle to a fixed-size feature map through the pooling process to obtain the feature map of the text boxes of the same size corresponding to each candidate text box.
  • the CPU 110 inputs candidate text boxes at any angle into the pooling layer of the machine learning model, and projects each candidate text box onto a feature map of a fixed size through the pooling layer to obtain a fixed map corresponding to each candidate text box. Size text box feature map.
  • the pooling layer may be a Rotation Region of Interest (RROI) pooling layer.
  • S1808 further includes: inputting each candidate text box into the pooling layer; determining projection parameters of each candidate text box according to a fixed size of a preset feature map; and projecting each candidate text box into a fixed size according to the projection parameter. Feature map to get the text box feature map corresponding to each candidate text box.
  • the CPU 110 recognizes the text in each text box feature map, and obtains the text recognition result corresponding to each text box feature map through recognition.
  • the CPU 110 inputs each text box feature map into an output layer of a machine learning model, and performs OCR recognition on the text box feature map through the output layer to obtain a text recognition result corresponding to each text box feature map.
  • candidate text boxes at any angle in the picture to be processed are determined, and candidate text boxes at different angles can be identified.
  • Candidate text boxes with different angles can recognize the text in the text box feature map to obtain the text recognition results of each candidate text box, which improves the accuracy and efficiency of text recognition in the pictures to be processed.
  • the image processing method further includes the following: inputting the image to be processed into the machine learning model through at least one thread, and sequentially executing the processing unit corresponding to each substructure according to the order of the substructures in the machine learning model. Subtasks of the corresponding substructure; at least part of the processing unit includes an FPGA unit.
  • each sub-task when executing each sub-task, one of the following steps is performed: extracting text features in the picture to be processed; determining candidate text boxes at any angle in the picture to be processed according to the text features; and pooling each candidate text box Processing, projecting each candidate text box onto a fixed-size feature map to obtain a text box feature map corresponding to each candidate text box; identifying the text in the text box feature map to obtain a text recognition result.
  • the sub-tasks of the picture processing method are implemented by each processing unit, so that sub-tasks of multiple pictures to be processed can be processed in parallel, and some of the sub-tasks are implemented by the FPGA unit in the processing unit, which improves the text recognition in the pictures to be processed s efficiency.
  • S1806 specifically includes the following: inputting text features to the candidate region generation network; performing convolution processing on the text features through the candidate region convolution layer in the candidate region generation network; and obtaining the text feature convolution results; according to The result of the text feature convolution determines the position information of each candidate text box in the picture to be processed; non-maximum suppression processing is performed on the position information of each candidate text box to obtain candidate text boxes at any angle.
  • the CPU 110 inputs the text features to the candidate region generation network, and performs convolution processing on the text features through the candidate region convolution layer in the candidate region generation network.
  • the text feature convolution results are obtained through the convolution processing. According to the text feature convolution results, Each candidate text box is determined in the picture to be processed, the determined position information of each candidate text box is obtained, and the CPU 110 performs non-maximum suppression processing on the position information of each candidate text box to obtain candidate text boxes at any angle.
  • performing non-maximum suppression processing on the position information of each candidate text box to obtain candidate text boxes of any arbitrary angle includes: determining the candidate of each arbitrary angle in the picture to be processed according to the position information of the candidate text box. Text boxes; determining overlapping candidate text boxes; performing non-maximum suppression processing on the overlapping candidate text boxes to obtain non-overlapping candidate text boxes at any angle.
  • the CPU 110 determines the candidate text boxes at any angle in the picture to be processed according to the position information of the candidate text boxes, filters the candidate text boxes that have overlaps from each candidate text box, and performs non-maximum suppression processing on the overlapping candidate text boxes. To get non-overlapping candidate text boxes at any angle.
  • a candidate region generation network can determine candidate text boxes at any angle in the picture to be processed, and non-maximum suppression processing is performed on the candidate text boxes at any angle to obtain non-overlapping candidate texts at any arbitrary angle. Box to improve the accuracy of determining candidate text boxes.
  • the machine learning model includes a fully connected layer connected to the pooling layer; S1610 specifically includes the following: inputting the text box feature map into the fully connected layer; determining the probability value corresponding to each text classification through the text feature map; selecting The text classification corresponding to the maximum probability value is used as the text recognition result of the text feature map.
  • the text box feature map is input to the fully connected layer of the machine learning model, and the text box feature map is processed through the fully connected layer to obtain the corresponding text feature map.
  • the probability value corresponding to each text classification, the maximum probability value is determined among each probability value, and the text classification corresponding to the maximum probability value is selected as the text recognition result of the text feature map.
  • the multiple processing units include an FPGA unit and a CPU; the to-be-processed pictures are multiple to-be-processed pictures; and the current to-be-processed picture is input to the FPGA unit corresponding to the convolution layer for processing to obtain the text characteristics of the to-be-processed picture.
  • a sub-task corresponding to a substructure in a machine learning model is executed by a part of the FPGA unit and the CPU, and when obtaining a plurality of pictures to be processed, each FPGA unit and the CPU execute the sub-tasks corresponding to each to-be-processed picture in parallel, so that The sub-tasks corresponding to the pictures to be processed can be processed in parallel, thereby improving the processing efficiency of multiple pictures to be processed.
  • FIG. 19 is a schematic flowchart of text recognition in an embodiment.
  • the to-be-processed picture 1802 obtains the text features of the to-be-processed picture 1902 through basic convolution processing, and processes the text features of the to-be-processed picture 1902 through the candidate region generation network (RPN) to obtain candidate text boxes at any angle in the to-be-processed picture 1902.
  • RPN candidate region generation network
  • Rotate the region of interest pooling layer to adjust the candidate text box at any angle to obtain a fixed-size text box feature map corresponding to each candidate text box.
  • the recognition result output is used to identify the text box feature map and output a text recognition result 1904.
  • the white box in the text recognition result 1904 is the recognized text.
  • FIG. 20 is a text recognition result obtained by recognizing the text in the advertisement picture, and the black frame is the recognized text.
  • FIG. 21 is a text recognition result obtained by recognizing text in a natural scene picture, and the white box is the recognized text.
  • 22 and 23 are text recognition results obtained by recognizing text in a game picture, and the black box is the recognized text.
  • FIG. 24 is a text recognition result obtained by recognizing text in a bank card picture, and the numbers in the gray box are the recognized text.
  • sub-tasks corresponding to the substructure in the machine learning model are executed by some FPGA units, and when acquiring multiple task data, each FPGA unit executes sub-tasks corresponding to each task data in parallel, so that each task data corresponds to Subtasks can be processed in parallel, which improves the processing efficiency of multiple task data.
  • Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory can include random access memory (RAM) or external cache memory.
  • RAM is available in various forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Synchlink DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
  • SRAM static RAM
  • DRAM dynamic RAM
  • SDRAM synchronous DRAM
  • DDRSDRAM dual data rate SDRAM
  • ESDRAM enhanced SDRAM
  • SLDRAM synchronous chain Synchlink DRAM
  • Rambus direct RAM
  • DRAM direct memory bus dynamic RAM
  • RDRAM memory bus dynamic RAM

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Medical Informatics (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Biology (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Neurology (AREA)
  • Image Analysis (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Advance Control (AREA)

Abstract

本申请涉及一种图片处理方法、任务数据处理方法和装置,所述方法包括:通过机器学习模型中各子结构所对应的处理单元执行相应的子任务来实现,至少部分所述处理单元包括FPGA单元;获取待处理图片;提取所述待处理图片中的文本特征;根据所述文本特征确定所述待处理图片中任意角度的候选文本框;对各所述候选文本框进行旋转感兴趣区域的池化处理,将各所述候选文本框投影到固定大小的特征图,得到各所述候选文本框对应的文本框特征图;识别所述文本框特征图中文本,得到文本识别结果。本申请通过FPGA架构,可并行处理数据以实现上述图片处理方法,能够在降低成本和功耗的同时,提高待处理图片中文本识别准确率和效率。

Description

图片处理方法、任务数据处理方法和装置
本申请要求于2018年08月27日提交、申请号为201810980841.1、发明名称为“图片处理方法、任务数据处理方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及数据处理领域,特别是涉及一种图片处理方法、任务数据处理方法和装置。
背景技术
随着计算技术的飞速发展,越来越多的数据都需要计算机来处理。尤其随着数据量的迅猛增长,导致对数据处理效率的要求越来越高。例如在场景文本识别领域,文本检测是场景文本识别的前提条件,要解决的问题是如何在杂乱无序、千奇百怪的复杂场景图片中准确地定位出文字的位置并识别出文字。由于背景的复杂性、光照的多变性以及字体的不可预测性等原因,文本检测面临着极大的挑战。
例如,在硬件方面,通常是通过CPU(Central Processing Unit,中央处理器)或GPU(Graphics Processing Unit,图形处理器)处理图片数据并进行文本检测等。CPU采用串行处理,即在处理过程中需要等待对前一任务数据处理完毕,得到前一任务数据对应的执行结果后,才能继续执行下一个任务数据,这样相较于大量的任务数据量,任务数据处理的效率较低。而通过GPU处理的成本较高,且具有超大的功耗。
发明内容
基于此,有必要针对传统方法存在的问题,提供一种图片处理方法、任务数据处理方法和装置。
一种图片处理方法,通过机器学习模型中各子结构所对应的处理单元执行相应的子任务来实现,至少部分所述处理单元包括现场可编程门阵列FPGA单元;所述方法包括:
获取待处理图片;
提取所述待处理图片中的文本特征;
根据所述文本特征确定所述待处理图片中任意角度的候选文本框;
对各所述候选文本框进行旋转感兴趣区域的池化处理,将各所述候选文本框投影到固定大小的特征图,得到各所述候选文本框对应的文本框特征图;
识别所述文本框特征图中文本,得到文本识别结果。
上述图片处理方法,根据待处理图片的文本特征,确定待处理图片中各任意角度的候选文本框,可以识别不同角度的候选文本框。对各候选文本框进行池化处理,并将不同大小的各候选文本框投影到固定大小的特征图,得到各候选文本框的文本框特征图,提高了处理候选文本框的适应性,可以处理不同尺寸和不同角度的候选文本框,通过识别文本框特征图中文本,得到各候选文本框的文本识别结果。同时,通过FPGA架构,可并行处理数据以实现上述图片处理方法,能够在降低成本和功耗的同时,提高待处理图片中文本识别准确率和效率。
一种任务数据处理方法,所述方法包括:
获取多个任务数据;
对于每个所述任务数据,分别按照机器学习模型中子结构的顺序,依次通过各所述子结构所对应的处理单元执行相应子结构的子任务;至少部分所述处理单元包括现场可编程门阵列FPGA单元;
在每个所述处理单元的处理过程中,当处理单元处于空闲状态时,并行执行下一个任务数据对应的子任务。
一种任务数据处理方法,应用于分布式服务器主机主机,所述方法包括:
接收终端发送的任务数据;
确定为所述任务数据分配的分布式服务器从机地址;
根据所述分配的分布式服务器从机地址,将所述任务数据发送至分布式服务器从机;
所述分布式服务器从机,用于将所述任务数据放入线程池,当对任务数据进行处理时,从所述线程池获取多个任务数据;对于每个所述任务数据,分别按照机器学习模型中子结构的顺序,依次通过各所述子结构所对应的处理单元执行相应子结构的子任务;至少部分所述处理单元包括现场可编程门阵列FPGA单元;在每个所述处理单元的处理过程中,当处理单元处于空闲状态时,并行执行下一个任务数据对应的子任务。
一种任务数据处理方法,应用于分布式服务器从机,所述方法包括:
当接收到分布式服务器主机发送的任务数据时,将所述任务数据放入线程池;
当对任务数据进行处理时,从所述线程池获取多个任务数据;
对于每个所述任务数据,分别按照机器学习模型中子结构的顺序,依次通过各所述子结构所对应的处理单元执行相应子结构的子任务;至少部分所述处理单元包括现场可编程门阵列FPGA单元;在每个所述处理单元的处理过程中,当处理单元处于空闲状态时,并行执行下一个任务数据对应的子任务。
一种任务数据处理装置,所述装置包括:任务调度单元和现场可编程门阵列FPGA单元,所述任务调度单元与所述FPGA单元相连接;
所述任务调度单元,用于获取多个任务数据;对于每个所述任务数据,分别按照机器学习模型中子结构的顺序,依次通过各所述子结构所对应的处理单元执行相应子结构的子任务;至少部分所述处理单元包括FPGA单元;在每个所述处理单元的处理过程中,当处理单元处于空闲状态时,并行执行下一个任务数据对应的子任务。
上述任务数据处理方法和装置,通过部分FPGA单元执行机器学习模型中子结构对应的子任务,且在获取多个任务数据时,各FPGA单元并行执行各任务数据对应的子任务,从而使得各任务数据对应的子任务可以被并行处理,从而提高了多个任务数据的处理效率。
一种计算机可读存储介质,所述计算机可读存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述指令、所述程序、所述代码集或所述指令集由处理器加载并执行以实现上述的图片处理方法中所执行的操作。
一种计算机可读存储介质,所述计算机可读存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述指令、所述程序、所述代码集或所述指令集由处理器加载并执行以实现上述的任务数据处理方法中所执行的操作。
一种计算机可读存储介质,所述计算机可读存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述指令、所述程序、所述代码集或所述指令集由处理器加载并执行以实现上述的任务数据处理方法中所执行的操作。
一种计算机可读存储介质,所述计算机可读存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述指令、所述程序、所述代码集或所述指令集由处理器加载并执行以实现上述的任务数据处理方法中所执行的操作。
一种计算机设备,所述计算机设备包括:处理器和存储器,所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现上述所述的图片处理方法。
一种计算机设备,所述计算机设备包括:处理器和存储器,所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现上述的任务数据处理方法。
一种分布式服务器主机,所述分布式服务器主机包括:处理器和存储器,所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现上述的任务数据处理方法。
一种分布式服务器从机,所述分布式服务器从机包括:处理器和存储器,所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现上述的任务数据处理方法。
附图说明
图1为一个实施例中任务数据处理方法的应用场景图;
图2为一个实施例中计算机设备的内部结构示意图;
图3为一个实施例中任务数据处理装置的框图;
图4为一个实施例中任务数据处理装置的内部结构示意图;
图5为一个实施例中任务数据处理方法的流程示意图;
图6为一个实施例中任务数据的封装示意图;
图7为一个实施例中多线程任务并行执行的示意图;
图8为一个实施例中多线程任务并行执行的时序图;
图9为一个实施例中多线程任务并行执行的时序图;
图10为一个实施例中CPU和FPGA单元并行处理任务的示意图;
图11为另一个实施例中任务数据处理方法的应用环境图;
图12为一个实施例中分布式服务器从机的内部环境图;
图13为一个实施例中任务数据处理方法的流程示意图;
图14为一个实施例中任务数据处理方法的软件架构图;
图15为一个实施例中各子结构处理图像处理任务数据的步骤的流程示意图;
图16为一实施例中获得分类结果的步骤的流程示意图;
图17为一个实施例中获得图像处理结果的流程示意图;
图18为一个实施例中图片处理方法的流程示意图;
图19为一个实施例中文本识别的流程示意图;
图20为一个应用场景对应的文本识别结果示意图;
图21为另一个应用场景对应的文本识别结果示意图;
图22为另一个应用场景对应的文本识别结果示意图;
图23为另一个应用场景对应的文本识别结果示意图;
图24为另一个应用场景对应的文本识别结果示意图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
图1为一个实施例中任务数据处理方法的应用场景图。参照图1,该应用场景中包括CPU110、板卡接口120和任务数据处理装置130。CPU 110通过板卡接口120与任务数据处理装置130进行通信。板卡接口120和CPU 110集成在计算机设备的主板上,板卡接口120可以是主板上的板卡插槽,任务数据处理装置130插入板卡插槽即可与CPU110进行通信。任务数据处理装置130中集成有至少一个FPGA(Field-Programmable Gate Array,现场可编程门阵列)单元。
图2为集成有图1中的CPU 110和板卡接口120的计算机设备的内部结构示意图。参照图2,该计算机设备包括通过系统总线连接的CPU 110、存储器、网络接口和板卡接口120,板卡接口120连接任务数据处理装置130。其中,存储器包括非易失性存储介质和内存储器。该计算机设备的非易失性存储介质可存储操作系统和计算机程序。该计算机程序被执行时,可使得CPU 110执行下述任务数据处理方法。该任务数据处理装置130和计算机设备的CPU 110用于提供计算和控制能力,支撑整个计算机设备和任务数据处理装置130的运行。该内存储器中可储存有计算机程序,该计算机程序被CPU 110执行时,可使得CPU 110执行下述任务数据处理方法。计算机设备的网络接口用于进行网络通信。计算机设备可以是分布式服务器从机。该板卡接口120可以为PCIE gen3x8接口。
本领域技术人员可以理解,图2示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图2中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
如图3所示,在一个实施例中,提供一种任务数据处理装置130,该装置具体包括:任务数据获取模块132、任务数据处理模块134和执行结果获取模块136。
任务数据获取模块132,用于获取多个任务数据。
任务数据处理模块134,用于对于每个任务数据,分别按照机器学习模型中子结构的顺序,依次通过各子结构所对应的处理单元执行相应子结构的子任务;至少部分处理单元包括FPGA单元;当处理单元处于空闲状态时,执行下一个任务数据对应的子任务。
执行结果获取模块136,用于当每个任务数据在各子结构的子任务执行完毕后,获得相 应的任务执行结果。
在一个实施例中,本申请提供的任务数据处理装置130可以实现为一种计算机程序的形式,计算机程序可在如图2所示的计算机设备上运行。计算机设备的存储器中可存储组成该任务数据处理装置130的各个程序模块,比如,图3所示的任务数据获取模块132、任务数据处理模块134和执行结果获取模块136。各个程序模块构成的计算机程序使得CPU 110执行本说明书中描述的本申请各个实施例的任务数据处理方法中的步骤。
例如,图2所示的计算机设备可以通过如图3所示的任务数据处理装置130中的任务数据获取模块132获取多个任务数据。计算机设备可通过任务数据处理模块134对于每个任务数据,分别按照机器学习模型中子结构的顺序,依次通过各子结构所对应的处理单元执行相应子结构的子任务;至少部分处理单元包括FPGA单元;在每个处理单元的处理过程中,当处理单元处于空闲状态时,执行下一个任务数据对应的子任务。计算机设备可通过执行结果获取模块136当每个任务数据在各子结构的子任务执行完毕后,获得相应的任务执行结果。
如图4所示,在另一个实施例中,任务数据处理装置130包括任务调度单元和一个或多个FPGA单元,任务调度单元与每个FPGA单元相连接。在图4中,以任务数据处理装置130包括任务调度单元和4个FPGA单元为例进行说明。
任务调度单元用于获取多个任务数据;对于每个任务数据,分别按照机器学习模型中子结构的顺序,依次调度机器学习模型中各子结构所对应的处理单元执行相应子结构的子任务;至少部分处理单元包括一个或多个FPGA单元。
FPGA单元用于在每个处理单元的处理过程中,当处理单元处于空闲状态时,执行下一个任务数据对应的子任务;当每个任务数据在各子结构的子任务执行完毕后,获得相应的任务执行结果并输出。
继续参见图4,在一个实施例中,任务数据处理装置130还包括:寄存器和存储器;存储器与FPGA单元相连接;寄存器与任务调度单元相连接。
任务调度单元还用于从寄存器中读取处理单元调用数据,以根据处理单元处理数据对于每个任务数据,分别按照机器学习模型中子结构的顺序,依次调度机器学习模型中各子结构所对应的处理单元执行相应子结构的子任务。
FPGA单元还用于从存储器中读取由CPU 110写入的任务数据,在每个处理单元的处理过程中,当处理单元处于空闲状态时,执行下一个任务数据对应的子任务;当每个任务数据在各子结构的子任务执行完毕后,获得相应的任务执行结果;将任务执行结果存储到存储器。
继续参见图4,在一个实施例中,任务数据处理装置130还包括:总线控制器;总线控 制器分别与存储器和每个FPGA单元相连接。每个FPGA单元通过总线控制器将任务执行结果存储到存储器中。该存储器可以为DDR4存储器。
在一个实施例中,任务数据处理装置130通过板卡接口120与CPU110相连接;处理单元还包括CPU 110。
如图5所示,在一个实施例中,提供一种任务数据处理方法。任务数据处理方法可以应用于上述图1中的任务数据处理装置130中。本实施例主要以该方法应用于上述图1中的任务数据处理装置130来举例说明。参照图5,该任务数据处理方法,具体包括以下步骤:
S502,获取多个任务数据。
其中,任务数据为待处理任务所对应的数据。
在本步骤中,CPU 110获取任务数据,将任务数据发送至任务数据处理装置130,任务数据处理装置130将接收到的任务数据进行存储。当对任务数据进行处理时,任务数据处理装置130从存储的任务数据中读取多个任务数据。
在一个实施例中,CPU 110从计算机设备的存储器中读取任务数据,将读取到的任务数据发送至任务数据处理装置130。任务数据处理装置130接收CPU 110发送的任务数据,将任务数据存储在存储器中。当对任务数据进行处理时,CPU 110向任务数据处理装置130发送任务执行指令;任务数据处理装置130接收CPU 110发送的任务执行指令,确定对任务数据进行处理,根据该任务执行指令从存储器中读取多个任务数据。
需要说明的一点是,任务数据处理装置130在缓存任务数据时,采用双缓存乒乓操作,数据读入和计算同时进行,减少两部分互相等待时延,提高了处理效率。
以下以图6举例说明,图6为一个实施例中任务数据的封装示意图。CPU110获取到任务数据后,将任务数据封装成FPGA线程任务。FPGA线程任务中包括任务数据和FPGA执行指令,FPGA执行指令包括写指令、读指令和开始指令。FPGA执行指令用于调用FPGA单元对任务数据进行处理。CPU 110将封装后的FPGA线程任务放入到线程池中,任务数据处理装置130从线程池中获取FPGA线程任务,从FPGA线程任务中读取任务数据。任务数据处理装置130还可以从FPGA线程任务中读取FPGA执行指令。
S504,对于每个任务数据,分别按照机器学习模型中子结构的顺序,依次通过各子结构所对应的处理单元执行相应子结构的子任务;其中,至少部分处理单元包括FPGA单元;当处理单元监测到自身处于空闲状态时,通过线程提取下一个任务数据,在线程对应的处理单元中执行下一个任务数据对应的子任务。
需要说明的是,任务数据处理装置130中包括多个处理单元。本实施例中,多个处理单元包括FPGA单元以及CPU单元。机器学习模型为预先训练好的任务数据处理的数据模型。对于机器学习模型中每个子结构,都存在对应的处理单元。处理单元用于执行所对应机器学习模型中子结构所对应的子任务。
任务数据处理装置130将多个任务数据输入到机器学习模型对应的处理单元,通过机器学习模型的处理单元对任务数据进行处理。
在一个实施例中,对于每个任务数据,任务数据处理装置130通过处理单元按照机器学习模型中子结构的顺序,执行任务数据所对应的子任务;且在每个处理单元的处理过程中,检测前一任务数据在当前子结构的子任务是否执行完毕,检测当前的任务数据在前一子结构的子任务是否执行完毕;在检测到前一个任务数据在当前子结构的子任务执行完毕,同时检测到当前的任务数据在前一子结构的子任务执行完毕后,即当前子结构对应的处理单元处于空闲状态时,开始执行当前的任务数据在当前子结构的子任务。
在一个实施例中,对于每个任务数据,任务数据处理装置130将任务数据输入机器学习模型中第一个子结构对应的处理单元,通过第一个子结构对应的处理单元,根据任务数据执行第一个子结构对应的子任务,得到第一子任务数据。任务数据处理装置130又将第一子任务数据输入机器学习模型中第二个子结构对应的处理单元,通过第二个子结构对应的处理单元,根据第一子任务数据执行第二子结构对应的子任务,得到第二子任务数据。任务数据处理装置又将第二子任务数据输入机器学习模型中第三个子结构,直至获得机器学习模型中最后一个子结构输出的任务执行结果。
在一个实施例中,在每个处理单元的处理过程中,任务数据处理装置130将前一任务数据输入机器学习模型中第一个子结构对应的处理单元,通过第一个子结构对应的处理单元,根据前一任务数据执行第一个子结构对应的子任务,得到前一任务数据对应的第一子任务数据。
在第一个子结构对应的处理单元,根据前一任务数据执行完毕第一个子结构的子任务后,任务数据处理装置130将当前任务数据输入机器学习模型中第一个子结构对应的处理单元,通过第一个子结构对应的处理单元,根据当前任务数据执行第一个子结构对应的子任务,同时任务数据处理装置将前一任务数据对应的第一子任务数据,输入到机器学习模型中第二个子结构对应的处理单元,通过第二个子结构对应的处理单元,根据前一任务数据对应的第一子任务数据,执行第二个子结构对应的子任务,得到前一任务数据对应的第二子任务数据。
在第一个子结构对应的处理单元,根据当前任务数据执行完毕第一个子结构的字任务, 获得当前任务数据对应的第一子任务数据后,且在得到前一任务数据对应的第二子任务数据后,任务数据处理装置130将当前任务数据对应的第一子任务数据,输入到机器学习模型中第二个子结构对应的处理单元,通过第二个子结构对应的处理单元,根据当前任务数据对应的第一子任务数据,执行第二个子结构对应的子任务,得到当前任务数据对应的第二子任务数据,同时任务数据处理装置130将前一任务数据对应的第二子任务数据,输入到机器学习模型中第三个子结构对应的处理单元,直至获取机器学习模型中的最后子机构输出前一任务数据对应的任务执行结果和当前任务数据对应的任务执行结果。
在一个实施例中,S504的步骤包括:对于每个任务数据,任务数据处理装置130从寄存器中读取处理单元调用数据;处理单元调用数据由CPU 110写入到寄存器中;根据处理单元调用数据,按照机器学习模型中子结构的顺序,依次调用各子结构所对应的处理单元执行相应子结构的子任务。
其中,处理单元调用数据为任务数据处理装置130调用处理单元所需要的数据。处理单元调用数据可以包括处理单元标识,还可以包括调用处理单元所用的指令。调用处理单元所用的指令可以包括单元写指令、单元读指令和单元执行指令中的至少一种。
在本步骤中,通过CPU 110将每个任务数据对应的处理单元调用数据写入到寄存器中。任务数据处理装置130从寄存器中读取每个任务数据对应的处理单元调用数据,提取处理单元调用数据中的处理单元标识,根据提取到的处理单元标识对应的处理单元,按照机器学习模型中子结构的顺序,依次调用各子结构所对应的处理单元执行相应子结构的子任务。
在一个实施例中,S504的步骤还包括:当处理单元均未处于空闲状态时,等待当前子结构的子任务所对应的处理单元被释放。例如,在每个处理单元的处理过程中,在前一任务数据在当前子结构的子任务未执行完毕,且当前的任务数据在前一子结构的子任务执行完毕时,等待当前子结构的子任务所对应的处理单元被释放。在当前子结构的子任务所对应的处理单元被释放后,调用当前子结构的子任务所对应的处理单元,执行当前的任务数据在当前子结构的子任务。
请参照图7,图7为一个实施例中多线程任务并行执行的示意图。任务数据处理装置130从线程池读取线程任务1、线程任务2和线程任务3,线程任务1、线程任务2和线程任务3按顺序连接。在通过FPGA单元处理线程任务1、线程任务2和线程任务3时,FPGA单元1的输出作为FPGA单元2的输入,FPGA单元2的输出作为FPGA单元3的输入,即FPGA单元1、FPGA单元2和FPGA单元3成流水线式处理,且各FPGA单元执行的子任务不同。线程任务调用各线程任务可以单独调用每个FPGA单元,从而实现不同的FPAG单元可以同 时运行不同线程任务,提高吞吐量。
图8和图9为一个实施例中多线程任务并行执行的时序图。任务数据对应的线程任务需要依次通过FPGA单元1、FPGA单元2和FPGA单元3执行相应的子任务,才能得到该任务数据的任务执行结果。请参照图8和图9,当FPGA单元1、FPGA单元2和FPGA单元3处于空闲状态时,线程任务1获取任务数据1,调用FPGA单元1执行任务数据1的子任务1。当FPGA单元1执行完毕线程任务1的子任务1时,线程任务1调用FPGA单元2执行子任务2,同时线程任务2获取任务数据2,调用FPGA单元1执行线程任务2的子任务1。当FPGA单元2执行完毕线程任务1的子任务2,且FPGA单元1执行完毕线程任务2的子任务1时,线程任务1调用FPGA单元3执行子任务3,同时线程任务2调用FPGA单元2执行子任务2,同时线程任务3获取任务数据3,调用FPGA单元1执行子任务1。当FPGA单元3执行完毕线程任务1的子任务3,且FPGA单元2执行完毕线程任务2的子任务2时,线程任务2调用FPGA单元3执行子任务3,同时当FPGA单元1执行完毕线程任务3的子任务1时,线程任务3调用FPGA单元2执行子任务2,且线程任务1又可以获取任务数据4,调用FPGA单元1执行线子任务1,直至通过线程任务调用FPGA单元获取的各任务数据对应的任务执行结果。其中,线程任务的数量可以设置为n个,n为正整数。
在一个实施例中,多个处理单元可以包括CPU 110和FPGA单元。图10为一个实施例中CPU 110和FPGA单元并行处理任务的示意图。如图10所示,线程任务1、线程任务2和线程任务3调用处理单元的顺序都一样,线程任务1调用FPGA单元1,在线程任务1释放FPGA单元1后,线程任务1调用CPU 110,线程任务2调用FPGA单元1;当线程任务1释放CPU后,线程任务1调用FPGA单元2;当线程任务1释放CPU 110,且线程任务2释放FPGA单元1后,线程任务2调用CPU 110,线程任务3调用FPGA单元1;当线程任务1释放FPGA单元2后,线程任务1调用FPGA单元3;当线程任务1释放FPGA单元2,且线程任务2释放CPU 110后,线程任务2调用FPGA单元2;当线程任务2释放CPU 110,线程任务3释放FPGA单元1后,线程任务3调用CPU 110;当线程任务1释放FPGA单元3后,等待FPAG单元1被线程任务3释放,当FPGA单元1被释放后,线程任务3再次调用FPGA单元1,从而保证各线程任务的并行处理,直至各并行处理的线程任务得到相应的任务执行结果。
S506,当每个任务数据在各子结构的子任务执行完毕后,获得相应的任务执行结果。
对于每个任务数据,任务数据处理装置130检测到机器学习模型中各子结构的子任务执行完毕后,获取机器学习模型中最后子结构对应的处理单元输出的任务执行结果,从而得到 每个任务数据对应的任务执行结果。
本实施例中,在获取多个任务数据后,对于每个任务数据,分别按照机器学习模型中子结构的顺序,依次通过各子结构所对应的处理单元执行相应子结构的子任务,每个处理单元对应一个机器学习模型的子结构,且至少部分处理单元包括FPGA单元。在每个处理单元的处理过程中,在前一个任务数据在当前子结构的子任务执行完毕,且当前的任务数据在前一子结构的子任务执行完毕后,开始执行当前的任务数据在当前子结构的子任务,各处理单元并行处理多个任务数据的子任务,使得机器学习模型可以在低成本和低功耗的结构中并行处理多个任务数据,从而提高了任务数据的处理效率。
图11为一个实施例中任务数据处理方法的应用环境图。图11中包括终端、分布式服务器主机和分布式服务器从机,终端通过网络与分布式服务器主机相连接,分布式服务器主机通过网络与分布式服务器从机相连接,分布式服务器从机可以是一个或者多个。分布式服务器从机中设置有线程池和处理单元调度程序,分布式服务器从机的板卡接口连接有任务数据处理装置130,任务数据处理装置130中设置有FPGA单元。分布式服务器从机通过执行处理单元调度程序实时任务数据处理方法,分布式服务器从机执行处理调度程序时,从线程池中的线程任务中读取任务数据,根据任务数据执行线程任务。
图12为一个实施例中分布式服务器从机的内部环境图。分布式服务器从机中设置有线程池和处理单元调度程序。分布式服务器从机通过执行处理单元调度程序实时任务数据处理方法,在执行处理单元调度程序时,从线程池的线程任务中获取任务数据,根据任务数据按照单元调度程序中子任务的顺序调度FPGA单元和CPU 110执行相应的子任务,且处理单元调度程序可以并行处理多个线程任务,经过处理单元调度程序处理得到多个线程任务的任务执行结果,将任务执行结果返回至相应的线程任务,通过分布式服务器从机返回至分布式服务器主机。其中,处理单元调度程序中包括n个子任务,n为正整数。
如图13所示,在一个实施例中,提供一种任务数据处理方法,应用于分布式服务器主机,方法包括以下内容:
S1302,分布式服务器主机接收终端发送的任务数据;确定为任务数据分配的分布式服务器从机地址;根据分配的分布式服务器从机地址将任务数据发送至分布式服务器从机。
该任务数据可以为图像处理任务数据;在一个实施例中,分布式服务器主机可以根据每个分布式服务器从机的工作状态为任务数据分配分布式从机;相应的,分布式服务器主机确定为任务数据分配的分布式服务器从机地址的步骤可以为:分布式服务器主机根据每个分布 式服务器从机的工作状态,从每个分布式服务器从机中选择处于空闲状态的分布式服务器从机,确定该选择的分布式服务器从机地址。
在另一个实施例中,分布式服务器主机可以根据任务数据的类型为任务数据分配分布式从机;相应的,分布式服务器主机确定为任务数据分配的分布式服务器从机地址的步骤可以为:分布式服务器主机根据该任务数据的类型,从每个分布式服务器从机中选择用于处理该类型的分布式服务器从机,确定该选择的分布式服务器从机地址。
S1304,分布式服务器从机将任务数据放入线程池,从线程池获取多个任务数据;对于每个任务数据,分别按照机器学习模型中子结构的顺序,依次通过各子结构所对应的处理单元执行相应子结构的子任务;至少部分处理单元包括FPGA单元;在每个处理单元的处理过程,当处理单元处于空闲状态时,执行下一个任务数据对应的子任务。
当该任务数据为图像处理任务数据时,机器学习模型可以为图像处理模型;相应的,对于每个图像处理任务数据,分别按照图像处理模型中子结构的顺序,依次通过各子结构对应的处理单元,执行相应子结构的图像处理子任务。
分布式服务器从机接收到分布式服务器主机发送的任务数据时,将该任务数据放入线程池中;当对任务数据进行处理时,分布式服务器从机从线程池中获取多个任务数据。
其中,可以由分布式服务器主机指示分布式从机对任务数据进行处理;相应的,当分布式服务器从机接收到分布式服务器主机发送的FPGA执行指令时,分布式服务器从机才从线程池中获取多个任务数据。
需要说明的一点是,在本步骤中,可以由FPGA单元执行相应子结构的子任务;相应的,对于每个任务数据,分布式服务器从机分别按照机器学习模型中子结构的顺序,依次通过各子结构所对应的FPGA单元执行相应子结构的子任务;在FPGA单元的处理过程中,当FPGA单元处于空闲状态时,执行下一个任务数据对应的子任务。
需要说明的另一点是,当处理单元处于非空闲状态时,当该处理单元被释放时,才通过该处理单元执行下一个任务数据对应的子任务;或者调用其他处理单元执行下一个任务数据对应的子任务。
S1306,当每个任务数据在各子结构的子任务执行完毕后,分布式服务器从机获得相应的任务执行结果;将获取到任务执行结果返回至分布式服务器主机。
FPGA单元在获得相应的任务执行结果时,将该任务执行结果存储到存储器中;CPU 110从该存储器中读取任务执行结果,将该任务执行结果返回至分布式服务器主机。
S1308,分布式服务器主机接收分布式服务器从机返回的任务执行结果,将返回的任务执 行结果发送至终端。
任务数据处理装置130获取多个图像处理任务数据,对于每个图像处理任务数据,分别按照图像处理模型中子结构的顺序,依次通过各子结构对应的处理单元,执行相应子结构的图像处理子任务;在每个处理单元的处理过程中,在前一个图像处理任务数据在当前子结构的图像处理子任务执行完毕,且当前的图像处理任务数据在前一子结构的图像处理子任务执行完毕后,开始执行当前的图像处理任务数据在当前子结构的图像处理子任务。在每个处理单元的处理过程中,各处理单元并行处理多个图像处理任务数据的图像处理子任务,使得通过图像处理模型可以在低成本和低功耗的结构中并行处理多个图像处理任务数据,从而提高了图像处理任务数据的处理效率。
如图14所示,在又一个实施例中,任务数据为图像处理任务数据;机器学习模型为图像处理模型;任务执行结果为图像处理结果。其中,图像处理结果可以是图像识别结果;图像识别结果可以为从图像中识别出的文本等。参见图14,本申请实施例提供了一种处理图像数据的系统,该处理图像数据的系统包括与终端连接的接入层、分布式服务器主机、位于系统层的一个或多个分布式服务器从机以及算法层。分布式服务器从机通过接口(如API(Application Programming Interface,应用程序编程接口))与算法层连接。线程池设置在分布式服务器从机中。算法层设置有机器学习模型。处理单元包括CPU和FPGA单元。终端通过接入层接入到分布式服务器主机。分布式服务器主机与系统层的分布式服务器从机进行数据交互。分布式服务器从机通过API接口(板卡接口)调用算法层的caffe-FPGA.so文件中的算法,根据caffe-FPGA.so文件中的算法调用FPGA单元与CPU配置对任务数据进行处理。分布式服务器从机的线程池中包括多个线程任务,每个线程任务根据caffe-FPGA.so调用FPGA和CPU对线程任务进行并行处理。
本申请实施例是基于caffe的OCR(Optical Character Recognition,光学字符识别)场景文字检测FPGA加速器的软件架构,将caffe进行改造,加入支持FPGA单元调用及多FPGA单元并行调用的类,使之支持多线程并发机制;同时,将caffe封装成caffe-FPGA.so文件中,通过增加API来支持算法,caffe-FPGA.so承载在分布式服务器架构之下来调度FPGA单元,从而实现FPGA单元对线程任务的并行处理。
在本实施例中,机器学习模型包括卷积层、RPN(Region Proposal Network,候选区域生成网络)、池化层、全连接层、第一分类层等。并且,卷积层的输出端与RPN的输入端连接,RPN的输出段与池化层的输入端连接,池化层的输出端与全连接层的输入端连接,全连 接层的输出端与第一分类层的输入端连接。其中,第一分类层用于输出图像处理结果。
其中,RPN进一步包括RPN卷积层、第二分类层、候选区域确定层(Proposals)以及NMS(Non Maximum Suppression,非极大值抑制)层。相应的,卷积层的输出端与RPN卷积层的输入端连接,RPN卷积层的输出端与第二分类层的输入端连接,第二分类层的输出端与候选区域确定层的输入端连接,候选区域确定层的输出端与NMS层的输入端连接,NMS层的输出端与池化层的输入端连接。
在机器学习模型中除了RPN中的候选区域确定层和最后的识别结果输出部分采用CPU 110处理外,其他所有部分都基于FPGA单元实现,从而能够将数据处理量较大的部分通过FPGA单元并行处理,而数据处理量较低的部分则仍然保持CPU 110处理,由此,提高了数据处理效率,同时降低成本。
如图15所示,在有一个实施例中,对于每个图像处理任务数据,分别按照图像处理模型中子结构的顺序,依次通过各子结构对应的处理单元,执行相应子结构的图像处理子任务包括各子结构处理图像处理任务数据的步骤,该步骤具体包括内容:
S1502,将图像处理任务数据输入图像处理模型中卷积层子结构对应的FPGA单元,获得卷积结果。
其中,图像处理模型为根据图像数据预先训练好的处理图像处理任务数据的数据模型。图像处理模型中包括卷积层子结构。按照图像处理模型中子结构的顺序,卷积层子结构可以是图像处理模型中第一个子结构。卷积层子结构对应的处理单元为FPGA单元。
任务数据处理装置130从寄存器中读取处理单元调用数据,提取处理单元调用数据中处理单元标识,根据处理单元标识确定图像处理模型中卷积层子结构对应的FPGA单元,向卷积层子结构对应的FPGA单元发送任务执行通知。卷积层子结构对应的FPGA单元接收到任务执行通知时,从存储器读取图像处理任务数据,对图像处理任务数据进行卷积层子结构对应的卷积处理,得到图像处理任务数据的卷积结果。
在一个实施例中,卷积层子结构对应的FPGA单元从存储器中读取图像处理模型的模型参数,根据读取到的模型参数进行配置,使得卷积层子结构对应的FPGA单元根据模型参数对图像处理任务数据进行卷积处理,得到图像处理任务数据的卷积结果。
在一个实施例中,卷积层子结构对应的FPGA单元在得到图像处理任务的卷积结果时,将卷积结果存储到存储器中。
需要说明的一点是,在初始化工作时,CPU 110通过PCIE(Peripheral Component Interconnect Express,高速串行计算机扩展总线标准)DMA写操作的方式,将模型参数写入 到存储器中。
S1504,将卷积结果发送至CPU 110,通过CPU 110执行图像处理模型中候选网络子结构对应候选区域选取任务,得到区域选取结果。
其中,图像处理模型中包括卷积层子结构。按照图像处理模型中子结构的顺序,候选网络子结构可以是图像处理模型中第二个子结构。候选网络子结构对应的处理单元为CPU单元。候选网络子结构对应的子任务为候选区域选取任务,候选区域选取任务用于在图像处理任务数据对应的图像中选取待处理的候选区域。
当图像处理模型用于从图像中识别出文本时,在本步骤中,待处理的候选区域可以为包括文本的区域。
任务数据处理装置130检测到卷积层子结构对应的FPGA单元将卷积结果存储到存储器后,向CPU 110发送候选区域选取任务执行通知。CPU 110在接收到候选区域选取任务执行通知后,从存储器中读取卷积结果,根据卷积结果执行候选网络子结构对应的候选区域选取任务,得到区域选取结果,将区域选取结果存储到存储器中。
在一个实施例中,CPU 110从存储器中读取图像处理模型的模型参数,根据模型参数配置候选网络子结构,通过配置的候选网络子结构,根据卷积结果执行候选区域选取任务,得到区域选取结果。
S1506,通过图像处理模型中分类子结构对应的FPGA单元,对区域选取结果进行分类处理,得到分类结果。
其中,图像数据处理模型包括分类子结构,按照图像数据处理模型中子结构的顺序,分类子结构可以是图像处理模型中第三个子结构。
任务数据处理装置130检测到CPU 110将区域选取结果存储到存储器后,向分类子结构对应的FPGA单元发送任务执行通知。分类子结构对应的FPGA单元接收到任务执行通知时,从存储器中读取区域选取结果,对读取到的区域选取结果进行分类处理,得到分类结果,将分类结果存储到存储器中。
在一个实施例中,分类子结构对应的FPGA单元从存储器中读取图像处理模型的模型参数,根据模型参数配置分类子结构,通过分类子结构对区域选取结构进行分类处理,得到分类结果。
S1508,通过CPU 110根据分类结果,确定图像处理任务数据对应的任务执行结果。
任务数据处理装置130在检测到分类子结构对应的FPGA单元将分类结果存储到存储器时,向CPU 110发送任务结果确定通知。CPU 110接收到任务结果确定通知时,从存储器中 读取分类结果,根据分类结果中提取图像处理任务数据对应的任务执行结果。举例说明,图像处理任务数据对应的任务执行结果可以对图像识别结果。
如图16所示,在一个实施例中,S1506具体还包括获得分类结果的步骤,该步骤具体包括以下内容:
S1602,调用图像处理模型中非极大值抑制子结构对应的FPGA单元,对区域选取结果进行非极大值抑制处理,得到非极大值抑制结果。
其中,图像处理模型中还包括非极大值抑制子结构。非极大值抑制子结构对应的处理单元为FPGA单元。非极大值抑制子结构对应的子任务为非极大值抑制处理任务,非极大值抑制结果为非极大值抑制处理任务对应的处理结果。
任务数据处理装置130检测到CPU 110将区域选取结果存储到存储器时,向非极大值抑制子结构对应的FPGA单元发送任务执行通知。非极大值抑制子结构对应的FPGA单元接收到任务执行通知时,从存储器中读取区域选取结果,对读取到的区域选取结果进行非极大值抑制处理,得到非极大值抑制结果,将非极大值抑制结果存储到存储器中。
在一个实施例中,非极大值抑制子结构对应的FPGA单元从存储器中读取图像处理模型的模型参数,根据模型参数配置非极大值抑制子结构,通过非极大值抑制子结构对区域选取结果进行非极大值抑制处理,得到非极大值抑制结果。
S1604,通过图像处理模型中池化层子结构对应的FPGA单元,对非极大值抑制结果进行池化层处理,得到池化结果。
其中,图像数据处理模型中还包括池化层子结构,池化层子结构对应的处理单元为FPGA单元。池化层子结构对应的子任务为池化层子任务,池化层子任务对应的处理结果为池化结果。
任务数据处理装置110检测到非极大值抑制子结构对应的FPGA单元将非极大值抑制结果存储到存储器时,向池化层子结构对应的FPGA单元发送任务执行通知。池化层子结构对应的FPGA单元接收到任务执行通知时,从存储器中读取非极大值抑制结果,根据非极大值抑制结果执行池化层子任务,即对非极大值抑制结果进行池化处理,得到池化结果,将池化结果存储到存储器中。
在一个实施例中,池化层子结构对应的FPGA单元从存储器中读取图像处理模型的模型参数,根据模型参数配置池化层子结构,通过池化层子结构对非极大值抑制结果进行池化处理,得到池化结果。
S1606,将池化结果输入图像处理模型中全连接层子结构对应的FPGA单元,获得分类结果。
其中,图像处理模型中还包括全连接层子结构,全连接层子结构对应的处理单元为FPGA单元。全连接层子结构对应的子任务为全连接处理任务,全连接处理任务对应的处理结果为分类结果。
任务数据处理装置130检测到池化层子结构对应的FPGA单元将池化结果存储到存储器时,向全连接层子结构对应的FPGA单元发送任务执行通知。全连接层子结构对应的FPGA单元接收到任务执行通知时,从存储器读取池化结果,根据池化结果执行全连接处理任务,得到分类结果,将分类结果存储到存储器中。
在一个实施例中,全连接层子结构对应的FPGA单元从存储器中读取图像处理模型的模型参数,根据读取到的模型参数配置全连接层子结构,通过全连接层子结构对池化结果进行全连接处理,得到分类结果。
在一个实施例中,图像处理结果可以是图像识别结果。图像识别结果可以是OCR识别结果,也可以是图像目标识别结果。
请参照图17,当需要进行文本识别,例如是OCR识别时,分布式服务器从机获取待处理图片,将待处理图片封装为线程任务。线程任务调用基础卷积层应的FPGA单元,对待处理图片进行卷积处理,得到文本特征。在得到文本特征时,线程任务将文本特征输入候选区域生成网络,在候选区域生成网络中,线程任务调用RPN卷积对应的FPGA单元,对文本特征进行RPN卷积处理,得到文本特征对应的卷积结果。线程任务调用分类对应的FPGA单元,对文本特征对应的卷积结果进行分类处理得到分类结果,线程任务调用CPU 110根据分类结果确定候选文本框,调用CPU 110对确定的候选文本框进行回归调整,得到各任意角度的候选文本框,调用非极大值抑制对应的FPGA单元,对候选文本框中重叠的候选文本框进行处理,得到不重叠的各任意角度的候选文本框。线程任务调用旋转感兴趣区域池化对应的FPGA单元,对候选区域生成网络输出的各任意角度的候选文本框进行池化处理,对各任意角度的候选文本框进行旋转调整,将旋转调整后的候选文本框投射到固定大小的特征图,得到各候选文本框对应的文本框特征图。线程任务调用识别结果输出对应的FPGA单元对文本框特征图中的文本进行识别,输出文本识别结果。
在本实施例中,除了RPN中的Proposal(候选区域确定层)和最后的识别结果输出部分,其他所有部分都基于FPGA实现。由此,能够将数据处理量较大的部分通过成本较低的FPGA 并行实现,以及数据处理量较大的部分则由CPU处理,从而在保持处理效率的同时降低了成本。
需要说明的一点是,不同的处理单元具有不同的输入输出并行度,卷积层对应的FPGA采用32路输入32路输出的并行度;分类对应的FPGA采用16路输入64路输出的并行度;RPN卷积对应的FPGA单元采用8路输入8路输出的并行度,提高了处理效率。
如图18所示,在一个实施例中,提供一种图片处理方法以实现上述OCR识别。该方法具体包括以下内容:
S1802,获取待处理图片。
其中,待处理图片为待进行文本识别处理的图片。文本识别可以是通过OCR(Optical Character Recognition,光学字符识别)技术对图片中的文本进行识别。当终端需要对待处理图片进行识别时,向CPU110发送文本识别请求;CPU 110接收文本识别请求,根据文本识别请求获取待处理图片。
在一个实施例中,该文本识别请求携带待处理图片。相应的,CPU 110根据文本识别请求获取待处理图片的步骤可以为:CPU 110从该文本识别请求中获取待处理图片。
在另一个实施例中,该文本识别请求携带待处理图片的图片标识。相应的,该文本识别请求携带待处理图片的步骤可以为:CPU 110对文本识别请求进行解析,通过解析提取文本识别请求中的图片标识,根据图片标识从存储器中读取待处理图片。图片标识可以是图片在存储器中的存储地址。
需要说明的一点是,该待处理图片的尺寸可以为任一尺寸;因此,本申请可以的图片处理方法能够支持不同尺寸的待处理图片,对不同尺寸的图片进行自适应配置,最大支持1024*1024尺寸的待处理图片。
S1804,提取待处理图片中的文本特征。
其中,文本特征为表示待处理图片中文本的特征。CPU 110获取到待处理图片时,对待处理图片进行卷积处理,通过卷积处理提取待处理图片中的文本特征。
在一个实施例中,S1804的步骤包括以下内容:CPU 110将待处理图片输入卷积层;根据卷积层的卷积核对待处理图片进行卷积处理,得到待处理图片的文本特征。
CPU 110将待处理图片输入机器学习模型的卷积层,通过卷积层的卷积核对待处理图片进行卷积处理,经过卷积处理得到待处理图片的文本特征。其中机器学习模型可以是用于对待处理图片进行文本识别处理的图片处理模型。
S1806,根据文本特征确定待处理图片中任意角度的候选文本框。
其中,候选文本框为待处理图片中包括文本的区域框。CPU 110在提取到待处理图片的文本特征时,根据文本特征在待处理图片中确定包括文本的候选文本框。候选文本框可以是任意角度的候选文本框,任意角度的候选文本框,可以是水平角度、垂直角度和倾斜角度中任意一种角度的候选文本框。
在一个实施例中,CPU 110将从待处理图片中提取到文本特征时,输入到机器学习模型的候选区域生成网络(RPN,Region Proposal Network),通过候选区域生成网络根据文本特征,确定待处理图片中各任意角度的候选文本框。候选区域生成网络可以是旋转候选区域生成网络(RRPN,Rotation RPN)
在本公开实施例中,RRPN算法能够提高准确性;由于RRPN算法流程复杂,在CPU 110端运算速度较慢,本申请实施例中,将加速器架构覆盖了算法中最耗时的部分,将整体运算效率大大提升,与CPU 110软件版本相比,实现了十倍以上的提升,吞吐量是GPU的1.4倍,成本降低到30%。
S1808,对各候选文本框进行池化处理,将各候选文本框投影到固定大小的特征图,得到各候选文本框对应的文本框特征图。
CPU 110对各任意角度的候选文本框池化处理,通过池化处理将各任意角度的候选文本框进行投影到固定大小的特征图,得到各候选文本框对应的大小相同的文本框特征图。
在一个实施例中,CPU 110将各任意角度的候选文本框输入机器学习模型的池化层,经过池化层将各候选文本框投影到固定大小的特征图,得到各候选文本框对应的固定大小的文本框特征图。池化层可以是旋转感兴趣区(RROI,Rotation ROI)池化层。
在一个实施例中,S1808还包括:将各候选文本框输入池化层;根据预设特征图的固定大小确定各候选文本框的投影参数;根据投影参数将各候选文本框投影为固定大小的特征图,得到各候选文本框对应的文本框特征图。
S1810,识别文本框特征图中文本,得到文本识别结果。
CPU 110对每个文本框特征图中的文本进行识别,通过识别的得到每个文本框特征图对应的文本识别结果。
在一个实施例中,CPU 110将每个文本框特征图输入机器学习模型的输出层,通过输出层对文本框特征图进行OCR识别,得到每个文本框特征图对应的文本识别结果。
本实施例中,根据待处理图片的文本特征,确定待处理图片中各任意角度的候选文本框,可以识别不同角度的候选文本框。对各候选文本框进行池化处理,并将各候选文本框投影到 固定大小的特征图,得到各候选文本框的文本框特征图,提高了处理候选文本框的适应性,可以处理不同尺寸和不同角度的候选文本框,通过识别文本框特征图中文本,得到各候选文本框的文本识别结果,提高了待处理图片中文本识别准确率和效率。
在一个实施例中,图片处理方法还包括以下内容:通过至少一个线程将待处理图片输入到机器学习模型中,按照机器学习模型中子结构的顺序,依次通过各子结构所对应的处理单元执行相应子结构的子任务;至少部分处理单元包括FPGA单元。
其中,在执行每个子任务时,执行下述步骤中的一个步骤:提取待处理图片中的文本特征;根据文本特征确定待处理图片中任意角度的候选文本框;对各候选文本框进行池化处理,将各候选文本框投影到固定大小的特征图,得到各候选文本框对应的文本框特征图;识别文本框特征图中文本,得到文本识别结果。
本实施例中,通过各处理单元实施图片处理方法的子任务,可以使得多待处理图片的子任务并行处理,且部分子任务通过处理单元中的FPGA单元实施,提高了待处理图片中文本识别的效率。
在一个实施例中,S1806具体包括以下内容:将文本特征输入到候选区域生成网络;通过候选区域生成网络中候选区域卷积层,对文本特征进行卷积处理,得到文本特征卷积结果;根据文本特征卷积结果,确定待处理图片中各候选文本框的位置信息;对各候选文本框的位置信息进行非极大值抑制处理,得到各任意角度的候选文本框。
CPU 110将文本特征输入到候选区域生成网络,通过候选区域生成网络中候选区域卷积层对文本特征进行卷积处理,通过卷积处理得到文本特征卷积结果,根据文本特征卷积结果,在待处理图片中确定各候选文本框,获取确定的各候选文本框的位置信息,CPU 110对各候选文本框的位置信息进行非极大值抑制处理,得到各任意角度的候选文本框。
在一个实施例中,对各候选文本框的位置信息进行非极大值抑制处理,得到各任意角度的候选文本框包括:根据候选文本框的位置信息,确定待处理图片中各任意角度的候选文本框;确定重叠的候选文本框;对重叠的候选文本框进行非极大值抑制处理,以得到不重叠的各任意角度的候选文本框。
CPU 110根据候选文本框的位置信息,确定待处理图片中各任意角度的候选文本框,从各候选文本框中筛选存在重叠的候选文本框,对重叠的候选文本框进行非极大值抑制处理,得到不重叠的各任意角度的候选文本框。
本实施例中,通过候选区域生成网络可以确定待处理图片中各任意角度的候选文本框,通过对任意角度的候选文本框进行非极大值抑制处理,得到不重叠的各任意角度的候选文本 框,提高了确定候选文本框的准确性。
在一个实施例中,机器学习模型包括与池化层连接的全连接层;S1610具体包括以下内容:将文本框特征图输入全连接层;通过文本特征图确定各文本分类对应的概率值;选取最大概率值对应的文本分类作为文本特征图的文本识别结果。
CPU 110得到各候选文本框分别对应的文本框特征图后,将文本框特征图输入到机器学习模型的全连接层,通过全连接层对各文本框特征图进行处理,得到文本特征图对应的各文本分类对应的概率值,在各概率值中确定最大概率值,选取最大概率值对应的文本分类作为文本特征图的文本识别结果。
在一个实施例中,多个处理单元包括FPGA单元和CPU;待处理图片为多个待处理图片;将当前的待处理图片输入卷积层对应的FPGA单元进行处理,得到待处理图片的文本特征;将文本特征输入候选区域生成网络对应的CPU进行处理,确定任意角度的候选文本框;通过池化层对应的FPGA单元,根据任意角度的候选文本框,确定各候选文本框对应的文本框特征图;根据识别结果层对应的FPGA单元,对文本框特征图中文本进行识别,得到文本识别结果;其中,在每个处理单元的处理过程中,当处理单元处于空闲状态时,并行执行下一个待处理图片对应的子任务。
本实施例中,通过部分FPGA单元和CPU执行机器学习模型中子结构对应的子任务,且在获取多个待处理图片时,各FPGA单元和CPU并行执行各待处理图片对应的子任务,从而使得各待处理图片对应的子任务可以被并行处理,从而提高了多个待处理图片的处理效率。
图19为一个实施例中文本识别的流程示意图。对待处理图片1802通过基础卷积处理得到待处理图片1902的文本特征,通过候选区域生成网络(RPN)对待处理图片1902的文本特征进行处理,得到待处理图片1902中任意角度的候选文本框,通过旋转感兴趣区域池化层对任意角度的候选文本框进行调整,得到各候选文本框对应的固定大小的文本框特征图,通过识别结果输出对文本框特征图进行识别,输出文本识别结果1904。文本识别结果1904中的白色框中为识别到的文本。
图20-24为各应用场景对应的文本识别结果示意图。其中,图20为对广告图片中的文本进行识别得到的文本识别结果,黑色框中为识别到的文本。图21对自然场景图片中的文本进行识别得到的文本识别结果,白色框中为识别到的文本。图22和23为对游戏图片中的文本进行识别得到的文本识别结果,黑色框中为识别到的文本。图24为对银行卡图片中文本进行识别得到的文本识别结果,灰色框中的数字为识别到的文本。
本实施例中,通过部分FPGA单元执行机器学习模型中子结构对应的子任务,且在获取多个任务数据时,各FPGA单元并行执行各任务数据对应的子任务,从而使得各任务数据对应的子任务可以被并行处理,从而提高了多个任务数据的处理效率。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,计算机程序可存储于一非易失性计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对申请专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。

Claims (24)

  1. 一种图片处理方法,其特征在于,通过机器学习模型中各子结构所对应的处理单元执行相应的子任务来实现,至少部分所述处理单元包括现场可编程门阵列FPGA单元;所述方法包括:
    获取待处理图片;
    提取所述待处理图片中的文本特征;
    根据所述文本特征确定所述待处理图片中任意角度的候选文本框;
    对各所述候选文本框进行旋转感兴趣区域的池化处理,将各所述候选文本框投影到固定大小的特征图,得到各所述候选文本框对应的文本框特征图;
    识别所述文本框特征图中文本,得到文本识别结果。
  2. 根据权利要求1所述的方法,其特征在于,所述机器学习模型包括卷积层;所述提取所述待处理图片中的文本特征包括:
    将所述待处理图片输入卷积层;
    根据所述卷积层的卷积核对所述待处理图片进行卷积处理,得到所述待处理图片的文本特征。
  3. 根据权利要求2所述的方法,其特征在于,所述机器学习模型包括与所述卷积层连接的候选区域生成网络;所述根据所述文本特征确定所述待处理图片中任意角度的候选文本框包括:
    将所述文本特征输入到候选区域生成网络;
    通过所述候选区域生成网络中的候选区域卷积层,对所述文本特征进行卷积处理,得到文本特征卷积结果;
    根据文本特征卷积结果,确定所述待处理图片中各候选文本框的位置信息;
    对各候选文本框的位置信息进行非极大值抑制处理,得到各任意角度的候选文本框。
  4. 根据权利要求3所述的方法,其特征在于,所述对各候选文本框的位置信息进行非极大值抑制处理,得到各任意角度的候选文本框包括:
    根据候选文本框的位置信息,确定所述待处理图片中各任意角度的候选文本框;
    确定重叠的候选文本框;
    对重叠的候选文本框进行非极大值抑制处理,得到不重叠的各任意角度的候选文本框。
  5. 根据权利要求3所述的方法,其特征在于,所述机器学习模型包括与所述候选区域生 成网络依次连接的池化层;所述对各所述候选文本框进行旋转感兴趣区域的池化处理,将各所述候选文本框投影到固定大小的特征图,得到各所述候选文本框对应的文本框特征图包括:
    将各所述候选文本框输入所述池化层;
    根据预设特征图的固定大小确定各所述候选文本框的投影参数;根据所述投影参数将各所述候选文本框投影为固定大小的特征图,得到各所述候选文本框对应的文本框特征图。
  6. 根据权利要求5所述的方法,其特征在于,所述机器学习模型包括与池化层连接的全连接层;所述识别所述文本框特征图中文本,得到文本识别结果包括:
    将所述文本框特征图输入所述全连接层;
    通过所述文本特征图确定各文本分类对应的概率值;
    选取最大概率值对应的文本分类作为所述文本特征图的文本识别结果。
  7. 根据权利要求1所述的方法,其特征在于,所述处理单元包括FPGA单元和CPU;所述待处理图片为多个待处理图片;
    所述提取所述待处理图片中的文本特征包括:将当前的待处理图片输入卷积层对应的FPGA单元进行处理,得到待处理图片的文本特征;
    所述根据所述文本特征确定所述待处理图片中任意角度的候选文本框包括:将所述文本特征输入候选区域生成网络对应的CPU进行处理,确定任意角度的候选文本框;
    所述对各所述候选文本框进行旋转感兴趣区域的池化处理,将各所述候选文本框投影到固定大小的特征图,得到各所述候选文本框对应的文本框特征图包括:通过池化层对应的FPGA单元,根据所述任意角度的候选文本框,确定各候选文本框对应的文本框特征图;
    所述识别所述文本框特征图中文本,得到文本识别结果包括:根据识别结果层对应的FPGA单元,对所述文本框特征图中文本进行识别,得到文本识别结果;
    其中,在每个所述处理单元的处理过程中,当处理单元处于空闲状态时,并行执行下一个待处理图片对应的子任务。
  8. 一种任务数据处理方法,其特征在于,所述方法包括:
    获取多个任务数据;
    对于每个所述任务数据,分别按照机器学习模型中子结构的顺序,依次通过各所述子结构所对应的处理单元执行相应子结构的子任务;至少部分所述处理单元包括现场可编程门阵列FPGA单元;
    在每个所述处理单元的处理过程中,当处理单元处于空闲状态时,并行执行下一个任务 数据对应的子任务。
  9. 根据权利要求8所述的方法,其特征在于,所述对于每个所述任务数据,分别按照机器学习模型中子结构的顺序,依次通过各所述子结构所对应的处理单元执行相应子结构的子任务包括:
    对于每个所述任务数据,从寄存器中读取处理单元调用数据;所述处理单元调用数据由中央处理器CPU写入到所述寄存器中;
    根据所述处理单元调用数据,按照机器学习模型中子结构的顺序,依次调用各所述子结构所对应的处理单元执行相应子结构的子任务。
  10. 根据权利要求8所述的方法,其特征在于,所述任务数据为图像处理任务数据;所述机器学习模型为图像处理模型;所述任务执行结果为图像处理结果;
    所述对于每个所述任务数据,分别按照机器学习模型中子结构的顺序,依次通过各所述子结构所对应的处理单元执行相应子结构的子任务包括:
    对于每个所述图像处理任务数据,分别按照图像处理模型中子结构的顺序,依次通过各所述子结构对应的处理单元,执行相应子结构的图像处理子任务。
  11. 根据权利要求10所述的方法,其特征在于,所述处理单元还包括CPU,所述对于每个所述图像处理任务数据,分别按照图像处理模型中子结构的顺序,依次通过各所述子结构对应的处理单元,执行相应子结构的图像处理子任务包括:
    将所述图像处理任务数据输入所述图像处理模型中卷积层子结构对应的FPGA单元,获得卷积结果;
    将所述卷积结果发送至CPU,通过所述CPU执行所述图像处理模型中候选网络子结构对应候选区域选取任务,得到区域选取结果;
    通过所述图像处理模型中分类子结构对应的FPGA单元,对所述区域选取结果进行分类处理,得到分类结果;
    根据所述分类结果,通过所述CPU确定所述图像处理任务数据对应的任务执行结果。
  12. 根据权利要求11所述的方法,其特征在于,所述通过所述图像处理模型中分类子结构对应的FPGA单元,对所述区域选取结果进行分类处理,得到分类结果包括:
    调用所述图像处理模型中非极大值抑制子结构对应的FPGA单元,对所述区域选取结果进行非极大值抑制处理,得到非极大值抑制结果;
    通过所述图像处理模型中池化层子结构对应的FPGA单元,对所述非极大值抑制结果进行池化层处理,得到池化结果;
    将所述池化结果输入所述图像处理模型中全连接层子结构对应的FPGA单元,获得分类结果。
  13. 根据权利要求8所述的方法,其特征在于,所述方法还包括:
    在每个所述处理单元的处理过程中,在前一任务数据在当前子结构的子任务未执行完毕,且当前的任务数据在前一子结构的子任务执行完毕时,等待所述当前子结构的子任务所对应的处理单元被释放;
    当所述当前子结构的子任务所对应的处理单元被释放后,调用所述当前子结构的子任务所对应的处理单元,执行当前的任务数据在当前子结构的子任务。
  14. 一种任务数据处理方法,其特征在于,应用于分布式服务器主机,所述方法包括:
    接收终端发送的任务数据;
    确定为所述任务数据分配的分布式服务器从机地址,将所述任务数据发送至对应的分布式服务器从机;
    所述分布式服务器从机,用于将所述任务数据放入线程池,当对任务数据进行处理时,从所述线程池获取多个任务数据;对于每个所述任务数据,分别按照机器学习模型中子结构的顺序,依次通过各所述子结构所对应的处理单元执行相应子结构的子任务;至少部分所述处理单元包括现场可编程门阵列FPGA单元;在每个所述处理单元的处理过程中,当处理单元处于空闲状态时,并行执行下一个任务数据对应的子任务。
  15. 一种任务数据处理方法,其特征在于,应用于分布式服务器从机,所述方法包括:
    当接收到分布式服务器主机发送的任务数据时,将所述任务数据放入线程池;
    当对任务数据进行处理时,从所述线程池获取多个任务数据;
    对于每个所述任务数据,分别按照机器学习模型中子结构的顺序,依次通过各所述子结构所对应的处理单元执行相应子结构的子任务;至少部分所述处理单元包括现场可编程门阵列FPGA单元;在每个所述处理单元的处理过程中,当处理单元处于空闲状态时,并行执行下一个任务数据对应的子任务。
  16. 一种任务数据处理装置,其特征在于,所述装置至少包括相互连接的任务调度单元和现场可编程门阵列FPGA单元;所述任务调度单元,用于获取多个任务数据;对于每个所述任务数据,分别按照机器学习模型中子结构的顺序,依次通过各所述子结构所对应的处理 单元执行相应子结构的子任务;至少部分所述处理单元包括所述FPGA单元;在每个所述处理单元的处理过程中,当处理单元处于空闲状态时,并行执行下一个任务数据对应的子任务。
  17. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述指令、所述程序、所述代码集或所述指令集由处理器加载并执行以实现如权利要求1至8任一权利要求所述的图片处理方法中所执行的操作。
  18. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述指令、所述程序、所述代码集或所述指令集由处理器加载并执行以实现如权利要求9至13任一权利要求所述的任务数据处理方法中所执行的操作。
  19. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述指令、所述程序、所述代码集或所述指令集由处理器加载并执行以实现如权利要求14所述的任务数据处理方法中所执行的操作。
  20. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述指令、所述程序、所述代码集或所述指令集由处理器加载并执行以实现如权利要求15所述的任务数据处理方法中所执行的操作。
  21. 一种计算机设备,其特征在于,所述计算机设备包括:处理器和存储器,所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现如权利要求1至8任一所述的图片处理方法。
  22. 一种计算机设备,其特征在于,所述计算机设备包括:处理器和存储器,所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现如权利要求9至13任一所述的任务数据处理方法。
  23. 一种分布式服务器主机,其特征在于,所述分布式服务器主机包括:处理器和存储器,所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现如权利要求14所述的任务数据处理方法。
  24. 一种分布式服务器从机,其特征在于,所述分布式服务器从机包括:处理器和存储器,所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现如权利要求15所述的任务数据处理方法。
PCT/CN2019/102587 2018-08-27 2019-08-26 图片处理方法、任务数据处理方法和装置 WO2020043057A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP19854472.8A EP3846079A4 (en) 2018-08-27 2019-08-26 IMAGE PROCESSING METHOD, AND TASK DATA PROCESSING METHOD AND DEVICE
US17/010,812 US20200401829A1 (en) 2018-08-27 2020-09-02 Picture processing method, and task data processing method and apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810980841.1 2018-08-27
CN201810980841.1A CN109325494B (zh) 2018-08-27 2018-08-27 图片处理方法、任务数据处理方法和装置

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/010,812 Continuation US20200401829A1 (en) 2018-08-27 2020-09-02 Picture processing method, and task data processing method and apparatus

Publications (1)

Publication Number Publication Date
WO2020043057A1 true WO2020043057A1 (zh) 2020-03-05

Family

ID=65264635

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/102587 WO2020043057A1 (zh) 2018-08-27 2019-08-26 图片处理方法、任务数据处理方法和装置

Country Status (4)

Country Link
US (1) US20200401829A1 (zh)
EP (1) EP3846079A4 (zh)
CN (1) CN109325494B (zh)
WO (1) WO2020043057A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111881778A (zh) * 2020-07-08 2020-11-03 泰康保险集团股份有限公司 文本检测的方法、装置、设备和计算机可读介质
US20200401829A1 (en) * 2018-08-27 2020-12-24 Tencent Technology (Shenzhen) Company Limited Picture processing method, and task data processing method and apparatus

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110084240A (zh) * 2019-04-24 2019-08-02 网易(杭州)网络有限公司 一种文字提取系统、方法、介质和计算设备
CN110458164A (zh) * 2019-08-07 2019-11-15 深圳市商汤科技有限公司 图像处理方法、装置、设备及计算机可读存储介质
CN110689475A (zh) * 2019-09-10 2020-01-14 浪潮电子信息产业股份有限公司 一种图像数据处理方法、系统、电子设备及存储介质
CN111860506B (zh) * 2020-07-24 2024-03-29 北京百度网讯科技有限公司 识别文字的方法和装置
CN111985465A (zh) * 2020-08-17 2020-11-24 中移(杭州)信息技术有限公司 文本识别方法、装置、设备及存储介质
CN112819684B (zh) * 2021-03-02 2022-07-26 成都视海芯图微电子有限公司 一种面向图像文本识别的加速装置
CN113010469B (zh) * 2021-03-18 2023-05-26 恒睿(重庆)人工智能技术研究院有限公司 图像特征提取方法、装置以及计算机可读存储介质
CN112990182B (zh) * 2021-05-10 2021-09-21 北京轻松筹信息技术有限公司 筹款信息审核方法、系统及电子设备
US11961317B2 (en) * 2021-11-24 2024-04-16 Oracle Financial Services Software Limited Extracting textual information from image documents

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102594891A (zh) * 2012-02-17 2012-07-18 中国科学院计算技术研究所 用于处理远程过程调用请求的方法及系统
CN104866460A (zh) * 2015-06-04 2015-08-26 电子科技大学 一种基于SoC的容错自适应可重构系统与方法
CN108229299A (zh) * 2017-10-31 2018-06-29 北京市商汤科技开发有限公司 证件的识别方法和装置、电子设备、计算机存储介质
CN108229303A (zh) * 2017-11-14 2018-06-29 北京市商汤科技开发有限公司 检测识别和检测识别网络的训练方法及装置、设备、介质
US10032072B1 (en) * 2016-06-21 2018-07-24 A9.Com, Inc. Text recognition and localization with deep learning
CN109325494A (zh) * 2018-08-27 2019-02-12 腾讯科技(深圳)有限公司 图片处理方法、任务数据处理方法和装置

Family Cites Families (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102122964B (zh) * 2011-03-31 2014-07-02 西安电子科技大学 一种基于fpga的高速rs编译码器实现方法
CN102497411B (zh) * 2011-12-08 2014-01-15 南京大学 面向密集运算的层次化异构多核片上网络架构
CN103279390B (zh) * 2012-08-21 2016-09-28 中国科学院信息工程研究所 一种面向小作业优化的并行处理系统
CN102903115B (zh) * 2012-10-12 2016-01-20 中国科学院深圳先进技术研究院 一种管状物体中心线的提取方法
CN103593323A (zh) * 2013-11-07 2014-02-19 浪潮电子信息产业股份有限公司 一种MapReduce任务资源配置参数的机器学习方法
CN103996186B (zh) * 2014-04-29 2017-03-15 小米科技有限责任公司 图像裁剪方法及装置
CN104299168B (zh) * 2014-09-16 2019-04-02 华北电力大学 一种架空输电线路巡检飞行机器人的视点优选方法
CN106326909A (zh) * 2015-07-01 2017-01-11 株式会社日立制作所 图像识别方法和图像识别装置
CN105956608A (zh) * 2016-04-21 2016-09-21 恩泊泰(天津)科技有限公司 一种基于深度学习的目标定位、分类算法
US11106467B2 (en) * 2016-04-28 2021-08-31 Microsoft Technology Licensing, Llc Incremental scheduler for out-of-order block ISA processors
US20180150256A1 (en) * 2016-11-29 2018-05-31 Intel Corporation Technologies for data deduplication in disaggregated architectures
CN106845530B (zh) * 2016-12-30 2018-09-11 百度在线网络技术(北京)有限公司 字符检测方法和装置
US11853244B2 (en) * 2017-01-26 2023-12-26 Wisconsin Alumni Research Foundation Reconfigurable computer accelerator providing stream processor and dataflow processor
CN107133616B (zh) * 2017-04-02 2020-08-28 南京汇川图像视觉技术有限公司 一种基于深度学习的无分割字符定位与识别方法
US11341368B2 (en) * 2017-04-07 2022-05-24 Intel Corporation Methods and systems for advanced and augmented training of deep neural networks using synthetic data and innovative generative networks
CN107832123A (zh) * 2017-07-13 2018-03-23 华中科技大学 基于人工智能的技术预见方法
CN107391429A (zh) * 2017-08-07 2017-11-24 胡明建 一种cpu+gpu+fpga的设计方法
US20190044809A1 (en) * 2017-08-30 2019-02-07 Intel Corporation Technologies for managing a flexible host interface of a network interface controller
JP7287823B2 (ja) * 2018-09-07 2023-06-06 パナソニック インテレクチュアル プロパティ コーポレーション オブ アメリカ 情報処理方法及び情報処理システム
US10970599B2 (en) * 2018-11-15 2021-04-06 Adobe Inc. Learning copy space using regression and segmentation neural networks
US11501477B2 (en) * 2021-03-22 2022-11-15 Adobe Inc. Customizing font bounding boxes for variable fonts
WO2023018477A1 (en) * 2021-08-12 2023-02-16 Ascenium, Inc. Parallel processing architecture using distributed register files

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102594891A (zh) * 2012-02-17 2012-07-18 中国科学院计算技术研究所 用于处理远程过程调用请求的方法及系统
CN104866460A (zh) * 2015-06-04 2015-08-26 电子科技大学 一种基于SoC的容错自适应可重构系统与方法
US10032072B1 (en) * 2016-06-21 2018-07-24 A9.Com, Inc. Text recognition and localization with deep learning
CN108229299A (zh) * 2017-10-31 2018-06-29 北京市商汤科技开发有限公司 证件的识别方法和装置、电子设备、计算机存储介质
CN108229303A (zh) * 2017-11-14 2018-06-29 北京市商汤科技开发有限公司 检测识别和检测识别网络的训练方法及装置、设备、介质
CN109325494A (zh) * 2018-08-27 2019-02-12 腾讯科技(深圳)有限公司 图片处理方法、任务数据处理方法和装置

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200401829A1 (en) * 2018-08-27 2020-12-24 Tencent Technology (Shenzhen) Company Limited Picture processing method, and task data processing method and apparatus
CN111881778A (zh) * 2020-07-08 2020-11-03 泰康保险集团股份有限公司 文本检测的方法、装置、设备和计算机可读介质
CN111881778B (zh) * 2020-07-08 2023-12-05 泰康保险集团股份有限公司 文本检测的方法、装置、设备和计算机可读介质

Also Published As

Publication number Publication date
CN109325494A (zh) 2019-02-12
CN109325494B (zh) 2021-09-17
EP3846079A1 (en) 2021-07-07
EP3846079A4 (en) 2021-10-27
US20200401829A1 (en) 2020-12-24

Similar Documents

Publication Publication Date Title
WO2020043057A1 (zh) 图片处理方法、任务数据处理方法和装置
US11961286B2 (en) Performing object detection in an image
Yang et al. Re-thinking CNN frameworks for time-sensitive autonomous-driving applications: Addressing an industrial challenge
KR20210014686A (ko) Tee 측에 대해 병렬인 멀티-코어를 구현하기 위한 방법, 장치 및 시스템
CN110765288B (zh) 一种图像信息同步方法、装置、系统及存储介质
WO2020125062A1 (zh) 一种图像融合方法及相关装置
US11175919B1 (en) Synchronization of concurrent computation engines
US11880715B2 (en) Method and system for opportunistic load balancing in neural networks using metadata
US20210158131A1 (en) Hierarchical partitioning of operators
WO2021238702A1 (zh) 一种任务的调度方法、计算设备及存储介质
US9672063B2 (en) Scheduling, interpreting and rasterising tasks in a multi-threaded raster image processor
US20140152700A1 (en) Method, apparatus and system for determining a merged intermediate representation of a page
US20200279359A1 (en) Inspection apparatus, inspection method, and non-volatile storage medium
US9553761B2 (en) Dynamic server to server configuration and initialization
CN109656868B (zh) 一种cpu与gpu之间的内存数据转移方法
US11562554B1 (en) Workload reduction for non-maximum suppression operation
US11941519B2 (en) Machine learning training platform
US10922146B1 (en) Synchronization of concurrent computation engines
Lee et al. Accelerating a computer vision algorithm on a mobile SoC using CPU-GPU co-processing: a case study on face detection
US11847507B1 (en) DMA synchronization using alternating semaphores
CN115586955A (zh) 命令执行方法、装置、计算机设备和存储介质
Gaowei et al. The face detection system based on GPU+ CPU desktop cluster
US20210103852A1 (en) Resource based workload allocation for machine learning workloads
WO2024108907A1 (zh) 一种数据处理方法、装置、ai芯片、电子设备及存储介质
US11205242B2 (en) Memory error recovery for complex page RIP

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19854472

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2019854472

Country of ref document: EP

Effective date: 20210329