CN109298884B - Universal character operation accelerated processing hardware device and control method - Google Patents

Universal character operation accelerated processing hardware device and control method Download PDF

Info

Publication number
CN109298884B
CN109298884B CN201810995831.5A CN201810995831A CN109298884B CN 109298884 B CN109298884 B CN 109298884B CN 201810995831 A CN201810995831 A CN 201810995831A CN 109298884 B CN109298884 B CN 109298884B
Authority
CN
China
Prior art keywords
data
character string
character
string
comparator
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810995831.5A
Other languages
Chinese (zh)
Other versions
CN109298884A (en
Inventor
李文明
叶笑春
范东睿
王达
张�浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Ruixin Integrated Circuit Technology Co ltd
Original Assignee
Beijing Zhongke Ruixin Technology Group Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhongke Ruixin Technology Group Co ltd filed Critical Beijing Zhongke Ruixin Technology Group Co ltd
Priority to CN201810995831.5A priority Critical patent/CN109298884B/en
Publication of CN109298884A publication Critical patent/CN109298884A/en
Application granted granted Critical
Publication of CN109298884B publication Critical patent/CN109298884B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/3004Arrangements for executing specific machine instructions to perform operations on memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • G06F9/3012Organisation of register space, e.g. banked or distributed register file
    • G06F9/30134Register stacks; shift registers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • G06F9/3012Organisation of register space, e.g. banked or distributed register file
    • G06F9/30138Extension of register space, e.g. register cache

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

The invention provides a universal character string acceleration processing hardware device and a control method, which relate to a universal hardware acceleration system structure based on a 3D storage computing mechanism aiming at the ubiquitous character string operation in big data application. The character string hardware acceleration system structure comprises a character string comparison acceleration structure and a character string writing operation acceleration structure, and can accelerate character string comparison operation and character string position interchange operation which commonly exist in big data application. The technical scheme of the invention can obviously improve the execution efficiency of the character string operation in the big data application and reduce the power consumption of the whole application execution.

Description

Universal character operation accelerated processing hardware device and control method
Technical Field
The invention relates to the technical field of big data application and hardware acceleration, in particular to a universal hardware acceleration device for processing character strings, which can integrate the existing universal processor and is realized by a hardware acceleration structure aiming at the ubiquitous basic character operation in the field of big data processing, and a control method thereof.
Background
In big data processing applications, character operations are the most basic type of operations, and almost all high-level languages have a basic function library for character operations. Character manipulation is becoming especially important today where applications such as search engines and social networking are increasingly taking up a large portion of people's lives. For example, web services, cloud computing, network packet security inspection, database queries, lexical grammar analysis in natural language processing, DNA sequence alignment and protein amino acid sequence alignment in biological computing, and the like. Character manipulation has gone deep into various aspects of big data applications.
Along with the sharp increase of the data volume processed by the big data application, the efficiency of the general-purpose processor in the big data application processing oriented to the huge data volume is not high, and the main reason is that the traditional general-purpose processor is oriented to scientific computing type application at the beginning of design and emphasizes that the complex operation of data is more than the data access. While the main feature of rapidly developing big data applications is data access, data computation tends to be simple instead. Thus, the inefficiency and high power consumption of current general purpose processors in the processing of large data applications is caused. To further illustrate, on the one hand, the complex computational pipelining of modern high-performance processors is too redundant for character-like operations; on the other hand, data needs to pass through a long transmission path (including an on-chip network, each level of Cache and the like) from the memory to the computing unit, and the large data application is more focused on data transportation, which further causes low efficiency and high energy consumption of large data application processing. In summary, the complex pipeline high performance processor is used to process the large data character operation with simple calculation but huge data amount, which results in low processing efficiency and waste of a large amount of power consumption. With advances in integrated circuit technology, Through Silicon Via (TSV) technology has enabled 3D storage, such as 3D memory of HBM, HMC, etc. In 3D storage, a stack of multiple RAMs is implemented by using a through-silicon-via technology, as shown in fig. 1, wherein each RAM is divided into multiple storage areas, and the multiple storage areas in the vertical direction form a Vault 105 structure, and each Vault structure can implement independent read access. The 3D memory can combine a simple computing unit and a storage unit closely, which significantly increases the bandwidth of memory access and shortens the time of data transmission, and is called memory-in-memory (PIM). Therefore, how to realize a high-efficiency and low-power-consumption character string processing structure by using the PIM technology becomes an urgent problem to be solved.
Disclosure of Invention
To solve the above problems in the prior art, the present invention provides a general-purpose hardware device for accelerated processing of a character string and a control method thereof, and specifically, the present invention provides the following technical solutions:
in one aspect, the present invention provides a hardware device for accelerating processing of a universal character string, wherein the hardware device is integrated in a 3D memory, and the hardware device comprises a PIM enable control unit, a simple processor core, and a character string acceleration structure integrated on a logic layer of the 3D memory;
the character string acceleration structure and the simple processor core are connected with the main processor;
at a main processor end, operating and calling an interface function for a character string needing to be used by the device, wherein the interface function wakes up the PIM enabling control unit, and the PIM enabling control unit sends control information of the character string operation needing to be operated to a controller of the 3D memory;
the simple processor checks the control information to analyze, and sends the control information to a character string acceleration structure corresponding to the Vault where the position is located according to the position of data distribution;
and after receiving the control information sent by the simple processor core, the character string acceleration structure executes character string operation and returns an operation result to the simple processor core.
Preferably, the string acceleration structure has fixed access to the memory granule belonging to each Vault.
Preferably, the simple processor analyzes the control information, and when the operation type corresponding to the control information is a character comparison operation, the control information is sent to the character string acceleration structure, the character string acceleration structure analyzes the control information in a decoder and the controller, sends a data reading request to a memory slice of the Vault where the character string acceleration structure is located, stores the read data in a cache, and performs a pre-fetching operation on the slice of the 3D memory to obtain pre-fetched data;
and comparing the read data with the prefetched data, and returning a result.
Preferably, after the pre-fetching operation is performed on the 3D memory slices, the read data stored in the cache is sent to a shift operation register, and the pre-fetched data is stored in the cache.
Preferably, the comparison operation is performed in a character comparator; the character comparator includes three: the character comparison device comprises a character comparator 1, a character comparator 2 and a character comparator 3, wherein the character comparator 1 is used for comparing character string contents, and the comparator 2 and the comparator 3 are used for detecting whether a character string is finished.
Preferably, the return result is returned to the simple processor core.
Preferably, the simple processor analyzes the control information, and when the operation type corresponding to the control information is a character string interchange operation, the control information is sent to the character string acceleration structure, the character string acceleration structure analyzes the control information in a decoder and the controller, sends a data read request to a memory slice of the Vault where the control information is located, stores the read data in a cache, and performs a prefetch operation on the slice of the 3D memory to obtain prefetched data;
and comparing the read data with the prefetched data, and performing write-back operation on the compared data to a target address.
Preferably, after the pre-fetching operation is performed on the 3D memory slices, the read data stored in the cache is sent to a shift operation register, and the pre-fetched data is stored in the cache.
Preferably, the comparison operation is performed in a character comparator; the character comparator includes two: a character comparator 4 and a character comparator 5, wherein the comparator 4 and the comparator 5 are used for detecting whether the character string is finished.
Preferably, the shift operation register is used for performing a shift operation on data.
In addition, the invention also provides a universal character string acceleration processing method, which is applied to a universal character string acceleration processing hardware device, wherein the device comprises a PIM (personal information management) enabling control unit, a simple processor core and a character string acceleration structure integrated on the logic layer of the 3D memory; the character string acceleration structure and the simple processor core are connected with the main processor; the method comprises the following steps:
step 1, awakening the PIM enabling control unit through an interface function;
step 2, the PIM enabling control unit receives a character string control command and directly sends the character string control command to a controller of the 3D memory, and the controller sends the character string control command to the simple processor core;
step 3, the simple processor checks the character string control command to analyze, and generates an instruction control message;
step 4, the character string acceleration structure sends a data reading request to the 3D memory based on the analysis of the control message;
step 5, storing the data read in the step 4 in a cache, and performing corresponding operation on the read character string based on the analysis of the control message;
and 6, judging whether the operation in the step 5 meets a judgment condition or not, and finishing the operation when the judgment condition is met.
Preferably, the step 4 further comprises: and the result of the analysis of the control message comprises a character string comparison operation and a character string interchange operation.
Preferably, when the result of the parsing is a string comparison operation, the step 5 further includes:
step 501, respectively storing the read data in a first cache and a second cache, simultaneously executing a pre-fetching operation on the fragments of the 3D memory, transmitting the data in the first cache and the second cache to the first shift operation register and the second shift operation register in the next beat, and continuously storing the pre-fetched data in the first cache and the second cache;
step 502, after the shift alignment in the shift operation register, data comparison is performed in a character comparator, and the character string comparison operation is completed.
Preferably, in step 502, the character comparator at least includes a character comparator for comparing the contents of the character string and a character comparator for detecting whether the character string is over.
Preferably, said step 502 further comprises, after:
step 503, when the equal-length comparison operation is executed according to the length of the data with the shortest character string, the character string part which is not subjected to the comparison operation at this time in the character string enters the next comparison operation; when the length is not specified, detecting a data end symbol, wherein the data end symbol comprises an EOF (end of live) and a character string end symbol.
Preferably, when the result of the parsing is a string comparison operation, the step 6 is followed by:
601, returning the execution result of the comparison result to the operation and returning the result to the production line;
and step 602, continuing to enter the next comparison operation for the operation which is not completed in the comparison until the comparison is finished, and writing back the result to the simple processor core.
Preferably, the alignment comprises positions greater than, equal to, less than, and/or matches, and/or the number of matches.
Preferably, when the result of the parsing is a string interchange operation, the step 5 further includes:
step 511, storing the read data in a third cache, simultaneously executing a pre-fetching operation on the fragments of the 3D memory, transmitting the data in the third cache to a third shift operation register in the next beat, and continuously storing the pre-fetched data in the third cache;
step 512, after the shift processing is performed in the shift operation register, performing an interchange operation of data.
Preferably, the step 512 further comprises:
step 5121, calculating the length of the remaining data by using the source address and the current data address;
step 5122, when the length is not specified, after the shift processing, the data is compared in the comparator to determine whether the character string to be written is finished; and/or detecting a data end symbol to judge whether the character string to be written is ended.
It should be noted that although the following two embodiments respectively list different execution processes when performing the string comparison operation and the string interchange operation, the module/component structures such as the cache one/two/three, the decoder, the shift operation register, etc. involved therein are all reusable in different execution processes, in the description, although the reusable components are labeled differently by different labels, they are only for distinguishing different execution processes, so as to facilitate understanding and explanation, and avoid difficulty in understanding due to the same label, therefore, components with similar functions are distinguished by different labels, but do not represent that the two components are not reusable, unless explicitly stated in the present specification, the two components are completely independent modules/structures, but not to be multiplexed with each other, otherwise, those skilled in the art can reasonably multiplex the modules/structures based on the conventional technology and the technical idea of the present invention, so as to save the operation resources.
Compared with the prior art, the method and the device have the advantages that the data calling flow in the character string processing operation is effectively simplified, the operation resources are saved, the character string processing speed is effectively improved, and the power consumption of the whole application execution is effectively reduced.
Drawings
FIG. 1 is a schematic diagram of a 3D storage technology-based design of a string processing acceleration architecture according to an embodiment of the present invention;
FIG. 2 is a logical block diagram of a string processing acceleration architecture according to an embodiment of the present invention;
FIG. 3 is a diagram illustrating an accelerated structure of string comparison operations in an accelerated character processing structure according to an embodiment of the present invention;
FIG. 4 is a diagram illustrating an accelerated structure of a string writing operation in an accelerated character processing structure according to an embodiment of the present invention;
FIG. 5 is a flow chart illustrating an embodiment of a character acceleration process.
DETAILED DESCRIPTION OF EMBODIMENT (S) OF INVENTION
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without inventive effort based on the embodiments of the present invention, are within the scope of the present invention.
It should be noted that although the following two embodiments respectively list different execution processes when performing the string comparison operation and the string interchange operation, the hardware structures of the cache, the decoder, the shift operation register, and the like involved therein are all reusable in different execution processes, in the example, although the reusable components are labeled differently by different labels, they are only for distinguishing different execution processes, so as to facilitate understanding and explanation, and avoid difficulty in understanding due to the same label, therefore, components with similar functions are distinguished by different labels, but do not represent that the two components cannot be reused, unless explicitly stated in the present specification, they are regarded as completely independent modules/structures and cannot be reused, and those skilled in the art can fully base on the conventional technology and the technical idea of the present invention, and reasonably multiplexing the modules/structures to save operation resources.
Example 1
In a specific embodiment, the hardware device for accelerating processing of a universal character string according to the present invention is based on a 3D memory, and the hardware acceleration processing structure is integrated inside the 3D memory to accelerate processing of the character string. The hardware acceleration processing apparatus is divided into a character string comparison acceleration structure and a character string interchange acceleration structure. The acceleration structure directly reads data from the memory to process the character string and writes the result back.
When the method is specifically set, an acceleration processing structure hardware control structure is added at a main processor end, and an acceleration processing structure calling function is provided in a programming interface, so that the method is selected and used by a user. The 3D memory end needs to be added with a simple processor core to execute control functions, and the control functions comprise receiving an execution command sent from the main processor end, analyzing the command, controlling each character acceleration structure to execute corresponding operation, collecting an execution result and returning the execution result to the main processor end.
In addition, the user can specify the length of the character string to be compared, or the comparison acceleration structure can automatically judge whether the character string comparison is finished or not according to the character string end identifier "/0" or the file end identifier "EOF".
The character string comparer directly reads data from the 3D memory slice to which the character string comparer belongs, and the character string comparison operation can execute streaming comparison according to the data reading speed without waiting for the character string to completely arrive and then compare.
The character string interchange operation acceleration structure reads character string data from the 3D memory segment to which the character string interchange operation acceleration structure belongs, the valid character strings read this time are written into the destination address in the operation this time, and the rest character strings are written into the destination address in the subsequent operation, so that the streaming character string writing operation is realized.
The core technical scheme of the invention is described by taking a more specific embodiment as an example with reference to the attached drawings:
as shown in fig. 1, to add a structure for accelerating the processing of character operations on a conventional processor and a 3D memory (for example, HMC memory of the meiguan corporation), a simple processor core 103 (simplified core) and a character string acceleration structure 104 integrated on a logic layer 106 are included. The whole HMC processor is divided into a mode that the superposition of N layers of storage particles (8 layers in the figure) is realized by adopting a 3D technology, all storages are divided into a plurality of Vault structures 105 in a three-dimensional mode, and each Vault structure comprises 8 layers of three-dimensional storage particles in the example shown in the figure. The string acceleration processing structure (Sting ACC) proposed by the present invention is arranged at the logic layer 106, and fixedly accesses the stereoscopic storage granule belonging to each Vault.
Fig. 2 is a logical structure diagram of the character-operation acceleration processing hardware device proposed in the present invention, wherein the dark gray is an added control Unit, including a PIM Enable control Unit (PIM Enable Unit,
PEU)203, a simple processor core (simple core) 209, a String accelerated processing structure (String ACC) 212. First, on the host processor 201 side, the user program needs to make modifications, and calls a special interface function to execute on the String operation that needs to use String ACC 212. The PEU 203 unit is awakened by an operation calling a string processing interface function, and directly sends control information such as the execution type and the execution address range of the string operation needing to be operated to the HMC controller 207, and the cache levels are skipped (L1, L2, LLC). Then, the String operation control information is transmitted to the simple core 209 added in the HMC memory 208, and the simple core parses the control information and distributes the control information in which Vault 211 according to the data, that is, the operation control information is sent to the String ACC 212 of the Vault. The String ACC 212 starts to execute the String operation after receiving the operation sent from the simple core 209, and returns the operation result to the simple core 209.
As shown in fig. 3, the accelerating means String ACC is operated for String comparison. The simple core 301 parses a message received from the main processor, and when the operation type is a character comparison operation, sends a control message 302 to the String ACC device. The string compare accelerator parses in the decoder & controller 305 the control information of the received command and issues a data read request to the DRAM memory slice 303 of the Vault304 in which it resides. The read back data is buffered in 1 and 2 while a pre-fetch operation is performed on the DRAM slice. The data in the cache can perform operations such as shifting according to requirements, which mainly takes alignment into consideration. Since the data required is not necessarily data that can be aligned for one memory access, e.g. read one character, 8 bits. Typically, the memory will read several bytes of alignment to the cache depending on the address. Therefore, corresponding shift processing is required to obtain the content required by us. After the shift alignment operation, a data compare operation is performed. And executing comparison operation with equal length according to the length of the data with the shortest character string. And entering the next comparison operation for the character strings which are not operated at this time. There are three comparators in the comparison stage, respectively comparing string1 with string2, string1 with "/0" or EOF, string2 with "/0" or EOF. The first comparator compares the contents of the strings, and the latter two strings detect whether a certain string has ended, respectively. If the comparison has ended, the last comparison result ">," < "or the matched position pos or the matched number counter is returned to the simple core 301. When judging whether the comparison is finished or not, the judgment may be made based on the values of the lengths len1 and len2 of the two character strings, that is, the comparison length specified by the program is finished. The results also need to be returned to the simple core 301 at this point. And continuing to enter the next comparison operation for the operation which is completed by the comparison until the judgment is finished according to a certain condition. Because data is read from the DARM slice 303 and is affected by the access bandwidth, the whole string comparison operation can be executed in a pipeline data operation mode, namely, the data can be executed without waiting, so that the execution efficiency is improved.
Fig. 4 shows a character string interchange operation processing acceleration structure. The structure is a unified structure combined with the string comparison acceleration processing device of fig. 3, and many operations can reuse the same hardware resources, except that logic control and function implementation are different. As shown, the character manipulation request is first sent from the host to the core 401 in the HMC storage, which analyzes the manipulation type (character comparison, character exchange) and then sends a corresponding control message 402 to the accelerated processing architecture. The string swap accelerator parses the control information of the received command in the decoder & controller 405 and issues a data read request to the DRAM memory partition 403 of the Vault 404 in which it is located. The read back data is stored in cache 406 while a prefetch operation is performed on the DRAM slices. The data in the cache can perform operations such as shifting according to requirements, which mainly takes alignment into consideration. Since the data required is not necessarily data that can be aligned for one memory access, e.g. read one character, 8 bits. Typically, the memory will read several bytes of alignment to the cache depending on the address. Therefore, corresponding shift processing is required to obtain the content required by us. After the shift data screening, a data write back target address operation is performed. The data length len is optional, and if the len length is not specified, the operation is ended according to the string end identifier "/0", and the two comparators 409 and 410 are used to determine whether the ending condition is satisfied, respectively.
FIG. 5 is a flow chart of the character acceleration process. The user initiates the String ACC acceleration structure work by using the String acceleration processing interface 501 in the program. First, when the user passes through the interface function, the PIM enabling unit PEU 502 is activated, so that the character processing command is sent directly to the simple core 503 added to the HMC memory across the cache hierarchy. The simplified core analyzes the command in the character processing, generates a character String processing command after the analysis and sends the character String processing command to a String ACC accelerated processing structure, and the processing structure distinguishes character comparison or character interchange operation according to the command. If it is the string comparison operation 505, after the accelerated processing structure receives the instruction, it first needs to analyze the address of the string to be processed, and executes the access operation 507 according to the analyzed address, and for the accessed and stored back data, because of the reason of alignment, it cannot be guaranteed that the first address of the string is the first data of the accessed and stored back data, and therefore it needs to execute the corresponding shift operation 509. The comparator then performs a comparison 511 of the data. After the comparison is finished, on one hand, the comparison result 515 needs to be accumulated, and on the other hand, whether the comparison is finished is judged 513. If the result is finished, the result is written back to the simple core 517, and if the result is not finished, the comparison between the memory access and the data is continued 515. If the operation is a character string writing operation, a similar operation is executed. The address is first resolved and a memory access operation 508 is initiated, and the returned data is shifted 510. Write operation 512 is then performed. Each write operation needs to determine whether to end 514, if not, record the length of the executed character string, and continue to initiate the access operation 516, and if so, complete this operation 518.
Example 2
In another specific embodiment, the core technical solution of the present invention is described by two specific types of string operation execution flows.
As shown in fig. 2, the present invention is based on a conventional processor system and 3D memory HMC, and adds a PIM enable unit PEU 203, a simple core 209, and a string acceleration processing structure 212 to perform string processing by directly requesting data from a DRAM slice 213. The acceleration structure 212 is integrated with the Vault controller and performs string operations by reading data from the DRAM slices it controls.
The specific operation steps of each functional module are as follows:
as shown in fig. 3, the apparatus for accelerating the character string comparison operation is provided. Character comparison operation:
step 1: when programming, the user needs to read the code for designing the character string comparison and interchange operation and uses a special interface function to realize the operation, and the special interface function is used for enabling the PEU structure and further activating the character string acceleration processor structure.
Step 2: after receiving the operation command sent from the processing core 202, the PIM enable unit PEU directly sends the string control command to the HMC controller via the cache hierarchy, and further to the simple core 209 in the HMC memory.
And step 3: the simple core 301 analyzes the command according to the command sent by the PIM enabling unit PMU in the main processor, and generates an instruction control message 302 for calling the character acceleration processing structure String ACC.
And 4, step 4: the decode and control unit 305 of the string acceleration structure parses and issues data read requests to the DRAM slices 303 according to the received instructions.
And 5: the read data is cached in the caches 308 and 309, a pre-fetching operation is performed on the DRAM slices, and after the data in the next beats of caches 308 and 309 is sent to the shift operation registers 306 and 307, the pre-fetched data is continuously placed in the caches 308 and 309.
Step 6: the data in the buffer will be shifted in the registers 306 and 307 according to the requirement, which is mainly the alignment issue. Since the data required is not necessarily data that can be aligned for one memory access, e.g. read one character, 8 bits. Typically, the memory will read several bytes aligned on-chip according to the address. Therefore, corresponding shift processing is required to obtain the content required by us.
And 7: after the shift alignment operation, a data comparison operation is performed in the character comparators 310, 311, and 312. There are three comparators in the comparison stage, respectively comparing string1 with string2, string1 with "/0" or EOF, string2 with "/0" or EOF. The first comparator compares the contents of the strings, and the latter two strings detect whether a certain string has ended, respectively.
And 8: and if the comparison operation with the same length is executed according to the length of the data with the shortest character string, entering the next comparison operation for the character string which is not operated at this time. If the length is not specified, whether the EOF is specified or not is detected, and the judgment is finished according to whether the EOF is compared or not. If none is specified, judging whether to end according to the/0.
And step 9: if it has ended, the last comparison result ">," < "or the matched position pos or the matched number counter is executed the result return operation. And returns to the pipeline Regs.
Step 10: and for the operation which is not completed in the comparison, continuing to enter the next comparison operation until the judgment is finished according to a certain condition, and writing the result back to the simple core 301. Because the data is read from the memory under the influence of the memory access bandwidth, the whole character string comparison operation can be executed in a pipeline data operation mode, namely the data can be executed when coming, and waiting is not needed, so that the execution efficiency is improved.
Fig. 4 shows a processing acceleration structure for a string interchange operation, which includes the following steps:
step 1: when programming, the user needs to read the code for designing the character string comparison and interchange operation and uses a special interface function to realize the operation, and the special interface function is used for enabling the PEU structure and further activating the character string acceleration processor structure.
Step 2: after receiving the operation command sent from the processing core 202, the PIM enable unit PEU directly sends the string control command to the HMC controller via the cache hierarchy, and further to the simple core 209 in the HMC memory.
And step 3: the simple core 401 performs command parsing according to a command sent by a PIM enabling unit PMU in the host processor, and generates an instruction control message 402 for calling the character acceleration processing structure String ACC.
And 4, step 4: the decoding and control unit 405 of the string acceleration structure parses and issues data read requests to the DRAM slices 403 according to the received instructions.
And 5: the read data is buffered in the cache memory 406, the data of the next beat of the cache memory 406 is sent to the shift operation register 408, and the next data for performing DRAM slice prefetching continues to read the data into the cache memory 406.
Step 6: the data in the buffer memory will be shifted in the register 408 according to the requirement, which mainly considers that the required data is not necessarily at the beginning of the first address of the DRAM slice, so that the corresponding shift processing is needed to obtain the required content.
And 7: after the shift operation, the valid data performs a data comparison operation in the comparator 409, and the main purpose of the comparison operation is to determine whether the character string to be written this time is finished (in the case where the length len is not specified).
And 8: at the same time, the length of the remaining data is calculated 410 using the source address and the current data address. If the length is not specified, judging whether to end according to the/0.
And step 9: if not, the exchange operation of the rest data is continuously executed in the same way.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (19)

1. A general string acceleration processing hardware device is characterized in that the device is integrated in a 3D memory and comprises a PIM (personal information management) enabling control unit, a simple processor core and a string acceleration structure integrated on a logic layer of the 3D memory;
the character string acceleration structure and the simple processor core are connected with the main processor;
at a main processor end, operating and calling an interface function for a character string needing to be used by the device, wherein the interface function wakes up the PIM enabling control unit, and the PIM enabling control unit sends control information of the character string operation needing to be operated to a controller of the 3D memory;
the simple processor checks the control information to analyze, and sends the control information to a character string acceleration structure corresponding to the Vault where the position is located according to the position of data distribution;
and after receiving the control information sent by the simple processor core, the character string acceleration structure executes character string operation and returns an operation result to the simple processor core.
2. The apparatus of claim 1, wherein the string acceleration structure has fixed access to the memory granule belonging to each Vault.
3. The apparatus according to claim 1, wherein the simple processor checks the control information for parsing, and when the operation type corresponding to the control information is a character comparison operation, the control information is sent to the string acceleration structure, and the string acceleration structure parses the control information in a decoder and the controller, sends a data read request to a memory slice of the Vault where the control information is located, stores the read data in a cache, and performs a prefetch operation on the slice of the 3D memory to obtain prefetched data;
and comparing the read data with the prefetched data, and returning a result.
4. The apparatus of claim 3, wherein after performing the prefetch operation on the 3D memory slice, the read data stored in the cache is sent to a shift register, and the prefetched data is stored in the cache.
5. The apparatus of claim 3, wherein the comparison operation is performed in a character comparator; the character comparator includes three: the character comparison device comprises a character comparator 1, a character comparator 2 and a character comparator 3, wherein the character comparator 1 is used for comparing character string contents, and the comparator 2 and the comparator 3 are used for detecting whether a character string is finished.
6. The apparatus of claim 3, wherein the return result is returned to the simple processor core.
7. The apparatus according to claim 1, wherein the simple processor checks the control information for parsing, and when the operation type corresponding to the control information is a string interchange operation, the control information is sent to the string acceleration structure, and the string acceleration structure parses the control information in a decoder and the controller, sends a data read request to a memory slice of the Vault where the control information is located, stores the read data in a cache, and performs a prefetch operation on the slice of the 3D memory to obtain prefetched data;
and comparing the read data with the prefetched data, and performing write-back operation on the compared data to a target address.
8. The apparatus of claim 7, wherein after performing the prefetch operation on the 3D memory slice, the read data stored in the cache is sent to a shift register, and the prefetched data is stored in the cache.
9. The apparatus of claim 7, wherein the comparison operation is performed in a character comparator; the character comparator includes two: a character comparator 4 and a character comparator 5, wherein the comparator 4 and the comparator 5 are used for detecting whether the character string is finished.
10. The apparatus of claim 4 or 8, wherein the shift operation register is configured to perform a shift operation on data.
11. A universal character string acceleration processing method is applied to a universal character string acceleration processing hardware device and is characterized in that the device comprises a PIM (personal information management) enabling control unit, a simple processor core and a character string acceleration structure integrated on a 3D memory logic layer; the character string acceleration structure and the simple processor core are connected with the main processor; the method comprises the following steps:
step 1, awakening the PIM enabling control unit through an interface function;
step 2, the PIM enabling control unit receives a character string control command and directly sends the character string control command to a controller of the 3D memory, and the controller sends the character string control command to the simple processor core;
step 3, the simple processor checks the character string control command to analyze, and generates an instruction control message;
step 4, the character string acceleration structure sends a data reading request to the 3D memory based on the analysis of the control message;
step 5, storing the data read in the step 4 in a cache, and performing corresponding operation on the read character string based on the analysis of the control message;
and 6, judging whether the operation in the step 5 meets a judgment condition or not, and finishing the operation when the judgment condition is met.
12. The method of claim 11, wherein the step 4 further comprises: and the result of the analysis of the control message comprises a character string comparison operation and a character string interchange operation.
13. The method according to claim 12, wherein when the result of the parsing is a string comparison operation, the step 5 further comprises:
step 501, respectively storing the read data in a first cache and a second cache, simultaneously executing a pre-fetching operation on the fragments of the 3D memory, transmitting the data in the first cache and the second cache to the first shift operation register and the second shift operation register in the next beat, and continuously storing the pre-fetched data in the first cache and the second cache;
step 502, after the shift alignment in the shift operation register, data comparison is performed in a character comparator, and the character string comparison operation is completed.
14. The method according to claim 13, wherein in step 502, the character comparator comprises at least a character comparator for comparing the contents of the character string and a character comparator for detecting whether the character string is over.
15. The method of claim 13, wherein step 502 is further followed by:
step 503, when the equal-length comparison operation is executed according to the length of the data with the shortest character string, the character string part which is not subjected to the comparison operation at this time in the character string enters the next comparison operation; when the length is not specified, detecting a data end symbol, wherein the data end symbol comprises an EOF (end of live) and a character string end symbol.
16. The method according to claim 12, wherein when the result of the parsing is a string comparison operation, the step 6 is followed by further comprising:
601, returning the execution result of the comparison result to the operation and returning the result to the production line;
and step 602, continuing to enter the next comparison operation for the operation which is not completed in the comparison until the comparison is finished, and writing back the result to the simple processor core.
17. The method of claim 16, wherein the alignment result comprises positions greater than, equal to, less than, and/or matches, and/or number of matches.
18. The method according to claim 12, wherein when the result of the parsing is a string interchange operation, the step 5 further comprises:
step 511, storing the read data in a third cache, simultaneously executing a pre-fetching operation on the fragments of the 3D memory, transmitting the data in the third cache to a third shift operation register in the next beat, and continuously storing the pre-fetched data in the third cache;
step 512, after the shift processing is performed in the shift operation register, performing an interchange operation of data.
19. The method of claim 18, wherein the step 512 further comprises:
step 5121, calculating the length of the remaining data by using the source address and the current data address;
step 5122, when the length is not specified, after the shift processing, the data is compared in the comparator to determine whether the character string to be written is finished; and/or detecting a data end symbol to judge whether the character string to be written is ended.
CN201810995831.5A 2018-08-29 2018-08-29 Universal character operation accelerated processing hardware device and control method Active CN109298884B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810995831.5A CN109298884B (en) 2018-08-29 2018-08-29 Universal character operation accelerated processing hardware device and control method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810995831.5A CN109298884B (en) 2018-08-29 2018-08-29 Universal character operation accelerated processing hardware device and control method

Publications (2)

Publication Number Publication Date
CN109298884A CN109298884A (en) 2019-02-01
CN109298884B true CN109298884B (en) 2021-05-25

Family

ID=65165922

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810995831.5A Active CN109298884B (en) 2018-08-29 2018-08-29 Universal character operation accelerated processing hardware device and control method

Country Status (1)

Country Link
CN (1) CN109298884B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114090079B (en) * 2021-11-16 2023-04-21 海光信息技术股份有限公司 String operation method, string operation device, and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105487812A (en) * 2014-10-01 2016-04-13 三星电子株式会社 Method for supporting in-memory processing and memory module
CN106445472A (en) * 2016-08-16 2017-02-22 中国科学院计算技术研究所 Character operation acceleration method and apparatus, chip, and processor
CN107301455A (en) * 2017-05-05 2017-10-27 中国科学院计算技术研究所 Mixing cube storage system and speed-up computation method for convolutional neural networks
CN107590533A (en) * 2017-08-29 2018-01-16 中国科学院计算技术研究所 A kind of compression set for deep neural network

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10430182B2 (en) * 2015-01-12 2019-10-01 Microsoft Technology Licensing, Llc Enhanced compression, encoding, and naming for resource strings

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105487812A (en) * 2014-10-01 2016-04-13 三星电子株式会社 Method for supporting in-memory processing and memory module
CN106445472A (en) * 2016-08-16 2017-02-22 中国科学院计算技术研究所 Character operation acceleration method and apparatus, chip, and processor
CN107301455A (en) * 2017-05-05 2017-10-27 中国科学院计算技术研究所 Mixing cube storage system and speed-up computation method for convolutional neural networks
CN107590533A (en) * 2017-08-29 2018-01-16 中国科学院计算技术研究所 A kind of compression set for deep neural network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
CasHMC: A Cycle-Accurate Simulator for Hybrid Memory Cube;Dong-Ik Jeon 等;《IEEE Computer Architecture Letters》;IEEE;20160816;第16卷(第1期);10-13页 *

Also Published As

Publication number Publication date
CN109298884A (en) 2019-02-01

Similar Documents

Publication Publication Date Title
US9350826B2 (en) Pre-fetching data
US9244980B1 (en) Strategies for pushing out database blocks from cache
CN108257078B (en) Memory aware reordering source
US10521228B2 (en) Data read-write scheduler and reservation station for vector operations
CN108268385B (en) Optimized caching agent with integrated directory cache
KR20180021812A (en) Block-based architecture that executes contiguous blocks in parallel
CN103309644A (en) Translation address cache for a microprocessor
WO2020073641A1 (en) Data structure-oriented data prefetching method and device for graphics processing unit
CN106445472B (en) A kind of character manipulation accelerated method, device, chip, processor
KR20140134523A (en) Processing apparatus of managing power based data and method thereof
US20090204798A1 (en) Simplified Implementation of Branch Target Preloading
US7769954B2 (en) Data processing system and method for processing data
CN109298884B (en) Universal character operation accelerated processing hardware device and control method
CN108762812B (en) Hardware acceleration structure device facing general character string processing and control method
CA2762563A1 (en) Data prefetching and coalescing for partitioned global address space languages
CN112612728B (en) Cache management method, device and equipment
US8490098B2 (en) Concomitance scheduling commensal threads in a multi-threading computer system
WO2013185660A1 (en) Instruction storage device of network processor and instruction storage method for same
US9626296B2 (en) Prefetch list management in a computer system
CN108874691B (en) Data prefetching method and memory controller
JP2003140965A (en) Distributed shared memory type parallel computer and instruction scheduling method
CN110825442B (en) Instruction prefetching method and processor
TWI469044B (en) Hiding instruction cache miss latency by running tag lookups ahead of the instruction accesses
CN110515659B (en) Atomic instruction execution method and device
US9135011B2 (en) Next branch table for use with a branch predictor

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 100095 room 135, 1st floor, building 15, Chuangke Town, Wenquan Town, Haidian District, Beijing

Applicant after: Beijing Zhongke Ruixin Technology Group Co.,Ltd.

Address before: 1 wensong Road, Zhongguancun environmental protection park, Beiqing Road, Haidian District, Beijing 100095

Applicant before: SMARTCORE (BEIJING) Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: A general character operation accelerated processing hardware device and control method

Effective date of registration: 20210811

Granted publication date: 20210525

Pledgee: Zhongxin Suzhou Industrial Park Venture Capital Co.,Ltd.

Pledgor: Beijing Zhongke Ruixin Technology Group Co.,Ltd.

Registration number: Y2021990000709

PE01 Entry into force of the registration of the contract for pledge of patent right
PC01 Cancellation of the registration of the contract for pledge of patent right

Date of cancellation: 20220517

Granted publication date: 20210525

Pledgee: Zhongxin Suzhou Industrial Park Venture Capital Co.,Ltd.

Pledgor: Beijing Zhongke Ruixin Technology Group Co.,Ltd.

Registration number: Y2021990000709

PC01 Cancellation of the registration of the contract for pledge of patent right
TR01 Transfer of patent right

Effective date of registration: 20230714

Address after: 215125 11-303, creative industrial park, No. 328, Xinghu street, Suzhou Industrial Park, Suzhou City, Jiangsu Province

Patentee after: Suzhou Ruixin integrated circuit technology Co.,Ltd.

Address before: 100095 room 135, 1st floor, building 15, Chuangke Town, Wenquan Town, Haidian District, Beijing

Patentee before: Beijing Zhongke Ruixin Technology Group Co.,Ltd.

TR01 Transfer of patent right