CN115878863B

CN115878863B - Data searching method and data searching device

Info

Publication number: CN115878863B
Application number: CN202211530394.2A
Authority: CN
Inventors: 杨飞; 王文华
Original assignee: Hangzhou Flyslice Technologies Co ltd
Current assignee: Hangzhou Flyslice Technologies Co ltd
Priority date: 2022-12-01
Filing date: 2022-12-01
Publication date: 2023-12-19
Anticipated expiration: 2042-12-01
Also published as: CN115878863A

Abstract

The application discloses a data searching method and device, which are used for segmenting received input data, each segment of data is independently searched by using small resources, water flows in a plurality of clocks are sequentially completed, and finally, the whole data is searched. The data searching method can reduce circuit power consumption, reduce resource consumption and save cost.

Description

Data searching method and data searching device

Technical Field

The present disclosure relates to the field of data processing technologies, and in particular, to a data searching method and a data searching device.

Background

With the rapid development of internet technology, data applications and data traffic proliferate, and more network devices are used. The main implementation modes of the network equipment are ASIC chips, NPs (Network Processor, network processors) and FPGA, wherein the FPGA can realize a specific scheme aiming at specific requirements due to the excellent programmable capability, and the network equipment is highly flexible and has incomparable advantages.

Content Addressable Memory (CAM), also known as associative memory, is a memory technology. CAM is based on content addressing, and fast matching is realized through a hardware circuit, and the parallel processing characteristic of CAM makes CAM popular in the field of data sorting, and is widely applied to ethernet address searching, data compression, pattern recognition, cache, high-speed data processing, data security, data encryption and the like.

However, the search method based on the CAM is mainly realized by using a CAM chip, and the quick search of the CAM is realized at the expense of hardware resources, so that the realization of high-efficiency data search inevitably causes large power consumption of a circuit, and the method has the defects of high price, high power consumption and inflexibility in application.

Therefore, how to find a data searching method with lower cost and lower resource consumption is a technical problem to be solved by those skilled in the art.

Disclosure of Invention

The embodiment of the application aims to provide a data searching method with lower consumption resources so as to solve the technical problem of high data searching consumption resources in the prior art.

A data lookup method, comprising: receiving input data, dividing the input data into a plurality of block data according to bits; searching according to the plurality of block data in P clock cycles, wherein P is an integer greater than 1; searching according to at least one piece of block data in each clock cycle of the P clock cycles to obtain a first output result of the at least one piece of block data; in a first clock cycle of the P clock cycles, processing according to the first output result to obtain a second output result; in each clock cycle from the second clock cycle to the P-1 clock cycle, after receiving the second output result of the previous clock cycle, obtaining the second output result of the current clock cycle according to the first output result of the current clock cycle and the second output result of the previous clock cycle; and in the P clock period of the P clock periods, after receiving the second output result of the P-1 clock period, obtaining a search result according to the first output result of the P clock period and the second output result of the P-1 clock period.

In some embodiments, each of the block data corresponds to a respective one of the first output results.

In some embodiments, processing the second output result according to the first output result specifically includes: and performing AND operation according to the plurality of first output results to obtain a second output result.

In some embodiments, obtaining the second output result of the current clock cycle according to the first output result of the current clock cycle and the second output result of the previous clock cycle specifically includes: and performing AND operation according to the first output results of the current clock cycle, and performing AND operation with the second output result of the previous clock cycle to obtain the second output result of the current clock cycle.

In some embodiments, obtaining the search result from the first output result of the P-th clock cycle and the second output result of the P-1 th clock cycle specifically includes: and performing AND operation according to the first output results of the P clock period, and performing AND operation with the second output results of the P-1 clock period to obtain a search result.

In some embodiments, searching from the plurality of partitioned data within P clock cycles specifically includes: and searching according to at least one block data belonging to different bit intervals in P clock cycles.

In some embodiments, the input data is data with a length of N bits, the input data is divided into Q parts by bits, where each part is a block data, Q is an integer greater than 1, and in each of P clock cycles, Q/P first output results are obtained by searching in corresponding Q/P RAMs according to Q/P parts of data of the input data belonging to different bit intervals, respectively.

In some embodiments, each RAM size is 2N/Q x 2M bits, where M is the address width of the data lookup, and the length of the lookup result is 2M bits.

In some embodiments, M is less than N.

In some embodiments, m=7, n=160.

In some embodiments, pmax takes on a value of Q.

In some embodiments, n=160, q=20, p=4; or n=160, q=40, p=40; or n=160, q=40, p=5.

In some embodiments, searching in the corresponding Q/P RAMs according to the Q/P copies of the input data belonging to different bit intervals to obtain Q/P first output results specifically includes: and searching in the first group of Q/P RAMs according to the Q/P data of the low-order interval, and searching in the second group of Q/P RAMs according to the Q/P data of the high-order interval, wherein the first group of Q/P RAMs and the second group of Q/P RAMs are mutually independent.

In some embodiments, the input data includes a network packet header.

The data searching device is characterized by comprising a RAM module and a logic operation module, wherein the RAM module and the logic operation module are used for: receiving input data, dividing the input data into a plurality of block data according to bits; searching according to the plurality of block data in P clock cycles, wherein P is an integer greater than 1; searching according to at least one piece of block data in each clock cycle of the P clock cycles to obtain a first output result of the at least one piece of block data; in a first clock cycle of the P clock cycles, processing according to the first output result to obtain a second output result; in each clock cycle from the second clock cycle to the P-1 clock cycle, after receiving the second output result of the previous clock cycle, obtaining the second output result of the current clock cycle according to the first output result of the current clock cycle and the second output result of the previous clock cycle; and in the P clock period of the P clock periods, after receiving the second output result of the P-1 clock period, obtaining a search result according to the first output result of the P clock period and the second output result of the P-1 clock period.

According to the data searching method and the data searching device, the plurality of block data divided by bits contained in the input data are sequentially searched in P clock cycles, wherein the searching is carried out according to at least one block data in each clock cycle of the P clock cycles, a first output result of at least one block data is correspondingly obtained, a second output result is obtained according to the first output result in each clock cycle, or a second output result of the current clock cycle is obtained according to the first output result of the current clock cycle and the second output result of the previous clock cycle, the searching result is finally obtained in the P clock cycle, the required RAM resource is reduced by dividing the plurality of block data by bits of the input data, the searching of one input data is completed in combination with the searching of the plurality of clock cycles, the first output result or the second output result is obtained through sequential processing with the searching of the second output result of the next clock cycle, the required combination logic resource of the single clock cycle is finally obtained, the probability of clock cycle violations is reduced, and the performance of the data is improved on the basis of lower resource consumption.

Drawings

For a clearer description of the solution in the present application, a brief description will be given below of the drawings that are needed in the description of the embodiments of the present application, it being obvious that the drawings in the following description are some embodiments of the present application, and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art.

FIG. 1 is a schematic diagram of a data lookup method according to one embodiment of the present invention;

FIG. 2 is a flow chart of a data searching method according to another embodiment of the present invention;

FIG. 3 illustrates an exemplary RAM partitioning diagram employed by the data lookup method of the FIG. 2 embodiment of the present invention;

FIG. 4 is a schematic diagram of a process for implementing a data lookup method according to an embodiment of the present invention using the RAM partition shown in FIG. 3;

FIG. 5 is a schematic diagram illustrating a process of pipelining over multiple clock cycles for the data lookup method of the embodiment of FIG. 2 in accordance with the present invention;

FIG. 6 is a schematic diagram of a data searching apparatus according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of a functional module of a data lookup method applied to an access control list according to an embodiment of the present invention;

Fig. 8 is a schematic diagram of a data searching process of the data searching method applied to the access control list according to the embodiment of the present invention.

Detailed Description

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the applications herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "comprising" and "having" and any variations thereof in the description and claims of the present application and in the description of the figures above are intended to cover non-exclusive inclusions. The terms first, second and the like in the description and in the claims or in the above-described figures, are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order. The terms "coupled" and "connected," as used herein, are intended to encompass both direct and indirect coupling (coupling), unless otherwise indicated.

Furthermore, the described features, operations, or characteristics of the description may be combined in any suitable manner in various embodiments. Also, various steps or acts in the method descriptions may be interchanged or modified in a manner apparent to those of ordinary skill in the art. Thus, the various orders in the description and drawings are for clarity of description of only certain embodiments, and are not meant to be required orders unless otherwise indicated.

The invention will be described in further detail below with reference to the drawings by means of specific embodiments. In the following embodiments, numerous specific details are set forth in order to provide a better understanding of the present application. However, one skilled in the art will readily recognize that some of the features may be omitted, or replaced by other elements, materials, or methods in different situations. In some instances, some operations associated with the present application have not been shown or described in the specification to avoid obscuring the core portions of the present application, and may not be necessary for a person skilled in the art to describe in detail the relevant operations based on the description herein and the general knowledge of one skilled in the art.

Content-addressable memory (CAM) is a special memory, which is a memory device that is accessed based on data Content faster than it is written to and read from based on addresses, but at the cost of hardware resources for efficient data lookup. In Ethernet applications, the header data of a data frame in the data link layer is 16 The number of required address entries (index) is generally 128 or more at 0bits, so that the required memory space is 128x 2 according to the data searching method used by the conventional CAM chip ¹⁶⁰ bits, therefore, require very large resources and power consumption of circuits, and are too costly.

In order to solve the above shortcomings, the application provides a data searching method, which realizes a CAM searching function based on a RAM (Random Access Memory ) in an FPGA platform and reduces power consumption by searching data in a block and time sharing manner.

The CAM data searching function realized in the FPGA is mainly built by a RAM inside the FPGA. A simple example is to use a 2bits of data to find a 3bits address, where a bit map (bitmap) is pre-stored in RAM, in this case by 3-8 decoding. The internal RAM partitioning is shown in table 1:

TABLE 1 RAM partition example

It can be seen from table 1 that the RAM is partitioned like a mesh matrix, each element storing a 1-bit (1 bit) binary number (1 or 0), each element corresponding to a specific address and data, each 1bit representing a combination of address and data. A 0 indicates a "miss" of the corresponding data, i.e., no data is located at the corresponding address, and a "miss" of all addresses is found at the address corresponding to the data. A 1 indicates a data "hit", i.e. there is data at the corresponding address.

Thus, the address corresponding to the data 0 is 4 (one-to-one correspondence between the data 0 and the address 4), the address corresponding to the data 1 is 3 (one-to-one correspondence between the data 1 and the address 3), the address corresponding to the data 2 is 2 (one-to-one correspondence between the data 2 and the address 2), and the address corresponding to the data 3 is 1 (one-to-one correspondence between the data 3 and the address 1). The depth of RAM is the data to be searchedTo the power of 2 for bit width. While the width of RAM is a power of 2 to the width of the seek address. For a data lookup of mxn (m is the data width and n is the address width). The required RAM size is 2 ^m x2 ⁿ In this example, the data width is m=2, the address width is n=3, and the required RAM size is 2 ² x2 ³ bits(32bits)。

Similarly, if 10bits of data are used to find a 7bits address (128 addresses), the RAM size used is: 2 ⁷ x2 ¹⁰ ＝128x 1024bits(128Kbits)。

In one embodiment of the present invention, after receiving input data (data to be searched, such as a search key word, etc.), dividing the input data into a plurality of pieces of block data, each piece of data uses an independent RAM to perform an independent search, sequentially completing the search in each RAM in one clock, and finally performing bitwise and operation on the search result (& gt). Referring to fig. 1, a schematic diagram of a data searching method according to an embodiment of the present application is shown. 160bits of input data din [9:0] are divided into 16 parts, din [9:0], din [19:10]. The term din [159:150], respectively. Each 10bits of data is used for searching a 7-bit address, each data needs a 128x 1024-specification RAM, the 128-bit index searched by the 16-bit RAM is subjected to bit-wise AND operation, and an entry address corresponding to 160bits of data can be obtained, namely a search Result searched according to input data.

Compared with the traditional data searching method, the segmentation searching method of the embodiment needs 16 parts of RAM with 128x1024 specifications, namely the RAM with 2Mbits as a whole, and the consumed RAM resource is greatly reduced.

Fig. 2 is a flow chart of a data searching method according to another embodiment of the invention. The method comprises the following steps:

step S1: receiving input data, dividing the input data into a plurality of block data according to bits;

the input data may be data of length N and may be divided into Q shares, in one example each share may be one piece of block data, in other examples more than 2 shares of data may be considered as one piece of block data. Different block data belong to different bit intervals, the block data can be divided from low order to high order, and the bit number of each block data can be equal.

Step S2: searching according to the plurality of block data in P clock cycles, wherein P is an integer greater than 1;

in each of the P clock cycles, a lookup may be performed based on at least one of the partitioned data belonging to different bit intervals, respectively, the order of the lookups may be a sequential lookup from a low bit interval to a high bit interval in some examples.

Step S3: searching according to at least one piece of block data in each clock cycle of the P clock cycles to obtain a first output result of the at least one piece of block data;

in one example, the first output result is obtained by searching the RAM according to the block data, where each block data corresponds to one first output result. Each piece of block data can be searched in an independent RAM to obtain a first output result. And in a single clock period, searching can be performed in a plurality of independent RAMs according to a plurality of block data, wherein one block data corresponds to one independent RAM, so as to obtain a plurality of corresponding first output results.

For example, the input data is data with a length of N bits, the input data is divided into Q parts by bits, each part is a block data, Q is an integer greater than 1, in each of P clock cycles, corresponding contents are searched in Q/P RAMs according to Q/P parts of data belonging to different bit intervals of the input data, respectively, Q/P first output results are obtained, Q/P parts of data correspond to Q/P RAMs, each part of data corresponds to 1 RAM, that is, Q/P parts of data correspond to Q/P RAMs one by one, and searching is performed in a corresponding RAM according to each part of data.

Step S4: in the first clock period of the P clock periods, processing according to the first output result to obtain a second output result;

as one example, a plurality of first output results obtained in a first clock cycle obtain a second output result by and operation. Because the number of AND operations needed to be performed in a single clock cycle is only the search result aiming at part of the block data, the occupied operation time is relatively reduced, and the reliability of the system can be improved.

Step S5: in each clock cycle from the second clock cycle to the P-1 clock cycle, after receiving the second output result of the previous clock cycle, obtaining the second output result of the current clock cycle according to the first output result of the current clock cycle and the second output result of the previous clock cycle;

and performing AND operation according to a plurality of first output results of the current clock cycle in other clock cycles except the first clock cycle and the last clock cycle, and performing AND operation with a second output result of the previous clock cycle to obtain a second output result of the current clock cycle.

Step S6: and in the P clock period of the P clock periods, after receiving the second output result of the P-1 clock period, obtaining a search result according to the first output result of the P clock period and the second output result of the P-1 clock period.

And performing AND operation on the last clock period of the P clock periods according to a plurality of first output results of the P clock periods, and performing AND operation on the second output results of the P-1 clock periods to obtain the search result of the input data.

According to the data searching method, a plurality of block data divided by bits contained in input data are sequentially searched in P clock cycles, wherein in each clock cycle of the P clock cycles, searching is carried out according to at least one block data, a first output result of the at least one block data is correspondingly obtained, in each clock cycle, a second output result is obtained according to the first output result, or a second output result of the current clock cycle is obtained according to the first output result of the current clock cycle and the second output result of the previous clock cycle, and finally searching results are obtained in the P clock cycle, so that required RAM resources are reduced through dividing the plurality of block data by bits of the input data, searching of one input data is completed in combination with the plurality of clock cycles, searching results are obtained in a single clock cycle, or the second output result is processed with searching results of the next clock cycle in a sequential mode, finally, the required combination logic resources of the single clock cycle are reduced, the probability of the clock cycle is lowered, searching is lowered, the data can be searched by using the method, the data can be more cost-violating the basic data resource consumption is lowered, and the data searching performance is lowered, and the searching cost is lowered. Specific details of the method are set forth further below in connection with the following figures.

Referring to FIG. 3, a diagram of RAM partitioning is shown for an example of the data lookup method of the embodiment of FIG. 2 of the present invention. Let Input data Input A be N bits and the number of required address entries be 2 ^M In the figure, N is 160,2 ^M Taken as 128. N bits of data are used to find an Mbits address. The method can be realized based on FPGA, the Input data Input A can be divided into P parts according to the bit, for example, the Input data Input A is divided into P parts according to the low order to the high order, in the example, P is taken as 4, and the Input data Input A is expressed as din [159:0 ]]The method is divided into 4 parts: first section din [39:0 ]]Second part din [79:40 ]]Third section din [119:80]Fourth section din [159:120 ]]. The FPGA includes P sets of RAMs, including in this example, a first set of RAM PART1, a second set of RAM PART2, a third set of RAM PART3, and a fourth set of RAM PART4. Each portion of the input data is used to look up in one of the sets of RAMs, i.e., in one of the sets of RAMs based on N/P bits of data, such as based on a first portion of the input data din [39:0 ]]Find in the first set of RAM PART1, according to the second section din [79:40 ]]Find in the second set of RAM ParT2, according to third section din [119:80 ]]Find in the third set of RAM PART3, according to fourth section din [159:120 ] ]Look-up in the fourth set of RAM PART 4. 2 found in P-group RAM ^M bits entry (first output result) is bitwise and (")"&") operation, namely obtaining the entry address corresponding to the N bits data as a search Result A. In the present embodiment, the look-up in a set of RAMs based on each portion of the input data is done in P different clock cycles, respectivelyIn this way, N/P bits of data can be searched for in each clock cycle, and the entry (the first output result) obtained in the next clock cycle and the output result (the second output result) of the entry obtained in the previous clock cycle in the clock cycle complete the bitwise and operation (obtain the second output result of the next clock cycle), that is, the bitwise and operation of all entries obtained in all P clock cycles need not be completed in a single clock cycle, but only the bitwise and operation of the entry (the first output result) obtained in the current clock cycle need be completed to obtain the second output result, or at most the second output result output in the previous clock cycle is processed once again to obtain the address of the entry corresponding to the N bits of data, that is, the search result, through successive bitwise and operation of a plurality of clock cycles, so that the number of bitwise and operations performed in each clock cycle can be reduced, and the processing clock of the FPGA can be improved.

Referring to FIG. 4, a process diagram of a method for implementing a data lookup method according to an embodiment of the present invention using the RAM partition shown in FIG. 3 is shown. Based on the RAM dividing and searching method corresponding to fig. 3, further, each RAM group includes S independent RAMs, each portion of the input data InputA is further divided into S shares according to bits, which can be understood as that the input data InputA is divided into p×s pieces of block data, and in each clock period, S first output results are obtained by searching in S RAMs according to S pieces of block data therein. S=5 in this example, where for input data: first section din [39:0 ]]Comprises 5 parts, respectively a first part din [7:0 ]]Second part din [15:8 ]]Third part din [23:16 ]]Fourth part din [31:24 ]]Fifth part din [39:32 ]]The method comprises the steps of carrying out a first treatment on the surface of the Second part din [79:40 ]]Comprises 5 parts, respectively sixth part din [47:40 ]]Seventh part din [55:48 ]]Eighth part din [63:56 ]]Ninth part din [71:64 ]]Tenth part din [79:72 ]]The method comprises the steps of carrying out a first treatment on the surface of the Third section din [119:80]Comprises 5 parts, namely eleventh part din [87:80 ]]Twelfth part din [95:88 ]]Thirteenth part din [103:96 ]]Fourteenth part din [111:104 ]]Fifteenth part din [119:112 ]]The method comprises the steps of carrying out a first treatment on the surface of the Fourth section din [159:120 ]]Comprises 5 parts, respectively sixteenth parts din [127:120 ] ]Seventeenth lot of din [135:128 ]]Eighteenth part din [143:136 ]]Nineteenth portion din [151:144 ]]Twentieth din [159:152 ]]. Each of which isThe shares comprise N/(s×p) bits of data, i.e. in this example 8 bits (160/20=8) of data per share, i.e. the number of bits per share of data is 8. For N as input data, the total is divided into Q parts, q=s=p= 5*4 =20. And for 4 sets of RAM, wherein: the first group of RAM ParT1 comprises 5 independent RAMs, namely a first RAM RAM1, a second RAM RAM2, a third RAM RAM3, a fourth RAM RAM4 and a fifth RAM RAM5; the second group of RAM ParT2 comprises 5 independent RAMs, namely a sixth RAM RAM6, a seventh RAM RAM7, an eighth RAM RAM8, a ninth RAM RAM9 and a tenth RAM RAM10; the third group of RAM ParT3 comprises 5 independent RAMs, namely eleventh RAM RAM11, twelfth RAM RAM12, thirteenth RAM RAM13, fourteenth RAM RAM14 and fifteenth RAM RAM15; the fourth set of RAM PART4 comprises 5 independent RAMs, namely a sixteenth RAM16, a seventeenth RAM17, an eighteenth RAM18, a nineteenth RAM19 and a twentieth RAM20. Thus, each independent RAM, namely, each RAM of the first RAM RAM1 to the twentieth RAM RAM20, adopts the rule that only needs (the number of the entry addresses is 2 ^{Number of bits per data} ) In this case 128 x 2 ⁸ I.e. 128 x 256 (32 Kbits) RAM, whereas the RAM resources required as a whole are 20 x 32Kbits, i.e. 640Kbits. The RAM required for the data lookup method of the present embodiment is further reduced relative to the 2Mbits RAM required for the embodiment of fig. 1, so that the data lookup method of the present invention is particularly suitable for large-scale computing systems.

According to 5 parts of each part of the input data, respectively completing searching in five RAMs which are independent in a group of RAMs, and respectively obtaining 5 128bits (namely 2 ^M ) The 5 128bits entries are bitwise associated with the 40 (i.e., N/P) bits data of the first portion of the input data. Wherein, according to 5 parts of the first part of the input data (namely the first part to the fifth part of the input data, or also called as the first S part of the input data), the search is completed in the first RAM RAM1, the second RAM RAM2, the third RAM RAM3, the fourth RAM RAM4 and the fifth RAM RAM5 included in the first group RAM in the first clock period T1 to obtain a first group S (5 in this example) 128bits search items, and the 5 128bits items are bitwise and operated to obtain a first part of the input data (input) The first S data of the incoming data) corresponds to one 128-bit group entry Result1 (the first group entry Result 1) which is used as an output entry of the first clock cycle T1, and in a single clock cycle, only 5 128-bit entries need to be subjected to bit-wise and operation, compared with 16 128-bit entries which need to be subjected to bit-wise and operation in the prior art, the FPGA combination logic resource required by operation is obviously reduced, so that the possibility of FPGA time sequence violations is greatly reduced. During the second clock period T2, according to the second 5 parts of the input data (i.e. the sixth to tenth parts of the input data, or may also be referred to as the second S part of the input data), searching in the second set of RAM to obtain a second set of 5 128bits search entries, performing bitwise and operation on the second set of 5 128bits entries, and then performing bitwise and operation on a set of entries Result1 obtained in the last period T1 (i.e. the output entries of the last clock period), to obtain a set of entries corresponding to the first two parts of the input data (i.e. the first part and the second part, or the first 2S part of the input data), referred to as the second set of entries Result2, during the second clock period T2, the need to perform bitwise and operation on 5+1, i.e. a total of 6 128bits entries, is still significantly reduced compared to the 16 128bits entries required in a single clock period in the prior art. And so on, the bits and operations required in the third clock cycle T3 and the fourth clock cycle T4 are all 6 128bits of entries, and finally, in the fourth clock cycle T4, the group entry Result4 corresponding to all of the input data (i.e. the first portion to the fourth portion, or the 4S data of the input data) is obtained, and the Result4 can also be used as the entry address Result a corresponding to the input data (160 bits of data), that is, the search Result of the input data. Of course, in some alternative embodiments, it is also possible to search in a RAM according to one data in each clock cycle, which corresponds to each part including only one data, in which case the number of processing clock cycles required is increased, for example, in the foregoing embodiment, if searching in a RAM of 128×256 according to one (8) data in each clock cycle is replaced, then a total of 20 clock cycles are required to obtain the required entry address.

In combination with the above specific processes of RAM partitioning and data searching, the following are described for each step of the data searching method shown in fig. 2 by taking 160bits of input data as an example:

s1, receiving input data with the length of N bits, wherein the input data is divided into Q pieces of block data according to the bits, Q, N is an integer, and Q is larger than N;

in step S1, the content to be searched, for example, 160bits of input data (n=160) is received, and the input data is divided into 20 pieces of block data (q=20) from low order to high order. S2, completing the search of Q (q=20) block data contained in the input data in P (p=4) consecutive clock cycles;

s3: in each of P clock cycles, searching in a group of Q/P5 independent RAMs according to Q/P (Q/P=5) block data in different bit intervals of input data respectively to obtain Q/P first output results, wherein P is an integer greater than or equal to 2;

in 4 consecutive clock cycles, each clock cycle completes the search of S (S=Q/P=5) pieces of block data, for each piece of block data, the block data is searched in a corresponding independent RAM according to the block data, in a single clock cycle, each RAM is searched in 5 independent RAMs according to 5 pieces of block data at the same time, a first output result (search item) can be obtained by searching, and then a first group of 5 search items of 128bits are obtained in each clock cycle. The required single RAM specification is 2 ^N/Q *2 ^M Wherein M is an integer less than N, M represents the address width of the data lookup, 2 ^M The correspondence indicates the number of address entries needed for the lookup. In some embodiments, M is less than N, or M may be much less than N. In addition, Q pieces of block data can be found out in Q clock cycles at most, where p=q. Optionally, from the first clock cycle to the P-th clock cycle, the low-order to high-order searching of the N-bit data is sequentially completed, for example, the low-order Q/P data is searched in the first group of Q/P RAMs at the same time, the high-order Q/P data is searched in the second group of Q/P RAMs at the same time, and the first group of Q/P RAMs and the second group of Q/P RAMs are independent from each other.

S4, in the first clock period of P (P=4) clock periods, processing according to the Q/P (Q/P=5) first output results obtained in the first clock period to obtain a second output result of the first clock period;

during a first clock cycle of the 4 clock cycles, a second output result of the first clock cycle can be obtained by performing and operation according to a first group of 5 first output results obtained by searching the block data din [39:0], and specifically, one 128bits entry corresponding to the first 5 data of the input data can be obtained as the second output result of the first clock cycle T1 after performing bitwise and operation on the 5 128bits first output results.

S5, in the second clock period to the P-1 clock period of P (P=4), obtaining a second output result of the current clock period according to the first output result of the current clock period and the second output result of the previous clock period;

during each clock cycle next to the first clock cycle, in addition to the 5 first output results of the current clock cycle obtained by searching in the RAM according to the input data, the second output result of the previous clock cycle is received, the bit-wise AND operation is performed according to the 5 first output results of the current clock cycle, and then the bit-wise AND operation is performed with the second output result of the previous clock cycle to obtain the second output result of the current clock cycle. Because the first output result of the current clock cycle is 5 entries of 128bits, and the second output result of the last clock cycle is 1 entry of 128bits, the number of bitwise and operations in the current clock cycle is one more than that in the first clock cycle, i.e. the bitwise and operations of 6 entries of 128bits need to be completed, but the amount of the operations is still significantly reduced compared to the FPGA combinational logic resource required in the embodiment of fig. 1.

S6, in the last clock period of the P clock periods, after receiving the second output result of the P-1 clock period, obtaining a search result according to the S first output results of the last clock period and the second output result of the P-1 clock period

Referring to fig. 5, a process diagram of the data searching method according to the embodiment of the present invention is shown in fig. 2. The figure shows a number of successive clock cycles during which each set of RAM is looked up in a pipelined fashion from a number of different input data. Specifically, the data search is started in the first clock period T1, and during the first clock period T1, the first partial data Input a_part1 corresponding to the first partial data is searched in the first group of RAMs according to the first partial data Input a to obtain an entry (first output Result) corresponding to the first partial data, and if the first group of RAMs includes a plurality of independent RAMs and the first partial data Input a_part1 corresponds to a plurality of pieces of data, then during the first clock period T1, the entry corresponding to the first partial data is obtained by bitwise and operation of the entry searched in each corresponding RAM according to each piece of data (refer to the process of obtaining the entry Result 1). During the next second clock period T2, result2 is found and obtained in the second set of RAMs from the second partial data Input a_part2 of the first Input data as described above, while Result1 of the second Input data Input B is also found and obtained in the first set of RAMs from the first partial data Input b_part1 of the second Input data Input B during the second clock period T2. Thus, after four clock cycles, a search Result for different Input data may be obtained in each clock cycle, for example, a search Result of the first Input data Input a, that is, an entry address Result a, is obtained in the fourth clock cycle T4, a search Result of the second Input data Input B, that is, an entry address Result B, is obtained in the fifth clock cycle T5, a search Result of the third Input data Input C, that is, an entry address Result C, is obtained in the sixth clock cycle T6, a search Result of the fourth Input data Input D, that is, an entry address Result D, …, and so on, is obtained in the seventh clock cycle T7. After the first four clock cycles delay, the search result is obtained in real time.

In some other embodiments, the division of the input data and the division of the RAM may be in other ways, depending on the manner in which RAM resources and delays are required, as opposed to the previous embodiments. Referring to table 2-input data division and RAM division, taking input data 160bits (n=160) to look up a 7bits (m=7) address as an example, scheme 1 is the division scheme adopted in the previous embodiment,wherein Q represents the number of divided parts of the input data, P represents the number of divided parts of the input data, S represents the number of parts of each part of the input data, wherein the number of parts determines the number of clock cycles Td (also represents the delay of the search) that need to be experienced by the search, RAM_s represents the size of a separate RAM corresponding to each data, RAM_t represents the total amount of RAM required, T _& Representing the number of bitwise and's that need to be operated at most per clock cycle.

TABLE 2

	Q	P	S	Td	RAM_s	RAM_t	T _&
								Example 1	20	4	5	4	128*256bit	640Kbit	6
Example 2	40	40	1	40	128*16bit	80Kbit	2
								Example 3	40	8	5	8	128*16bit	80Kbit	6

As in example 2, if 160bits are divided into 40 parts for searching, each 4bits is used for searching a 7bits address, and the RAM specification used is 128x16 bits, then the RAM resources only need 128x16x40bits, namely 80Kbits, and the RAM resources consumed are greatly reduced. Meanwhile, the data search of 160bits can be realized by dividing the data search into 40 clock cycles, so that each clock cycle searches 4bits, and only the bitwise and operation of 2 128 bits of data is needed in a single clock cycle, thereby greatly improving the frequency of processing clock cycles. By realizing the 160bit data searching in a pipelined manner, after delay of 40 clock cycles, one data can be searched in one clock cycle finally, and the overall searching performance can be greatly improved.

As in example 3, if 160bits are divided into 40 parts for searching, each part of 4bits is used for searching for a 7-bit address, and the RAM specification used is 128x16 bits, the RAM resources only need 128x16x40bits, namely 80Kbits, and the consumed RAM resources are greatly reduced. Meanwhile, the data search of 160bits can be realized by dividing the data search into 8 clock cycles, so that each clock cycle searches 20 bits, and only the bitwise and operation of 5 or 6 128 bits of data is required to be realized in a single clock cycle, and the frequency of processing clock cycles is improved to a certain extent.

Still other embodiments will not be described in detail, but the invention can flexibly and rapidly search data according to RAM resources by dividing input data by bits and performing search in different clock cycles and then performing bit and operation to obtain a final search result, and also can consider processing clock cycles and improve overall performance.

Referring to fig. 6, a schematic diagram of a data searching apparatus 10 according to an embodiment of the invention is shown. The data searching device comprises a RAM module 101 and a logic operation module 103. Wherein the RAM module 101 may be divided according to the examples in the foregoing embodiments; the logic operation module 103 includes a logic array for implementing bitwise and operation. The data lookup apparatus 10 may be implemented by an FPGA. The data lookup apparatus 10 may further include a controller, a comparator, a register, and the like. The controller may be used to control the receiving and dividing process of the input data, the comparator may be used to compare each copy of the input data with the stored contents of each RAM, and the register may be used to register intermediate operation results, such as output entries obtained during the first clock cycle, and may be registered so that the next cycle continues to participate in bitwise and operations.

In a possible implementation manner, the data searching method provided by the invention can be applied to access list control. Referring to fig. 7 and fig. 8 together, fig. 7 is a schematic diagram of a functional module of the data searching method applied to the access control list according to the embodiment of the present invention, and fig. 8 is a schematic diagram of a data searching process of the data searching method applied to the access control list according to the embodiment of the present invention. The access control list (Access Control List, ACL) can limit network traffic and improve network performance, and performs entry matching on the header of the network data packet through one or more rules so as to filter the whole network data packet, wherein an entry matching function (i.e. a data searching function) can be realized through hardware to accelerate the matching speed. The FPGA-based access control list circuit 20 may implement an entry matching function (i.e., a data lookup function), and the circuit 20 includes a network packet receiving module 200, a network packet buffering module 202, a network packet reading control module 203, an ACL detection module 205, a network packet processing module 206, and a network packet transmitting module. The network data packet receiving module 200 receives a network data packet from a network end through the data interface RGMII (Reduced Gigabit Media Independent Interface). The network data packet buffer module 202 is configured to store a received network data packet, to cooperate with the stream processing of the data lookup, and buffer the received network data packet according to the data partitioning in the data lookup method and the corresponding mode of the clock cycle lookup, where the network data packet buffer module may be implemented by using a small-capacity RAM. The network packet read control module 203 reads different block data of the network packet in clock cycles. The ACL detection module 205 searches the RAM according to the different block data, and outputs the final search result. The network data packet processing module 206 obtains the network data packet through the network data packet reading control module 203 on the one hand, and obtains the search result from the ACL detection module 205 on the other hand, and performs processes such as discarding or forwarding on the network data packet according to the search result. The RAM memory is then configured by CPU210 via a data interface or bus, such as PCIE (Peripheral Component Interconnect Express).

Still taking 160bit data as an example, 160bit data in the ACL application scenario of this example refers to a header in a network data packet. The network data packet buffering module 202 stores the received network data packet, and buffering can be performed according to a certain format requirement, for example, the packet header is stored separately, and the data is also stored separately. When the packet header (which can be regarded as input data in the data searching method) is stored, different storage methods are set according to different designs of the data searching method. For example, referring to the foregoing example of the data searching method, 160bit data is divided into 4 clock cycles for searching, and then the packet header is stored according to 4 40 bits respectively, specifically including 40bit_0[39:0], 40bit_1[79:40], 40bit_2[119:80], and 40bit_3[159:120] of partitioned data in different bit intervals from low to high. The network data packet reading control module 203 extracts the corresponding 160bit packet header, specifically reads the first block data 40bit_0 (1) of the 1 st network data packet header in the first clock cycle; reading out the first blocking data 40bit_0 (2) of the 2 nd network data packet header in the second clock period, and simultaneously reading out the second blocking data 40bit_1 (1) of the 1 st network data packet header; reading out the first blocking data 40bit_0 (3) of the 3 rd network data packet header in the third clock period, simultaneously reading out the second blocking data 40bit_1 (2) of the 2 nd network data packet header, and reading out the third blocking data 40bit_2 (1) of the 1 st network data packet header; reading out the first blocking data 40bit_0 (4) of the 4 th network data packet header, simultaneously reading out the second blocking data 40bit_1 (3) of the 3 rd network data packet header, reading out the third blocking data 40bit_2 (2) of the 2 nd network data packet header, and reading out the fourth blocking data 40bit_3 (1) of the 1 st network data packet header; reading out the first blocking data 40bit_0 (5) of the 5 th network data packet header in a fifth clock period, simultaneously reading out the second blocking data 40bit_1 (4) of the 4 th network data packet header, reading out the third blocking data 40bit_2 (3) of the 3 rd network data packet header, and reading out the fourth blocking data 40bit_3 (2) of the 2 nd network data packet header; followed by the following steps. Referring to fig. 8 again, taking two network data packets as an example, the ACL detection module 205 searches (corresponding to the look-up RAM0 in fig. 7) in the packet header RAM0 according to the first block data 40bit_0 (1) of the 1 st network data packet header in the first clock cycle to obtain at least one first output result, if the first block data 40bit_0 (1) is searched in 5 independent RAMs at the same time according to the foregoing embodiment, that is, the first block data 40bit 0[39:0] may be further divided into 5 block data 8bit 0[7:0], 8bit 0[15:8 bit 0], 8bit 0[23:16], 8bit 0[31:24], 8bit 0[39:32], then corresponding to obtain 5 first output results, and performing a bit and operation on the 5 first output results in the first clock cycle to obtain a second output result of the first clock cycle. In the next second clock cycle, searching in the packet header RAM1 (searching RAM1 in fig. 7) according to the second block data 40bit_1 (1) of the 1 st network packet header to obtain a first output result of the second clock cycle, and obtaining a second output result of the second clock cycle according to the first output result of the second clock cycle (i.e. the current clock cycle) and the second output result of the first clock cycle (i.e. the previous clock cycle), wherein the second block data 40bit_1 (1) is further divided into 5 block data to search 5 independent RAMs respectively for example, the second clock cycle can still obtain 5 first output results, and the 5 output results are subjected to bit-wise and operation and then subjected to AND operation with the second output result of the first clock cycle to obtain the second output result of the second clock cycle. In the second clock cycle, according to the pipeline operation mode, the first processing result and the second processing result of the first clock cycle of the 2 nd input data can be obtained by searching in the packet header ram0 according to the first block data 40bit_0 (2) of the 2 nd network data packet, and the specific reference can be made to the processing process of the 2 nd network data packet, which is not described herein. And so on, in the next clock period, the running water type can be used for searching in the data packet header ram2 and the data packet header ram3 respectively according to the third block data 40bit_2 (1) and the fourth block data 40bit_3 (1) of the 1 st network data packet, and respectively obtaining a second output result of the third clock period and a second output result of the fourth clock period aiming at the 1 st input data (equivalent to the 1 st network data packet); and searching in the packet header ram2 and the packet header ram3 according to the second block data 40bit_1 (2), the third block data 40bit_2 (2) and the fourth block data 40bit_3 (2) of the 2 nd network data packet, and respectively obtaining second output results of the second clock cycle to the fourth clock cycle of the 2 nd input data (equivalent to the 2 nd network data packet). The packet headers RAM0, RAM1, RAM2, RAM3 are RAM for storing network packet headers.

Meanwhile, for each network data packet, the network data packet read control module 203 reads the data portion of the network data packet in the next clock cycle after reading the fourth block data 40bit_3 of the corresponding network data packet header, for example, for the 1 st network data packet, the corresponding network data packet data1 is read from the data ram in the 5 th clock cycle, and the corresponding network data packet headers 40bit_0, 40bit_1, 40bit_2 and 40bit_3 are beaten by using registers, and finally, the read network data packet data1 is spliced into a complete network data packet, and according to the detection result of the ACL detection module, discarding or forwarding the network data packet is determined. If the packet is forwarded, the packet is sent to a network data packet sending module for forwarding according to the content of the packet header in the network data packet. Similarly, for the 2 nd network data packet, the corresponding network data packet data2 is read out from the data ram in the sixth clock cycle and is similarly processed, which is not described herein.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order and may be performed in other orders, unless explicitly stated herein. Moreover, at least some of the steps in the flowcharts of the figures may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order of their execution not necessarily being sequential, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.

Reference is made to various exemplary embodiments herein. However, those skilled in the art will recognize that changes and modifications may be made to the exemplary embodiments without departing from the scope herein. For example, the various operational steps and components used to perform the operational steps may be implemented in different ways (e.g., one or more steps may be deleted, modified, or combined into other steps) depending on the particular application or taking into account any number of cost functions associated with the operation of the system.

While the principles herein have been shown in various embodiments, many modifications of structure, arrangement, proportions, elements, materials, and components, which are particularly adapted to specific environments and operative requirements, may be used without departing from the principles and scope of the present disclosure. The above modifications and other changes or modifications are intended to be included within the scope of this document.

The foregoing detailed description has been described with reference to various embodiments. However, those skilled in the art will recognize that various modifications and changes may be made without departing from the scope of the present disclosure. Accordingly, the present disclosure is to be considered as illustrative and not restrictive in character, and all such modifications are intended to be included within the scope thereof. Also, advantages, other advantages, and solutions to problems have been described above with regard to various embodiments. The benefits, advantages, solutions to problems, and any element(s) that may cause any benefit, advantage, or solution to occur or become more pronounced are not to be construed as a critical, required, or essential feature. The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, system, article, or apparatus. Furthermore, the term "couple" and any other variants thereof are used herein to refer to physical connections, electrical connections, magnetic connections, optical connections, communication connections, functional connections, and/or any other connection.

Those skilled in the art will recognize that many changes may be made to the details of the above-described embodiments without departing from the underlying principles of the invention. Accordingly, the scope of the invention should be determined from the following claims.

Claims

1. A method for searching data, comprising:

receiving input data, dividing the input data into a plurality of block data according to bits;

searching according to the plurality of block data in P clock cycles, wherein P is an integer greater than 1;

searching according to at least one piece of block data in each clock cycle of the P clock cycles to obtain a first output result of the at least one piece of block data, wherein each piece of block data corresponds to one first output result respectively;

in the first clock period of the P clock periods, processing according to the first output result to obtain a second output result;

in each clock cycle from the second clock cycle to the P-1 clock cycle, after receiving the second output result of the previous clock cycle, obtaining the second output result of the current clock cycle according to the first output result of the current clock cycle and the second output result of the previous clock cycle;

And in the P clock period of the P clock periods, after receiving the second output result of the P-1 clock period, obtaining a search result according to the first output result of the P clock period and the second output result of the P-1 clock period.

2. The data searching method of claim 1, wherein the processing according to the first output result to obtain a second output result comprises: and performing AND operation according to the first output results to obtain the second output result.

3. The data searching method of claim 1, wherein the obtaining the second output result of the current clock cycle based on the first output result of the current clock cycle and the second output result of the previous clock cycle comprises: and performing AND operation according to the first output results of the current clock cycle, and performing AND operation with the second output result of the previous clock cycle to obtain the second output result of the current clock cycle.

4. The data lookup method as claimed in claim 1 wherein said deriving a lookup result based on a first output result of a P-th clock cycle and a second output result of a P-1 th clock cycle comprises: and performing AND operation according to the first output results of the P clock period, and performing AND operation with the second output results of the P-1 clock period to obtain the search result.

5. The data lookup method as claimed in any one of claims 1 to 4 wherein said performing a lookup from said plurality of partitioned data in P clock cycles comprises: and searching according to at least one block data belonging to different bit intervals in P clock cycles.

6. The data searching method of claim 1, wherein the input data is data with a length of N bits, the input data is divided into Q parts by bits, wherein each part is a block data, Q is an integer greater than 1, and each of the P clock cycles is searched in the corresponding Q/P RAMs according to the Q/P parts of the input data belonging to different bit intervals, respectively, to obtain Q/P first output results.

7. The data lookup method as claimed in claim 6 wherein each of said RAM specifications is 2 ^N/Q *2 ^M bits, where M is the address width of the data lookup, the length of the lookup result is 2 ^M Bits.

8. The data lookup method as claimed in claim 7 wherein said M is less than said N.

9. The data lookup method as claimed in claim 7 wherein m=7 and n=160.

10. The data lookup method as claimed in claim 6 wherein Pmax takes the value Q.

11. The data lookup method as claimed in claim 6 wherein n=160, q=20, p=4; or n=160, q=40, p=40; or n=160, q=40, p=5.

12. The data searching method of claim 6, wherein searching in the corresponding Q/P RAMs based on the Q/P copies of the input data belonging to different bit intervals, respectively, to obtain Q/P of the first output results comprises:

and searching in the first group of Q/P RAMs according to the Q/P data of the low-order interval, and searching in the second group of Q/P RAMs according to the Q/P data of the high-order interval, wherein the first group of Q/P RAMs and the second group of Q/P RAMs are mutually independent.

13. The data lookup method as claimed in claim 1 wherein the incoming data comprises a network packet header.

14. A data searching apparatus comprising a RAM module and a logic operation module, the RAM module and the logic operation module being configured to perform the data searching method of any one of claims 1-13.