CN117633550A - Blocking mode matching method and medium based on FM-index algorithm - Google Patents

Blocking mode matching method and medium based on FM-index algorithm Download PDF

Info

Publication number
CN117633550A
CN117633550A CN202311661786.7A CN202311661786A CN117633550A CN 117633550 A CN117633550 A CN 117633550A CN 202311661786 A CN202311661786 A CN 202311661786A CN 117633550 A CN117633550 A CN 117633550A
Authority
CN
China
Prior art keywords
character
index
pattern matching
description information
result
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311661786.7A
Other languages
Chinese (zh)
Inventor
马熙弘
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Agricultural Bank of China
Original Assignee
Agricultural Bank of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Agricultural Bank of China filed Critical Agricultural Bank of China
Priority to CN202311661786.7A priority Critical patent/CN117633550A/en
Publication of CN117633550A publication Critical patent/CN117633550A/en
Pending legal-status Critical Current

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a blocking mode matching method and medium based on an FM-index algorithm. The method comprises the steps of obtaining character texts to be subjected to pattern matching in real time, obtaining equipment description information of target terminal equipment, calculating the number of character strings, carrying out average block distribution on the character texts to be subjected to pattern matching through an FM-index algorithm, obtaining character text distribution results, and determining at least one index file description information corresponding to each thread respectively; respectively carrying out pattern matching processing on the index files in the index file description information through each target thread to obtain target thread processing sub-results; and synchronously summarizing the processing sub-results of each target thread through a signal mechanism to obtain a pattern matching result, and feeding the pattern matching result back to a user. The problems of low efficiency of character text pattern matching and large memory of required equipment are solved, the accuracy and efficiency of text character pattern matching are improved, and the memory pressure of the equipment is reduced.

Description

Blocking mode matching method and medium based on FM-index algorithm
Technical Field
The invention relates to the technical field of data processing, in particular to a blocking mode matching method and medium based on an FM-index algorithm.
Background
String matching is a very common need in the computer field. For various business interaction scenes of banks or financial institutions, the message is used as a general communication technical means. Over the years, a large amount of message data is generated. If there is a need for searching or statistics on these data, it is necessary to extract the message field and use database storage and index establishment to match the character strings.
The inventors have found that the following drawbacks exist in the prior art in the process of implementing the present invention: at present, a commonly used character string matching method is a KMP (Knuth-Morris-Pratt) algorithm, and the algorithm performs matching of character strings from beginning to end, and the time complexity is O (n), namely, the matching time is in direct proportion to the length of the searched character string. In the face of a scenario where matching is performed in a file of 10GB or more, the matching time may reach tens of seconds or even minutes, and the storage capacity required for the device is also required to be relatively large.
Disclosure of Invention
The invention provides a block pattern matching method and medium based on an FM-index algorithm, so as to improve the accuracy and efficiency of text character pattern matching.
According to an aspect of the present invention, there is provided a blocking pattern matching method based on an FM-index algorithm, including:
acquiring character texts to be matched in a mode in real time, and acquiring equipment description information of target terminal equipment to calculate and obtain the number of character strings;
according to the number of the character strings, carrying out average block distribution on the character text to be matched by an FM-index algorithm to obtain a character text distribution result, and determining at least one index file description information corresponding to each thread in the character text distribution result;
respectively carrying out pattern matching processing on the index files in the index file description information through each target thread to obtain target thread processing sub-results;
and summarizing all the target thread processing sub-results through synchronization of a preset signal mechanism to obtain a pattern matching result corresponding to the character text to be pattern matched, and feeding back the pattern matching result to a user.
According to another aspect of the present invention, there is provided a blocking pattern matching apparatus based on an FM-index algorithm, including:
the character string quantity calculation module is used for acquiring character texts to be subjected to pattern matching in real time and acquiring equipment description information of target terminal equipment to calculate and obtain the character string quantity;
The index file description information determining module is used for carrying out average block distribution on the character text to be matched by the mode through an FM-index algorithm according to the number of the character strings to obtain a character text distribution result, and determining at least one index file description information corresponding to each thread in the character text distribution result;
the target thread processing sub-result determining module is used for respectively carrying out pattern matching processing on the index files in the index file description information through each target thread to obtain target thread processing sub-results;
and the pattern matching result feedback module is used for synchronously processing all the target thread processing sub-results through a preset signal mechanism, obtaining a pattern matching result corresponding to the character text to be pattern matched, and feeding back the pattern matching result to a user.
According to another aspect of the present invention, there is provided an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements a block pattern matching method based on an FM-index algorithm according to any embodiment of the present invention when executing the computer program.
According to another aspect of the present invention, there is provided a computer readable storage medium storing computer instructions for causing a processor to implement a blocking pattern matching method based on an FM-index algorithm according to any one of the embodiments of the present invention when executed.
According to the technical scheme, the number of character strings is calculated and obtained by acquiring character texts to be subjected to pattern matching in real time and acquiring equipment description information of target terminal equipment; according to the number of the character strings, carrying out average block distribution on the character text to be matched by an FM-index algorithm to obtain a character text distribution result, and determining at least one index file description information corresponding to each thread in the character text distribution result; respectively carrying out pattern matching processing on the index files in the index file description information through each target thread to obtain target thread processing sub-results; and summarizing all the target thread processing sub-results through synchronization of a preset signal mechanism to obtain a pattern matching result corresponding to the character text to be pattern matched, and feeding back the pattern matching result to a user. The problems of low efficiency of character text pattern matching and large memory of required equipment are solved, the accuracy and efficiency of text character pattern matching are improved, and the memory pressure of the equipment is reduced.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the invention or to delineate the scope of the invention. Other features of the present invention will become apparent from the description that follows.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flowchart of a block pattern matching method based on an FM-index algorithm according to a first embodiment of the present invention;
fig. 2 is a schematic structural diagram of a block pattern matching device based on an FM-index algorithm according to a second embodiment of the present invention;
fig. 3 is a schematic structural diagram of an electronic device according to a third embodiment of the present invention.
Detailed Description
In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.
It should be noted that the terms "target," "current," and the like in the description and claims of the present invention and the above-described drawings are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Example 1
Fig. 1 is a flowchart of a block pattern matching method based on an FM-index algorithm according to an embodiment of the present invention, where the method may be performed by a block pattern matching device based on the FM-index algorithm, and the block pattern matching device based on the FM-index algorithm may be implemented in hardware and/or software.
Accordingly, as shown in fig. 1, the method includes:
s110, acquiring character texts to be subjected to pattern matching in real time, and acquiring equipment description information of target terminal equipment to calculate the number of character strings.
The character text to be pattern matched may be a character text to be pattern matched, and the character text includes a plurality of characters.
The device description information may be information for describing the terminal device, where the device description information may include processor multithreading information and current memory description information, and further, the processing operation of the character text may be performed according to the specific parameter condition of the device description information. The number of character strings may be the number of character strings that can be processed by the current device, and the size of the number of character strings that can be processed may be determined by the specific case of the device description information.
Optionally, the device description information includes: the processor multithreads processing information and current memory description information; the obtaining the device description information of the target terminal device to calculate and obtain the number of the character strings includes: determining the number of simultaneous execution threads of the equipment according to the multithreading processing information of the processor; determining the number of constructing index rounds according to the current memory description information; and calculating the number of character strings according to the number of threads executed by the equipment at the same time and the number of construction index rounds.
The processor multithreaded information may be information describing specific hardware parameters of a central processor in the terminal device. The current memory description information may be a size of a current memory in the terminal device. The number of simultaneous device execution threads may be the number of threads that the target device may support simultaneous text processing by the threads. The number of construction index rounds may be the size of the number of rounds in which the index of the character text is processed in multiple rounds.
By way of example, assuming that the central processor of the current terminal device included in the processor multithreaded processing information can support 16 threads to process character text simultaneously, the number of simultaneous execution threads of the device can be 16. Further, assuming that the current memory description information of the current terminal device is 512GB (assuming that 8GB is required for one file index), the number of construction index rounds may be determined to be 64. Correspondingly, the number of the character strings can be calculated to be 16×64 according to the number of threads executed by the device and the number of index constructing rounds.
S120, carrying out average block distribution on the character text to be matched by an FM-index algorithm according to the number of the character strings to obtain a character text distribution result, and determining at least one index file description information corresponding to each thread in the character text distribution result.
The character text distribution result can be obtained by carrying out block processing on the character text, and the character text distribution result comprises a plurality of blocks of character text. The index file description information may be file specific information for describing each index round.
In the previous example, since the number of character strings is 16×64, the character text to be pattern-matched can be further subjected to average block allocation by an FM-index algorithm, so as to obtain a character text allocation result. And because each character text allocation result is required to correspond to each thread, namely each thread is required to perform pattern matching processing on the allocated character text respectively, and index file description information corresponding to each thread is required to be acquired. Since the number of the constructed index rounds is 64, it is also possible to include 64 index file description information.
Optionally, the index file description information includes: index files and number of index files; the determining the at least one index file description information corresponding to each thread in the character text distribution result comprises the following steps: acquiring the number of construction index rounds corresponding to each character text allocation sub-result in the character text allocation results; and determining the number of index files and the number of index files corresponding to each thread respectively according to the number of the constructed index rounds.
In the previous example, since the number of the constructed index rounds is 64, the description information of 64 index files can be included. Further, it may be determined to include 64 index files, i.e., the index files and the number of index files are determined by constructing the number of index rounds in the character text assignment result.
Optionally, the performing, according to the number of character strings, average block allocation on the character text to be pattern-matched by using an FM-index algorithm to obtain a character text allocation result, including: judging whether the character text to be matched is a gene reference text, if not, numbering the character text to be matched, and obtaining a character numbering result of the character text to be matched; performing remainder processing on the number of the character strings according to the character numbering result to obtain a remainder processing result; and carrying out average block distribution on the character text to be matched in the mode according to the remainder processing result to obtain a character text distribution result.
The character numbering result may be a result obtained by numbering each character in the character text to be pattern-matched.
In this embodiment, after determining that the text of the character to be pattern-matched is not the genetic reference text (i.e., the text of the character to be pattern-matched may be a read set text, here, it is assumed that the text of the character to be pattern-matched is the read set text), numbering processing needs to be performed on the read set text, that is, each character corresponds to a corresponding number.
Further, after the character number result is determined, the number of the character strings is required to be subjected to remainder processing through the character number result, so that remainder processing results are obtained. And adding the read set text to the end of the corresponding character string according to the remainder processing result, so that the read set text can be divided into a plurality of character strings evenly. Moreover, the read set text cannot be truncated here, each divided block is an independent text, and the truncated read set text cannot be matched normally.
Optionally, after the determining whether the character text to be pattern-matched is a genetic reference text, the method further includes: if yes, performing sequence length statistics on the character text to be subjected to pattern matching to obtain sequence length statistics; judging whether the sequence length statistics can be distributed to the number of the character strings on average, if not, carrying out cutting-off processing on the character text to be matched in the mode, and adding character text description information into the index file description information; wherein the character text description information includes at least one of: character text sequence name, character text sequence length, and truncation position in the original character text sequence.
The sequence length statistics may be a statistical result obtained by performing sequence length statistics on the reference text of the gene.
Specifically, the partitioning of the genetic reference text is somewhat complicated, and is of the order of several KB to hundreds of MB, due to the different length of each base sequence in the genetic reference text. In order to make the matching process as efficient as possible, the base sequence should be distributed as evenly as possible over the multiple threads. Therefore, it is necessary to count the lengths of a plurality of base sequences and analyze whether the base sequences can be divided into a plurality of threads on a relatively average, and if they cannot be divided into a plurality of threads, it is necessary to further consider that the long base sequences are cut.
Furthermore, when the sequence is truncated, overlap is added at the truncated position according to the maximum length of the matching mode, so that the pattern matching can be normally performed at the truncated position. In addition, the index file description information should additionally record the character text sequence name (i.e. the base sequence name here), the character text sequence length (i.e. the base sequence length here) and the position of the truncation in the original character text sequence (i.e. the position of the truncation in the original sequence here) contained in the block, so as to convert the result into a specific position when outputting the result, exclude the cross-sequence matching result, and remove the repetition of the truncation.
S130, respectively carrying out pattern matching processing on the index files in the index file description information through each target thread to obtain target thread processing sub-results.
The target thread processing sub-result may be a processing result obtained by each thread respectively.
In the previous example, the target thread processing sub-result is obtained by performing pattern matching processing on each index file corresponding to each thread through 16 threads in the target terminal device, that is, each thread can obtain the corresponding target thread processing sub-result, that is, 16 target thread processing sub-results are obtained.
Optionally, the performing pattern matching processing on the index file in the description information of each index file through each target thread to obtain a target thread processing sub-result includes: and each target thread carries out serial pattern matching processing on each index file in parallel, and obtains target thread processing sub-results corresponding to each target thread.
In this embodiment, it is assumed that thread 1 corresponds to 3 string blocks and thread 2 corresponds to 5 string blocks in 16 threads. Further, the thread 1 and the thread 2 perform pattern matching processing on the string blocks at the same time, that is, the thread 1 may process 3 string blocks sequentially, and the thread 2 may process 5 string blocks sequentially. Correspondingly, a target thread processing sub-result corresponding to the thread 1 is obtained; the target thread corresponding to thread 2 processes the sub-result.
Specifically, each target thread performs serial pattern matching processing on each index file in parallel, and obtains a target thread processing sub-result corresponding to each target thread, including: respectively obtaining each index file corresponding to each target thread; and each target thread respectively carries out serial pattern matching processing on each corresponding index file in parallel, and when the pattern matching processing of each target thread is completed, each target thread processing sub-result corresponding to each target thread is respectively obtained.
In the previous example, after each thread in the 16 threads processes the corresponding character string blocks, that is, each thread obtains the corresponding target thread processing sub-results, the summary processing operation can be further performed on the target thread processing sub-results.
And S140, synchronizing through a preset signal mechanism, summarizing all the target thread processing sub-results to obtain a pattern matching result corresponding to the character text to be pattern matched, and feeding back the pattern matching result to a user.
The pattern matching result may be a result obtained by summing the processing sub-results of each target thread.
In this embodiment, after the pattern matching result is obtained, feedback processing is performed on the obtained pattern matching result, and after the user receives the pattern matching result, analysis information corresponding to the character text to be pattern matched can be obtained.
For example, if the block index establishment process uses n threads, i.e. constructs indexes for n blocks simultaneously, and runs in m number of constructing index rounds, the number of divided index blocks (the number of strings) is n×m blocks. The following theoretical analysis is performed on the performance of the chunk index:
1) Space occupation when using index: the space occupation when using the index is equal to the size of the index file. The size of the block index is hardly influenced, and the size of each index is still 1.1 times that of the original text under the condition that compression is not used, and on the basis, the total information of less than 1KB is stored in each block index, so that the influence is not great.
2) Time efficiency when using index: the process of searching by using the index is divided into two parts, wherein one part is to perform LF mapping and positioning to a final [ sp, ep ] interval on BWT text for a plurality of times; one is to find the location in the original text using a simplified bwt_array and exclude the results of text matching across the read set for each result within the interval. Thus, the time complexity of performing a match is linearly related to the number of occurrences of pattern T and the length j of T (pattern T is the text to be indexed).
Furthermore, when multithreading is divided, LF mapping times are increased but are performed simultaneously, so that the occupied time is unchanged; and the total frequency of the mode T is unchanged, so that the processing of the result is divided into multiple threads to be respectively carried out. Therefore, for a mode that occurs only once, the time efficiency is unchanged, and for a mode that occurs multiple times, the time efficiency is improved to different degrees according to the distribution situation of the positions where the mode occurs. When the number of rounds is divided into a plurality of rounds, the LF mapping times are increased and are sequentially carried out, the running time is linearly increased along with the number n of rounds, and the time for processing the result is unchanged.
3) Space occupation in constructing the index: when only thread is split, the occupation peak value of the array and the character string is unchanged, the occupation of the recursion function stack is improved to a certain extent, and the memory occupation peak value is changed into 5 x i+nlog2 (i/n). When the memory is divided into rounds, the memory occupation peak value can be greatly reduced, and the memory occupation peak value is changed into i+ (4/m) i+log2 (i/m), so that the core idea of reducing the memory occupation is to divide a long text into different times to establish indexes. Where i represents the length of the character text to be pattern matched.
4) Time efficiency in constructing the index: in practice, the time consumption for establishing the index is mainly concentrated in the processes of reading the file and sorting the matrix, the total time length for reading the file is limited by the read bandwidth of the disk, and cannot be shortened due to the split rounds and the split threads, and the time efficiency of the sorting part of the matrix can be shortened due to the rounds and the threads. The ordering time complexity in the ideal state is (|k| is the modulo of the character set): log2i log|k|i.
Further, in the case of multithreading, the ordering time becomes: (i/n) log2 (i/n) log|k| (i/n); when the division is returned, the sequencing time is changed into: it can be seen that the time used for sorting can be shortened to different degrees, both in the branching process and the branching round, thereby bringing about a reduction in the total time use.
In summary, in the process of indexing by using the blocking FM-index algorithm, the number of threads should be increased as much as possible in both the construction and the use of the index, so that the time efficiency can be effectively improved. When the memory is insufficient to accommodate the memory occupation peak value in the indexing process, the index is built by considering the sub-rounds, and the improvement of the round number n is not obvious to reduce the indexing time, but the time occupation when the index is used for pattern matching is obviously increased.
The total block number should be the same in both the indexing and the usage indexing process, but the allocation of thread numbers and round numbers may be different: the index can be established by using a low thread number and a high round number to avoid the shortage of memory capacity, and the cost is that the time for establishing the index can be obviously increased. When the index is used, the high thread number and the low round number can be used to improve the utilization rate of the memory and the threads and the time efficiency. When in use, the number n of rounds when constructing the index is ensured to be large enough, so that the process of establishing the index runs smoothly, the number m of threads when using the index is increased as much as possible, and the CPU utilization rate is ensured to be high enough. Then, m and n are determined, and a trade-off is made between the time efficiency of constructing the index and using the index.
According to the technical scheme, the number of character strings is calculated and obtained by acquiring character texts to be subjected to pattern matching in real time and acquiring equipment description information of target terminal equipment; according to the number of the character strings, carrying out average block distribution on the character text to be matched by an FM-index algorithm to obtain a character text distribution result, and determining at least one index file description information corresponding to each thread in the character text distribution result; respectively carrying out pattern matching processing on the index files in the index file description information through each target thread to obtain target thread processing sub-results; and summarizing all the target thread processing sub-results through synchronization of a preset signal mechanism to obtain a pattern matching result corresponding to the character text to be pattern matched, and feeding back the pattern matching result to a user. The problems of low efficiency of character text pattern matching and large memory of required equipment are solved, the accuracy and efficiency of text character pattern matching are improved, and the memory pressure of the equipment is reduced.
Example two
Fig. 2 is a schematic structural diagram of a block pattern matching device based on an FM-index algorithm according to a second embodiment of the present invention. The block pattern matching device based on the FM-index algorithm provided by the embodiment of the invention can be realized through software and/or hardware, and can be configured in terminal equipment to realize the block pattern matching method based on the FM-index algorithm. As shown in fig. 2, the apparatus includes: the system comprises a character string number calculation module 210, an index file description information determination module 220, a target thread processing sub-result determination module 230 and a pattern matching result feedback module 240.
The character string number calculation module 210 is configured to obtain a character text to be subjected to pattern matching in real time, and obtain device description information of a target terminal device, so as to calculate and obtain the number of character strings;
the index file description information determining module 220 is configured to perform average block allocation on the character text to be matched by using an FM-index algorithm according to the number of the character strings to obtain a character text allocation result, and determine at least one index file description information corresponding to each thread in the character text allocation result;
the target thread processing sub-result determining module 230 is configured to perform pattern matching processing on the index files in the index file description information through each target thread, so as to obtain a target thread processing sub-result;
and the pattern matching result feedback module 240 is configured to synchronize, through a preset signal mechanism, to perform summary processing on each target thread processing sub-result, obtain a pattern matching result corresponding to the character text to be pattern matched, and perform feedback operation on the pattern matching result to a user.
According to the technical scheme, the number of character strings is calculated and obtained by acquiring character texts to be subjected to pattern matching in real time and acquiring equipment description information of target terminal equipment; according to the number of the character strings, carrying out average block distribution on the character text to be matched by an FM-index algorithm to obtain a character text distribution result, and determining at least one index file description information corresponding to each thread in the character text distribution result; respectively carrying out pattern matching processing on the index files in the index file description information through each target thread to obtain target thread processing sub-results; and summarizing all the target thread processing sub-results through synchronization of a preset signal mechanism to obtain a pattern matching result corresponding to the character text to be pattern matched, and feeding back the pattern matching result to a user. The problems of low efficiency of character text pattern matching and large memory of required equipment are solved, the accuracy and efficiency of text character pattern matching are improved, and the memory pressure of the equipment is reduced.
On the basis of the above embodiments, the device description information includes: the processor multithreads the information and the current memory description information.
Based on the above embodiments, the string number calculation module 210 may be specifically configured to: determining the number of simultaneous execution threads of the equipment according to the multithreading processing information of the processor; determining the number of constructing index rounds according to the current memory description information; and calculating the number of character strings according to the number of threads executed by the equipment at the same time and the number of construction index rounds.
On the basis of the above embodiments, the index file description information determining module 220 may be specifically configured to: judging whether the character text to be matched is a gene reference text, if not, numbering the character text to be matched, and obtaining a character numbering result of the character text to be matched; performing remainder processing on the number of the character strings according to the character numbering result to obtain a remainder processing result; and carrying out average block distribution on the character text to be matched in the mode according to the remainder processing result to obtain a character text distribution result.
On the basis of the above embodiments, the index file description information determining module 220 may be further specifically configured to: if the character text to be matched is not the gene reference text, performing sequence length statistics on the character text to be matched to obtain sequence length statistics; judging whether the sequence length statistics can be distributed to the number of the character strings on average, if not, carrying out cutting-off processing on the character text to be matched in the mode, and adding character text description information into the index file description information; wherein the character text description information includes at least one of: character text sequence name, character text sequence length, and truncation position in the original character text sequence.
On the basis of the above embodiments, the index file description information includes: index files and number of index files; the index file description information determining module 220 may be specifically configured to: acquiring the number of construction index rounds corresponding to each character text allocation sub-result in the character text allocation results; and determining the number of index files and the number of index files corresponding to each thread respectively according to the number of the constructed index rounds.
Based on the above embodiments, the target thread processing sub-result determining module 230 may be specifically configured to: and each target thread carries out serial pattern matching processing on each index file in parallel, and obtains target thread processing sub-results corresponding to each target thread.
Based on the above embodiments, the target thread processing sub-result determining module 230 may be specifically configured to: respectively obtaining each index file corresponding to each target thread; and each target thread respectively carries out serial pattern matching processing on each corresponding index file in parallel, and when the pattern matching processing of each target thread is completed, each target thread processing sub-result corresponding to each target thread is respectively obtained.
The block mode matching device based on the FM-index algorithm provided by the embodiment of the invention can execute the block mode matching method based on the FM-index algorithm provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the executing method.
Example III
Fig. 3 shows a schematic diagram of the structure of an electronic device 10 that may be used to implement a third embodiment of the invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic equipment may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.
As shown in fig. 3, the electronic device 10 includes at least one processor 11, and a memory, such as a Read Only Memory (ROM) 12, a Random Access Memory (RAM) 13, etc., communicatively connected to the at least one processor 11, in which the memory stores a computer program executable by the at least one processor, and the processor 11 may perform various appropriate actions and processes according to the computer program stored in the Read Only Memory (ROM) 12 or the computer program loaded from the storage unit 18 into the Random Access Memory (RAM) 13. In the RAM 13, various programs and data required for the operation of the electronic device 10 may also be stored. The processor 11, the ROM 12 and the RAM 13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to bus 14.
Various components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, etc.; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, an optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.
The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, digital Signal Processors (DSPs), and any suitable processor, controller, microcontroller, etc. The processor 11 performs the various methods and processes described above, such as a block pattern matching method based on the FM-index algorithm.
In some embodiments, a block pattern matching method based on the FM-index algorithm may be implemented as a computer program tangibly embodied on a computer-readable storage medium, such as the storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19. When the computer program is loaded into RAM 13 and executed by processor 11, one or more steps of the block pattern matching method based on the FM-index algorithm described above may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform a block pattern matching method based on the FM-index algorithm in any other suitable way (e.g. by means of firmware).
The method comprises the following steps: acquiring character texts to be matched in a mode in real time, and acquiring equipment description information of target terminal equipment to calculate and obtain the number of character strings; according to the number of the character strings, carrying out average block distribution on the character text to be matched by an FM-index algorithm to obtain a character text distribution result, and determining at least one index file description information corresponding to each thread in the character text distribution result; respectively carrying out pattern matching processing on the index files in the index file description information through each target thread to obtain target thread processing sub-results; and summarizing all the target thread processing sub-results through synchronization of a preset signal mechanism to obtain a pattern matching result corresponding to the character text to be pattern matched, and feeding back the pattern matching result to a user.
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
A computer program for carrying out methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be implemented. The computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. The computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) through which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.
The computing system may include clients and servers. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present invention may be performed in parallel, sequentially, or in a different order, so long as the desired results of the technical solution of the present invention are achieved, and the present invention is not limited herein.
The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.
Example IV
A fourth embodiment of the present invention also provides a computer-readable storage medium containing computer-readable instructions, which when executed by a computer processor, are configured to perform a method of blocking pattern matching based on an FM-index algorithm, the method comprising: acquiring character texts to be matched in a mode in real time, and acquiring equipment description information of target terminal equipment to calculate and obtain the number of character strings; according to the number of the character strings, carrying out average block distribution on the character text to be matched by an FM-index algorithm to obtain a character text distribution result, and determining at least one index file description information corresponding to each thread in the character text distribution result; respectively carrying out pattern matching processing on the index files in the index file description information through each target thread to obtain target thread processing sub-results; and summarizing all the target thread processing sub-results through synchronization of a preset signal mechanism to obtain a pattern matching result corresponding to the character text to be pattern matched, and feeding back the pattern matching result to a user.
Of course, the computer-readable storage medium provided in the embodiments of the present invention is not limited to the above-described method operations, and may also perform the related operations in the block pattern matching method based on the FM-index algorithm provided in any embodiment of the present invention.
From the above description of embodiments, it will be clear to a person skilled in the art that the present invention may be implemented by means of software and necessary general purpose hardware, but of course also by means of hardware, although in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, etc., and include several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments of the present invention.
It should be noted that, in the above embodiment of the block pattern matching device based on the FM-index algorithm, each unit and module included are only divided according to the functional logic, but not limited to the above division, so long as the corresponding functions can be implemented; in addition, the specific names of the functional units are also only for distinguishing from each other, and are not used to limit the protection scope of the present invention.
The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims (10)

1. The blocking mode matching method based on the FM-index algorithm is characterized by comprising the following steps of:
acquiring character texts to be matched in a mode in real time, and acquiring equipment description information of target terminal equipment to calculate and obtain the number of character strings;
according to the number of the character strings, carrying out average block distribution on the character text to be matched by an FM-index algorithm to obtain a character text distribution result, and determining at least one index file description information corresponding to each thread in the character text distribution result;
respectively carrying out pattern matching processing on the index files in the index file description information through each target thread to obtain target thread processing sub-results;
and summarizing all the target thread processing sub-results through synchronization of a preset signal mechanism to obtain a pattern matching result corresponding to the character text to be pattern matched, and feeding back the pattern matching result to a user.
2. The method of claim 1, wherein the device description information comprises: the processor multithreads processing information and current memory description information;
the obtaining the device description information of the target terminal device to calculate and obtain the number of the character strings includes:
determining the number of simultaneous execution threads of the equipment according to the multithreading processing information of the processor;
determining the number of constructing index rounds according to the current memory description information;
and calculating the number of character strings according to the number of threads executed by the equipment at the same time and the number of construction index rounds.
3. The method according to claim 2, wherein the performing, according to the number of character strings, the average block allocation on the character text to be pattern-matched by using an FM-index algorithm to obtain a character text allocation result includes:
judging whether the character text to be matched is a gene reference text, if not, numbering the character text to be matched, and obtaining a character numbering result of the character text to be matched;
performing remainder processing on the number of the character strings according to the character numbering result to obtain a remainder processing result;
And carrying out average block distribution on the character text to be matched in the mode according to the remainder processing result to obtain a character text distribution result.
4. The method according to claim 3, further comprising, after said determining whether the character text to be pattern-matched is a gene reference text:
if yes, performing sequence length statistics on the character text to be subjected to pattern matching to obtain sequence length statistics;
judging whether the sequence length statistics can be distributed to the number of the character strings on average, if not, carrying out cutting-off processing on the character text to be matched in the mode, and adding character text description information into the index file description information;
wherein the character text description information includes at least one of: character text sequence name, character text sequence length, and truncation position in the original character text sequence.
5. The method of claim 4, wherein the index file description information comprises: index files and number of index files;
the determining the at least one index file description information corresponding to each thread in the character text distribution result comprises the following steps:
Acquiring the number of construction index rounds corresponding to each character text allocation sub-result in the character text allocation results;
and determining the number of index files and the number of index files corresponding to each thread respectively according to the number of the constructed index rounds.
6. The method of claim 5, wherein the performing, by each target thread, pattern matching processing on the index file in the index file description information to obtain a target thread processing sub-result includes:
and each target thread carries out serial pattern matching processing on each index file in parallel, and obtains target thread processing sub-results corresponding to each target thread.
7. The method of claim 6, wherein each of the target threads performs serial pattern matching processing on each of the index files in parallel, and obtains a target thread processing sub-result corresponding to each of the target threads, respectively, comprising:
respectively obtaining each index file corresponding to each target thread;
and each target thread respectively carries out serial pattern matching processing on each corresponding index file in parallel, and when the pattern matching processing of each target thread is completed, each target thread processing sub-result corresponding to each target thread is respectively obtained.
8. A blocking pattern matching device based on FM-index algorithm, comprising:
the character string quantity calculation module is used for acquiring character texts to be subjected to pattern matching in real time and acquiring equipment description information of target terminal equipment to calculate and obtain the character string quantity;
the index file description information determining module is used for carrying out average block distribution on the character text to be matched by the mode through an FM-index algorithm according to the number of the character strings to obtain a character text distribution result, and determining at least one index file description information corresponding to each thread in the character text distribution result;
the target thread processing sub-result determining module is used for respectively carrying out pattern matching processing on the index files in the index file description information through each target thread to obtain target thread processing sub-results;
and the pattern matching result feedback module is used for synchronously processing all the target thread processing sub-results through a preset signal mechanism, obtaining a pattern matching result corresponding to the character text to be pattern matched, and feeding back the pattern matching result to a user.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the FM-index algorithm based blocking pattern matching method according to any one of claims 1-7 when executing the computer program.
10. A computer readable storage medium storing computer instructions for causing a processor to implement the FM-index algorithm based blocking pattern matching method according to any one of claims 1-7 when executed.
CN202311661786.7A 2023-12-05 2023-12-05 Blocking mode matching method and medium based on FM-index algorithm Pending CN117633550A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311661786.7A CN117633550A (en) 2023-12-05 2023-12-05 Blocking mode matching method and medium based on FM-index algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311661786.7A CN117633550A (en) 2023-12-05 2023-12-05 Blocking mode matching method and medium based on FM-index algorithm

Publications (1)

Publication Number Publication Date
CN117633550A true CN117633550A (en) 2024-03-01

Family

ID=90018021

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311661786.7A Pending CN117633550A (en) 2023-12-05 2023-12-05 Blocking mode matching method and medium based on FM-index algorithm

Country Status (1)

Country Link
CN (1) CN117633550A (en)

Similar Documents

Publication Publication Date Title
CN114816578A (en) Method, device and equipment for generating program configuration file based on configuration table
CN112433757A (en) Method and device for determining interface calling relationship
CN117633550A (en) Blocking mode matching method and medium based on FM-index algorithm
CN114564149B (en) Data storage method, device, equipment and storage medium
CN115438007A (en) File merging method and device, electronic equipment and medium
CN115563310A (en) Method, device, equipment and medium for determining key service node
CN114722048A (en) Data processing method and device, electronic equipment and storage medium
CN114238335A (en) Buried point data generation method and related equipment thereof
CN117056133B (en) Data backup method, device and medium based on distributed Internet of things architecture
CN113076178B (en) Message storage method, device and equipment
CN117609302A (en) Data allocation method and device, electronic equipment and storage medium
CN115525659A (en) Data query method and device, electronic equipment and storage medium
CN116308713A (en) Multiplexing method and device for business transaction codes, electronic equipment and storage medium
CN115168407A (en) Numerical value addressing method and device, electronic equipment and storage medium
CN114139512A (en) Spreadsheet control method, device, computer readable storage medium and server
CN115965276A (en) Index set determination method and device, electronic equipment and storage medium
CN115033823A (en) Method, apparatus, device, medium and product for processing data
CN113343064A (en) Data processing method, device, equipment, storage medium and computer program product
CN115408547A (en) Dictionary tree construction method, device, equipment and storage medium
CN117670336A (en) Debit card number generation method, device and medium based on tree structure algorithm
CN117667938A (en) Database index updating method, device, equipment and storage medium
CN117573775A (en) Service data processing method and device, electronic equipment and storage medium
CN114416881A (en) Real-time synchronization method, device, equipment and medium for multi-source data
CN117194498A (en) Data aggregation method and device, electronic equipment and storage medium
CN117573491A (en) Positioning method, device, equipment and storage medium for performance bottleneck

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination