CN115687560A - Mass keyword searching method based on finite automaton - Google Patents

Mass keyword searching method based on finite automaton Download PDF

Info

Publication number
CN115687560A
CN115687560A CN202211370688.3A CN202211370688A CN115687560A CN 115687560 A CN115687560 A CN 115687560A CN 202211370688 A CN202211370688 A CN 202211370688A CN 115687560 A CN115687560 A CN 115687560A
Authority
CN
China
Prior art keywords
character
identity
linked list
query
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211370688.3A
Other languages
Chinese (zh)
Inventor
陈汝龙
柴玉倩
孙勤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qichacha Technology Co ltd
Original Assignee
Qichacha Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qichacha Technology Co ltd filed Critical Qichacha Technology Co ltd
Priority to CN202211370688.3A priority Critical patent/CN115687560A/en
Publication of CN115687560A publication Critical patent/CN115687560A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to a mass keyword searching method based on a finite automaton. The method comprises the following steps: the following single character query is performed: acquiring a single character in information to be searched, and converting the character into an identity; inputting the identity into a finite deterministic automaton, wherein the finite deterministic automaton comprises a multilayer structure, each layer structure comprises at least one linked list, each linked list is composed of child nodes belonging to the same father node, the child nodes are represented by identity, and the identity is sequentially arranged; inquiring the identity in a linked list according to a binary search algorithm; acquiring the position of the character in the finite automaton; and after the query of the single character is finished, taking out the next character in the information to be searched to execute the character query, wherein the starting bit of the next character query is the position of the last character in the finite automaton. By adopting the method, the memory can be reduced, and the query speed can be improved.

Description

Mass keyword searching method based on finite automaton
Technical Field
The application relates to the technical field of data processing, in particular to a mass keyword searching method based on a finite automaton.
Background
With the development of internet technology, various clients communicating with the outside world, such as communication software, email, short messages, etc., are more and more commonly applied in the daily life of users. In order to avoid the user from using some keywords which are not beneficial to public security in the process of using the internet. Generally, the text input by the user at the client can be screened through the keyword library to determine whether the text contains keywords.
In the related technology, a word bank is constructed into a tree structure according to a finite automaton algorithm, and efficient search is performed in the tree structure according to an input text. However, the amount of word banks is large, and in the case of a large amount of word banks, memory consumption is large.
Disclosure of Invention
Therefore, it is necessary to provide a method for searching a large number of keywords based on a finite automaton, which can convert characters into identifiers, store the identifiers belonging to the same father node into the same linked list to form the finite automaton, and improve the searching performance, reduce the memory and improve the query speed by using a binary searching method.
In a first aspect, the application provides a method for searching massive keywords based on a finite automaton. The method comprises the following steps:
the following single character query is performed:
acquiring a single character in information to be searched, converting the character into an identity, wherein the character and the identity correspond to each other;
inputting the identity into a finite deterministic automaton, wherein the finite deterministic automaton comprises a multilayer structure, each layer structure comprises at least one linked list, each linked list is composed of child nodes belonging to the same father node, the child nodes are represented by identity, and the identity is sequentially arranged;
querying the identity in a linked list according to a binary search algorithm, wherein the linked list at least comprises one identity;
acquiring the position of the character in the finite automaton;
and after the query of the single character is finished, taking out the next character in the information to be searched to execute the character query, wherein the starting bit of the next character query is the position of the last character in the finite automaton until the test of the last character in the information to be searched is finished, and obtaining the keyword in the information to be searched.
In one embodiment, the converting the character into the identity further includes:
and inquiring whether the character has a corresponding identity, if not, creating the identity of the character, and the identities are sequentially arranged.
In one embodiment, if a node in the linked list is not a parent node of other child nodes, the node which is not a parent node of other child nodes in the linked list is labeled to obtain labeling information of the node, wherein the labeling information includes the position of the node in the finite automaton.
In one embodiment, the obtaining the position of the character in the finite automaton further comprises:
judging whether a new identity exists, if so, searching a father node of the new identity, and adding the new identity into a linked list according to a binary search method, wherein the linked list comprises child nodes of the father node.
In one embodiment, the extracting the next character in the information to be searched to perform the character query further includes:
and if the next character is not the child node of the previous character at the position of the finite automaton, re-executing character query on the next character to obtain the position of the next character at the position of the finite automaton.
In a second aspect, the present application further provides a device for searching for a large number of keywords based on a finite automaton, where the device includes:
the following single character query is performed:
the conversion module is used for acquiring a single character in the information to be searched and converting the character into an identity, wherein the character corresponds to the identity;
the query module is used for inputting the identity into the finite automaton, the finite automaton comprises a multilayer structure, each layer of structure comprises at least one linked list, each linked list is composed of child nodes belonging to the same father node, the child nodes are represented by the identity, and the identity is sequentially arranged; inquiring the identity in a linked list according to a binary search algorithm, wherein the linked list at least comprises one identity;
the confirmation module is used for acquiring the position of the character in the finite automaton;
and the keyword acquisition module is used for taking out the next character in the information to be searched to execute character query after the query of the single character is finished, wherein the starting bit of the next character query is the position of the last character in the finite automaton until the test of the last character in the information to be searched is finished, and the keyword in the information to be searched is obtained.
In one embodiment, the converting the character into the identity further includes:
and the creating module is used for inquiring whether the characters have corresponding identification marks or not, if the identification marks do not exist, the identification marks of the characters are created, and the identification marks are sequentially arranged.
In one embodiment, if a node in the linked list is not a parent node of other child nodes, the node which is not a parent node of other child nodes in the linked list is labeled to obtain labeling information of the node, wherein the labeling information includes the position of the node in the finite automaton.
In one embodiment, the obtaining the position of the character in the finite automaton further comprises:
judging whether a new identity exists, if so, searching a father node of the new identity, and adding the new identity into a linked list according to a binary search method, wherein the linked list comprises child nodes of the father node.
In one embodiment, the extracting the next character in the information to be searched to perform the character query further includes:
and the judging module is used for executing character query on the next character again to obtain the position of the next character in the finite automaton if the next character is not the child node of the last character in the position of the finite automaton.
In a third aspect, the present disclosure also provides a computer device. The computer equipment comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the steps of the mass keyword searching method based on the finite automaton when executing the computer program.
In a fourth aspect, the present disclosure also provides a computer-readable storage medium. The computer-readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of a finite automaton-based method for searching for a large number of keywords.
In a fifth aspect, the present disclosure also provides a computer program product. The computer program product comprises a computer program which, when executed by a processor, implements the steps of a finite automaton-based method for searching for a large number of keywords.
The mass keyword searching method based on the finite automaton at least has the following beneficial effects:
the embodiment scheme provided by the disclosure converts characters in a word bank into identity marks, stores the identity marks belonging to the same father node into the same linked list to form a finite automaton, and improves the searching performance, reduces the memory and improves the query speed by using a binary searching method.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
In order to more clearly illustrate the embodiments of the present disclosure or technical solutions in the conventional technologies, the drawings used in the description of the embodiments or conventional technologies will be briefly introduced below, it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a diagram of an exemplary implementation of a method for searching for a large number of keywords based on a finite automaton;
FIG. 2 is a flowchart illustrating a method for searching for a large number of keywords based on a finite automaton in an embodiment;
FIG. 3 is a block diagram of a finite automaton in one embodiment;
FIG. 4 is a flowchart illustrating a method for searching for a large number of keywords based on a finite automaton in an embodiment;
FIG. 5 is a flowchart illustrating a method for searching for a large number of keywords based on a finite automaton in an embodiment;
FIG. 6 is a block diagram illustrating a massive keyword searching apparatus based on finite automata according to an embodiment of the present disclosure;
FIG. 7 is a block diagram illustrating an exemplary embodiment of a device for searching for a large number of keywords based on a finite automaton;
FIG. 8 is a block diagram illustrating a massive keyword searching apparatus based on finite automata according to an embodiment of the present invention;
FIG. 9 is a diagram showing an internal structure of a computer device in one embodiment;
FIG. 10 is an internal block diagram of a server in one embodiment.
Detailed Description
In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.
It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims. The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, the presence of additional identical or equivalent elements in processes, methods, articles, or apparatus that include the recited elements is not excluded. For example, if the terms first, second, etc. are used to denote names, they do not denote any particular order.
The embodiment of the disclosure provides a method for searching a large number of keywords based on a finite automaton, which can be applied to an application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The data storage system may store data that the server 104 needs to process. The data storage system may be integrated on the server 104, or may be located on the cloud or other network server. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, internet of things devices and portable wearable devices, and the internet of things devices may be smart speakers, smart televisions, smart air conditioners, smart car-mounted devices, and the like. The portable wearable device can be a smart watch, a smart bracelet, a head-mounted device, and the like. The server 104 may be implemented as a stand-alone server or as a server cluster comprised of multiple servers.
In some embodiments of the present disclosure, as shown in fig. 2, a method for searching a massive keyword based on finite automata is provided, and an example in which the method is applied to the server in fig. 1 to process an identity is described. It is understood that the method can be applied to a server, and can also be applied to a system comprising a terminal and a server, and is realized through the interaction of the terminal and the server. In a specific embodiment, the method may include the steps of:
s20: the following single character query is performed:
s202: obtaining a single character in the information to be searched, and converting the character into an identity, wherein the character corresponds to the identity.
The information to be searched at least comprises one character, the information to be searched can be divided into single characters to be searched in the process of searching the characters, if one of the characters is not searched and can represent that the information to be searched does not comprise the keyword, the rest characters do not need to execute the searching operation, and the searching efficiency is improved. The identity can comprise codes, numbers and the like, and characters can be converted into the identity, wherein one character corresponds to one identity, so that the condition that the query is not corresponding can be effectively avoided.
S204: inputting the identity into a finite deterministic automaton, wherein the finite deterministic automaton comprises a multilayer structure, each layer of structure comprises at least one linked list, each linked list is composed of child nodes belonging to the same father node, the child nodes are represented by identity, and the identity is sequentially arranged.
FIG. 3 is a block diagram of a finite automaton in one embodiment. A finite automation (DFA) is an automaton that can implement state transitions, and can implement a function to transition to the next state according to a predetermined transition function. The finite automaton comprises a multi-level structure, which can be constructed as a tree structure. The linked list is a non-continuous and non-sequential storage structure on a physical storage unit, the logical sequence of data elements is realized by the pointer link sequence in the linked list, the linked list structure can overcome the defect that the data size of an array linked list needs to be known in advance, and the linked list structure can fully utilize the memory space of a computer to realize flexible dynamic memory management. A parent node may have at least one child node directly below it. In the tree structure, the child nodes may be represented as the roots of the respective subtrees of the current node.
In some embodiments of the present disclosure, a user-defined linked list may be constructed, each linked list may include child nodes belonging to the same parent node, and the child nodes may be represented by the identity and may be arranged in the linked list in order, which facilitates subsequent query work. The plurality of self-defined linked lists can form a finite deterministic automaton, and the finite deterministic automaton can comprise a plurality of layers of identities, wherein each layer of the finite deterministic automaton can comprise a plurality of linked lists.
S206: and inquiring the identity in a linked list according to a binary search algorithm, wherein the linked list at least comprises one identity.
A linked list may include multiple ids, and if a single character corresponds to the last id in the linked list, in an actual query operation, the query may be performed from the first id to the last id. In some embodiments of the present disclosure, the identity identifier may be queried in the linked list by using a binary search algorithm, and the intermediate identity identifiers are compared first, so as to quickly find the pair region to which the identity identifier to be searched belongs, thereby improving the query efficiency.
S208: and acquiring the position of the character in the finite automaton.
The finite automaton may comprise a multi-layer structure, and determining the position of the character to be looked up in the finite automaton may facilitate fast query operations for the next character to be looked up.
S210: and after the query of the single character is finished, taking out the next character in the information to be searched to execute the character query, wherein the starting bit of the next character query is the position of the last character in the finite automaton until the test of the last character in the information to be searched is finished, and obtaining the keyword in the information to be searched.
In some embodiments of the present disclosure, the linked list is composed of child nodes belonging to the same parent node, so that after the query of a single character is completed and the character is determined at the position of the finite automaton, the query may not need to be performed from the head node again, and the position of the above character at the finite automaton may be directly used as the start bit of the next character query, thereby increasing the query speed, and if all characters find the corresponding identity in the finite automaton, the keyword in the information to be searched may be obtained.
In the method for searching the mass keywords based on the finite automaton, the characters can be converted into the identity marks, the identity marks belonging to the same father node are stored in the same linked list, the multiple linked lists can form the finite automaton, the searching performance is improved by using a binary searching method, the memory is reduced, and the searching speed is improved.
As shown in fig. 4, in some embodiments of the present disclosure, the converting the character into an identity further includes:
s402: and inquiring whether the character has a corresponding identity, if not, creating the identity of the character, and the identities are sequentially arranged.
The character to be searched may not have a corresponding identity, a new identity may be created before conversion, and the new identity is added to the set for storing the identity, so that the operation of reconstructing the identity when the same character appears again can be avoided. And the identity identifiers are sequentially arranged in the set, so that the query efficiency is improved.
In some embodiments of the present disclosure, if a node in the linked list is not a parent node of other child nodes, a node that is not a parent node of other child nodes in the linked list is labeled to obtain labeling information of the node, where the labeling information includes a position of the node in the finite automaton.
The finite automaton can comprise a multilayer structure, the multilayer structure can comprise a plurality of linked lists, the linked lists can be formed by child nodes belonging to the same parent node, and if a node of one branch in the finite automaton is not the parent node of other child nodes, the node is labeled. The labeling mode may include setting a label or setting a color, for example, one of the branch corresponding characters in the finite automata is "healthy", after the identity of the "healthy" corresponding pair is queried, the next character in the character to be searched may not be queried any more with the position of the "healthy" in the finite automata as a starting bit of the query, and the query may be directly performed from the head node without traversing all the identities, thereby saving the memory.
As shown in fig. 5, in some embodiments of the present disclosure, the extracting a next character in the information to be searched to perform a character query further includes:
s502: and if the next character is not the child node of the previous character at the position of the finite automaton, re-executing character query on the next character to obtain the position of the next character at the position of the finite automaton.
When the query operation is executed, the start bit of the next character query can be determined as the position of the last character in the finite automaton, if the next character is not the child node of the position of the last character in the finite automaton and can indicate that the next character and the last character are not keywords, the next character is queried again by the query start bit, the memory is saved, and the query is performed without traversing all the identifiers.
It should be understood that, although the steps in the flowcharts related to the embodiments as described above are sequentially displayed as indicated by arrows, the steps are not necessarily performed sequentially as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a part of the steps in the flowcharts related to the embodiments described above may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the execution order of the steps or stages is not necessarily sequential, but may be rotated or alternated with other steps or at least a part of the steps or stages in other steps.
Based on the same inventive concept, the embodiment of the disclosure also provides a device for searching the mass keywords of the finite automaton, which is used for realizing the method for searching the mass keywords of the finite automaton. The implementation scheme for solving the problem provided by the device is similar to the implementation scheme recorded in the method, so that specific limitations in the following embodiment of the device for searching for the mass keywords of the finite automaton can be referred to the limitations of the method for searching for the mass keywords of the finite automaton, and are not described herein again.
The apparatus may include systems (including distributed systems), software (applications), modules, components, servers, clients, etc. that use the methods described in embodiments of the present specification in conjunction with any necessary apparatus to implement the hardware. Based on the same innovative concept, the embodiments of the present disclosure provide an apparatus in one or more embodiments as described in the following embodiments. Since the implementation scheme of the apparatus for solving the problem is similar to that of the method, the specific implementation of the apparatus in the embodiment of the present specification may refer to the implementation of the foregoing method, and repeated details are not repeated. As used hereinafter, the term "unit" or "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.
In one embodiment, as shown in fig. 6, a massive keyword search apparatus of a finite automaton is provided, and the apparatus may be the aforementioned server, or a module, component, device, unit, etc. integrated in the server. The apparatus may include:
the following single character query 60 is performed:
a conversion module 602, configured to obtain a single character in information to be searched, and convert the character into an identity, where the character and the identity correspond to each other;
a query module 604, configured to input the identity into a finite automaton, where the finite automaton includes a multilayer structure, each layer structure includes at least one linked list, the linked list is formed by child nodes belonging to the same parent node, the child nodes are represented by identity, and the identity is sequentially arranged in order; inquiring the identity in a linked list according to a binary search algorithm, wherein the linked list at least comprises one identity;
a confirmation module 606 for obtaining the position of the character in the finite automaton;
and the keyword acquisition module 608 is configured to take out a next character in the information to be searched to execute the character query after the query of the single character is completed, where a start bit of the next character query is a position of a previous character in the finite automaton until a last character in the information to be searched is tested, and obtain a keyword in the information to be searched.
As shown in fig. 7, in an embodiment, the converting the character into an identity further includes:
a creating module 702, configured to query whether the character has a corresponding identity, and if the identity does not exist, create the identity of the character, where the identities are sequentially arranged.
In one embodiment, if a node in the linked list is not a parent node of other child nodes, the node which is not a parent node of other child nodes in the linked list is labeled to obtain labeling information of the node, wherein the labeling information includes the position of the node in the finite automaton.
In one embodiment, the obtaining the position of the character in the finite automaton further comprises:
judging whether a new identity exists, if so, searching a father node of the new identity, and adding the new identity into a linked list according to a binary search method, wherein the linked list comprises child nodes of the father node.
As shown in fig. 8, in an embodiment, the extracting a next character in the information to be searched to perform a character query further includes:
the determining module 802 is configured to, if the next character is not a child node of the previous character at the position of the finite automaton, execute character query on the next character again to obtain the position of the next character in the finite automaton.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
The modules in the device for searching for massive keywords based on the finite automaton can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 9. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing the identity. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a finite automaton-based method for searching for a large number of keywords.
In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 10. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless communication can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a finite automaton-based method for searching for a large number of keywords. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.
Those skilled in the art will appreciate that the configurations shown in fig. 9 and 10 are merely block diagrams of some configurations relevant to the present disclosure, and do not constitute a limitation on the computing devices to which the present disclosure may be applied, and that a particular computing device may include more or less components than those shown, or combine certain components, or have a different arrangement of components.
In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which, when executed by a processor, implements the method of any of the embodiments of the present disclosure.
In one embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the method of any of the embodiments of the present disclosure.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, databases, or other media used in the embodiments provided by the present disclosure may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high-density embedded nonvolatile Memory, resistive Random Access Memory (ReRAM), magnetic Random Access Memory (MRAM), ferroelectric Random Access Memory (FRAM), phase Change Memory (PCM), graphene Memory, and the like. Volatile Memory can include Random Access Memory (RAM), external cache Memory, and the like. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others. The databases involved in embodiments provided by the present disclosure may include at least one of relational and non-relational databases. The non-relational database may include, but is not limited to, a block chain based distributed database, and the like. The processors referred to in the embodiments provided in this disclosure may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic, quantum computing based data processing logic, etc., without limitation.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present disclosure, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present disclosure. It should be noted that various changes and modifications can be made by one skilled in the art without departing from the spirit of the disclosure, and these changes and modifications are all within the scope of the disclosure. Therefore, the protection scope of the present disclosure should be subject to the appended claims.

Claims (13)

1. A massive keyword searching method based on finite automata is characterized by comprising the following steps:
the following single character query is performed:
acquiring a single character in information to be searched, and converting the character into an identity, wherein the character corresponds to the identity;
inputting the identity into a finite deterministic automaton, wherein the finite deterministic automaton comprises a multilayer structure, each layer structure comprises at least one linked list, each linked list is composed of child nodes belonging to the same father node, the child nodes are represented by identity, and the identity is sequentially arranged;
inquiring the identity in a linked list according to a binary search algorithm, wherein the linked list at least comprises one identity;
acquiring the position of the character in the finite automaton;
and after the query of the single character is finished, taking out the next character in the information to be searched to execute the character query, wherein the starting bit of the next character query is the position of the last character in the finite automaton until the test of the last character in the information to be searched is finished, and obtaining the keyword in the information to be searched.
2. The method of claim 1, wherein converting the character into an identity further comprises:
and inquiring whether the character has a corresponding identity, if not, creating the identity of the character, and sequentially arranging the identities according to the sequence.
3. The method of claim 1, wherein if the node in the linked list is not a parent node of other child nodes, labeling the node that is not a parent node of other child nodes in the linked list to obtain labeling information of the node, wherein the labeling information includes a position of the node in the finite automaton.
4. The method of claim 1, wherein the obtaining the character's position in the finite automaton further comprises:
judging whether a new identity exists, if so, searching a father node of the new identity, and adding the new identity into a linked list according to a binary search method, wherein the linked list comprises child nodes of the father node.
5. The method of claim 1, wherein the retrieving the next character in the information to be searched for performs a character query, and thereafter further comprising:
and if the next character is not the child node of the previous character at the position of the finite automaton, re-executing character query on the next character to obtain the position of the next character at the position of the finite automaton.
6. A device for searching massive keywords based on finite automata is characterized by comprising:
the following single character query is performed:
the conversion module is used for acquiring a single character in the information to be searched and converting the character into an identity, wherein the character corresponds to the identity;
the query module is used for inputting the identity into the finite automaton, the finite automaton comprises a multilayer structure, each layer of structure comprises at least one linked list, each linked list is composed of child nodes belonging to the same father node, the child nodes are represented by the identity, and the identity is sequentially arranged; inquiring the identity in a linked list according to a binary search algorithm, wherein the linked list at least comprises one identity;
the confirmation module is used for acquiring the position of the character in the finite automaton;
and the keyword acquisition module is used for taking out the next character in the information to be searched to execute character query after the query of the single character is finished, wherein the starting bit of the next character query is the position of the last character in the finite automaton until the test of the last character in the information to be searched is finished, and the keyword in the information to be searched is obtained.
7. The apparatus of claim 6, wherein said converting said character into an identification further comprises:
and the creating module is used for inquiring whether the characters have corresponding identification marks or not, if the identification marks do not exist, the identification marks of the characters are created, and the identification marks are sequentially arranged.
8. The apparatus of claim 6, wherein if the node in the linked list is not a parent node of other child nodes, labeling the node that is not a parent node of other child nodes in the linked list to obtain labeling information of the node, wherein the labeling information includes a position of the node in the finite automaton.
9. The apparatus of claim 6, wherein the obtaining the position of the character in the finite automaton further comprises:
judging whether a new identity exists, if so, searching a father node of the new identity, and adding the new identity into a linked list according to a binary search method, wherein the linked list comprises child nodes of the father node.
10. The apparatus of claim 6, wherein the retrieving the next character in the information to be searched for performs a character query, and thereafter further comprises:
and the judging module is used for executing character query on the next character again to obtain the position of the next character in the finite automaton if the next character is not the child node of the position of the last character in the finite automaton.
11. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 5.
12. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 5.
13. A computer program product comprising a computer program, characterized in that the computer program realizes the steps of the method of any one of claims 1 to 5 when executed by a processor.
CN202211370688.3A 2022-11-03 2022-11-03 Mass keyword searching method based on finite automaton Pending CN115687560A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211370688.3A CN115687560A (en) 2022-11-03 2022-11-03 Mass keyword searching method based on finite automaton

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211370688.3A CN115687560A (en) 2022-11-03 2022-11-03 Mass keyword searching method based on finite automaton

Publications (1)

Publication Number Publication Date
CN115687560A true CN115687560A (en) 2023-02-03

Family

ID=85047708

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211370688.3A Pending CN115687560A (en) 2022-11-03 2022-11-03 Mass keyword searching method based on finite automaton

Country Status (1)

Country Link
CN (1) CN115687560A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116303405A (en) * 2023-05-12 2023-06-23 深圳竹云科技股份有限公司 Data duplicate checking method and device and computer equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104008136A (en) * 2014-05-07 2014-08-27 中国科学院信息工程研究所 Method and device for text searching
CN110222143A (en) * 2019-05-31 2019-09-10 北京小米移动软件有限公司 Character string matching method, device, storage medium and electronic equipment
CN113157904A (en) * 2021-03-30 2021-07-23 北京优医达智慧健康科技有限公司 Sensitive word filtering method and system based on DFA algorithm
CN113555069A (en) * 2021-07-22 2021-10-26 杭州叙简科技股份有限公司 Chemical name retrieval and extraction method and device based on AC automaton
CN115204170A (en) * 2022-07-20 2022-10-18 上海亘岩网络科技有限公司 Sensitive word filtering method and related equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104008136A (en) * 2014-05-07 2014-08-27 中国科学院信息工程研究所 Method and device for text searching
CN110222143A (en) * 2019-05-31 2019-09-10 北京小米移动软件有限公司 Character string matching method, device, storage medium and electronic equipment
CN113157904A (en) * 2021-03-30 2021-07-23 北京优医达智慧健康科技有限公司 Sensitive word filtering method and system based on DFA algorithm
CN113555069A (en) * 2021-07-22 2021-10-26 杭州叙简科技股份有限公司 Chemical name retrieval and extraction method and device based on AC automaton
CN115204170A (en) * 2022-07-20 2022-10-18 上海亘岩网络科技有限公司 Sensitive word filtering method and related equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116303405A (en) * 2023-05-12 2023-06-23 深圳竹云科技股份有限公司 Data duplicate checking method and device and computer equipment
CN116303405B (en) * 2023-05-12 2023-11-10 深圳竹云科技股份有限公司 Data duplicate checking method and device and computer equipment

Similar Documents

Publication Publication Date Title
US9406381B2 (en) TCAM search unit including a distributor TCAM and DRAM and a method for dividing a database of TCAM rules
CN115687560A (en) Mass keyword searching method based on finite automaton
US8407255B1 (en) Method and apparatus for exploiting master-detail data relationships to enhance searching operations
CN115730596A (en) Object recommendation method and device and computer equipment
CN115809311A (en) Data processing method and device of knowledge graph and computer equipment
CN115129804A (en) Address association method, device, equipment, medium and product
CN113722446A (en) Power system operation data generation method and device and computer equipment
Anantha et al. Context Tuning for Retrieval Augmented Generation
CN116303405B (en) Data duplicate checking method and device and computer equipment
CN113961636A (en) Object relation query method and device, computer equipment and storage medium
CN115878924B (en) Data processing method, device, medium and electronic equipment based on double dictionary trees
CN114036171B (en) Application data management method, device, computer equipment and storage medium
CN116910115A (en) Group query method, device, computer equipment and storage medium
CN117827978A (en) Data conflict processing method, device, computer equipment and storage medium
CN116910337A (en) Entity object circle selection method, query method, device, server and medium
CN115563445A (en) Matrix matching search method, device, equipment and storage medium
CN116662998A (en) Model processing method, device and equipment
CN113987322A (en) Index data query method and device, computer equipment and computer program product
CN116955357A (en) Object identification recognition method and device and computer equipment
CN116186337A (en) Business scene data processing method, system and electronic equipment
CN116506394A (en) Message reminding method and device and computer equipment
CN117807080A (en) Text data processing method, apparatus, computer device and storage medium
CN114385630A (en) Blood relationship combing method, apparatus, device, storage medium, and program product
CN117194729A (en) Power data storage method, apparatus, device, storage medium, and program product
CN115115433A (en) Order data processing method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination