CN108021569A - The structure of AC automatic machines and Chinese multi-model matching method and relevant apparatus - Google Patents
The structure of AC automatic machines and Chinese multi-model matching method and relevant apparatus Download PDFInfo
- Publication number
- CN108021569A CN108021569A CN201610943520.5A CN201610943520A CN108021569A CN 108021569 A CN108021569 A CN 108021569A CN 201610943520 A CN201610943520 A CN 201610943520A CN 108021569 A CN108021569 A CN 108021569A
- Authority
- CN
- China
- Prior art keywords
- node
- layer
- direct descendent
- values
- chinese character
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/90335—Query processing
- G06F16/90344—Query processing by using string matching techniques
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of construction method of AC automatic machines, its device, AC automatic machines, Chinese multi-model matching method and its device, according to the whole Chinese characters included in all pattern keywords got, the node set with Rotating fields being made of whole Chinese characters is determined;According to the node that each layer is included in the node set determined, using ordering strategy, according to the direct descendent number included by order more at least in each layer, it is preferential determine the base values comprising the largest number of nodes of direct descendent and comprising direct descendent check values, conflict can be significantly reduced when finding base values, avoid array from increasing too fast, reduce Sparse, improve EMS memory occupation;Failure pointer is established to each node of the even numbers group dictionary tree of foundation, AC automatic machines is generated, largely optimizes space availability ratio.
Description
Technical field
The present invention relates to information technology field, espespecially a kind of construction method of AC automatic machines, its device, AC automatic machines, in
Literary multi-model matching method and its device.
Background technology
Aho-Corasick algorithms resulted from AT&T Labs in 1975, were one of famous multi-pattern matching algorithms,
The corresponding data structure of the algorithm is Aho-Corasick automatic machines (referred to as AC automatic machines), by using the public of character string
Prefix reduces query time, reduces meaningless character string comparison to greatest extent, therefore is chiefly used in counting, and sorts, Yi Jibao
Substantial amounts of character string is deposited, is especially often searched the statistics that engine is used for text word frequency.
In general, AC automatic machines need to use a kind of dictionary tree with tree structure, the search tree of word is also known as, is a kind of
The mutation of Hash tree.Usually, the fundamental property of dictionary tree includes:Root node does not include character, each in addition to root node
Node all only includes a character;From root node to some node, the Connection operator passed through on path gets up for the node pair
The character string answered;The character that all child nodes of each node include is different from.
However, in the traditional dictionary tree of structure, existing greatest problem is exactly that space consuming is big;If there are n in dictionary tree
First layer node, the number of characters that each node of first layer may include is m, then the space occupied is n × m;For Chinese character
Speech, the total amount of Chinese character, when dictionary tree is more than 1000 nodes, may cause the memory of active computer not up to 80,000 at present
Foot.Therefore, if carry out Chinese multi-mode matching using this traditional dictionary tree structure AC automatic machines, it is necessary to dictionary tree
Space consuming optimizes, to meet the normal effective operation of computer.
In order to solve the problems, such as space consuming, in existing technical solution, the first scheme is to sacrifice the time to change space, i.e.,
The array of all child nodes of storage node is changed to chained list;Although the method improves space availability ratio, but reduce AC
Automatic machine carries out search efficiency during Chinese multi-mode matching;In the worst case, when searching some child node, only needed in the past
1 comparison, it is now desired to which carrying out 80,000 comparisons could complete.
Second scheme is to optimize space using word frequency, i.e., for the higher pattern keyword of frequency of use, uses array
Child node is stored, for the relatively low pattern keyword of frequency of use, then using storage of linked list child node;Although the program is on the whole
Space is optimized, but space consuming still can not be reduced for the no small high frequency mode keyword of number, while is also sacrificed
The query function of non-high frequency mode keyword.
The third scheme is to replace character structure dictionary tree using the Chinese phonetic alphabet, although knot has saved space, is brought
Other the problem of;Since a kind of Chinese phonetic alphabet may correspond to multiple Chinese-character words, such as " tianliang " possible corresponding " daybreak ",
Multiple words such as " Tian Liang ", " staggering amount ", " it is cool ", " conscience ", this just violates the failure pointer of a node in AC automatic machines only
The rule of a node can be directed toward, the once traversal of AC automatic machines can not be utilized to match the characteristic of all pattern keywords.
In addition, can be instead of traditional dictionary tree, even numbers group dictionary tree using even numbers group dictionary tree in the prior art
Another expression way of traditional dictionary tree construction, deviates base arrays and relation check arrays to represent by two kinds of arrays
The hierarchical structure of dictionary tree, significantly knot has saved space and committed memory, and reduces space complexity;Yet with building
According to level traversing nodes during even numbers group dictionary tree, if less with the child node that above node includes in layer, and tie below
When the child node that point includes is more, the conflict of high probability just occurs when building node below, and then to thus even numbers group word
Space consuming of the AC automatic machines of allusion quotation tree structure during Chinese multi-mode matching has an impact.Such as:As shown in table 1, when
During priority treatment " angstrom " node, direct descendent " and " be located under be designated as at the state state corresponding to 7, then reprocessing knot
Point " during Ah ", direct descendent " root ", " glue ", the position of " drawing " at state may with " and " position at state weight
Close, so just generate conflict, therefore only direct descendent " root ", " glue ", " drawing " are positioned over subscript and are more than 7 corresponding shapes
At state state, the even numbers group dictionary tree so built is then sparse, has seriously affected the utilization rate in space.
Table 1
Based on this, how when building AC automatic machines on the basis of even numbers group dictionary tree, realize optimization array space, improve
Space availability ratio, is those skilled in the art's technical problem urgently to be resolved hurrily.
The content of the invention
The embodiment of the present invention provides a kind of construction method of AC automatic machines, its device, AC automatic machines, Chinese multi-mode matching
Method and its device, to solve how when building AC automatic machines on the basis of even numbers group dictionary tree, to realize that optimization array is empty
Between, improve space availability ratio.
An embodiment of the present invention provides a kind of construction method of multi-mode matching AC automatic machines, including:
According to the whole Chinese characters included in all pattern keywords got, the tool being made of whole Chinese characters is determined
There is the node set of Rotating fields;
According to the node that each layer is included in the node set determined, included in each layer according to the node
Direct descendent number determines the corresponding offset base values of node and relation check that each layer is included by order more at least
Value, establishes the even numbers group dictionary tree being made of each base values and each check values;
Failure pointer is established to each node in the even numbers group dictionary tree, generates AC automatic machines.
In a kind of possible embodiment, in the structure of above-mentioned multi-mode matching AC automatic machines provided in an embodiment of the present invention
In construction method, each layer is included in the node set that the basis is determined node, according to the node in each layer
Comprising direct descendent number by order more at least, determine the corresponding offset base values of node and relation that each layer is included
Check values, specifically include:
Chinese character numbering is carried out by the coding mode specified to the whole Chinese characters included in all pattern keywords;
Node number in the node set, initialization base array and initialization of the structure with setting length
Check arrays;
According to the order from the first layer node being connected with root node to leafy node, to each in the node set
Layer node performs procedure below:
Determine to whether there is direct descendent with any node in layer;
Determining with any node in layer there are during direct descendent, for there are the direct descendent in same layer
Node, according to the direct descendent number that the node in same layer includes by order more at least, according to each node
Chinese character is numbered and the Chinese character of the direct descendent that includes of each node numbering, determine the corresponding base values of each node with
And the corresponding check values of direct descendent that each node includes;
Determine with direct descendent is not present in any node in layer when, all pattern keywords will be located at
The corresponding base values of node of ending are set to negative value.
In a kind of possible embodiment, in the structure of above-mentioned multi-mode matching AC automatic machines provided in an embodiment of the present invention
In construction method, the Chinese character of the direct descendent included according to the Chinese character of each node numbering and each node is compiled
Number, determine the corresponding base values of each node and the corresponding check values of direct descendent that each node includes, specifically
Including:
Numbered according to the Chinese character of each node, determine the corresponding subscript of each node;
The Chinese character numbering of the direct descendent included according to each corresponding subscript of node and each node, determines
The corresponding check values of direct descendent that each corresponding base values of the node and each node include.
In a kind of possible embodiment, in the structure of above-mentioned multi-mode matching AC automatic machines provided in an embodiment of the present invention
It is described to be numbered according to the Chinese character of each node in construction method, determine the corresponding subscript of each node, specifically include:
For each node in first layer node, determine that the corresponding Chinese character numbering of each node is corresponding subscript;
For each direct descendent for belonging to Same Vertices in other layer of node, the subscript of each direct descendent is determined
I and Chinese character numbering anMeet following relation:
I=k+an
And determine that k is the minimum positive integer for meeting the following conditions:
base[k+a1]=base [k+a2]=...=base [k+an]
=check [k+a1]=check [k+a2]=...=check [k+aj]=initial value
Wherein, n is each direct descendent number, n=1,2 ... j.
In a kind of possible embodiment, in the structure of above-mentioned multi-mode matching AC automatic machines provided in an embodiment of the present invention
In construction method, the Chinese character numbering of the direct descendent included according to each corresponding subscript of node and each node, really
The corresponding check values of direct descendent that fixed each corresponding base values of node and each node include, specifically include:
It is k values to determine the corresponding base values of each node;
Determine the subscript that corresponding check values of direct descendent that each node includes are each node.
In a kind of possible embodiment, in the structure of above-mentioned multi-mode matching AC automatic machines provided in an embodiment of the present invention
It is described to be set to negative value positioned at the corresponding base values of node of all pattern keyword endings in construction method, specifically include:
It is described by this is revised as positioned at the corresponding base values for initial value of node of all pattern keyword endings
The lower target negative of node;
The institute will not be revised as the base values of initial value positioned at the node of all pattern keywords ending is corresponding
State the negative of the corresponding base values of node.
In a kind of possible embodiment, in the structure of above-mentioned multi-mode matching AC automatic machines provided in an embodiment of the present invention
In construction method, whole Chinese characters to being included in all pattern keywords carry out Chinese character volume by the coding mode specified
Number, specifically include:
Chinese character numbering is carried out in the following manner to the whole Chinese characters included in all pattern keywords:Information exchange is used
Hanzi coded character set GB2312, Chinese Internal Code Specification GBK, Big5 BIG5 or 8 general format transformation UTF-8.
The embodiment of the present invention additionally provides a kind of AC automatic machines, using above-mentioned AC automatic machines provided in an embodiment of the present invention
Construction method is built.
The embodiment of the present invention additionally provides a kind of Chinese multi-model matching method, including:
Using above-mentioned AC automatic machines provided in an embodiment of the present invention, pending text is from first to last scanned, is counted described
The number that all pattern keywords occur in pending text.
The embodiment of the present invention additionally provides a kind of construction device of AC automatic machines, including:
Determining module, for according to the whole Chinese characters included in all pattern keywords got, determining by described complete
The node set with Rotating fields of portion's Chinese character composition;
Establish module, for the node included according to each layer in the node set determined, in each layer according to
The direct descendent number that the node includes determines the corresponding offset base of node that each layer is included by order more at least
Value and relation check values, establish the even numbers group dictionary tree being made of each base values and each check values;
Generation module, for establishing failure pointer to each node in the even numbers group dictionary tree, generation AC is automatic
Machine.
In a kind of possible embodiment, in the construction device of above-mentioned AC automatic machines provided in an embodiment of the present invention,
It is described to establish module, carried out specifically for whole Chinese characters to being included in all pattern keywords by the coding mode specified
Chinese character is numbered;Node number in the node set, structure is with the initialization base arrays for setting length and initially
Change check arrays;According to the order from the first layer node being connected with root node to leafy node, in the node set
Each layer node perform procedure below:Determine to whether there is direct descendent with any node in layer;Determining with layer
Any node is there are during direct descendent, for the node in same layer there are the direct descendent, according to institute in same layer
Direct descendent number that node includes is stated by order more at least, according to the Chinese character of each node numbering and each knot
The Chinese character numbering for the direct descendent that point includes, determine the corresponding base values of each node and each node include it is straight
Connect the corresponding check values of child node;Determine with direct descendent be not present in any node in layer when, will be located at described in
The corresponding base values of node of all pattern keyword endings are set to negative value.
In a kind of possible embodiment, in the construction device of above-mentioned AC automatic machines provided in an embodiment of the present invention,
It is described to establish module, specifically for being numbered according to the Chinese character of each node, determine the corresponding subscript of each node;According to each
The Chinese character numbering for the direct descendent that the corresponding subscript of node and each node include, determines that each node corresponds to
Base values and the corresponding check values of direct descendent that include of each node.
In a kind of possible embodiment, in the construction device of above-mentioned AC automatic machines provided in an embodiment of the present invention,
It is described to establish module, specifically for for each node in first layer node, determining that the corresponding Chinese character numbering of each node is
Corresponding subscript;For each direct descendent for belonging to Same Vertices in other layer of node, each direct descendent is determined
Subscript I and Chinese character numbering anMeet following relation:
I=k+an
And determine that k is the minimum positive integer for meeting the following conditions:
base[k+a1]=base [k+a2]=...=base [k+an]
=check [k+a1]=check [k+a2]=...=check [k+aj]=initial value
Wherein, n is each direct descendent number, n=1,2 ... j.
In a kind of possible embodiment, in the construction device of above-mentioned AC automatic machines provided in an embodiment of the present invention,
It is described to establish module, specifically for determining that the corresponding base values of each node are k values;It is direct to determine that each node includes
The corresponding check values of child node are the subscript of each node.
In a kind of possible embodiment, in the construction device of above-mentioned AC automatic machines provided in an embodiment of the present invention,
It is described to establish module, specifically for by positioned at the corresponding base values for initial value of node of all pattern keyword endings
It is revised as the lower target negative of the node;By it is corresponding positioned at the node of all pattern keywords ending be not initial
The base values of value are revised as the negative of the corresponding base values of the node.
In a kind of possible embodiment, in the construction device of above-mentioned AC automatic machines provided in an embodiment of the present invention,
It is described to establish module, specifically for carrying out Chinese character volume in the following manner to the whole Chinese characters included in all pattern keywords
Number:Chinese Character Set Code for Informati GB2312, Chinese Internal Code Specification GBK, Big5 BIG5 or 8 general conversions
Form UTF-8.
The embodiment of the present invention additionally provides a kind of Chinese multi-mode matching device, including:
Scan module, for scanning pending text through and through using above-mentioned AC automatic machines provided in an embodiment of the present invention
This;
Statistical module, for counting the number that all pattern keywords occur in the pending text.
The present invention has the beneficial effect that:
A kind of construction method of AC automatic machines provided in an embodiment of the present invention, its device, AC automatic machines, Chinese multi-mode
Method of completing the square and its device, according to the whole Chinese characters included in all pattern keywords got, determine to be made of whole Chinese characters
The node set with Rotating fields;According to the node that each layer is included in the node set determined, according to knot in each layer
The direct descendent number that point includes determines the corresponding offset base values of node and close that each layer is included by order more at least
It is check values, establishes the even numbers group dictionary tree being made of each base values and each check values;To each knot in even numbers group dictionary tree
Point establishes failure pointer, generates AC automatic machines.Therefore, during AC automatic machines are built, first establish by base values with
During the even numbers group dictionary tree of check values composition, using ordering strategy, in each layer according to the direct descendent number included by
Order at least more, it is preferential determine base values comprising the largest number of nodes of direct descendent and comprising direct descendent
Check values, can significantly reduce conflict when finding base values, avoid array from increasing too fast, reduction Sparse, in raising
Deposit occupancy;Secondly, failure pointer is established to each node of the even numbers group dictionary tree of foundation, generates AC automatic machines, largely
Optimize space availability ratio.
Brief description of the drawings
Fig. 1 is a kind of one of flow diagram of construction method of AC automatic machines provided in the embodiment of the present invention;
Fig. 2 is the two of the flow diagram of the construction method of a kind of AC automatic machines provided in the embodiment of the present invention;
Fig. 3 is a kind of flow diagram of the Chinese multi-model matching method provided in the embodiment of the present invention;
Fig. 4 is the flow diagram of the embodiment one provided in the embodiment of the present invention;
Fig. 5 is the structure diagram of node set in the embodiment one provided in the embodiment of the present invention;
Fig. 6 a and 6b are the structure diagram of the simple dictionary tree in the embodiment two provided in the embodiment of the present invention;
Fig. 7 is the flow diagram of the embodiment two provided in the embodiment of the present invention;
Fig. 8 is the flow diagram of the embodiment three provided in the embodiment of the present invention;
Fig. 9 a to 9f are the schematic diagram in the Chinese multi-mode matching path in the embodiment three provided in the embodiment of the present invention;
Figure 10 is a kind of structure diagram of the construction device of the AC automatic machines provided in the embodiment of the present invention;
Figure 11 is a kind of structure diagram of the Chinese multi-mode matching device provided in the embodiment of the present invention.
Embodiment
Below in conjunction with attached drawing, to a kind of construction method of AC automatic machines provided in an embodiment of the present invention, its device, AC from
Motivation, Chinese multi-model matching method and its device, and the embodiment of Chinese multi-mode matching device carry out in detail
Ground explanation.It should be noted that described embodiment is only part of the embodiment of the present invention, rather than whole implementation
Example.Based on the embodiments of the present invention, those of ordinary skill in the art are obtained without making creative work
Every other embodiment, belongs to the scope of protection of the invention.
It should be noted that even numbers group dictionary tree is only made of each base values and each check values, and related in herein below
And even numbers group dictionary tree in the subscript that occurs and state be used for the purpose of the structure that clearly illustrates even numbers group dictionary tree
Process, in actual even numbers group dictionary tree and is not present;In addition, the structural representation of the node set occurred in herein below
Figure, is existing for array form, to use the representation node set of tree here only during the actual implementation of AC automatic machines
It is in order to which the building process of AC automatic machines is more clearly understood.
An embodiment of the present invention provides a kind of construction method of AC automatic machines, as shown in Figure 1, following step can be included
Suddenly:
The whole Chinese characters included in all pattern keywords that S101, basis are got, determine what is be made of whole Chinese characters
Node set with Rotating fields;
S102, the node included according to each layer in the node set determined, include according to node straight in each layer
Child node number is connect by order more at least, determines the corresponding offset base values of node and relation check values that each layer is included,
Establish the even numbers group dictionary tree being made of each base values and each check values;
S103, establish each node in even numbers group dictionary tree on failure pointer, generates AC automatic machines.
An embodiment of the present invention provides a kind of construction method of AC automatic machines, during AC automatic machines are built, first
When establishing the even numbers group dictionary tree being made of base values and check values, using ordering strategy, according to being included in each layer
Direct descendent number is preferential to determine the base values comprising the largest number of nodes of direct descendent and bag by order more at least
The check values of the direct descendent contained, can significantly reduce conflict when finding base values, avoid array from increasing too fast, subtract
Few Sparse, improves EMS memory occupation;Secondly, failure pointer is established to each node of the even numbers group dictionary tree of foundation, generates AC
Automatic machine, largely optimizes space availability ratio.
In the specific implementation, in order to establish the even numbers group dictionary tree being made of each base values and each check values, in the present invention
Step S102 in the construction method for the above-mentioned AC automatic machines that embodiment provides is wrapped according to each layer in the node set determined
The node contained, determines what each layer was included according to the direct descendent number that node includes in each layer by order more at least
The corresponding offset base values of node and relation check values, as shown in Fig. 2, following steps can be specifically included:
S201, carry out Chinese character numbering to the whole Chinese characters included in all pattern keywords by the coding mode specified;
S202, the node number in node set, structure is with the initialization base arrays for setting length and initially
Change check arrays;
S203, according to the order from the first layer node being connected with root node to leafy node, in node set
Each layer node, determines to whether there is direct descendent with any node in layer;If so, then perform step S204;If it is not, then perform
Step S205;
S204, for the node in same layer there are direct descendent, the direct descendent that is included according to same layer interior knot
Number is by order more at least, and the Chinese character of the direct descendent included according to the Chinese character of each node numbering and each node is numbered, really
The corresponding check values of direct descendent that fixed each corresponding base values of node and each node include, in definite every layer of each node
After the corresponding check values of direct descendent that corresponding base values and each node include, step S203 is returned to;
S205, by the corresponding base values of node to end up positioned at all pattern keywords be set to negative value.
Specifically, in order to complete to number whole Chinese characters included in all pattern keywords, in the embodiment of the present invention
Step S201 in the construction method of the above-mentioned AC automatic machines provided, can specifically include in the following manner:
Such as Chinese Character Set Code for Informati GB2312, Chinese Internal Code Specification (Chinese Internal
Code Specification, GBK), Big5 BIG5 or 8 general format transformation (8-bie Unicode
Transformation Format, UTF-8) in any one, be not limited thereto.
It should be noted that step S203 to step S204 is the process of a circulation, from first be connected with root node
Layer node starts, up to leafy node, to determine that any node whether there is direct descendent for the node in each layer, if
With any node in layer there are direct descendent, then step S204 is performed, if there is no directly son knot with any node in layer
Point, illustrates to have arrived leafy node, end loop process, performs step S205.
Specifically, the step S203 in the construction method of above-mentioned AC automatic machines provided in an embodiment of the present invention determines same layer
Interior any node can include two kinds of situations there are during direct descendent with each node in layer:A part of node includes direct
Child node, a part of node does not include direct descendent, or includes direct descendent with all nodes in layer;
For the node comprising direct descendent, then step S204 is performed;
For the node not comprising direct descendent, then it is initial value to keep the corresponding base values of the node and check values.
Further, for that comprising the different node of direct descendent number, need to be included according to same layer interior knot direct
Child node number performs step S204 successively by order more at least;And for comprising the identical node of direct descendent number,
The order for performing step S204 is not limited herein.
Specifically, the step S204 in the construction method of above-mentioned AC automatic machines provided in an embodiment of the present invention is according to each knot
The Chinese character numbering of point and the Chinese character numbering of the direct descendent that includes of each node, determine corresponding base values of each node and respectively
The corresponding check values of direct descendent that node includes, can specifically include following steps:
Numbered according to the Chinese character of each node, determine the corresponding subscript of each node;
The Chinese character numbering of the direct descendent included according to the corresponding subscript of each node and each node, determines each node pair
The corresponding check values of direct descendent that the base values and each node answered include.
Further, in order to determine the corresponding subscript of each node by the Chinese character of each node numbering, in the embodiment of the present invention
In the construction method of the above-mentioned AC automatic machines provided, it can be accomplished by the following way:
For each node in first layer node, determine that the corresponding Chinese character numbering of each node is corresponding subscript;
For each direct descendent for belonging to Same Vertices in other layer of node, determine the subscript I of each direct descendent with
Chinese character numbering anMeet following relation:
I=k+an
And determine that k is the minimum positive integer for meeting the following conditions:
base[k+a1]=base [k+a2]=...=base [k+an]
=check [k+a1]=check [k+a2]=...=check [k+aj]=initial value
Wherein, n is each direct descendent number, n=1,2 ... j.
Further, in order to which the direct descendent for determining the corresponding base values of each node and each node includes is corresponding
Check values, in the construction method of above-mentioned AC automatic machines provided in an embodiment of the present invention, determine that the corresponding base values of each node are
K values, determine the subscript that corresponding check values of direct descendent that each node includes are each node.
In the specific implementation, the step S205 in the construction method of above-mentioned AC automatic machines provided in an embodiment of the present invention will
The corresponding base values of node positioned at the ending of all pattern keywords are set to negative value, can specifically include in the following manner:
It is revised as the node to end up positioned at all pattern keywords is corresponding for the base values of initial value under the node
Target negative;
By the node to end up positioned at all pattern keywords it is corresponding be initial value base values be revised as the node pair
The negative for the base values answered.
It should be noted that by each base values and each check values structure even numbers group dictionary tree in, when base values with
When check values are initial value, illustrate the position for sky, it is unoccupied;When base values are negative, illustrate the position correspondence
Node is leafy node, without direct descendent, for the ending of a pattern keyword, therefore, passes through the even numbers group word of structure
Allusion quotation tree, can intuitively represent the position where the ending of exit pattern keyword;In addition, because being designated as the straight of the node under node
The check values of child node are connect, so by the even numbers group dictionary tree of structure, can clearly represent the layer where each node
The number for the direct descendent that secondary and each node includes.
In the specific implementation, S103 pairs of the step in the construction method of above-mentioned AC automatic machines provided in an embodiment of the present invention
The embodiment that each node in even numbers group dictionary tree establishes failure pointer can use existing various foundation unsuccessfully to refer to
The embodiment of pin, instantiation is referring to embodiment two, and therefore not to repeat here.
The embodiment of the present invention additionally provides a kind of AC automatic machines, using above-mentioned AC automatic machines provided in an embodiment of the present invention
Construction method is built, and overlaps will not be repeated.
The embodiment of the present invention additionally provides a kind of Chinese multi-model matching method, as shown in figure 3, can include following step
Suddenly:
S301, using AC automatic machines, from first to last scan pending text;
S302, the statistics number that all pattern keywords occur in pending text.
Several specific embodiments the present invention will be described in detail above-mentioned Chinese multi-mode matching that embodiment provides will be passed through below
Method.
Embodiment one:
With 6 pattern keywords:Exemplified by " ", " Egypt ", " donkey-hide gelatin ", " Argentina ", " Arab " and " Arabic ",
Even numbers group dictionary tree is built, as shown in figure 4, may comprise steps of:
S401, according to 6 pattern keywords getting, determine by comprising 10 Chinese characters form there are Rotating fields
Node set, as shown in Figure 5;
S402, carry out Chinese character numbering to 10 Chinese characters included in 6 pattern keywords by the numbering of GBK;
Specifically, to the numberings of 10 Chinese characters, the results are shown in Table 2.
Table 2
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
Ah | Angstrom | Root | Glue | Draw | And | The court of a feudal ruler | Primary | People |
S403, according to 10 nodes included altogether in node set, the base for the initialization that structure array length is 19
Array and check arrays;
Specifically, the array length of structure is as shown in table 3 for the base arrays and check arrays of 19 initialization.
Table 3
Subscript | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 |
base | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
check | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
state |
S404, determine that any node whether there is direct descendent in first layer in node set;If so, then perform step
S405;If it is not, then perform step S410;
Specifically, according to first layer node in node set:" ", " Ah " and " angstrom ", determine any in these three nodes
Node whether there is direct descendent, by judging to find that " therefore Ah " and " angstrom " two nodes, are held there are direct descendent
Row step 405.
S405, determine that the corresponding Chinese character numbering of each node is corresponding subscript;
Specifically, according to the first layer node " " shown in table 1, " Chinese character of Ah " and " angstrom " is numbered, it may be determined that this
The subscript of three nodes is respectively:1st, 2 and 3.
S406, for there are the node of direct descendent, according to direct descendent number by order more at least, determine each
The relation of the subscript for the direct descendent that node includes and Chinese character numbering;
Specifically, in three nodes of first layer " " and " angstrom ", " Ah " and " angstrom " are direct comprising three respectively for node
Child node and a direct descendent, therefore, according to direct descendent number by order more at least, first handle node " Ah ",
Reprocess node " angstrom ".
Specifically, " direct descendent of Ah " is " root ", " glue " and " drawing ", according to the numbering result shown in table 1
" numbering of " root ", " glue " and " drawing " is 4,5 and 6, according to the subscript I of each direct descendent and Chinese character numbering anRelation, can be with
Show that the relation that the subscript I of three direct descendents is met is:
IRoot=k+4, IGlue=k+5, IDraw=k+6
And determine that k is the minimum positive integer for meeting the following conditions:
Base [k+4]=base [k+5]=base [k+6]=check [k+4]=check [k+5]=check [k+6]=
Initial value
Therefore, k=1, it may be determined that " the subscript I of three direct descendents " root ", " glue " and " drawing " of Ah " is respectively 5,6
With 7.
Similarly, for node " angstrom " direct descendent for " and ", according to the numbering result in table 1 " and " volume
Number be 7, according to the subscript I of each direct descendent and Chinese character numbering anRelation, it can be deduced that the subscript I of direct descendentAndInstitute
The relation of satisfaction is:
IAnd=k+7
And determine that k is the minimum positive integer for meeting the following conditions:
Base [k+7]=check [k+7]=initial value
Therefore, k=1, it may be determined that " angstrom " direct descendent " and " subscript IAndFor 8.
S407, determine that the corresponding base values of each node are k;
Specifically, " the base values of Ah " and " angstrom " are 1, and the results are shown in Table 4 in first layer node.
Table 4
Subscript | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 |
base | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
check | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
state | Ah | Angstrom |
S408, determine the subscript that corresponding check values of direct descendent that each node includes are each node;
Specifically, " the corresponding check values of three direct descendents " root ", " glue " and " drawing " of Ah " are equal in first layer node
For 2, the direct descendent of " angstrom " " and " corresponding check values are 3, the results are shown in Table 5.
Table 5
S409, according to the order since second layer node to leafy node, determine to whether there is with any node in layer
Direct descendent;If so, step S406 is returned to, if it is not, then performing step S410;
Specifically, for the second layer node " root " in node set, " glue ", " drawing " and " and ", it may be determined that " root " wrap
Containing direct descendent " court of a feudal ruler ", " drawing " includes direct descendent " primary ", " glue " and " and " be pattern keyword " donkey-hide gelatin " and " Egypt "
Ending, therefore, to node " root " and " drawing ", step S406 can be performed.
Specifically, according to result above, root " and the subscript of " drawing " are respectively 5 and 7;Because node " root " and " drawing " wrap
The direct descendent number contained is 1, therefore can not limit the processing sequence of two nodes.
Specifically, according to the numbering result shown in table 1, the numbering of the direct descendent " court of a feudal ruler " of " root " is 8, and root
According to the subscript I and Chinese character numbering a of each direct descendentnRelation, it can be deduced that the subscript I in " court of a feudal ruler "The court of a feudal rulerThe relation met is:
IThe court of a feudal ruler=k+8,
And determine that k is the minimum positive integer for meeting the following conditions:
Base [k+8]=check [k+8]=initial value
Therefore, k=1, it may be determined that the subscript I in " court of a feudal ruler "The court of a feudal rulerFor 9;
Hence, it can be determined that the corresponding base values of node " root " are 1, the corresponding check values of direct descendent " court of a feudal ruler " are 5.
Similarly, according to the numbering result shown in table 1, the numbering of the direct descendent " primary " of " drawing " is 9, and root
According to the subscript I and Chinese character numbering a of each direct descendentnRelation, it can be deduced that the subscript I of " primary "PrimaryThe relation met is:
IPrimary=k+9
And determine that k is the minimum positive integer for meeting the following conditions:
Base [k+9]=check [k+9]=initial value
Therefore, k=1, it may be determined that the subscript I of node " primary "PrimaryFor 10;
Hence, it can be determined that the corresponding base values of node " drawing " are 1, the corresponding check values of direct descendent " primary " are 7.
Similarly, for the third layer node " court of a feudal ruler " in node set and " primary ", it may be determined that only node " primary " includes one
A direct descendent " people ", therefore, 7 is designated as according under " primary ", and the numbering of " people " is 10, by using above-mentioned identical calculating
Process, overlaps will not be repeated, it may be determined that the corresponding base values of node " primary " are 1, and direct descendent " people " is corresponding
Check values are 10, and the results are shown in Table 6.
Table 6
S410, be revised as the node by the node to end up positioned at all pattern keywords is corresponding for the base values of initial value
Lower target negative;By the node to end up positioned at all pattern keywords it is corresponding be initial value base values be revised as the knot
The negative of the corresponding base values of point.
Specifically, according to the result shown in table 6, positioned at all pattern keywords ending node " ", " glue ",
" and ", the subscript of " court of a feudal ruler " and " people " be respectively 1,6,8,9 and 11, the corresponding base values of these nodes are initial value, therefore, will
The corresponding base values of these nodes are revised as the lower target negative of the node, i.e., amended node " ", " glue ", " and ",
" court of a feudal ruler " and " people " corresponding base values are respectively -1, -6, -8, -9 and -11;
Meanwhile 10 are designated as under the node " primary " of all pattern keywords ending, corresponding base values are 1, therefore,
The corresponding base values of node " primary " are revised as to the negative of the corresponding base values of the node, the i.e. corresponding base values of node " primary "
For -1, as shown in table 7.
Table 7
Embodiment two:
The simple dictionary tree as shown in Figure 6 a constructed with two pattern keywords " Slender West Lake " to be matched and " West Lake "
Exemplified by, structure failure pointer, generates AC automatic machines, as shown in fig. 7, may comprise steps of:
S501, the failure pointer direction root node root by the first layer node being connected with root node root;
S502, since second layer node up to leafy node, refer to from the failure pointer of the father node of any node Chinese character
To node recall, it is determined whether there is the Chinese character identical with the node;If so, then perform step S503;If it is not, then perform step
Rapid S504;
The failure pointer of the node, is directed toward the node for having identical Chinese character with the node by S503;
S504, the failure pointer direction root node root by the node.
Specifically, in the simple dictionary tree shown in Fig. 6 a, including Liang Ge branches, it is respectively " Slender West Lake " and " West Lake ";
First, the first layer node Chinese character " thin " being connected with root node root and the failure pointer in " west " are directed toward root node root;Connect
, for second layer node Chinese character " west " and " lake ", recall from the failure pointer of the father node " thin " of left side node Chinese character " west ",
The corresponding node Chinese character of second branch for finding root node root is also " west ", then by the failure of the node Chinese character " west " on the left side
Pointer is directed toward the node Chinese character " west " on the right;However, for the right second layer node Chinese character " lake ", from its father node " west "
Failure pointer backtracking, does not find the node that the Chinese character being connected with root node root is " lake ", therefore, by the right second layer node
The failure pointer of Chinese character " lake " is directed toward root node root;For left side third layer node Chinese character " lake ", according to above-mentioned same side
Method, recalls from the failure pointer of its father node " west ", identical Chinese character " lake " is found, therefore, by the left side third layer node Chinese
The failure pointer of word " lake " is directed toward the node that the right Chinese character is " lake ", as a result as shown in Figure 6 b.
Embodiment three:
With text to be detected " Slender West Lake is not the thin West Lake ", and by two pattern keywords " Slender West Lake " and " West Lake "
Exemplified by the AC automatic machines of structure, Chinese multi-mode matching is realized, as shown in figure 8, may comprise steps of:
S601, the AC automatic machines according to structure, scan text to be detected since root node root;
Whether the current Chinese character in S602, the text to be detected for determining to scan matches with the current Chinese character in AC automatic machines;
If so, then perform step S603;If it is not, then perform step S606;
S603, determine whether the current Chinese character place node in AC automatic machines is leafy node;If so then execute step
S604;If it is not, then perform step S605;
All Chinese characters that S604, record are matched from root node root to current node in this path;
S605, the next Chinese character continued to scan in text to be detected continue to match, and return to step S602;
S606, the node being directed toward along the failure pointer of node where current Chinese character continue to match;
S607, treat unsuccessfully that pointer is directed toward root node root, the current Chinese character in the text to be detected of scanning is skipped, under scanning
One Chinese character, returns to step S602.
Specifically, for text to be detected " Slender West Lake is not the thin West Lake ", and by two pattern keyword " thin west
Lake " and the AC automatic machines of " West Lake " structure, match since root node root, find first Chinese character with text to be detected
" thin " identical node, continues and text matches to be detected, path 1- as illustrated in fig. 9 along the direct descendent of the node
2-3, have found " thin ", " west " and " lake ", it turns out that " lake " is leafy node, therefore, record from root node root to leaf
All Chinese characters matched in this path of node are " Slender West Lake " and " West Lake ";Then continue to match in text to be detected
" no " word, it is found that " lake " does not include any direct descendent for leafy node, and root node root is found thus according to failure pointer,
Also it is node " no " without Chinese character below path 4-5 as shown in figure 9b, discovery root node root, it fails to match;Skip
" no " word, continues to match "Yes" word below, finds then to jump again also without the node that Chinese character is "Yes" below root node root
"Yes" word is crossed, continues to match " thin " word below;A branch for including " thin " word, therefore edge are found below root node root
The individual path to match downwards, path 6 as is shown in fig. 9 c;When match " thin " word behind " " word when, find it is " thin " knot
Point it is following do not have " " branch, it fails to match, and root node root is found according to the failure pointer of " thin " node, as shown in figure 9d
Path 7;Do not found at root node root yet Chinese character for " " branch, in be to skip " " word, continue match " west " word,
The branch that Chinese character is " west ", path 8 as shown in figure 9e are found at root node root;Continue to match " lake " along AC automatic machines
Word, have found " lake " word below " west " node, path 9 as shown in figure 9f;At " lake " node, it is leaf to find the node
Child node, it is " West Lake " then to record all Chinese characters matched from root node root to leafy node in this path.So far,
Text " Slender West Lake is not the thin West Lake " matched end to be detected, text to be detected, which need to only be scanned one time, to be matched
" Slender West Lake " and " West Lake " two words, while count time that " Slender West Lake " and " West Lake " two words occur in text to be detected
Number is respectively 1 time and 2 times.
Based on same inventive concept, the embodiment of the present invention additionally provides a kind of construction device of AC automatic machines, due to the dress
It is similar to a kind of foregoing construction method of AC automatic machines to put the principle solved the problems, such as, therefore the implementation of the device may refer to method
Implementation, overlaps will not be repeated.
Specifically, the construction device of a kind of AC automatic machines provided in an embodiment of the present invention, as shown in Figure 10, can specifically wrap
Include:
Determining module 10, for according to the whole Chinese characters included in all pattern keywords got, determining by whole
The node set with Rotating fields of Chinese character composition;
Module 20 is established, the node that each layer is included in the node set determined for basis, according to knot in each layer
The direct descendent number that point includes determines the corresponding offset base values of node and close that each layer is included by order more at least
It is check values, establishes the even numbers group dictionary tree being made of each base values and each check values;
Generation module 30, for establishing failure pointer to each node in even numbers group dictionary tree, generates AC automatic machines.
Further, in the construction device of above-mentioned AC automatic machines provided in an embodiment of the present invention, module 20 is established, specifically
For whole Chinese characters to being included in all pattern keywords Chinese character numbering is carried out by the coding mode specified;According to node set
In node number, structure with setting length initialization base arrays and initialization check arrays;According to from root node
The first layer node of connection performs procedure below to the order of leafy node to each layer node in node set:Determine same
Any node whether there is direct descendent in layer;Determining with any node in layer there are during direct descendent, for same layer
Inside there are the node of direct descendent, according to the direct descendent number that same layer interior knot includes by order more at least, according to
The Chinese character numbering of each node and the Chinese character numbering of the direct descendent that includes of each node, determine the corresponding base values of each node with
And the corresponding check values of direct descendent that each node includes;Determining that direct descendent is not present with any node in layer
When, the corresponding base values of node to end up positioned at all pattern keywords are set to negative value.
Further, in the construction device of above-mentioned AC automatic machines provided in an embodiment of the present invention, module 20 is established, specifically
For being numbered according to the Chinese character of each node, the corresponding subscript of each node is determined;According to the corresponding subscript of each node and each node
Comprising direct descendent Chinese character numbering, determine the corresponding base values of each node and the direct descendent pair that each node includes
The check values answered.
Further, in the construction device of above-mentioned AC automatic machines provided in an embodiment of the present invention, module 20 is established, specifically
For for each node in first layer node, determining that the corresponding Chinese character numbering of each node is corresponding subscript;For other layers
Belong to each direct descendent of Same Vertices in node, determine the subscript I and Chinese character numbering a of each direct descendentnMeet following
Relation:
I=k+an
And determine that k is the minimum positive integer for meeting the following conditions:
base[k+a1]=base [k+a2]=...=base [k+an]
=check [k+a1]=check [k+a2]=...=check [k+aj]=initial value
Wherein, n is each direct descendent number, n=1,2 ... j.
Further, in the construction device of above-mentioned AC automatic machines provided in an embodiment of the present invention, module 20 is established, specifically
For determining that the corresponding base values of each node are k values;Determine that the corresponding check values of direct descendent that each node includes are each knot
The subscript of point.
Further, in the construction device of above-mentioned AC automatic machines provided in an embodiment of the present invention, module 20 is established, specifically
For the corresponding base values for initial value of the node to end up positioned at all pattern keywords to be revised as to the lower target of the node
Negative;By the node to end up positioned at all pattern keywords it is corresponding be not that to be revised as the node corresponding for the base values of initial value
The negative of base values.
Further, in the construction device of above-mentioned AC automatic machines provided in an embodiment of the present invention, module 20 is established, specifically
For carrying out Chinese character numbering in the following manner to the whole Chinese characters included in all pattern keywords:Information exchange encoding of chinese characters
Character set GB2312, Chinese Internal Code Specification GBK, Big5 BIG5 or 8 general format transformation UTF-8.
Based on same inventive concept, the embodiment of the present invention additionally provides a kind of Chinese multi-mode matching device, due to the dress
It is similar to a kind of foregoing Chinese multi-model matching method to put the principle solved the problems, such as, therefore the implementation of the device may refer to method
Implementation, overlaps will not be repeated.
Specifically, a kind of Chinese multi-mode matching device provided in an embodiment of the present invention, as shown in figure 11, can specifically wrap
Include:
Scan module 100, waits to locate for scanning through and through using above-mentioned AC automatic machines provided in an embodiment of the present invention
Manage text;
Statistical module 200, for counting the number that all pattern keywords occur in pending text.
It should be understood by those skilled in the art that, embodiments herein can be provided as method, system or computer program
Product.Therefore, the application can use the reality in terms of complete hardware embodiment, complete software embodiment or combination software and hardware
Apply the form of example.Moreover, the application can use the computer for wherein including computer usable program code in one or more
The shape for the computer program product that usable storage medium is implemented on (including but not limited to magnetic disk storage and optical memory etc.)
Formula.
The application is with reference to the flow according to the method for the embodiment of the present application, equipment (system) and computer program product
Figure and/or block diagram describe.It should be understood that it can be realized by computer program instructions every first-class in flowchart and/or the block diagram
The combination of flow and/or square frame in journey and/or square frame and flowchart and/or the block diagram.These computer programs can be provided
The processors of all-purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices is instructed to produce
A raw machine so that the instruction performed by computer or the processor of other programmable data processing devices, which produces, to be used in fact
The device for the function of being specified in present one flow of flow chart or one square frame of multiple flows and/or block diagram or multiple square frames.
These computer program instructions, which may also be stored in, can guide computer or other programmable data processing devices with spy
Determine in the computer-readable memory that mode works so that the instruction being stored in the computer-readable memory, which produces, to be included referring to
Make the manufacture of device, the command device realize in one flow of flow chart or multiple flows and/or one square frame of block diagram or
The function of being specified in multiple square frames.
These computer program instructions can be also loaded into computer or other programmable data processing devices so that counted
Series of operation steps is performed on calculation machine or other programmable devices to produce computer implemented processing, thus in computer or
The instruction performed on other programmable devices is provided and is used for realization in one flow of flow chart or multiple flows and/or block diagram one
The step of function of being specified in a square frame or multiple square frames.
Although preferred embodiments of the present invention have been described, but those skilled in the art once know basic creation
Property concept, then can make these embodiments other change and modification.So appended claims be intended to be construed to include it is excellent
Select embodiment and fall into all change and modification of the scope of the invention.
A kind of construction method of AC automatic machines provided in an embodiment of the present invention, its device, AC automatic machines, Chinese multi-mode
Method of completing the square and its device, according to the whole Chinese characters included in all pattern keywords got, determine to be made of whole Chinese characters
The node set with Rotating fields;According to the node that each layer is included in the node set determined, according to knot in each layer
The direct descendent number that point includes determines the corresponding offset base values of node and close that each layer is included by order more at least
It is check values, establishes the even numbers group dictionary tree being made of each base values and each check values;To each knot in even numbers group dictionary tree
Point establishes failure pointer, generates AC automatic machines.Therefore, during AC automatic machines are built, first establish by base values with
During the even numbers group dictionary tree of check values composition, using ordering strategy, in each layer according to the direct descendent number included by
Order at least more, it is preferential determine base values comprising the largest number of nodes of direct descendent and comprising direct descendent
Check values, can significantly reduce conflict when finding base values, avoid array from increasing too fast, reduction Sparse, in raising
Deposit occupancy;Secondly, failure pointer is established to each node of the even numbers group dictionary tree of foundation, generates AC automatic machines, largely
Optimize space availability ratio.
Obviously, various changes and modifications can be made to the invention without departing from essence of the invention by those skilled in the art
God and scope.In this way, if these modifications and changes of the present invention belongs to the scope of the claims in the present invention and its equivalent technologies
Within, then the present invention is also intended to comprising including these modification and variations.
Claims (17)
- A kind of 1. construction method of multi-mode matching AC automatic machines, it is characterised in that including:According to the whole Chinese characters included in all pattern keywords got, determine that there is layer by what whole Chinese characters formed The node set of structure;According to the node that each layer is included in the node set determined, included in each layer according to the node direct Child node number is determined the corresponding offset base values of node and relation check values that each layer is included, is built by order more at least The vertical even numbers group dictionary tree being made of each base values and each check values;Failure pointer is established to each node in the even numbers group dictionary tree, generates AC automatic machines.
- 2. construction method as claimed in claim 1, it is characterised in that each layer in the node set that the basis is determined Comprising node, each layer is determined by order more at least according to the direct descendent number that the node includes in each layer Comprising the corresponding offset base values of node and relation check values, specifically include:Chinese character numbering is carried out by the coding mode specified to the whole Chinese characters included in all pattern keywords;Node number in the node set, initialization base array and initialization of the structure with setting length Check arrays;According to the order from the first layer node being connected with root node to leafy node, to each layer knot in the node set Point performs procedure below:Determine to whether there is direct descendent with any node in layer;Determining with any node in layer there are during direct descendent, for the knot in same layer there are the direct descendent Point, according to the direct descendent number that the node in same layer includes by order more at least, according to the Chinese character of each node The Chinese character of numbering and the direct descendent that includes of each node numbering, determines corresponding base values of each node and respectively The corresponding check values of direct descendent that the node includes;Determine with direct descendent be not present in any node in layer when, all pattern keywords will be located at and be ended up The corresponding base values of node be set to negative value.
- 3. construction method as claimed in claim 2, it is characterised in that described according to the Chinese character of each node numbering and each The Chinese character numbering for the direct descendent that the node includes, determines the corresponding base values of each node and each node bag The corresponding check values of direct descendent contained, specifically include:Numbered according to the Chinese character of each node, determine the corresponding subscript of each node;The Chinese character numbering of the direct descendent included according to each corresponding subscript of node and each node, determines each institute State the corresponding base values of node and corresponding check values of direct descendent that each node includes.
- 4. construction method as claimed in claim 3, it is characterised in that it is described to be numbered according to the Chinese character of each node, determine The corresponding subscript of each node, specifically includes:For each node in first layer node, determine that the corresponding Chinese character numbering of each node is corresponding subscript;For each direct descendent for belonging to Same Vertices in other layer of node, determine the subscript I of each direct descendent with Chinese character numbering anMeet following relation:I=k+anAnd determine that k is the minimum positive integer for meeting the following conditions:base[k+a1]=base [k+a2]=...=base [k+an]=check [k+a1]=check [k+a2]=...=check [k+aj]=initial valueWherein, n is each direct descendent number, n=1,2 ... j.
- 5. construction method as claimed in claim 4, it is characterised in that according to the corresponding subscript of each node and it is each described in The Chinese character numbering of the direct descendent that node includes, determines the corresponding base values of each node and each node includes The corresponding check values of direct descendent, specifically include:It is k values to determine the corresponding base values of each node;Determine the subscript that corresponding check values of direct descendent that each node includes are each node.
- 6. such as claim 2-5 any one of them construction methods, it is characterised in that described to be located at all pattern keys The corresponding base values of node of word ending are set to negative value, specifically include:The node will be revised as positioned at the corresponding base values for initial value of node of all pattern keyword endings Lower target negative;By it is corresponding positioned at the node of all pattern keywords ending be not that the base values of initial value are revised as the knot The negative of the corresponding base values of point.
- 7. such as claim 2-5 any one of them construction methods, it is characterised in that described in all pattern keywords Comprising whole Chinese characters carry out Chinese character numbering by the coding mode specified, specifically include:Chinese character numbering is carried out in the following manner to the whole Chinese characters included in all pattern keywords:Information exchange Chinese character Coded character set GB2312, Chinese Internal Code Specification GBK, Big5 BIG5 or 8 general format transformation UTF-8.
- 8. a kind of AC automatic machines, it is characterised in that using the construction method such as claim 1-7 any one of them AC automatic machines Structure.
- A kind of 9. Chinese multi-model matching method, it is characterised in that including:Using AC automatic machines as claimed in claim 8, pending text is from first to last scanned, is counted in the pending text In the number that occurs of all pattern keywords.
- A kind of 10. construction device of AC automatic machines, it is characterised in that including:Determining module, for according to the whole Chinese characters included in all pattern keywords got, determining by whole Chinese The node set with Rotating fields of word composition;Module is established, the node that each layer is included in the node set determined for basis, according to described in each layer The direct descendent number that node includes by order more at least, determine corresponding offset base values of node that each layer is included and Relation check values, establish the even numbers group dictionary tree being made of each base values and each check values;Generation module, for establishing failure pointer to each node in the even numbers group dictionary tree, generates AC automatic machines.
- 11. construction device as claimed in claim 10, it is characterised in that it is described to establish module, specifically for described all The whole Chinese characters included in pattern keyword carry out Chinese character numbering by the coding mode specified;According to the knot in the node set Point number, initialization base array and initialization check array of the structure with setting length;According to from being connected with root node First layer node performs procedure below to the order of leafy node to each layer node in the node set:Determine same layer Interior any node whether there is direct descendent;Determining with any node in layer there are during direct descendent, pin To the node in same layer there are the direct descendent, the direct descendent number included according to the node in same layer is by up to Few order, the Chinese character of the direct descendent included according to the Chinese character of each node numbering and each node are numbered, really The corresponding check values of direct descendent that fixed each corresponding base values of node and each node include;Determining together , will be corresponding positioned at the node of all pattern keyword endings when direct descendent is not present in any node in layer Base values are set to negative value.
- 12. construction device as claimed in claim 11, it is characterised in that it is described to establish module, specifically for according to each described The Chinese character numbering of node, determines the corresponding subscript of each node;According to each corresponding subscript of node and each knot The Chinese character numbering for the direct descendent that point includes, determine the corresponding base values of each node and each node include it is straight Connect the corresponding check values of child node.
- 13. construction device as claimed in claim 12, it is characterised in that it is described to establish module, specifically for for first layer Each node in node, determines that the corresponding Chinese character numbering of each node is corresponding subscript;For belonging in other layer of node Each direct descendent of Same Vertices, determines the subscript I and Chinese character numbering a of each direct descendentnMeet following relation:I=k+anAnd determine that k is the minimum positive integer for meeting the following conditions:base[k+a1]=base [k+a2]=...=base [k+an]=check [k+a1]=check [k+a2]=...=check [k+aj]=initial valueWherein, n is each direct descendent number, n=1,2 ... j.
- 14. construction device as claimed in claim 13, it is characterised in that it is described to establish module, it is each described specifically for determining The corresponding base values of node are k values;Determine that the corresponding check values of direct descendent that each node includes are each node Subscript.
- 15. such as claim 11-14 any one of them construction devices, it is characterised in that it is described to establish module, specifically for inciting somebody to action The subscript of the node is revised as positioned at the corresponding base values for initial value of node of all pattern keyword endings Negative;By it is corresponding positioned at the node of all pattern keywords ending be not that to be revised as this described for the base values of initial value The negative of the corresponding base values of node.
- 16. such as claim 11-14 any one of them construction devices, it is characterised in that it is described to establish module, specifically for pair The whole Chinese characters included in all pattern keywords carry out Chinese character numbering in the following manner:Information exchange encoding of chinese characters word Symbol collection GB2312, Chinese Internal Code Specification GBK, Big5 BIG5 or 8 general format transformation UTF-8.
- A kind of 17. Chinese multi-mode matching device, it is characterised in that including:Scan module, for scanning pending text through and through using AC automatic machines as claimed in claim 8;Statistical module, for counting the number that all pattern keywords occur in the pending text.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610943520.5A CN108021569A (en) | 2016-11-01 | 2016-11-01 | The structure of AC automatic machines and Chinese multi-model matching method and relevant apparatus |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610943520.5A CN108021569A (en) | 2016-11-01 | 2016-11-01 | The structure of AC automatic machines and Chinese multi-model matching method and relevant apparatus |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108021569A true CN108021569A (en) | 2018-05-11 |
Family
ID=62070753
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610943520.5A Pending CN108021569A (en) | 2016-11-01 | 2016-11-01 | The structure of AC automatic machines and Chinese multi-model matching method and relevant apparatus |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108021569A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109524068A (en) * | 2018-10-16 | 2019-03-26 | 东华大学 | A kind of disease symptoms extracting method based on AC automatic machine |
CN109933656A (en) * | 2019-03-15 | 2019-06-25 | 深圳市赛为智能股份有限公司 | Public sentiment polarity prediction technique, device, computer equipment and storage medium |
CN113065419A (en) * | 2021-03-18 | 2021-07-02 | 哈尔滨工业大学 | Pattern matching algorithm and system based on flow high-frequency content |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1786962A (en) * | 2005-12-21 | 2006-06-14 | 中国科学院计算技术研究所 | Method for managing and searching dictionary with perfect even numbers group TRIE Tree |
US20070192286A1 (en) * | 2004-07-26 | 2007-08-16 | Sourcefire, Inc. | Methods and systems for multi-pattern searching |
CN102193914A (en) * | 2011-05-26 | 2011-09-21 | 中国科学院计算技术研究所 | Computer aided translation method and system |
CN103198079A (en) * | 2012-01-06 | 2013-07-10 | 北大方正集团有限公司 | Related search implementation method and device |
CN105183788A (en) * | 2015-08-20 | 2015-12-23 | 及时标讯网络信息技术(北京)有限公司 | Operation method for Chinese AC automatic machine based on retrieval of keyword dictionary tree |
-
2016
- 2016-11-01 CN CN201610943520.5A patent/CN108021569A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070192286A1 (en) * | 2004-07-26 | 2007-08-16 | Sourcefire, Inc. | Methods and systems for multi-pattern searching |
CN1786962A (en) * | 2005-12-21 | 2006-06-14 | 中国科学院计算技术研究所 | Method for managing and searching dictionary with perfect even numbers group TRIE Tree |
CN102193914A (en) * | 2011-05-26 | 2011-09-21 | 中国科学院计算技术研究所 | Computer aided translation method and system |
CN103198079A (en) * | 2012-01-06 | 2013-07-10 | 北大方正集团有限公司 | Related search implementation method and device |
CN105183788A (en) * | 2015-08-20 | 2015-12-23 | 及时标讯网络信息技术(北京)有限公司 | Operation method for Chinese AC automatic machine based on retrieval of keyword dictionary tree |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109524068A (en) * | 2018-10-16 | 2019-03-26 | 东华大学 | A kind of disease symptoms extracting method based on AC automatic machine |
CN109933656A (en) * | 2019-03-15 | 2019-06-25 | 深圳市赛为智能股份有限公司 | Public sentiment polarity prediction technique, device, computer equipment and storage medium |
WO2020186627A1 (en) * | 2019-03-15 | 2020-09-24 | 深圳市赛为智能股份有限公司 | Public opinion polarity prediction method and apparatus, computer device, and storage medium |
CN109933656B (en) * | 2019-03-15 | 2023-08-15 | 深圳市赛为智能股份有限公司 | Public opinion polarity prediction method, public opinion polarity prediction device, computer equipment and storage medium |
CN113065419A (en) * | 2021-03-18 | 2021-07-02 | 哈尔滨工业大学 | Pattern matching algorithm and system based on flow high-frequency content |
CN113065419B (en) * | 2021-03-18 | 2022-05-24 | 哈尔滨工业大学 | Pattern matching algorithm and system based on flow high-frequency content |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Bahmani et al. | Efficient distributed locality sensitive hashing | |
Song et al. | RP-DBSCAN: A superfast parallel DBSCAN algorithm based on random partitioning | |
CN104679778B (en) | A kind of generation method and device of search result | |
US9390134B2 (en) | Regular expression matching method and system, and searching device | |
CN104462260B (en) | A kind of community search method in social networks based on k- cores | |
CN108920720A (en) | The large-scale image search method accelerated based on depth Hash and GPU | |
US20170242855A1 (en) | Fast, scalable dictionary construction and maintenance | |
CN103377237B (en) | The neighbor search method of high dimensional data and fast approximate image searching method | |
CN105138647A (en) | Travel network cell division method based on Simhash algorithm | |
CN108021569A (en) | The structure of AC automatic machines and Chinese multi-model matching method and relevant apparatus | |
CN103514236A (en) | Retrieval condition error correction prompt processing method based on Pinyin in retrieval application | |
CN102148746A (en) | Message classification method and system | |
EP2544414A1 (en) | Method and device for storing routing table entry | |
Kang et al. | Flow rounding | |
Chen et al. | Metric similarity joins using MapReduce | |
de Berg et al. | A framework for ETH-tight algorithms and lower bounds in geometric intersection graphs | |
Ferragina et al. | On the bit-complexity of Lempel--Ziv compression | |
CN106874425A (en) | Real time critical word approximate search algorithm based on Storm | |
CN107180079A (en) | The image search method of index is combined with Hash based on convolutional neural networks and tree | |
Serra et al. | Large-scale sparse structural node representation | |
CN108228896B (en) | A kind of missing data complementing method and device based on density | |
Ghosh et al. | A user-guided innovization-based evolutionary algorithm framework for practical multi-objective optimization problems | |
CN106940711A (en) | A kind of URL detection methods and detection means | |
CN109408517A (en) | Multidimensional search method, apparatus, equipment and the readable storage medium storing program for executing of rule | |
Eppstein et al. | Approximate greedy clustering and distance selection for graph metrics |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180511 |
|
RJ01 | Rejection of invention patent application after publication |