CN108021569A - The structure of AC automatic machines and Chinese multi-model matching method and relevant apparatus - Google Patents

The structure of AC automatic machines and Chinese multi-model matching method and relevant apparatus Download PDF

Info

Publication number
CN108021569A
CN108021569A CN201610943520.5A CN201610943520A CN108021569A CN 108021569 A CN108021569 A CN 108021569A CN 201610943520 A CN201610943520 A CN 201610943520A CN 108021569 A CN108021569 A CN 108021569A
Authority
CN
China
Prior art keywords
node
layer
direct descendent
values
chinese character
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610943520.5A
Other languages
Chinese (zh)
Inventor
刘江锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Communications Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Communications Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Communications Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN201610943520.5A priority Critical patent/CN108021569A/en
Publication of CN108021569A publication Critical patent/CN108021569A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of construction method of AC automatic machines, its device, AC automatic machines, Chinese multi-model matching method and its device, according to the whole Chinese characters included in all pattern keywords got, the node set with Rotating fields being made of whole Chinese characters is determined;According to the node that each layer is included in the node set determined, using ordering strategy, according to the direct descendent number included by order more at least in each layer, it is preferential determine the base values comprising the largest number of nodes of direct descendent and comprising direct descendent check values, conflict can be significantly reduced when finding base values, avoid array from increasing too fast, reduce Sparse, improve EMS memory occupation;Failure pointer is established to each node of the even numbers group dictionary tree of foundation, AC automatic machines is generated, largely optimizes space availability ratio.

Description

The structure of AC automatic machines and Chinese multi-model matching method and relevant apparatus
Technical field
The present invention relates to information technology field, espespecially a kind of construction method of AC automatic machines, its device, AC automatic machines, in Literary multi-model matching method and its device.
Background technology
Aho-Corasick algorithms resulted from AT&T Labs in 1975, were one of famous multi-pattern matching algorithms, The corresponding data structure of the algorithm is Aho-Corasick automatic machines (referred to as AC automatic machines), by using the public of character string Prefix reduces query time, reduces meaningless character string comparison to greatest extent, therefore is chiefly used in counting, and sorts, Yi Jibao Substantial amounts of character string is deposited, is especially often searched the statistics that engine is used for text word frequency.
In general, AC automatic machines need to use a kind of dictionary tree with tree structure, the search tree of word is also known as, is a kind of The mutation of Hash tree.Usually, the fundamental property of dictionary tree includes:Root node does not include character, each in addition to root node Node all only includes a character;From root node to some node, the Connection operator passed through on path gets up for the node pair The character string answered;The character that all child nodes of each node include is different from.
However, in the traditional dictionary tree of structure, existing greatest problem is exactly that space consuming is big;If there are n in dictionary tree First layer node, the number of characters that each node of first layer may include is m, then the space occupied is n × m;For Chinese character Speech, the total amount of Chinese character, when dictionary tree is more than 1000 nodes, may cause the memory of active computer not up to 80,000 at present Foot.Therefore, if carry out Chinese multi-mode matching using this traditional dictionary tree structure AC automatic machines, it is necessary to dictionary tree Space consuming optimizes, to meet the normal effective operation of computer.
In order to solve the problems, such as space consuming, in existing technical solution, the first scheme is to sacrifice the time to change space, i.e., The array of all child nodes of storage node is changed to chained list;Although the method improves space availability ratio, but reduce AC Automatic machine carries out search efficiency during Chinese multi-mode matching;In the worst case, when searching some child node, only needed in the past 1 comparison, it is now desired to which carrying out 80,000 comparisons could complete.
Second scheme is to optimize space using word frequency, i.e., for the higher pattern keyword of frequency of use, uses array Child node is stored, for the relatively low pattern keyword of frequency of use, then using storage of linked list child node;Although the program is on the whole Space is optimized, but space consuming still can not be reduced for the no small high frequency mode keyword of number, while is also sacrificed The query function of non-high frequency mode keyword.
The third scheme is to replace character structure dictionary tree using the Chinese phonetic alphabet, although knot has saved space, is brought Other the problem of;Since a kind of Chinese phonetic alphabet may correspond to multiple Chinese-character words, such as " tianliang " possible corresponding " daybreak ", Multiple words such as " Tian Liang ", " staggering amount ", " it is cool ", " conscience ", this just violates the failure pointer of a node in AC automatic machines only The rule of a node can be directed toward, the once traversal of AC automatic machines can not be utilized to match the characteristic of all pattern keywords.
In addition, can be instead of traditional dictionary tree, even numbers group dictionary tree using even numbers group dictionary tree in the prior art Another expression way of traditional dictionary tree construction, deviates base arrays and relation check arrays to represent by two kinds of arrays The hierarchical structure of dictionary tree, significantly knot has saved space and committed memory, and reduces space complexity;Yet with building According to level traversing nodes during even numbers group dictionary tree, if less with the child node that above node includes in layer, and tie below When the child node that point includes is more, the conflict of high probability just occurs when building node below, and then to thus even numbers group word Space consuming of the AC automatic machines of allusion quotation tree structure during Chinese multi-mode matching has an impact.Such as:As shown in table 1, when During priority treatment " angstrom " node, direct descendent " and " be located under be designated as at the state state corresponding to 7, then reprocessing knot Point " during Ah ", direct descendent " root ", " glue ", the position of " drawing " at state may with " and " position at state weight Close, so just generate conflict, therefore only direct descendent " root ", " glue ", " drawing " are positioned over subscript and are more than 7 corresponding shapes At state state, the even numbers group dictionary tree so built is then sparse, has seriously affected the utilization rate in space.
Table 1
Based on this, how when building AC automatic machines on the basis of even numbers group dictionary tree, realize optimization array space, improve Space availability ratio, is those skilled in the art's technical problem urgently to be resolved hurrily.
The content of the invention
The embodiment of the present invention provides a kind of construction method of AC automatic machines, its device, AC automatic machines, Chinese multi-mode matching Method and its device, to solve how when building AC automatic machines on the basis of even numbers group dictionary tree, to realize that optimization array is empty Between, improve space availability ratio.
An embodiment of the present invention provides a kind of construction method of multi-mode matching AC automatic machines, including:
According to the whole Chinese characters included in all pattern keywords got, the tool being made of whole Chinese characters is determined There is the node set of Rotating fields;
According to the node that each layer is included in the node set determined, included in each layer according to the node Direct descendent number determines the corresponding offset base values of node and relation check that each layer is included by order more at least Value, establishes the even numbers group dictionary tree being made of each base values and each check values;
Failure pointer is established to each node in the even numbers group dictionary tree, generates AC automatic machines.
In a kind of possible embodiment, in the structure of above-mentioned multi-mode matching AC automatic machines provided in an embodiment of the present invention In construction method, each layer is included in the node set that the basis is determined node, according to the node in each layer Comprising direct descendent number by order more at least, determine the corresponding offset base values of node and relation that each layer is included Check values, specifically include:
Chinese character numbering is carried out by the coding mode specified to the whole Chinese characters included in all pattern keywords;
Node number in the node set, initialization base array and initialization of the structure with setting length Check arrays;
According to the order from the first layer node being connected with root node to leafy node, to each in the node set Layer node performs procedure below:
Determine to whether there is direct descendent with any node in layer;
Determining with any node in layer there are during direct descendent, for there are the direct descendent in same layer Node, according to the direct descendent number that the node in same layer includes by order more at least, according to each node Chinese character is numbered and the Chinese character of the direct descendent that includes of each node numbering, determine the corresponding base values of each node with And the corresponding check values of direct descendent that each node includes;
Determine with direct descendent is not present in any node in layer when, all pattern keywords will be located at The corresponding base values of node of ending are set to negative value.
In a kind of possible embodiment, in the structure of above-mentioned multi-mode matching AC automatic machines provided in an embodiment of the present invention In construction method, the Chinese character of the direct descendent included according to the Chinese character of each node numbering and each node is compiled Number, determine the corresponding base values of each node and the corresponding check values of direct descendent that each node includes, specifically Including:
Numbered according to the Chinese character of each node, determine the corresponding subscript of each node;
The Chinese character numbering of the direct descendent included according to each corresponding subscript of node and each node, determines The corresponding check values of direct descendent that each corresponding base values of the node and each node include.
In a kind of possible embodiment, in the structure of above-mentioned multi-mode matching AC automatic machines provided in an embodiment of the present invention It is described to be numbered according to the Chinese character of each node in construction method, determine the corresponding subscript of each node, specifically include:
For each node in first layer node, determine that the corresponding Chinese character numbering of each node is corresponding subscript;
For each direct descendent for belonging to Same Vertices in other layer of node, the subscript of each direct descendent is determined I and Chinese character numbering anMeet following relation:
I=k+an
And determine that k is the minimum positive integer for meeting the following conditions:
base[k+a1]=base [k+a2]=...=base [k+an]
=check [k+a1]=check [k+a2]=...=check [k+aj]=initial value
Wherein, n is each direct descendent number, n=1,2 ... j.
In a kind of possible embodiment, in the structure of above-mentioned multi-mode matching AC automatic machines provided in an embodiment of the present invention In construction method, the Chinese character numbering of the direct descendent included according to each corresponding subscript of node and each node, really The corresponding check values of direct descendent that fixed each corresponding base values of node and each node include, specifically include:
It is k values to determine the corresponding base values of each node;
Determine the subscript that corresponding check values of direct descendent that each node includes are each node.
In a kind of possible embodiment, in the structure of above-mentioned multi-mode matching AC automatic machines provided in an embodiment of the present invention It is described to be set to negative value positioned at the corresponding base values of node of all pattern keyword endings in construction method, specifically include:
It is described by this is revised as positioned at the corresponding base values for initial value of node of all pattern keyword endings The lower target negative of node;
The institute will not be revised as the base values of initial value positioned at the node of all pattern keywords ending is corresponding State the negative of the corresponding base values of node.
In a kind of possible embodiment, in the structure of above-mentioned multi-mode matching AC automatic machines provided in an embodiment of the present invention In construction method, whole Chinese characters to being included in all pattern keywords carry out Chinese character volume by the coding mode specified Number, specifically include:
Chinese character numbering is carried out in the following manner to the whole Chinese characters included in all pattern keywords:Information exchange is used Hanzi coded character set GB2312, Chinese Internal Code Specification GBK, Big5 BIG5 or 8 general format transformation UTF-8.
The embodiment of the present invention additionally provides a kind of AC automatic machines, using above-mentioned AC automatic machines provided in an embodiment of the present invention Construction method is built.
The embodiment of the present invention additionally provides a kind of Chinese multi-model matching method, including:
Using above-mentioned AC automatic machines provided in an embodiment of the present invention, pending text is from first to last scanned, is counted described The number that all pattern keywords occur in pending text.
The embodiment of the present invention additionally provides a kind of construction device of AC automatic machines, including:
Determining module, for according to the whole Chinese characters included in all pattern keywords got, determining by described complete The node set with Rotating fields of portion's Chinese character composition;
Establish module, for the node included according to each layer in the node set determined, in each layer according to The direct descendent number that the node includes determines the corresponding offset base of node that each layer is included by order more at least Value and relation check values, establish the even numbers group dictionary tree being made of each base values and each check values;
Generation module, for establishing failure pointer to each node in the even numbers group dictionary tree, generation AC is automatic Machine.
In a kind of possible embodiment, in the construction device of above-mentioned AC automatic machines provided in an embodiment of the present invention, It is described to establish module, carried out specifically for whole Chinese characters to being included in all pattern keywords by the coding mode specified Chinese character is numbered;Node number in the node set, structure is with the initialization base arrays for setting length and initially Change check arrays;According to the order from the first layer node being connected with root node to leafy node, in the node set Each layer node perform procedure below:Determine to whether there is direct descendent with any node in layer;Determining with layer Any node is there are during direct descendent, for the node in same layer there are the direct descendent, according to institute in same layer Direct descendent number that node includes is stated by order more at least, according to the Chinese character of each node numbering and each knot The Chinese character numbering for the direct descendent that point includes, determine the corresponding base values of each node and each node include it is straight Connect the corresponding check values of child node;Determine with direct descendent be not present in any node in layer when, will be located at described in The corresponding base values of node of all pattern keyword endings are set to negative value.
In a kind of possible embodiment, in the construction device of above-mentioned AC automatic machines provided in an embodiment of the present invention, It is described to establish module, specifically for being numbered according to the Chinese character of each node, determine the corresponding subscript of each node;According to each The Chinese character numbering for the direct descendent that the corresponding subscript of node and each node include, determines that each node corresponds to Base values and the corresponding check values of direct descendent that include of each node.
In a kind of possible embodiment, in the construction device of above-mentioned AC automatic machines provided in an embodiment of the present invention, It is described to establish module, specifically for for each node in first layer node, determining that the corresponding Chinese character numbering of each node is Corresponding subscript;For each direct descendent for belonging to Same Vertices in other layer of node, each direct descendent is determined Subscript I and Chinese character numbering anMeet following relation:
I=k+an
And determine that k is the minimum positive integer for meeting the following conditions:
base[k+a1]=base [k+a2]=...=base [k+an]
=check [k+a1]=check [k+a2]=...=check [k+aj]=initial value
Wherein, n is each direct descendent number, n=1,2 ... j.
In a kind of possible embodiment, in the construction device of above-mentioned AC automatic machines provided in an embodiment of the present invention, It is described to establish module, specifically for determining that the corresponding base values of each node are k values;It is direct to determine that each node includes The corresponding check values of child node are the subscript of each node.
In a kind of possible embodiment, in the construction device of above-mentioned AC automatic machines provided in an embodiment of the present invention, It is described to establish module, specifically for by positioned at the corresponding base values for initial value of node of all pattern keyword endings It is revised as the lower target negative of the node;By it is corresponding positioned at the node of all pattern keywords ending be not initial The base values of value are revised as the negative of the corresponding base values of the node.
In a kind of possible embodiment, in the construction device of above-mentioned AC automatic machines provided in an embodiment of the present invention, It is described to establish module, specifically for carrying out Chinese character volume in the following manner to the whole Chinese characters included in all pattern keywords Number:Chinese Character Set Code for Informati GB2312, Chinese Internal Code Specification GBK, Big5 BIG5 or 8 general conversions Form UTF-8.
The embodiment of the present invention additionally provides a kind of Chinese multi-mode matching device, including:
Scan module, for scanning pending text through and through using above-mentioned AC automatic machines provided in an embodiment of the present invention This;
Statistical module, for counting the number that all pattern keywords occur in the pending text.
The present invention has the beneficial effect that:
A kind of construction method of AC automatic machines provided in an embodiment of the present invention, its device, AC automatic machines, Chinese multi-mode Method of completing the square and its device, according to the whole Chinese characters included in all pattern keywords got, determine to be made of whole Chinese characters The node set with Rotating fields;According to the node that each layer is included in the node set determined, according to knot in each layer The direct descendent number that point includes determines the corresponding offset base values of node and close that each layer is included by order more at least It is check values, establishes the even numbers group dictionary tree being made of each base values and each check values;To each knot in even numbers group dictionary tree Point establishes failure pointer, generates AC automatic machines.Therefore, during AC automatic machines are built, first establish by base values with During the even numbers group dictionary tree of check values composition, using ordering strategy, in each layer according to the direct descendent number included by Order at least more, it is preferential determine base values comprising the largest number of nodes of direct descendent and comprising direct descendent Check values, can significantly reduce conflict when finding base values, avoid array from increasing too fast, reduction Sparse, in raising Deposit occupancy;Secondly, failure pointer is established to each node of the even numbers group dictionary tree of foundation, generates AC automatic machines, largely Optimize space availability ratio.
Brief description of the drawings
Fig. 1 is a kind of one of flow diagram of construction method of AC automatic machines provided in the embodiment of the present invention;
Fig. 2 is the two of the flow diagram of the construction method of a kind of AC automatic machines provided in the embodiment of the present invention;
Fig. 3 is a kind of flow diagram of the Chinese multi-model matching method provided in the embodiment of the present invention;
Fig. 4 is the flow diagram of the embodiment one provided in the embodiment of the present invention;
Fig. 5 is the structure diagram of node set in the embodiment one provided in the embodiment of the present invention;
Fig. 6 a and 6b are the structure diagram of the simple dictionary tree in the embodiment two provided in the embodiment of the present invention;
Fig. 7 is the flow diagram of the embodiment two provided in the embodiment of the present invention;
Fig. 8 is the flow diagram of the embodiment three provided in the embodiment of the present invention;
Fig. 9 a to 9f are the schematic diagram in the Chinese multi-mode matching path in the embodiment three provided in the embodiment of the present invention;
Figure 10 is a kind of structure diagram of the construction device of the AC automatic machines provided in the embodiment of the present invention;
Figure 11 is a kind of structure diagram of the Chinese multi-mode matching device provided in the embodiment of the present invention.
Embodiment
Below in conjunction with attached drawing, to a kind of construction method of AC automatic machines provided in an embodiment of the present invention, its device, AC from Motivation, Chinese multi-model matching method and its device, and the embodiment of Chinese multi-mode matching device carry out in detail Ground explanation.It should be noted that described embodiment is only part of the embodiment of the present invention, rather than whole implementation Example.Based on the embodiments of the present invention, those of ordinary skill in the art are obtained without making creative work Every other embodiment, belongs to the scope of protection of the invention.
It should be noted that even numbers group dictionary tree is only made of each base values and each check values, and related in herein below And even numbers group dictionary tree in the subscript that occurs and state be used for the purpose of the structure that clearly illustrates even numbers group dictionary tree Process, in actual even numbers group dictionary tree and is not present;In addition, the structural representation of the node set occurred in herein below Figure, is existing for array form, to use the representation node set of tree here only during the actual implementation of AC automatic machines It is in order to which the building process of AC automatic machines is more clearly understood.
An embodiment of the present invention provides a kind of construction method of AC automatic machines, as shown in Figure 1, following step can be included Suddenly:
The whole Chinese characters included in all pattern keywords that S101, basis are got, determine what is be made of whole Chinese characters Node set with Rotating fields;
S102, the node included according to each layer in the node set determined, include according to node straight in each layer Child node number is connect by order more at least, determines the corresponding offset base values of node and relation check values that each layer is included, Establish the even numbers group dictionary tree being made of each base values and each check values;
S103, establish each node in even numbers group dictionary tree on failure pointer, generates AC automatic machines.
An embodiment of the present invention provides a kind of construction method of AC automatic machines, during AC automatic machines are built, first When establishing the even numbers group dictionary tree being made of base values and check values, using ordering strategy, according to being included in each layer Direct descendent number is preferential to determine the base values comprising the largest number of nodes of direct descendent and bag by order more at least The check values of the direct descendent contained, can significantly reduce conflict when finding base values, avoid array from increasing too fast, subtract Few Sparse, improves EMS memory occupation;Secondly, failure pointer is established to each node of the even numbers group dictionary tree of foundation, generates AC Automatic machine, largely optimizes space availability ratio.
In the specific implementation, in order to establish the even numbers group dictionary tree being made of each base values and each check values, in the present invention Step S102 in the construction method for the above-mentioned AC automatic machines that embodiment provides is wrapped according to each layer in the node set determined The node contained, determines what each layer was included according to the direct descendent number that node includes in each layer by order more at least The corresponding offset base values of node and relation check values, as shown in Fig. 2, following steps can be specifically included:
S201, carry out Chinese character numbering to the whole Chinese characters included in all pattern keywords by the coding mode specified;
S202, the node number in node set, structure is with the initialization base arrays for setting length and initially Change check arrays;
S203, according to the order from the first layer node being connected with root node to leafy node, in node set Each layer node, determines to whether there is direct descendent with any node in layer;If so, then perform step S204;If it is not, then perform Step S205;
S204, for the node in same layer there are direct descendent, the direct descendent that is included according to same layer interior knot Number is by order more at least, and the Chinese character of the direct descendent included according to the Chinese character of each node numbering and each node is numbered, really The corresponding check values of direct descendent that fixed each corresponding base values of node and each node include, in definite every layer of each node After the corresponding check values of direct descendent that corresponding base values and each node include, step S203 is returned to;
S205, by the corresponding base values of node to end up positioned at all pattern keywords be set to negative value.
Specifically, in order to complete to number whole Chinese characters included in all pattern keywords, in the embodiment of the present invention Step S201 in the construction method of the above-mentioned AC automatic machines provided, can specifically include in the following manner:
Such as Chinese Character Set Code for Informati GB2312, Chinese Internal Code Specification (Chinese Internal Code Specification, GBK), Big5 BIG5 or 8 general format transformation (8-bie Unicode Transformation Format, UTF-8) in any one, be not limited thereto.
It should be noted that step S203 to step S204 is the process of a circulation, from first be connected with root node Layer node starts, up to leafy node, to determine that any node whether there is direct descendent for the node in each layer, if With any node in layer there are direct descendent, then step S204 is performed, if there is no directly son knot with any node in layer Point, illustrates to have arrived leafy node, end loop process, performs step S205.
Specifically, the step S203 in the construction method of above-mentioned AC automatic machines provided in an embodiment of the present invention determines same layer Interior any node can include two kinds of situations there are during direct descendent with each node in layer:A part of node includes direct Child node, a part of node does not include direct descendent, or includes direct descendent with all nodes in layer;
For the node comprising direct descendent, then step S204 is performed;
For the node not comprising direct descendent, then it is initial value to keep the corresponding base values of the node and check values.
Further, for that comprising the different node of direct descendent number, need to be included according to same layer interior knot direct Child node number performs step S204 successively by order more at least;And for comprising the identical node of direct descendent number, The order for performing step S204 is not limited herein.
Specifically, the step S204 in the construction method of above-mentioned AC automatic machines provided in an embodiment of the present invention is according to each knot The Chinese character numbering of point and the Chinese character numbering of the direct descendent that includes of each node, determine corresponding base values of each node and respectively The corresponding check values of direct descendent that node includes, can specifically include following steps:
Numbered according to the Chinese character of each node, determine the corresponding subscript of each node;
The Chinese character numbering of the direct descendent included according to the corresponding subscript of each node and each node, determines each node pair The corresponding check values of direct descendent that the base values and each node answered include.
Further, in order to determine the corresponding subscript of each node by the Chinese character of each node numbering, in the embodiment of the present invention In the construction method of the above-mentioned AC automatic machines provided, it can be accomplished by the following way:
For each node in first layer node, determine that the corresponding Chinese character numbering of each node is corresponding subscript;
For each direct descendent for belonging to Same Vertices in other layer of node, determine the subscript I of each direct descendent with Chinese character numbering anMeet following relation:
I=k+an
And determine that k is the minimum positive integer for meeting the following conditions:
base[k+a1]=base [k+a2]=...=base [k+an]
=check [k+a1]=check [k+a2]=...=check [k+aj]=initial value
Wherein, n is each direct descendent number, n=1,2 ... j.
Further, in order to which the direct descendent for determining the corresponding base values of each node and each node includes is corresponding Check values, in the construction method of above-mentioned AC automatic machines provided in an embodiment of the present invention, determine that the corresponding base values of each node are K values, determine the subscript that corresponding check values of direct descendent that each node includes are each node.
In the specific implementation, the step S205 in the construction method of above-mentioned AC automatic machines provided in an embodiment of the present invention will The corresponding base values of node positioned at the ending of all pattern keywords are set to negative value, can specifically include in the following manner:
It is revised as the node to end up positioned at all pattern keywords is corresponding for the base values of initial value under the node Target negative;
By the node to end up positioned at all pattern keywords it is corresponding be initial value base values be revised as the node pair The negative for the base values answered.
It should be noted that by each base values and each check values structure even numbers group dictionary tree in, when base values with When check values are initial value, illustrate the position for sky, it is unoccupied;When base values are negative, illustrate the position correspondence Node is leafy node, without direct descendent, for the ending of a pattern keyword, therefore, passes through the even numbers group word of structure Allusion quotation tree, can intuitively represent the position where the ending of exit pattern keyword;In addition, because being designated as the straight of the node under node The check values of child node are connect, so by the even numbers group dictionary tree of structure, can clearly represent the layer where each node The number for the direct descendent that secondary and each node includes.
In the specific implementation, S103 pairs of the step in the construction method of above-mentioned AC automatic machines provided in an embodiment of the present invention The embodiment that each node in even numbers group dictionary tree establishes failure pointer can use existing various foundation unsuccessfully to refer to The embodiment of pin, instantiation is referring to embodiment two, and therefore not to repeat here.
The embodiment of the present invention additionally provides a kind of AC automatic machines, using above-mentioned AC automatic machines provided in an embodiment of the present invention Construction method is built, and overlaps will not be repeated.
The embodiment of the present invention additionally provides a kind of Chinese multi-model matching method, as shown in figure 3, can include following step Suddenly:
S301, using AC automatic machines, from first to last scan pending text;
S302, the statistics number that all pattern keywords occur in pending text.
Several specific embodiments the present invention will be described in detail above-mentioned Chinese multi-mode matching that embodiment provides will be passed through below Method.
Embodiment one:
With 6 pattern keywords:Exemplified by " ", " Egypt ", " donkey-hide gelatin ", " Argentina ", " Arab " and " Arabic ", Even numbers group dictionary tree is built, as shown in figure 4, may comprise steps of:
S401, according to 6 pattern keywords getting, determine by comprising 10 Chinese characters form there are Rotating fields Node set, as shown in Figure 5;
S402, carry out Chinese character numbering to 10 Chinese characters included in 6 pattern keywords by the numbering of GBK;
Specifically, to the numberings of 10 Chinese characters, the results are shown in Table 2.
Table 2
1 2 3 4 5 6 7 8 9 10
Ah Angstrom Root Glue Draw And The court of a feudal ruler Primary People
S403, according to 10 nodes included altogether in node set, the base for the initialization that structure array length is 19 Array and check arrays;
Specifically, the array length of structure is as shown in table 3 for the base arrays and check arrays of 19 initialization.
Table 3
Subscript 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
base 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
check 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
state
S404, determine that any node whether there is direct descendent in first layer in node set;If so, then perform step S405;If it is not, then perform step S410;
Specifically, according to first layer node in node set:" ", " Ah " and " angstrom ", determine any in these three nodes Node whether there is direct descendent, by judging to find that " therefore Ah " and " angstrom " two nodes, are held there are direct descendent Row step 405.
S405, determine that the corresponding Chinese character numbering of each node is corresponding subscript;
Specifically, according to the first layer node " " shown in table 1, " Chinese character of Ah " and " angstrom " is numbered, it may be determined that this The subscript of three nodes is respectively:1st, 2 and 3.
S406, for there are the node of direct descendent, according to direct descendent number by order more at least, determine each The relation of the subscript for the direct descendent that node includes and Chinese character numbering;
Specifically, in three nodes of first layer " " and " angstrom ", " Ah " and " angstrom " are direct comprising three respectively for node Child node and a direct descendent, therefore, according to direct descendent number by order more at least, first handle node " Ah ", Reprocess node " angstrom ".
Specifically, " direct descendent of Ah " is " root ", " glue " and " drawing ", according to the numbering result shown in table 1 " numbering of " root ", " glue " and " drawing " is 4,5 and 6, according to the subscript I of each direct descendent and Chinese character numbering anRelation, can be with Show that the relation that the subscript I of three direct descendents is met is:
IRoot=k+4, IGlue=k+5, IDraw=k+6
And determine that k is the minimum positive integer for meeting the following conditions:
Base [k+4]=base [k+5]=base [k+6]=check [k+4]=check [k+5]=check [k+6]= Initial value
Therefore, k=1, it may be determined that " the subscript I of three direct descendents " root ", " glue " and " drawing " of Ah " is respectively 5,6 With 7.
Similarly, for node " angstrom " direct descendent for " and ", according to the numbering result in table 1 " and " volume Number be 7, according to the subscript I of each direct descendent and Chinese character numbering anRelation, it can be deduced that the subscript I of direct descendentAndInstitute The relation of satisfaction is:
IAnd=k+7
And determine that k is the minimum positive integer for meeting the following conditions:
Base [k+7]=check [k+7]=initial value
Therefore, k=1, it may be determined that " angstrom " direct descendent " and " subscript IAndFor 8.
S407, determine that the corresponding base values of each node are k;
Specifically, " the base values of Ah " and " angstrom " are 1, and the results are shown in Table 4 in first layer node.
Table 4
Subscript 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
base 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
check 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
state Ah Angstrom
S408, determine the subscript that corresponding check values of direct descendent that each node includes are each node;
Specifically, " the corresponding check values of three direct descendents " root ", " glue " and " drawing " of Ah " are equal in first layer node For 2, the direct descendent of " angstrom " " and " corresponding check values are 3, the results are shown in Table 5.
Table 5
S409, according to the order since second layer node to leafy node, determine to whether there is with any node in layer Direct descendent;If so, step S406 is returned to, if it is not, then performing step S410;
Specifically, for the second layer node " root " in node set, " glue ", " drawing " and " and ", it may be determined that " root " wrap Containing direct descendent " court of a feudal ruler ", " drawing " includes direct descendent " primary ", " glue " and " and " be pattern keyword " donkey-hide gelatin " and " Egypt " Ending, therefore, to node " root " and " drawing ", step S406 can be performed.
Specifically, according to result above, root " and the subscript of " drawing " are respectively 5 and 7;Because node " root " and " drawing " wrap The direct descendent number contained is 1, therefore can not limit the processing sequence of two nodes.
Specifically, according to the numbering result shown in table 1, the numbering of the direct descendent " court of a feudal ruler " of " root " is 8, and root According to the subscript I and Chinese character numbering a of each direct descendentnRelation, it can be deduced that the subscript I in " court of a feudal ruler "The court of a feudal rulerThe relation met is:
IThe court of a feudal ruler=k+8,
And determine that k is the minimum positive integer for meeting the following conditions:
Base [k+8]=check [k+8]=initial value
Therefore, k=1, it may be determined that the subscript I in " court of a feudal ruler "The court of a feudal rulerFor 9;
Hence, it can be determined that the corresponding base values of node " root " are 1, the corresponding check values of direct descendent " court of a feudal ruler " are 5.
Similarly, according to the numbering result shown in table 1, the numbering of the direct descendent " primary " of " drawing " is 9, and root According to the subscript I and Chinese character numbering a of each direct descendentnRelation, it can be deduced that the subscript I of " primary "PrimaryThe relation met is:
IPrimary=k+9
And determine that k is the minimum positive integer for meeting the following conditions:
Base [k+9]=check [k+9]=initial value
Therefore, k=1, it may be determined that the subscript I of node " primary "PrimaryFor 10;
Hence, it can be determined that the corresponding base values of node " drawing " are 1, the corresponding check values of direct descendent " primary " are 7.
Similarly, for the third layer node " court of a feudal ruler " in node set and " primary ", it may be determined that only node " primary " includes one A direct descendent " people ", therefore, 7 is designated as according under " primary ", and the numbering of " people " is 10, by using above-mentioned identical calculating Process, overlaps will not be repeated, it may be determined that the corresponding base values of node " primary " are 1, and direct descendent " people " is corresponding Check values are 10, and the results are shown in Table 6.
Table 6
S410, be revised as the node by the node to end up positioned at all pattern keywords is corresponding for the base values of initial value Lower target negative;By the node to end up positioned at all pattern keywords it is corresponding be initial value base values be revised as the knot The negative of the corresponding base values of point.
Specifically, according to the result shown in table 6, positioned at all pattern keywords ending node " ", " glue ", " and ", the subscript of " court of a feudal ruler " and " people " be respectively 1,6,8,9 and 11, the corresponding base values of these nodes are initial value, therefore, will The corresponding base values of these nodes are revised as the lower target negative of the node, i.e., amended node " ", " glue ", " and ", " court of a feudal ruler " and " people " corresponding base values are respectively -1, -6, -8, -9 and -11;
Meanwhile 10 are designated as under the node " primary " of all pattern keywords ending, corresponding base values are 1, therefore, The corresponding base values of node " primary " are revised as to the negative of the corresponding base values of the node, the i.e. corresponding base values of node " primary " For -1, as shown in table 7.
Table 7
Embodiment two:
The simple dictionary tree as shown in Figure 6 a constructed with two pattern keywords " Slender West Lake " to be matched and " West Lake " Exemplified by, structure failure pointer, generates AC automatic machines, as shown in fig. 7, may comprise steps of:
S501, the failure pointer direction root node root by the first layer node being connected with root node root;
S502, since second layer node up to leafy node, refer to from the failure pointer of the father node of any node Chinese character To node recall, it is determined whether there is the Chinese character identical with the node;If so, then perform step S503;If it is not, then perform step Rapid S504;
The failure pointer of the node, is directed toward the node for having identical Chinese character with the node by S503;
S504, the failure pointer direction root node root by the node.
Specifically, in the simple dictionary tree shown in Fig. 6 a, including Liang Ge branches, it is respectively " Slender West Lake " and " West Lake "; First, the first layer node Chinese character " thin " being connected with root node root and the failure pointer in " west " are directed toward root node root;Connect , for second layer node Chinese character " west " and " lake ", recall from the failure pointer of the father node " thin " of left side node Chinese character " west ", The corresponding node Chinese character of second branch for finding root node root is also " west ", then by the failure of the node Chinese character " west " on the left side Pointer is directed toward the node Chinese character " west " on the right;However, for the right second layer node Chinese character " lake ", from its father node " west " Failure pointer backtracking, does not find the node that the Chinese character being connected with root node root is " lake ", therefore, by the right second layer node The failure pointer of Chinese character " lake " is directed toward root node root;For left side third layer node Chinese character " lake ", according to above-mentioned same side Method, recalls from the failure pointer of its father node " west ", identical Chinese character " lake " is found, therefore, by the left side third layer node Chinese The failure pointer of word " lake " is directed toward the node that the right Chinese character is " lake ", as a result as shown in Figure 6 b.
Embodiment three:
With text to be detected " Slender West Lake is not the thin West Lake ", and by two pattern keywords " Slender West Lake " and " West Lake " Exemplified by the AC automatic machines of structure, Chinese multi-mode matching is realized, as shown in figure 8, may comprise steps of:
S601, the AC automatic machines according to structure, scan text to be detected since root node root;
Whether the current Chinese character in S602, the text to be detected for determining to scan matches with the current Chinese character in AC automatic machines; If so, then perform step S603;If it is not, then perform step S606;
S603, determine whether the current Chinese character place node in AC automatic machines is leafy node;If so then execute step S604;If it is not, then perform step S605;
All Chinese characters that S604, record are matched from root node root to current node in this path;
S605, the next Chinese character continued to scan in text to be detected continue to match, and return to step S602;
S606, the node being directed toward along the failure pointer of node where current Chinese character continue to match;
S607, treat unsuccessfully that pointer is directed toward root node root, the current Chinese character in the text to be detected of scanning is skipped, under scanning One Chinese character, returns to step S602.
Specifically, for text to be detected " Slender West Lake is not the thin West Lake ", and by two pattern keyword " thin west Lake " and the AC automatic machines of " West Lake " structure, match since root node root, find first Chinese character with text to be detected " thin " identical node, continues and text matches to be detected, path 1- as illustrated in fig. 9 along the direct descendent of the node 2-3, have found " thin ", " west " and " lake ", it turns out that " lake " is leafy node, therefore, record from root node root to leaf All Chinese characters matched in this path of node are " Slender West Lake " and " West Lake ";Then continue to match in text to be detected " no " word, it is found that " lake " does not include any direct descendent for leafy node, and root node root is found thus according to failure pointer, Also it is node " no " without Chinese character below path 4-5 as shown in figure 9b, discovery root node root, it fails to match;Skip " no " word, continues to match "Yes" word below, finds then to jump again also without the node that Chinese character is "Yes" below root node root "Yes" word is crossed, continues to match " thin " word below;A branch for including " thin " word, therefore edge are found below root node root The individual path to match downwards, path 6 as is shown in fig. 9 c;When match " thin " word behind " " word when, find it is " thin " knot Point it is following do not have " " branch, it fails to match, and root node root is found according to the failure pointer of " thin " node, as shown in figure 9d Path 7;Do not found at root node root yet Chinese character for " " branch, in be to skip " " word, continue match " west " word, The branch that Chinese character is " west ", path 8 as shown in figure 9e are found at root node root;Continue to match " lake " along AC automatic machines Word, have found " lake " word below " west " node, path 9 as shown in figure 9f;At " lake " node, it is leaf to find the node Child node, it is " West Lake " then to record all Chinese characters matched from root node root to leafy node in this path.So far, Text " Slender West Lake is not the thin West Lake " matched end to be detected, text to be detected, which need to only be scanned one time, to be matched " Slender West Lake " and " West Lake " two words, while count time that " Slender West Lake " and " West Lake " two words occur in text to be detected Number is respectively 1 time and 2 times.
Based on same inventive concept, the embodiment of the present invention additionally provides a kind of construction device of AC automatic machines, due to the dress It is similar to a kind of foregoing construction method of AC automatic machines to put the principle solved the problems, such as, therefore the implementation of the device may refer to method Implementation, overlaps will not be repeated.
Specifically, the construction device of a kind of AC automatic machines provided in an embodiment of the present invention, as shown in Figure 10, can specifically wrap Include:
Determining module 10, for according to the whole Chinese characters included in all pattern keywords got, determining by whole The node set with Rotating fields of Chinese character composition;
Module 20 is established, the node that each layer is included in the node set determined for basis, according to knot in each layer The direct descendent number that point includes determines the corresponding offset base values of node and close that each layer is included by order more at least It is check values, establishes the even numbers group dictionary tree being made of each base values and each check values;
Generation module 30, for establishing failure pointer to each node in even numbers group dictionary tree, generates AC automatic machines.
Further, in the construction device of above-mentioned AC automatic machines provided in an embodiment of the present invention, module 20 is established, specifically For whole Chinese characters to being included in all pattern keywords Chinese character numbering is carried out by the coding mode specified;According to node set In node number, structure with setting length initialization base arrays and initialization check arrays;According to from root node The first layer node of connection performs procedure below to the order of leafy node to each layer node in node set:Determine same Any node whether there is direct descendent in layer;Determining with any node in layer there are during direct descendent, for same layer Inside there are the node of direct descendent, according to the direct descendent number that same layer interior knot includes by order more at least, according to The Chinese character numbering of each node and the Chinese character numbering of the direct descendent that includes of each node, determine the corresponding base values of each node with And the corresponding check values of direct descendent that each node includes;Determining that direct descendent is not present with any node in layer When, the corresponding base values of node to end up positioned at all pattern keywords are set to negative value.
Further, in the construction device of above-mentioned AC automatic machines provided in an embodiment of the present invention, module 20 is established, specifically For being numbered according to the Chinese character of each node, the corresponding subscript of each node is determined;According to the corresponding subscript of each node and each node Comprising direct descendent Chinese character numbering, determine the corresponding base values of each node and the direct descendent pair that each node includes The check values answered.
Further, in the construction device of above-mentioned AC automatic machines provided in an embodiment of the present invention, module 20 is established, specifically For for each node in first layer node, determining that the corresponding Chinese character numbering of each node is corresponding subscript;For other layers Belong to each direct descendent of Same Vertices in node, determine the subscript I and Chinese character numbering a of each direct descendentnMeet following Relation:
I=k+an
And determine that k is the minimum positive integer for meeting the following conditions:
base[k+a1]=base [k+a2]=...=base [k+an]
=check [k+a1]=check [k+a2]=...=check [k+aj]=initial value
Wherein, n is each direct descendent number, n=1,2 ... j.
Further, in the construction device of above-mentioned AC automatic machines provided in an embodiment of the present invention, module 20 is established, specifically For determining that the corresponding base values of each node are k values;Determine that the corresponding check values of direct descendent that each node includes are each knot The subscript of point.
Further, in the construction device of above-mentioned AC automatic machines provided in an embodiment of the present invention, module 20 is established, specifically For the corresponding base values for initial value of the node to end up positioned at all pattern keywords to be revised as to the lower target of the node Negative;By the node to end up positioned at all pattern keywords it is corresponding be not that to be revised as the node corresponding for the base values of initial value The negative of base values.
Further, in the construction device of above-mentioned AC automatic machines provided in an embodiment of the present invention, module 20 is established, specifically For carrying out Chinese character numbering in the following manner to the whole Chinese characters included in all pattern keywords:Information exchange encoding of chinese characters Character set GB2312, Chinese Internal Code Specification GBK, Big5 BIG5 or 8 general format transformation UTF-8.
Based on same inventive concept, the embodiment of the present invention additionally provides a kind of Chinese multi-mode matching device, due to the dress It is similar to a kind of foregoing Chinese multi-model matching method to put the principle solved the problems, such as, therefore the implementation of the device may refer to method Implementation, overlaps will not be repeated.
Specifically, a kind of Chinese multi-mode matching device provided in an embodiment of the present invention, as shown in figure 11, can specifically wrap Include:
Scan module 100, waits to locate for scanning through and through using above-mentioned AC automatic machines provided in an embodiment of the present invention Manage text;
Statistical module 200, for counting the number that all pattern keywords occur in pending text.
It should be understood by those skilled in the art that, embodiments herein can be provided as method, system or computer program Product.Therefore, the application can use the reality in terms of complete hardware embodiment, complete software embodiment or combination software and hardware Apply the form of example.Moreover, the application can use the computer for wherein including computer usable program code in one or more The shape for the computer program product that usable storage medium is implemented on (including but not limited to magnetic disk storage and optical memory etc.) Formula.
The application is with reference to the flow according to the method for the embodiment of the present application, equipment (system) and computer program product Figure and/or block diagram describe.It should be understood that it can be realized by computer program instructions every first-class in flowchart and/or the block diagram The combination of flow and/or square frame in journey and/or square frame and flowchart and/or the block diagram.These computer programs can be provided The processors of all-purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices is instructed to produce A raw machine so that the instruction performed by computer or the processor of other programmable data processing devices, which produces, to be used in fact The device for the function of being specified in present one flow of flow chart or one square frame of multiple flows and/or block diagram or multiple square frames.
These computer program instructions, which may also be stored in, can guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works so that the instruction being stored in the computer-readable memory, which produces, to be included referring to Make the manufacture of device, the command device realize in one flow of flow chart or multiple flows and/or one square frame of block diagram or The function of being specified in multiple square frames.
These computer program instructions can be also loaded into computer or other programmable data processing devices so that counted Series of operation steps is performed on calculation machine or other programmable devices to produce computer implemented processing, thus in computer or The instruction performed on other programmable devices is provided and is used for realization in one flow of flow chart or multiple flows and/or block diagram one The step of function of being specified in a square frame or multiple square frames.
Although preferred embodiments of the present invention have been described, but those skilled in the art once know basic creation Property concept, then can make these embodiments other change and modification.So appended claims be intended to be construed to include it is excellent Select embodiment and fall into all change and modification of the scope of the invention.
A kind of construction method of AC automatic machines provided in an embodiment of the present invention, its device, AC automatic machines, Chinese multi-mode Method of completing the square and its device, according to the whole Chinese characters included in all pattern keywords got, determine to be made of whole Chinese characters The node set with Rotating fields;According to the node that each layer is included in the node set determined, according to knot in each layer The direct descendent number that point includes determines the corresponding offset base values of node and close that each layer is included by order more at least It is check values, establishes the even numbers group dictionary tree being made of each base values and each check values;To each knot in even numbers group dictionary tree Point establishes failure pointer, generates AC automatic machines.Therefore, during AC automatic machines are built, first establish by base values with During the even numbers group dictionary tree of check values composition, using ordering strategy, in each layer according to the direct descendent number included by Order at least more, it is preferential determine base values comprising the largest number of nodes of direct descendent and comprising direct descendent Check values, can significantly reduce conflict when finding base values, avoid array from increasing too fast, reduction Sparse, in raising Deposit occupancy;Secondly, failure pointer is established to each node of the even numbers group dictionary tree of foundation, generates AC automatic machines, largely Optimize space availability ratio.
Obviously, various changes and modifications can be made to the invention without departing from essence of the invention by those skilled in the art God and scope.In this way, if these modifications and changes of the present invention belongs to the scope of the claims in the present invention and its equivalent technologies Within, then the present invention is also intended to comprising including these modification and variations.

Claims (17)

  1. A kind of 1. construction method of multi-mode matching AC automatic machines, it is characterised in that including:
    According to the whole Chinese characters included in all pattern keywords got, determine that there is layer by what whole Chinese characters formed The node set of structure;
    According to the node that each layer is included in the node set determined, included in each layer according to the node direct Child node number is determined the corresponding offset base values of node and relation check values that each layer is included, is built by order more at least The vertical even numbers group dictionary tree being made of each base values and each check values;
    Failure pointer is established to each node in the even numbers group dictionary tree, generates AC automatic machines.
  2. 2. construction method as claimed in claim 1, it is characterised in that each layer in the node set that the basis is determined Comprising node, each layer is determined by order more at least according to the direct descendent number that the node includes in each layer Comprising the corresponding offset base values of node and relation check values, specifically include:
    Chinese character numbering is carried out by the coding mode specified to the whole Chinese characters included in all pattern keywords;
    Node number in the node set, initialization base array and initialization of the structure with setting length Check arrays;
    According to the order from the first layer node being connected with root node to leafy node, to each layer knot in the node set Point performs procedure below:
    Determine to whether there is direct descendent with any node in layer;
    Determining with any node in layer there are during direct descendent, for the knot in same layer there are the direct descendent Point, according to the direct descendent number that the node in same layer includes by order more at least, according to the Chinese character of each node The Chinese character of numbering and the direct descendent that includes of each node numbering, determines corresponding base values of each node and respectively The corresponding check values of direct descendent that the node includes;
    Determine with direct descendent be not present in any node in layer when, all pattern keywords will be located at and be ended up The corresponding base values of node be set to negative value.
  3. 3. construction method as claimed in claim 2, it is characterised in that described according to the Chinese character of each node numbering and each The Chinese character numbering for the direct descendent that the node includes, determines the corresponding base values of each node and each node bag The corresponding check values of direct descendent contained, specifically include:
    Numbered according to the Chinese character of each node, determine the corresponding subscript of each node;
    The Chinese character numbering of the direct descendent included according to each corresponding subscript of node and each node, determines each institute State the corresponding base values of node and corresponding check values of direct descendent that each node includes.
  4. 4. construction method as claimed in claim 3, it is characterised in that it is described to be numbered according to the Chinese character of each node, determine The corresponding subscript of each node, specifically includes:
    For each node in first layer node, determine that the corresponding Chinese character numbering of each node is corresponding subscript;
    For each direct descendent for belonging to Same Vertices in other layer of node, determine the subscript I of each direct descendent with Chinese character numbering anMeet following relation:
    I=k+an
    And determine that k is the minimum positive integer for meeting the following conditions:
    base[k+a1]=base [k+a2]=...=base [k+an]
    =check [k+a1]=check [k+a2]=...=check [k+aj]=initial value
    Wherein, n is each direct descendent number, n=1,2 ... j.
  5. 5. construction method as claimed in claim 4, it is characterised in that according to the corresponding subscript of each node and it is each described in The Chinese character numbering of the direct descendent that node includes, determines the corresponding base values of each node and each node includes The corresponding check values of direct descendent, specifically include:
    It is k values to determine the corresponding base values of each node;
    Determine the subscript that corresponding check values of direct descendent that each node includes are each node.
  6. 6. such as claim 2-5 any one of them construction methods, it is characterised in that described to be located at all pattern keys The corresponding base values of node of word ending are set to negative value, specifically include:
    The node will be revised as positioned at the corresponding base values for initial value of node of all pattern keyword endings Lower target negative;
    By it is corresponding positioned at the node of all pattern keywords ending be not that the base values of initial value are revised as the knot The negative of the corresponding base values of point.
  7. 7. such as claim 2-5 any one of them construction methods, it is characterised in that described in all pattern keywords Comprising whole Chinese characters carry out Chinese character numbering by the coding mode specified, specifically include:
    Chinese character numbering is carried out in the following manner to the whole Chinese characters included in all pattern keywords:Information exchange Chinese character Coded character set GB2312, Chinese Internal Code Specification GBK, Big5 BIG5 or 8 general format transformation UTF-8.
  8. 8. a kind of AC automatic machines, it is characterised in that using the construction method such as claim 1-7 any one of them AC automatic machines Structure.
  9. A kind of 9. Chinese multi-model matching method, it is characterised in that including:
    Using AC automatic machines as claimed in claim 8, pending text is from first to last scanned, is counted in the pending text In the number that occurs of all pattern keywords.
  10. A kind of 10. construction device of AC automatic machines, it is characterised in that including:
    Determining module, for according to the whole Chinese characters included in all pattern keywords got, determining by whole Chinese The node set with Rotating fields of word composition;
    Module is established, the node that each layer is included in the node set determined for basis, according to described in each layer The direct descendent number that node includes by order more at least, determine corresponding offset base values of node that each layer is included and Relation check values, establish the even numbers group dictionary tree being made of each base values and each check values;
    Generation module, for establishing failure pointer to each node in the even numbers group dictionary tree, generates AC automatic machines.
  11. 11. construction device as claimed in claim 10, it is characterised in that it is described to establish module, specifically for described all The whole Chinese characters included in pattern keyword carry out Chinese character numbering by the coding mode specified;According to the knot in the node set Point number, initialization base array and initialization check array of the structure with setting length;According to from being connected with root node First layer node performs procedure below to the order of leafy node to each layer node in the node set:Determine same layer Interior any node whether there is direct descendent;Determining with any node in layer there are during direct descendent, pin To the node in same layer there are the direct descendent, the direct descendent number included according to the node in same layer is by up to Few order, the Chinese character of the direct descendent included according to the Chinese character of each node numbering and each node are numbered, really The corresponding check values of direct descendent that fixed each corresponding base values of node and each node include;Determining together , will be corresponding positioned at the node of all pattern keyword endings when direct descendent is not present in any node in layer Base values are set to negative value.
  12. 12. construction device as claimed in claim 11, it is characterised in that it is described to establish module, specifically for according to each described The Chinese character numbering of node, determines the corresponding subscript of each node;According to each corresponding subscript of node and each knot The Chinese character numbering for the direct descendent that point includes, determine the corresponding base values of each node and each node include it is straight Connect the corresponding check values of child node.
  13. 13. construction device as claimed in claim 12, it is characterised in that it is described to establish module, specifically for for first layer Each node in node, determines that the corresponding Chinese character numbering of each node is corresponding subscript;For belonging in other layer of node Each direct descendent of Same Vertices, determines the subscript I and Chinese character numbering a of each direct descendentnMeet following relation:
    I=k+an
    And determine that k is the minimum positive integer for meeting the following conditions:
    base[k+a1]=base [k+a2]=...=base [k+an]
    =check [k+a1]=check [k+a2]=...=check [k+aj]=initial value
    Wherein, n is each direct descendent number, n=1,2 ... j.
  14. 14. construction device as claimed in claim 13, it is characterised in that it is described to establish module, it is each described specifically for determining The corresponding base values of node are k values;Determine that the corresponding check values of direct descendent that each node includes are each node Subscript.
  15. 15. such as claim 11-14 any one of them construction devices, it is characterised in that it is described to establish module, specifically for inciting somebody to action The subscript of the node is revised as positioned at the corresponding base values for initial value of node of all pattern keyword endings Negative;By it is corresponding positioned at the node of all pattern keywords ending be not that to be revised as this described for the base values of initial value The negative of the corresponding base values of node.
  16. 16. such as claim 11-14 any one of them construction devices, it is characterised in that it is described to establish module, specifically for pair The whole Chinese characters included in all pattern keywords carry out Chinese character numbering in the following manner:Information exchange encoding of chinese characters word Symbol collection GB2312, Chinese Internal Code Specification GBK, Big5 BIG5 or 8 general format transformation UTF-8.
  17. A kind of 17. Chinese multi-mode matching device, it is characterised in that including:
    Scan module, for scanning pending text through and through using AC automatic machines as claimed in claim 8;
    Statistical module, for counting the number that all pattern keywords occur in the pending text.
CN201610943520.5A 2016-11-01 2016-11-01 The structure of AC automatic machines and Chinese multi-model matching method and relevant apparatus Pending CN108021569A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610943520.5A CN108021569A (en) 2016-11-01 2016-11-01 The structure of AC automatic machines and Chinese multi-model matching method and relevant apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610943520.5A CN108021569A (en) 2016-11-01 2016-11-01 The structure of AC automatic machines and Chinese multi-model matching method and relevant apparatus

Publications (1)

Publication Number Publication Date
CN108021569A true CN108021569A (en) 2018-05-11

Family

ID=62070753

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610943520.5A Pending CN108021569A (en) 2016-11-01 2016-11-01 The structure of AC automatic machines and Chinese multi-model matching method and relevant apparatus

Country Status (1)

Country Link
CN (1) CN108021569A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109524068A (en) * 2018-10-16 2019-03-26 东华大学 A kind of disease symptoms extracting method based on AC automatic machine
CN109933656A (en) * 2019-03-15 2019-06-25 深圳市赛为智能股份有限公司 Public sentiment polarity prediction technique, device, computer equipment and storage medium
CN113065419A (en) * 2021-03-18 2021-07-02 哈尔滨工业大学 Pattern matching algorithm and system based on flow high-frequency content

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1786962A (en) * 2005-12-21 2006-06-14 中国科学院计算技术研究所 Method for managing and searching dictionary with perfect even numbers group TRIE Tree
US20070192286A1 (en) * 2004-07-26 2007-08-16 Sourcefire, Inc. Methods and systems for multi-pattern searching
CN102193914A (en) * 2011-05-26 2011-09-21 中国科学院计算技术研究所 Computer aided translation method and system
CN103198079A (en) * 2012-01-06 2013-07-10 北大方正集团有限公司 Related search implementation method and device
CN105183788A (en) * 2015-08-20 2015-12-23 及时标讯网络信息技术(北京)有限公司 Operation method for Chinese AC automatic machine based on retrieval of keyword dictionary tree

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070192286A1 (en) * 2004-07-26 2007-08-16 Sourcefire, Inc. Methods and systems for multi-pattern searching
CN1786962A (en) * 2005-12-21 2006-06-14 中国科学院计算技术研究所 Method for managing and searching dictionary with perfect even numbers group TRIE Tree
CN102193914A (en) * 2011-05-26 2011-09-21 中国科学院计算技术研究所 Computer aided translation method and system
CN103198079A (en) * 2012-01-06 2013-07-10 北大方正集团有限公司 Related search implementation method and device
CN105183788A (en) * 2015-08-20 2015-12-23 及时标讯网络信息技术(北京)有限公司 Operation method for Chinese AC automatic machine based on retrieval of keyword dictionary tree

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109524068A (en) * 2018-10-16 2019-03-26 东华大学 A kind of disease symptoms extracting method based on AC automatic machine
CN109933656A (en) * 2019-03-15 2019-06-25 深圳市赛为智能股份有限公司 Public sentiment polarity prediction technique, device, computer equipment and storage medium
WO2020186627A1 (en) * 2019-03-15 2020-09-24 深圳市赛为智能股份有限公司 Public opinion polarity prediction method and apparatus, computer device, and storage medium
CN109933656B (en) * 2019-03-15 2023-08-15 深圳市赛为智能股份有限公司 Public opinion polarity prediction method, public opinion polarity prediction device, computer equipment and storage medium
CN113065419A (en) * 2021-03-18 2021-07-02 哈尔滨工业大学 Pattern matching algorithm and system based on flow high-frequency content
CN113065419B (en) * 2021-03-18 2022-05-24 哈尔滨工业大学 Pattern matching algorithm and system based on flow high-frequency content

Similar Documents

Publication Publication Date Title
Bahmani et al. Efficient distributed locality sensitive hashing
Song et al. RP-DBSCAN: A superfast parallel DBSCAN algorithm based on random partitioning
CN104679778B (en) A kind of generation method and device of search result
US9390134B2 (en) Regular expression matching method and system, and searching device
CN104462260B (en) A kind of community search method in social networks based on k- cores
CN108920720A (en) The large-scale image search method accelerated based on depth Hash and GPU
US20170242855A1 (en) Fast, scalable dictionary construction and maintenance
CN103377237B (en) The neighbor search method of high dimensional data and fast approximate image searching method
CN105138647A (en) Travel network cell division method based on Simhash algorithm
CN108021569A (en) The structure of AC automatic machines and Chinese multi-model matching method and relevant apparatus
CN103514236A (en) Retrieval condition error correction prompt processing method based on Pinyin in retrieval application
CN102148746A (en) Message classification method and system
EP2544414A1 (en) Method and device for storing routing table entry
Kang et al. Flow rounding
Chen et al. Metric similarity joins using MapReduce
de Berg et al. A framework for ETH-tight algorithms and lower bounds in geometric intersection graphs
Ferragina et al. On the bit-complexity of Lempel--Ziv compression
CN106874425A (en) Real time critical word approximate search algorithm based on Storm
CN107180079A (en) The image search method of index is combined with Hash based on convolutional neural networks and tree
Serra et al. Large-scale sparse structural node representation
CN108228896B (en) A kind of missing data complementing method and device based on density
Ghosh et al. A user-guided innovization-based evolutionary algorithm framework for practical multi-objective optimization problems
CN106940711A (en) A kind of URL detection methods and detection means
CN109408517A (en) Multidimensional search method, apparatus, equipment and the readable storage medium storing program for executing of rule
Eppstein et al. Approximate greedy clustering and distance selection for graph metrics

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20180511

RJ01 Rejection of invention patent application after publication