CN108304384A - Word-breaking method and apparatus - Google Patents
Word-breaking method and apparatus Download PDFInfo
- Publication number
- CN108304384A CN108304384A CN201810086623.3A CN201810086623A CN108304384A CN 108304384 A CN108304384 A CN 108304384A CN 201810086623 A CN201810086623 A CN 201810086623A CN 108304384 A CN108304384 A CN 108304384A
- Authority
- CN
- China
- Prior art keywords
- permutation
- word
- combination
- combination word
- masterplate
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
The object of the present invention is to provide a kind of word-breaking method and apparatus, there are one words by each node storage in dictionary tree by the present invention, individual node in dictionary tree forms corresponding masterplate permutation and combination word, or the node other than adjacent 2 or more the root nodes of level in same branch, corresponding masterplate permutation and combination word is formed by upper layer node to lower level node successively, obtained each permutation and combination word is matched with the masterplate permutation and combination word in dictionary tree, it obtains and matches consistent masterplate permutation and combination word, so as to rapidly and accurately extract the masterplate contamination word corresponding to the descriptive labelling information from dictionary tree.
Description
Technical field
The present invention relates to computer realm more particularly to a kind of word-breaking method and apparatus.
Background technology
Existing word-breaking scheme is there are word-breaking speed is slow, and the problem that word-breaking is not accurate enough.
Invention content
It is an object of the present invention to provide a kind of word-breaking method and apparatus, can solve existing word-breaking scheme presence and tear open
Word speed is slow, and the problem that word-breaking is not accurate enough.
According to an aspect of the invention, there is provided a kind of word-breaking method, this method include:
Obtain phrase input by user;
The phrase is split into single word;
Multiple permutation and combination words of each single word for splitting and obtaining are obtained, the permutation and combination word includes according to one
A or multiple tactic one or more words;
Obtained each permutation and combination word is matched with the masterplate permutation and combination word in dictionary tree, obtains and matches consistent row
Row portmanteau word, wherein each node storage in addition to root node in dictionary tree is there are one word, the individual node group in dictionary tree
Node other than 2 or more adjacent root nodes of level at corresponding masterplate permutation and combination word, or in same branch, successively
Corresponding masterplate permutation and combination word is formed by upper layer node to lower level node.
Further, in the above method, the dictionary tree is even numbers group dictionary tree.
Further, in the above method, by the masterplate permutation and combination word in obtained each permutation and combination word and dictionary tree
Before matching, further include:
Obtaining record has the dictionary of masterplate permutation and combination word;
Each masterplate permutation and combination word in the dictionary is stored into each respective branches of the dictionary tree by lexcographical order
In.
Further, in the above method, each masterplate permutation and combination word in the dictionary is stored into the dictionary tree
Each respective branches in, including:
If there is approximate masterplate permutation and combination word in the dictionary, each group of approximation masterplate permutation and combination word is the word phase of composition
One group of masterplate permutation and combination word same but that sequence is different only choose a masterplate arrangement from one group of approximation masterplate permutation and combination word
Portmanteau word is stored in as main permutation and combination word in the dictionary tree, each masterplate that do not chosen in this group of masterplate permutation and combination word
Permutation and combination word is as each secondary permutation and combination word;
Establish the correspondence of the main permutation and combination word and secondary permutation and combination word.
Further, in the above method, by the masterplate permutation and combination word in obtained each permutation and combination word and dictionary tree
Matching obtains and matches consistent permutation and combination word, including:
According to the correspondence of the main permutation and combination word and secondary permutation and combination word, in obtained each permutation and combination word
Retain the main permutation and combination word, and delete corresponding secondary permutation and combination word, obtains filtered main permutation and combination word;
Each filtered main permutation and combination word is matched with the masterplate permutation and combination word in dictionary tree, it is consistent to obtain matching
Main permutation and combination word;
According to the correspondence of the main permutation and combination word and secondary permutation and combination word, acquisition matches unanimously with the dictionary tree
Secondary permutation and combination word.
Further, in the above method, by the masterplate arrangement group in each filtered main permutation and combination word and dictionary tree
Word matching is closed, the consistent main permutation and combination word of matching is obtained, including:
By the identical main permutation and combination word of entry word in the main permutation and combination word be one group, respectively it is main to every group combine into
The following iteration of row:
Whether search has the entry word with the main permutation and combination word currently organized in dictionary tree be the node started,
If nothing, one group of main permutation and combination word iteration since newly is removed;
If so, being the node started with current entry word, searched and the main permutation and combination word of the group in the dictionary tree
The consistent main permutation and combination word of matching, and result set is added.
Further, in the above method, by the masterplate permutation and combination word in obtained each permutation and combination word and dictionary tree
Matching obtains after matching consistent permutation and combination word, further includes:
The correspondence of each masterplate permutation and combination word and volumes of searches scoring is set;
According to the correspondence, the scoring for matching consistent permutation and combination word is determined.
Further, in the above method, when the masterplate permutation and combination word is violated word, each arrangement group for will obtaining
It closes word to match with the masterplate permutation and combination word in dictionary tree, obtains after matching consistent permutation and combination word, further include:
The consistent permutation and combination word of the matching is deleted from the phrase input by user.
According to another aspect of the present invention, a kind of word-breaking equipment is additionally provided, which includes:
Acquisition device, for obtaining phrase input by user;
Detachment device, for the phrase to be split into single word;
Combination unit, for obtaining the multiple permutation and combination words for splitting obtained each single word, the arrangement
Portmanteau word includes according to one or more tactic one or more words;
Coalignment is obtained for matching obtained each permutation and combination word with the masterplate permutation and combination word in dictionary tree
It takes and matches consistent permutation and combination word, wherein there are one word, dictionary trees for each node storage in addition to root node in dictionary tree
In individual node form corresponding masterplate permutation and combination word, or 2 or more root nodes that the level in same branch is adjacent
Node in addition forms corresponding masterplate permutation and combination word by upper layer node to lower level node successively.
Further, in above equipment, the dictionary tree is even numbers group dictionary tree.
Further, in above equipment, dictionary tree generating means have the word of masterplate permutation and combination word for obtaining record
Library;Each masterplate permutation and combination word in the dictionary is stored by lexcographical order in each respective branches of the dictionary tree.
Further, in above equipment, the dictionary tree generating means, if for there is approximate masterplate to arrange in the dictionary
Portmanteau word, each group of approximation masterplate permutation and combination word are one group of masterplate permutation and combination word that the word of composition is identical but sequence is different,
One masterplate permutation and combination word of selection is stored in described as main permutation and combination word from one group of approximation masterplate permutation and combination word
In dictionary tree, each masterplate permutation and combination word that do not chosen in this group of masterplate permutation and combination word is as each secondary permutation and combination word;
Establish the correspondence of the main permutation and combination word and secondary permutation and combination word.
Further, in above equipment, the coalignment, for according to the main permutation and combination word and secondary permutation and combination
The correspondence of word retains the main permutation and combination word in obtained each permutation and combination word, and deletes corresponding secondary arrangement
Portmanteau word obtains filtered main permutation and combination word;Masterplate in each filtered main permutation and combination word and dictionary tree is arranged
Row portmanteau word matches, and obtains the consistent main permutation and combination word of matching;According to the main permutation and combination word and secondary permutation and combination word
Correspondence obtains and matches consistent secondary permutation and combination word with the dictionary tree.
Further, in above equipment, the coalignment, for entry word in the main permutation and combination word is identical
Main permutation and combination word is one group, carries out following iteration to every group of main combination respectively:Whether search has currently to organize in dictionary tree
The entry word of main permutation and combination word is that the node started removes one group of main permutation and combination word iteration since newly if nothing;If
Have, be the node started with current entry word, searches and matched with the main permutation and combination word of the group unanimously in the dictionary tree
Main permutation and combination word, and result set is added.
Further, further include scoring apparatus in above equipment, for obtain match consistent permutation and combination word it
Afterwards, the correspondence of each masterplate permutation and combination word and volumes of searches scoring is set;According to the correspondence, the matching is determined
The scoring of consistent permutation and combination word.
Further, further include deleting device in above equipment, for being violated word when the masterplate permutation and combination word
When, after obtaining and matching consistent permutation and combination word, by the consistent permutation and combination word of the matching from described input by user
It is deleted in phrase.
According to the another side of the application, a kind of equipment based on calculating is also provided, including:
Processor;And
It is arranged to the memory of storage computer executable instructions, the executable instruction makes the place when executed
Manage device:
Obtain phrase input by user;
The phrase is split into single word;
Multiple permutation and combination words of each single word for splitting and obtaining are obtained, the permutation and combination word includes according to one
A or multiple tactic one or more words;
Obtained each permutation and combination word is matched with the masterplate permutation and combination word in dictionary tree, obtains and matches consistent row
Row portmanteau word, wherein each node storage in addition to root node in dictionary tree is there are one word, the individual node group in dictionary tree
Node other than 2 or more adjacent root nodes of level at corresponding masterplate permutation and combination word, or in same branch, successively
Corresponding masterplate permutation and combination word is formed by upper layer node to lower level node.
According to the another side of the application, a kind of computer readable storage medium is also provided, being stored thereon with computer can hold
Row instruction, wherein the computer executable instructions make processor when being executed by processor:
Obtain phrase input by user;
The phrase is split into single word;
Multiple permutation and combination words of each single word for splitting and obtaining are obtained, the permutation and combination word includes according to one
A or multiple tactic one or more words;
Obtained each permutation and combination word is matched with the masterplate permutation and combination word in dictionary tree, obtains and matches consistent row
Row portmanteau word, wherein each node storage in addition to root node in dictionary tree is there are one word, the individual node group in dictionary tree
Node other than 2 or more adjacent root nodes of level at corresponding masterplate permutation and combination word, or in same branch, successively
Corresponding masterplate permutation and combination word is formed by upper layer node to lower level node.
Compared with prior art, the present invention by the storage of each node in dictionary tree there are one word, the list in dictionary tree
A node forms corresponding masterplate permutation and combination word, or other than adjacent 2 or more the root nodes of the level in same branch
Node forms corresponding masterplate permutation and combination word by upper layer node to lower level node successively, each permutation and combination word that will be obtained
It is matched with the masterplate permutation and combination word in dictionary tree, obtains and match consistent masterplate permutation and combination word, so as to rapidly and accurately
The masterplate contamination word corresponding to the descriptive labelling information is extracted from dictionary tree.
Description of the drawings
By reading a detailed description of non-restrictive embodiments in the light of the attached drawings below, of the invention other
Feature, objects and advantages will become more apparent upon:
Fig. 1 shows word-breaking method flow diagram according to an embodiment of the invention.
Same or analogous reference numeral represents same or analogous component in attached drawing.
Specific implementation mode
Present invention is further described in detail below in conjunction with the accompanying drawings.
In a typical configuration of this application, terminal, the equipment of service network and trusted party include one or more
Processor (CPU), input/output interface, network interface and memory.
Memory may include computer-readable medium in volatile memory, random access memory (RAM) and/or
The forms such as Nonvolatile memory, such as read-only memory (ROM) or flash memory (flash RAM).Memory is computer-readable medium
Example.
Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method
Or technology realizes information storage.Information can be computer-readable instruction, data structure, the module of program or other data.
The example of the storage medium of computer includes, but are not limited to phase transition internal memory (PRAM), static RAM (SRAM), moves
State random access memory (DRAM), other kinds of random access memory (RAM), read-only memory (ROM), electric erasable
Programmable read only memory (EEPROM), fast flash memory bank or other memory techniques, read-only disc read only memory (CD-ROM) (CD-ROM),
Digital versatile disc (DVD) or other optical storages, magnetic tape cassette, magnetic tape disk storage or other magnetic storage apparatus or
Any other non-transmission medium can be used for storage and can be accessed by a computing device information.As defined in this article, computer
Readable medium does not include non-temporary computer readable media (transitory media), such as data-signal and carrier wave of modulation.
As shown in Figure 1, a kind of word-breaking method, including:
Step S1 obtains phrase input by user, such as abc;
The phrase is split into single word, above-mentioned abc is such as split as a, b, c by step S2;
Step S3 obtains multiple permutation and combination words of each single word for splitting and obtaining, the permutation and combination word
Including according to one or more tactic one or more words, the arrangement group that such as above-mentioned a, b, c can be combined
It is a, b, c, ab, ac, bc, abc, ba, ca, cb, acb, bac, bca, cab, cba to close word;
Step S6 matches obtained each permutation and combination word with the masterplate permutation and combination word in dictionary tree, obtains matching
Consistent permutation and combination word, wherein each node storage in addition to root node in dictionary tree is there are one word, the list in dictionary tree
A node forms corresponding masterplate permutation and combination word, or other than adjacent 2 or more the root nodes of the level in same branch
Node forms corresponding masterplate permutation and combination word by upper layer node to lower level node successively, as a, b, c, ab, ac, bc, abc,
Ba, ca, acb, bac, bca, cab, cba are respectively a branch in dictionary tree.
Here, dictionary tree is also known as word lookup tree, Trie trees are a kind of tree structures, are a kind of mutation of Hash tree.Word
The advantages of allusion quotation tree is:Query time is reduced using the common prefix of word, meaningless word is reduced to the maximum extent and compares, inquiry effect
Rate is higher than Hash tree.
This implementation passes through each node storage in dictionary tree, and there are one the individual node composition in word dictionary tree is corresponding
Node other than masterplate permutation and combination word, or adjacent 2 or more the root nodes of level in same branch, is saved by upper layer successively
Point to lower level node forms corresponding masterplate permutation and combination word, and the masterplate in obtained each permutation and combination word and dictionary tree is arranged
Row portmanteau word matches, and can quickly and accurately obtain and match consistent masterplate permutation and combination word, such as permutation and combination to be matched
Word is:A, b, c, ab, ac, bc, abc, ba, ca, cb, acb, bcc, bca, cab, cba,
Masterplate permutation and combination word in dictionary tree is:Ab, ac, bc, abc, ba, ca, acb, bac, bca, cab, cba,
The matched masterplate permutation and combination word then obtained, i.e., the masterplate arrangement in permutation and combination word to be matched and dictionary tree
The permutation and combination word that portmanteau word kind has is:Ab, ac, bc, abc, ba, ca, acb, bac, bca, cab, cba, so as to quick
The masterplate contamination corresponding to the descriptive labelling information is accurately extracted from dictionary tree.
In one embodiment of word-breaking method of the present invention, the dictionary tree is even numbers group dictionary tree, the even numbers group dictionary tree
(Double-Array Trie) is the compressed format of dictionary tree (Trie) structure, only indicates Trie trees with two linear arrays,
The structure effectively combines digital search tree (Digital Search Tree) retrieval time efficient feature and chain type expression
The compact feature of Trie space structures.The essence of even numbers group Trie is a deterministic finite automation (DFA), each node
A state for representing automatic machine carries out state transfer according to variable difference, complete when reaching end state or can not shift
It is operated at one query.Contact between the character for including in all keys of even numbers group is all by simple mathematical addition operation
It indicates, not only increases retrieval rate, and eliminate a large amount of pointers used in chain structure, save memory space.
As shown in Figure 1, the present invention one embodiment of word-breaking method in, step S6, by obtained each permutation and combination word with
Before masterplate permutation and combination word matching in dictionary tree, further include:
Step S4, obtaining record has the dictionary of masterplate permutation and combination word;
Each masterplate permutation and combination word in the dictionary is stored into each right of the dictionary tree by step S5 by lexcographical order
It answers in branch.
Here, by the way that dictionary tree will be converted in dictionary, subsequently through the inquiry to dictionary tree, can accelerate to obtain matching
Masterplate permutation and combination word.
In one embodiment of word-breaking method of the present invention, step S5 deposits each masterplate permutation and combination word in the dictionary
In each respective branches for storing up the dictionary tree, including:
Step S51, if there is approximate masterplate permutation and combination word in the dictionary, each group of approximation masterplate permutation and combination word is group
At word is identical but the different one group of masterplate permutation and combination word of sequence, one is only chosen from one group of approximation masterplate permutation and combination word
Masterplate permutation and combination word is stored in as main permutation and combination word in the dictionary tree, is not chosen in this group of masterplate permutation and combination word
Each masterplate permutation and combination word is as each secondary permutation and combination word, for example, abc, acb and bac are one group of approximation masterplate arrangement group
Close word, wherein be stored in abc as main permutation and combination word in the dictionary tree, using acb and bac as secondary permutation and combination word
It is not stored in the dictionary tree, abc corresponds to acb and bac;
Step S52 establishes the correspondence of the main permutation and combination word and secondary permutation and combination word.
Here, being used as main permutation and combination by only choosing a masterplate permutation and combination word in approximate masterplate permutation and combination word
Word is stored in the dictionary tree, it is possible to reduce the redundancy of dictionary tree data storage, and can accelerate it is each combine with it is corresponding
Dictionary tree matching speed.
The present invention one embodiment of word-breaking method in, each group of approximation masterplate permutation and combination word be composition word it is identical but row
One group of different masterplate permutation and combination word of sequence only chooses a masterplate permutation and combination word from one group of approximation masterplate permutation and combination word
It is stored in the dictionary tree as main permutation and combination word, each masterplate arrangement group that do not chosen in this group of masterplate permutation and combination word
Word is closed as each secondary permutation and combination word, and establish the main permutation and combination word with after the correspondence of secondary permutation and combination word,
Step S6 matches obtained each permutation and combination word with the masterplate permutation and combination word in dictionary tree, obtains matching
Consistent permutation and combination word, including:
Step S61, according to the correspondence of the main permutation and combination word and secondary permutation and combination word, in obtained each arrangement
Retain the main permutation and combination word in portmanteau word, and delete corresponding secondary permutation and combination word, obtains filtered main permutation and combination
Word;
Step S62 matches each filtered main permutation and combination word with the masterplate permutation and combination word in dictionary tree, obtains
The consistent main permutation and combination word of matching;
Step S63 is obtained and the dictionary tree according to the correspondence of the main permutation and combination word and secondary permutation and combination word
The consistent secondary permutation and combination word of matching.
For example, there is following dictionary:
a;
ab;
abc;
ba;
bac;
abcd。
Phrase input by user be cba, by the phrase split into all possible permutation and combination of single word be A (3,1)+
A (3,2)+A (3,3)=3+6+6=15, it is specific as follows:
a;
b;
c;
ab;
ac;
bc;
ba;
ca;
cb;
abc;
acb;
cab;
bac;
bca;
cba。
It needs to find the permutation and combination that cba is occurred in dictionary as follows:
a;
ab;
abc;
ba;
bac。
Dictionary first can be carried out the sequence of ascii codes by the application, if having approximate masterplate permutation and combination word in the dictionary, often
One group of approximation masterplate permutation and combination word is one group of masterplate permutation and combination word that the word of composition is identical but sequence is different, from one group of approximation
It only chooses a masterplate permutation and combination word in masterplate permutation and combination word to be stored in the dictionary tree as main permutation and combination word, i.e.,
New dictionary will be obtained after original dictionary duplicate removal, and constructs dictionary tree using new dictionary is obtained after duplicate removal, it is as follows:
a[a];
Ab [ab, ba];
Abc [abc, bac];
abcd[abcd]。
When being that cba is matched with dictionary tree to phrase input by user, cba is abc, abc according to after the sequence of ascii codes
Main permutation and combination word a, ab, abc are obtained after above-mentioned dictionary tree traversal, then by main permutation and combination word and secondary permutation and combination word
Correspondence such as obtains a kind of combination a by a, and two kinds of combinations ab, ba are obtained by ab, and two kinds of combinations abc, bac are obtained by abc.
Therefore obtaining final matching result is:
a;
ab;
ba;
abc;
bac。
In one embodiment of word-breaking method of the present invention, step S62, by each filtered main permutation and combination word and dictionary tree
In the matching of masterplate permutation and combination word, obtain the consistent main permutation and combination word of matching, including:
By the identical main permutation and combination word of entry word in the main permutation and combination word be one group, respectively it is main to every group combine into
The following iteration of row:
Step S621, whether search has the entry word with the main permutation and combination word currently organized in dictionary tree be the section started
Point,
If nothing, step S622 removes one group of main permutation and combination word iteration since newly;
If so, step S623, is the node started with current entry word, searched and the main row of the group in the dictionary tree
The consistent main permutation and combination word of row portmanteau word matching, and result set is added.
For example, it is desired to the function that the commodity title of user is given a mark, the commodity title (most 30 to user is needed
Chinese character) dismantled by character after find all permutation and combination appeared in dictionary (about 3000000 data)
Each possible permutation and combination number of commodity title be A (30,1)+A (30,2)+...+A (30,30) data volume is suitable
(being about 10 33 powers) greatly
Algorithm for design is as follows:
Space is gone to switch to small letter duplicate removal as main permutation and combination word all masterplate permutation and combination words in dictionary, while structure
It makes a map and stores main permutation and combination word and secondary permutation and combination word, a main permutation and combination word may correspond to multiple secondary arrangement groups
Word is closed, while a dictionary tree is constructed according to all main permutation and combination words;
After going to space to switch to small letter duplicate removal by the sequence of ascii codes commodity title, abcd is such as obtained, walks dictionary in order
Tree, finds out all main permutation and combination words being likely to occur and includes the following steps by taking abcd as an example:
Whether there is the node started with a in 1.0 lookup dictionary trees,
If if there is walking 1.1 without 2.0
1.1 first judge whether a is leaf node, if it is walking 1.1.1 otherwise 1.2
1.1.1, after word in leaf node to be added to the result set returned, walk 2.0;
1.2 look for the node ... with the presence or absence of ab ac ad beginnings with a beginnings respectively
1.2.1 the result set that will matching consistent main permutation and combination word and return be added all is matched with the node of a beginnings
2.0 are walked after complete;
2.0 check whether there is the node ... started with b
It goes in map to find all corresponding secondary permutation and combination words after finding all main permutation and combination words, to which one will be matched
The main permutation and combination word and its corresponding secondary permutation and combination word caused, as final matching result.
Here, the present embodiment by by the identical main permutation and combination word of entry word in the main permutation and combination word be one group,
It is the node started with current entry word, is searched in the dictionary tree and match consistent main row with the main permutation and combination word of the group
Row portmanteau word can further speed up search speed.
In one embodiment of word-breaking method of the present invention, step S6 will be in obtained each permutation and combination word and dictionary tree
Masterplate permutation and combination word matches, and obtains after matching consistent permutation and combination word, further includes:
The correspondence of each masterplate permutation and combination word and volumes of searches scoring is set;
According to the correspondence, the scoring for matching consistent permutation and combination word is determined.
For example, each masterplate permutation and combination word and the correspondence that volumes of searches scores are as follows:a:8 points, b:1 point, c:10 points,
ab:7 points, ac:5 points, bc:9 points, abc:4 points, ba:3 points, ca:0 point, cb:2 points;
The consistent permutation and combination word of the matching is:ab、ac、bc、abc;
It is described match consistent permutation and combination word scoring be:ab:7 points, ac:5 points, bc:9 points, abc:4 points.
Here, phrase input by user can be descriptive labelling information, according to the consistent permutation and combination word of the matching
Scoring or scoring sum total, can predict above-mentioned descriptive labelling information search amount, can be to reach expected requirement to volumes of searches into two
Descriptive labelling information modify.
In one embodiment of word-breaking method of the present invention, when the masterplate permutation and combination word is violated word,
Step S6 matches obtained each permutation and combination word with the masterplate permutation and combination word in dictionary tree, obtains matching
After consistent permutation and combination word, further include:
The consistent permutation and combination word of the matching is deleted from the phrase input by user.
Here, phrase input by user can be descriptive labelling information, each permutation and combination word and dictionary tree that will be obtained
In the matching of violated word, obtain and match consistent permutation and combination word, descriptive labelling information can be subsequently filtered to removal, with
Descriptive labelling information is automatically corrected.
According to the another side of the application, a kind of word-breaking equipment is also provided, which includes:
Acquisition device, for obtaining phrase input by user;
Detachment device, for the phrase to be split into single word;
Combination unit, for obtaining the multiple permutation and combination words for splitting obtained each single word, the arrangement
Portmanteau word includes according to one or more tactic one or more words;
Coalignment is obtained for matching obtained each permutation and combination word with the masterplate permutation and combination word in dictionary tree
It takes and matches consistent permutation and combination word, wherein there are one word, dictionary trees for each node storage in addition to root node in dictionary tree
In individual node form corresponding masterplate permutation and combination word, or 2 or more root nodes that the level in same branch is adjacent
Node in addition forms corresponding masterplate permutation and combination word by upper layer node to lower level node successively.
In one embodiment of word-breaking equipment of the present invention, the dictionary tree is even numbers group dictionary tree.
In one embodiment of word-breaking equipment of the present invention, dictionary tree generating means have masterplate permutation and combination for obtaining record
The dictionary of word;Each masterplate permutation and combination word in the dictionary is stored into each respective branches of the dictionary tree by lexcographical order
In.
In one embodiment of word-breaking equipment of the present invention, the dictionary tree generating means, if for having approximation in the dictionary
Masterplate permutation and combination word, each group of approximation masterplate permutation and combination word are one group of masterplate arrangement that the word of composition is identical but sequence is different
Portmanteau word, only one masterplate permutation and combination word of selection is stored as main permutation and combination word from one group of approximation masterplate permutation and combination word
In the dictionary tree, each masterplate permutation and combination word that do not chosen in this group of masterplate permutation and combination word is as each secondary arrangement group
Close word;Establish the correspondence of the main permutation and combination word and secondary permutation and combination word.
In one embodiment of word-breaking equipment of the present invention, the coalignment, for according to the main permutation and combination word and pair
The correspondence of permutation and combination word retains the main permutation and combination word in obtained each permutation and combination word, and deletes correspondence
Secondary permutation and combination word, obtain filtered main permutation and combination word;It will be in each filtered main permutation and combination word and dictionary tree
The matching of masterplate permutation and combination word, obtain the consistent main permutation and combination word of matching;It is arranged with secondary according to the main permutation and combination word
The correspondence of portmanteau word obtains and matches consistent secondary permutation and combination word with the dictionary tree.
In one embodiment of word-breaking equipment of the present invention, the coalignment, for will start in the main permutation and combination word
The identical main permutation and combination word of word is one group, carries out following iteration to every group of main combination respectively:Search dictionary tree in whether have with
The entry word for the main permutation and combination word currently organized is the node started, if nothing, removes one group of main permutation and combination word institute since new
State iteration;If so, being the node started with current entry word, searched and the main permutation and combination word of the group in the dictionary tree
With consistent main permutation and combination word, and result set is added.
Further include scoring apparatus in one embodiment of word-breaking equipment of the present invention, for obtaining the consistent arrangement group of matching
After closing word, the correspondence of each masterplate permutation and combination word and volumes of searches scoring is set;According to the correspondence, institute is determined
State the scoring for matching consistent permutation and combination word.
In one embodiment of word-breaking equipment of the present invention, further includes deleting device, be for working as the masterplate permutation and combination word
When violated word, after obtaining the consistent permutation and combination word of matching, consistent permutation and combination word is matched from the user by described
It is deleted in the phrase of input.
According to the another side of the application, the equipment based on calculating is also provided, including:
Processor;And
It is arranged to the memory of storage computer executable instructions, the executable instruction makes the place when executed
Manage device:
Obtain phrase input by user;
The phrase is split into single word;
Multiple permutation and combination words of each single word for splitting and obtaining are obtained, the permutation and combination word includes according to one
A or multiple tactic one or more words;
Obtained each permutation and combination word is matched with the masterplate permutation and combination word in dictionary tree, obtains and matches consistent row
Row portmanteau word, wherein each node storage in addition to root node in dictionary tree is there are one word, the individual node group in dictionary tree
Node other than 2 or more adjacent root nodes of level at corresponding masterplate permutation and combination word, or in same branch, successively
Corresponding masterplate permutation and combination word is formed by upper layer node to lower level node.
According to the another side of the application, computer readable storage medium is also provided, is stored thereon with the executable finger of computer
It enables, which makes the processor when being executed by processor:
Obtain phrase input by user;
The phrase is split into single word;
Multiple permutation and combination words of each single word for splitting and obtaining are obtained, the permutation and combination word includes according to one
A or multiple tactic one or more words;
Obtained each permutation and combination word is matched with the masterplate permutation and combination word in dictionary tree, obtains and matches consistent row
Row portmanteau word, wherein each node storage in addition to root node in dictionary tree is there are one word, the individual node group in dictionary tree
Node other than 2 or more adjacent root nodes of level at corresponding masterplate permutation and combination word, or in same branch, successively
Corresponding masterplate permutation and combination word is formed by upper layer node to lower level node.
The detailed content of each embodiment of equipment and computer readable storage medium of the present invention, for details, reference can be made to method parts
Each embodiment, details are not described herein.
In conclusion the present invention by the storage of each node in dictionary tree there are one word, the individual node in dictionary tree
Form corresponding masterplate permutation and combination word, or the node other than adjacent 2 or more the root nodes of the level in same branch, according to
Each permutation and combination word and dictionary tree secondary that corresponding masterplate permutation and combination word is formed by upper layer node to lower level node, will obtain
In the matching of masterplate permutation and combination word, obtain and match consistent masterplate permutation and combination word, so as to rapidly and accurately from dictionary tree
In extract masterplate contamination word corresponding to the descriptive labelling information.
Obviously, those skilled in the art can carry out the application essence of the various modification and variations without departing from the application
God and range.In this way, if these modifications and variations of the application belong to the range of the application claim and its equivalent technologies
Within, then the application is also intended to include these modifications and variations.
It should be noted that the present invention can be carried out in the assembly of software and/or software and hardware, for example, can adopt
With application-specific integrated circuit (ASIC), general purpose computer or any other realized similar to hardware device.In one embodiment
In, software program of the invention can be executed by processor to realize steps described above or function.Similarly, of the invention
Software program (including relevant data structure) can be stored in computer readable recording medium storing program for performing, for example, RAM memory,
Magnetic or optical driver or floppy disc and similar devices.In addition, hardware can be used to realize in some steps or function of the present invention, example
Such as, coordinate to execute the circuit of each step or function as with processor.
In addition, the part of the present invention can be applied to computer program product, such as computer program instructions, when its quilt
When computer executes, by the operation of the computer, it can call or provide according to the method for the present invention and/or technical solution.
And the program instruction of the method for the present invention is called, it is possibly stored in fixed or moveable recording medium, and/or pass through
Broadcast or the data flow in other signal loaded mediums and be transmitted, and/or be stored according to described program instruction operation
In the working storage of computer equipment.Here, including a device according to one embodiment of present invention, which includes using
Memory in storage computer program instructions and processor for executing program instructions, wherein when the computer program refers to
When order is executed by the processor, method and/or skill of the device operation based on aforementioned multiple embodiments according to the present invention are triggered
Art scheme.
It is obvious to a person skilled in the art that invention is not limited to the details of the above exemplary embodiments, Er Qie
In the case of without departing substantially from spirit or essential attributes of the invention, the present invention can be realized in other specific forms.Therefore, no matter
From the point of view of which point, the present embodiments are to be considered as illustrative and not restrictive, and the scope of the present invention is by appended power
Profit requires rather than above description limits, it is intended that all by what is fallen within the meaning and scope of the equivalent requirements of the claims
Variation includes within the present invention.Any reference signs in the claims should not be construed as limiting the involved claims.This
Outside, it is clear that one word of " comprising " is not excluded for other units or step, and odd number is not excluded for plural number.That is stated in device claim is multiple
Unit or device can also be realized by a unit or device by software or hardware.The first, the second equal words are used for table
Show title, and does not represent any particular order.
Claims (18)
1. a kind of word-breaking method, wherein this method includes:
Obtain phrase input by user;
The phrase is split into single word;
Obtain the multiple permutation and combination words of each single word for splitting and obtaining, the permutation and combination word include according to one or
Multiple tactic one or more words;
Obtained each permutation and combination word is matched with the masterplate permutation and combination word in dictionary tree, obtains and matches consistent arrangement group
Close word, wherein for each node storage in addition to root node in dictionary tree there are one word, the individual node group in dictionary tree is pairs of
Node other than the masterplate permutation and combination word answered, or adjacent 2 or more the root nodes of level in same branch, successively by upper
Node layer to lower level node forms corresponding masterplate permutation and combination word.
2. according to the method described in claim 1, wherein, the dictionary tree is even numbers group dictionary tree.
3. according to the method described in claim 1, wherein, the masterplate in obtained each permutation and combination word and dictionary tree is arranged
Before portmanteau word matching, further include:
Obtaining record has the dictionary of masterplate permutation and combination word;
Each masterplate permutation and combination word in the dictionary is stored by lexcographical order in each respective branches of the dictionary tree.
4. according to the method described in claim 3, wherein, each masterplate permutation and combination word in the dictionary is stored into described
In each respective branches of dictionary tree, including:
If there is approximate masterplate permutation and combination word in the dictionary, each group of approximation masterplate permutation and combination word be composition word it is identical but
One group of masterplate permutation and combination word for sorting different only chooses a masterplate permutation and combination from one group of approximation masterplate permutation and combination word
Word is stored in as main permutation and combination word in the dictionary tree, each masterplate arrangement that do not chosen in this group of masterplate permutation and combination word
Portmanteau word is as each secondary permutation and combination word;
Establish the correspondence of the main permutation and combination word and secondary permutation and combination word.
5. according to the method described in claim 4, wherein, the masterplate in obtained each permutation and combination word and dictionary tree is arranged
Portmanteau word matches, and obtains and matches consistent permutation and combination word, including:
According to the correspondence of the main permutation and combination word and secondary permutation and combination word, retain in obtained each permutation and combination word
The main permutation and combination word, and corresponding secondary permutation and combination word is deleted, obtain filtered main permutation and combination word;
Each filtered main permutation and combination word is matched with the masterplate permutation and combination word in dictionary tree, obtains and matches consistent master
Permutation and combination word;
According to the correspondence of the main permutation and combination word and secondary permutation and combination word, obtains and match consistent pair with the dictionary tree
Permutation and combination word.
6. according to the method described in claim 5, wherein, by the masterplate in each filtered main permutation and combination word and dictionary tree
Permutation and combination word matches, and obtains the consistent main permutation and combination word of matching, including:
It is one group by the identical main permutation and combination word of entry word in the main permutation and combination word, every group of main combination is carried out such as respectively
Lower iteration:
Whether search has the entry word with the main permutation and combination word currently organized in dictionary tree be the node started,
If nothing, one group of main permutation and combination word iteration since newly is removed;
If so, being the node started with current entry word, searches in the dictionary tree and matched with the main permutation and combination word of the group
Consistent main permutation and combination word, and result set is added.
7. method according to any one of claims 1 to 6, wherein will be in obtained each permutation and combination word and dictionary tree
The matching of masterplate permutation and combination word, obtain after matching consistent permutation and combination word, further include:
The correspondence of each masterplate permutation and combination word and volumes of searches scoring is set;
According to the correspondence, the scoring for matching consistent permutation and combination word is determined.
8. method according to any one of claims 1 to 6, wherein, will when the masterplate permutation and combination word is violated word
Obtained each permutation and combination word is matched with the masterplate permutation and combination word in dictionary tree, obtain match consistent permutation and combination word it
Afterwards, further include:
The consistent permutation and combination word of the matching is deleted from the phrase input by user.
9. a kind of word-breaking equipment, wherein the equipment includes:
Acquisition device, for obtaining phrase input by user;
Detachment device, for the phrase to be split into single word;
Combination unit, for obtaining the multiple permutation and combination words for splitting obtained each single word, the permutation and combination
Word includes according to one or more tactic one or more words;
Coalignment, for obtained each permutation and combination word to be matched with the masterplate permutation and combination word in dictionary tree, acquisition
With consistent permutation and combination word, wherein each node storage in addition to root node in dictionary tree is there are one word, in dictionary tree
Individual node forms corresponding masterplate permutation and combination word, or other than adjacent 2 or more the root nodes of the level in same branch
Node, corresponding masterplate permutation and combination word is formed by upper layer node to lower level node successively.
10. equipment according to claim 9, wherein the dictionary tree is even numbers group dictionary tree.
11. equipment according to claim 9, wherein dictionary tree generating means have masterplate permutation and combination for obtaining record
The dictionary of word;Each masterplate permutation and combination word in the dictionary is stored into each respective branches of the dictionary tree by lexcographical order
In.
12. equipment according to claim 11, wherein the dictionary tree generating means, if close for having in the dictionary
Like masterplate permutation and combination word, each group of approximation masterplate permutation and combination word is one group of masterplate row that the word of composition is identical but sequence is different
Row portmanteau word, only one masterplate permutation and combination word of selection is deposited as main permutation and combination word from one group of approximation masterplate permutation and combination word
It is stored in the dictionary tree, each masterplate permutation and combination word that do not chosen in this group of masterplate permutation and combination word is as each secondary arrangement
Portmanteau word;Establish the correspondence of the main permutation and combination word and secondary permutation and combination word.
13. equipment according to claim 12, wherein the coalignment, for according to the main permutation and combination word with
The correspondence of secondary permutation and combination word retains the main permutation and combination word, and deletion pair in obtained each permutation and combination word
The secondary permutation and combination word answered obtains filtered main permutation and combination word;By each filtered main permutation and combination word and dictionary tree
In the matching of masterplate permutation and combination word, obtain the consistent main permutation and combination word of matching;It is arranged with secondary according to the main permutation and combination word
The correspondence of row portmanteau word obtains and matches consistent secondary permutation and combination word with the dictionary tree.
14. equipment according to claim 13, wherein the coalignment, for will be opened in the main permutation and combination word
The identical main permutation and combination word of head word is one group, carries out following iteration to every group of main combination respectively:Whether search has in dictionary tree
Entry word with the main permutation and combination word currently organized is the node started, if nothing, removes one group of main permutation and combination word since new
The iteration;If so, being the node started with current entry word, searched and the main permutation and combination word of the group in the dictionary tree
The consistent main permutation and combination word of matching, and result set is added.
15. according to claim 9 to 14 any one of them equipment, wherein further include scoring apparatus, for obtaining matching one
After the permutation and combination word of cause, the correspondence of each masterplate permutation and combination word and volumes of searches scoring is set;According to the correspondence
Relationship determines the scoring for matching consistent permutation and combination word.
16. according to claim 9 to 14 any one of them equipment, wherein further include deleting device, for being arranged when the masterplate
When row portmanteau word is violated word, after obtaining the consistent permutation and combination word of matching, by the consistent permutation and combination word of the matching
It is deleted from the phrase input by user.
17. a kind of equipment based on calculating, wherein including:
Processor;And
It is arranged to the memory of storage computer executable instructions, the executable instruction makes the processing when executed
Device:
Obtain phrase input by user;
The phrase is split into single word;
Obtain the multiple permutation and combination words of each single word for splitting and obtaining, the permutation and combination word include according to one or
Multiple tactic one or more words;
Obtained each permutation and combination word is matched with the masterplate permutation and combination word in dictionary tree, obtains and matches consistent arrangement group
Close word, wherein for each node storage in addition to root node in dictionary tree there are one word, the individual node group in dictionary tree is pairs of
Node other than the masterplate permutation and combination word answered, or adjacent 2 or more the root nodes of level in same branch, successively by upper
Node layer to lower level node forms corresponding masterplate permutation and combination word.
18. a kind of computer readable storage medium, is stored thereon with computer executable instructions, wherein the computer is executable
Instruction makes the processor when being executed by processor:
Obtain phrase input by user;
The phrase is split into single word;
Obtain the multiple permutation and combination words of each single word for splitting and obtaining, the permutation and combination word include according to one or
Multiple tactic one or more words;
Obtained each permutation and combination word is matched with the masterplate permutation and combination word in dictionary tree, obtains and matches consistent arrangement group
Close word, wherein for each node storage in addition to root node in dictionary tree there are one word, the individual node group in dictionary tree is pairs of
Node other than the masterplate permutation and combination word answered, or adjacent 2 or more the root nodes of level in same branch, successively by upper
Node layer to lower level node forms corresponding masterplate permutation and combination word.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810086623.3A CN108304384B (en) | 2018-01-29 | 2018-01-29 | Word splitting method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810086623.3A CN108304384B (en) | 2018-01-29 | 2018-01-29 | Word splitting method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108304384A true CN108304384A (en) | 2018-07-20 |
CN108304384B CN108304384B (en) | 2021-08-27 |
Family
ID=62866739
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810086623.3A Active CN108304384B (en) | 2018-01-29 | 2018-01-29 | Word splitting method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108304384B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109062898A (en) * | 2018-07-27 | 2018-12-21 | 汉能移动能源控股集团有限公司 | Characteristic word duplication eliminating method, device and equipment and storage medium thereof |
CN111310452A (en) * | 2018-12-12 | 2020-06-19 | 北京京东尚科信息技术有限公司 | Word segmentation method and device |
CN113569027A (en) * | 2021-07-27 | 2021-10-29 | 北京百度网讯科技有限公司 | Document title processing method and device and electronic equipment |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110055233A1 (en) * | 2009-08-25 | 2011-03-03 | Lutz Weber | Methods, Computer Systems, Software and Storage Media for Handling Many Data Elements for Search and Annotation |
US20130159318A1 (en) * | 2011-12-16 | 2013-06-20 | Microsoft Corporation | Rule-Based Generation of Candidate String Transformations |
CN103514287A (en) * | 2013-09-29 | 2014-01-15 | 深圳市龙视传媒有限公司 | Index tree building method, Chinese vocabulary searching method and related device |
CN103914569A (en) * | 2014-04-24 | 2014-07-09 | 百度在线网络技术(北京)有限公司 | Input prompt method and device and dictionary tree model establishing method and device |
CN105917327A (en) * | 2013-12-11 | 2016-08-31 | 触摸式有限公司 | System and method for inputting text into electronic devices |
CN106649286A (en) * | 2016-10-15 | 2017-05-10 | 语联网(武汉)信息技术有限公司 | Method for conducting term matching on basis of double-array lexicographic tree |
CN107357911A (en) * | 2017-07-18 | 2017-11-17 | 北京新美互通科技有限公司 | A kind of text entry method and device |
-
2018
- 2018-01-29 CN CN201810086623.3A patent/CN108304384B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110055233A1 (en) * | 2009-08-25 | 2011-03-03 | Lutz Weber | Methods, Computer Systems, Software and Storage Media for Handling Many Data Elements for Search and Annotation |
US20130159318A1 (en) * | 2011-12-16 | 2013-06-20 | Microsoft Corporation | Rule-Based Generation of Candidate String Transformations |
CN103514287A (en) * | 2013-09-29 | 2014-01-15 | 深圳市龙视传媒有限公司 | Index tree building method, Chinese vocabulary searching method and related device |
CN105917327A (en) * | 2013-12-11 | 2016-08-31 | 触摸式有限公司 | System and method for inputting text into electronic devices |
CN103914569A (en) * | 2014-04-24 | 2014-07-09 | 百度在线网络技术(北京)有限公司 | Input prompt method and device and dictionary tree model establishing method and device |
CN106649286A (en) * | 2016-10-15 | 2017-05-10 | 语联网(武汉)信息技术有限公司 | Method for conducting term matching on basis of double-array lexicographic tree |
CN107357911A (en) * | 2017-07-18 | 2017-11-17 | 北京新美互通科技有限公司 | A kind of text entry method and device |
Non-Patent Citations (2)
Title |
---|
JIANDONG LI: ""Enhanced KStore With the Use of Dictionary and Trie for Retail Business Data"", 《PROCEEDINGS OF 2016 IEEE INTERNATIONAL CONFERENCE ON BIG DATA ANALYSIS (ICBDA)》 * |
田思虑 等: ""一种改进的基于二元统计的HMM分词算法"", 《计算机与数字工程》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109062898A (en) * | 2018-07-27 | 2018-12-21 | 汉能移动能源控股集团有限公司 | Characteristic word duplication eliminating method, device and equipment and storage medium thereof |
CN111310452A (en) * | 2018-12-12 | 2020-06-19 | 北京京东尚科信息技术有限公司 | Word segmentation method and device |
CN113569027A (en) * | 2021-07-27 | 2021-10-29 | 北京百度网讯科技有限公司 | Document title processing method and device and electronic equipment |
CN113569027B (en) * | 2021-07-27 | 2024-02-13 | 北京百度网讯科技有限公司 | Document title processing method and device and electronic equipment |
Also Published As
Publication number | Publication date |
---|---|
CN108304384B (en) | 2021-08-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP2750053B1 (en) | Data storage program, data retrieval program, data retrieval apparatus, data storage method and data retrieval method | |
CN110019647B (en) | Keyword searching method and device and search engine | |
CN101673307B (en) | Space data index method and system | |
CN107153647B (en) | Method, apparatus, system and computer program product for data compression | |
EP2045731A1 (en) | Automatic generation of ontologies using word affinities | |
CN107357843B (en) | Massive network data searching method based on data stream structure | |
CN108304384A (en) | Word-breaking method and apparatus | |
CN106980656B (en) | A kind of searching method based on two-value code dictionary tree | |
CN107368527B (en) | Multi-attribute index method based on data stream | |
CN109902142B (en) | Character string fuzzy matching and query method based on edit distance | |
KR100651743B1 (en) | Method of generating and searching tcam entry, and apparatus thereof | |
JP2009512099A (en) | Method and apparatus for restartable hashing in a try | |
US20150248448A1 (en) | Online radix tree compression with key sequence skip | |
Kempa et al. | Dynamic suffix array with polylogarithmic queries and updates | |
Jansson et al. | Linked dynamic tries with applications to LZ-compression in sublinear time and space | |
US10372736B2 (en) | Generating and implementing local search engines over large databases | |
Pandey et al. | A comparison and selection on basic type of searching algorithm in data structure | |
CN110457398A (en) | Block data storage method and device | |
Kanda et al. | Dynamic path-decomposed tries | |
CN109803022B (en) | Digital resource sharing system and service method thereof | |
US9396286B2 (en) | Lookup with key sequence skip for radix trees | |
CN112256821A (en) | Method, device, equipment and storage medium for complementing Chinese address | |
US8554696B2 (en) | Efficient computation of ontology affinity matrices | |
US20150248449A1 (en) | Online compression for limited sequence length radix tree | |
Akagi et al. | Grammar index by induced suffix sorting |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |