CN108304384A - Word-breaking method and apparatus - Google Patents

Word-breaking method and apparatus Download PDF

Info

Publication number
CN108304384A
CN108304384A CN201810086623.3A CN201810086623A CN108304384A CN 108304384 A CN108304384 A CN 108304384A CN 201810086623 A CN201810086623 A CN 201810086623A CN 108304384 A CN108304384 A CN 108304384A
Authority
CN
China
Prior art keywords
permutation
word
combination
combination word
masterplate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810086623.3A
Other languages
Chinese (zh)
Other versions
CN108304384B (en
Inventor
扈贵谦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai M-Share Software Technology Co Ltd
Original Assignee
Shanghai M-Share Software Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai M-Share Software Technology Co Ltd filed Critical Shanghai M-Share Software Technology Co Ltd
Priority to CN201810086623.3A priority Critical patent/CN108304384B/en
Publication of CN108304384A publication Critical patent/CN108304384A/en
Application granted granted Critical
Publication of CN108304384B publication Critical patent/CN108304384B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The object of the present invention is to provide a kind of word-breaking method and apparatus, there are one words by each node storage in dictionary tree by the present invention, individual node in dictionary tree forms corresponding masterplate permutation and combination word, or the node other than adjacent 2 or more the root nodes of level in same branch, corresponding masterplate permutation and combination word is formed by upper layer node to lower level node successively, obtained each permutation and combination word is matched with the masterplate permutation and combination word in dictionary tree, it obtains and matches consistent masterplate permutation and combination word, so as to rapidly and accurately extract the masterplate contamination word corresponding to the descriptive labelling information from dictionary tree.

Description

Word-breaking method and apparatus
Technical field
The present invention relates to computer realm more particularly to a kind of word-breaking method and apparatus.
Background technology
Existing word-breaking scheme is there are word-breaking speed is slow, and the problem that word-breaking is not accurate enough.
Invention content
It is an object of the present invention to provide a kind of word-breaking method and apparatus, can solve existing word-breaking scheme presence and tear open Word speed is slow, and the problem that word-breaking is not accurate enough.
According to an aspect of the invention, there is provided a kind of word-breaking method, this method include:
Obtain phrase input by user;
The phrase is split into single word;
Multiple permutation and combination words of each single word for splitting and obtaining are obtained, the permutation and combination word includes according to one A or multiple tactic one or more words;
Obtained each permutation and combination word is matched with the masterplate permutation and combination word in dictionary tree, obtains and matches consistent row Row portmanteau word, wherein each node storage in addition to root node in dictionary tree is there are one word, the individual node group in dictionary tree Node other than 2 or more adjacent root nodes of level at corresponding masterplate permutation and combination word, or in same branch, successively Corresponding masterplate permutation and combination word is formed by upper layer node to lower level node.
Further, in the above method, the dictionary tree is even numbers group dictionary tree.
Further, in the above method, by the masterplate permutation and combination word in obtained each permutation and combination word and dictionary tree Before matching, further include:
Obtaining record has the dictionary of masterplate permutation and combination word;
Each masterplate permutation and combination word in the dictionary is stored into each respective branches of the dictionary tree by lexcographical order In.
Further, in the above method, each masterplate permutation and combination word in the dictionary is stored into the dictionary tree Each respective branches in, including:
If there is approximate masterplate permutation and combination word in the dictionary, each group of approximation masterplate permutation and combination word is the word phase of composition One group of masterplate permutation and combination word same but that sequence is different only choose a masterplate arrangement from one group of approximation masterplate permutation and combination word Portmanteau word is stored in as main permutation and combination word in the dictionary tree, each masterplate that do not chosen in this group of masterplate permutation and combination word Permutation and combination word is as each secondary permutation and combination word;
Establish the correspondence of the main permutation and combination word and secondary permutation and combination word.
Further, in the above method, by the masterplate permutation and combination word in obtained each permutation and combination word and dictionary tree Matching obtains and matches consistent permutation and combination word, including:
According to the correspondence of the main permutation and combination word and secondary permutation and combination word, in obtained each permutation and combination word Retain the main permutation and combination word, and delete corresponding secondary permutation and combination word, obtains filtered main permutation and combination word;
Each filtered main permutation and combination word is matched with the masterplate permutation and combination word in dictionary tree, it is consistent to obtain matching Main permutation and combination word;
According to the correspondence of the main permutation and combination word and secondary permutation and combination word, acquisition matches unanimously with the dictionary tree Secondary permutation and combination word.
Further, in the above method, by the masterplate arrangement group in each filtered main permutation and combination word and dictionary tree Word matching is closed, the consistent main permutation and combination word of matching is obtained, including:
By the identical main permutation and combination word of entry word in the main permutation and combination word be one group, respectively it is main to every group combine into The following iteration of row:
Whether search has the entry word with the main permutation and combination word currently organized in dictionary tree be the node started,
If nothing, one group of main permutation and combination word iteration since newly is removed;
If so, being the node started with current entry word, searched and the main permutation and combination word of the group in the dictionary tree The consistent main permutation and combination word of matching, and result set is added.
Further, in the above method, by the masterplate permutation and combination word in obtained each permutation and combination word and dictionary tree Matching obtains after matching consistent permutation and combination word, further includes:
The correspondence of each masterplate permutation and combination word and volumes of searches scoring is set;
According to the correspondence, the scoring for matching consistent permutation and combination word is determined.
Further, in the above method, when the masterplate permutation and combination word is violated word, each arrangement group for will obtaining It closes word to match with the masterplate permutation and combination word in dictionary tree, obtains after matching consistent permutation and combination word, further include:
The consistent permutation and combination word of the matching is deleted from the phrase input by user.
According to another aspect of the present invention, a kind of word-breaking equipment is additionally provided, which includes:
Acquisition device, for obtaining phrase input by user;
Detachment device, for the phrase to be split into single word;
Combination unit, for obtaining the multiple permutation and combination words for splitting obtained each single word, the arrangement Portmanteau word includes according to one or more tactic one or more words;
Coalignment is obtained for matching obtained each permutation and combination word with the masterplate permutation and combination word in dictionary tree It takes and matches consistent permutation and combination word, wherein there are one word, dictionary trees for each node storage in addition to root node in dictionary tree In individual node form corresponding masterplate permutation and combination word, or 2 or more root nodes that the level in same branch is adjacent Node in addition forms corresponding masterplate permutation and combination word by upper layer node to lower level node successively.
Further, in above equipment, the dictionary tree is even numbers group dictionary tree.
Further, in above equipment, dictionary tree generating means have the word of masterplate permutation and combination word for obtaining record Library;Each masterplate permutation and combination word in the dictionary is stored by lexcographical order in each respective branches of the dictionary tree.
Further, in above equipment, the dictionary tree generating means, if for there is approximate masterplate to arrange in the dictionary Portmanteau word, each group of approximation masterplate permutation and combination word are one group of masterplate permutation and combination word that the word of composition is identical but sequence is different, One masterplate permutation and combination word of selection is stored in described as main permutation and combination word from one group of approximation masterplate permutation and combination word In dictionary tree, each masterplate permutation and combination word that do not chosen in this group of masterplate permutation and combination word is as each secondary permutation and combination word; Establish the correspondence of the main permutation and combination word and secondary permutation and combination word.
Further, in above equipment, the coalignment, for according to the main permutation and combination word and secondary permutation and combination The correspondence of word retains the main permutation and combination word in obtained each permutation and combination word, and deletes corresponding secondary arrangement Portmanteau word obtains filtered main permutation and combination word;Masterplate in each filtered main permutation and combination word and dictionary tree is arranged Row portmanteau word matches, and obtains the consistent main permutation and combination word of matching;According to the main permutation and combination word and secondary permutation and combination word Correspondence obtains and matches consistent secondary permutation and combination word with the dictionary tree.
Further, in above equipment, the coalignment, for entry word in the main permutation and combination word is identical Main permutation and combination word is one group, carries out following iteration to every group of main combination respectively:Whether search has currently to organize in dictionary tree The entry word of main permutation and combination word is that the node started removes one group of main permutation and combination word iteration since newly if nothing;If Have, be the node started with current entry word, searches and matched with the main permutation and combination word of the group unanimously in the dictionary tree Main permutation and combination word, and result set is added.
Further, further include scoring apparatus in above equipment, for obtain match consistent permutation and combination word it Afterwards, the correspondence of each masterplate permutation and combination word and volumes of searches scoring is set;According to the correspondence, the matching is determined The scoring of consistent permutation and combination word.
Further, further include deleting device in above equipment, for being violated word when the masterplate permutation and combination word When, after obtaining and matching consistent permutation and combination word, by the consistent permutation and combination word of the matching from described input by user It is deleted in phrase.
According to the another side of the application, a kind of equipment based on calculating is also provided, including:
Processor;And
It is arranged to the memory of storage computer executable instructions, the executable instruction makes the place when executed Manage device:
Obtain phrase input by user;
The phrase is split into single word;
Multiple permutation and combination words of each single word for splitting and obtaining are obtained, the permutation and combination word includes according to one A or multiple tactic one or more words;
Obtained each permutation and combination word is matched with the masterplate permutation and combination word in dictionary tree, obtains and matches consistent row Row portmanteau word, wherein each node storage in addition to root node in dictionary tree is there are one word, the individual node group in dictionary tree Node other than 2 or more adjacent root nodes of level at corresponding masterplate permutation and combination word, or in same branch, successively Corresponding masterplate permutation and combination word is formed by upper layer node to lower level node.
According to the another side of the application, a kind of computer readable storage medium is also provided, being stored thereon with computer can hold Row instruction, wherein the computer executable instructions make processor when being executed by processor:
Obtain phrase input by user;
The phrase is split into single word;
Multiple permutation and combination words of each single word for splitting and obtaining are obtained, the permutation and combination word includes according to one A or multiple tactic one or more words;
Obtained each permutation and combination word is matched with the masterplate permutation and combination word in dictionary tree, obtains and matches consistent row Row portmanteau word, wherein each node storage in addition to root node in dictionary tree is there are one word, the individual node group in dictionary tree Node other than 2 or more adjacent root nodes of level at corresponding masterplate permutation and combination word, or in same branch, successively Corresponding masterplate permutation and combination word is formed by upper layer node to lower level node.
Compared with prior art, the present invention by the storage of each node in dictionary tree there are one word, the list in dictionary tree A node forms corresponding masterplate permutation and combination word, or other than adjacent 2 or more the root nodes of the level in same branch Node forms corresponding masterplate permutation and combination word by upper layer node to lower level node successively, each permutation and combination word that will be obtained It is matched with the masterplate permutation and combination word in dictionary tree, obtains and match consistent masterplate permutation and combination word, so as to rapidly and accurately The masterplate contamination word corresponding to the descriptive labelling information is extracted from dictionary tree.
Description of the drawings
By reading a detailed description of non-restrictive embodiments in the light of the attached drawings below, of the invention other Feature, objects and advantages will become more apparent upon:
Fig. 1 shows word-breaking method flow diagram according to an embodiment of the invention.
Same or analogous reference numeral represents same or analogous component in attached drawing.
Specific implementation mode
Present invention is further described in detail below in conjunction with the accompanying drawings.
In a typical configuration of this application, terminal, the equipment of service network and trusted party include one or more Processor (CPU), input/output interface, network interface and memory.
Memory may include computer-readable medium in volatile memory, random access memory (RAM) and/or The forms such as Nonvolatile memory, such as read-only memory (ROM) or flash memory (flash RAM).Memory is computer-readable medium Example.
Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method Or technology realizes information storage.Information can be computer-readable instruction, data structure, the module of program or other data. The example of the storage medium of computer includes, but are not limited to phase transition internal memory (PRAM), static RAM (SRAM), moves State random access memory (DRAM), other kinds of random access memory (RAM), read-only memory (ROM), electric erasable Programmable read only memory (EEPROM), fast flash memory bank or other memory techniques, read-only disc read only memory (CD-ROM) (CD-ROM), Digital versatile disc (DVD) or other optical storages, magnetic tape cassette, magnetic tape disk storage or other magnetic storage apparatus or Any other non-transmission medium can be used for storage and can be accessed by a computing device information.As defined in this article, computer Readable medium does not include non-temporary computer readable media (transitory media), such as data-signal and carrier wave of modulation.
As shown in Figure 1, a kind of word-breaking method, including:
Step S1 obtains phrase input by user, such as abc;
The phrase is split into single word, above-mentioned abc is such as split as a, b, c by step S2;
Step S3 obtains multiple permutation and combination words of each single word for splitting and obtaining, the permutation and combination word Including according to one or more tactic one or more words, the arrangement group that such as above-mentioned a, b, c can be combined It is a, b, c, ab, ac, bc, abc, ba, ca, cb, acb, bac, bca, cab, cba to close word;
Step S6 matches obtained each permutation and combination word with the masterplate permutation and combination word in dictionary tree, obtains matching Consistent permutation and combination word, wherein each node storage in addition to root node in dictionary tree is there are one word, the list in dictionary tree A node forms corresponding masterplate permutation and combination word, or other than adjacent 2 or more the root nodes of the level in same branch Node forms corresponding masterplate permutation and combination word by upper layer node to lower level node successively, as a, b, c, ab, ac, bc, abc, Ba, ca, acb, bac, bca, cab, cba are respectively a branch in dictionary tree.
Here, dictionary tree is also known as word lookup tree, Trie trees are a kind of tree structures, are a kind of mutation of Hash tree.Word The advantages of allusion quotation tree is:Query time is reduced using the common prefix of word, meaningless word is reduced to the maximum extent and compares, inquiry effect Rate is higher than Hash tree.
This implementation passes through each node storage in dictionary tree, and there are one the individual node composition in word dictionary tree is corresponding Node other than masterplate permutation and combination word, or adjacent 2 or more the root nodes of level in same branch, is saved by upper layer successively Point to lower level node forms corresponding masterplate permutation and combination word, and the masterplate in obtained each permutation and combination word and dictionary tree is arranged Row portmanteau word matches, and can quickly and accurately obtain and match consistent masterplate permutation and combination word, such as permutation and combination to be matched Word is:A, b, c, ab, ac, bc, abc, ba, ca, cb, acb, bcc, bca, cab, cba,
Masterplate permutation and combination word in dictionary tree is:Ab, ac, bc, abc, ba, ca, acb, bac, bca, cab, cba,
The matched masterplate permutation and combination word then obtained, i.e., the masterplate arrangement in permutation and combination word to be matched and dictionary tree The permutation and combination word that portmanteau word kind has is:Ab, ac, bc, abc, ba, ca, acb, bac, bca, cab, cba, so as to quick The masterplate contamination corresponding to the descriptive labelling information is accurately extracted from dictionary tree.
In one embodiment of word-breaking method of the present invention, the dictionary tree is even numbers group dictionary tree, the even numbers group dictionary tree (Double-Array Trie) is the compressed format of dictionary tree (Trie) structure, only indicates Trie trees with two linear arrays, The structure effectively combines digital search tree (Digital Search Tree) retrieval time efficient feature and chain type expression The compact feature of Trie space structures.The essence of even numbers group Trie is a deterministic finite automation (DFA), each node A state for representing automatic machine carries out state transfer according to variable difference, complete when reaching end state or can not shift It is operated at one query.Contact between the character for including in all keys of even numbers group is all by simple mathematical addition operation It indicates, not only increases retrieval rate, and eliminate a large amount of pointers used in chain structure, save memory space.
As shown in Figure 1, the present invention one embodiment of word-breaking method in, step S6, by obtained each permutation and combination word with Before masterplate permutation and combination word matching in dictionary tree, further include:
Step S4, obtaining record has the dictionary of masterplate permutation and combination word;
Each masterplate permutation and combination word in the dictionary is stored into each right of the dictionary tree by step S5 by lexcographical order It answers in branch.
Here, by the way that dictionary tree will be converted in dictionary, subsequently through the inquiry to dictionary tree, can accelerate to obtain matching Masterplate permutation and combination word.
In one embodiment of word-breaking method of the present invention, step S5 deposits each masterplate permutation and combination word in the dictionary In each respective branches for storing up the dictionary tree, including:
Step S51, if there is approximate masterplate permutation and combination word in the dictionary, each group of approximation masterplate permutation and combination word is group At word is identical but the different one group of masterplate permutation and combination word of sequence, one is only chosen from one group of approximation masterplate permutation and combination word Masterplate permutation and combination word is stored in as main permutation and combination word in the dictionary tree, is not chosen in this group of masterplate permutation and combination word Each masterplate permutation and combination word is as each secondary permutation and combination word, for example, abc, acb and bac are one group of approximation masterplate arrangement group Close word, wherein be stored in abc as main permutation and combination word in the dictionary tree, using acb and bac as secondary permutation and combination word It is not stored in the dictionary tree, abc corresponds to acb and bac;
Step S52 establishes the correspondence of the main permutation and combination word and secondary permutation and combination word.
Here, being used as main permutation and combination by only choosing a masterplate permutation and combination word in approximate masterplate permutation and combination word Word is stored in the dictionary tree, it is possible to reduce the redundancy of dictionary tree data storage, and can accelerate it is each combine with it is corresponding Dictionary tree matching speed.
The present invention one embodiment of word-breaking method in, each group of approximation masterplate permutation and combination word be composition word it is identical but row One group of different masterplate permutation and combination word of sequence only chooses a masterplate permutation and combination word from one group of approximation masterplate permutation and combination word It is stored in the dictionary tree as main permutation and combination word, each masterplate arrangement group that do not chosen in this group of masterplate permutation and combination word Word is closed as each secondary permutation and combination word, and establish the main permutation and combination word with after the correspondence of secondary permutation and combination word,
Step S6 matches obtained each permutation and combination word with the masterplate permutation and combination word in dictionary tree, obtains matching Consistent permutation and combination word, including:
Step S61, according to the correspondence of the main permutation and combination word and secondary permutation and combination word, in obtained each arrangement Retain the main permutation and combination word in portmanteau word, and delete corresponding secondary permutation and combination word, obtains filtered main permutation and combination Word;
Step S62 matches each filtered main permutation and combination word with the masterplate permutation and combination word in dictionary tree, obtains The consistent main permutation and combination word of matching;
Step S63 is obtained and the dictionary tree according to the correspondence of the main permutation and combination word and secondary permutation and combination word The consistent secondary permutation and combination word of matching.
For example, there is following dictionary:
a;
ab;
abc;
ba;
bac;
abcd。
Phrase input by user be cba, by the phrase split into all possible permutation and combination of single word be A (3,1)+ A (3,2)+A (3,3)=3+6+6=15, it is specific as follows:
a;
b;
c;
ab;
ac;
bc;
ba;
ca;
cb;
abc;
acb;
cab;
bac;
bca;
cba。
It needs to find the permutation and combination that cba is occurred in dictionary as follows:
a;
ab;
abc;
ba;
bac。
Dictionary first can be carried out the sequence of ascii codes by the application, if having approximate masterplate permutation and combination word in the dictionary, often One group of approximation masterplate permutation and combination word is one group of masterplate permutation and combination word that the word of composition is identical but sequence is different, from one group of approximation It only chooses a masterplate permutation and combination word in masterplate permutation and combination word to be stored in the dictionary tree as main permutation and combination word, i.e., New dictionary will be obtained after original dictionary duplicate removal, and constructs dictionary tree using new dictionary is obtained after duplicate removal, it is as follows:
a[a];
Ab [ab, ba];
Abc [abc, bac];
abcd[abcd]。
When being that cba is matched with dictionary tree to phrase input by user, cba is abc, abc according to after the sequence of ascii codes Main permutation and combination word a, ab, abc are obtained after above-mentioned dictionary tree traversal, then by main permutation and combination word and secondary permutation and combination word Correspondence such as obtains a kind of combination a by a, and two kinds of combinations ab, ba are obtained by ab, and two kinds of combinations abc, bac are obtained by abc.
Therefore obtaining final matching result is:
a;
ab;
ba;
abc;
bac。
In one embodiment of word-breaking method of the present invention, step S62, by each filtered main permutation and combination word and dictionary tree In the matching of masterplate permutation and combination word, obtain the consistent main permutation and combination word of matching, including:
By the identical main permutation and combination word of entry word in the main permutation and combination word be one group, respectively it is main to every group combine into The following iteration of row:
Step S621, whether search has the entry word with the main permutation and combination word currently organized in dictionary tree be the section started Point,
If nothing, step S622 removes one group of main permutation and combination word iteration since newly;
If so, step S623, is the node started with current entry word, searched and the main row of the group in the dictionary tree The consistent main permutation and combination word of row portmanteau word matching, and result set is added.
For example, it is desired to the function that the commodity title of user is given a mark, the commodity title (most 30 to user is needed Chinese character) dismantled by character after find all permutation and combination appeared in dictionary (about 3000000 data)
Each possible permutation and combination number of commodity title be A (30,1)+A (30,2)+...+A (30,30) data volume is suitable (being about 10 33 powers) greatly
Algorithm for design is as follows:
Space is gone to switch to small letter duplicate removal as main permutation and combination word all masterplate permutation and combination words in dictionary, while structure It makes a map and stores main permutation and combination word and secondary permutation and combination word, a main permutation and combination word may correspond to multiple secondary arrangement groups Word is closed, while a dictionary tree is constructed according to all main permutation and combination words;
After going to space to switch to small letter duplicate removal by the sequence of ascii codes commodity title, abcd is such as obtained, walks dictionary in order Tree, finds out all main permutation and combination words being likely to occur and includes the following steps by taking abcd as an example:
Whether there is the node started with a in 1.0 lookup dictionary trees,
If if there is walking 1.1 without 2.0
1.1 first judge whether a is leaf node, if it is walking 1.1.1 otherwise 1.2
1.1.1, after word in leaf node to be added to the result set returned, walk 2.0;
1.2 look for the node ... with the presence or absence of ab ac ad beginnings with a beginnings respectively
1.2.1 the result set that will matching consistent main permutation and combination word and return be added all is matched with the node of a beginnings 2.0 are walked after complete;
2.0 check whether there is the node ... started with b
It goes in map to find all corresponding secondary permutation and combination words after finding all main permutation and combination words, to which one will be matched The main permutation and combination word and its corresponding secondary permutation and combination word caused, as final matching result.
Here, the present embodiment by by the identical main permutation and combination word of entry word in the main permutation and combination word be one group, It is the node started with current entry word, is searched in the dictionary tree and match consistent main row with the main permutation and combination word of the group Row portmanteau word can further speed up search speed.
In one embodiment of word-breaking method of the present invention, step S6 will be in obtained each permutation and combination word and dictionary tree Masterplate permutation and combination word matches, and obtains after matching consistent permutation and combination word, further includes:
The correspondence of each masterplate permutation and combination word and volumes of searches scoring is set;
According to the correspondence, the scoring for matching consistent permutation and combination word is determined.
For example, each masterplate permutation and combination word and the correspondence that volumes of searches scores are as follows:a:8 points, b:1 point, c:10 points, ab:7 points, ac:5 points, bc:9 points, abc:4 points, ba:3 points, ca:0 point, cb:2 points;
The consistent permutation and combination word of the matching is:ab、ac、bc、abc;
It is described match consistent permutation and combination word scoring be:ab:7 points, ac:5 points, bc:9 points, abc:4 points.
Here, phrase input by user can be descriptive labelling information, according to the consistent permutation and combination word of the matching Scoring or scoring sum total, can predict above-mentioned descriptive labelling information search amount, can be to reach expected requirement to volumes of searches into two Descriptive labelling information modify.
In one embodiment of word-breaking method of the present invention, when the masterplate permutation and combination word is violated word,
Step S6 matches obtained each permutation and combination word with the masterplate permutation and combination word in dictionary tree, obtains matching After consistent permutation and combination word, further include:
The consistent permutation and combination word of the matching is deleted from the phrase input by user.
Here, phrase input by user can be descriptive labelling information, each permutation and combination word and dictionary tree that will be obtained In the matching of violated word, obtain and match consistent permutation and combination word, descriptive labelling information can be subsequently filtered to removal, with Descriptive labelling information is automatically corrected.
According to the another side of the application, a kind of word-breaking equipment is also provided, which includes:
Acquisition device, for obtaining phrase input by user;
Detachment device, for the phrase to be split into single word;
Combination unit, for obtaining the multiple permutation and combination words for splitting obtained each single word, the arrangement Portmanteau word includes according to one or more tactic one or more words;
Coalignment is obtained for matching obtained each permutation and combination word with the masterplate permutation and combination word in dictionary tree It takes and matches consistent permutation and combination word, wherein there are one word, dictionary trees for each node storage in addition to root node in dictionary tree In individual node form corresponding masterplate permutation and combination word, or 2 or more root nodes that the level in same branch is adjacent Node in addition forms corresponding masterplate permutation and combination word by upper layer node to lower level node successively.
In one embodiment of word-breaking equipment of the present invention, the dictionary tree is even numbers group dictionary tree.
In one embodiment of word-breaking equipment of the present invention, dictionary tree generating means have masterplate permutation and combination for obtaining record The dictionary of word;Each masterplate permutation and combination word in the dictionary is stored into each respective branches of the dictionary tree by lexcographical order In.
In one embodiment of word-breaking equipment of the present invention, the dictionary tree generating means, if for having approximation in the dictionary Masterplate permutation and combination word, each group of approximation masterplate permutation and combination word are one group of masterplate arrangement that the word of composition is identical but sequence is different Portmanteau word, only one masterplate permutation and combination word of selection is stored as main permutation and combination word from one group of approximation masterplate permutation and combination word In the dictionary tree, each masterplate permutation and combination word that do not chosen in this group of masterplate permutation and combination word is as each secondary arrangement group Close word;Establish the correspondence of the main permutation and combination word and secondary permutation and combination word.
In one embodiment of word-breaking equipment of the present invention, the coalignment, for according to the main permutation and combination word and pair The correspondence of permutation and combination word retains the main permutation and combination word in obtained each permutation and combination word, and deletes correspondence Secondary permutation and combination word, obtain filtered main permutation and combination word;It will be in each filtered main permutation and combination word and dictionary tree The matching of masterplate permutation and combination word, obtain the consistent main permutation and combination word of matching;It is arranged with secondary according to the main permutation and combination word The correspondence of portmanteau word obtains and matches consistent secondary permutation and combination word with the dictionary tree.
In one embodiment of word-breaking equipment of the present invention, the coalignment, for will start in the main permutation and combination word The identical main permutation and combination word of word is one group, carries out following iteration to every group of main combination respectively:Search dictionary tree in whether have with The entry word for the main permutation and combination word currently organized is the node started, if nothing, removes one group of main permutation and combination word institute since new State iteration;If so, being the node started with current entry word, searched and the main permutation and combination word of the group in the dictionary tree With consistent main permutation and combination word, and result set is added.
Further include scoring apparatus in one embodiment of word-breaking equipment of the present invention, for obtaining the consistent arrangement group of matching After closing word, the correspondence of each masterplate permutation and combination word and volumes of searches scoring is set;According to the correspondence, institute is determined State the scoring for matching consistent permutation and combination word.
In one embodiment of word-breaking equipment of the present invention, further includes deleting device, be for working as the masterplate permutation and combination word When violated word, after obtaining the consistent permutation and combination word of matching, consistent permutation and combination word is matched from the user by described It is deleted in the phrase of input.
According to the another side of the application, the equipment based on calculating is also provided, including:
Processor;And
It is arranged to the memory of storage computer executable instructions, the executable instruction makes the place when executed Manage device:
Obtain phrase input by user;
The phrase is split into single word;
Multiple permutation and combination words of each single word for splitting and obtaining are obtained, the permutation and combination word includes according to one A or multiple tactic one or more words;
Obtained each permutation and combination word is matched with the masterplate permutation and combination word in dictionary tree, obtains and matches consistent row Row portmanteau word, wherein each node storage in addition to root node in dictionary tree is there are one word, the individual node group in dictionary tree Node other than 2 or more adjacent root nodes of level at corresponding masterplate permutation and combination word, or in same branch, successively Corresponding masterplate permutation and combination word is formed by upper layer node to lower level node.
According to the another side of the application, computer readable storage medium is also provided, is stored thereon with the executable finger of computer It enables, which makes the processor when being executed by processor:
Obtain phrase input by user;
The phrase is split into single word;
Multiple permutation and combination words of each single word for splitting and obtaining are obtained, the permutation and combination word includes according to one A or multiple tactic one or more words;
Obtained each permutation and combination word is matched with the masterplate permutation and combination word in dictionary tree, obtains and matches consistent row Row portmanteau word, wherein each node storage in addition to root node in dictionary tree is there are one word, the individual node group in dictionary tree Node other than 2 or more adjacent root nodes of level at corresponding masterplate permutation and combination word, or in same branch, successively Corresponding masterplate permutation and combination word is formed by upper layer node to lower level node.
The detailed content of each embodiment of equipment and computer readable storage medium of the present invention, for details, reference can be made to method parts Each embodiment, details are not described herein.
In conclusion the present invention by the storage of each node in dictionary tree there are one word, the individual node in dictionary tree Form corresponding masterplate permutation and combination word, or the node other than adjacent 2 or more the root nodes of the level in same branch, according to Each permutation and combination word and dictionary tree secondary that corresponding masterplate permutation and combination word is formed by upper layer node to lower level node, will obtain In the matching of masterplate permutation and combination word, obtain and match consistent masterplate permutation and combination word, so as to rapidly and accurately from dictionary tree In extract masterplate contamination word corresponding to the descriptive labelling information.
Obviously, those skilled in the art can carry out the application essence of the various modification and variations without departing from the application God and range.In this way, if these modifications and variations of the application belong to the range of the application claim and its equivalent technologies Within, then the application is also intended to include these modifications and variations.
It should be noted that the present invention can be carried out in the assembly of software and/or software and hardware, for example, can adopt With application-specific integrated circuit (ASIC), general purpose computer or any other realized similar to hardware device.In one embodiment In, software program of the invention can be executed by processor to realize steps described above or function.Similarly, of the invention Software program (including relevant data structure) can be stored in computer readable recording medium storing program for performing, for example, RAM memory, Magnetic or optical driver or floppy disc and similar devices.In addition, hardware can be used to realize in some steps or function of the present invention, example Such as, coordinate to execute the circuit of each step or function as with processor.
In addition, the part of the present invention can be applied to computer program product, such as computer program instructions, when its quilt When computer executes, by the operation of the computer, it can call or provide according to the method for the present invention and/or technical solution. And the program instruction of the method for the present invention is called, it is possibly stored in fixed or moveable recording medium, and/or pass through Broadcast or the data flow in other signal loaded mediums and be transmitted, and/or be stored according to described program instruction operation In the working storage of computer equipment.Here, including a device according to one embodiment of present invention, which includes using Memory in storage computer program instructions and processor for executing program instructions, wherein when the computer program refers to When order is executed by the processor, method and/or skill of the device operation based on aforementioned multiple embodiments according to the present invention are triggered Art scheme.
It is obvious to a person skilled in the art that invention is not limited to the details of the above exemplary embodiments, Er Qie In the case of without departing substantially from spirit or essential attributes of the invention, the present invention can be realized in other specific forms.Therefore, no matter From the point of view of which point, the present embodiments are to be considered as illustrative and not restrictive, and the scope of the present invention is by appended power Profit requires rather than above description limits, it is intended that all by what is fallen within the meaning and scope of the equivalent requirements of the claims Variation includes within the present invention.Any reference signs in the claims should not be construed as limiting the involved claims.This Outside, it is clear that one word of " comprising " is not excluded for other units or step, and odd number is not excluded for plural number.That is stated in device claim is multiple Unit or device can also be realized by a unit or device by software or hardware.The first, the second equal words are used for table Show title, and does not represent any particular order.

Claims (18)

1. a kind of word-breaking method, wherein this method includes:
Obtain phrase input by user;
The phrase is split into single word;
Obtain the multiple permutation and combination words of each single word for splitting and obtaining, the permutation and combination word include according to one or Multiple tactic one or more words;
Obtained each permutation and combination word is matched with the masterplate permutation and combination word in dictionary tree, obtains and matches consistent arrangement group Close word, wherein for each node storage in addition to root node in dictionary tree there are one word, the individual node group in dictionary tree is pairs of Node other than the masterplate permutation and combination word answered, or adjacent 2 or more the root nodes of level in same branch, successively by upper Node layer to lower level node forms corresponding masterplate permutation and combination word.
2. according to the method described in claim 1, wherein, the dictionary tree is even numbers group dictionary tree.
3. according to the method described in claim 1, wherein, the masterplate in obtained each permutation and combination word and dictionary tree is arranged Before portmanteau word matching, further include:
Obtaining record has the dictionary of masterplate permutation and combination word;
Each masterplate permutation and combination word in the dictionary is stored by lexcographical order in each respective branches of the dictionary tree.
4. according to the method described in claim 3, wherein, each masterplate permutation and combination word in the dictionary is stored into described In each respective branches of dictionary tree, including:
If there is approximate masterplate permutation and combination word in the dictionary, each group of approximation masterplate permutation and combination word be composition word it is identical but One group of masterplate permutation and combination word for sorting different only chooses a masterplate permutation and combination from one group of approximation masterplate permutation and combination word Word is stored in as main permutation and combination word in the dictionary tree, each masterplate arrangement that do not chosen in this group of masterplate permutation and combination word Portmanteau word is as each secondary permutation and combination word;
Establish the correspondence of the main permutation and combination word and secondary permutation and combination word.
5. according to the method described in claim 4, wherein, the masterplate in obtained each permutation and combination word and dictionary tree is arranged Portmanteau word matches, and obtains and matches consistent permutation and combination word, including:
According to the correspondence of the main permutation and combination word and secondary permutation and combination word, retain in obtained each permutation and combination word The main permutation and combination word, and corresponding secondary permutation and combination word is deleted, obtain filtered main permutation and combination word;
Each filtered main permutation and combination word is matched with the masterplate permutation and combination word in dictionary tree, obtains and matches consistent master Permutation and combination word;
According to the correspondence of the main permutation and combination word and secondary permutation and combination word, obtains and match consistent pair with the dictionary tree Permutation and combination word.
6. according to the method described in claim 5, wherein, by the masterplate in each filtered main permutation and combination word and dictionary tree Permutation and combination word matches, and obtains the consistent main permutation and combination word of matching, including:
It is one group by the identical main permutation and combination word of entry word in the main permutation and combination word, every group of main combination is carried out such as respectively Lower iteration:
Whether search has the entry word with the main permutation and combination word currently organized in dictionary tree be the node started,
If nothing, one group of main permutation and combination word iteration since newly is removed;
If so, being the node started with current entry word, searches in the dictionary tree and matched with the main permutation and combination word of the group Consistent main permutation and combination word, and result set is added.
7. method according to any one of claims 1 to 6, wherein will be in obtained each permutation and combination word and dictionary tree The matching of masterplate permutation and combination word, obtain after matching consistent permutation and combination word, further include:
The correspondence of each masterplate permutation and combination word and volumes of searches scoring is set;
According to the correspondence, the scoring for matching consistent permutation and combination word is determined.
8. method according to any one of claims 1 to 6, wherein, will when the masterplate permutation and combination word is violated word Obtained each permutation and combination word is matched with the masterplate permutation and combination word in dictionary tree, obtain match consistent permutation and combination word it Afterwards, further include:
The consistent permutation and combination word of the matching is deleted from the phrase input by user.
9. a kind of word-breaking equipment, wherein the equipment includes:
Acquisition device, for obtaining phrase input by user;
Detachment device, for the phrase to be split into single word;
Combination unit, for obtaining the multiple permutation and combination words for splitting obtained each single word, the permutation and combination Word includes according to one or more tactic one or more words;
Coalignment, for obtained each permutation and combination word to be matched with the masterplate permutation and combination word in dictionary tree, acquisition With consistent permutation and combination word, wherein each node storage in addition to root node in dictionary tree is there are one word, in dictionary tree Individual node forms corresponding masterplate permutation and combination word, or other than adjacent 2 or more the root nodes of the level in same branch Node, corresponding masterplate permutation and combination word is formed by upper layer node to lower level node successively.
10. equipment according to claim 9, wherein the dictionary tree is even numbers group dictionary tree.
11. equipment according to claim 9, wherein dictionary tree generating means have masterplate permutation and combination for obtaining record The dictionary of word;Each masterplate permutation and combination word in the dictionary is stored into each respective branches of the dictionary tree by lexcographical order In.
12. equipment according to claim 11, wherein the dictionary tree generating means, if close for having in the dictionary Like masterplate permutation and combination word, each group of approximation masterplate permutation and combination word is one group of masterplate row that the word of composition is identical but sequence is different Row portmanteau word, only one masterplate permutation and combination word of selection is deposited as main permutation and combination word from one group of approximation masterplate permutation and combination word It is stored in the dictionary tree, each masterplate permutation and combination word that do not chosen in this group of masterplate permutation and combination word is as each secondary arrangement Portmanteau word;Establish the correspondence of the main permutation and combination word and secondary permutation and combination word.
13. equipment according to claim 12, wherein the coalignment, for according to the main permutation and combination word with The correspondence of secondary permutation and combination word retains the main permutation and combination word, and deletion pair in obtained each permutation and combination word The secondary permutation and combination word answered obtains filtered main permutation and combination word;By each filtered main permutation and combination word and dictionary tree In the matching of masterplate permutation and combination word, obtain the consistent main permutation and combination word of matching;It is arranged with secondary according to the main permutation and combination word The correspondence of row portmanteau word obtains and matches consistent secondary permutation and combination word with the dictionary tree.
14. equipment according to claim 13, wherein the coalignment, for will be opened in the main permutation and combination word The identical main permutation and combination word of head word is one group, carries out following iteration to every group of main combination respectively:Whether search has in dictionary tree Entry word with the main permutation and combination word currently organized is the node started, if nothing, removes one group of main permutation and combination word since new The iteration;If so, being the node started with current entry word, searched and the main permutation and combination word of the group in the dictionary tree The consistent main permutation and combination word of matching, and result set is added.
15. according to claim 9 to 14 any one of them equipment, wherein further include scoring apparatus, for obtaining matching one After the permutation and combination word of cause, the correspondence of each masterplate permutation and combination word and volumes of searches scoring is set;According to the correspondence Relationship determines the scoring for matching consistent permutation and combination word.
16. according to claim 9 to 14 any one of them equipment, wherein further include deleting device, for being arranged when the masterplate When row portmanteau word is violated word, after obtaining the consistent permutation and combination word of matching, by the consistent permutation and combination word of the matching It is deleted from the phrase input by user.
17. a kind of equipment based on calculating, wherein including:
Processor;And
It is arranged to the memory of storage computer executable instructions, the executable instruction makes the processing when executed Device:
Obtain phrase input by user;
The phrase is split into single word;
Obtain the multiple permutation and combination words of each single word for splitting and obtaining, the permutation and combination word include according to one or Multiple tactic one or more words;
Obtained each permutation and combination word is matched with the masterplate permutation and combination word in dictionary tree, obtains and matches consistent arrangement group Close word, wherein for each node storage in addition to root node in dictionary tree there are one word, the individual node group in dictionary tree is pairs of Node other than the masterplate permutation and combination word answered, or adjacent 2 or more the root nodes of level in same branch, successively by upper Node layer to lower level node forms corresponding masterplate permutation and combination word.
18. a kind of computer readable storage medium, is stored thereon with computer executable instructions, wherein the computer is executable Instruction makes the processor when being executed by processor:
Obtain phrase input by user;
The phrase is split into single word;
Obtain the multiple permutation and combination words of each single word for splitting and obtaining, the permutation and combination word include according to one or Multiple tactic one or more words;
Obtained each permutation and combination word is matched with the masterplate permutation and combination word in dictionary tree, obtains and matches consistent arrangement group Close word, wherein for each node storage in addition to root node in dictionary tree there are one word, the individual node group in dictionary tree is pairs of Node other than the masterplate permutation and combination word answered, or adjacent 2 or more the root nodes of level in same branch, successively by upper Node layer to lower level node forms corresponding masterplate permutation and combination word.
CN201810086623.3A 2018-01-29 2018-01-29 Word splitting method and device Active CN108304384B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810086623.3A CN108304384B (en) 2018-01-29 2018-01-29 Word splitting method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810086623.3A CN108304384B (en) 2018-01-29 2018-01-29 Word splitting method and device

Publications (2)

Publication Number Publication Date
CN108304384A true CN108304384A (en) 2018-07-20
CN108304384B CN108304384B (en) 2021-08-27

Family

ID=62866739

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810086623.3A Active CN108304384B (en) 2018-01-29 2018-01-29 Word splitting method and device

Country Status (1)

Country Link
CN (1) CN108304384B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109062898A (en) * 2018-07-27 2018-12-21 汉能移动能源控股集团有限公司 Characteristic word duplication eliminating method, device and equipment and storage medium thereof
CN111310452A (en) * 2018-12-12 2020-06-19 北京京东尚科信息技术有限公司 Word segmentation method and device
CN113569027A (en) * 2021-07-27 2021-10-29 北京百度网讯科技有限公司 Document title processing method and device and electronic equipment

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110055233A1 (en) * 2009-08-25 2011-03-03 Lutz Weber Methods, Computer Systems, Software and Storage Media for Handling Many Data Elements for Search and Annotation
US20130159318A1 (en) * 2011-12-16 2013-06-20 Microsoft Corporation Rule-Based Generation of Candidate String Transformations
CN103514287A (en) * 2013-09-29 2014-01-15 深圳市龙视传媒有限公司 Index tree building method, Chinese vocabulary searching method and related device
CN103914569A (en) * 2014-04-24 2014-07-09 百度在线网络技术(北京)有限公司 Input prompt method and device and dictionary tree model establishing method and device
CN105917327A (en) * 2013-12-11 2016-08-31 触摸式有限公司 System and method for inputting text into electronic devices
CN106649286A (en) * 2016-10-15 2017-05-10 语联网(武汉)信息技术有限公司 Method for conducting term matching on basis of double-array lexicographic tree
CN107357911A (en) * 2017-07-18 2017-11-17 北京新美互通科技有限公司 A kind of text entry method and device

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110055233A1 (en) * 2009-08-25 2011-03-03 Lutz Weber Methods, Computer Systems, Software and Storage Media for Handling Many Data Elements for Search and Annotation
US20130159318A1 (en) * 2011-12-16 2013-06-20 Microsoft Corporation Rule-Based Generation of Candidate String Transformations
CN103514287A (en) * 2013-09-29 2014-01-15 深圳市龙视传媒有限公司 Index tree building method, Chinese vocabulary searching method and related device
CN105917327A (en) * 2013-12-11 2016-08-31 触摸式有限公司 System and method for inputting text into electronic devices
CN103914569A (en) * 2014-04-24 2014-07-09 百度在线网络技术(北京)有限公司 Input prompt method and device and dictionary tree model establishing method and device
CN106649286A (en) * 2016-10-15 2017-05-10 语联网(武汉)信息技术有限公司 Method for conducting term matching on basis of double-array lexicographic tree
CN107357911A (en) * 2017-07-18 2017-11-17 北京新美互通科技有限公司 A kind of text entry method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JIANDONG LI: ""Enhanced KStore With the Use of Dictionary and Trie for Retail Business Data"", 《PROCEEDINGS OF 2016 IEEE INTERNATIONAL CONFERENCE ON BIG DATA ANALYSIS (ICBDA)》 *
田思虑 等: ""一种改进的基于二元统计的HMM分词算法"", 《计算机与数字工程》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109062898A (en) * 2018-07-27 2018-12-21 汉能移动能源控股集团有限公司 Characteristic word duplication eliminating method, device and equipment and storage medium thereof
CN111310452A (en) * 2018-12-12 2020-06-19 北京京东尚科信息技术有限公司 Word segmentation method and device
CN113569027A (en) * 2021-07-27 2021-10-29 北京百度网讯科技有限公司 Document title processing method and device and electronic equipment
CN113569027B (en) * 2021-07-27 2024-02-13 北京百度网讯科技有限公司 Document title processing method and device and electronic equipment

Also Published As

Publication number Publication date
CN108304384B (en) 2021-08-27

Similar Documents

Publication Publication Date Title
EP2750053B1 (en) Data storage program, data retrieval program, data retrieval apparatus, data storage method and data retrieval method
CN110019647B (en) Keyword searching method and device and search engine
CN101673307B (en) Space data index method and system
CN107153647B (en) Method, apparatus, system and computer program product for data compression
EP2045731A1 (en) Automatic generation of ontologies using word affinities
CN107357843B (en) Massive network data searching method based on data stream structure
CN108304384A (en) Word-breaking method and apparatus
CN106980656B (en) A kind of searching method based on two-value code dictionary tree
CN107368527B (en) Multi-attribute index method based on data stream
CN109902142B (en) Character string fuzzy matching and query method based on edit distance
KR100651743B1 (en) Method of generating and searching tcam entry, and apparatus thereof
JP2009512099A (en) Method and apparatus for restartable hashing in a try
US20150248448A1 (en) Online radix tree compression with key sequence skip
Kempa et al. Dynamic suffix array with polylogarithmic queries and updates
Jansson et al. Linked dynamic tries with applications to LZ-compression in sublinear time and space
US10372736B2 (en) Generating and implementing local search engines over large databases
Pandey et al. A comparison and selection on basic type of searching algorithm in data structure
CN110457398A (en) Block data storage method and device
Kanda et al. Dynamic path-decomposed tries
CN109803022B (en) Digital resource sharing system and service method thereof
US9396286B2 (en) Lookup with key sequence skip for radix trees
CN112256821A (en) Method, device, equipment and storage medium for complementing Chinese address
US8554696B2 (en) Efficient computation of ontology affinity matrices
US20150248449A1 (en) Online compression for limited sequence length radix tree
Akagi et al. Grammar index by induced suffix sorting

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant