CN108304384B - Word splitting method and device - Google Patents

Word splitting method and device Download PDF

Info

Publication number
CN108304384B
CN108304384B CN201810086623.3A CN201810086623A CN108304384B CN 108304384 B CN108304384 B CN 108304384B CN 201810086623 A CN201810086623 A CN 201810086623A CN 108304384 B CN108304384 B CN 108304384B
Authority
CN
China
Prior art keywords
combination
word
permutation
words
template
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810086623.3A
Other languages
Chinese (zh)
Other versions
CN108304384A (en
Inventor
扈贵谦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Mingxuan Software Technology Co ltd
Original Assignee
Shanghai Mingxuan Software Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Mingxuan Software Technology Co ltd filed Critical Shanghai Mingxuan Software Technology Co ltd
Priority to CN201810086623.3A priority Critical patent/CN108304384B/en
Publication of CN108304384A publication Critical patent/CN108304384A/en
Application granted granted Critical
Publication of CN108304384B publication Critical patent/CN108304384B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention aims to provide a word splitting method and device, wherein each node in a dictionary tree stores a word, a single node in the dictionary tree forms a corresponding template arrangement combination word, or nodes except more than 2 hierarchically adjacent root nodes in the same branch sequentially form the corresponding template arrangement combination word from an upper node to a lower node, and each obtained arrangement combination word is matched with the template arrangement combination word in the dictionary tree to obtain the template arrangement combination word which is consistent in matching, so that the combination word corresponding to the template word of the commodity description information can be quickly and accurately extracted from the dictionary tree.

Description

Word splitting method and device
Technical Field
The invention relates to the field of computers, in particular to a word segmentation method and device.
Background
The existing word splitting scheme has the problems of low word splitting speed and inaccurate word splitting.
Disclosure of Invention
The invention aims to provide a word segmentation method and device, which can solve the problems of low word segmentation speed and inaccurate word segmentation of the existing word segmentation scheme.
According to an aspect of the present invention, there is provided a word segmentation method, including:
acquiring a phrase input by a user;
splitting the phrase into single words;
obtaining a plurality of permutation and combination words of each single word obtained by splitting, wherein the permutation and combination words comprise one or more words arranged according to one or more sequences;
and matching each obtained permutation and combination word with a template permutation and combination word in a dictionary tree to obtain the permutation and combination words which are consistent in matching, wherein each node except a root node in the dictionary tree stores one word, a single node in the dictionary tree forms the corresponding template permutation and combination word, or nodes except more than 2 root nodes which are adjacent in the same branch in the hierarchy sequentially form the corresponding template permutation and combination words from an upper layer node to a lower layer node.
Further, in the above method, the dictionary tree is a double-array dictionary tree.
Further, before matching each obtained permutation and combination word with a template permutation and combination word in a dictionary tree, the method further includes:
acquiring a word bank recorded with template arrangement combination words;
and storing each template arrangement combination word in the word stock into each corresponding branch of the dictionary tree according to the lexical order.
Further, in the above method, storing each template arrangement combination word in the thesaurus into each corresponding branch of the dictionary tree, includes:
if the word stock has approximate template arrangement combination words, each group of the approximate template arrangement combination words is a group of template arrangement combination words which are the same in composition words but different in sequence, only one template arrangement combination word is selected from the group of the approximate template arrangement combination words as a main arrangement combination word to be stored in the dictionary tree, and each template arrangement combination word which is not selected from the group of the template arrangement combination words is used as each auxiliary arrangement combination word;
and establishing a corresponding relation between the main permutation combination words and the auxiliary permutation combination words.
Further, in the above method, matching each obtained permutation and combination word with a template permutation and combination word in a dictionary tree to obtain a permutation and combination word with a consistent match, the method includes:
according to the corresponding relation between the main permutation combination words and the auxiliary permutation combination words, the main permutation combination words are reserved in the obtained permutation combination words, and the corresponding auxiliary permutation combination words are deleted to obtain filtered main permutation combination words;
matching each filtered main permutation combination word with the template permutation combination word in the dictionary tree to obtain main permutation combination words which are matched consistently;
and acquiring the secondary arrangement combined words which are matched with the dictionary tree in a consistent way according to the corresponding relation between the primary arrangement combined words and the secondary arrangement combined words.
Further, in the above method, matching each filtered main arrangement combined word with a template arrangement combined word in a dictionary tree to obtain a main arrangement combined word with a consistent match, includes:
and (3) dividing the main arrangement combined words with the same initial words in the main arrangement combined words into a group, and respectively performing the following iteration on each group of main combinations:
looking up whether there is a node in the dictionary tree starting with the head word of the main arrangement combination word of the current group,
if not, taking down a group of main permutation and combination words and starting the iteration from the beginning;
if so, taking the current initial word as a starting node, searching the main arrangement combined word matched and consistent with the main arrangement combined word in the dictionary tree, and adding a result set.
Further, in the above method, after matching each obtained permutation and combination word with a template permutation and combination word in a dictionary tree and obtaining a permutation and combination word with a consistent match, the method further includes:
setting a corresponding relation between each template arrangement combination word and the search quantity score;
and determining the scores of the matched and consistent permutation and combination words according to the corresponding relation.
Further, in the above method, when the template permutation combining word is a forbidden word, matching each obtained permutation combining word with the template permutation combining word in the dictionary tree, and after obtaining the permutation combining words which are consistent in matching, the method further includes:
and deleting the matched and consistent permutation and combination words from the phrases input by the user.
According to another aspect of the present invention, there is also provided a word segmentation apparatus, including:
the acquisition device is used for acquiring the phrases input by the user;
splitting means for splitting the phrase into individual words;
a combining device, configured to obtain multiple permutation and combination words of each single word obtained by splitting, where the permutation and combination words include one or more words arranged in one or more orders;
and the matching device is used for matching each obtained permutation and combination word with the template permutation and combination word in the dictionary tree to obtain the permutation and combination word which is consistent in matching, wherein each node except the root node in the dictionary tree stores one word, a single node in the dictionary tree forms the corresponding template permutation and combination word, or nodes except more than 2 root nodes which are adjacent in the same branch in the hierarchy form the corresponding template permutation and combination word sequentially from an upper layer node to a lower layer node.
Further, in the above apparatus, the dictionary tree is a double-array dictionary tree.
Further, in the above apparatus, the dictionary tree generating device is configured to obtain a lexicon in which template arrangement combination words are recorded; and storing each template arrangement combination word in the word stock into each corresponding branch of the dictionary tree according to the lexical order.
Further, in the above apparatus, the dictionary tree generating device is configured to select only one template permutation combination word from a group of similar template permutation combination words as a main permutation combination word to be stored in the dictionary tree if the lexicon has similar template permutation combination words, and each template permutation combination word not selected from the group of similar template permutation combination words is used as each secondary permutation combination word; and establishing a corresponding relation between the main permutation combination words and the auxiliary permutation combination words.
Further, in the above apparatus, the matching device is configured to, according to a correspondence between the primary arrangement combined word and the secondary arrangement combined word, retain the primary arrangement combined word in each obtained arrangement combined word, and delete the corresponding secondary arrangement combined word, so as to obtain a filtered primary arrangement combined word; matching each filtered main permutation combination word with the template permutation combination word in the dictionary tree to obtain main permutation combination words which are matched consistently; and acquiring the secondary arrangement combined words which are matched with the dictionary tree in a consistent way according to the corresponding relation between the primary arrangement combined words and the secondary arrangement combined words.
Further, in the above apparatus, the matching device is configured to group main-arrangement combined words with the same initial word in the main-arrangement combined words, and perform the following iteration on each group of main combinations respectively: searching whether a node starting from the head word of the main arrangement combined word of the current group exists in the dictionary tree or not, and if not, taking down a group of main arrangement combined words to start the iteration from the beginning; if so, taking the current initial word as a starting node, searching the main arrangement combined word matched and consistent with the main arrangement combined word in the dictionary tree, and adding a result set.
Further, the device further comprises a scoring device, configured to set a corresponding relationship between each template permutation combination word and the score of the search amount after the permutation combination words which are matched consistently are obtained; and determining the scores of the matched and consistent permutation and combination words according to the corresponding relation.
Further, the above apparatus further includes a deleting device, configured to delete the aligned combination word that is aligned with the template from the phrase input by the user after the aligned combination word that is aligned with the template is obtained when the aligned combination word is the forbidden word.
According to another aspect of the present application, there is also provided a computing-based device comprising:
a processor; and
a memory arranged to store computer executable instructions that, when executed, cause the processor to:
acquiring a phrase input by a user;
splitting the phrase into single words;
obtaining a plurality of permutation and combination words of each single word obtained by splitting, wherein the permutation and combination words comprise one or more words arranged according to one or more sequences;
and matching each obtained permutation and combination word with a template permutation and combination word in a dictionary tree to obtain the permutation and combination words which are consistent in matching, wherein each node except a root node in the dictionary tree stores one word, a single node in the dictionary tree forms the corresponding template permutation and combination word, or nodes except more than 2 root nodes which are adjacent in the same branch in the hierarchy sequentially form the corresponding template permutation and combination words from an upper layer node to a lower layer node.
According to another aspect of the present application, there is also provided a computer-readable storage medium having stored thereon computer-executable instructions, wherein the computer-executable instructions, when executed by a processor, cause the processor to:
acquiring a phrase input by a user;
splitting the phrase into single words;
obtaining a plurality of permutation and combination words of each single word obtained by splitting, wherein the permutation and combination words comprise one or more words arranged according to one or more sequences;
and matching each obtained permutation and combination word with a template permutation and combination word in a dictionary tree to obtain the permutation and combination words which are consistent in matching, wherein each node except a root node in the dictionary tree stores one word, a single node in the dictionary tree forms the corresponding template permutation and combination word, or nodes except more than 2 root nodes which are adjacent in the same branch in the hierarchy sequentially form the corresponding template permutation and combination words from an upper layer node to a lower layer node.
Compared with the prior art, the method has the advantages that one word is stored in each node in the dictionary tree, the corresponding template arrangement combination words are formed by single nodes in the dictionary tree, or the corresponding template arrangement combination words are formed by nodes except more than 2 root nodes adjacent to each other in the same branch from an upper node to a lower node in sequence, the obtained arrangement combination words are matched with the template arrangement combination words in the dictionary tree, the template arrangement combination words which are matched consistently are obtained, and therefore the combination words corresponding to the template words of the commodity description information can be extracted from the dictionary tree quickly and accurately.
Drawings
Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments made with reference to the following drawings:
fig. 1 shows a flow chart of a word segmentation method according to an embodiment of the invention.
The same or similar reference numbers in the drawings identify the same or similar elements.
Detailed Description
The present invention is described in further detail below with reference to the attached drawing figures.
In a typical configuration of the present application, the terminal, the device serving the network, and the trusted party each include one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (transient media), such as modulated data signals and carrier waves.
As shown in fig. 1, a word segmentation method includes:
step S1, acquiring phrases input by users, such as abc;
step S2, splitting the phrase into single words, for example, splitting abc into a, b, and c;
step S3, obtaining a plurality of permutation combination words of each single word obtained by splitting, where the permutation combination words include one or more words arranged in one or more orders, and the permutation combination words that can be obtained by combining a, b, and c as described above are a, b, c, ab, ac, bc, abc, ba, ca, cb, acb, bac, bca, and cba;
step S6, matching each obtained permutation and combination word with a template permutation and combination word in a dictionary tree to obtain a permutation and combination word with consistent matching, wherein each node except a root node in the dictionary tree stores a word, a single node in the dictionary tree forms a corresponding template permutation and combination word, or nodes except more than 2 root nodes adjacent in the same branch, and corresponding template permutation and combination words are formed sequentially from an upper node to a lower node, such as a, b, c, ab, ac, bc, abc, ba, ca, acb, bac, bca, bc, cab and cba which are respectively a branch in the dictionary tree.
Here, the dictionary tree is also called word lookup tree, Trie tree, which is a tree structure and is a variation of hash tree. The advantages of the dictionary tree are: the public prefix of the word is utilized to reduce the query time, so that unnecessary word comparison is reduced to the maximum extent, and the query efficiency is higher than that of a Hash tree.
In the implementation, each node in the dictionary tree stores a template permutation compound word corresponding to a single node in a word dictionary tree, or nodes except for more than 2 root nodes adjacent to each other in a hierarchy in the same branch, the template permutation compound words corresponding to the nodes from the upper layer node to the lower layer node are sequentially formed, each obtained permutation compound word is matched with the template permutation compound word in the dictionary tree, and template permutation compound words which are consistent in matching can be quickly and accurately obtained, wherein if the permutation compound words to be matched are: a. b, c, ab, ac, bc, abc, ba, ca, cb, acb, bcc, bca, cab, cba,
the template arrangement combination words in the dictionary tree are as follows: ab. ac, bc, abc, ba, ca, acb, bac, bca, cab, cba,
the obtained matched template permutation and combination words, namely the permutation and combination words to be matched and the template permutation and combination words in the dictionary tree have the following permutation and combination words: ab. and the template words corresponding to the commodity description information can be quickly and accurately extracted from the dictionary tree.
In an embodiment of the word segmentation method, the dictionary Tree is a Double-Array dictionary Tree, the Double-Array dictionary Tree is a compressed form of a dictionary Tree (Trie) structure, only two linear arrays are used for representing the Trie Tree, and the structure effectively combines the efficient time retrieval characteristic of a Digital Search Tree (Digital Search Tree) and the compact space structure characteristic of the chain-represented Trie. The essence of the double-array Trie is a deterministic finite state automata (DFA), each node represents a state of the automata, state transfer is carried out according to different variables, and when the state reaches an end state or the state cannot be transferred, one query operation is completed. The relation between the characters contained in all keys of the double-number group is expressed by simple mathematical addition operation, thereby not only improving the retrieval speed, but also saving a large number of pointers used in a chain structure and saving the storage space.
As shown in fig. 1, in an embodiment of the word segmentation method according to the present invention, before the step S6 matches each obtained permutation-combination word with a template permutation-combination word in a dictionary tree, the method further includes:
step S4, acquiring a word bank recorded with template arrangement combination words;
and step S5, storing each template arrangement combination word in the word stock into each corresponding branch of the dictionary tree according to the lexical order.
In this case, the matched template arrangement compound words can be obtained at an accelerated speed by converting the word stock into the dictionary tree and then querying the dictionary tree.
In an embodiment of the word segmentation method of the present invention, in step S5, storing each template arrangement compound word in the thesaurus into each corresponding branch of the dictionary tree includes:
step S51, if there are similar template arrangement combination words in the word stock, each group of similar template arrangement combination words is a group of template arrangement combination words with the same composition words but different sequences, only one template arrangement combination word is selected from the group of similar template arrangement combination words as a main arrangement combination word to be stored in the dictionary tree, each unselected template arrangement combination word in the group of template arrangement combination words is selected as a sub arrangement combination word, for example, abc, acb and bac are a group of similar template arrangement combination words, wherein abc is stored in the dictionary tree as the main arrangement combination word, acb and bac are not stored in the dictionary tree as the sub arrangement combination words, and abc corresponds to acb and bac;
and step S52, establishing the corresponding relation between the main arrangement combined words and the auxiliary arrangement combined words.
In this case, only one template permutation compound word is selected from the approximate template permutation compound words to be stored in the dictionary tree as a main permutation compound word, so that the redundancy of dictionary tree data storage can be reduced, and the matching speed of each combination and the corresponding dictionary tree can be increased.
In an embodiment of the word segmentation method of the present invention, each group of similar template arrangement combination words is a group of template arrangement combination words with the same words but different orders, only one template arrangement combination word is selected from the group of similar template arrangement combination words as a main arrangement combination word and stored in the dictionary tree, each template arrangement combination word not selected in the group of template arrangement combination words is used as each auxiliary arrangement combination word, and after the corresponding relationship between the main arrangement combination word and the auxiliary arrangement combination word is established,
step S6, matching each obtained permutation and combination word with a template permutation and combination word in a dictionary tree, and obtaining a permutation and combination word with a consistent match, including:
step S61, according to the corresponding relation between the main arrangement combination words and the auxiliary arrangement combination words, keeping the main arrangement combination words in each obtained arrangement combination word, and deleting the corresponding auxiliary arrangement combination words to obtain filtered main arrangement combination words;
step S62, matching each filtered main arrangement combined word with the template arrangement combined word in the dictionary tree to obtain main arrangement combined words which are matched consistently;
and step S63, acquiring the secondary arrangement combined words which are matched with the dictionary tree in a consistent way according to the corresponding relation between the primary arrangement combined words and the secondary arrangement combined words.
For example, the following lexicon is available:
a;
ab;
abc;
ba;
bac;
abcd。
the phrase input by the user is cba, and the phrase is split into individual words, and all possible permutations are combined as a (3, 1) + a (3, 2) + a (3, 3) ═ 3+6+6 ═ 15, specifically as follows:
a;
b;
c;
ab;
ac;
bc;
ba;
ca;
cb;
abc;
acb;
cab;
bac;
bca;
cba。
all permutations combinations that need to be found for cba to occur in the lexicon are as follows:
a;
ab;
abc;
ba;
bac。
the word stock can be firstly sequenced by ascii codes, if approximate template arrangement combination words exist in the word stock, each group of the approximate template arrangement combination words are the same as the formed words but are sequenced differently, only one template arrangement combination word is selected from the group of the approximate template arrangement combination words to be stored in the dictionary tree as a main arrangement combination word, namely, a new word stock is obtained after the original word stock is removed with duplication, and the new word stock is obtained after the duplication is removed to construct the dictionary tree, as follows:
a[a];
ab[ab,ba];
abc[abc,bac];
abcd[abcd]。
when the phrase input by the user is cba and the dictionary tree is matched, the cba is ordered according to ascii codes and then is abc, the abc is traversed by the dictionary tree to obtain main arrangement combined words a, ab and abc, then the corresponding relation between the main arrangement combined words and the auxiliary arrangement combined words is used, for example, a combination a is obtained from a, two combinations ab and ba are obtained from ab, and two combinations abc and bac are obtained from abc.
The final match results are thus obtained as:
a;
ab;
ba;
abc;
bac。
in an embodiment of the word segmentation method of the present invention, in step S62, matching each filtered main permutation combined word with a template permutation combined word in a dictionary tree to obtain a main permutation combined word with a consistent match, where the method includes:
and (3) dividing the main arrangement combined words with the same initial words in the main arrangement combined words into a group, and respectively performing the following iteration on each group of main combinations:
step S621, searching whether the dictionary tree has a node starting from the head word of the main arrangement combination word of the current group,
if not, step S622, take down a group of main arranged compound words and start the iteration from the beginning;
if yes, in step S623, the current initial word is taken as the starting node, the main arrangement combined word matched and consistent with the main arrangement combined word in the dictionary tree is searched, and a result set is added.
For example, a function of scoring the commodity title of the user is required, and the commodity title of the user (up to 30 Chinese characters) is required to be separated according to characters and then all permutation and combination appearing in the word stock (about 3000000 item data) are found
The number of possible combinations of arrangement per title of the product is a (30, 1) + a (30, 2) +. + a (30, 30) and the amount of data is considerable (about 10 to the power of 33)
The design algorithm is as follows:
the method comprises the steps that spaces of all template arrangement combination words in a word library are removed, the template arrangement combination words are converted into lower-case de-duplication words and serve as main arrangement combination words, a map is constructed to store the main arrangement combination words and the auxiliary arrangement combination words, one main arrangement combination word possibly corresponds to a plurality of auxiliary arrangement combination words, and meanwhile a dictionary tree is constructed according to all the main arrangement combination words;
after the space removal of the commodity title is converted into the lower case removal and the duplication removal are sequenced according to ascii codes, if abcd is obtained, a dictionary tree is sequentially walked, and all main sequencing combination words which possibly appear are found out, wherein the abcd is taken as an example and comprises the following steps:
1.0 looks up whether there are nodes in the trie beginning with a,
if there is a walk 1.1 and if there is no walk 2.0
1.1 firstly judging whether a is a leaf node, if so, 1.1.1, otherwise, 1.2
1.1.1 adding the words in the leaf nodes into the returned result set, and then walking 2.0;
1.2 finding the existence of node … beginning with ab ac ad, respectively, beginning with a
1.2.1, matching the matched main arrangement combination words with the added returned result set, and moving by 2.0 after all the nodes starting from a are matched;
2.0 checks if there are nodes starting with b.
And after all the main permutation combination words are found, all the corresponding auxiliary permutation combination words are found in the map, so that the main permutation combination words which are matched in a consistent way and the corresponding auxiliary permutation combination words are used as a final matching result.
Here, in this embodiment, the main arrangement compound words with the same initial word in the main arrangement compound words are set as a group, and the current initial word is used as a starting node to search the dictionary tree for the main arrangement compound words that match the group of main arrangement compound words, so that the search speed can be further increased.
In an embodiment of the word segmentation method of the present invention, in step S6, after matching each obtained permutation and combination word with a template permutation and combination word in a dictionary tree and obtaining a permutation and combination word that matches consistently, the method further includes:
setting a corresponding relation between each template arrangement combination word and the search quantity score;
and determining the scores of the matched and consistent permutation and combination words according to the corresponding relation.
For example, the correspondence between each template arrangement combination word and the search volume score is as follows: a: 8 min, b: 1 minute and c: 10 points, ab: 7 min, ac: 5 min, bc: 9 min, abc: 4 min, ba: 3 min, ca: 0 min, cb: 2 min;
the matched and consistent permutation and combination words are as follows: ab. ac, bc, abc;
the scores of the matched permutation combination words are as follows: ab: 7 min, ac: 5 min, bc: 9 min, abc: and 4, dividing.
The phrase input by the user may be the commodity description information, and according to the scores or the score sum of the arranged combination words with the consistent matching, the search amount of the commodity description information may be predicted, and further, the commodity description information with the search amount meeting the expected requirement may be modified.
In an embodiment of the word segmentation method of the present invention, when the template arrangement compound word is a forbidden word,
step S6, matching each obtained permutation and combination word with the template permutation and combination word in the dictionary tree, and after obtaining the permutation and combination words with consistent matching, further including:
and deleting the matched and consistent permutation and combination words from the phrases input by the user.
The phrase input by the user may be commodity description information, each obtained permutation and combination word is matched with the forbidden word in the dictionary tree, the permutation and combination words with consistent matching are obtained, and then the commodity description information may be filtered and removed to automatically correct the commodity description information.
According to another aspect of the present application, there is also provided a word segmentation apparatus, including:
the acquisition device is used for acquiring the phrases input by the user;
splitting means for splitting the phrase into individual words;
a combining device, configured to obtain multiple permutation and combination words of each single word obtained by splitting, where the permutation and combination words include one or more words arranged in one or more orders;
and the matching device is used for matching each obtained permutation and combination word with the template permutation and combination word in the dictionary tree to obtain the permutation and combination word which is consistent in matching, wherein each node except the root node in the dictionary tree stores one word, a single node in the dictionary tree forms the corresponding template permutation and combination word, or nodes except more than 2 root nodes which are adjacent in the same branch in the hierarchy form the corresponding template permutation and combination word sequentially from an upper layer node to a lower layer node.
In an embodiment of the word segmentation apparatus of the present invention, the dictionary tree is a double-array dictionary tree.
In an embodiment of the word segmentation equipment, the dictionary tree generation device is used for acquiring a word bank recorded with template arrangement combination words; and storing each template arrangement combination word in the word stock into each corresponding branch of the dictionary tree according to the lexical order.
In an embodiment of the word segmentation apparatus in the present invention, the dictionary tree generation device is configured to select only one template permutation and combination word from a group of similar template permutation and combination words as a main permutation and combination word to be stored in the dictionary tree if the word library has similar template permutation and combination words, each group of similar template permutation and combination words is a group of template permutation and combination words with the same words but different orders, and each unselected template permutation and combination word from the group of similar template permutation and combination words is used as each secondary permutation and combination word; and establishing a corresponding relation between the main permutation combination words and the auxiliary permutation combination words.
In an embodiment of the word segmentation apparatus of the present invention, the matching device is configured to, according to a correspondence between the primary arrangement combined word and the secondary arrangement combined word, retain the primary arrangement combined word in each obtained arrangement combined word, and delete the corresponding secondary arrangement combined word to obtain the filtered primary arrangement combined word; matching each filtered main permutation combination word with the template permutation combination word in the dictionary tree to obtain main permutation combination words which are matched consistently; and acquiring the secondary arrangement combined words which are matched with the dictionary tree in a consistent way according to the corresponding relation between the primary arrangement combined words and the secondary arrangement combined words.
In an embodiment of the word segmentation apparatus in the present invention, the matching device is configured to group main permutation words with the same initial word in the main permutation combination words, and perform the following iteration on each group of main combinations respectively: searching whether a node starting from the head word of the main arrangement combined word of the current group exists in the dictionary tree or not, and if not, taking down a group of main arrangement combined words to start the iteration from the beginning; if so, taking the current initial word as a starting node, searching the main arrangement combined word matched and consistent with the main arrangement combined word in the dictionary tree, and adding a result set.
In an embodiment of the word segmentation equipment, the word segmentation equipment further comprises a scoring device, wherein the scoring device is used for setting a corresponding relation between each template permutation combination word and the score of the search amount after the permutation combination words which are matched consistently are obtained; and determining the scores of the matched and consistent permutation and combination words according to the corresponding relation.
In an embodiment of the word segmentation apparatus of the present invention, the apparatus further includes a deleting device, configured to delete the aligned combined word that is aligned with the template from the word group input by the user after the aligned combined word that is aligned with the template is obtained when the aligned combined word is a forbidden word.
According to another aspect of the application, there is also provided a computing-based device comprising:
a processor; and
a memory arranged to store computer executable instructions that, when executed, cause the processor to:
acquiring a phrase input by a user;
splitting the phrase into single words;
obtaining a plurality of permutation and combination words of each single word obtained by splitting, wherein the permutation and combination words comprise one or more words arranged according to one or more sequences;
and matching each obtained permutation and combination word with a template permutation and combination word in a dictionary tree to obtain the permutation and combination words which are consistent in matching, wherein each node except a root node in the dictionary tree stores one word, a single node in the dictionary tree forms the corresponding template permutation and combination word, or nodes except more than 2 root nodes which are adjacent in the same branch in the hierarchy sequentially form the corresponding template permutation and combination words from an upper layer node to a lower layer node.
According to another aspect of the present application, there is also provided a computer-readable storage medium having stored thereon computer-executable instructions that, when executed by a processor, cause the processor to:
acquiring a phrase input by a user;
splitting the phrase into single words;
obtaining a plurality of permutation and combination words of each single word obtained by splitting, wherein the permutation and combination words comprise one or more words arranged according to one or more sequences;
and matching each obtained permutation and combination word with a template permutation and combination word in a dictionary tree to obtain the permutation and combination words which are consistent in matching, wherein each node except a root node in the dictionary tree stores one word, a single node in the dictionary tree forms the corresponding template permutation and combination word, or nodes except more than 2 root nodes which are adjacent in the same branch in the hierarchy sequentially form the corresponding template permutation and combination words from an upper layer node to a lower layer node.
For details of the embodiments of the apparatus and the computer-readable storage medium of the present invention, reference may be made to the embodiments of the method section, which are not described herein again.
In summary, in the present invention, a word is stored in each node in the dictionary tree, a single node in the dictionary tree forms a corresponding template arrangement compound word, or nodes other than 2 or more root nodes adjacent to each other in the same branch, form a corresponding template arrangement compound word sequentially from an upper node to a lower node, match each obtained arrangement compound word with a template arrangement compound word in the dictionary tree, and obtain template arrangement compound words that are matched consistently, so that a compound word corresponding to a template word of the commodity description information can be extracted from the dictionary tree quickly and accurately.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.
It should be noted that the present invention may be implemented in software and/or in a combination of software and hardware, for example, as an Application Specific Integrated Circuit (ASIC), a general purpose computer or any other similar hardware device. In one embodiment, the software program of the present invention may be executed by a processor to implement the steps or functions described above. Also, the software programs (including associated data structures) of the present invention can be stored in a computer readable recording medium, such as RAM memory, magnetic or optical drive or diskette and the like. Further, some of the steps or functions of the present invention may be implemented in hardware, for example, as circuitry that cooperates with the processor to perform various steps or functions.
In addition, some of the present invention can be applied as a computer program product, such as computer program instructions, which when executed by a computer, can invoke or provide the method and/or technical solution according to the present invention through the operation of the computer. Program instructions which invoke the methods of the present invention may be stored on a fixed or removable recording medium and/or transmitted via a data stream on a broadcast or other signal-bearing medium and/or stored within a working memory of a computer device operating in accordance with the program instructions. An embodiment according to the invention herein comprises an apparatus comprising a memory for storing computer program instructions and a processor for executing the program instructions, wherein the computer program instructions, when executed by the processor, trigger the apparatus to perform a method and/or solution according to embodiments of the invention as described above.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the apparatus claims may also be implemented by one unit or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.

Claims (16)

1. A method of word segmentation, wherein the method comprises:
acquiring a phrase input by a user;
splitting the phrase into single words;
obtaining a plurality of permutation and combination words of each single word obtained by splitting, wherein the permutation and combination words comprise one or more words arranged according to one or more sequences;
matching each obtained permutation and combination word with a template permutation and combination word in a dictionary tree to obtain the permutation and combination words which are consistent in matching, wherein each node except a root node in the dictionary tree stores one word, a single node in the dictionary tree forms the corresponding template permutation and combination word, or nodes except more than 2 root nodes which are adjacent in a hierarchy in the same branch sequentially form the corresponding template permutation and combination words from an upper layer node to a lower layer node;
when the template permutation and combination word is a forbidden word, matching each obtained permutation and combination word with the template permutation and combination word in the dictionary tree, and after obtaining the permutation and combination words which are consistent in matching, the method further comprises the following steps:
and deleting the matched and consistent permutation and combination words from the phrases input by the user.
2. The method of claim 1, wherein the dictionary tree is a double-array dictionary tree.
3. The method of claim 1, wherein prior to matching each resulting aligned-combination word with a template aligned-combination word in the dictionary tree, further comprising:
acquiring a word bank recorded with template arrangement combination words;
and storing each template arrangement combination word in the word stock into each corresponding branch of the dictionary tree according to the lexical order.
4. The method of claim 3, wherein storing each template-aligned compound word in the thesaurus into each corresponding branch of the dictionary tree comprises:
if the word stock has approximate template arrangement combination words, each group of the approximate template arrangement combination words is a group of template arrangement combination words which are the same in composition words but different in sequence, only one template arrangement combination word is selected from the group of the approximate template arrangement combination words as a main arrangement combination word to be stored in the dictionary tree, and each template arrangement combination word which is not selected from the group of the template arrangement combination words is used as each auxiliary arrangement combination word;
and establishing a corresponding relation between the main permutation combination words and the auxiliary permutation combination words.
5. The method of claim 4, wherein matching each obtained permutation combination word with a template permutation combination word in a dictionary tree to obtain a permutation combination word with consistent matching comprises:
according to the corresponding relation between the main permutation combination words and the auxiliary permutation combination words, the main permutation combination words are reserved in the obtained permutation combination words, and the corresponding auxiliary permutation combination words are deleted to obtain filtered main permutation combination words;
matching each filtered main permutation combination word with the template permutation combination word in the dictionary tree to obtain main permutation combination words which are matched consistently;
and acquiring the secondary arrangement combined words which are matched with the dictionary tree in a consistent way according to the corresponding relation between the primary arrangement combined words and the secondary arrangement combined words.
6. The method of claim 5, wherein matching each filtered primary ranked combined word with a template ranked combined word in a dictionary tree to obtain a primary ranked combined word that matches consistently comprises:
and (3) dividing the main arrangement combined words with the same initial words in the main arrangement combined words into a group, and respectively performing the following iteration on each group of main combinations:
looking up whether there is a node in the dictionary tree starting with the head word of the main arrangement combination word of the current group,
if not, taking down a group of main permutation and combination words and starting the iteration from the beginning;
if so, taking the current initial word as a starting node, searching the main arrangement combined word matched and consistent with the main arrangement combined word in the dictionary tree, and adding a result set.
7. The method according to any one of claims 1 to 6, wherein, after matching each obtained permutation and combination word with the template permutation and combination word in the dictionary tree and obtaining the permutation and combination word which is consistent in matching, the method further comprises:
setting a corresponding relation between each template arrangement combination word and the search quantity score;
and determining the scores of the matched and consistent permutation and combination words according to the corresponding relation.
8. A word-breaking apparatus, wherein the apparatus comprises:
the acquisition device is used for acquiring the phrases input by the user;
splitting means for splitting the phrase into individual words;
a combining device, configured to obtain multiple permutation and combination words of each single word obtained by splitting, where the permutation and combination words include one or more words arranged in one or more orders;
the matching device is used for matching each obtained permutation and combination word with the template permutation and combination word in the dictionary tree to obtain the permutation and combination word with consistent matching, wherein each node except the root node in the dictionary tree stores one word, a single node in the dictionary tree forms the corresponding template permutation and combination word, or nodes except more than 2 root nodes adjacent to each other in the same branch in the hierarchy sequentially form the corresponding template permutation and combination word from an upper layer node to a lower layer node;
and the deleting device is used for deleting the permutation and combination words which are matched with each other from the phrases input by the user after the permutation and combination words which are matched with each other are obtained when the template permutation and combination words are forbidden words.
9. The apparatus of claim 8, wherein the dictionary tree is a double-array dictionary tree.
10. The apparatus according to claim 8, wherein the dictionary tree generating means is configured to obtain a lexicon in which template arrangement combined words are recorded; and storing each template arrangement combination word in the word stock into each corresponding branch of the dictionary tree according to the lexical order.
11. The apparatus according to claim 10, wherein the dictionary tree generating means is configured to, if there are similar template arrangement combination words in the thesaurus, each group of similar template arrangement combination words is a group of template arrangement combination words having the same composition words but different sequences, select only one template arrangement combination word from the group of similar template arrangement combination words as a primary arrangement combination word to be stored in the dictionary tree, and select each template arrangement combination word not selected from the group of template arrangement combination words as each secondary arrangement combination word; and establishing a corresponding relation between the main permutation combination words and the auxiliary permutation combination words.
12. The apparatus according to claim 11, wherein the matching device is configured to, according to a correspondence between the primary permutation combination words and the secondary permutation combination words, retain the primary permutation combination words in the obtained permutation combination words, and delete the corresponding secondary permutation combination words to obtain filtered primary permutation combination words; matching each filtered main permutation combination word with the template permutation combination word in the dictionary tree to obtain main permutation combination words which are matched consistently; and acquiring the secondary arrangement combined words which are matched with the dictionary tree in a consistent way according to the corresponding relation between the primary arrangement combined words and the secondary arrangement combined words.
13. The apparatus as claimed in claim 12, wherein the matching means is configured to group main ordered combination words with the same initial word in the main ordered combination words, and perform the following iteration for each group of main combinations respectively: searching whether a node starting from the head word of the main arrangement combined word of the current group exists in the dictionary tree or not, and if not, taking down a group of main arrangement combined words to start the iteration from the beginning; if so, taking the current initial word as a starting node, searching the main arrangement combined word matched and consistent with the main arrangement combined word in the dictionary tree, and adding a result set.
14. The apparatus according to any one of claims 8 to 13, further comprising scoring means for setting a correspondence relationship between each template arrangement combination word and a search amount score after obtaining the arrangement combination words that are matched consistently; and determining the scores of the matched and consistent permutation and combination words according to the corresponding relation.
15. A computing-based device, comprising:
a processor; and
a memory arranged to store computer executable instructions that, when executed, cause the processor to:
acquiring a phrase input by a user;
splitting the phrase into single words;
obtaining a plurality of permutation and combination words of each single word obtained by splitting, wherein the permutation and combination words comprise one or more words arranged according to one or more sequences;
matching each obtained permutation and combination word with a template permutation and combination word in a dictionary tree to obtain the permutation and combination words which are consistent in matching, wherein each node except a root node in the dictionary tree stores one word, a single node in the dictionary tree forms the corresponding template permutation and combination word, or nodes except more than 2 root nodes which are adjacent in a hierarchy in the same branch sequentially form the corresponding template permutation and combination words from an upper layer node to a lower layer node;
when the template permutation and combination word is a forbidden word, matching each obtained permutation and combination word with the template permutation and combination word in the dictionary tree, and after obtaining the permutation and combination words which are consistent in matching, the method further comprises the following steps:
and deleting the matched and consistent permutation and combination words from the phrases input by the user.
16. A computer-readable storage medium having computer-executable instructions stored thereon, wherein the computer-executable instructions, when executed by a processor, cause the processor to:
acquiring a phrase input by a user;
splitting the phrase into single words;
obtaining a plurality of permutation and combination words of each single word obtained by splitting, wherein the permutation and combination words comprise one or more words arranged according to one or more sequences;
matching each obtained permutation and combination word with a template permutation and combination word in a dictionary tree to obtain the permutation and combination words which are consistent in matching, wherein each node except a root node in the dictionary tree stores one word, a single node in the dictionary tree forms the corresponding template permutation and combination word, or nodes except more than 2 root nodes which are adjacent in a hierarchy in the same branch sequentially form the corresponding template permutation and combination words from an upper layer node to a lower layer node;
when the template permutation and combination word is a forbidden word, matching each obtained permutation and combination word with the template permutation and combination word in the dictionary tree, and after obtaining the permutation and combination words which are consistent in matching, the method further comprises the following steps:
and deleting the matched and consistent permutation and combination words from the phrases input by the user.
CN201810086623.3A 2018-01-29 2018-01-29 Word splitting method and device Active CN108304384B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810086623.3A CN108304384B (en) 2018-01-29 2018-01-29 Word splitting method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810086623.3A CN108304384B (en) 2018-01-29 2018-01-29 Word splitting method and device

Publications (2)

Publication Number Publication Date
CN108304384A CN108304384A (en) 2018-07-20
CN108304384B true CN108304384B (en) 2021-08-27

Family

ID=62866739

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810086623.3A Active CN108304384B (en) 2018-01-29 2018-01-29 Word splitting method and device

Country Status (1)

Country Link
CN (1) CN108304384B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109062898A (en) * 2018-07-27 2018-12-21 汉能移动能源控股集团有限公司 Characteristic word duplication eliminating method, device and equipment and storage medium thereof
CN111310452B (en) * 2018-12-12 2024-06-18 北京汇钧科技有限公司 Word segmentation method and device
CN113569027B (en) * 2021-07-27 2024-02-13 北京百度网讯科技有限公司 Document title processing method and device and electronic equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103514287A (en) * 2013-09-29 2014-01-15 深圳市龙视传媒有限公司 Index tree building method, Chinese vocabulary searching method and related device
CN103914569A (en) * 2014-04-24 2014-07-09 百度在线网络技术(北京)有限公司 Input prompt method and device and dictionary tree model establishing method and device
CN105917327A (en) * 2013-12-11 2016-08-31 触摸式有限公司 System and method for inputting text into electronic devices
CN106649286A (en) * 2016-10-15 2017-05-10 语联网(武汉)信息技术有限公司 Method for conducting term matching on basis of double-array lexicographic tree
CN107357911A (en) * 2017-07-18 2017-11-17 北京新美互通科技有限公司 A kind of text entry method and device

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8473501B2 (en) * 2009-08-25 2013-06-25 Ontochem Gmbh Methods, computer systems, software and storage media for handling many data elements for search and annotation
US9298693B2 (en) * 2011-12-16 2016-03-29 Microsoft Technology Licensing, Llc Rule-based generation of candidate string transformations

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103514287A (en) * 2013-09-29 2014-01-15 深圳市龙视传媒有限公司 Index tree building method, Chinese vocabulary searching method and related device
CN105917327A (en) * 2013-12-11 2016-08-31 触摸式有限公司 System and method for inputting text into electronic devices
CN103914569A (en) * 2014-04-24 2014-07-09 百度在线网络技术(北京)有限公司 Input prompt method and device and dictionary tree model establishing method and device
CN106649286A (en) * 2016-10-15 2017-05-10 语联网(武汉)信息技术有限公司 Method for conducting term matching on basis of double-array lexicographic tree
CN107357911A (en) * 2017-07-18 2017-11-17 北京新美互通科技有限公司 A kind of text entry method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"Enhanced KStore With the Use of Dictionary and Trie for Retail Business Data";Jiandong Li;《Proceedings of 2016 IEEE International Conference on Big Data Analysis (ICBDA)》;20160312;第111-115页 *
"一种改进的基于二元统计的HMM分词算法";田思虑 等;《计算机与数字工程》;20110115;第39卷(第1期);第14-17页 *

Also Published As

Publication number Publication date
CN108304384A (en) 2018-07-20

Similar Documents

Publication Publication Date Title
JP6028567B2 (en) Data storage program, data search program, data storage device, data search device, data storage method, and data search method
US8171029B2 (en) Automatic generation of ontologies using word affinities
JP2670383B2 (en) Prefix search tree with partial key branch function
US8321485B2 (en) Device and method for constructing inverted indexes
US9195738B2 (en) Tokenization platform
US7526497B2 (en) Database retrieval apparatus, retrieval method, storage medium, and program
CN108304384B (en) Word splitting method and device
CN106033416A (en) A string processing method and device
CN111801665B (en) Hierarchical Locality Sensitive Hash (LSH) partition index for big data applications
CN102867049B (en) Chinese PINYIN quick word segmentation method based on word search tree
KR20090065130A (en) Indexing and searching method for high-demensional data using signature file and the system thereof
US20110153677A1 (en) Apparatus and method for managing index information of high-dimensional data
CN115543993A (en) Data processing method and device, electronic equipment and storage medium
CN110020001A (en) Storage, querying method and the corresponding equipment of string data
JP2010198425A (en) Document management method and device
KR101615164B1 (en) Query processing method and apparatus based on n-gram
JP2000322416A (en) Document retrieving device
JP2001134593A (en) Method and device for neighborhood data retrieval and storage medium stored with neighborhood data retrieving program
JP3639480B2 (en) Similar data retrieval method, similar data retrieval device, and similar data retrieval program recording medium
KR101787900B1 (en) Apparatus and method for generating ternary/quarternary bloom filter replacing counting bloom filter
JP2001243245A (en) Similar sentence retrieving method, its device and recording medium storing similar sentence retrieval program
JPH1196170A (en) Data base generating method, method and device for information retrieval, and recording medium
US12019701B2 (en) Computer architecture for string searching
JPH10240741A (en) Managing method for tree structure type data
KR100446639B1 (en) Apparatus And Method of Cell-based Indexing of High-dimensional Data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant