CN111310450A

CN111310450A - Character string word segmentation method, device, equipment and storage medium

Info

Publication number: CN111310450A
Application number: CN202010208159.8A
Authority: CN
Inventors: 陈旭明; 林楚荣; 朱祖恩; 程莹; 赵伟
Original assignee: China Construction Bank Corp; CCB Finetech Co Ltd
Current assignee: China Construction Bank Corp
Priority date: 2020-03-23
Filing date: 2020-03-23
Publication date: 2020-06-19
Anticipated expiration: 2040-03-23
Also published as: CN111310450B

Abstract

The embodiment of the invention discloses a character string word segmentation method, a character string word segmentation device, character string word segmentation equipment and a storage medium. The method comprises the following steps: reading first characters from a target character string, and searching a first root node corresponding to the first characters of the target character string from a pre-constructed word segmentation dictionary tree in Hill; sequentially reading subsequent characters from the target character string, and acquiring the character state of the child node corresponding to the subsequent characters according to the child node associated with the first root node; and judging whether the target character string is split or not according to the character state so as to obtain a word segmentation result of the target character string. By operating the technical scheme provided by the embodiment of the invention, the problems that the first character of the character string is sequentially searched from the word stock and whether the word formed by the first character appears in the word stock or not by adopting a direct searching mode can be solved, the word segmentation complexity is increased, and the effects of reducing the word segmentation complexity of the character string and saving the word segmentation time are achieved.

Description

Character string word segmentation method, device, equipment and storage medium

Technical Field

The embodiment of the invention relates to computer technology, in particular to a character string word segmentation method, a device, equipment and a storage medium.

Background

Currently, word segmentation is required in many fields to obtain useful information, for example, in the field of logistics, address strings are split to obtain correct addresses of mail recipients.

In the prior art, a direct search mode is often adopted to sequentially search the first characters of the character strings from the word stock, so that the search times are positively correlated with the size of the word stock, and whether the words formed by the first characters appear in the word stock is continuously judged, thereby further increasing the word segmentation complexity.

Disclosure of Invention

The embodiment of the invention provides a character string word segmentation method, a device, equipment and a storage medium, which are used for reducing the complexity of character string word segmentation and saving word segmentation time.

In a first aspect, an embodiment of the present invention provides a method for segmenting a character string, where the method includes:

reading first characters from a target character string, and searching a first root node corresponding to the first characters of the target character string from a pre-constructed word segmentation dictionary tree in Hill;

sequentially reading subsequent characters from the target character string, and acquiring the character state of the child node corresponding to the subsequent characters according to the child node associated with the first root node;

and judging whether the target character string is split or not according to the character state so as to obtain a word segmentation result of the target character string.

In a second aspect, an embodiment of the present invention further provides a character string word segmentation apparatus, where the apparatus includes:

the first node searching module is used for reading the first character from the target character string and searching a first root node corresponding to the first character of the target character string from a pre-constructed word segmentation dictionary tree in a Haxi mode;

the first character state acquisition module is used for sequentially reading subsequent characters from the target character string and acquiring the character states of the child nodes corresponding to the subsequent characters according to the child nodes associated with the first root node;

and the first word segmentation result acquisition module judges whether the target character string is split or not according to the character state so as to acquire the word segmentation result of the target character string.

In a third aspect, an embodiment of the present invention further provides an apparatus, where the apparatus includes:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement a string participle method as described above.

In a fourth aspect, the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the character string word segmentation method as described above.

Reading an initial character from a target character string, and searching a first root node corresponding to the initial character of the target character string from a pre-constructed word segmentation dictionary tree; sequentially reading subsequent characters from the target character string, and acquiring the character state of the child node corresponding to the subsequent characters according to the child node associated with the first root node; and judging whether the target character string is split or not according to the character state so as to obtain a word segmentation result of the target character string, solving the problem that the word segmentation complexity is increased by sequentially searching the first character of the character string from the word stock and judging whether the word formed by the first character appears in the word stock in a direct searching mode, and realizing the effects of reducing the word segmentation complexity of the character string and saving the word segmentation time.

Drawings

Fig. 1 is a flowchart of a method for segmenting word strings according to an embodiment of the present invention;

fig. 2 is a flowchart of a method for segmenting word strings according to an embodiment of the present invention;

fig. 3 is a flow chart of a construction of a segmentation dictionary tree according to a second embodiment of the present invention;

fig. 4 is a schematic structural diagram of a character string segmentation apparatus according to a third embodiment of the present application;

fig. 5 is a schematic structural diagram of an apparatus according to a fourth embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Example one

Fig. 1 is a flowchart of a character string segmentation method according to an embodiment of the present invention, where this embodiment is applicable to a case where a target character string is split, and the method may be executed by a character string segmentation apparatus according to an embodiment of the present invention, and the apparatus may be implemented in a software and/or hardware manner. Referring to fig. 1, the method for segmenting a character string provided by this embodiment includes:

step 110, reading the first character from the target character string, and searching a first root node corresponding to the first character of the target character string from a pre-constructed word segmentation dictionary tree.

The target character string may be a character string that needs to be split, such as an address character string, and the like, which is not limited in this embodiment. The address string can be equal-length string of 'Nanshan region, aster, large passage A, cell, No. B, room C' in Shenzhen, Guangdong province. The first character of the target character string is the first character in the character string, such as "wide" in the address character string described above.

The segmentation dictionary tree is used for determining whether a specific character string exists, for example, when the address character string is segmented, the segmentation dictionary tree is searched with one or more pre-constructed segmentation dictionary trees to determine whether entries such as "Guangdong province", "Shenzhen city", and the like which can be independently split from the character string exist. The word segmentation dictionary tree can be a dictionary tree, which is also called a word lookup tree and a Trie tree, and is a storage mode for dictionaries. Each word in the dictionary is a path from a root node to a target node, and letters of each edge in the path are connected to form a word.

Hash lookup is a method of performing a lookup by computing the storage address of a data element. Associating each node in the tree with a hash table by combining with the word segmentation dictionary tree, namely determining whether a certain node exists in the word segmentation dictionary tree by adopting hash searching; the hash table is a data structure which directly accesses a memory position in a memory according to a key, data which needs to be inquired is mapped to a position in the table to access a record by calculating a function related to the key value, and the hash table has the advantages of accelerating the searching speed, reducing the complexity of character string word segmentation and saving the word segmentation time.

The first root node corresponding to the first character of the target character string is a root node in the word segmentation dictionary tree, and may be a root node of a sub-tree, which is not limited in this embodiment. For example, when the first character is "wide", it is looked up whether there is a root node associated with "wide" in the segmentation dictionary tree.

And step 120, sequentially reading subsequent characters from the target character string, and acquiring the character states of the child nodes corresponding to the subsequent characters according to the child nodes associated with the first root node.

Subsequent characters are read sequentially from the target string, i.e., characters of the string are read sequentially backward, e.g., "east", "province". The child nodes associated with the first root node are child nodes subsequent to the first root node in the segmentation dictionary tree, for example, the child node of "wide" may be "west" or "east", and at this time, the target character string and the segmentation dictionary tree are sequentially searched from the root node downwards to obtain a search result.

If the target character string is 'Guangxi City', no corresponding child node exists when the 'City' is found, and at the moment, the machine is stopped if the search result is failure. And if the target character string is 'Guangdong province', and the search result is successful when the target character string is 'east', acquiring the state of the 'east'. The character state is the state of each character preset in the word segmentation dictionary tree during construction.

In this embodiment, optionally, the character state includes:

a continuation state, an extension state, and a termination state.

Among all character sets of all entries with a certain character as the first character, if a character is not the last character of an entry, the state is called a continuation state, such as "people" and "people" in a people square; if a character is a tail word of an entry but the word can also be used as a prefix to form a longer word, the word is called in an extended state, for example, the street of an inclined pond street is formed, and the street is added with the street to form the inclined pond street; if a character is the end of an entry and the word cannot be prefixed to form a longer entry, the state is called terminated, e.g. "province", "city". The advantage of this arrangement is that the position of each character in the corresponding entry is distinguished to improve the accuracy of word segmentation.

And step 130, judging whether the target character string is split or not according to the character state so as to obtain a word segmentation result of the target character string.

And determining the splitting mode of the target character string according to the character state corresponding to each character in the target character string, thereby obtaining the word segmentation result of the target character string. For example, the "south mountain area aster major corridor A floor C room of Guandong Shenzhen, Guangdong province" is split into "Guangdong province", "Shenzhen city", "south mountain area", "aster major corridor", "A floor B room C room of cell A" and the like, and the character strings after the cell can be further split, and the splitting mode is not unique.

In this embodiment, optionally, determining whether to split the target character string according to a character state to obtain a word segmentation result of the target character string, where the determining includes:

if the character state is the termination state, splitting the first character in the target character string to the character in the termination state to form a character splitting result;

judging whether the target character string is split completely;

if yes, determining the character splitting result as the word segmentation result.

If the character state is the termination state, splitting the first character in the target character string to the character in the termination state, for example, if the province is the termination state character, splitting the character string from the wide province to the province from the target character string, if the character does not exist in the subsequent steps of the Guangdong province, splitting is finished, and the Guangdong province is a split result. The advantage of this arrangement is that the character string is completely and correctly split from the target character string, thereby improving the correctness of word segmentation.

In this embodiment, optionally, after judging whether the target character string is split completely, the method further includes:

if not, reading the first character from the split target character string, and searching a second root node corresponding to the first character of the split target character string from a pre-constructed word segmentation dictionary tree in Haxi;

sequentially reading subsequent characters from the split target character string, and acquiring the character state of the child node corresponding to the subsequent characters according to the child node associated with the second root node;

and judging whether the split target character string is split or not according to the character state so as to obtain a word segmentation result of the split target character string.

For example, if "Guangdong province" has a subsequent character, which indicates that the target character string is not split and ended, then "Guangdong province" is split from the target character string, the split character string "Shenzhen nan shan district aster major corridor A cell number B building C room" serves as a new target character string, at this time, "deep" is a first character, and hash is performed to find whether a root node associated with "deep" exists in the participle dictionary tree. The subsequent steps are the same as the embodiment, and the steps are repeated until the target character string is split, so that the word segmentation result of the target character string is obtained. The method has the advantages that all the detachable character strings in the target character string are sequentially detached, so that the word segmentation accuracy and efficiency are improved.

In this embodiment, optionally, the determining whether to split the target character string to obtain the word segmentation result of the target character string includes:

if the character state is a continuous state, reading a subsequent character from the target character string;

if the character state is an extended state, the character is marked in the target character string, and the subsequent character is read from the target character string.

In the technical scheme provided by this embodiment, an initial character is read from a target character string, and a first root node corresponding to the initial character of the target character string is searched for in haxi from a pre-constructed word segmentation dictionary tree; sequentially reading subsequent characters from the target character string, and acquiring the character state of the child node corresponding to the subsequent characters according to the child node associated with the first root node; according to the character state, whether the target character string is split or not is judged to obtain the word segmentation result of the target character string, the problem that the word segmentation complexity is increased by sequentially searching the first character of the character string from the word stock and judging whether the word formed by the first character appears in the word stock in a direct searching mode is solved, and the effects of reducing the word segmentation complexity of the character string and saving the word segmentation time are achieved.

Fig. 2 is a flowchart of a character string segmentation method according to an embodiment of the present invention, and as shown in fig. 2, the character string segmentation method includes:

step 210, reading the first character from the target character string, and searching a root node corresponding to the first character of the target character string from a pre-constructed word segmentation dictionary tree.

And step 220, sequentially reading subsequent characters from the target character string, and judging whether the subsequent characters are inquired in the child nodes. If no follow-up character is found in the child node, go to step 280; if the child node inquires the next character, go to step 230.

Step 230, acquiring the character state of the child node corresponding to the subsequent character according to the child node associated with the root node; if the character status is the continuation status, go to step 220; if the character status is extended, go to step 240; if the character status is the end status, go to step 250.

Step 240 marks the character in the target string and goes to step 220.

Step 250, judging whether the target character string is split and finished, if so, turning to step 260; otherwise, go to step 270.

And step 260, acquiring a splitting result.

Step 270, splitting the words, obtaining the split target character string, and parallel-connecting step 210.

Step 280, judging whether the previous state of the character is an extended state, if so, turning to step 250; if not, go to step 290.

For example, in the "yellow sea street entrance", when the "entrance" is read, the corresponding child node is not found, and it is determined whether the "street" is in an extended state, and if the "street" is in an extended state, the "yellow sea street" is split.

Step 290, indexing to the previous mark, obtaining the target character string at the mark, and parallel-connecting step 210.

The mark may be an extension mark, or may be any other problem mark added in the actual operation process, which is not limited in this embodiment. The characters at the marked positions are searched again to prevent errors caused by searching and improve the word segmentation accuracy. Optionally, if the number of times exceeds the preset number, the splitting result cannot be obtained, the splittable character string is split, and the non-splittable part is deleted, so as to obtain the final splitting result. Or an error notification is performed, which is not limited in this embodiment.

Example two

Fig. 3 is a flow chart for constructing a segmentation dictionary tree according to the second embodiment of the present invention, and the technical solution is supplementary explained with respect to the construction process of the segmentation dictionary tree. Compared with the scheme, the scheme is specifically optimized in that the construction process of the word segmentation dictionary tree comprises the following steps:

reading a first character from the entry character string to be used as a root node of the word segmentation dictionary tree;

sequentially reading subsequent characters from the entry character string, and sequentially judging whether sub-nodes corresponding to the subsequent characters exist in the word segmentation dictionary tree or not;

if not, the characters in the entry character string are used as child nodes to be sequentially inserted into the word segmentation dictionary tree, and the character state of the inserted characters is determined.

Specifically, the flow chart of constructing the word segmentation dictionary tree is shown in fig. 3:

and step 310, reading the first character from the entry character string to be used as a root node of the word segmentation dictionary tree.

The term character string is a standard term that is already determined, such as "shanxi province". At this time, the first character "shan" is used as the root node of the segmentation dictionary tree, and may also be the root node of the sub-tree in the segmentation dictionary tree. And defines the corresponding character state of the root node as the continuation state.

And step 320, sequentially reading subsequent characters from the entry character string, and sequentially judging whether sub-nodes corresponding to the subsequent characters exist in the word segmentation dictionary tree.

For example, when the term "Guangxi province" is inserted, if "Guangdong province" already exists, then both "Wide" and "province" nodes exist. If yes, the operation is not executed, and the subsequent characters are continuously read. Step 320 is repeated until all characters in the entry string are inserted into the segmentation dictionary tree.

And step 330, if not, inserting the characters in the entry character string as child nodes into the word segmentation dictionary tree in sequence, and determining the character state of the inserted characters.

And if the child node corresponding to the subsequent character does not exist, sequentially inserting the characters in the entry character string into the segmentation dictionary tree as the child node, and determining the character state of the inserted node, for example, after inserting 'west' into 'wide', determining the character state of 'west' as an extended state. Optionally, if a new node is inserted after the node whose character state is the termination state, the termination state of the node is modified to the extended state.

And if the newly inserted node is the last node in the word segmentation dictionary tree, determining the character state of the node as a termination state.

And repeating the steps 310 to 330 until all the entries are inserted into the segmentation dictionary tree.

On the basis of the above embodiment, the common prefixes of the entry character strings are combined together by constructing the word segmentation dictionary tree, so that the storage space is saved, the time for searching the subsequent character strings is reduced, the complexity of character string segmentation is reduced, and the word segmentation time is saved.

EXAMPLE III

Fig. 4 is a schematic structural diagram of a character string segmentation apparatus according to a third embodiment of the present application. The device can execute the character string word segmentation method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.

As shown in fig. 4, a character string word segmentation apparatus includes:

a first node searching module 410, configured to read an initial character from a target character string, and search, from a pre-constructed word segmentation dictionary tree, a first root node corresponding to the initial character of the target character string;

a first character state obtaining module 420, configured to sequentially read subsequent characters from the target character string, and obtain, according to a child node associated with the first root node, a character state of a child node corresponding to the subsequent character;

the first segmentation result obtaining module 430 determines whether to split the target character string according to the character state, so as to obtain a segmentation result of the target character string.

On the basis of the above technical solutions, optionally, the apparatus includes a dictionary tree building module, where the dictionary tree building module includes:

the first character reading unit is used for reading a first character from the entry character string to be used as a root node of the word segmentation dictionary tree;

a child node judging unit, configured to sequentially read subsequent characters from the entry character string, and sequentially judge whether a child node corresponding to the subsequent character already exists in the word segmentation dictionary tree;

and the character state determining unit is used for taking the characters in the entry character string as child nodes and sequentially inserting the characters into the word segmentation dictionary tree if the child node judging module judges that the characters are not inserted, and determining the character state of the inserted characters.

On the basis of the above technical solutions, optionally, the character state includes:

a continuation state, an extension state, and a termination state.

On the basis of the above technical solutions, optionally, the first segmentation result obtaining module 430 includes:

the character splitting unit is used for splitting the first character in the target character string to the character in the termination state to form a character splitting result if the character state is the termination state;

the splitting judgment unit is used for judging whether the target character string is split completely;

and the word segmentation result determining unit is used for determining the character splitting result as the word segmentation result if the splitting judgment unit judges that the character splitting result is the word segmentation result.

On the basis of the above technical solutions, optionally, the method further includes:

a second node searching module, configured to, after the splitting determination unit, if the splitting determination unit determines that the first character string is not a split character string, read the first character from the split target character string, and search, from a pre-constructed word segmentation dictionary tree, a second root node corresponding to the first character of the split target character string;

a second character state obtaining module, configured to sequentially read subsequent characters from the split target character string, and obtain, according to a child node associated with the second root node, a character state of a child node corresponding to the subsequent character;

and the second word segmentation result acquisition module is used for judging whether the split target character string is split or not according to the character state so as to acquire the word segmentation result of the split target character string.

Example four

Fig. 5 is a schematic structural diagram of an apparatus according to a fourth embodiment of the present invention, as shown in fig. 5, the apparatus includes a processor 50, a memory 51, an input device 52, and an output device 53; the number of processors 50 in the device may be one or more, and one processor 50 is taken as an example in fig. 5; the processor 50, the memory 51, the input device 52 and the output device 53 in the apparatus may be connected by a bus or other means, which is exemplified in fig. 5.

The memory 51 is a computer-readable storage medium, and can be used for storing software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the character string segmentation method in the embodiment of the present invention. The processor 50 executes various functional applications of the device and data processing by executing software programs, instructions and modules stored in the memory 51, that is, implements the above-described character string segmentation method.

The memory 51 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the memory 51 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, the memory 51 may further include memory located remotely from the processor 50, which may be connected to the device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

EXAMPLE five

An embodiment of the present invention further provides a storage medium containing computer-executable instructions, which when executed by a computer processor, are configured to perform a method for segmenting a character string, the method including:

Of course, the storage medium provided by the embodiment of the present invention contains computer-executable instructions, and the computer-executable instructions are not limited to the method operations described above, and may also perform related operations in the character string segmentation method provided by any embodiment of the present invention.

From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly, can also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the methods according to the embodiments of the present invention.

It should be noted that, in the embodiment of the above character string segmentation apparatus, each included unit and module are only divided according to functional logic, but are not limited to the above division, as long as the corresponding function can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A character string word segmentation method is characterized by comprising the following steps:

2. The method of claim 1, wherein the construction process of the segmentation dictionary tree comprises:

3. The method of claim 1 or 2, wherein the character state comprises:

a continuation state, an extension state, and a termination state.

4. The method of claim 3, wherein determining whether to split the target character string according to the character status to obtain a word segmentation result of the target character string comprises:

judging whether the target character string is split completely;

5. The method of claim 4, after interpreting whether the target string is split-ended, further comprising:

6. A character string word segmentation apparatus, comprising:

7. The apparatus of claim 6, wherein the dictionary tree construction module comprises:

8. The apparatus of claim 6 or 7, wherein the character state comprises:

a continuation state, an extension state, and a termination state.

9. An apparatus, characterized in that the apparatus comprises:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the string participle method as recited in any of claims 1-5.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out a string participling method according to any one of claims 1-5.