CN111310450B

CN111310450B - Character string word segmentation method, device, equipment and storage medium

Info

Publication number: CN111310450B
Application number: CN202010208159.8A
Authority: CN
Inventors: 陈旭明; 林楚荣; 朱祖恩; 程莹; 赵伟
Original assignee: China Construction Bank Corp
Current assignee: China Construction Bank Corp
Priority date: 2020-03-23
Filing date: 2020-03-23
Publication date: 2023-07-14
Anticipated expiration: 2040-03-23
Also published as: CN111310450A

Abstract

The embodiment of the invention discloses a character string word segmentation method, a device, equipment and a storage medium. The method comprises the following steps: reading a first character from a target character string, and carrying out hash searching on a first root node corresponding to the first character of the target character string from a pre-constructed word dictionary tree; sequentially reading subsequent characters from the target character string, and acquiring character states of the sub-nodes corresponding to the subsequent characters according to the sub-nodes associated with the first root node; and judging whether to split the target character string according to the character state so as to obtain a word segmentation result of the target character string. By operating the technical scheme provided by the embodiment of the invention, the problems that the complexity of word segmentation is increased by adopting a direct search mode to search the first characters of the character strings from the word stock and judging whether the words formed by the first characters appear in the word stock can be solved, and the effects of reducing the complexity of word segmentation of the character strings and saving the word segmentation time are achieved.

Description

Character string word segmentation method, device, equipment and storage medium

Technical Field

The embodiment of the invention relates to a computer technology, in particular to a character string word segmentation method, a device, equipment and a storage medium.

Background

Currently, in many fields, word segmentation is required on a character string to obtain useful information, for example, in the field of logistics, an address character string is split to obtain a correct mailing and receiving address.

In the prior art, a direct searching mode is often adopted to search the first characters of the character strings in sequence from the word stock, so that the searching times are positively correlated with the size of the word stock, and whether the words formed by the first characters appear in the word stock is continuously judged, and the word segmentation complexity is further increased.

Disclosure of Invention

The embodiment of the invention provides a character string word segmentation method, a device, equipment and a storage medium, which are used for realizing the effects of reducing the complexity of character string word segmentation and saving word segmentation time.

In a first aspect, an embodiment of the present invention provides a method for word segmentation of a character string, where the method includes:

reading a first character from a target character string, and carrying out hash searching on a first root node corresponding to the first character of the target character string from a pre-constructed word dictionary tree;

sequentially reading subsequent characters from the target character string, and acquiring character states of the sub-nodes corresponding to the subsequent characters according to the sub-nodes associated with the first root node;

and judging whether to split the target character string according to the character state so as to obtain a word segmentation result of the target character string.

In a second aspect, an embodiment of the present invention further provides a device for word segmentation of a character string, where the device includes:

the first node searching module is used for reading the first character from the target character string and carrying out hash searching on a first root node corresponding to the first character of the target character string from a pre-constructed word segmentation dictionary tree;

the first character state acquisition module is used for sequentially reading subsequent characters from the target character string and acquiring character states of the sub-nodes corresponding to the subsequent characters according to the sub-nodes associated with the first root node;

and the first word segmentation result acquisition module is used for judging whether to split the target character string according to the character state so as to acquire the word segmentation result of the target character string.

In a third aspect, an embodiment of the present invention further provides an apparatus, including:

one or more processors;

storage means for storing one or more programs,

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the string word segmentation method as described above.

In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a character string word segmentation method as described above.

According to the embodiment of the invention, the first character is read from the target character string, and the first root node corresponding to the first character of the target character string is hashed and searched from the pre-constructed word segmentation dictionary tree; sequentially reading subsequent characters from the target character string, and acquiring character states of the sub-nodes corresponding to the subsequent characters according to the sub-nodes associated with the first root node; according to the character state, judging whether to split the target character string so as to obtain the word segmentation result of the target character string, solving the problems that the complexity of word segmentation of the character string is increased, the complexity of word segmentation is reduced and the word segmentation time is saved by adopting a direct search mode to search the first character of the character string from a word stock in sequence and judging whether the words formed by the first character appear in the word stock.

Drawings

FIG. 1 is a flowchart of a method for word segmentation of a character string according to a first embodiment of the present invention;

FIG. 2 is a flowchart of a method for word segmentation of a character string according to a first embodiment of the present invention;

FIG. 3 is a flowchart for constructing a word segmentation dictionary tree according to a second embodiment of the present invention;

fig. 4 is a schematic structural diagram of a character string word segmentation device according to a third embodiment of the present application;

fig. 5 is a schematic structural diagram of an apparatus according to a fourth embodiment of the present invention.

Detailed Description

The invention is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting thereof. It should be further noted that, for convenience of description, only some, but not all of the structures related to the present invention are shown in the drawings.

Example 1

Fig. 1 is a flowchart of a character string word segmentation method according to an embodiment of the present invention, where the embodiment is applicable to a case of splitting a target character string, the method may be performed by a character string word segmentation device according to an embodiment of the present invention, and the device may be implemented by software and/or hardware. Referring to fig. 1, the method for word segmentation of a character string provided in this embodiment includes:

step 110, reading a first character from a target character string, and Hash-searching a first root node corresponding to the first character of the target character string from a pre-constructed word segmentation dictionary tree.

The target string may be a string that needs to be split, for example, an address string, which is not limited in this embodiment. The address character string can be an equal-length character string of 'Shenzhen, guangdong province, south mountain area school, yuan A district, B number building, C room'. The first character of the target character string is the first character in the character string, such as "wide" in the address character string.

The word segmentation dictionary tree is used for determining whether a specific character string exists, for example, when the address character string is segmented, the word segmentation dictionary tree can be searched with one or more pre-constructed word segmentation dictionary trees to determine whether an entry which can be independently split from the character string exists in Guangdong province, shenzhen city or the like. The word segmentation dictionary tree can be a dictionary tree, which is also called a word searching tree and a Trie tree, and is a storage mode for the dictionary. Each word in the dictionary is a path from the root node to a target node, and the letters of each edge in the path are connected to form a word.

Hash lookup is a method of doing a lookup by computing the storage address of a data element. Through combining with the word segmentation dictionary tree, each node in the tree is associated with a hash table, namely, whether a certain node exists in the word segmentation dictionary tree is determined by adopting hash searching; the hash table is a data structure which directly accesses a memory storage position according to keys, and the data to be queried is mapped to a position in the table to access the record by calculating a function related to key values.

The first root node corresponding to the first character of the target character string is a root node in the word segmentation dictionary tree, and may be a root node of a subtree, which is not limited in this embodiment. For example, when the first character is "broad", it is searched whether or not there is a root node associated with "broad" in the word segmentation dictionary tree.

And 120, sequentially reading subsequent characters from the target character string, and acquiring character states of the sub-nodes corresponding to the subsequent characters according to the sub-nodes associated with the first root node.

The subsequent characters are sequentially read from the target character string, i.e., the characters of the character string are sequentially read back, e.g., "east", "province". The child node associated with the first root node is a child node subsequent to the first root node in the word segmentation dictionary tree, for example, the child node of "wide" may be "west" or "east", and at this time, the target character string and the word segmentation dictionary tree are searched from the root node downwards in sequence to obtain a search result.

If the target character string is "Guangxi city", if the city is found, no corresponding child node exists, and at this time, the finding result is failure, and the machine is stopped. If the target character string is "Guangdong province", and the search result is successful when searching for "dong", the character state of "dong" is obtained. The character state is the state of each character preset in the construction of the word segmentation dictionary tree.

In this embodiment, optionally, the character status includes:

a continuation state, an extension state, and a termination state.

Wherein, in the whole character set of all vocabulary entries taking a certain character as a first word, if a character is not the last word of a vocabulary entry, the continuous state is called as a continuation state, such as 'people' and 'people' of people squares; if a character is the end word of a term but the term can also be used as a prefix to form a longer term, called an extended state, such as a 'street' of a cross-pond street, and a 'street' is added at the back to form a 'cross-pond street'; if a character is the end word of a term and the term cannot be used as a prefix to form a longer term, the term is called a termination state, such as "province", "city". The advantage of this arrangement is that the position of each character in the corresponding entry is distinguished, so that the word segmentation accuracy is improved.

And 130, judging whether to split the target character string according to the character state so as to obtain a word segmentation result of the target character string.

And determining a splitting mode of the target character string according to the character state corresponding to each character in the target character string, thereby obtaining a word segmentation result of the target character string. For example, "the building C room of the Yuan Dai A district B in the south mountain area of Guangdong province" is split into "the Guangdong province", "the Shenzhen city", "the south mountain area", "the Yuan Dai", "the building C room of the A district B number" and the like, and the character strings in the following districts can be further split, so that the splitting mode is not unique.

In this embodiment, optionally, according to the character state, determining whether to split the target character string to obtain a word segmentation result of the target character string includes:

if the character state is the termination state, splitting the first character in the target character string and the character in the termination state to form a character splitting result;

judging whether the splitting of the target character string is finished or not;

if yes, determining the character splitting result as the word segmentation result.

If the character state is the ending state, splitting the first character in the target character string to the character in the ending state, for example, splitting the character string from "wide" to "province" from the target character string if the character state is the ending state, and if the character is not found in the following "Guangdong province", splitting is ended, and the "Guangdong province" is the splitting result. The advantage of this arrangement is that the character string is completely and correctly split from the target character string, thereby improving the accuracy of word segmentation.

In this embodiment, optionally, after judging whether the splitting of the target string is finished, the method further includes:

if not, reading the first character from the split target character string, and carrying out hash searching on a second root node corresponding to the first character of the split target character string from a pre-constructed word segmentation dictionary tree;

sequentially reading subsequent characters from the split target character string, and acquiring character states of the sub-nodes corresponding to the subsequent characters according to the sub-nodes associated with the second root node;

and judging whether to split the split target character string according to the character state so as to obtain a word segmentation result of the split target character string.

For example, if the "Guangdong province" has a subsequent character, which indicates that the splitting of the target character string is not finished, splitting the "Guangdong province" from the target character string, taking the split character string "Shenzhen south mountain area school aster large area A cell B building C room" as a new target character string, wherein the "deep" is the first character, and finding whether a root node associated with the "deep" exists in the word segmentation dictionary tree through the hash. The subsequent steps are the same as those of the embodiment, and the steps are repeated until the target character string is split to end, so that the word segmentation result of the target character string is obtained. The method has the advantages that all the splittable character strings in the target character string are split in sequence, so that the accuracy and the efficiency of word segmentation are improved.

In this embodiment, optionally, determining whether to split the target character string to obtain a word segmentation result of the target character string includes:

if the character state is the continuous state, reading the subsequent character from the target character string;

if the character state is the extended state, the character is marked in the target character string, and the subsequent character is read from the target character string.

According to the technical scheme provided by the embodiment, the first character is read from the target character string, and the first root node corresponding to the first character of the target character string is hashed and searched from the pre-constructed word segmentation dictionary tree; sequentially reading subsequent characters from the target character string, and acquiring character states of the sub-nodes corresponding to the subsequent characters according to the sub-nodes associated with the first root node; according to the character state, judging whether to split the target character string so as to obtain the word segmentation result of the target character string, solving the problems that the complexity of word segmentation is increased, the complexity of word segmentation of the character string is reduced, and the word segmentation time is saved by adopting a direct search mode to search the first character of the character string from a word stock in sequence and judging whether the word formed by the first character appears in the word stock.

Fig. 2 is a flowchart of a character string word segmentation method according to an embodiment of the present invention, where, as shown in fig. 2, the character string word segmentation method includes:

step 210, reading the first character from the target character string, and hash-searching the root node corresponding to the first character of the target character string from the pre-constructed word dictionary tree.

Step 220, reading the subsequent characters from the target character string in turn, and judging whether the subsequent characters are inquired in the child nodes. If no subsequent character is queried in the child node, go to step 280; if a subsequent character is queried in the child node, step 230 is entered.

Step 230, acquiring the character state of the child node corresponding to the subsequent character according to the child node associated with the root node; if the character status is the continuation status, go to step 220; if the character status is the extended status, go to step 240; if the character status is the end status, go to step 250.

Step 240, marking the character in the target character string and converting the character to step 220.

Step 250, judging whether the splitting of the target character string is finished, if yes, turning to step 260; if not, go to step 270.

Step 260, obtaining the splitting result.

Step 270, splitting the words, obtaining the split target character string, and parallel-converting step 210.

Step 280, judging whether the previous state of the character is an extended state, if so, turning to step 250; if not, go to step 290.

For example, in the "yellow sea street" port, when the "port" is read, if no corresponding child node is found, it is determined whether the "street" is in an extended state, and if the "street" is in an extended state, the "yellow sea street" is split.

Step 290, indexing to the previous mark, obtaining the target character string at the mark, and parallel-converting step 210.

The mark may be an extension mark, or may be any other problem mark added in the actual operation process, which is not limited in this embodiment. And the characters at the mark are searched again, so that errors generated in searching are prevented, and the word segmentation accuracy is improved. Optionally, if the number of times exceeds the preset number, the splitting result cannot be obtained, splitting the split character string, and deleting the non-split part to obtain the final splitting result. Or to make an error notification, which is not limited in this embodiment.

Example two

Fig. 3 is a flow chart for constructing a word segmentation dictionary tree according to a second embodiment of the present invention, and the present technical solution is described in addition to the construction process of the word segmentation dictionary tree. Compared with the scheme, the scheme is specifically optimized, and the construction process of the word segmentation dictionary tree comprises the following steps:

reading first characters from the entry character strings and taking the first characters as root nodes of the word segmentation dictionary tree;

sequentially reading subsequent characters from the entry character string, and sequentially judging whether child nodes corresponding to the subsequent characters exist in the word segmentation dictionary tree or not;

if not, sequentially inserting the characters in the entry character string into the word segmentation dictionary tree as child nodes, and determining the character state of the inserted characters.

Specifically, the construction flow chart of the word segmentation dictionary tree is shown in fig. 3:

step 310, reading the first character from the entry character string as the root node of the word segmentation dictionary tree.

Wherein the term string is a standard term that has been determined, for example, "Shaanxi province". At this time, the first character "shan" is used as the root node of the word segmentation dictionary tree, and may also be the root node of the sub-tree in the word segmentation dictionary tree. And defines the corresponding character state of the root node as a continuation state.

Step 320, sequentially reading subsequent characters from the entry character string, and sequentially judging whether child nodes corresponding to the subsequent characters exist in the word segmentation dictionary tree.

For example, when the term "Guangxi province" is inserted, if "Guangdong province" already exists, both "Guangdong" and "province" nodes exist. If so, the operation is not executed, and the subsequent characters are continuously read. Step 320 is repeated until all characters in the entry string are inserted into the word segmentation dictionary tree.

And 330, if not, sequentially inserting the characters in the entry character string into the word segmentation dictionary tree as child nodes, and determining the character state of the inserted characters.

If the child node corresponding to the subsequent character does not exist, the characters in the vocabulary entry character string are sequentially inserted into the word segmentation dictionary tree as child nodes, the character state of the inserted node is determined, for example, after the 'Western' is inserted to the 'Guangdong', and the character state of the 'Western' is determined to be an extension state. Optionally, if a new node is inserted after the node whose character state is the termination state, the termination state of the node is modified to be an extension state.

And if the newly inserted node is the last node in the word segmentation dictionary tree, determining the character state of the node as a termination state.

Steps 310-330 are repeated until all entries are inserted into the word segmentation dictionary tree.

On the basis of the embodiment, the common pre-conjugation of the entry character strings is combined together through the construction of the word segmentation dictionary tree, so that the storage space is saved, the time for searching the subsequent character strings is reduced, the word segmentation complexity of the character strings is reduced, and the word segmentation time is saved.

Example III

Fig. 4 is a schematic structural diagram of a character string word segmentation device according to a third embodiment of the present application. The device can execute the character string word segmentation method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.

As shown in fig. 4, a character string word segmentation apparatus, the apparatus includes:

a first node searching module 410, configured to read a first character from a target character string, and hash-find a first root node corresponding to the first character of the target character string from a pre-constructed word dictionary tree;

a first character state obtaining module 420, configured to sequentially read subsequent characters from the target character string, and obtain character states of sub-nodes corresponding to the subsequent characters according to the sub-nodes associated with the first root node;

the first word segmentation result obtaining module 430 determines whether to split the target character string according to the character status, so as to obtain the word segmentation result of the target character string.

On the basis of the above technical solutions, optionally, the apparatus includes a dictionary tree construction module, where the dictionary tree construction module includes:

the first character reading unit is used for reading the first character from the entry character string and used as a root node of the word segmentation dictionary tree;

the child node judging unit is used for sequentially reading subsequent characters from the entry character string and sequentially judging whether child nodes corresponding to the subsequent characters exist in the word segmentation dictionary tree or not;

a character state determining unit for determining whether the sub-node judging module judges that the sub-node judging module is not, and sequentially inserting characters in the entry character string into the word segmentation dictionary tree as child nodes, and determining the character state of the inserted characters.

On the basis of the above technical solutions, optionally, the character state includes:

a continuation state, an extension state, and a termination state.

Based on the above technical solutions, optionally, the first word segmentation result obtaining module 430 includes:

the character splitting unit is used for splitting the first character in the target character string to the character in the ending state if the character state is the ending state, so as to form a character splitting result;

the splitting judging unit is used for judging whether the splitting of the target character string is finished or not;

and the word segmentation result determining unit is used for determining that the character splitting result is the word segmentation result if the splitting judging unit judges that the character splitting result is yes.

Based on the above technical solutions, optionally, the method further includes:

the second node searching module is used for reading the first character from the split target character string and carrying out hash searching on a second root node corresponding to the first character of the split target character string from a pre-constructed word segmentation dictionary tree after the split judging unit judges that the split target character string is not the first character;

the second character state acquisition module is used for sequentially reading subsequent characters from the split target character string and acquiring character states of the sub-nodes corresponding to the subsequent characters according to the sub-nodes associated with the second root node;

and the second word segmentation result acquisition module is used for judging whether to split the split target character string according to the character state so as to acquire the word segmentation result of the split target character string.

Example IV

Fig. 5 is a schematic structural diagram of an apparatus according to a fourth embodiment of the present invention, and as shown in fig. 5, the apparatus includes a processor 50, a memory 51, an input device 52 and an output device 53; the number of processors 50 in the device may be one or more, one processor 50 being taken as an example in fig. 5; the processor 50, the memory 51, the input means 52 and the output means 53 in the device may be connected by a bus or by other means, in fig. 5 by way of example.

The memory 51 is a computer readable storage medium, and may be used to store a software program, a computer executable program, and modules, such as program instructions/modules corresponding to the word segmentation method of a character string in the embodiment of the present invention. The processor 50 executes various functional applications of the apparatus and data processing, i.e., implements the above-described character string word segmentation method, by running software programs, instructions, and modules stored in the memory 51.

The memory 51 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, at least one application program required for functions; the storage data area may store data created according to the use of the terminal, etc. In addition, memory 51 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some examples, memory 51 may further include memory located remotely from processor 50, which may be connected to the device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

Example five

A fifth embodiment of the present invention also provides a storage medium containing computer-executable instructions, which when executed by a computer processor, are for performing a method of word segmentation of a character string, the method comprising:

Of course, the storage medium containing the computer executable instructions provided in the embodiments of the present invention is not limited to the above-described method operations, and may also perform the related operations in the character string word segmentation method provided in any embodiment of the present invention.

From the above description of embodiments, it will be clear to a person skilled in the art that the present invention may be implemented by means of software and necessary general purpose hardware, but of course also by means of hardware, although in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, etc., and include several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments of the present invention.

It should be noted that, in the embodiment of the character string word segmentation device, each unit and module included are only divided according to the functional logic, but not limited to the above division, so long as the corresponding functions can be implemented; in addition, the specific names of the functional units are also only for distinguishing from each other, and are not used to limit the protection scope of the present invention.

Note that the above is only a preferred embodiment of the present invention and the technical principle applied. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, while the invention has been described in connection with the above embodiments, the invention is not limited to the embodiments, but may be embodied in many other equivalent forms without departing from the spirit or scope of the invention, which is set forth in the following claims.

Claims

1. A method for word segmentation of a character string, comprising:

sequentially reading subsequent characters from the target character string, and acquiring character states of the sub-nodes corresponding to the subsequent characters according to the sub-nodes associated with the first root node; wherein the character state includes an extended state; the extended state is a character state in which the character is the tail word of the entry, but the entry can be used as a prefix to form a longer word;

judging whether to split the target character string according to the character state so as to obtain a word segmentation result of the target character string;

if the character state is the extension state, marking the character in the target character string;

if the subsequent character is not inquired in the child node, judging whether the previous state of the character is an extension state or not;

if yes, judging whether the splitting of the target character string is finished, and determining the word segmentation result according to a character string splitting finishing judgment result;

if not, indexing to the previous mark, acquiring the character at the mark of the target character string, and re-reading the character at the mark of the target character string as a starting point.

2. The method of claim 1, wherein the process of constructing the word segmentation dictionary tree comprises:

3. The method according to claim 1 or 2, wherein the character status further comprises:

a continuation state and a termination state.

4. The method of claim 3, wherein determining whether to split the target string based on the character status to obtain the word segmentation result of the target string comprises:

5. The method of claim 4, further comprising, after determining whether splitting of the target string is complete:

6. A character string word segmentation apparatus, comprising:

the first character state acquisition module is used for sequentially reading subsequent characters from the target character string and acquiring character states of the sub-nodes corresponding to the subsequent characters according to the sub-nodes associated with the first root node; wherein the character state includes an extended state; the extended state is a character state in which the character is the tail word of the entry, but the entry can be used as a prefix to form a longer word;

the first word segmentation result acquisition module is used for judging whether to split the target character string according to the character state so as to acquire a word segmentation result of the target character string;

the first word segmentation result obtaining module is further configured to mark the character in the target character string if the character state is the extended state;

7. The apparatus of claim 6, wherein the dictionary tree construction module comprises:

8. The apparatus of claim 6 or 7, wherein the character status further comprises:

a continuation state and a termination state.

9. An apparatus, the apparatus comprising:

one or more processors;

storage means for storing one or more programs,

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of string segmentation as recited in any one of claims 1-5.

10. A computer-readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements a character string segmentation method according to any one of claims 1-5.