CN111967257A - Word segmentation method and device, electronic equipment and storage medium - Google Patents

Word segmentation method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN111967257A
CN111967257A CN202010652918.XA CN202010652918A CN111967257A CN 111967257 A CN111967257 A CN 111967257A CN 202010652918 A CN202010652918 A CN 202010652918A CN 111967257 A CN111967257 A CN 111967257A
Authority
CN
China
Prior art keywords
word segmentation
segmentation result
lemma
target
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010652918.XA
Other languages
Chinese (zh)
Inventor
周苏建
周效军
周冰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
MIGU Culture Technology Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
MIGU Culture Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, MIGU Culture Technology Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN202010652918.XA priority Critical patent/CN111967257A/en
Publication of CN111967257A publication Critical patent/CN111967257A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Abstract

The embodiment of the invention provides a word segmentation method, a word segmentation device, electronic equipment and a storage medium, wherein in the process of performing word segmentation and back supplement on a first word segmentation result of a search statement, a target word element to be subjected to word segmentation again is obtained from the first word segmentation result, and the target word element is subjected to word segmentation again to obtain a second word segmentation result. And determining a final word segmentation result for segmenting the search sentence according to the first word segmentation result and the second word segmentation result. Because the second word segmentation result comprises the lemma for segmenting the target lemma again, the query stage can search according to the second word segmentation result for segmenting the target lemma again, the probability of searching the matched data is increased, and the matching degree of the search result and the expectation of the user is improved.

Description

Word segmentation method and device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of natural language processing technologies, and in particular, to a word segmentation method, device, electronic device, and storage medium.
Background
The word segmentation device is used for segmenting the sentences to obtain word elements one by one, and each word element comprises one word or a plurality of words. The word segmentation device is applied in a search scenario, for example, when a search engine receives a search sentence, a word element obtained by performing word segmentation on the search sentence by the word segmentation device is obtained, and then, a search is performed according to the word element to search for information (e.g., video, picture, article, etc.) related to the search sentence.
The present Chinese word segmentation device includes 5 word segmentation devices based on different word segmentation principles, including BaseAnalysis basic word segmentation device, DicAnalysis user-defined dictionary word segmentation device, IndexAnalysis index word segmentation device, ToAnalysis standard word segmentation device and NlpAnalysis natural language word segmentation device. In consideration of the balance of performance and effect, the general index stage (i.e., the stage of creating an index to a material) in a production environment uses an IndexAnalysis tokenizer and the query stage (the stage of searching according to a search sentence) uses ToAnalysis.
However, for some keywords such as idioms and poems in query stage, since idioms and nous sentences containing multiple words are recognized as terms (existing query stage uses rough segmentation of shortest path, planning of optimal path is achieved according to hidden markov model and viterbi algorithm, and terms are divided by leaf nodes at the farthest end), the terms are easy to be too long, and thus matching content cannot be searched. For example, there is a resource name "Dilongtianjie", and the query phase divides "Dilongjie" into a word, and the index created for the material does not have the word "Dilongjie", so that the material that almost completely matches with the search word "Dilongjie" cannot be searched.
It can be seen that the word segmentation result obtained by segmenting the search sentence in the conventional query stage has a word element which cannot be searched for matching data, so that the search result is inconsistent with the expectation of the user.
Disclosure of Invention
The embodiment of the invention provides a word segmentation method, a word segmentation device, electronic equipment and a storage medium, which are used for solving the problem that in the prior art, a word element of matched data cannot be searched in a word segmentation result of segmenting a search sentence in a query stage, so that the search result is inconsistent with the expectation of a user.
In view of the above technical problems, in a first aspect, an embodiment of the present invention provides a word segmentation method, including:
acquiring a target word element from a first word segmentation result of a search statement;
segmenting the target word elements to obtain a second segmentation result;
and determining a final word segmentation result according to the first word segmentation result and the second word segmentation result.
Optionally, the segmenting the target lemma and obtaining a second segmentation result includes:
taking a branch in a dictionary tree with each character in the target lemma as a root node as a target branch, and acquiring the second word segmentation result from the lemma at each node of the target branch;
wherein the dictionary tree includes branches created with each character of the search sentence as a root node.
Optionally, the obtaining the second word segmentation result from the lemma at each node of the target branch includes:
when the target word elements are segmented for the first time, acquiring a first branch from the target branch, and taking the word elements at a first node in the first branch as first word elements of the first word segmentation; the first branch is a branch taking the first character of the target lemma as a root node;
when the target lemma is not segmented for the first time, acquiring a second branch from the target branch, and taking the lemma at a second node in the second branch as a second lemma of the word segmentation until a first character after the previous lemma is not in the target lemma; the second branch is a branch taking the first character after the previous lemma as a root node;
and taking the first lemma and the second lemma as the second word segmentation result.
Optionally, the first node is adjacent to the node where the target lemma is located, and is closer to the root node of the first branch than the node where the target lemma is located;
each character in the lemma at the second node is contained in the target lemma and is farthest from the root node of the second branch.
Optionally, the obtaining a target lemma from a first segmentation result of the search statement includes:
and if the number of the lemmas in the first word segmentation result is less than or equal to a preset number and/or the lemmas with the length larger than the preset length exist in the first word segmentation result, acquiring the lemmas with the length larger than the preset length as the target lemmas.
Optionally, the determining a final segmentation result according to the first segmentation result and the second segmentation result includes:
and if the lemma in the second word segmentation result contains each character of the target lemma, taking the second word segmentation result and the first word segmentation result with the target lemma deleted as the final word segmentation result.
Optionally, before the obtaining the target lemma from the first segmentation result of the search statement, the method further includes:
and acquiring an original word segmentation result of the search sentence, and cleaning the original word segmentation result according to the part of speech of each word element in the original word segmentation result and/or the character type of characters contained in each word element to obtain the first word segmentation result.
In a second aspect, an embodiment of the present invention provides a word segmentation apparatus, including:
the acquisition module is used for acquiring a target word element from a first word segmentation result of the search statement;
the word segmentation module is used for segmenting the target word elements to obtain a second word segmentation result;
and the determining module is used for determining a final word segmentation result according to the first word segmentation result and the second word segmentation result.
In a third aspect, an embodiment of the present invention provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the word segmentation method described above when executing the program.
In a fourth aspect, an embodiment of the present invention provides a non-transitory readable storage medium, on which a computer program is stored, which when executed by a processor, implements the steps of the word segmentation method described in any one of the above.
The embodiment of the invention provides a word segmentation method, a word segmentation device, electronic equipment and a storage medium, wherein in the process of performing word segmentation and back supplement on a first word segmentation result of a search statement, a target word element to be subjected to word segmentation again is obtained from the first word segmentation result, and the target word element is subjected to word segmentation again to obtain a second word segmentation result. And determining a final word segmentation result for segmenting the search sentence according to the first word segmentation result and the second word segmentation result. Because the second word segmentation result comprises the lemma for segmenting the target lemma again, the query stage can search according to the second word segmentation result for segmenting the target lemma again, the probability of searching the matched data is increased, and the matching degree of the search result and the expectation of the user is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a flow chart of a word segmentation method provided by an embodiment of the invention;
fig. 2 is a block diagram of a word segmentation apparatus according to another embodiment of the present invention;
fig. 3 is a schematic physical structure diagram of an electronic device according to another embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment provides a word segmentation method, which is used for splitting a sentence into word elements and searching data related to the sentence according to the split word elements. The word segmentation method can be executed by any device, and particularly can be executed by a word segmentation device in a certain application program of the device. For example, after receiving a search sentence, a search engine of the device performs word segmentation on the search sentence through a word segmentation device to obtain a word segmentation result, and the search engine queries data related to the search sentence according to the word segmentation result.
Fig. 1 is a schematic flow chart of the word segmentation method provided in this embodiment, and referring to fig. 1, the method includes:
step 101: and acquiring a target word element from the first word segmentation result of the search statement.
The first word segmentation result comprises at least one lemma for segmenting the search sentence. However, since the lemma in the first segmentation result may be too long, which may result in that the data meeting the expectation cannot be searched according to the first segmentation result, in the method provided in this embodiment, the target lemma in the first segmentation result is re-segmented (i.e. the segmentation complementing process) to improve the probability of searching the data meeting the expectation. Specifically, the word segmentation and back-complementing process is a process of re-segmenting some word elements in the first word segmentation result, and obtaining a final word segmentation result according to new word elements obtained by re-segmenting and the first word segmentation result.
Step 102: and segmenting the target word elements to obtain a second segmentation result.
The second segmentation result is a result obtained after the target lemma is segmented, for example, if the target lemma obtained from the first segmentation result is "long before long", the term "long before long" is segmented, and the obtained second segmentation result includes "long before long" and "long before long".
Step 103: and determining a final word segmentation result according to the first word segmentation result and the second word segmentation result.
The final segmentation result is determined according to the first segmentation result and the second segmentation result, for example, the final segmentation result is a union of the first segmentation result and the second segmentation result, or the second segmentation result is a union of the first segmentation result and the second segmentation result without the target lemma, which is not limited in this embodiment.
The embodiment provides a word segmentation method, which includes obtaining a target lemma to be subjected to word segmentation again from a first word segmentation result in a process of performing word segmentation and back supplement on the first word segmentation result of a search sentence, and performing word segmentation on the target lemma again to obtain a second word segmentation result. And determining a final word segmentation result for segmenting the search sentence according to the first word segmentation result and the second word segmentation result. Because the second word segmentation result comprises the lemma for segmenting the target lemma again, the query stage can search according to the second word segmentation result for segmenting the target lemma again, the probability of searching the matched data is increased, and the matching degree of the search result and the expectation of the user is improved.
Regarding to performing re-segmentation on the target lemma, further, on the basis of the foregoing embodiments, the obtaining a second segmentation result by segmenting the target lemma includes:
taking a branch in a dictionary tree with each character in the target lemma as a root node as a target branch, and acquiring the second word segmentation result from the lemma at each node of the target branch;
wherein the dictionary tree includes branches created with each character of the search sentence as a root node.
The search sentence is participled using a dictionary tree as a tool for participling, wherein the dictionary tree includes branches created with each character of the search sentence as a root node. For example, the dictionary tree for which the search sentence "story between gunns villagers" is built includes the following branches: in the branch with the character "gunn" as the root node, the word elements at each node are "gunn" and "gunn", in turn. In the branch with the character "village" as the root node, the lemmas at each node are "village" and "villager" in sequence. In the branch with the character "civil" as the root node, the lemmas at the nodes are "civil", "folk" and "folk story" in sequence. In the branch with the character "between" as the root node, the lemmas at the nodes are sequentially "between". In the branch with the character "event" as the root node, the lemmas at the nodes are "event" and "story" in sequence.
As can be seen, in this embodiment, the dictionary tree established according to the search statement includes the lemmas reasonably divided according to the characters, and the dictionary tree includes the lemmas used for re-segmenting the target lemmas, so that the nodes in the dictionary tree can implement fast segmentation of the target lemmas, and the efficiency of re-segmenting the target lemmas is improved.
For a specific implementation process of determining a second word segmentation result according to a dictionary tree, further, on the basis of the foregoing embodiments, the obtaining the second word segmentation result from the lemma at each node of the target branch includes:
when the target word elements are segmented for the first time, acquiring a first branch from the target branch, and taking the word elements at a first node in the first branch as first word elements of the first word segmentation; the first branch is a branch taking the first character of the target lemma as a root node;
when the target lemma is not segmented for the first time, acquiring a second branch from the target branch, and taking the lemma at a second node in the second branch as a second lemma of the word segmentation until a first character after the previous lemma is not in the target lemma; the second branch is a branch taking the first character after the previous lemma as a root node;
and taking the first lemma and the second lemma as the second word segmentation result.
It can be understood that the first lemma and the second lemma that are divided may be lemmas that cannot be re-divided, or lemmas that can be further re-divided, which is not specifically limited in this embodiment.
The first lemma is a lemma obtained by performing word segmentation on the target lemma for the first time, and is determined by the lemma at each node in the branch with the first character of the target lemma as the root node. For example, in a branch having the character "civil" as the root node, the target lemma "folk story" is the lemma "civil", "folk story" and "folk story" at each node in this order. The first lemma may be determined from nodes corresponding to "folk", "folk story".
And the second word element is a word element obtained by carrying out word segmentation on the target word element when the target word element is not firstly segmented. For example, the first lemma for dividing the "folk story" for the first time is "folk", and the second branch is a branch having "so" as the root node. In the branch with the character 'cause' as the root node, the lemmas at the nodes are 'cause' and 'story' in sequence, and then the second lemmas divided this time can be determined from the nodes corresponding to the 'story'.
In this embodiment, through a cyclic process, in combination with a branch using each character of a target lemma as a root node, determination of a first lemma and a second lemma of a participle is achieved. When the target lemma is not firstly subjected to word segmentation again, the second branch used for determining the lemma is determined according to the lemma determined by performing word segmentation on the target lemma in the previous time, and the first character after the previous lemma is used as a root node in the second branch, so that the lemma obtained by word segmentation is prevented from being overlapped in characters, and the accuracy of word segmentation on the target lemma is ensured.
Further, on the basis of the foregoing embodiments, the first node is adjacent to the node where the target lemma is located, and is closer to the root node of the first branch than the node where the target lemma is located;
each character in the lemma at the second node is contained in the target lemma and is farthest from the root node of the second branch.
For example, for the target lemma "folk story", a first lemma obtained by word segmentation is first selected from the first branch "folk story", and a first node (i.e., "folk") adjacent to the node where the target lemma "folk story" is located and closer to the root node than the "folk story". In the second branch, "story" for the second time of word segmentation, the lemma corresponding to the node farthest from the root node, i.e., "story" is used as the second lemma for the second time of word segmentation. Since there is no character contained in the target lemma "folk story" after "story", the word segmentation of the target lemma "folk story" is finished, and the obtained second word segmentation result is "folk story".
Again, for example, table 1 is a lemma mapping table on each node in each branch created for the target lemma "willow-dim". Referring to table 1, when performing the re-segmentation on the target lemma of "suggestive" the segmentation algorithm includes the following contents:
obtaining the word element in the user dictionary from the user dictionary tree according to the query character string, recording the following information
String temp is null; // word segmentation result per time
int offo is 0; // current location of lemma
int lastOffo is 0; // starting position of last participle
int nextOffo ═ 0; // the starting position of the next word segmentation
The loop gets the temp contained in the dictionary from the tree,
while(...){
offo=word.offo;
// first Condition, control queue Tail element
// second Condition, ensuring no crossovers
if(offo+temp.length()<=nextOffo||(lastOffo!=offo&&offo<nextOffo))
{continue;}
lastOffo=offo;
nextOffo=offo+temp.length();
map.put(String.valueOf(offo),temp);
}
Table 1 lemma comparison table for each node in each branch created by target lemma "sudeng
Figure BDA0002575649110000081
In the process of re-segmenting the suggestive word, the character next to the previous element is used as the root node of the second branch determined this time (that is, the condition of offo + temp. By taking the next character of the previous lemma as the root node of the second branch determined this time, words with meanings different from those of "willow dim" can be filtered out. Specifically, as shown in table 1, the last lemma meeting the condition includes: dark willow, dark flower, bright flower. The offdark offo is 0, the length is 2, and according to the condition of offo + temp. length () < ═ nextofo, the segmentation element of dark flower 1+2 is filtered out.
Specifically, when the addition of the lemmas complemented by the participles can completely match the original fractionated lemmas, the subscripts corresponding to the lemmas are carried when the corresponding participle lemmas are returned according to the dictionary by using the viterbi algorithm, and if the addition of the subscripts can be ensured to be equal to the length of the original words according to the comparison and addition of the subscripts, the original lemmas can be considered to be replaced. Original word segmentation word elements are removed, and the condition that the query result is not in accordance with expectation due to the mm parameter matching rule is prevented.
According to the embodiment, the lemma for performing word segmentation on a certain target lemma each time can be quickly determined through the nodes in the dictionary tree and the determination of the first node and the second node in the branch, so that word segmentation on the target lemma is realized.
Further, on the basis of the foregoing embodiments, the obtaining a target lemma from the first segmentation result of the search statement includes:
and if the number of the lemmas in the first word segmentation result is less than or equal to a preset number and/or the lemmas with the length larger than the preset length exist in the first word segmentation result, acquiring the lemmas with the length larger than the preset length as the target lemmas.
The lemma length refers to the number of characters included in the lemma, and for example, the lemma length of "longest" is 4. The preset number and the preset length may be preset, for example, the preset number is 2, and the preset length is 4 characters.
When the number of the lemmas in the first word segmentation result is judged to be less than or equal to the preset number, and/or the lemmas with the lemma length larger than the preset length exist in the first word segmentation result, the fact that the first word segmentation result needs to be re-segmented is indicated. And then, obtaining the lemmas with the lemmas length larger than the preset length from the first word segmentation result every time as target lemmas, and then carrying out word segmentation on the target lemmas to obtain a second word segmentation result.
For example, the search sentence is "gunn villager story", and the word segmentation process for the search sentence comprises the following steps:
step 1, judging whether a first word segmentation result of 'story between Gunn villagers' needs word segmentation and back supplement;
step 2, cleaning and filtering word segmentation word elements;
and 3, carrying out secondary word segmentation on the lexical elements meeting the conditions by using a hidden Markov model and a viterbi algorithm carried by ansj.
In the above step 1, the gunn village story is divided into two lemmas, i.e., "gunn village" and "folk story", and these two lemmas are stored in a Set to prevent duplication.
Since the first segmentation result of the search sentence only contains two elements, namely gunn and folk story, the first segmentation result needs to be segmented and complemented. And acquiring the word element 'folk story' with the word element length being more than 4 in the first word segmentation result, and performing re-segmentation on the 'folk story'.
In the embodiment, whether the lemma complementation is needed is judged through the number of the lemmas in the first word segmentation result and the lemma length of the lemmas in the first word segmentation result, the lemmas needing the lemma complementation are screened out, and the data meeting the expectation of the user can be searched through the search sentence through the lemma complementation.
For the output of the second segmentation result, further, on the basis of the foregoing embodiments, the determining a final segmentation result according to the first segmentation result and the second segmentation result includes:
and if the lemma in the second word segmentation result contains each character of the target lemma, taking the second word segmentation result and the first word segmentation result with the target lemma deleted as the final word segmentation result.
Understandably, the second word segmentation result and the first word segmentation result without deleting the target word element can also be used as the final word segmentation result.
In this embodiment, all the lemmas in the second and first segmentation results may be output as the second segmentation result (for example, the second segmentation result of "gunn", "folk story", "folk", "story" is output for the above step 3), or the remaining lemmas may be output as the second segmentation result after the target lemmas for re-segmentation are deleted in the second and first segmentation results (for example, the second segmentation result of "gunn", "folk", "story" is output for the above step 3).
The embodiment can output the word segmentation result which is expected by the user according to the user requirement, and the flexibility of the output form of the word segmentation result is increased.
Further, on the basis of the foregoing embodiments, before the obtaining the target lemma from the first segmentation result of the search statement, the method further includes:
and acquiring an original word segmentation result for segmenting the search sentence, and cleaning the original word segmentation result according to the part of speech of each word element in the original word segmentation result and/or the character type of characters contained in each word element to obtain the first word segmentation result.
Specifically, the cleaning of the original word segmentation result according to the part of speech of each word element in the original word segmentation result and/or the character type of the character contained in each word element includes: and cleaning the human name nr, the place name nrf, the quantitative word mq and the like according to the part of speech of each word element in the original word segmentation result.
Specifically, the step of cleaning the original word segmentation result according to the character type of the character included in each word element in the original word segmentation result includes: the regular expression is used for cleaning characters of a preset character type, so that Chinese characters or Arabic numerals such as 'season 2', 'third set' are filtered out, and the word segmentation process is prevented from being influenced.
The embodiment avoids filtering specific characters and nouns through cleaning the lemmas, avoids the influence of the lemmas on the judgment process of the re-segmentation, simplifies irrelevant lemmas, and improves the efficiency of the re-segmentation process.
It should be noted that the word segmentation and anaplerosis are not to be considered as more words are better, and the effect is mainly considered. The word segmentation method provided by the application combines the leaf nodes obtained again according to the hidden Markov model and the viterbi algorithm as the back-filling result, can ensure the accuracy and optimality of the word segmentation, and removes the completely back-filled lemmas in the original lemmas, so that the word segmentation effect is good, and the test in the production environment at present has good reverberation. Specifically, the embodiment is based on ANSJ optimization, and can perform secondary complementation on the longest token obtained according to the hidden markov model and the viterbi algorithm, so that the resource of "longest lasting and longest lasting" does not exist, but the resource of "longest lasting and longest lasting" and "longest lasting prefecture XX" related to "longest lasting and longest lasting" can be satisfied; compared with an IK word segmentation device, the method has the advantages that part-of-speech analysis is added, secondary complementation is performed on the result after intelligent word segmentation based on ansj, and the accuracy is much stronger than that of the IK word segmentation device.
In addition, the word segmentation method provided by the application is verified by an actual test experiment, and tests are carried out by using 350w pieces of actual data in the video database, so that 8900 pieces of test results are optimized, the word segmentation result is accurate, and the recall effect is good.
Fig. 2 is a block diagram of the structure of the word segmentation apparatus provided in this embodiment, referring to fig. 2, the word segmentation apparatus includes an obtaining module 201, a word segmentation module 202, and a determining module 203, wherein,
an obtaining module 201, configured to obtain a target lemma from a first segmentation result of a search statement;
a word segmentation module 202, configured to segment words of the target lemma and obtain a second word segmentation result;
and the determining module 203 is configured to determine a final word segmentation result according to the first word segmentation result and the second word segmentation result.
The word segmentation device provided in this embodiment is suitable for the word segmentation methods provided in the above embodiments, and will not be described herein again.
The embodiment of the invention provides a word segmentation device, which is used for acquiring a target word element to be subjected to re-segmentation from a first segmentation result in the process of performing word segmentation and back filling on the first segmentation result of a search sentence, and performing word segmentation on the target word element again to obtain a second segmentation result. And determining a final word segmentation result for segmenting the search sentence according to the first word segmentation result and the second word segmentation result. Because the second word segmentation result comprises the lemma for segmenting the target lemma again, the query stage can search according to the second word segmentation result for segmenting the target lemma again, the probability of searching the matched data is increased, and the matching degree of the search result and the expectation of the user is improved.
Optionally, the segmenting the target lemma and obtaining a second segmentation result includes:
taking a branch in a dictionary tree with each character in the target lemma as a root node as a target branch, and acquiring the second word segmentation result from the lemma at each node of the target branch;
wherein the dictionary tree includes branches created with each character of the search sentence as a root node.
Optionally, the obtaining the second word segmentation result from the lemma at each node of the target branch includes:
when the target word elements are segmented for the first time, acquiring a first branch from the target branch, and taking the word elements at a first node in the first branch as first word elements of the first word segmentation; the first branch is a branch taking the first character of the target lemma as a root node;
when the target lemma is not segmented for the first time, acquiring a second branch from the target branch, and taking the lemma at a second node in the second branch as a second lemma of the word segmentation until a first character after the previous lemma is not in the target lemma; the second branch is a branch taking the first character after the previous lemma as a root node;
and taking the first lemma and the second lemma as the second word segmentation result.
Optionally, the first node is adjacent to the node where the target lemma is located, and is closer to the root node of the first branch than the node where the target lemma is located;
each character in the lemma at the second node is contained in the target lemma and is farthest from the root node of the second branch.
Optionally, the obtaining a target lemma from a first segmentation result of the search statement includes:
and if the number of the lemmas in the first word segmentation result is less than or equal to a preset number and/or the lemmas with the length larger than the preset length exist in the first word segmentation result, acquiring the lemmas with the length larger than the preset length as the target lemmas.
Optionally, the determining a final segmentation result according to the first segmentation result and the second segmentation result includes:
and if the lemma in the second word segmentation result contains each character of the target lemma, taking the second word segmentation result and the first word segmentation result with the target lemma deleted as the final word segmentation result.
Optionally, before the obtaining the target lemma from the first segmentation result of the search statement, the method further includes:
and acquiring an original word segmentation result of the search sentence, and cleaning the original word segmentation result according to the part of speech of each word element in the original word segmentation result and/or the character type of characters contained in each word element to obtain the first word segmentation result.
Fig. 3 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 3: a processor (processor)301, a communication Interface (communication Interface)302, a memory (memory)303 and a communication bus 304, wherein the processor 301, the communication Interface 302 and the memory 303 complete communication with each other through the communication bus 304. Processor 301 may call logic instructions in memory 303 to perform the following method: acquiring a target word element from a first word segmentation result of a search statement; segmenting the target word elements to obtain a second segmentation result; and determining a final word segmentation result according to the first word segmentation result and the second word segmentation result.
In addition, the logic instructions in the memory 303 may be implemented in the form of software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
Further, an embodiment of the present invention discloses a computer program product, the computer program product comprising a computer program stored on a non-transitory readable storage medium, the computer program comprising program instructions, which when executed by a computer, the computer is capable of performing the method provided by the above-mentioned method embodiments, for example, including: acquiring a target word element from a first word segmentation result of a search statement; segmenting the target word elements to obtain a second segmentation result; and determining a final word segmentation result according to the first word segmentation result and the second word segmentation result.
In another aspect, an embodiment of the present invention further provides a non-transitory readable storage medium, on which a computer program is stored, where the computer program is implemented to perform the transmission method provided in the foregoing embodiments when executed by a processor, for example, the method includes: acquiring a target word element from a first word segmentation result of a search statement; segmenting the target word elements to obtain a second segmentation result; and determining a final word segmentation result according to the first word segmentation result and the second word segmentation result.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding, the above technical solutions may be embodied in the form of a software product, which may be stored in a readable storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A method of word segmentation, comprising:
acquiring a target word element from a first word segmentation result of a search statement;
segmenting the target word elements to obtain a second segmentation result;
and determining a final word segmentation result according to the first word segmentation result and the second word segmentation result.
2. The word segmentation method according to claim 1, wherein the segmenting the target lemma and obtaining a second segmentation result comprises:
taking a branch in a dictionary tree with each character in the target lemma as a root node as a target branch, and acquiring the second word segmentation result from the lemma at each node of the target branch;
wherein the dictionary tree includes branches created with each character of the search sentence as a root node.
3. The word segmentation method according to claim 2, wherein the obtaining the second word segmentation result from the lemma at each node of the target branch comprises:
when the target word elements are segmented for the first time, acquiring a first branch from the target branch, and taking the word elements at a first node in the first branch as first word elements of the first word segmentation; the first branch is a branch taking the first character of the target lemma as a root node;
when the target lemma is not segmented for the first time, acquiring a second branch from the target branch, and taking the lemma at a second node in the second branch as a second lemma of the word segmentation until a first character after the previous lemma is not in the target lemma; the second branch is a branch taking the first character after the previous lemma as a root node;
and taking the first lemma and the second lemma as the second word segmentation result.
4. The word segmentation method according to claim 3,
the first node is adjacent to the node where the target lemma is located and is closer to the root node of the first branch than the node where the target lemma is located;
each character in the lemma at the second node is contained in the target lemma and is farthest from the root node of the second branch.
5. The word segmentation method according to claim 1, wherein the obtaining of the target lemma from the first word segmentation result of the search sentence comprises:
and if the number of the lemmas in the first word segmentation result is less than or equal to a preset number and/or the lemmas with the length larger than the preset length exist in the first word segmentation result, acquiring the lemmas with the length larger than the preset length as the target lemmas.
6. The word segmentation method according to claim 1, wherein the determining a final word segmentation result according to the first word segmentation result and the second word segmentation result comprises:
and if the lemma in the second word segmentation result contains each character of the target lemma, taking the second word segmentation result and the first word segmentation result with the target lemma deleted as the final word segmentation result.
7. The word segmentation method according to claim 1, wherein before the obtaining the target lemma from the first word segmentation result of the search sentence, the method further comprises:
and acquiring an original word segmentation result of the search sentence, and cleaning the original word segmentation result according to the part of speech of each word element in the original word segmentation result and/or the character type of characters contained in each word element to obtain the first word segmentation result.
8. A word segmentation device, comprising:
the acquisition module is used for acquiring a target word element from a first word segmentation result of the search statement;
the word segmentation module is used for segmenting the target word elements to obtain a second word segmentation result;
and the determining module is used for determining a final word segmentation result according to the first word segmentation result and the second word segmentation result.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the word segmentation method as claimed in any one of claims 1 to 7 are implemented by the processor when executing the program.
10. A non-transitory readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the word segmentation method according to any one of claims 1 to 7.
CN202010652918.XA 2020-07-08 2020-07-08 Word segmentation method and device, electronic equipment and storage medium Pending CN111967257A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010652918.XA CN111967257A (en) 2020-07-08 2020-07-08 Word segmentation method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010652918.XA CN111967257A (en) 2020-07-08 2020-07-08 Word segmentation method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN111967257A true CN111967257A (en) 2020-11-20

Family

ID=73361872

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010652918.XA Pending CN111967257A (en) 2020-07-08 2020-07-08 Word segmentation method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111967257A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109800427A (en) * 2018-12-28 2019-05-24 北京金山安全软件有限公司 Word segmentation method, word segmentation device, word segmentation terminal and computer readable storage medium
CN110704719A (en) * 2019-09-29 2020-01-17 北京金堤科技有限公司 Enterprise search text word segmentation method and device
CN111126048A (en) * 2019-12-25 2020-05-08 腾讯科技(深圳)有限公司 Candidate synonym determination method, device, server and storage medium
CN111310450A (en) * 2020-03-23 2020-06-19 中国建设银行股份有限公司 Character string word segmentation method, device, equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109800427A (en) * 2018-12-28 2019-05-24 北京金山安全软件有限公司 Word segmentation method, word segmentation device, word segmentation terminal and computer readable storage medium
CN110704719A (en) * 2019-09-29 2020-01-17 北京金堤科技有限公司 Enterprise search text word segmentation method and device
CN111126048A (en) * 2019-12-25 2020-05-08 腾讯科技(深圳)有限公司 Candidate synonym determination method, device, server and storage medium
CN111310450A (en) * 2020-03-23 2020-06-19 中国建设银行股份有限公司 Character string word segmentation method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN106156365B (en) A kind of generation method and device of knowledge mapping
RU2460131C2 (en) Equipping user interface with search query expansion
CN109408811B (en) Data processing method and server
CN109726298B (en) Knowledge graph construction method, system, terminal and medium suitable for scientific and technical literature
US11755654B2 (en) Category tag mining method, electronic device and non-transitory computer-readable storage medium
WO2019169858A1 (en) Searching engine technology based data analysis method and system
US20100257177A1 (en) Document rating calculation system, document rating calculation method and program
US9754083B2 (en) Automatic creation of clinical study reports
CN109033282B (en) Webpage text extraction method and device based on extraction template
CN110232112A (en) Keyword extracting method and device in article
WO2014210387A2 (en) Concept extraction
JP2020191075A (en) Recommendation of web apis and associated endpoints
CN112883165B (en) Intelligent full-text retrieval method and system based on semantic understanding
CN111488468A (en) Geographic information knowledge point extraction method and device, storage medium and computer equipment
CN112613321A (en) Method and system for extracting entity attribute information in text
CN109902290A (en) A kind of term extraction method, system and equipment based on text information
CN112395425A (en) Data processing method and device, computer equipment and readable storage medium
CN106777140B (en) Method and device for searching unstructured document
CN110209780A (en) A kind of question template generation method, device, server and storage medium
CN111160445A (en) Bid document similarity calculation method and device
CN110705285A (en) Government affair text subject word bank construction method, device, server and readable storage medium
CN111967257A (en) Word segmentation method and device, electronic equipment and storage medium
CN111339287B (en) Abstract generation method and device
CN113468339A (en) Label extraction method, system, electronic device and medium based on knowledge graph
CN114780700A (en) Intelligent question-answering method, device, equipment and medium based on machine reading understanding

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination