CN109800427B - Word segmentation method, device, terminal and computer readable storage medium - Google Patents

Word segmentation method, device, terminal and computer readable storage medium Download PDF

Info

Publication number
CN109800427B
CN109800427B CN201811619990.1A CN201811619990A CN109800427B CN 109800427 B CN109800427 B CN 109800427B CN 201811619990 A CN201811619990 A CN 201811619990A CN 109800427 B CN109800427 B CN 109800427B
Authority
CN
China
Prior art keywords
word
word segmentation
words
stock
text information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811619990.1A
Other languages
Chinese (zh)
Other versions
CN109800427A (en
Inventor
许晏铭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Kingsoft Internet Security Software Co Ltd
Original Assignee
Beijing Kingsoft Internet Security Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Kingsoft Internet Security Software Co Ltd filed Critical Beijing Kingsoft Internet Security Software Co Ltd
Priority to CN201811619990.1A priority Critical patent/CN109800427B/en
Publication of CN109800427A publication Critical patent/CN109800427A/en
Application granted granted Critical
Publication of CN109800427B publication Critical patent/CN109800427B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The embodiment of the invention provides a word segmentation method, a word segmentation device, a terminal and a computer readable storage medium. The method comprises the following steps: determining text information to be segmented; word segmentation is carried out on the text information according to a preset matching algorithm, a word stock in a word segmentation model constructed in advance and a word index table corresponding to the word stock; wherein words in the word stock are ordered according to the number of characters contained in each word; the word index table is used for: indexing the position of words of each character number in a word stock; and obtaining a word segmentation result of the text information. In this way, in the process of word segmentation of text information by using a preset matching algorithm, a location interval corresponding to the number of characters of the word to be queried can be determined by using the word index table, and then whether the word exists or not can be searched in the location interval. Therefore, the traversal of the unified word stock is avoided, and the searching time is shortened, so that the word segmentation speed is improved.

Description

Word segmentation method, device, terminal and computer readable storage medium
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a word segmentation method, a word segmentation device, a word segmentation terminal, and a computer readable storage medium.
Background
Since Chinese is written in words, there is usually no obvious word segmentation mark between words in a sentence. Thus, the electronic device often needs to segment the chinese character sequence into individual words to understand the semantics of the chinese character sequence to be expressed according to the obtained word segmentation result.
Currently, commonly used word segmentation algorithms include dictionary-based word segmentation algorithms. The word segmentation algorithm is to match a Chinese character string to be matched with each word in a preset dictionary containing a large number of word groups according to a certain matching algorithm. If a word can be found in the dictionary, the description matching is successful, i.e. a word is identified. The matching algorithm is usually a forward maximum matching method and a bidirectional matching word segmentation method.
The inventor finds that the word segmentation speed of the word segmentation method is still relatively low in the process of realizing the invention, and the requirement of rapid word segmentation cannot be met.
Disclosure of Invention
An object of an embodiment of the present invention is to provide a word segmentation method, a device, a terminal, and a computer readable storage medium, so as to increase a search speed of a word, thereby increasing a word segmentation speed. The specific technical scheme is as follows:
In a first aspect, an embodiment of the present invention provides a word segmentation method, where the method may include:
determining text information to be segmented;
word segmentation is carried out on the text information according to a preset matching algorithm, a word stock in a word segmentation model constructed in advance and a word index table corresponding to the word stock; wherein words in the word stock are ordered according to the number of characters contained in each word; the word index table is used for: indexing the position of words of each character number in a word stock;
and obtaining word segmentation results of the text information.
Optionally, in the embodiment of the present invention, the step of determining text information of the word to be segmented may include:
detecting whether the information in the input box is updated or not;
when the information in the input box is updated, obtaining the information in the input box as target information;
and selecting a preset number of character sequences in the target information according to a right-to-left selection mode to obtain text information of the word to be segmented.
Optionally, in an embodiment of the present invention, the preset matching algorithm may include: inverse longest match algorithm.
Optionally, in the embodiment of the present invention, before the step of word segmentation of the text information according to the preset matching algorithm, the word segmentation library in the pre-constructed word segmentation model, and the word index table corresponding to the word segmentation library, the method may further include:
Acquiring an original corpus and a word segmentation word stock;
the word segmentation is carried out on the original corpus by utilizing a word segmentation word stock, so that a word segmentation result of the original corpus is obtained;
counting word frequencies of all words in the word segmentation word stock according to word segmentation results of the original corpus;
training to obtain a word segmentation model according to the word segmentation result of the original corpus and the word frequency obtained by statistics; the word segmentation model comprises a word stock, wherein each word is recorded in the word stock;
and sequencing each word in the word list according to the number of characters contained in the word.
Optionally, in an embodiment of the present invention, the word index table includes a first sub index table and a second sub index table;
accordingly, after the step of sorting the words in the unified word stock according to the number of characters contained in the words, the method may further include:
constructing a first sub-index table for recording the initial position information of words of each character number in a word stock;
a second sub-index table is constructed for recording the size of the memory space occupied by words of each character number.
Optionally, in an embodiment of the present invention, the word segmentation model may include: an N-tuple model.
Optionally, in the embodiment of the present invention, after the step of obtaining the word segmentation result of the text information, the method may further include:
Searching N-element relations from the N-element group model according to the word segmentation result and the word index table of the text information to obtain a plurality of predicted words corresponding to the word segmentation result of the text information;
and displaying a preset number of predicted words according to the sequence from big to small of the occurrence probability of each predicted word.
Optionally, in the embodiment of the present invention, the step of displaying the preset number of predicted words according to the order of occurrence probability of each predicted word from high to low may include:
determining a pinyin character string input by a user;
determining predicted words meeting spelling rules of the Pinyin character strings in the predicted words as target predicted words;
and displaying each target predicted word on a recommended word display interface of the input method according to the sequence of the occurrence probability of each target predicted word from high to low.
In a second aspect, an embodiment of the present invention further provides a word segmentation apparatus, where the apparatus may include:
the first determining module is used for determining text information of the word to be segmented;
the first word segmentation module is used for segmenting the text information according to a preset matching algorithm, a word database in a pre-constructed word segmentation model and a word index table corresponding to the word database; wherein words in the word stock are ordered according to the number of characters contained in each word; the word index table is used for: indexing the position of words of each character number in a word stock;
The obtaining module is used for obtaining the word segmentation result of the text information.
Optionally, in an embodiment of the present invention, the first determining module may include:
a detection unit for detecting whether the information in the input box is updated;
an obtaining unit configured to obtain information in the input frame as target information when the information in the input frame is updated;
the selecting unit is used for selecting a preset number of character sequences in the target information according to a right-to-left selecting mode to obtain text information of the word to be segmented.
Optionally, in an embodiment of the present invention, the preset matching algorithm may include: inverse longest match algorithm.
Optionally, in an embodiment of the present invention, the method may further include:
the first acquisition module is used for acquiring an original corpus and a word segmentation word stock before the text information is segmented according to a preset matching algorithm, a word segmentation word stock in a pre-constructed word segmentation model and a word index table corresponding to the word segmentation word stock;
the second word segmentation module is used for segmenting the original corpus by utilizing the word segmentation word stock to obtain a word segmentation result of the original corpus;
the statistics module is used for counting the word frequency of each word in the word segmentation word stock according to the word segmentation result of the original corpus;
The training module is used for training to obtain a word segmentation model according to the word segmentation result of the original corpus and the word frequency obtained through statistics; the word segmentation model comprises a word stock, wherein each word is recorded in the word stock;
and the ordering module is used for ordering each word in the unified word stock according to the number of characters contained in the word.
Optionally, in an embodiment of the present invention, the word index table includes a first sub index table and a second sub index table; the apparatus may further include:
the first construction module is used for constructing a first sub-index table for recording the initial position information of words with each character number in the word stock after sequencing each word in the word stock according to the character number contained in the word;
and the second construction module is used for constructing a second sub-index table for recording the storage space size occupied by the words of each character number.
Optionally, in an embodiment of the present invention, the word segmentation model may include: an N-tuple model.
Optionally, in an embodiment of the present invention, the method may further include:
the searching module is used for searching N-element relations from the N-element group model according to the word segmentation result and the word index table of the text information after the word segmentation result of the text information is obtained, and obtaining a plurality of predicted words corresponding to the word segmentation result of the text information;
The display module is used for displaying a preset number of predicted words according to the sequence from big to small of the occurrence probability of each predicted word.
Optionally, in an embodiment of the present invention, the display module may include:
a first determining unit for determining a pinyin character string input by a user;
a second determining unit, configured to determine, as a target predicted word, a predicted word that satisfies a spelling rule of the pinyin string among the predicted words;
and the display unit is used for displaying each target predicted word on a recommended word display interface of the input method according to the sequence from high to low of the occurrence probability of each target predicted word.
In a third aspect, an embodiment of the present invention further provides a terminal, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;
a memory for storing a computer program;
and the processor is used for realizing the method steps of any word segmentation method when executing the program stored in the memory.
In a fourth aspect, embodiments of the present invention provide a computer readable storage medium having a computer program stored therein, the computer program, when executed by a processor, implementing the method steps of any one of the above-described word segmentation methods.
In a fifth aspect, embodiments of the present invention provide a computer program product, which when run on a terminal, causes the terminal to perform: the method steps of any of the word segmentation methods described above.
In the embodiment of the invention, the text information of the word to be segmented can be determined. Then, according to a preset matching algorithm, a word stock in the word segmentation model and a word index table corresponding to the word stock are constructed in advance to segment the text information. Further, the word segmentation result of the text information is obtained. Wherein, since words in the word stock are ordered according to the number of characters contained in each word, that is, words with the same number of characters in the word stock are gathered together. Also, since the positions of words of each character number in the unified word stock can be indexed by the word index table. Therefore, in the process of word segmentation of text information by using a preset matching algorithm, a location interval corresponding to the number of characters of the word to be queried can be determined by using the word index table, and then whether the word exists or not can be searched in the location interval. Therefore, the traversal of the unified word stock is avoided, and the searching time is shortened, so that the word segmentation speed is improved.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flowchart of a word segmentation method according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a word segmentation device according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a terminal according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In order to solve the technical problem of low word segmentation speed in the prior art, the embodiment of the invention provides a word segmentation method, a word segmentation device, a terminal and a computer readable storage medium.
The word segmentation refers to a process of segmenting a continuous character sequence into word sequences according to a certain specification. Specifically, chinese word segmentation refers to a process of segmenting a Chinese character sequence into individual words according to a certain specification.
For example, the Chinese character sequence "I love China" can be split into "I", "love" and "Chinese" by Chinese word segmentation. Wherein, the Chinese character sequence of 'I love China' can also be called text information. In addition, "me", "love" and "chinese" obtained by word segmentation are three separate words, respectively.
The word segmentation method provided by the embodiment of the invention is first described below.
The word segmentation method provided by the embodiment of the invention is applied to terminals including but not limited to computers, mobile phones and intelligent watches.
Referring to fig. 1, the word segmentation method provided by the embodiment of the present invention may include the following steps:
s101: determining text information to be segmented;
it will be appreciated that in many word processing scenarios, the terminal often needs to word the text information to understand the semantics expressed by the text information based on the word segmentation results obtained. The text information may be chinese text information or english text information, but is not limited thereto.
For example, in a scenario where a user inputs text using a chinese input method, after the user inputs text information in an input box, the terminal often needs to word-segment the text information in the input box, so as to understand the semantics of the text information according to the obtained word-segment result, and further predict the next word that the user wants to input.
In the embodiment of the present invention, the manner in which the terminal determines the text information to be segmented may specifically be:
it is detected whether an update of the information in the input box has occurred. When the information in the input box is updated, the terminal may obtain the information in the input box as target information. Then, a preset number of character sequences in the target information can be selected according to a right-to-left selection mode, so that text information of the word to be segmented is obtained.
Therefore, all text information in the input box is not required to be used as text information to be segmented, namely, all text information in the input box is not required to be segmented, and therefore the segmentation speed is improved.
In addition, the terminal predicts the next word which the user wants to input, and the right word in the input box has the greatest influence on the prediction result, so that the accuracy of the prediction result can be improved by adopting a right-to-left selection mode.
Wherein, the person skilled in the art can set the preset number of values according to the actual situation. For example, the preset number may be 20, i.e., 20 characters.
S102: word segmentation is carried out on the text information according to a preset matching algorithm, a word stock in a word segmentation model constructed in advance and a word index table corresponding to the word stock; wherein words in the word stock are ordered according to the number of characters contained in each word; the word index table is used for: indexing the position of words of each character number in a word stock;
s103: and obtaining word segmentation results of the text information.
For the input method scene, a unigram in the word segmentation model is stored with: and loading all the unigrams in the word segmentation model corresponding to the input method engine. The unified word stock is stored in the memory space of the terminal. 60 ten thousand words may be included in the thesaurus.
Specifically, the thesaurus may be: based on word segmentation word stock, the method is constructed by segmenting a large amount of original corpus and counting word frequency. The word segmentation of a large number of original linguistic data by the word segmentation word library refers to a word segmentation technology in the field of natural language processing, for example, word segmentation is performed by using an HMM (Hidden Markov Model ).
The method comprises the steps that a word list is formed by a plurality of words, wherein the words in the word list can comprise entity nouns and most of high-frequency phrases, such as 'what is going to', and daily chat scenes can be covered. Therefore, when word segmentation is performed based on the unigram word library, the word segmentation is accurate, and the intention of the user can be accurately identified.
Words recorded in the word stock are ordered according to the number of characters each word contains, i.e., words of the same number of characters in the word stock are clustered together. In this way, the memory structure of the thesaurus can be optimized.
Moreover, the terminal can index the position of words of each character number in the unified word stock through the word index table. In this way, in the process of word segmentation of text information by using the preset matching algorithm, the terminal can determine the position interval corresponding to the number of characters of the word to be queried by using the word index table. Then, it can be searched for whether the word exists in the location section. Therefore, the method avoids the need of traversing a word stock in each search, shortens the search time and improves the word segmentation speed.
The words in the word stock may be arranged in an ascending order of the number of characters, or may be arranged in a descending order of the number of characters, which is reasonable.
For example, the words "you", "china", "me", "like" and "web page" are recorded in a unified word stock. Then, when words in the unified word stock are arranged in ascending order of the number of characters, the word ordering is: "you", "me", "china", "like" and "web page".
Of course, for words of the same number of characters, such as "chinese", "like" and "web" words of the same number of characters, it is also reasonable to order the letters in the alphabet in order of appearance.
Wherein, after the words in the unified Word stock are ordered, each Word can be assigned an index Word ID. Wherein, a Word ID corresponds to a Word one by one, and the Word ID of a Word may be a sequence number of the Word arranged in a Word stock. Thus, after a Word is determined, the Word ID of the Word can be determined. After determining a Word ID, a Word corresponding to the Word ID may be determined.
In addition, the word index table may specifically include a first sub index table and a second sub index table. Wherein a sub-index table may be stored in a tuple.
The first sub-index table is used for: the starting position information of words of each character number in a word stock is recorded. For example, assuming that there are 100 words of 5 characters, the first sub-index table has recorded therein the start position information of the word of the first 5 characters in the word stock.
The second sub-index table is used for: the size of the memory space occupied by the words of each character number is recorded. For example, a word of 100 words of 5 characters occupies a storage space of 3 data blocks.
In addition, the preset matching algorithm may include: inverse longest match algorithm. The word segmentation method has the advantages that more accurate word segmentation results can be obtained by adopting the reverse longest matching algorithm, and the word segmentation speed is faster than that of the word segmentation algorithm based on the machine learning and the neural network adopted in the prior art.
The word segmentation model may include: an N-tuple model, namely an N-Gram model. Wherein N-Gram is based on one assumption: the nth word occurrence is related to the first n-1 words and not to any other words. Moreover, the probability of occurrence of the whole sentence is equal to the probability product of occurrence of each word. The probability of each word can be obtained through corpus statistics calculation.
In the embodiment of the present invention, the value of N may be 2 or 3, which is not limited to this. For the sake of clear layout, a training method of the segmentation model will be described later.
In addition, after the word segmentation result of the text information is obtained, the terminal can search the N-element relation from the N-element group model according to the word segmentation result of the text information and the word index table to obtain a plurality of predicted words corresponding to the word segmentation result of the text information. Then, a preset number of the predicted words can be displayed according to the sequence from the big probability to the small probability of each predicted word. Thus, the next word which the user wants to input can be predicted by combining with the N-Gram model.
The manner in which the predicted word is obtained will be described below with reference to specific examples.
Step one: when a user inputs pinyin 'wochaoji' through a Chinese input method, the terminal can obtain predicted words corresponding to the 'wochaoji' according to a pinyin word-forming rule as follows: super, noisy, shoveling and nest;
step two: when a user selects a predicted word 'I super', the terminal can display the 'I super' in an input box, and segment the text information 'I super' in the input box to obtain a word segmentation result 'I/super';
step three: the terminal uses the two-layer relation of I/super to search the subsequent N-element relation, and obtains and outputs a plurality of predicted words as follows: like, fun, smoldering, beautiful, smoldering and comfortable;
step four: when the user selects the predicted word like, the terminal can segment the text information I'm super like' in the current input frame to obtain a word segmentation result I'm/super/like, and search for the subsequent N-element relation in the N-element group model by utilizing the three-layer relation I'm/super/like and the word index table to obtain and output a plurality of predicted words as follows: such, children, eating, they, the person and the sentence;
Step five: when no predicted word is desired by the user, and the user continues to input the pinyin "mail", the terminal can obtain the predicted word corresponding to the "mail" according to the pinyin word-forming rule as follows: michael, mike, shell, pellet, and buy;
step six: when a user selects a predicted word 'Michael', the terminal can segment the text information 'I super like Michael' in an input frame to obtain a segmentation result 'I/super/like/Michael'; then, the terminal can search the subsequent N-gram relation in the N-gram model by utilizing the word segmentation result and the word index table to obtain and output a plurality of predicted words as follows: jackson, euler, bode and Shu Mahe; when the user selects the predictor "jackson", the input is completed.
It can be understood that when the terminal continuously performs word segmentation operation in the process of user input, the word segmentation method provided by the embodiment of the invention can greatly improve the query speed of words in the word segmentation process, thereby improving the word segmentation speed and further improving the input efficiency.
In addition, for the case shown in the fifth step, there is no predicted word that the user wants, and the user continues to input pinyin. In addition to obtaining the predicted word corresponding to the pinyin input by the user according to the pinyin word-forming rule, the predicted word corresponding to the pinyin input by the user can be obtained by the following method:
Searching N-element relation from the N-element group model according to the word segmentation result of the text information in the input box, so as to obtain a plurality of predicted words corresponding to the word segmentation result; then, determining the predicted word meeting the spelling rule of the pinyin character string in each predicted word as a target predicted word; and then displaying each target predicted word on a recommended word display interface of the input method according to the sequence of the occurrence probability of each target predicted word from high to low. This is reasonable.
For example, when the text information in the input box is "prose", the terminal may segment the "prose" to obtain a segmentation result. Then, the word segmentation result can be utilized to search the subsequent binary relation, and when the user inputs the pinyin xs again, the predicted word can be obtained by considering the spelling rule of the pinyin and the predicted word searched according to the binary relation: novels, reality, display, frightening, washing and mr. Therefore, the output predicted word is more in line with the user expectation, and the user experience is improved.
The method comprises the following steps of obtaining predicted words corresponding to pinyin xs input by a user according to a pinyin word-forming rule: hours, reality, display, frighten, wash and mr.
In the embodiment of the invention, the text information of the word to be segmented can be determined. Then, according to a preset matching algorithm, a word stock in the word segmentation model and a word index table corresponding to the word stock are constructed in advance to segment the text information. Further, the word segmentation result of the text information is obtained. Wherein, since words in the word stock are ordered according to the number of characters contained in each word, that is, words with the same number of characters in the word stock are gathered together. Also, since the positions of words of each character number in the unified word stock can be indexed by the word index table. Therefore, in the process of word segmentation of text information by using a preset matching algorithm, a location interval corresponding to the number of characters of the word to be queried can be determined by using the word index table, and then whether the word exists or not can be searched in the location interval. Therefore, the traversal of the unified word stock is avoided, and the searching time is shortened, so that the word segmentation speed is improved.
The following describes a construction mode of the word segmentation model provided by the embodiment of the invention.
Step one: before word segmentation is carried out on the text information according to a preset matching algorithm and a word stock in a word segmentation model constructed in advance and a word index table corresponding to the word stock, an original corpus and the word stock are obtained before a word segmentation result of the text information is obtained.
Wherein, a large amount of original corpus can be obtained from the network. Then, after the original corpus is obtained, non-text information in the original corpus can be removed, and the processed original raw material is obtained. Wherein the non-text information includes: the symbol characters and the numeric characters are of course not limited thereto.
The word library may store words commonly used in the input method and high-frequency words collected in advance by a technician, but is not limited thereto.
Step two: and performing word segmentation on the original corpus by using the word segmentation word stock to obtain a word segmentation result of the original corpus. And counting the word frequency of each word in the word segmentation word stock according to the word segmentation result of the original corpus.
Step three: training to obtain a word segmentation model according to the word segmentation result of the original corpus and the word frequency obtained by statistics; the word segmentation model comprises a word segmentation library, wherein each word in the word segmentation library is recorded in the word segmentation library.
The N-Gram model can be obtained through training according to word segmentation results of the original corpus and word frequencies obtained through statistics. That is, the N-Gram model can be obtained through training according to the word segmentation result of the original corpus, the word frequency obtained through statistics and the N-Gram algorithm.
Step four: and sequencing each word in the word list according to the number of characters contained in the word.
After each word in the word stock is ordered according to the number of characters contained in the word, the words with the same number of characters in the word stock can be gathered together, so that the optimization of the storage structure of the word stock is realized.
In addition, the word index table may include a first sub index table and a second sub index table. Then, after the fourth step, a first sub-index table for recording the start position information of the words of each character number in the one-word stock may also be constructed. And constructs a second sub-index table for recording the size of the memory space occupied by the words of each character number. In this way, in the process of inquiring the words later, the word stock is not required to be traversed, only the position interval corresponding to the number of characters of the words to be inquired in the word stock is required to be found, and then whether the words exist in the position interval is searched. Thus, the search time of the words can be shortened, and the word segmentation speed is improved.
In summary, by applying the word segmentation method provided by the embodiment of the invention, in the process of segmenting text information by using a preset matching algorithm, a position interval corresponding to the number of characters of the word to be queried can be determined by using the word index table, and then whether the word exists or not can be searched in the position interval. Therefore, the traversal of the unified word stock is avoided, and the searching time is shortened, so that the word segmentation speed is improved.
Corresponding to the above method embodiment, the embodiment of the present invention further provides a word segmentation device, referring to fig. 2, where the device may include:
a first determining module 201, configured to determine text information of a word to be segmented;
the first word segmentation module 202 is configured to segment the text information according to a preset matching algorithm, a word database in a pre-constructed word segmentation model, and a word index table corresponding to the word database; wherein words in the word stock are ordered according to the number of characters contained in each word; the word index table is used for: indexing the position of words of each character number in a word stock;
and the obtaining module 203 is configured to obtain a word segmentation result of the text information.
By applying the device provided by the embodiment of the invention, the text information to be segmented can be determined. Then, according to a preset matching algorithm, a word stock in the word segmentation model and a word index table corresponding to the word stock are constructed in advance to segment the text information. Further, the word segmentation result of the text information is obtained. Wherein, since words in the word stock are ordered according to the number of characters contained in each word, that is, words with the same number of characters in the word stock are gathered together. Also, since the positions of words of each character number in the unified word stock can be indexed by the word index table. Therefore, in the process of word segmentation of text information by using a preset matching algorithm, a location interval corresponding to the number of characters of the word to be queried can be determined by using the word index table, and then whether the word exists or not can be searched in the location interval. Therefore, the traversal of the unified word stock is avoided, and the searching time is shortened, so that the word segmentation speed is improved.
Alternatively, in an embodiment of the present invention, the first determining module 201 may include:
a detection unit for detecting whether the information in the input box is updated;
an obtaining unit configured to obtain information in the input frame as target information when the information in the input frame is updated;
the selecting unit is used for selecting a preset number of character sequences in the target information according to a right-to-left selecting mode to obtain text information of the word to be segmented.
Optionally, in an embodiment of the present invention, the preset matching algorithm may include: inverse longest match algorithm.
Optionally, in an embodiment of the present invention, the method may further include:
the first acquisition module is used for acquiring an original corpus and a word segmentation word stock before the text information is segmented according to a preset matching algorithm, a word segmentation word stock in a pre-constructed word segmentation model and a word index table corresponding to the word segmentation word stock;
the second word segmentation module is used for segmenting the original corpus by utilizing the word segmentation word stock to obtain a word segmentation result of the original corpus;
the statistics module is used for counting the word frequency of each word in the word segmentation word stock according to the word segmentation result of the original corpus;
the training module is used for training to obtain a word segmentation model according to the word segmentation result of the original corpus and the word frequency obtained through statistics; the word segmentation model comprises a word stock, wherein each word is recorded in the word stock;
And the ordering module is used for ordering each word in the unified word stock according to the number of characters contained in the word.
Optionally, in an embodiment of the present invention, the word index table includes a first sub index table and a second sub index table; accordingly, the apparatus may further include:
the first construction module is used for constructing a first sub-index table for recording the initial position information of words with each character number in the word stock after sequencing each word in the word stock according to the character number contained in the word;
and the second construction module is used for constructing a second sub-index table for recording the storage space size occupied by the words of each character number.
Optionally, in an embodiment of the present invention, the word segmentation model may include: an N-tuple model.
Optionally, in an embodiment of the present invention, the method may further include:
the searching module is used for searching N-element relations from the N-element group model according to the word segmentation result and the word index table of the text information after the word segmentation result of the text information is obtained, and obtaining a plurality of predicted words corresponding to the word segmentation result of the text information;
the display module is used for displaying a preset number of predicted words according to the sequence from big to small of the occurrence probability of each predicted word.
Optionally, in an embodiment of the present invention, the display module may include:
a first determining unit for determining a pinyin character string input by a user;
a second determining unit, configured to determine, as a target predicted word, a predicted word that satisfies a spelling rule of the pinyin string among the predicted words;
and the display unit is used for displaying each target predicted word on a recommended word display interface of the input method according to the sequence from high to low of the occurrence probability of each target predicted word.
Corresponding to the above method embodiment, the embodiment of the present invention further provides a terminal, referring to fig. 3, including a processor 301, a communication interface 302, a memory 303, and a communication bus 304, where the processor 301, the communication interface 302, and the memory 303 complete communication with each other through the communication bus 304;
a memory 303 for storing a computer program;
the processor 301 is configured to implement the method steps of any of the word segmentation methods described above when executing the program stored in the memory 303.
In the embodiment of the invention, the terminal can determine the text information to be segmented. Then, the terminal can pre-construct a word list in the word segmentation model and a word index table corresponding to the word list according to a preset matching algorithm to segment the text information. Further, the word segmentation result of the text information is obtained. Wherein, since words in the word stock are ordered according to the number of characters contained in each word, that is, words with the same number of characters in the word stock are gathered together. Also, since the positions of words of each character number in the unified word stock can be indexed by the word index table. Therefore, in the process of word segmentation of text information by using a preset matching algorithm, a location interval corresponding to the number of characters of the word to be queried can be determined by using the word index table, and then whether the word exists or not can be searched in the location interval. Therefore, the traversal of the unified word stock is avoided, and the searching time is shortened, so that the word segmentation speed is improved.
Corresponding to the above method embodiments, the present invention further provides a computer readable storage medium, in which a computer program is stored, where the computer program, when executed by a processor of a terminal, implements the method steps of any of the above word segmentation methods.
After the computer program stored in the computer readable storage medium provided by the embodiment of the invention is executed by the processor of the terminal, the terminal can determine the text information to be segmented. Then, according to a preset matching algorithm, a word stock in the word segmentation model and a word index table corresponding to the word stock are constructed in advance to segment the text information. Further, the word segmentation result of the text information is obtained. Wherein, since words in the word stock are ordered according to the number of characters contained in each word, that is, words with the same number of characters in the word stock are gathered together. Also, since the positions of words of each character number in the unified word stock can be indexed by the word index table. Therefore, in the process of word segmentation of text information by using a preset matching algorithm, a location interval corresponding to the number of characters of the word to be queried can be determined by using the word index table, and then whether the word exists or not can be searched in the location interval. Therefore, the traversal of the unified word stock is avoided, and the searching time is shortened, so that the word segmentation speed is improved.
Corresponding to the above method embodiments, the present invention also provides a computer program product, which, when run on a terminal, causes the terminal to perform: the method steps of any of the word segmentation methods described above.
After the computer program provided by the embodiment of the invention is executed by the processor of the terminal, the terminal can determine the text information to be segmented. Then, according to a preset matching algorithm, a word stock in the word segmentation model and a word index table corresponding to the word stock are constructed in advance to segment the text information. Further, the word segmentation result of the text information is obtained. Wherein, since words in the word stock are ordered according to the number of characters contained in each word, that is, words with the same number of characters in the word stock are gathered together. Also, since the positions of words of each character number in the unified word stock can be indexed by the word index table. Therefore, in the process of word segmentation of text information by using a preset matching algorithm, a location interval corresponding to the number of characters of the word to be queried can be determined by using the word index table, and then whether the word exists or not can be searched in the location interval. Therefore, the traversal of the unified word stock is avoided, and the searching time is shortened, so that the word segmentation speed is improved.
The communication bus mentioned above for the electronic devices may be a peripheral component interconnect standard (Peripheral Component Interconnect, PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, etc. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.
The communication interface is used for communication between the electronic device and other devices.
The Memory may include random access Memory (Random Access Memory, RAM) or may include Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.
The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also digital signal processors (Digital Signal Processing, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), etc.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for apparatus, terminal and computer readable storage medium embodiments, the description is relatively simple as it is substantially similar to the method embodiments, with reference to the section descriptions of the method embodiments being merely illustrative.
The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims (14)

1. A method of word segmentation, the method comprising:
determining text information to be segmented according to text information input by a user;
word segmentation is carried out on the text information according to a preset matching algorithm, a word stock in a word segmentation model constructed in advance and a word index table corresponding to the word stock; the words in the unified word stock are ordered according to the number of characters contained in each word, and words with the same number of characters in the unified word stock are gathered together; the word index table is used for: indexing the position of words of each character number in the word stock; the word segmentation model comprises: an N-tuple model;
obtaining word segmentation results of the text information;
searching an N-element relation from the N-element group model according to the word segmentation result of the text information and the word index table to obtain a plurality of predicted words corresponding to the word segmentation result of the text information;
And displaying a preset number of predicted words according to the sequence from big to small of the occurrence probability of each predicted word.
2. The method of claim 1, wherein the step of determining text information for a word to be segmented comprises:
detecting whether the information in the input box is updated or not;
when the information in the input box is updated, obtaining the information in the input box as target information;
and selecting a preset number of character sequences in the target information according to a right-to-left selection mode to obtain text information of the word to be segmented.
3. The method of claim 1, wherein the preset matching algorithm comprises: inverse longest match algorithm.
4. The method according to claim 1, further comprising, before the step of word segmentation of the text information according to a preset matching algorithm, a word stock in a pre-built word segmentation model, and a word index table corresponding to the word stock:
acquiring an original corpus and a word segmentation word stock;
performing word segmentation on the original corpus by using the word segmentation word stock to obtain a word segmentation result of the original corpus;
counting word frequencies of all words in the word segmentation word stock according to word segmentation results of the original corpus;
Training to obtain a word segmentation model according to the word segmentation result of the original corpus and the word frequency obtained through statistics; the word segmentation model comprises a word stock, wherein each word is recorded in the word stock;
and sequencing the words in the unified word stock according to the number of characters contained in the words.
5. The method of claim 4, wherein the word index table comprises a first sub-index table and a second sub-index table;
after the step of ordering the words in the word stock according to the number of characters contained in the words, the method further comprises:
constructing a first sub-index table for recording the initial position information of words of each character number in the unigram;
and constructing a second sub-index table for recording the storage space size occupied by the words of each character number.
6. The method of claim 1, wherein the step of displaying a preset number of the predicted words in order of increasing probability of occurrence of each predicted word comprises:
determining a pinyin character string input by a user;
determining predicted words meeting spelling rules of the pinyin character strings in all the predicted words as target predicted words;
And displaying each target predicted word on a recommended word display interface of the input method according to the sequence of the occurrence probability of each target predicted word from high to low.
7. A word segmentation apparatus, the apparatus comprising:
the first determining module is used for determining text information to be segmented according to the text information input by the user;
the first word segmentation module is used for segmenting the text information according to a preset matching algorithm, a word database in a pre-built word segmentation model and a word index table corresponding to the word database; the words in the unified word stock are ordered according to the number of characters contained in each word, and words with the same number of characters in the unified word stock are gathered together; the word index table is used for: indexing the position of words of each character number in the word stock; the word segmentation model comprises: an N-tuple model;
the obtaining module is used for obtaining word segmentation results of the text information;
the searching module is used for searching the N-element relation from the N-element group model according to the word segmentation result of the text information and the word index table after the word segmentation result of the text information is obtained, so as to obtain a plurality of predicted words corresponding to the word segmentation result of the text information;
The display module is used for displaying a preset number of predicted words according to the sequence from big to small of the occurrence probability of each predicted word.
8. The apparatus of claim 7, wherein the first determining module comprises:
a detection unit for detecting whether the information in the input box is updated;
an obtaining unit configured to obtain information in an input box as target information when the information in the input box is updated;
and the selecting unit is used for selecting a preset number of character sequences in the target information according to a right-to-left selecting mode to obtain text information of the word to be segmented.
9. The apparatus of claim 8, wherein the preset matching algorithm comprises: inverse longest match algorithm.
10. The apparatus as recited in claim 8, further comprising:
the first acquisition module is used for acquiring an original corpus and a word segmentation word stock before the text information is segmented according to a preset matching algorithm, a word segmentation word stock in a pre-built word segmentation model and a word index table corresponding to the word segmentation word stock;
the second word segmentation module is used for segmenting the original corpus by utilizing the word segmentation word stock to obtain a word segmentation result of the original corpus;
The statistics module is used for counting the word frequency of each word in the word segmentation word stock according to the word segmentation result of the original corpus;
the training module is used for training to obtain a word segmentation model according to the word segmentation result of the original corpus and the word frequency obtained through statistics; the word segmentation model comprises a word stock, wherein each word is recorded in the word stock;
and the ordering module is used for ordering the words in the word stock according to the number of characters contained in the words.
11. The apparatus of claim 10, wherein the word index table comprises a first sub-index table and a second sub-index table; the apparatus further comprises:
the first construction module is used for constructing a first sub-index table for recording the initial position information of words with each character number in the unified word stock after sequencing each word in the unified word stock according to the character number contained in the words;
and the second construction module is used for constructing a second sub-index table for recording the storage space size occupied by the words of each character number.
12. The apparatus of claim 7, wherein the display module comprises:
A first determining unit for determining a pinyin character string input by a user;
a second determining unit, configured to determine, as a target predicted word, a predicted word that satisfies a spelling rule of the pinyin string among the predicted words;
and the display unit is used for displaying each target predicted word on a recommended word display interface of the input method according to the sequence from high to low of the occurrence probability of each target predicted word.
13. The terminal is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;
a memory for storing a computer program;
a processor for implementing the method steps of any one of claims 1-6 when executing a program stored on a memory.
14. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored therein a computer program which, when executed by a processor, implements the method steps of any of claims 1-6.
CN201811619990.1A 2018-12-28 2018-12-28 Word segmentation method, device, terminal and computer readable storage medium Active CN109800427B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811619990.1A CN109800427B (en) 2018-12-28 2018-12-28 Word segmentation method, device, terminal and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811619990.1A CN109800427B (en) 2018-12-28 2018-12-28 Word segmentation method, device, terminal and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN109800427A CN109800427A (en) 2019-05-24
CN109800427B true CN109800427B (en) 2023-09-22

Family

ID=66557861

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811619990.1A Active CN109800427B (en) 2018-12-28 2018-12-28 Word segmentation method, device, terminal and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN109800427B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110187780B (en) * 2019-06-10 2023-07-21 北京百度网讯科技有限公司 Long text prediction method, long text prediction device, long text prediction equipment and storage medium
CN110941777A (en) * 2019-11-29 2020-03-31 武汉虹旭信息技术有限责任公司 Navigation method and device for WEB application
CN111967257A (en) * 2020-07-08 2020-11-20 咪咕文化科技有限公司 Word segmentation method and device, electronic equipment and storage medium
CN113033193B (en) * 2021-01-20 2024-04-16 山谷网安科技股份有限公司 Mixed Chinese text word segmentation method based on C++ language

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2192435A1 (en) * 1995-04-10 1996-10-17 Vijayakumar Rangarajan System and method for portable document indexing using n-gram word decomposition
CN102999534A (en) * 2011-09-19 2013-03-27 北京金和软件股份有限公司 Chinese word segmentation algorithm based on reverse maximum matching
CN106257441A (en) * 2016-06-30 2016-12-28 电子科技大学 A kind of training method of skip language model based on word frequency
CN108197315A (en) * 2018-02-01 2018-06-22 中控技术(西安)有限公司 A kind of method and apparatus for establishing participle index database

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2192435A1 (en) * 1995-04-10 1996-10-17 Vijayakumar Rangarajan System and method for portable document indexing using n-gram word decomposition
CN102999534A (en) * 2011-09-19 2013-03-27 北京金和软件股份有限公司 Chinese word segmentation algorithm based on reverse maximum matching
CN106257441A (en) * 2016-06-30 2016-12-28 电子科技大学 A kind of training method of skip language model based on word frequency
CN108197315A (en) * 2018-02-01 2018-06-22 中控技术(西安)有限公司 A kind of method and apparatus for establishing participle index database

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
一种改进的最大匹配中文分词算法;闻玉彪,贾时银,邓世昆,李远方;《计算机技术与发展 》;20111010;第21卷(第10期);论文第94页第2节 *

Also Published As

Publication number Publication date
CN109800427A (en) 2019-05-24

Similar Documents

Publication Publication Date Title
CN109800427B (en) Word segmentation method, device, terminal and computer readable storage medium
CN108363790B (en) Method, device, equipment and storage medium for evaluating comments
US10402493B2 (en) System and method for inputting text into electronic devices
US10025819B2 (en) Generating a query statement based on unstructured input
US7953692B2 (en) Predicting candidates using information sources
CN112347778B (en) Keyword extraction method, keyword extraction device, terminal equipment and storage medium
US20130060769A1 (en) System and method for identifying social media interactions
CN111444320A (en) Text retrieval method and device, computer equipment and storage medium
US20180173694A1 (en) Methods and computer systems for named entity verification, named entity verification model training, and phrase expansion
CN110532354B (en) Content retrieval method and device
US20220083874A1 (en) Method and device for training search model, method for searching for target object, and storage medium
WO2018010579A1 (en) Character string segmentation method, apparatus and device
CN111967264B (en) Named entity identification method
CN110569496A (en) Entity linking method, device and storage medium
CN112732870B (en) Word vector based search method, device, equipment and storage medium
JP7093825B2 (en) Man-machine dialogue methods, devices, and equipment
WO2019049001A1 (en) System and method for recommendation of terms, including recommendation of search terms in a search system
CN108664142A (en) Input method with self-learning function between document
CN113919424A (en) Training of text processing model, text processing method, device, equipment and medium
CN115035890B (en) Training method and device of voice recognition model, electronic equipment and storage medium
CN112559725A (en) Text matching method, device, terminal and storage medium
JP5179564B2 (en) Query segment position determination device
CN111428487A (en) Model training method, lyric generation method, device, electronic equipment and medium
CN110795562A (en) Map optimization method, device, terminal and storage medium
CN115563242A (en) Automobile information screening method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant