CN109800408A - Dictionary data storage method and device, segmenting method and device based on dictionary - Google Patents
Dictionary data storage method and device, segmenting method and device based on dictionary Download PDFInfo
- Publication number
- CN109800408A CN109800408A CN201711136379.9A CN201711136379A CN109800408A CN 109800408 A CN109800408 A CN 109800408A CN 201711136379 A CN201711136379 A CN 201711136379A CN 109800408 A CN109800408 A CN 109800408A
- Authority
- CN
- China
- Prior art keywords
- word
- information
- lead
- dictionary
- matching
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
This application involves a kind of dictionary data storage method and device, the segmenting method based on dictionary and device, computer equipment and storage medium, dictionary data storage method includes: acquisition dictionary;The dictionary is analyzed, the corresponding word information of each lead-in is obtained;Word information corresponding to each lead-in, is successively stored according to the sequence of word word length;The corresponding relationship for establishing each word word length of lead-in and the storage location of word information obtains concordance;According to the concordance, the relationship for establishing the storage location of lead-in and corresponding word index information obtains lead-in index.This method can only load the corresponding word information of word word length of the word to memory, and inquiry can be completed, it is not necessary that entire dictionary is all loaded onto memory, to reduce the EMS memory occupation of word segmentation processing, and have faster inquiry velocity.
Description
Technical field
This application involves technical field of information processing, more particularly to a kind of dictionary data storage method and device, are based on
The segmenting method and device of dictionary, computer equipment and storage medium.
Background technique
Participle is the basis of text mining, for a Duan Wenben of input, is successfully segmented, and can achieve computer certainly
The effect of dynamic identification sentence meaning.Traditional segmenting method includes the segmentation methods based on string matching, and this method needs
Establish sufficiently large dictionary for word segmentation.
Traditional segmentation methods based on string matching, string searching performance when in order to ensure participle, general meeting
Entire dictionary is loaded into memory, and organizes to be to be appropriate for the organizational form of string searching, such as " prefix trees " data knot
Structure.By such memory data structure, may be implemented quickly whether to be searched it in dictionary for the character string to be segmented
In the presence of.Although dictionary is loaded into numerous string searching demands during memory energy quick response participle, in practical application
In, in order to obtain the preferable precision of word segmentation, lexicon file generally has the scale of 4m-30m or so.If in dictionary is all loaded onto
It deposits, then will occupy biggish memory headroom.
Summary of the invention
Based on this, it is necessary to for the excessive problem of memory that participle occupies, provide a kind of dictionary data storage method and
Device, the segmenting method based on dictionary and device, computer equipment and storage medium.
A kind of dictionary data storage method, comprising:
Obtain dictionary;
The dictionary is analyzed, the corresponding word information of each lead-in is obtained;
Word information corresponding to each lead-in, is successively stored according to the sequence of word word length;
The corresponding relationship for establishing each word word length of lead-in and the storage location of word information obtains concordance;
According to the concordance, the relationship for establishing the storage location of lead-in and corresponding word index information obtains lead-in rope
Draw.
A kind of segmenting method based on dictionary, comprising:
Obtain text information to be segmented;
According to preset text matches algorithm, the text information is split, each matching field for obtaining fractionation and respectively matching
The lead-in of field;
The lead-in index for reading dictionary, according to the relationship of lead-in and the storage location of corresponding word index information, obtains institute
State the storage location of the corresponding concordance information of lead-in of matching field;
According to the storage location of the concordance information, the corresponding concordance of lead-in of the matching field is read;
In the concordance, according to the corresponding relationship of word word length and the storage location of word information, obtain described
The storage location of the corresponding word information of the word length of matching field;
According to the storage location of the word information, word information corresponding with word length is loaded in dictionary to memory;
The matching field is matched with each word in the word information, obtains the matching of the matching field
As a result;
Word segmentation result is obtained according to the matching result of each matching field of fractionation based on preset text matches algorithm.
A kind of dictionary data storage device, comprising: dictionary obtains module, dictionary analysis module, memory module, concordance
It establishes module and lead-in index establishes module;
The dictionary obtains module, for obtaining dictionary;
The dictionary analysis module obtains the corresponding word information of each lead-in for analyzing the dictionary;
The memory module, for successively being deposited according to the sequence of word word length to the corresponding word information of each lead-in
Storage;
The concordance establishes module, for establishing pair of each word word length of lead-in and the storage location of word information
It should be related to obtain concordance;
The lead-in index establishes module, for establishing lead-in and corresponding word index information according to the concordance
Storage location relationship obtain lead-in index.
A kind of participle device based on dictionary, comprising: text obtain module, split module, searching module, read module,
Loading module, matching module and word segmentation module;
The text obtains module, for obtaining text information to be segmented;
The fractionation module, for splitting the text information, obtaining each of fractionation according to preset text matches algorithm
The lead-in of matching field and each matching field;
The searching module, the lead-in for reading dictionary indexes, according to the storage of lead-in and corresponding word index information
The relationship of position obtains the storage location of the corresponding concordance information of lead-in of the matching field;
The read module reads the head of the matching field for the storage location according to the concordance information
The corresponding concordance of word;
The searching module is also used in the concordance, according to the storage location of word word length and word information
Corresponding relationship, obtain the storage location of the corresponding word information of word length of the matching field;
The loading module loads word corresponding with word length in dictionary for the storage location according to the word information
Language information is to memory;
The matching module is obtained for matching the matching field with each word in the word information
The matching result of the matching field;
The word segmentation module, for being based on preset text matches algorithm, according to the matching knot of each matching field of fractionation
Fruit obtains word segmentation result.
A kind of computer equipment, including memory and processor, the memory are stored with computer program, the calculating
When machine program is executed by the processor, so that the processor executes above-mentioned dictionary data storage method or based on dictionary
The step of segmenting method.
A kind of storage medium is stored with computer program, when the computer program is executed by processor, so that the place
Reason device executes the step of above-mentioned dictionary data storage method or segmenting method based on dictionary.
Above-mentioned dictionary data storage method and device are not that whole words of a lead-in are disorderly stored in one
It rises, but by the corresponding word information of lead-in, it is successively stored according to the sequence of word word length, establishes each word word of lead-in
The long corresponding relationship with the storage location of word information, obtains concordance, further according to concordance, establishes lead-in and equivalent
The positional relationship of language index information obtains lead-in index, to can determine the index of lead-in according to lead-in in dictionary enquiry
The storage location of information determines the storage location of the word information of the word word length further according to word word length in index information,
The corresponding word information of word word length so as to only load the word can be completed inquiry, be not necessarily to entire word to memory
Allusion quotation is all loaded onto memory, to reduce the EMS memory occupation of word segmentation processing, and has faster inquiry velocity.
The above-mentioned segmenting method and device based on dictionary, can determine the storage position of the index information of lead-in according to lead-in
It sets, further according to word word length, the storage location of the word information of the word word length is determined in index information, so as to only add
The corresponding word information of word word length of the word is carried to memory, inquiry can be completed, it is not necessary that entire dictionary to be all loaded onto
Memory to reduce the EMS memory occupation of word segmentation processing, and has faster inquiry velocity.
Detailed description of the invention
Fig. 1 is the flow chart of dictionary data storage method in one embodiment;
Fig. 2 is successively to be deposited to the corresponding word information of each lead-in according to the sequence of word word length in one embodiment
The flow chart of the step of storage;
Fig. 3 is the storage organization schematic diagram of dictionary data in one embodiment;
Fig. 4 is the flow chart of the segmenting method based on dictionary in one embodiment;
Fig. 5 is the step of carrying out binary chop in one embodiment to word information, obtain the matching result of matching field
Flow chart;
Fig. 6 is the flow chart of the segmenting method based on dictionary in another embodiment;
Fig. 7 is Words partition system architecture diagram in one embodiment;
Fig. 8 is the structural block diagram of dictionary data storage device in one embodiment;
Fig. 9 is the structural block diagram of the participle device in one embodiment based on dictionary;
Figure 10 is the structural block diagram of computer equipment in one embodiment.
Specific embodiment
It is with reference to the accompanying drawings and embodiments, right in order to which the objects, technical solutions and advantages of the application are more clearly understood
The application is further elaborated.It should be appreciated that specific embodiment described herein is only used to explain the application, and
It is not used in restriction the application.
As shown in Figure 1, in one embodiment, providing a kind of dictionary data storage method.As shown in Figure 1, the dictionary
Date storage method specifically comprises the following steps:
S102: dictionary is obtained.
Wherein, dictionary refers to a collection of word obtained by a large amount of Concordances.
S104: analyzing dictionary, obtains the corresponding word information of each lead-in.
Wherein, lead-in refers to the first character of word.By taking word " mankind " as an example, first character " people " is the head of the word
Word.
Specifically, dictionary is traversed, the lead-in of each word of dictionary is analyzed, obtains the corresponding word letter of each lead-in
Breath.Wherein, word information can be as times of word content, long, the word frequency information of word and word the additional information of word word
It anticipates one or more.For example, traversing dictionary for lead-in " people ", obtain with " people " as all words of lead-in and its each word
Information, the word information including the words such as " mankind ", " life ", " artificial satellite ".
Word word length refers to the word length of word, related with the number of word included by word.It is with word " mankind "
Example, word word length are 2.
The additional information of word can identify for the part of speech of word, and part of speech includes time word, noun, adjective, geographical word
Etc..
In a specific embodiment, the step of dictionary being analyzed, obtaining each lead-in corresponding word information, packet
It includes: dictionary being analyzed according to character set encoding sequence, obtain the word information of each lead-in.
Character set encoding refers to that carrying out integration to multiple characters (usually differing tens to tens of thousands of) is packaged into a text
Coding, external program used in part can call specified character by this coding from character set file.Common
Chinese character set coding includes GB2312 coding, GBK coding, Unicode coding etc..
By taking GBK character set encoding as an example, according to the sequence of character set encoding, corresponding character will be successively encoded in dictionary
It searches whether the word for having using current character as lead-in, obtains the corresponding whole word information of lead-in if so, searching.According to
Secondary traversal character set encoding obtains each character as the corresponding word information of lead-in.
In the present embodiment, using fixed length character code, facilitates calculating character offset and accelerate search efficiency.
S106: word information corresponding to each lead-in is successively stored according to the sequence of word word length.
The corresponding word of lead-in, including multiple word lengths, such as two word lengths, three word lengths etc..By taking lead-in " people " as an example,
" mankind " are the word of two word lengths, and " artificial satellite " is the word of four word lengths.For the word for being reduced as far as memory load
Allusion quotation content can put the word information of the identical word word length of lead-in together.Thus according to the word length and lead-in of matching field,
The word information of corresponding word length is loaded to memory, avoids for entire dictionary to be loaded onto and occupies more memory caused by memory and provide
The problem of source.
Specifically, often in the corresponding whole word information of each lead-in, according to word word length ascending or descending order successively into
Row storage.After successively storing, the storage location of the word information of the word of the identical word word length of same lead-in is continuous.It can
With understanding, the storage order of the word information of all lead-ins should be identical.For example, often in the corresponding word information of lead-in,
First the word information of 2 word lengths is successively stored, then the word information of 3 word lengths is successively stored, and so on, until should
The word information of the maximum word length of lead-in stores.In conjunction with participle practical experience, maximum word word length can be limited as 8.Specific
In embodiment, maximum word word length can also be adjusted according to demand.
Fig. 2 is one embodiment to the corresponding word information of each lead-in, is successively deposited according to the sequence of word word length
The flow chart of the step of storage.As shown in Fig. 2, the step includes: S202 to S204.
S202: according to character set encoding sequence, successively each lead-in corresponding word information is stored.
Specifically, it according to character set encoding sequence, after the whole word information for having stored a lead-in, then stores next
Whole word orders of lead-in.For example, by taking GBK Chinese character coding set as an example, according to the sequence of character code, for each character
Corresponding lead-in is encoded, dictionary is traversed, obtains the corresponding word of the lead-in and its word information.Since lead-in is according to character set
What coded sequence was successively stored, facilitate calculating character offset and accelerates search efficiency.
S204: to each word information of lead-in, according to identical word word length each word character code sequence successively into
Row storage.
One lead-in corresponding word, including multiple word lengths, such as two word lengths, three word lengths etc..To subtract as much as possible
The dictionary content of few memory load, can put the word information of the identical word word length of lead-in together.To according to matching word
The word length and lead-in of section load the word information of corresponding word length to memory, avoid and entire dictionary is loaded onto caused by memory
The problem of occupying more memory source.
And for the word information of identical word word length, it is successively stored according to the sequence of each word character code.Tool
Body, for the word of two word lengths, successively stored according to the size of the GKB encoding of chinese characters of the second of word word.It is right
In the word of three word lengths, successively stored according to the size of the triliteral GKB encoding of chinese characters of word.The rest may be inferred,
According to the sequence of each word character code, successively the word information of identical word word length is stored in each word of phase lead-in.
In this storage mode, the word information of the identical word length of same lead-in is successively carried out according to the sequence of character code
Storage, and due to the data length of every word information of the word information of identical word length be it is fixed, so as to be two
Divide to search and preferable support is provided.Binary chop is a kind of higher search method of efficiency, it is desirable that dictionary is in sequence list by pass
Key sequence.
S108: the corresponding relationship for establishing each word word length of lead-in and the storage location of word information obtains concordance.
Wherein, the storage location of word information includes the offset and data length that word information corresponds to character string.Offset
Amount refers to, the start memory location of the word information of the word of identical word length, that is, stores first word letter of each word word length
The initial position of breath.Due to every storage information number of words having the same, and fixed length character code is used, thus every word
Language information is all regular length, according to the word quantity of the identical word word length of each lead-in, can calculate the data of word information
Length.
Specifically, the concordance of lead-in should include: 2 word word lengths pass corresponding with storage location of the lead-in
System, the corresponding relationship of 3 word word lengths and storage location, and so on, the concordance of lead-in includes every kind of word of lead-in
The corresponding relationship of the storage location of word length and word information.
Concordance should include the corresponding relationship of every kind of word word length of whole lead-ins and the storage location of word information.
According to the offset and data length of word information, the storage location of word information can determine, according to the storage of word information
Position can obtain corresponding word information.
S110: according to concordance, the positional relationship for establishing lead-in and corresponding word index information obtains lead-in index.
Specifically, concordance includes that every kind of word word length of whole lead-ins and the corresponding of the storage location of word information are closed
System.According to position of the corresponding word word length of each lead-in in concordance, lead-in and corresponding word index information are established
Positional relationship obtains lead-in index.
Specifically, the storage location of concordance includes the offset and data length of concordance information.Offset is
Refer to, the start memory location of the concordance of each word word length of identical lead-in, that is, store first concordance letter of lead-in
The initial position of breath.Data length refers to the data length of the index information of whole word word lengths an of lead-in.It is understood that
, in index relative, the data for describing each storage location are regular lengths, therefore, can be easily according to lead-in
The quantity of corresponding whole word word length calculates the data length of the corresponding whole concordance information of lead-in.
According to the offset and data length of concordance information, the storage location of concordance information can determine, then
According to the word word length for searching word, the storage location of the word information of the word word length is determined in index information.
The storage organization of the dictionary data of one embodiment is as shown in Figure 3.When organizing dictionary data, by the institute in dictionary
There is dictionary data (format: word character string-word frequency-extension information (such as part of speech)) to establish multiple index, including lead-in index
And concordance.Overall file structure is divided into: word lead-in index area, the word point number of words information area (including concordance and word
Language information)
It include the start bit of the corresponding lead-in word point number of words information area of all GBK Chinese character coding sets in lead-in index
It sets and length.The arrangement mode of lead-in index is arranged according to the size order of GBK encoding of chinese characters, is divided into the area GBK2312
(0xB0A1-0xF7FE, totally 6763 Chinese characters), the area GBK/3 (0x8140-0xA0FE, totally 6080 Chinese characters), the area GBK/4
(0xAA40-0xFEA0, totally 8160 Chinese characters), that is to say, that the number in lead-in index region are as follows: 6763+6080+8160=
21003.The lead-in index record storage location of lead-in and corresponding word index information.
The corresponding relationship of the storage location of each word length and word information of each lead-in in including in concordance.Such as figure
Shown in 3, the concordance that the corresponding concordance packet word word length of a lead-in is 2, the concordance etc. that word word length is 3
Deng.The corresponding storage location of concordance has recorded the word information of the identical word length of same lead-in, including remaining word, word frequency and
Additional information.
It when being inquired using dictionary, according to the lead-in of matching field, is indexed using lead-in, determines the index letter of lead-in
The storage location of breath, it is long further according to the word of matching field, the storage of the word information of the word word length is determined in index information
Inquiry can be completed to memory in position, the corresponding word information of word word length so as to only load the word.
Above-mentioned dictionary data storage method, whole words of a lead-in is not stored together disorderly,
It by the corresponding word information of lead-in, is successively stored according to the sequence of word word length, establishes each word word length and word of lead-in
The corresponding relationship of the storage location of language information, obtains concordance, further according to concordance, establishes lead-in and corresponding word indexes
The positional relationship of information obtains lead-in index, to can determine the index information of lead-in according to lead-in in dictionary enquiry
Storage location determines the storage location of the word information of the word word length further according to word word length in index information, so as to
To memory inquiry can be completed, it is not necessary that entire dictionary is whole in the corresponding word information of word word length only to load the word
It is loaded onto memory, to reduce the EMS memory occupation of word segmentation processing, and there is faster inquiry velocity.
Fig. 4 is the flow chart of the segmenting method based on dictionary of one embodiment.As shown in figure 4, this method includes following
Step:
S402: text information to be segmented is obtained.
Segmentation methods are the bases of text mining, are usually applied to the neck such as natural language processing, search engine, intelligent recommendation
Domain.To the text mining object in text information, that is, concrete application field application scenarios of cliction, for example, being used in search engine
The text information of family input, the corresponding text information of voice etc. for needing to identify in speech synthesis system.
S404: according to preset text matches algorithm, splitting text information, each matching field for obtaining fractionation and respectively matching
The lead-in of field.
Wherein, text matches algorithm refers to the progress of the word in the matching field and dictionary that will be split out in text information
The algorithm matched.Text matches algorithm needs text information splitting into matching field.Matching field, which refers to, splits text information
The word obtained afterwards is matched for the word with dictionary.Text matches algorithm includes splitting and matching two steps, different
Text matches algorithm fractionation mode and matching way be all different.
Common text matches algorithm includes Forward Maximum Method algorithm, reverse maximum matching algorithm, two-way maximum matching
Algorithm and maximum word frequency matching algorithm.
Wherein, Forward Maximum Method algorithm refers to, takes m character of Chinese sentence to be slit as matching word from left to right
Section, m are longest entry number in big machine dictionary.It searches dictionary and is matched.If successful match, by this matching field
It is come out as a word segmentation.If matching is unsuccessful, the last character of this matching field is removed, remaining character string
It as new matching field, is matched again, above procedure is repeated, until being syncopated as all words.
Reverse maximum matching algorithm refers to, takes m character of Chinese sentence to be slit as matching field from right to left, m is
Longest entry number in big machine dictionary.It searches dictionary and is matched.If successful match, using this matching field as one
A word segmentation comes out.If matching is unsuccessful, the most previous word of this matching field is removed, remaining character string is as new
Matching field, matched again, repeat above procedure, until being syncopated as all words.
Self-reinforcing in double directions refers to the word segmentation result for obtaining Forward Maximum Method method and reverse maximum matching method
To result be compared, to determine correct segmenting method.
Maximum word frequency matching algorithm is referred to and is arbitrarily split to text information using a variety of fractionation modes, every kind is torn open
Word inside point mode is matched with dictionary, is counted total word frequency of every kind of fractionation mode, is selected the maximum fractionation of total word frequency
The corresponding split result of mode is as word segmentation result.
S406: reading in the lead-in index of dictionary, according to the relationship of lead-in and the storage location of corresponding word index information,
Obtain the storage location of the corresponding concordance information of lead-in of matching field.
Wherein, lead-in index is the pass of the lead-in of the dictionary pre-established and the storage location of corresponding word index information
System.Specifically, the lead-in of matching field is searched in lead-in index, determines the storage location of lead-in corresponding word index.
In the present embodiment, the storage location of concordance information includes the offset and data length of the concordance of lead-in.Offset
Refer to, the start memory location of the concordance of each word word length of identical lead-in, that is, store first word of each word word length
The initial position of language index information.Data length refers to the data length of the index information of whole word word lengths an of lead-in.
It is understood that the data for describing each storage location are regular length, therefore, Neng Goufang in lead-in index relative
Just according to the quantity of the corresponding whole word word lengths of lead-in, the data for calculating the corresponding whole concordance information of lead-in are long
Degree.
S408: according to the storage location of concordance information, the corresponding concordance of lead-in is read.
S410: it in concordance, according to the corresponding relationship of word word length and the storage location of word information, is matched
The storage location of the corresponding word information of the word length of field.
Concordance includes the corresponding relationship of every kind of word word length of whole lead-ins and the storage location of word information.Its
In, the storage location of word information includes the offset and data length of word information.Offset refers to, identical word word length
The start memory location of the word information of word, that is, store the initial position of first word information of each word word length.Due to
Every storage information number of words having the same, and fixed length character code has been used, so that every word information is all fixed length
Degree, according to the word quantity of the identical word word length of each lead-in, can calculate the data length of word information.
Specifically, the concordance of lead-in should include: 2 word word lengths pass corresponding with storage location of the lead-in
System, the corresponding relationship of 3 word word lengths and storage location, and so on, the concordance of lead-in includes every kind of word of lead-in
The corresponding relationship of the storage location of word length and word information.
Concordance should include the corresponding relationship of every kind of word word length of whole lead-ins and the storage location of word information.
According to the offset and data length of word information, the storage location of word information can determine, according to the storage of word information
Position can obtain corresponding word information.
S412: according to the storage location of word information, word information corresponding with word length is loaded in dictionary to memory.
In the present embodiment, word information corresponding with the word length of matching field need to be only loaded to memory, entirely without load
Dictionary reduces participle to the occupancy of memory to memory.Word information includes word character string, word frequency and extension information (such as word
Property etc.) any one or more.
Wherein, word frequency refers to the number that each word occurs in corpus.Dictionary method thinks that word frequency is higher, this word into
Probability when row participle prediction is higher, and weight is bigger.
S414: matching field is matched with each word in word information, obtains the matching result of matching field.
Specifically, matching field can be matched with each of word information word information, obtains matching field
Matching result.To improve search efficiency, can also be matched in word information according to binary chop.
S416: being based on preset text matches algorithm, according to the matching result of each matching field of fractionation, obtains participle knot
Fruit.
Specifically, according to preset text matches algorithm, above-mentioned matching process is used to each matching field of fractionation,
The matching result of each matching field split obtains word segmentation result according to the matching result of each matching field of fractionation.With
For Forward Maximum Method algorithm, take m character of Chinese sentence to be slit as matching field from left to right, m is big machine word
Longest entry number in allusion quotation.It searches dictionary and is matched.If successful match, using this matching field as a word segmentation
Out.If matching is unsuccessful, the last character of this matching field is removed, remaining character string is as new matching word
Section, is matched again, above procedure is repeated, until being syncopated as all words.By taking big word frequency matching algorithm as an example, using more
Kind fractionation mode arbitrarily splits text information, and the word inside every kind of fractionation mode is matched with dictionary, counts
Total word frequency of every kind of fractionation mode selects total word frequency maximum to split the corresponding split result of mode as word segmentation result.
The above-mentioned segmenting method based on dictionary can determine the storage location of the index information of lead-in according to lead-in, then
According to word word length, the storage location of the word information of the word word length is determined, in index information so as to only load this
Inquiry can be completed to memory in the corresponding word information of the word word length of word, it is not necessary that entire dictionary is all loaded onto memory,
To reduce the EMS memory occupation of word segmentation processing, and there is faster inquiry velocity.
In another embodiment, matching field is matched with each word in word information, obtains matched word
The step of matching result of section, comprising: binary chop is carried out to word information, obtains the matching result of matching field.Fig. 5 is one
Flow chart the step of carrying out binary chop to word information, obtain the matching result of matching field of a embodiment.Such as Fig. 5 institute
Show, the step the following steps are included:
S502: the data block of word information is obtained.
Specifically, after word information being loaded into memory, the internal storage data block position where word information is obtained.
S504: the middle position of inquiry data block is determined.
Since word information is the corresponding data of identical word word length of lead-in, and word information is according to identical word word length
The coded sequence of each word successively stored, therefore, the data length of the corresponding word information of each word be it is fixed,
And it is arranged successively by the sequence of character code.According to the data length of word information, the middle position of data block is determined.
S506: judging whether the word on middle position is equal to matching field, or whether searches end position.
Specifically, the character code of matching field can be compared with the character code of word on middle position, if phase
Deng, it is determined that for the matching field there are corresponding word in dictionary, which is what text information needed to split out
Word.The word not searched on end position or middle position is equal to matching field, thens follow the steps S510: knot is searched in output
Fruit.If the word on middle position is equal to matching field, the word information of the word found is returned.If having searched end
Position then returns and does not search data.
If the word on middle position is not equal to matching field, or does not search end position, S508 is thened follow the steps.
S508: the middle position of inquiry data block is redefined.
Specifically, if comparison result is that the character code of matching field is less than the character code of word on middle position,
Continue binary chop in the first half before middle position, redefines the centre of inquiry data block in first half
Position.
If comparison result is that the character code of matching field is greater than the character code of word on middle position, in interposition
Continue binary chop in latter half before setting, redefines the middle position of inquiry data block in latter half.
According to the middle position for redefining inquiry data block, continue binary chop, in this way, passing through once relatively
The retrieval section for reducing half, so goes on, until retrieving successfully or retrieving failure.
Based on dictionary data structure shown in Fig. 3, can simplify according to the operation that word carries out data query to dictionary is 2
Secondary File read operation+(log (n)+2) secondary memory read operation (n be on the basis of current queries word lead-in, it is corresponding
40) all word entries numbers, are usually no more than, the memory that inquiry generates every time is solely dependent upon all words of current term lead-in
Data scale (is usually no more than 40), that is to say, that it is sufficiently fast sufficient with EMS memory occupation that this query engine has taken into account inquiry velocity
It is enough small, it does not need any dictionary and preloads process.
In a specific embodiment, preset text matches algorithm is using maximum word frequency algorithm.According to preset text
This matching algorithm, split text information, obtain fractionation each matching field and each matching field lead-in the step of, comprising: root
According to maximum word frequency matching algorithm, text information is done any fractionation by forward direction, and obtains the corresponding each matching word of every kind of fractionation mode
The lead-in of section and each matching field.
Word information includes word frequency;Based on preset text matches algorithm, according to the matching knot of each matching field of fractionation
Fruit, the step of obtaining word segmentation result, comprising: the word frequency for the whole matching fields for splitting mode according to every kind calculates every kind of fractionation side
The corresponding total word frequency of formula,
Maximum word frequency is corresponded into the corresponding each matching field of fractionation mode as the word segmentation result of text information.
Fig. 6 is the flow chart of the segmenting method based on dictionary of one embodiment.As shown in fig. 6, this method includes following
Step:
S602: text information to be segmented is obtained.
S604: according to maximum word frequency matching algorithm, text information is done any fractionation by forward direction, and obtains every kind of fractionation mode
The lead-in of corresponding each matching field and each matching field.
Forward direction refers to direction corresponding with text information input direction.Specifically, one section of text information from left to right
Direction.
By taking text information is " today, weather was pretty good " as an example, text information is done any fractionation by forward direction, and what is be likely to be obtained tears open
The mode of dividing is as follows:
The present/day weather is pretty good;
Today/weather not/it is wrong;
Today wrong day/gas/or not/
……
By various fractionation modes, the lead-in of the corresponding matching field of every kind of fractionation mode and each matching field is obtained.Its
In, matching field refers to text information is split after obtained word, matched for the word with dictionary.In a manner of splitting
For " today/weather not/wrong ", obtained matching field includes " today ", " weather is not " and " mistake ".Wherein, lead-in is each
The first character of matching field.
S606: the lead-in index of dictionary is read, according to the relationship of lead-in and the storage location of corresponding word index information, is obtained
To the storage location of the corresponding concordance information of lead-in of the matching field.
Specifically, for every kind split mode each matching field, pre-establish lead-in index in, according to lead-in with
The relationship of the storage location of corresponding word index information obtains depositing for the corresponding concordance information of lead-in of the matching field
Storage space is set.
Wherein, the lead-in index record positional relationship of lead-in and corresponding word index information.Concordance includes all
The corresponding relationship of the storage location of every kind of word word length and word information of lead-in.The storage location of concordance includes word rope
The offset and data length of fuse breath.Offset refers to that the starting of the concordance of each word word length of identical lead-in stores
Position, that is, store the initial position of first concordance information of lead-in.Data length refers to whole words an of lead-in
The data length of the index information of word length.
S608: according to the storage location of the concordance information, the corresponding word of lead-in of the matching field is read
Index.
S610: it in the concordance, according to the corresponding relationship of word word length and the storage location of word information, obtains
The storage location of the corresponding word information of the word length of the matching field.
Wherein, concordance has recorded the corresponding relationship of every kind of word word length of lead-in and the storage location of word information.
According to the offset and data length of concordance information, the storage location of concordance information can determine, then
According to the word word length for searching word, the storage location of the word information of the word word length is determined in index information.Word letter
The storage location of breath includes the offset and data length that word information corresponds to character string.Offset refers to, the word of identical word length
The start memory location of the word information of language, that is, store the initial position of first word information of each word word length.Due to every
Item stores information number of words having the same, and has used fixed length character code, so that every word information is all regular length,
According to the word quantity of the identical word word length of each lead-in, the data length of word information can be calculated.According to word information
Offset and data length can determine the storage location of word information.
Specifically, according to the word length of matching field, the storage position of the corresponding word information of word length is searched in index relative
It sets.For example, lead-in is " day " so that current matching word is " weather " as an example, word length is two word lengths.According to lead-in " day " in head
The storage location of the corresponding concordance of " day " word is found in word indexing.According to the storage location of concordance, " day " word is read
Corresponding concordance.In the corresponding concordance of " day " word, including the corresponding concordance of two word lengths, three word lengths are corresponding
Concordance etc..Concordance has recorded the corresponding relationship of every kind of word length of " day " word and the storage location of word information.
In the present embodiment, " weather " is two word word lengths, in concordance, searches the word letter of corresponding two word lengths of " day " word
The storage location of breath.
S612: according to the storage location of the word information, word information corresponding with word length is loaded in dictionary to memory.
In the present embodiment, no longer it is the entire dictionary of load to memory, need to only loads the word letter of the corresponding lead-in of a word length
Memory is ceased, specifically, loads the word information of corresponding two word lengths of " day " word to memory, thus internal when reducing participle
The occupancy deposited.
S614: binary chop is carried out to the word information, obtains the matching result of the matching field.
The word information of the identical word length of same lead-in is successively stored according to the sequence of character code, and due to phase
Data length with every word information of the word information of word length be it is fixed, so as to be provided preferably for binary chop
It supports.
S616: judge whether all matching fields of every kind of fractionation mode match and finish.If so, S618 is thened follow the steps,
If it is not, then return step S606, searches next matching field of the fractionation mode, or to a kind of lower fractionation mode
Matching field searched.
S618: the word frequency for the whole matching fields for splitting mode according to every kind calculates the corresponding total word of every kind of fractionation mode
Frequently.
S620: maximum word frequency is corresponded into the corresponding each matching field of fractionation mode as the participle knot of the text information
Fruit.
The above-mentioned segmenting method based on dictionary has both preferable participle speed and lesser EMS memory occupation, and does not almost have
There is the start-up loading time, the application environment more sensitive to performance, memory particularly suitable for mobile terminal etc..
The Words partition system architecture diagram of one embodiment is as shown in Figure 7.The system includes dictionary data storage device, and is based on
The participle device of dictionary.Allusion quotation data storage device is used to original dictionary data being processed into the dictionary number with multiple index file
According to.Multiple index includes lead-in index and concordance.Wherein, lead-in index record lead-in and corresponding word index information
The relationship of storage location, concordance has recorded every kind of word word length of lead-in and the corresponding of the storage location of word information is closed
System.
Participle device based on dictionary obtains input text, and user inputs one section of Chinese text in the input frame of mobile terminal
" today, weather was pretty good " calls the participle device based on dictionary to be split, to the word of fractionation, with multiple index file
It is matched in dictionary data.Specifically, the storage location that can determine the index information of lead-in according to lead-in, further according to word
Word length determines the storage location of the word information of the word word length, in index information so as to only load the word of the word
To memory inquiry can be completed, it is not necessary that entire dictionary is all loaded onto memory, to reduce in the corresponding word information of language word length
The EMS memory occupation of word segmentation processing, and there is faster inquiry velocity.
For fractionation each matching field after dictionary matches, the word frequency of each matching field is obtained, according to most
Big word frequency matching process, corresponds to the corresponding each matching field of fractionation mode as the participle knot of the text information for maximum word frequency
Fruit.Specifically, for " today, weather was pretty good " of input, being segmented (or even can add word for " today/weather/good "
Property mark: today (time word) weather (noun) is good (adjective)), and give speech synthesis rear end and carry out pinyin marking, sound
The operations such as frequency generation, the final voice that synthesizes play.
Fig. 8 is a kind of structural block diagram of dictionary data storage device of one embodiment, as shown in figure 8, the device includes:
Dictionary obtains module 802, dictionary analysis module 804, memory module 806, concordance and establishes module 808 and lead-in index foundation
Module 810.
The dictionary obtains module 802, for obtaining dictionary.
The dictionary analysis module 804 obtains the corresponding word information of each lead-in for analyzing the dictionary.
The memory module 806, for successively being carried out according to the sequence of word word length to the corresponding word information of each lead-in
Storage.
The concordance establishes module 808, the storage location of each word word length and word information for establishing lead-in
Corresponding relationship obtain concordance.
The lead-in index establishes module 808, for establishing lead-in and corresponding word index being believed according to the concordance
The relationship of the storage location of breath obtains lead-in index.
Specifically, the dictionary analysis module is obtained for being analyzed according to character set encoding sequence the dictionary
The word information of each lead-in.
Specifically, the memory module 806, for successively believing each lead-in corresponding word according to character set encoding sequence
Breath is stored, and to each word information of lead-in, according to identical word word length each word character code sequence successively into
Row storage.
Above-mentioned dictionary data storage device, whole words of a lead-in is not stored together disorderly,
It by the corresponding word information of lead-in, is successively stored according to the sequence of word word length, establishes each word word length and word of lead-in
The corresponding relationship of the storage location of language information, obtains concordance, further according to concordance, establishes lead-in and corresponding word indexes
The positional relationship of information obtains lead-in index, to can determine the index information of lead-in according to lead-in in dictionary enquiry
Storage location determines the storage location of the word information of the word word length further according to word word length in index information, so as to
To memory inquiry can be completed, it is not necessary that entire dictionary is whole in the corresponding word information of word word length only to load the word
It is loaded onto memory, to reduce the EMS memory occupation of word segmentation processing, and there is faster inquiry velocity.
Fig. 9 is a kind of structural block diagram of participle device based on dictionary of one embodiment, as shown in figure 9, being based on dictionary
Participle device include: text obtain module 902, split module 904, searching module 906, read module 908, loading module
910, matching module 912 and word segmentation module 914.
The text obtains module 902, for obtaining text information to be segmented.
The fractionation module 904 is obtained and is split for splitting the text information according to preset text matches algorithm
Each matching field and each matching field lead-in.
The searching module 906, the lead-in for reading dictionary indexes, according to depositing for lead-in and corresponding word index information
The relationship that storage space is set obtains the storage location of the corresponding concordance information of lead-in of the matching field.
The read module 908 reads the matching field for the storage location according to the concordance information
The corresponding concordance of lead-in.
The searching module 906, is also used in the concordance, according to the storage position of word word length and word information
The corresponding relationship set obtains the storage location of the corresponding word information of word length of the matching field.
The loading module 910 loads corresponding with word length in dictionary for the storage location according to the word information
Word information is to memory.
The matching module 912 is obtained for matching the matching field with each word in the word information
To the matching result of the matching field;
The word segmentation module 914, for being based on preset text matches algorithm, according to the matching of each matching field of fractionation
As a result, obtaining word segmentation result.
The above-mentioned participle device based on dictionary, can determine the storage location of the index information of lead-in according to lead-in, then
According to word word length, the storage location of the word information of the word word length is determined, in index information so as to only load this
Inquiry can be completed to memory in the corresponding word information of the word word length of word, it is not necessary that entire dictionary is all loaded onto memory,
To reduce the EMS memory occupation of word segmentation processing, and there is faster inquiry velocity.
In another embodiment, the matching module obtains described for carrying out binary chop to the word information
The matching result of matching field.
The fractionation module in another embodiment is used for according to maximum word frequency matching algorithm, positive by the text
Information does any fractionation, and obtains the lead-in of the corresponding each matching field of every kind of fractionation mode and each matching field.
The word information includes word frequency;The word segmentation module 914 includes word frequency computing module and participle determining module.
The word frequency computing module calculates every kind of fractionation for splitting the word frequency of whole matching fields of mode according to every kind
The corresponding total word frequency of mode.
The participle determining module, for maximum word frequency to be corresponded to the corresponding each matching field of fractionation mode as the text
The word segmentation result of this information.
A kind of computer equipment, including memory and processor, memory are stored with computer program, computer program quilt
When processor executes, so that processor executes the dictionary data storage method of the various embodiments described above or the segmenting method based on dictionary
The step of.
Figure 10 shows the internal structure chart of computer equipment in one embodiment.As shown in Figure 10, the computer equipment
It include processor, memory and the network interface connected by system bus including the computer equipment.Wherein, memory includes
Non-volatile memory medium and built-in storage.The non-volatile memory medium of the computer equipment is stored with operating system, may be used also
It is stored with computer program, when which is executed by processor, processor may make to realize dictionary data storage method
Or the segmenting method based on dictionary.Computer program can also be stored in the built-in storage, which is held by processor
When row, processor may make to execute dictionary data storage method or the segmenting method based on dictionary.Those skilled in the art can be with
Understand, structure shown in Figure 10, only the block diagram of part-structure relevant to application scheme, is not constituted to the application
The restriction for the computer equipment that scheme is applied thereon, specific computer equipment may include than as shown in the figure more or more
Few component perhaps combines certain components or with different component layouts.
In one embodiment, dictionary data storage device provided by the present application can be implemented as a kind of computer program
Form, computer program can be run in computer equipment as shown in Figure 10.Group can be stored in the memory of computer equipment
At each program module of the dictionary data storage device, for example, dictionary shown in Fig. 8 obtain module, dictionary analysis module and
Memory module.It is each that the computer program that each program module is constituted makes processor execute the application described in this specification
Step in the dictionary data storage method of embodiment.
For example, computer equipment shown in Fig. 10 can pass through the dictionary in dictionary data storage device as shown in Figure 8
It obtains module and executes the step of obtaining dictionary.Computer equipment can be executed by dictionary analysis module divides the dictionary
Analysis, the step of obtaining each lead-in corresponding word information.Computer equipment can be executed corresponding to each lead-in by memory module
Word information, the step of successively storage according to the sequence of word word length.
In another embodiment, the participle device provided by the present application based on dictionary can be implemented as a kind of computer journey
The form of sequence, computer program can be run in computer equipment as shown in Figure 10.It can be deposited in the memory of computer equipment
Storage form the dictionary data storage device each program module, for example, text shown in Fig. 9 obtain module, split module and
Searching module.It is each that the computer program that each program module is constituted makes processor execute the application described in this specification
Step in the segmenting method based on dictionary of embodiment.
For example, computer equipment shown in Fig. 10 can pass through the text in the participle device as shown in Figure 9 based on dictionary
The step of execution of this acquisition module obtains text information to be segmented.Computer equipment can be executed by splitting module according to default
Text matches algorithm, split the text information, obtain fractionation each matching field and each matching field lead-in the step of.
Computer equipment can be executed by searching for module in the lead-in index pre-established, according to lead-in and corresponding word index information
Storage location relationship, the step of obtaining the storage location of the corresponding concordance information of lead-in of the matching field.
A kind of storage medium is stored with computer program, when the computer program is executed by processor, so that the place
Device is managed to execute such as the step of the dictionary data storage method of the various embodiments described above or segmenting method based on dictionary.
Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with
Relevant hardware is instructed to complete by computer program, the program can be stored in a non-volatile computer and can be read
In storage medium, the program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, provided herein
Each embodiment used in any reference to memory, storage, database or other media, may each comprise non-volatile
And/or volatile memory.Nonvolatile memory may include that read-only memory (ROM), programming ROM (PROM), electricity can be compiled
Journey ROM (EPROM), electrically erasable ROM (EEPROM) or flash memory.Volatile memory may include random access memory
(RAM) or external cache.By way of illustration and not limitation, RAM is available in many forms, such as static state RAM
(SRAM), dynamic ram (DRAM), synchronous dram (SDRAM), double data rate sdram (DDRSDRAM), enhanced SDRAM
(ESDRAM), synchronization link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) directly RAM (RDRAM), straight
Connect memory bus dynamic ram (DRDRAM) and memory bus dynamic ram (RDRAM) etc..
Each technical characteristic of above embodiments can be combined arbitrarily, for simplicity of description, not to above-described embodiment
In each technical characteristic it is all possible combination be all described, as long as however, the combination of these technical characteristics be not present lance
Shield all should be considered as described in this specification.
The several embodiments of the application above described embodiment only expresses, the description thereof is more specific and detailed, but simultaneously
The limitation to the application the scope of the patents therefore cannot be interpreted as.It should be pointed out that for those of ordinary skill in the art
For, without departing from the concept of this application, various modifications and improvements can be made, these belong to the guarantor of the application
Protect range.Therefore, the scope of protection shall be subject to the appended claims for the application patent.
Claims (14)
1. a kind of dictionary data storage method, comprising:
Obtain dictionary;
The dictionary is analyzed, the corresponding word information of each lead-in is obtained;
Word information corresponding to each lead-in, is successively stored according to the sequence of word word length;
The corresponding relationship for establishing each word word length of lead-in and the storage location of word information obtains concordance;
According to the concordance, the relationship for establishing the storage location of lead-in and corresponding word index information obtains lead-in index.
2. obtaining each lead-in pair the method according to claim 1, wherein described analyze the dictionary
The step of word information answered, comprising:
The dictionary is analyzed according to character set encoding sequence, obtains the word information of each lead-in.
3. method according to claim 1 or 2, which is characterized in that it is described to the corresponding word information of each lead-in, according to word
The step of sequence of language word length is successively stored, comprising:
According to character set encoding sequence, successively each lead-in corresponding word information is stored;
To each word information of lead-in, successively stored according to the sequence of each word character code of identical word word length.
4. a kind of segmenting method based on dictionary, comprising:
Obtain text information to be segmented;
According to preset text matches algorithm, the text information is split, obtains each matching field and each matching field of fractionation
Lead-in;
The lead-in index for reading dictionary obtains described according to the relationship of lead-in and the storage location of corresponding word index information
The storage location of the corresponding concordance information of lead-in with field;
According to the storage location of the concordance information, the corresponding concordance of lead-in of the matching field is read;
In the concordance, according to the corresponding relationship of word word length and the storage location of word information, the matching is obtained
The storage location of the corresponding word information of the word length of field;
According to the storage location of the word information, word information corresponding with word length is loaded in dictionary to memory;
The matching field is matched with each word in the word information, obtains the matching knot of the matching field
Fruit;
Word segmentation result is obtained according to the matching result of each matching field of fractionation based on preset text matches algorithm.
5. according to the method described in claim 4, it is characterized in that, by each word in the matching field and the word information
The step of language is matched, and the matching result of institute's matching field is obtained, comprising:
Binary chop is carried out to the word information, obtains the matching result of the matching field.
6. according to the method described in claim 4, it is characterized in that, splitting the text according to preset text matches algorithm
Information, obtain fractionation each matching field and each matching field lead-in the step of, comprising: according to maximum word frequency matching algorithm,
The text information is done any fractionation by forward direction, and obtains the corresponding each matching field of every kind of fractionation mode and each matching field
Lead-in;
The word information includes word frequency;Based on preset text matches algorithm, according to the matching knot of each matching field of fractionation
Fruit, the step of obtaining word segmentation result, comprising:
The word frequency for the whole matching fields for splitting mode according to every kind calculates the corresponding total word frequency of every kind of fractionation mode;
Maximum word frequency is corresponded into the corresponding each matching field of fractionation mode as the word segmentation result of the text information.
7. a kind of dictionary data storage device, comprising: dictionary obtains module, dictionary analysis module, memory module, concordance and builds
Formwork erection block and lead-in index establish module;
The dictionary obtains module, for obtaining dictionary;
The dictionary analysis module obtains the corresponding word information of each lead-in for analyzing the dictionary;
The memory module, for successively being stored according to the sequence of word word length to the corresponding word information of each lead-in;
The concordance establishes module, and each word word length for establishing lead-in is corresponding with the storage location of word information to close
System obtains concordance;
Lead-in index establishes module, for according to the concordance, establishing depositing for lead-in and corresponding word index information
The relationship that storage space is set obtains lead-in index.
8. device according to claim 7, which is characterized in that the dictionary analysis module, for according to character set encoding
Sequence analyzes the dictionary, obtains the word information of each lead-in.
9. device according to claim 7 or 8, which is characterized in that the memory module, for suitable according to character set encoding
Sequence successively stores each lead-in corresponding word information, and to each word information of lead-in, according to each of identical word word length
The sequence of word character code is successively stored.
10. a kind of participle device based on dictionary, comprising: text obtains module, splits module, searching module, read module, adds
Carry module, matching module and word segmentation module;
The text obtains module, for obtaining text information to be segmented;
The fractionation module, for splitting the text information, obtaining each matching of fractionation according to preset text matches algorithm
The lead-in of field and each matching field;
The searching module, the lead-in for reading dictionary indexes, according to the storage location of lead-in and corresponding word index information
Relationship, obtain the storage location of the corresponding concordance information of lead-in of the matching field;
The read module reads the lead-in pair of the matching field for the storage location according to the concordance information
The concordance answered;
The searching module is also used in the concordance, according to pair of word word length and the storage location of word information
It should be related to, obtain the storage location of the corresponding word information of word length of the matching field;
The loading module loads word letter corresponding with word length in dictionary for the storage location according to the word information
Cease memory;
The matching module obtains described for matching the matching field with each word in the word information
The matching result of matching field;
The word segmentation module, for being obtained based on preset text matches algorithm according to the matching result of each matching field of fractionation
To word segmentation result.
11. device according to claim 10, which is characterized in that the matching module, for the word information into
Row binary chop obtains the matching result of the matching field.
12. device according to claim 10, which is characterized in that the fractionation module, for being matched according to maximum word frequency
The text information is done any fractionation by algorithm, forward direction, and obtains the corresponding each matching field of every kind of fractionation mode and each matching
The lead-in of field;
The word information includes word frequency;The word segmentation module includes word frequency computing module and participle determining module;
The word frequency computing module calculates every kind of fractionation mode for splitting the word frequency of whole matching fields of mode according to every kind
Corresponding total word frequency;
The participle determining module, for maximum word frequency to be corresponded to the corresponding each matching field of fractionation mode as the text envelope
The word segmentation result of breath.
13. a kind of computer equipment, including memory and processor, the memory is stored with computer program, the calculating
When machine program is executed by the processor, so that the processor executes dictionary number as claimed any one in claims 1 to 3
The step of according to segmenting method described in any one of storage method or 4 to 6 based on dictionary.
14. a kind of storage medium is stored with computer program, when the computer program is executed by processor, so that the place
Reason device execute described in any one of dictionary data storage method as claimed any one in claims 1 to 3 or 4 to 6 based on
The step of segmenting method of dictionary.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711136379.9A CN109800408B (en) | 2017-11-16 | 2017-11-16 | Dictionary data storage method and device, and dictionary-based word segmentation method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711136379.9A CN109800408B (en) | 2017-11-16 | 2017-11-16 | Dictionary data storage method and device, and dictionary-based word segmentation method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109800408A true CN109800408A (en) | 2019-05-24 |
CN109800408B CN109800408B (en) | 2023-05-26 |
Family
ID=66555376
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711136379.9A Active CN109800408B (en) | 2017-11-16 | 2017-11-16 | Dictionary data storage method and device, and dictionary-based word segmentation method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109800408B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110489380A (en) * | 2019-08-14 | 2019-11-22 | 腾讯科技(深圳)有限公司 | A kind of data processing method, device and equipment |
CN111090992A (en) * | 2019-12-13 | 2020-05-01 | 厦门市美亚柏科信息股份有限公司 | Text preprocessing method and device and storage medium |
CN112101025A (en) * | 2020-11-13 | 2020-12-18 | 北京世纪好未来教育科技有限公司 | Pinyin marking method and device, electronic equipment and storage medium |
CN112307753A (en) * | 2020-12-29 | 2021-02-02 | 启业云大数据(南京)有限公司 | Word segmentation method supporting large word stock, computer readable storage medium and system |
CN113626651A (en) * | 2021-08-04 | 2021-11-09 | 上海金仕达成括信息科技有限公司 | Data matching method and device |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2001229162A (en) * | 2000-02-15 | 2001-08-24 | Matsushita Electric Ind Co Ltd | Method and device for automatically proofreading chinese document |
US6879951B1 (en) * | 1999-07-29 | 2005-04-12 | Matsushita Electric Industrial Co., Ltd. | Chinese word segmentation apparatus |
US20060167680A1 (en) * | 2005-01-25 | 2006-07-27 | Nokia Corporation | System and method for optimizing run-time memory usage for a lexicon |
WO2009113869A1 (en) * | 2008-03-12 | 2009-09-17 | Lumex As | A word length indexed dictionary for use in an optical character recognition (ocr) system. |
-
2017
- 2017-11-16 CN CN201711136379.9A patent/CN109800408B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6879951B1 (en) * | 1999-07-29 | 2005-04-12 | Matsushita Electric Industrial Co., Ltd. | Chinese word segmentation apparatus |
JP2001229162A (en) * | 2000-02-15 | 2001-08-24 | Matsushita Electric Ind Co Ltd | Method and device for automatically proofreading chinese document |
US20060167680A1 (en) * | 2005-01-25 | 2006-07-27 | Nokia Corporation | System and method for optimizing run-time memory usage for a lexicon |
WO2009113869A1 (en) * | 2008-03-12 | 2009-09-17 | Lumex As | A word length indexed dictionary for use in an optical character recognition (ocr) system. |
US20110103713A1 (en) * | 2008-03-12 | 2011-05-05 | Hans Christian Meyer | Word length indexed dictionary for use in an optical character recognition (ocr) system |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110489380A (en) * | 2019-08-14 | 2019-11-22 | 腾讯科技(深圳)有限公司 | A kind of data processing method, device and equipment |
CN110489380B (en) * | 2019-08-14 | 2024-02-13 | 腾讯科技(深圳)有限公司 | Data processing method, device and equipment |
CN111090992A (en) * | 2019-12-13 | 2020-05-01 | 厦门市美亚柏科信息股份有限公司 | Text preprocessing method and device and storage medium |
CN111090992B (en) * | 2019-12-13 | 2022-12-06 | 厦门市美亚柏科信息股份有限公司 | Text preprocessing method and device and storage medium |
CN112101025A (en) * | 2020-11-13 | 2020-12-18 | 北京世纪好未来教育科技有限公司 | Pinyin marking method and device, electronic equipment and storage medium |
CN112307753A (en) * | 2020-12-29 | 2021-02-02 | 启业云大数据(南京)有限公司 | Word segmentation method supporting large word stock, computer readable storage medium and system |
CN113626651A (en) * | 2021-08-04 | 2021-11-09 | 上海金仕达成括信息科技有限公司 | Data matching method and device |
Also Published As
Publication number | Publication date |
---|---|
CN109800408B (en) | 2023-05-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109800408A (en) | Dictionary data storage method and device, segmenting method and device based on dictionary | |
US8171029B2 (en) | Automatic generation of ontologies using word affinities | |
US10289717B2 (en) | Semantic search apparatus and method using mobile terminal | |
CN110321408B (en) | Searching method and device based on knowledge graph, computer equipment and storage medium | |
CN106033416A (en) | A string processing method and device | |
CN109614627B (en) | Text punctuation prediction method and device, computer equipment and storage medium | |
CN112800769B (en) | Named entity recognition method, named entity recognition device, named entity recognition computer equipment and named entity recognition storage medium | |
KR20090065130A (en) | Indexing and searching method for high-demensional data using signature file and the system thereof | |
WO2020155749A1 (en) | Method and apparatus for constructing personal knowledge graph, computer device, and storage medium | |
KR20150107595A (en) | Information processing system and information processing method for character input prediction | |
US20160147867A1 (en) | Information matching apparatus, information matching method, and computer readable storage medium having stored information matching program | |
CN105843960A (en) | Semantic tree based indexing method and system | |
KR101379128B1 (en) | Dictionary generation device, dictionary generation method, and computer readable recording medium storing the dictionary generation program | |
CN105404677A (en) | Tree structure based retrieval method | |
CN110222015B (en) | File data reading and querying method and device and readable storage medium | |
CN108595437B (en) | Text query error correction method and device, computer equipment and storage medium | |
CN109446336B (en) | News screening method, device, computer equipment and storage medium | |
CN110046219A (en) | A kind of Chinese word cutting method based on hash algorithm | |
CN108304384B (en) | Word splitting method and device | |
CN111382570A (en) | Text entity recognition method and device, computer equipment and storage medium | |
CN116226681B (en) | Text similarity judging method and device, computer equipment and storage medium | |
CN112765976A (en) | Text similarity calculation method, device and equipment and storage medium | |
CN108776705B (en) | Text full-text accurate query method, device, equipment and readable medium | |
CN110795617A (en) | Error correction method and related device for search terms | |
CN114003685B (en) | Word segmentation position index construction method and device, and document retrieval method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |