CN109800408A

CN109800408A - Dictionary data storage method and device, segmenting method and device based on dictionary

Info

Publication number: CN109800408A
Application number: CN201711136379.9A
Authority: CN
Inventors: 易成; 王新亮; 李斌; 卢鲤; 李新辉
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2017-11-16
Filing date: 2017-11-16
Publication date: 2019-05-24
Anticipated expiration: 2037-11-16
Also published as: CN109800408B

Abstract

This application involves a kind of dictionary data storage method and device, the segmenting method based on dictionary and device, computer equipment and storage medium, dictionary data storage method includes: acquisition dictionary；The dictionary is analyzed, the corresponding word information of each lead-in is obtained；Word information corresponding to each lead-in, is successively stored according to the sequence of word word length；The corresponding relationship for establishing each word word length of lead-in and the storage location of word information obtains concordance；According to the concordance, the relationship for establishing the storage location of lead-in and corresponding word index information obtains lead-in index.This method can only load the corresponding word information of word word length of the word to memory, and inquiry can be completed, it is not necessary that entire dictionary is all loaded onto memory, to reduce the EMS memory occupation of word segmentation processing, and have faster inquiry velocity.

Description

Dictionary data storage method and device, segmenting method and device based on dictionary

Technical field

This application involves technical field of information processing, more particularly to a kind of dictionary data storage method and device, are based on The segmenting method and device of dictionary, computer equipment and storage medium.

Background technique

Participle is the basis of text mining, for a Duan Wenben of input, is successfully segmented, and can achieve computer certainly The effect of dynamic identification sentence meaning.Traditional segmenting method includes the segmentation methods based on string matching, and this method needs Establish sufficiently large dictionary for word segmentation.

Traditional segmentation methods based on string matching, string searching performance when in order to ensure participle, general meeting Entire dictionary is loaded into memory, and organizes to be to be appropriate for the organizational form of string searching, such as " prefix trees " data knot Structure.By such memory data structure, may be implemented quickly whether to be searched it in dictionary for the character string to be segmented In the presence of.Although dictionary is loaded into numerous string searching demands during memory energy quick response participle, in practical application In, in order to obtain the preferable precision of word segmentation, lexicon file generally has the scale of 4m-30m or so.If in dictionary is all loaded onto It deposits, then will occupy biggish memory headroom.

Summary of the invention

Based on this, it is necessary to for the excessive problem of memory that participle occupies, provide a kind of dictionary data storage method and Device, the segmenting method based on dictionary and device, computer equipment and storage medium.

A kind of dictionary data storage method, comprising:

Obtain dictionary；

The dictionary is analyzed, the corresponding word information of each lead-in is obtained；

Word information corresponding to each lead-in, is successively stored according to the sequence of word word length；

The corresponding relationship for establishing each word word length of lead-in and the storage location of word information obtains concordance；

According to the concordance, the relationship for establishing the storage location of lead-in and corresponding word index information obtains lead-in rope Draw.

A kind of segmenting method based on dictionary, comprising:

Obtain text information to be segmented；

According to preset text matches algorithm, the text information is split, each matching field for obtaining fractionation and respectively matching The lead-in of field；

The lead-in index for reading dictionary, according to the relationship of lead-in and the storage location of corresponding word index information, obtains institute State the storage location of the corresponding concordance information of lead-in of matching field；

According to the storage location of the concordance information, the corresponding concordance of lead-in of the matching field is read；

In the concordance, according to the corresponding relationship of word word length and the storage location of word information, obtain described The storage location of the corresponding word information of the word length of matching field；

According to the storage location of the word information, word information corresponding with word length is loaded in dictionary to memory；

The matching field is matched with each word in the word information, obtains the matching of the matching field As a result；

Word segmentation result is obtained according to the matching result of each matching field of fractionation based on preset text matches algorithm.

A kind of dictionary data storage device, comprising: dictionary obtains module, dictionary analysis module, memory module, concordance It establishes module and lead-in index establishes module；

The dictionary obtains module, for obtaining dictionary；

The dictionary analysis module obtains the corresponding word information of each lead-in for analyzing the dictionary；

The memory module, for successively being deposited according to the sequence of word word length to the corresponding word information of each lead-in Storage；

The concordance establishes module, for establishing pair of each word word length of lead-in and the storage location of word information It should be related to obtain concordance；

The lead-in index establishes module, for establishing lead-in and corresponding word index information according to the concordance Storage location relationship obtain lead-in index.

A kind of participle device based on dictionary, comprising: text obtain module, split module, searching module, read module, Loading module, matching module and word segmentation module；

The text obtains module, for obtaining text information to be segmented；

The fractionation module, for splitting the text information, obtaining each of fractionation according to preset text matches algorithm The lead-in of matching field and each matching field；

The searching module, the lead-in for reading dictionary indexes, according to the storage of lead-in and corresponding word index information The relationship of position obtains the storage location of the corresponding concordance information of lead-in of the matching field；

The read module reads the head of the matching field for the storage location according to the concordance information The corresponding concordance of word；

The searching module is also used in the concordance, according to the storage location of word word length and word information Corresponding relationship, obtain the storage location of the corresponding word information of word length of the matching field；

The loading module loads word corresponding with word length in dictionary for the storage location according to the word information Language information is to memory；

The matching module is obtained for matching the matching field with each word in the word information The matching result of the matching field；

The word segmentation module, for being based on preset text matches algorithm, according to the matching knot of each matching field of fractionation Fruit obtains word segmentation result.

A kind of computer equipment, including memory and processor, the memory are stored with computer program, the calculating When machine program is executed by the processor, so that the processor executes above-mentioned dictionary data storage method or based on dictionary The step of segmenting method.

A kind of storage medium is stored with computer program, when the computer program is executed by processor, so that the place Reason device executes the step of above-mentioned dictionary data storage method or segmenting method based on dictionary.

Above-mentioned dictionary data storage method and device are not that whole words of a lead-in are disorderly stored in one It rises, but by the corresponding word information of lead-in, it is successively stored according to the sequence of word word length, establishes each word word of lead-in The long corresponding relationship with the storage location of word information, obtains concordance, further according to concordance, establishes lead-in and equivalent The positional relationship of language index information obtains lead-in index, to can determine the index of lead-in according to lead-in in dictionary enquiry The storage location of information determines the storage location of the word information of the word word length further according to word word length in index information, The corresponding word information of word word length so as to only load the word can be completed inquiry, be not necessarily to entire word to memory Allusion quotation is all loaded onto memory, to reduce the EMS memory occupation of word segmentation processing, and has faster inquiry velocity.

The above-mentioned segmenting method and device based on dictionary, can determine the storage position of the index information of lead-in according to lead-in It sets, further according to word word length, the storage location of the word information of the word word length is determined in index information, so as to only add The corresponding word information of word word length of the word is carried to memory, inquiry can be completed, it is not necessary that entire dictionary to be all loaded onto Memory to reduce the EMS memory occupation of word segmentation processing, and has faster inquiry velocity.

Detailed description of the invention

Fig. 1 is the flow chart of dictionary data storage method in one embodiment；

Fig. 2 is successively to be deposited to the corresponding word information of each lead-in according to the sequence of word word length in one embodiment The flow chart of the step of storage；

Fig. 3 is the storage organization schematic diagram of dictionary data in one embodiment；

Fig. 4 is the flow chart of the segmenting method based on dictionary in one embodiment；

Fig. 5 is the step of carrying out binary chop in one embodiment to word information, obtain the matching result of matching field Flow chart；

Fig. 6 is the flow chart of the segmenting method based on dictionary in another embodiment；

Fig. 7 is Words partition system architecture diagram in one embodiment；

Fig. 8 is the structural block diagram of dictionary data storage device in one embodiment；

Fig. 9 is the structural block diagram of the participle device in one embodiment based on dictionary；

Figure 10 is the structural block diagram of computer equipment in one embodiment.

Specific embodiment

It is with reference to the accompanying drawings and embodiments, right in order to which the objects, technical solutions and advantages of the application are more clearly understood The application is further elaborated.It should be appreciated that specific embodiment described herein is only used to explain the application, and It is not used in restriction the application.

As shown in Figure 1, in one embodiment, providing a kind of dictionary data storage method.As shown in Figure 1, the dictionary Date storage method specifically comprises the following steps:

S102: dictionary is obtained.

Wherein, dictionary refers to a collection of word obtained by a large amount of Concordances.

S104: analyzing dictionary, obtains the corresponding word information of each lead-in.

Wherein, lead-in refers to the first character of word.By taking word " mankind " as an example, first character " people " is the head of the word Word.

Specifically, dictionary is traversed, the lead-in of each word of dictionary is analyzed, obtains the corresponding word letter of each lead-in Breath.Wherein, word information can be as times of word content, long, the word frequency information of word and word the additional information of word word It anticipates one or more.For example, traversing dictionary for lead-in " people ", obtain with " people " as all words of lead-in and its each word Information, the word information including the words such as " mankind ", " life ", " artificial satellite ".

Word word length refers to the word length of word, related with the number of word included by word.It is with word " mankind " Example, word word length are 2.

The additional information of word can identify for the part of speech of word, and part of speech includes time word, noun, adjective, geographical word Etc..

In a specific embodiment, the step of dictionary being analyzed, obtaining each lead-in corresponding word information, packet It includes: dictionary being analyzed according to character set encoding sequence, obtain the word information of each lead-in.

Character set encoding refers to that carrying out integration to multiple characters (usually differing tens to tens of thousands of) is packaged into a text Coding, external program used in part can call specified character by this coding from character set file.Common Chinese character set coding includes GB2312 coding, GBK coding, Unicode coding etc..

By taking GBK character set encoding as an example, according to the sequence of character set encoding, corresponding character will be successively encoded in dictionary It searches whether the word for having using current character as lead-in, obtains the corresponding whole word information of lead-in if so, searching.According to Secondary traversal character set encoding obtains each character as the corresponding word information of lead-in.

In the present embodiment, using fixed length character code, facilitates calculating character offset and accelerate search efficiency.

S106: word information corresponding to each lead-in is successively stored according to the sequence of word word length.

The corresponding word of lead-in, including multiple word lengths, such as two word lengths, three word lengths etc..By taking lead-in " people " as an example, " mankind " are the word of two word lengths, and " artificial satellite " is the word of four word lengths.For the word for being reduced as far as memory load Allusion quotation content can put the word information of the identical word word length of lead-in together.Thus according to the word length and lead-in of matching field, The word information of corresponding word length is loaded to memory, avoids for entire dictionary to be loaded onto and occupies more memory caused by memory and provide The problem of source.

Specifically, often in the corresponding whole word information of each lead-in, according to word word length ascending or descending order successively into Row storage.After successively storing, the storage location of the word information of the word of the identical word word length of same lead-in is continuous.It can With understanding, the storage order of the word information of all lead-ins should be identical.For example, often in the corresponding word information of lead-in, First the word information of 2 word lengths is successively stored, then the word information of 3 word lengths is successively stored, and so on, until should The word information of the maximum word length of lead-in stores.In conjunction with participle practical experience, maximum word word length can be limited as 8.Specific In embodiment, maximum word word length can also be adjusted according to demand.

Fig. 2 is one embodiment to the corresponding word information of each lead-in, is successively deposited according to the sequence of word word length The flow chart of the step of storage.As shown in Fig. 2, the step includes: S202 to S204.

S202: according to character set encoding sequence, successively each lead-in corresponding word information is stored.

Specifically, it according to character set encoding sequence, after the whole word information for having stored a lead-in, then stores next Whole word orders of lead-in.For example, by taking GBK Chinese character coding set as an example, according to the sequence of character code, for each character Corresponding lead-in is encoded, dictionary is traversed, obtains the corresponding word of the lead-in and its word information.Since lead-in is according to character set What coded sequence was successively stored, facilitate calculating character offset and accelerates search efficiency.

S204: to each word information of lead-in, according to identical word word length each word character code sequence successively into Row storage.

One lead-in corresponding word, including multiple word lengths, such as two word lengths, three word lengths etc..To subtract as much as possible The dictionary content of few memory load, can put the word information of the identical word word length of lead-in together.To according to matching word The word length and lead-in of section load the word information of corresponding word length to memory, avoid and entire dictionary is loaded onto caused by memory The problem of occupying more memory source.

And for the word information of identical word word length, it is successively stored according to the sequence of each word character code.Tool Body, for the word of two word lengths, successively stored according to the size of the GKB encoding of chinese characters of the second of word word.It is right In the word of three word lengths, successively stored according to the size of the triliteral GKB encoding of chinese characters of word.The rest may be inferred, According to the sequence of each word character code, successively the word information of identical word word length is stored in each word of phase lead-in.

In this storage mode, the word information of the identical word length of same lead-in is successively carried out according to the sequence of character code Storage, and due to the data length of every word information of the word information of identical word length be it is fixed, so as to be two Divide to search and preferable support is provided.Binary chop is a kind of higher search method of efficiency, it is desirable that dictionary is in sequence list by pass Key sequence.

S108: the corresponding relationship for establishing each word word length of lead-in and the storage location of word information obtains concordance.

Wherein, the storage location of word information includes the offset and data length that word information corresponds to character string.Offset Amount refers to, the start memory location of the word information of the word of identical word length, that is, stores first word letter of each word word length The initial position of breath.Due to every storage information number of words having the same, and fixed length character code is used, thus every word Language information is all regular length, according to the word quantity of the identical word word length of each lead-in, can calculate the data of word information Length.

Specifically, the concordance of lead-in should include: 2 word word lengths pass corresponding with storage location of the lead-in System, the corresponding relationship of 3 word word lengths and storage location, and so on, the concordance of lead-in includes every kind of word of lead-in The corresponding relationship of the storage location of word length and word information.

Concordance should include the corresponding relationship of every kind of word word length of whole lead-ins and the storage location of word information. According to the offset and data length of word information, the storage location of word information can determine, according to the storage of word information Position can obtain corresponding word information.

S110: according to concordance, the positional relationship for establishing lead-in and corresponding word index information obtains lead-in index.

Specifically, concordance includes that every kind of word word length of whole lead-ins and the corresponding of the storage location of word information are closed System.According to position of the corresponding word word length of each lead-in in concordance, lead-in and corresponding word index information are established Positional relationship obtains lead-in index.

Specifically, the storage location of concordance includes the offset and data length of concordance information.Offset is Refer to, the start memory location of the concordance of each word word length of identical lead-in, that is, store first concordance letter of lead-in The initial position of breath.Data length refers to the data length of the index information of whole word word lengths an of lead-in.It is understood that , in index relative, the data for describing each storage location are regular lengths, therefore, can be easily according to lead-in The quantity of corresponding whole word word length calculates the data length of the corresponding whole concordance information of lead-in.

According to the offset and data length of concordance information, the storage location of concordance information can determine, then According to the word word length for searching word, the storage location of the word information of the word word length is determined in index information.

The storage organization of the dictionary data of one embodiment is as shown in Figure 3.When organizing dictionary data, by the institute in dictionary There is dictionary data (format: word character string-word frequency-extension information (such as part of speech)) to establish multiple index, including lead-in index And concordance.Overall file structure is divided into: word lead-in index area, the word point number of words information area (including concordance and word Language information)

It include the start bit of the corresponding lead-in word point number of words information area of all GBK Chinese character coding sets in lead-in index It sets and length.The arrangement mode of lead-in index is arranged according to the size order of GBK encoding of chinese characters, is divided into the area GBK2312 (0xB0A1-0xF7FE, totally 6763 Chinese characters), the area GBK/3 (0x8140-0xA0FE, totally 6080 Chinese characters), the area GBK/4 (0xAA40-0xFEA0, totally 8160 Chinese characters), that is to say, that the number in lead-in index region are as follows: 6763+6080+8160= 21003.The lead-in index record storage location of lead-in and corresponding word index information.

The corresponding relationship of the storage location of each word length and word information of each lead-in in including in concordance.Such as figure Shown in 3, the concordance that the corresponding concordance packet word word length of a lead-in is 2, the concordance etc. that word word length is 3 Deng.The corresponding storage location of concordance has recorded the word information of the identical word length of same lead-in, including remaining word, word frequency and Additional information.

It when being inquired using dictionary, according to the lead-in of matching field, is indexed using lead-in, determines the index letter of lead-in The storage location of breath, it is long further according to the word of matching field, the storage of the word information of the word word length is determined in index information Inquiry can be completed to memory in position, the corresponding word information of word word length so as to only load the word.

Above-mentioned dictionary data storage method, whole words of a lead-in is not stored together disorderly, It by the corresponding word information of lead-in, is successively stored according to the sequence of word word length, establishes each word word length and word of lead-in The corresponding relationship of the storage location of language information, obtains concordance, further according to concordance, establishes lead-in and corresponding word indexes The positional relationship of information obtains lead-in index, to can determine the index information of lead-in according to lead-in in dictionary enquiry Storage location determines the storage location of the word information of the word word length further according to word word length in index information, so as to To memory inquiry can be completed, it is not necessary that entire dictionary is whole in the corresponding word information of word word length only to load the word It is loaded onto memory, to reduce the EMS memory occupation of word segmentation processing, and there is faster inquiry velocity.

Fig. 4 is the flow chart of the segmenting method based on dictionary of one embodiment.As shown in figure 4, this method includes following Step:

S402: text information to be segmented is obtained.

Segmentation methods are the bases of text mining, are usually applied to the neck such as natural language processing, search engine, intelligent recommendation Domain.To the text mining object in text information, that is, concrete application field application scenarios of cliction, for example, being used in search engine The text information of family input, the corresponding text information of voice etc. for needing to identify in speech synthesis system.

S404: according to preset text matches algorithm, splitting text information, each matching field for obtaining fractionation and respectively matching The lead-in of field.

Wherein, text matches algorithm refers to the progress of the word in the matching field and dictionary that will be split out in text information The algorithm matched.Text matches algorithm needs text information splitting into matching field.Matching field, which refers to, splits text information The word obtained afterwards is matched for the word with dictionary.Text matches algorithm includes splitting and matching two steps, different Text matches algorithm fractionation mode and matching way be all different.

Common text matches algorithm includes Forward Maximum Method algorithm, reverse maximum matching algorithm, two-way maximum matching Algorithm and maximum word frequency matching algorithm.

Wherein, Forward Maximum Method algorithm refers to, takes m character of Chinese sentence to be slit as matching word from left to right Section, m are longest entry number in big machine dictionary.It searches dictionary and is matched.If successful match, by this matching field It is come out as a word segmentation.If matching is unsuccessful, the last character of this matching field is removed, remaining character string It as new matching field, is matched again, above procedure is repeated, until being syncopated as all words.

Reverse maximum matching algorithm refers to, takes m character of Chinese sentence to be slit as matching field from right to left, m is Longest entry number in big machine dictionary.It searches dictionary and is matched.If successful match, using this matching field as one A word segmentation comes out.If matching is unsuccessful, the most previous word of this matching field is removed, remaining character string is as new Matching field, matched again, repeat above procedure, until being syncopated as all words.

Self-reinforcing in double directions refers to the word segmentation result for obtaining Forward Maximum Method method and reverse maximum matching method To result be compared, to determine correct segmenting method.

Maximum word frequency matching algorithm is referred to and is arbitrarily split to text information using a variety of fractionation modes, every kind is torn open Word inside point mode is matched with dictionary, is counted total word frequency of every kind of fractionation mode, is selected the maximum fractionation of total word frequency The corresponding split result of mode is as word segmentation result.

S406: reading in the lead-in index of dictionary, according to the relationship of lead-in and the storage location of corresponding word index information, Obtain the storage location of the corresponding concordance information of lead-in of matching field.

Wherein, lead-in index is the pass of the lead-in of the dictionary pre-established and the storage location of corresponding word index information System.Specifically, the lead-in of matching field is searched in lead-in index, determines the storage location of lead-in corresponding word index. In the present embodiment, the storage location of concordance information includes the offset and data length of the concordance of lead-in.Offset Refer to, the start memory location of the concordance of each word word length of identical lead-in, that is, store first word of each word word length The initial position of language index information.Data length refers to the data length of the index information of whole word word lengths an of lead-in. It is understood that the data for describing each storage location are regular length, therefore, Neng Goufang in lead-in index relative Just according to the quantity of the corresponding whole word word lengths of lead-in, the data for calculating the corresponding whole concordance information of lead-in are long Degree.

S408: according to the storage location of concordance information, the corresponding concordance of lead-in is read.

S410: it in concordance, according to the corresponding relationship of word word length and the storage location of word information, is matched The storage location of the corresponding word information of the word length of field.

Concordance includes the corresponding relationship of every kind of word word length of whole lead-ins and the storage location of word information.Its In, the storage location of word information includes the offset and data length of word information.Offset refers to, identical word word length The start memory location of the word information of word, that is, store the initial position of first word information of each word word length.Due to Every storage information number of words having the same, and fixed length character code has been used, so that every word information is all fixed length Degree, according to the word quantity of the identical word word length of each lead-in, can calculate the data length of word information.

S412: according to the storage location of word information, word information corresponding with word length is loaded in dictionary to memory.

In the present embodiment, word information corresponding with the word length of matching field need to be only loaded to memory, entirely without load Dictionary reduces participle to the occupancy of memory to memory.Word information includes word character string, word frequency and extension information (such as word Property etc.) any one or more.

Wherein, word frequency refers to the number that each word occurs in corpus.Dictionary method thinks that word frequency is higher, this word into Probability when row participle prediction is higher, and weight is bigger.

S414: matching field is matched with each word in word information, obtains the matching result of matching field.

Specifically, matching field can be matched with each of word information word information, obtains matching field Matching result.To improve search efficiency, can also be matched in word information according to binary chop.

S416: being based on preset text matches algorithm, according to the matching result of each matching field of fractionation, obtains participle knot Fruit.

Specifically, according to preset text matches algorithm, above-mentioned matching process is used to each matching field of fractionation, The matching result of each matching field split obtains word segmentation result according to the matching result of each matching field of fractionation.With For Forward Maximum Method algorithm, take m character of Chinese sentence to be slit as matching field from left to right, m is big machine word Longest entry number in allusion quotation.It searches dictionary and is matched.If successful match, using this matching field as a word segmentation Out.If matching is unsuccessful, the last character of this matching field is removed, remaining character string is as new matching word Section, is matched again, above procedure is repeated, until being syncopated as all words.By taking big word frequency matching algorithm as an example, using more Kind fractionation mode arbitrarily splits text information, and the word inside every kind of fractionation mode is matched with dictionary, counts Total word frequency of every kind of fractionation mode selects total word frequency maximum to split the corresponding split result of mode as word segmentation result.

The above-mentioned segmenting method based on dictionary can determine the storage location of the index information of lead-in according to lead-in, then According to word word length, the storage location of the word information of the word word length is determined, in index information so as to only load this Inquiry can be completed to memory in the corresponding word information of the word word length of word, it is not necessary that entire dictionary is all loaded onto memory, To reduce the EMS memory occupation of word segmentation processing, and there is faster inquiry velocity.

In another embodiment, matching field is matched with each word in word information, obtains matched word The step of matching result of section, comprising: binary chop is carried out to word information, obtains the matching result of matching field.Fig. 5 is one Flow chart the step of carrying out binary chop to word information, obtain the matching result of matching field of a embodiment.Such as Fig. 5 institute Show, the step the following steps are included:

S502: the data block of word information is obtained.

Specifically, after word information being loaded into memory, the internal storage data block position where word information is obtained.

S504: the middle position of inquiry data block is determined.

Since word information is the corresponding data of identical word word length of lead-in, and word information is according to identical word word length The coded sequence of each word successively stored, therefore, the data length of the corresponding word information of each word be it is fixed, And it is arranged successively by the sequence of character code.According to the data length of word information, the middle position of data block is determined.

S506: judging whether the word on middle position is equal to matching field, or whether searches end position.

Specifically, the character code of matching field can be compared with the character code of word on middle position, if phase Deng, it is determined that for the matching field there are corresponding word in dictionary, which is what text information needed to split out Word.The word not searched on end position or middle position is equal to matching field, thens follow the steps S510: knot is searched in output Fruit.If the word on middle position is equal to matching field, the word information of the word found is returned.If having searched end Position then returns and does not search data.

If the word on middle position is not equal to matching field, or does not search end position, S508 is thened follow the steps.

S508: the middle position of inquiry data block is redefined.

Specifically, if comparison result is that the character code of matching field is less than the character code of word on middle position, Continue binary chop in the first half before middle position, redefines the centre of inquiry data block in first half Position.

If comparison result is that the character code of matching field is greater than the character code of word on middle position, in interposition Continue binary chop in latter half before setting, redefines the middle position of inquiry data block in latter half.

According to the middle position for redefining inquiry data block, continue binary chop, in this way, passing through once relatively The retrieval section for reducing half, so goes on, until retrieving successfully or retrieving failure.

Based on dictionary data structure shown in Fig. 3, can simplify according to the operation that word carries out data query to dictionary is 2 Secondary File read operation+(log (n)+2) secondary memory read operation (n be on the basis of current queries word lead-in, it is corresponding 40) all word entries numbers, are usually no more than, the memory that inquiry generates every time is solely dependent upon all words of current term lead-in Data scale (is usually no more than 40), that is to say, that it is sufficiently fast sufficient with EMS memory occupation that this query engine has taken into account inquiry velocity It is enough small, it does not need any dictionary and preloads process.

In a specific embodiment, preset text matches algorithm is using maximum word frequency algorithm.According to preset text This matching algorithm, split text information, obtain fractionation each matching field and each matching field lead-in the step of, comprising: root According to maximum word frequency matching algorithm, text information is done any fractionation by forward direction, and obtains the corresponding each matching word of every kind of fractionation mode The lead-in of section and each matching field.

Word information includes word frequency；Based on preset text matches algorithm, according to the matching knot of each matching field of fractionation Fruit, the step of obtaining word segmentation result, comprising: the word frequency for the whole matching fields for splitting mode according to every kind calculates every kind of fractionation side The corresponding total word frequency of formula,

Maximum word frequency is corresponded into the corresponding each matching field of fractionation mode as the word segmentation result of text information.

Fig. 6 is the flow chart of the segmenting method based on dictionary of one embodiment.As shown in fig. 6, this method includes following Step:

S602: text information to be segmented is obtained.

S604: according to maximum word frequency matching algorithm, text information is done any fractionation by forward direction, and obtains every kind of fractionation mode The lead-in of corresponding each matching field and each matching field.

Forward direction refers to direction corresponding with text information input direction.Specifically, one section of text information from left to right Direction.

By taking text information is " today, weather was pretty good " as an example, text information is done any fractionation by forward direction, and what is be likely to be obtained tears open The mode of dividing is as follows:

The present/day weather is pretty good；

Today/weather not/it is wrong；

Today wrong day/gas/or not/

……

By various fractionation modes, the lead-in of the corresponding matching field of every kind of fractionation mode and each matching field is obtained.Its In, matching field refers to text information is split after obtained word, matched for the word with dictionary.In a manner of splitting For " today/weather not/wrong ", obtained matching field includes " today ", " weather is not " and " mistake ".Wherein, lead-in is each The first character of matching field.

S606: the lead-in index of dictionary is read, according to the relationship of lead-in and the storage location of corresponding word index information, is obtained To the storage location of the corresponding concordance information of lead-in of the matching field.

Specifically, for every kind split mode each matching field, pre-establish lead-in index in, according to lead-in with The relationship of the storage location of corresponding word index information obtains depositing for the corresponding concordance information of lead-in of the matching field Storage space is set.

Wherein, the lead-in index record positional relationship of lead-in and corresponding word index information.Concordance includes all The corresponding relationship of the storage location of every kind of word word length and word information of lead-in.The storage location of concordance includes word rope The offset and data length of fuse breath.Offset refers to that the starting of the concordance of each word word length of identical lead-in stores Position, that is, store the initial position of first concordance information of lead-in.Data length refers to whole words an of lead-in The data length of the index information of word length.

S608: according to the storage location of the concordance information, the corresponding word of lead-in of the matching field is read Index.

S610: it in the concordance, according to the corresponding relationship of word word length and the storage location of word information, obtains The storage location of the corresponding word information of the word length of the matching field.

Wherein, concordance has recorded the corresponding relationship of every kind of word word length of lead-in and the storage location of word information.

According to the offset and data length of concordance information, the storage location of concordance information can determine, then According to the word word length for searching word, the storage location of the word information of the word word length is determined in index information.Word letter The storage location of breath includes the offset and data length that word information corresponds to character string.Offset refers to, the word of identical word length The start memory location of the word information of language, that is, store the initial position of first word information of each word word length.Due to every Item stores information number of words having the same, and has used fixed length character code, so that every word information is all regular length, According to the word quantity of the identical word word length of each lead-in, the data length of word information can be calculated.According to word information Offset and data length can determine the storage location of word information.

Specifically, according to the word length of matching field, the storage position of the corresponding word information of word length is searched in index relative It sets.For example, lead-in is " day " so that current matching word is " weather " as an example, word length is two word lengths.According to lead-in " day " in head The storage location of the corresponding concordance of " day " word is found in word indexing.According to the storage location of concordance, " day " word is read Corresponding concordance.In the corresponding concordance of " day " word, including the corresponding concordance of two word lengths, three word lengths are corresponding Concordance etc..Concordance has recorded the corresponding relationship of every kind of word length of " day " word and the storage location of word information. In the present embodiment, " weather " is two word word lengths, in concordance, searches the word letter of corresponding two word lengths of " day " word The storage location of breath.

S612: according to the storage location of the word information, word information corresponding with word length is loaded in dictionary to memory.

In the present embodiment, no longer it is the entire dictionary of load to memory, need to only loads the word letter of the corresponding lead-in of a word length Memory is ceased, specifically, loads the word information of corresponding two word lengths of " day " word to memory, thus internal when reducing participle The occupancy deposited.

S614: binary chop is carried out to the word information, obtains the matching result of the matching field.

The word information of the identical word length of same lead-in is successively stored according to the sequence of character code, and due to phase Data length with every word information of the word information of word length be it is fixed, so as to be provided preferably for binary chop It supports.

S616: judge whether all matching fields of every kind of fractionation mode match and finish.If so, S618 is thened follow the steps, If it is not, then return step S606, searches next matching field of the fractionation mode, or to a kind of lower fractionation mode Matching field searched.

S618: the word frequency for the whole matching fields for splitting mode according to every kind calculates the corresponding total word of every kind of fractionation mode Frequently.

S620: maximum word frequency is corresponded into the corresponding each matching field of fractionation mode as the participle knot of the text information Fruit.

The above-mentioned segmenting method based on dictionary has both preferable participle speed and lesser EMS memory occupation, and does not almost have There is the start-up loading time, the application environment more sensitive to performance, memory particularly suitable for mobile terminal etc..

The Words partition system architecture diagram of one embodiment is as shown in Figure 7.The system includes dictionary data storage device, and is based on The participle device of dictionary.Allusion quotation data storage device is used to original dictionary data being processed into the dictionary number with multiple index file According to.Multiple index includes lead-in index and concordance.Wherein, lead-in index record lead-in and corresponding word index information The relationship of storage location, concordance has recorded every kind of word word length of lead-in and the corresponding of the storage location of word information is closed System.

Participle device based on dictionary obtains input text, and user inputs one section of Chinese text in the input frame of mobile terminal " today, weather was pretty good " calls the participle device based on dictionary to be split, to the word of fractionation, with multiple index file It is matched in dictionary data.Specifically, the storage location that can determine the index information of lead-in according to lead-in, further according to word Word length determines the storage location of the word information of the word word length, in index information so as to only load the word of the word To memory inquiry can be completed, it is not necessary that entire dictionary is all loaded onto memory, to reduce in the corresponding word information of language word length The EMS memory occupation of word segmentation processing, and there is faster inquiry velocity.

For fractionation each matching field after dictionary matches, the word frequency of each matching field is obtained, according to most Big word frequency matching process, corresponds to the corresponding each matching field of fractionation mode as the participle knot of the text information for maximum word frequency Fruit.Specifically, for " today, weather was pretty good " of input, being segmented (or even can add word for " today/weather/good " Property mark: today (time word) weather (noun) is good (adjective)), and give speech synthesis rear end and carry out pinyin marking, sound The operations such as frequency generation, the final voice that synthesizes play.

Fig. 8 is a kind of structural block diagram of dictionary data storage device of one embodiment, as shown in figure 8, the device includes: Dictionary obtains module 802, dictionary analysis module 804, memory module 806, concordance and establishes module 808 and lead-in index foundation Module 810.

The dictionary obtains module 802, for obtaining dictionary.

The dictionary analysis module 804 obtains the corresponding word information of each lead-in for analyzing the dictionary.

The memory module 806, for successively being carried out according to the sequence of word word length to the corresponding word information of each lead-in Storage.

The concordance establishes module 808, the storage location of each word word length and word information for establishing lead-in Corresponding relationship obtain concordance.

The lead-in index establishes module 808, for establishing lead-in and corresponding word index being believed according to the concordance The relationship of the storage location of breath obtains lead-in index.

Specifically, the dictionary analysis module is obtained for being analyzed according to character set encoding sequence the dictionary The word information of each lead-in.

Specifically, the memory module 806, for successively believing each lead-in corresponding word according to character set encoding sequence Breath is stored, and to each word information of lead-in, according to identical word word length each word character code sequence successively into Row storage.

Above-mentioned dictionary data storage device, whole words of a lead-in is not stored together disorderly, It by the corresponding word information of lead-in, is successively stored according to the sequence of word word length, establishes each word word length and word of lead-in The corresponding relationship of the storage location of language information, obtains concordance, further according to concordance, establishes lead-in and corresponding word indexes The positional relationship of information obtains lead-in index, to can determine the index information of lead-in according to lead-in in dictionary enquiry Storage location determines the storage location of the word information of the word word length further according to word word length in index information, so as to To memory inquiry can be completed, it is not necessary that entire dictionary is whole in the corresponding word information of word word length only to load the word It is loaded onto memory, to reduce the EMS memory occupation of word segmentation processing, and there is faster inquiry velocity.

Fig. 9 is a kind of structural block diagram of participle device based on dictionary of one embodiment, as shown in figure 9, being based on dictionary Participle device include: text obtain module 902, split module 904, searching module 906, read module 908, loading module 910, matching module 912 and word segmentation module 914.

The text obtains module 902, for obtaining text information to be segmented.

The fractionation module 904 is obtained and is split for splitting the text information according to preset text matches algorithm Each matching field and each matching field lead-in.

The searching module 906, the lead-in for reading dictionary indexes, according to depositing for lead-in and corresponding word index information The relationship that storage space is set obtains the storage location of the corresponding concordance information of lead-in of the matching field.

The read module 908 reads the matching field for the storage location according to the concordance information The corresponding concordance of lead-in.

The searching module 906, is also used in the concordance, according to the storage position of word word length and word information The corresponding relationship set obtains the storage location of the corresponding word information of word length of the matching field.

The loading module 910 loads corresponding with word length in dictionary for the storage location according to the word information Word information is to memory.

The matching module 912 is obtained for matching the matching field with each word in the word information To the matching result of the matching field；

The word segmentation module 914, for being based on preset text matches algorithm, according to the matching of each matching field of fractionation As a result, obtaining word segmentation result.

The above-mentioned participle device based on dictionary, can determine the storage location of the index information of lead-in according to lead-in, then According to word word length, the storage location of the word information of the word word length is determined, in index information so as to only load this Inquiry can be completed to memory in the corresponding word information of the word word length of word, it is not necessary that entire dictionary is all loaded onto memory, To reduce the EMS memory occupation of word segmentation processing, and there is faster inquiry velocity.

In another embodiment, the matching module obtains described for carrying out binary chop to the word information The matching result of matching field.

The fractionation module in another embodiment is used for according to maximum word frequency matching algorithm, positive by the text Information does any fractionation, and obtains the lead-in of the corresponding each matching field of every kind of fractionation mode and each matching field.

The word information includes word frequency；The word segmentation module 914 includes word frequency computing module and participle determining module.

The word frequency computing module calculates every kind of fractionation for splitting the word frequency of whole matching fields of mode according to every kind The corresponding total word frequency of mode.

The participle determining module, for maximum word frequency to be corresponded to the corresponding each matching field of fractionation mode as the text The word segmentation result of this information.

A kind of computer equipment, including memory and processor, memory are stored with computer program, computer program quilt When processor executes, so that processor executes the dictionary data storage method of the various embodiments described above or the segmenting method based on dictionary The step of.

Figure 10 shows the internal structure chart of computer equipment in one embodiment.As shown in Figure 10, the computer equipment It include processor, memory and the network interface connected by system bus including the computer equipment.Wherein, memory includes Non-volatile memory medium and built-in storage.The non-volatile memory medium of the computer equipment is stored with operating system, may be used also It is stored with computer program, when which is executed by processor, processor may make to realize dictionary data storage method Or the segmenting method based on dictionary.Computer program can also be stored in the built-in storage, which is held by processor When row, processor may make to execute dictionary data storage method or the segmenting method based on dictionary.Those skilled in the art can be with Understand, structure shown in Figure 10, only the block diagram of part-structure relevant to application scheme, is not constituted to the application The restriction for the computer equipment that scheme is applied thereon, specific computer equipment may include than as shown in the figure more or more Few component perhaps combines certain components or with different component layouts.

In one embodiment, dictionary data storage device provided by the present application can be implemented as a kind of computer program Form, computer program can be run in computer equipment as shown in Figure 10.Group can be stored in the memory of computer equipment At each program module of the dictionary data storage device, for example, dictionary shown in Fig. 8 obtain module, dictionary analysis module and Memory module.It is each that the computer program that each program module is constituted makes processor execute the application described in this specification Step in the dictionary data storage method of embodiment.

For example, computer equipment shown in Fig. 10 can pass through the dictionary in dictionary data storage device as shown in Figure 8 It obtains module and executes the step of obtaining dictionary.Computer equipment can be executed by dictionary analysis module divides the dictionary Analysis, the step of obtaining each lead-in corresponding word information.Computer equipment can be executed corresponding to each lead-in by memory module Word information, the step of successively storage according to the sequence of word word length.

In another embodiment, the participle device provided by the present application based on dictionary can be implemented as a kind of computer journey The form of sequence, computer program can be run in computer equipment as shown in Figure 10.It can be deposited in the memory of computer equipment Storage form the dictionary data storage device each program module, for example, text shown in Fig. 9 obtain module, split module and Searching module.It is each that the computer program that each program module is constituted makes processor execute the application described in this specification Step in the segmenting method based on dictionary of embodiment.

For example, computer equipment shown in Fig. 10 can pass through the text in the participle device as shown in Figure 9 based on dictionary The step of execution of this acquisition module obtains text information to be segmented.Computer equipment can be executed by splitting module according to default Text matches algorithm, split the text information, obtain fractionation each matching field and each matching field lead-in the step of. Computer equipment can be executed by searching for module in the lead-in index pre-established, according to lead-in and corresponding word index information Storage location relationship, the step of obtaining the storage location of the corresponding concordance information of lead-in of the matching field.

A kind of storage medium is stored with computer program, when the computer program is executed by processor, so that the place Device is managed to execute such as the step of the dictionary data storage method of the various embodiments described above or segmenting method based on dictionary.

Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, the program can be stored in a non-volatile computer and can be read In storage medium, the program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, provided herein Each embodiment used in any reference to memory, storage, database or other media, may each comprise non-volatile And/or volatile memory.Nonvolatile memory may include that read-only memory (ROM), programming ROM (PROM), electricity can be compiled Journey ROM (EPROM), electrically erasable ROM (EEPROM) or flash memory.Volatile memory may include random access memory (RAM) or external cache.By way of illustration and not limitation, RAM is available in many forms, such as static state RAM (SRAM), dynamic ram (DRAM), synchronous dram (SDRAM), double data rate sdram (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronization link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) directly RAM (RDRAM), straight Connect memory bus dynamic ram (DRDRAM) and memory bus dynamic ram (RDRAM) etc..

Each technical characteristic of above embodiments can be combined arbitrarily, for simplicity of description, not to above-described embodiment In each technical characteristic it is all possible combination be all described, as long as however, the combination of these technical characteristics be not present lance Shield all should be considered as described in this specification.

The several embodiments of the application above described embodiment only expresses, the description thereof is more specific and detailed, but simultaneously The limitation to the application the scope of the patents therefore cannot be interpreted as.It should be pointed out that for those of ordinary skill in the art For, without departing from the concept of this application, various modifications and improvements can be made, these belong to the guarantor of the application Protect range.Therefore, the scope of protection shall be subject to the appended claims for the application patent.

Claims

1. a kind of dictionary data storage method, comprising:

Obtain dictionary；

According to the concordance, the relationship for establishing the storage location of lead-in and corresponding word index information obtains lead-in index.

2. obtaining each lead-in pair the method according to claim 1, wherein described analyze the dictionary The step of word information answered, comprising:

The dictionary is analyzed according to character set encoding sequence, obtains the word information of each lead-in.

3. method according to claim 1 or 2, which is characterized in that it is described to the corresponding word information of each lead-in, according to word The step of sequence of language word length is successively stored, comprising:

According to character set encoding sequence, successively each lead-in corresponding word information is stored；

To each word information of lead-in, successively stored according to the sequence of each word character code of identical word word length.

4. a kind of segmenting method based on dictionary, comprising:

Obtain text information to be segmented；

According to preset text matches algorithm, the text information is split, obtains each matching field and each matching field of fractionation Lead-in；

The lead-in index for reading dictionary obtains described according to the relationship of lead-in and the storage location of corresponding word index information The storage location of the corresponding concordance information of lead-in with field；

In the concordance, according to the corresponding relationship of word word length and the storage location of word information, the matching is obtained The storage location of the corresponding word information of the word length of field；

The matching field is matched with each word in the word information, obtains the matching knot of the matching field Fruit；

5. according to the method described in claim 4, it is characterized in that, by each word in the matching field and the word information The step of language is matched, and the matching result of institute's matching field is obtained, comprising:

Binary chop is carried out to the word information, obtains the matching result of the matching field.

6. according to the method described in claim 4, it is characterized in that, splitting the text according to preset text matches algorithm Information, obtain fractionation each matching field and each matching field lead-in the step of, comprising: according to maximum word frequency matching algorithm, The text information is done any fractionation by forward direction, and obtains the corresponding each matching field of every kind of fractionation mode and each matching field Lead-in；

The word information includes word frequency；Based on preset text matches algorithm, according to the matching knot of each matching field of fractionation Fruit, the step of obtaining word segmentation result, comprising:

The word frequency for the whole matching fields for splitting mode according to every kind calculates the corresponding total word frequency of every kind of fractionation mode；

Maximum word frequency is corresponded into the corresponding each matching field of fractionation mode as the word segmentation result of the text information.

7. a kind of dictionary data storage device, comprising: dictionary obtains module, dictionary analysis module, memory module, concordance and builds Formwork erection block and lead-in index establish module；

The dictionary obtains module, for obtaining dictionary；

The memory module, for successively being stored according to the sequence of word word length to the corresponding word information of each lead-in；

The concordance establishes module, and each word word length for establishing lead-in is corresponding with the storage location of word information to close System obtains concordance；

Lead-in index establishes module, for according to the concordance, establishing depositing for lead-in and corresponding word index information The relationship that storage space is set obtains lead-in index.

8. device according to claim 7, which is characterized in that the dictionary analysis module, for according to character set encoding Sequence analyzes the dictionary, obtains the word information of each lead-in.

9. device according to claim 7 or 8, which is characterized in that the memory module, for suitable according to character set encoding Sequence successively stores each lead-in corresponding word information, and to each word information of lead-in, according to each of identical word word length The sequence of word character code is successively stored.

10. a kind of participle device based on dictionary, comprising: text obtains module, splits module, searching module, read module, adds Carry module, matching module and word segmentation module；

The text obtains module, for obtaining text information to be segmented；

The fractionation module, for splitting the text information, obtaining each matching of fractionation according to preset text matches algorithm The lead-in of field and each matching field；

The searching module, the lead-in for reading dictionary indexes, according to the storage location of lead-in and corresponding word index information Relationship, obtain the storage location of the corresponding concordance information of lead-in of the matching field；

The read module reads the lead-in pair of the matching field for the storage location according to the concordance information The concordance answered；

The searching module is also used in the concordance, according to pair of word word length and the storage location of word information It should be related to, obtain the storage location of the corresponding word information of word length of the matching field；

The loading module loads word letter corresponding with word length in dictionary for the storage location according to the word information Cease memory；

The matching module obtains described for matching the matching field with each word in the word information The matching result of matching field；

The word segmentation module, for being obtained based on preset text matches algorithm according to the matching result of each matching field of fractionation To word segmentation result.

11. device according to claim 10, which is characterized in that the matching module, for the word information into Row binary chop obtains the matching result of the matching field.

12. device according to claim 10, which is characterized in that the fractionation module, for being matched according to maximum word frequency The text information is done any fractionation by algorithm, forward direction, and obtains the corresponding each matching field of every kind of fractionation mode and each matching The lead-in of field；

The word information includes word frequency；The word segmentation module includes word frequency computing module and participle determining module；

The word frequency computing module calculates every kind of fractionation mode for splitting the word frequency of whole matching fields of mode according to every kind Corresponding total word frequency；

The participle determining module, for maximum word frequency to be corresponded to the corresponding each matching field of fractionation mode as the text envelope The word segmentation result of breath.

13. a kind of computer equipment, including memory and processor, the memory is stored with computer program, the calculating When machine program is executed by the processor, so that the processor executes dictionary number as claimed any one in claims 1 to 3 The step of according to segmenting method described in any one of storage method or 4 to 6 based on dictionary.

14. a kind of storage medium is stored with computer program, when the computer program is executed by processor, so that the place Reason device execute described in any one of dictionary data storage method as claimed any one in claims 1 to 3 or 4 to 6 based on The step of segmenting method of dictionary.