CN106021230A

CN106021230A - Word segmentation method and word segmentation apparatus

Info

Publication number: CN106021230A
Application number: CN201610339657.XA
Authority: CN
Inventors: 高云翔
Original assignee: Wireless Living (hangzhou) Mdt Infotech Ltd
Current assignee: Wireless Living (hangzhou) Mdt Infotech Ltd
Priority date: 2016-05-19
Filing date: 2016-05-19
Publication date: 2016-10-12
Anticipated expiration: 2036-05-19
Also published as: CN106021230B

Abstract

The invention discloses a word segmentation method and a word segmentation apparatus, and relates to the technical field of data mining. The method comprises the steps of: performing arbitrary combination on adjacent characters in a to-be-processed document to obtain candidate phrases; respectively calculating the left entropy and the right entropy of all the candidate phrases; and determining a probability of the candidate phrase being a word according to the left entropy and the right entropy of the candidate phrase, and performing word segmentation on the to-be-processed document according to the probability. The method and the apparatus provided by the invention can perform word segmentation on the document without a lexicon, better process new words and words that are not frequently seen, and thus perform word segmentation more accurately.

Description

A kind of segmenting method and device

Technical field

The present invention relates to data mining technology field, particularly to a kind of segmenting method and device.

Background technology

Traditional segmenting method is based on dictionary, and general thinking is to carry out maximum with the word in dictionary and document Coupling (Forward Maximum Method method, reverse maximum matching method, two-way maximum matching method).But, traditional Segmenting method is suitable for traditional document is carried out participle, for electricity business's platform, exists among descriptive labelling Substantial amounts of brand word, function word, the non-common word such as neologisms, for such vocabulary, traditional segmenting method Cannot process.

It addition, traditional segmenting method simply uses the idea of greed, carry out forward or reverse maximum Join, matching result not global optimum.Although speed is fast, but effect bad.Certain applications are led Territory, is not required to the fastest participle speed, and the effect to participle has higher requirement on the contrary.

Furthermore, traditional dictionary building method effect based on comentropy is limited, simple " comentropy " not The probability of word can be represented.

Summary of the invention

In view of the above problems, it is proposed that the present invention is to provide one to overcome the problems referred to above or at least in part Solve a kind of segmenting method and the device of the problems referred to above.

The present invention provides a kind of segmenting method, including:

Word adjacent in pending document is carried out combination in any, it is thus achieved that candidate phrase；

Calculate the left entropy of all candidate phrase and right entropy respectively；

Left entropy and right entropy according to described candidate phrase determine the probability that described candidate phrase is word, according to described Probability carries out participle to described pending document.

In one embodiment, after the left entropy calculating all candidate phrase respectively and right entropy, described method May also include that

Left entropy and right entropy to described all candidate phrase are modified operation.

In one embodiment, the described left entropy to described all candidate phrase and right entropy are modified operation, Comprise the steps that

Left entropy and right entropy to described all candidate phrase carry out word string offset correction；

Left entropy revised to all candidate phrase and right entropy carry out data standard according to the length of candidate phrase Change.

In one embodiment, described according to described probability, described pending document is carried out participle, it may include:

Utilize dynamic programming algorithm that described pending document is carried out participle according to described probability.

In one embodiment, according to following equation, left entropy and the right entropy of candidate phrase are carried out word string skew and repair Just:

VRE(x_1..n)=RE (x_1..n)-RE(x_1..n-1)

VLE(x_1..n)=LE (x_1..n)-LE(x_2..n)

Wherein, RE (x_1..n) it is the right entropy of the candidate phrase of a length of n word, RE (x_1..n-1) it is length For the right entropy of the candidate phrase of n-1 word, VRE (x_1..n) it is the right entropy of candidate phrase of a length of n word Carry out the result after smoothing；LE(x_1..n) it is the left entropy of the candidate phrase of a length of n word, LE (x_2..n) it is The left entropy of the candidate phrase of a length of n-1 word, VLE (x_1..n) it is the candidate phrase of a length of n word Left entropy carries out the result after smoothing.

In one embodiment, data are carried out according to following equation left entropy revised to candidate phrase and right entropy Standardization:

nVRE(x_1..n)=(VRE (x_1..n)-RV(x_1..n))/standard deviation

nVLE(x_1..n)=(VLE (x_1..n)-LV(x_1..n))/standard deviation

Wherein, VRE (x_1..n) be the right entropy of the candidate phrase of a length of n word carry out smooth after result, RV(x_1..n) it is the meansigma methods of the right entropy of all candidate phrase of a length of n word, nVRE (x_1..n) it is will VRE(x_1..n) carry out data normalization after the new data that obtains；VLE(x_1..n) it is the time of a length of n word The right entropy of phrase is selected to carry out the result after smoothing, LV (x_1..n) it is that all candidate phrase of a length of n word are left The meansigma methods of entropy, nVLE (x_1..n) it is by VLE (x_1..n) carry out data normalization after the new data that obtains.

The present invention also provides for a kind of participle device, including:

Composite module, for carrying out combination in any by word adjacent in pending document, it is thus achieved that candidate phrase；

Computing module, for calculating the left entropy of all candidate phrase and right entropy respectively；

According to left entropy and the right entropy of described candidate phrase, word-dividing mode, for determining that described candidate phrase is word Probability, carries out participle according to described probability to described pending document.

In one embodiment, described device may also include that

Correcting module, for being modified operation to the left entropy of described all candidate phrase and right entropy.

In one embodiment, described correcting module, it may include:

First revises submodule, repaiies for the left entropy of described all candidate phrase and right entropy carry out word string skew Just；

Second revises submodule, is used for left entropy revised to all candidate phrase and right entropy according to candidate phrase Length carry out data normalization.

In one embodiment, described word-dividing mode, it may include:

Participle submodule, for utilizing dynamic programming algorithm to institute according to left entropy and the right entropy of described candidate phrase State pending document and carry out participle.

VRE(x_1..n)=RE (x_1..n)-RE(x_1..n-1)

VLE(x_1..n)=LE (x_1..n)-LE(x_2..n)

nVRE(x_1..n)=(VRE (x_1..n)-RV(x_1..n))/standard deviation

nVLE(x_1..n)=(VLE (x_1..n)-LV(x_1..n))/standard deviation

The technical scheme that embodiments of the invention provide can include following beneficial effect:

The technique scheme of the embodiment of the present invention, by carrying out word adjacent in pending document arbitrarily Combination, it is thus achieved that candidate phrase, calculates the left entropy of all candidate phrase and right entropy, respectively according to candidate phrase Left entropy and right entropy carry out participle to pending document.Such that it is able to document is carried out on the premise of there is no dictionary Non-common word, neologisms are preferably processed by participle, thus carry out participle more accurately.

Other features and advantages of the present invention will illustrate in the following description, and, partly from explanation Book becomes apparent, or understands by implementing the present invention.The purpose of the present invention and other advantages can Realize by structure specifically noted in the description write, claims and accompanying drawing and obtain ?.

Below by drawings and Examples, technical scheme is described in further detail.

Accompanying drawing explanation

Accompanying drawing is for providing a further understanding of the present invention, and constitutes a part for description, with this Bright embodiment is used for explaining the present invention together, is not intended that limitation of the present invention.In the accompanying drawings:

Fig. 1 is the flow chart of a kind of segmenting method in the embodiment of the present invention；

Fig. 2 is the flow chart of another kind of segmenting method in the embodiment of the present invention；

Fig. 3 is the flow chart of step S14 in a kind of segmenting method in the embodiment of the present invention；

Fig. 4 is the flow chart of another kind of segmenting method in the embodiment of the present invention；

Fig. 5 is the block diagram of a kind of participle device in the embodiment of the present invention；

Fig. 6 is the block diagram of another kind of participle device in the embodiment of the present invention；

Fig. 7 is the block diagram of correcting module 54 in a kind of participle device in the embodiment of the present invention；

Fig. 8 is the block diagram of word-dividing mode 53 in a kind of participle device in the embodiment of the present invention.

Detailed description of the invention

Below in conjunction with accompanying drawing, the preferred embodiments of the present invention are illustrated, it will be appreciated that described herein Preferred embodiment is merely to illustrate and explains the present invention, is not intended to limit the present invention.

In prior art, traditional segmenting method cannot process and exist among the extensive stock description of electricity business's platform Brand word, function word, the non-common word, such as descriptive labelling " the former single POLLIWALKS of foreign trade such as neologisms Hole, Bao Liwo Rana nigromaculata hole footwear children's footwear garden shoe sandals for children slippers certified products " in " hole, hole footwear " in common dictionary Do not comprised.

Fig. 1 show the flow chart of a kind of segmenting method in the embodiment of the present invention, as it is shown in figure 1, include Following steps S11-S13:

Step S11, carries out combination in any by word adjacent in pending document, it is thus achieved that candidate phrase.

Utilize unsupervised mode, count the probability of word in pending document.That is, by pending document In all adjacent words be combined, enumerate all possible candidate phrase.In this step, Ying Xianshe Determine the greatest length of word.

Illustrate, it is assumed that the length of word is five to the maximum, it is assumed that pending document is that " sandals for children slippers are just Product ", enumerate all possible phrase (being shown in Table 1):

Table 1 is all possible phrase in the pending document enumerated

Child	Tong Liang	Sandals	Footwear drag
				Child is cool	Children's sandal	Sandals drag	Slippers
Sandals for children	Children's sandal drags	Sandals slippers	Slippers is just
				Sandals for children drags	Children's sandal slippers	Sandals slippers are just	Slippers certified products

Enumerate in table 16 phrases are defined as candidate phrase.In these 16 phrases, only part is true Positive " word ", although human brain the most just can identify, but computer is not aware that, utilize substantial amounts of language material and Certain algorithmic rule, we computer can be allowed to also identify which phrase is real " word ".

Step S12, calculates the left entropy of each candidate phrase and right entropy respectively.

In step S11, all possible phrase is all defined as in candidate phrase, this step, calculates all The left entropy of candidate phrase and right entropy.

Illustrate, as a example by above-mentioned 16 candidate phrase, calculate a left side for above-mentioned 16 candidate phrase respectively Entropy and right entropy, such as, calculate the left entropy of candidate's " phrase children's sandal drags " and right entropy.

Comentropy is the tolerance representing random degree, and entropy is the biggest, represents the most random.Certain state occurs Probability be p, if it occur that, its quantity of information is-log (p), the probability of the quantity of information of all states and be Total comentropy, is:

Comentropy=Σ log (P_i)*P_i

If now with three documents, " garment for children ... ", " children's paradise ... ", " infant foods ... ", Three words are occurred in that: take, find pleasure in, eat on the right of this phrase of " child clothing ... " child.The number of times occurred is 2、1、1.Then:

If three documents become, " one-piece dress is red ... ", " one-piece dress bag postal ... ", " one-piece dress generation Purchase ... ", then " even clothing " the right only " skirt " occurs in that three times, and its right Entropy Changes is:

Right entropy=-log (3/3) the * 3/3=0 of " even clothing "

As can be seen here, the word occurred on the right of a phrase is the most random, and the right entropy of this phrase is the biggest.With Managing left entropy is also same phenomenon." child " is a real word, and the word then ratio occurred on the right of it is more random, Comentropy is the highest；" even clothing " tends not to be occurred as single word, and the word occurred on the right of it is then the most solid Fixed, comentropy is smaller.

It is to say, the word that a phrase left side and the right occur is the most random, this phrase is the general of a word Rate is the biggest.I.e. one the left entropy of phrase and right entropy are the biggest, then the probability that it is actually a word is the biggest, Because the word of a real word both sides around appearance is often random.

The summary carrying out and formulating as defined above: assume phrase x_1..nIt is the n tuple containing n word, It is:

x_1..n=x₁x₂…x_n

And assume x_1..nThe collection of all characters that the right occurs is combined into χ_r.The right entropy of so phrase RE (Right Entropy, right entropy), for:

R E (x_{1.. n}) = \underset{x &Element; x_{r}}{Σ} [P (x | x_{1.. n}) * \log P (x | x_{1.. n})]

In like manner, left entropy LE (left Entropy, left entropy) is:

L E (x_{1.. n}) = \underset{x &Element; x_{1}}{Σ} [P (x | x_{1.. n}) * \log P (x | x_{1.. n})]

In one embodiment, as in figure 2 it is shown, after step s 12, said method may also include step S14:

Step S14, left entropy and right entropy to all candidate phrase are modified operation.

Owing to simple comentropy can not well represent the probability of word, therefore, in the present embodiment, to letter Breath entropy carries out two step optimizations, calculates the probability of word further.

In step s 12, calculate the left entropy of each candidate phrase and right entropy respectively, but at actual meter In calculation, find that it is inadequate for only carrying out the probability of grammatical term for the character with the left entropy of phrase and right entropy.Therefore entropy is carried out Two steps smooth.As follows:

In one embodiment, as it is shown on figure 3, step S14 can also be embodied as following steps S141-S142:

Step S141, left entropy and right entropy to all candidate phrase carry out word string offset correction.

VRE(x_1..n)=RE (x_1..n)-RE(x_1..n-1)

VLE(x_1..n)=LE (x_1..n)-LE(x_2..n)

Illustrating, the right entropy of " one-piece dress " is very big, and the right entropy of " even clothing " is the least, and both subtract each other and obtain one afterwards The biggest individual value, can preferably represent that " one-piece dress " is actually the probability of a word.

Comentropy is obtained after the first step is smooth result VRE (x_1..n) and VLE (x_1..n)。

In the present embodiment, comentropy is carried out word string offset correction, enable revised data preferably generation The probability of table word.

Step S142, left entropy revised to all candidate phrase and right entropy are carried out according to the length of candidate phrase Data normalization.

nVRE(x_1..n)=(VRE (x_1..n)-RV(x_1..n))/standard deviation

nVLE(x_1..n)=(VLE (x_1..n)-LV(x_1..n))/standard deviation

After the first step has smoothed, find the fewest phrase correspondence VLE of number of characters and VRE the most more High, it is therefore desirable to carry out a step and data are standardized.It is first according to length be grouped, each phrase VLE and VRE be individually subtracted the average of equal length group belonging to it and, then divided by standard deviation, retrieve NVLE and nVRE.So operate the result obtained, the distribution symbol of all values of the phrase of its equal length Standardization normal distribution.

Finally, we define the probability a (x of a word_1..n) as follows:

a (x_{1.. n}) = m i n \{\begin{matrix} n V L E (x_{1.. n}) \\ n V R E (x_{1.. n}) \end{matrix}

In the present embodiment, phrase correspondence VLE the fewest due to number of characters and VRE are the highest, therefore will VLE and VRE is normalized correction, and after two step corrections, new data can preferably represent word Probability.

Step S13, left entropy and right entropy according to candidate phrase determine the probability that candidate phrase is word, according to generally Rate carries out participle to described pending document.

In one embodiment, as shown in Figure 4, step S13 may be implemented as step S131:

Step S131, left entropy and right entropy according to candidate phrase utilize dynamic programming algorithm to enter pending document Row participle.

In this step, utilize dynamic programming algorithm that document is carried out participle.Assume that pending document is S_1..n, For any one splitting scheme, define the sum of the probability of the word that its income is all divided phrases.False If S_1..nIt is divided into: x_1..3x_3..5x_5..6…x_n-2..n

Then the income of this splitting scheme is:

Profit(S_1..n)=a (x_1..3)+a(x_3..5)+a(x_5..6)+…a(x_n-2..n)

Utilizing dynamic programming algorithm, can calculate Income Maximum in all splitting schemes, direct violence is enumerated Method be infeasible because total number of splitting scheme is up to 2^n (n is Document Length).This is one The dynamic programming problems of relative standard, be more no longer developed in details in.

In the present embodiment, on the premise of being calculated Word probability, utilize dynamic programming algorithm to carry out document and divide Cut, global optimum can be obtained, compare forward direction maximum fractionation and consequent maximum fractionation can obtain preferably Participle effect.

The said method of the embodiment of the present invention, by word adjacent in pending document is carried out combination in any, Obtain candidate phrase, calculate the left entropy of all candidate phrase and right entropy respectively, according to the left entropy of candidate phrase and Right entropy carries out participle to pending document.Such that it is able to document is carried out participle on the premise of there is no dictionary, Non-common word, neologisms are preferably processed, thus carries out participle more accurately.

Based on same inventive concept, the embodiment of the present invention additionally provides a kind of participle device, due to this device institute The principle of solution problem is similar to aforementioned segmenting method, and therefore the enforcement of this device may refer to preceding method Implement, repeat no more in place of repetition.

Fig. 5 show the block diagram of a kind of participle device in the embodiment of the present invention, as it is shown in figure 5, this device bag Include:

Composite module 51, for carrying out combination in any by word adjacent in pending document, it is thus achieved that candidate is short Language；

Computing module 52, for calculating the left entropy of all candidate phrase and right entropy respectively；

Word-dividing mode 53, for determining, according to left entropy and the right entropy of candidate phrase, the probability that candidate phrase is word, According to probability, described pending document is carried out participle.

The said apparatus of the embodiment of the present invention, by word adjacent in pending document is carried out combination in any, Obtain candidate phrase, calculate the left entropy of all candidate phrase and right entropy respectively, according to the left entropy of candidate phrase and Right entropy carries out participle to pending document.Such that it is able to document is carried out participle on the premise of there is no dictionary, Non-common word, neologisms are preferably processed, thus carries out participle more accurately.

In one embodiment, as shown in Figure 6, said apparatus may also include that

Correcting module 54, for being modified operation to the left entropy of all candidate phrase and right entropy.

In one embodiment, as it is shown in fig. 7, correcting module 54, it may include:

First revises submodule 541, repaiies for the left entropy of all candidate phrase and right entropy carry out word string skew Just；

Second revises submodule 542, is used for left entropy revised to all candidate phrase and right entropy according to candidate The length of phrase carries out data normalization.

In one embodiment, as shown in Figure 8, word-dividing mode 53, it may include:

Participle submodule 531, for utilizing dynamic programming algorithm to treat according to left entropy and the right entropy of candidate phrase Process document and carry out participle.

VRE(x_1..n)=RE (x_1..n)-RE(x_1..n-1)

VLE(x_1..n)=LE (x_1..n)-LE(x_2..n)

nVRE(x_1..n)=(VRE (x_1..n)-RV(x_1..n))/standard deviation

nVLE(x_1..n)=(VLE (x_1..n)-LV(x_1..n))/standard deviation

Those skilled in the art are it should be appreciated that embodiments of the invention can be provided as method, system or meter Calculation machine program product.Therefore, the present invention can use complete hardware embodiment, complete software implementation or knot The form of the embodiment in terms of conjunction software and hardware.And, the present invention can use and wherein wrap one or more Computer-usable storage medium containing computer usable program code (include but not limited to disk memory and Optical memory etc.) form of the upper computer program implemented.

The present invention is with reference to method, equipment (system) and computer program product according to embodiments of the present invention The flow chart of product and/or block diagram describe.It should be understood that can by computer program instructions flowchart and / or block diagram in each flow process and/or flow process in square frame and flow chart and/or block diagram and/ Or the combination of square frame.These computer program instructions can be provided to general purpose computer, special-purpose computer, embedding The processor of formula datatron or other programmable data processing device is to produce a machine so that by calculating The instruction that the processor of machine or other programmable data processing device performs produces for realizing at flow chart one The device of the function specified in individual flow process or multiple flow process and/or one square frame of block diagram or multiple square frame.

These computer program instructions may be alternatively stored in and computer or the process of other programmable datas can be guided to set In the standby computer-readable memory worked in a specific way so that be stored in this computer-readable memory Instruction produce and include the manufacture of command device, this command device realizes in one flow process or multiple of flow chart The function specified in flow process and/or one square frame of block diagram or multiple square frame.

These computer program instructions also can be loaded in computer or other programmable data processing device, makes Sequence of operations step must be performed to produce computer implemented place on computer or other programmable devices Reason, thus the instruction performed on computer or other programmable devices provides for realizing flow chart one The step of the function specified in flow process or multiple flow process and/or one square frame of block diagram or multiple square frame.

Obviously, those skilled in the art can carry out various change and modification without deviating from this to the present invention Bright spirit and scope.So, if the present invention these amendment and modification belong to the claims in the present invention and Within the scope of its equivalent technologies, then the present invention is also intended to comprise these change and modification.

Claims

1. a segmenting method, it is characterised in that including:

2. the method for claim 1, it is characterised in that calculating all candidate phrase respectively After left entropy and right entropy, described method also includes:

3. method as claimed in claim 2, it is characterised in that described to described all candidate phrase Left entropy and right entropy are modified operation, including:

4. the method as according to any one of claim 1-3, it is characterised in that described according to described generally Rate carries out participle to described pending document, including:

5. method as claimed in claim 3, it is characterised in that according to following equation to candidate phrase Left entropy and right entropy carry out word string offset correction:

VRE(x_1..n)=RE (x_1..n)-RE(x_1..n-1)

VLE(x_1..n)=LE (x_1..n)-LE(x_2..n)

6. method as claimed in claim 3, it is characterised in that candidate phrase is repaiied according to following equation Left entropy after just and right entropy carry out data normalization:

nVRE(x_1..n)=(VRE (x_1..n)-RV(x_1..n))/standard deviation

nVLE(x_1..n)=(VLE (x_1..n)-LV(x_1..n))/standard deviation

7. a participle device, it is characterised in that including:

8. device as claimed in claim 7, it is characterised in that described device also includes:

9. device as claimed in claim 8, it is characterised in that described correcting module, including:

10. device as claimed in any one of claims 7-9, it is characterised in that described word-dividing mode, Including:

11. devices as claimed in claim 9, it is characterised in that according to following equation to candidate phrase Left entropy and right entropy carry out word string offset correction:

VRE(x_1..n)=RE (x_1..n)-RE(x_1..n-1)

VLE(x_1..n)=LE (x_1..n)-LE(x_2..n)

12. devices as claimed in claim 9, it is characterised in that candidate phrase is repaiied according to following equation Left entropy after just and right entropy carry out data normalization:

nVRE(x_1..n)=(VRE (x_1..n)-RV(x_1..n))/standard deviation

nVLE(x_1..n)=(VLE (x_1..n)-LV(x_1..n))/standard deviation