CN106021230A - Word segmentation method and word segmentation apparatus - Google Patents

Word segmentation method and word segmentation apparatus Download PDF

Info

Publication number
CN106021230A
CN106021230A CN201610339657.XA CN201610339657A CN106021230A CN 106021230 A CN106021230 A CN 106021230A CN 201610339657 A CN201610339657 A CN 201610339657A CN 106021230 A CN106021230 A CN 106021230A
Authority
CN
China
Prior art keywords
entropy
candidate phrase
word
length
carry out
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610339657.XA
Other languages
Chinese (zh)
Other versions
CN106021230B (en
Inventor
高云翔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wireless Living (hangzhou) Mdt Infotech Ltd
Original Assignee
Wireless Living (hangzhou) Mdt Infotech Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wireless Living (hangzhou) Mdt Infotech Ltd filed Critical Wireless Living (hangzhou) Mdt Infotech Ltd
Priority to CN201610339657.XA priority Critical patent/CN106021230B/en
Publication of CN106021230A publication Critical patent/CN106021230A/en
Application granted granted Critical
Publication of CN106021230B publication Critical patent/CN106021230B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Document Processing Apparatus (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a word segmentation method and a word segmentation apparatus, and relates to the technical field of data mining. The method comprises the steps of: performing arbitrary combination on adjacent characters in a to-be-processed document to obtain candidate phrases; respectively calculating the left entropy and the right entropy of all the candidate phrases; and determining a probability of the candidate phrase being a word according to the left entropy and the right entropy of the candidate phrase, and performing word segmentation on the to-be-processed document according to the probability. The method and the apparatus provided by the invention can perform word segmentation on the document without a lexicon, better process new words and words that are not frequently seen, and thus perform word segmentation more accurately.

Description

A kind of segmenting method and device
Technical field
The present invention relates to data mining technology field, particularly to a kind of segmenting method and device.
Background technology
Traditional segmenting method is based on dictionary, and general thinking is to carry out maximum with the word in dictionary and document Coupling (Forward Maximum Method method, reverse maximum matching method, two-way maximum matching method).But, traditional Segmenting method is suitable for traditional document is carried out participle, for electricity business's platform, exists among descriptive labelling Substantial amounts of brand word, function word, the non-common word such as neologisms, for such vocabulary, traditional segmenting method Cannot process.
It addition, traditional segmenting method simply uses the idea of greed, carry out forward or reverse maximum Join, matching result not global optimum.Although speed is fast, but effect bad.Certain applications are led Territory, is not required to the fastest participle speed, and the effect to participle has higher requirement on the contrary.
Furthermore, traditional dictionary building method effect based on comentropy is limited, simple " comentropy " not The probability of word can be represented.
Summary of the invention
In view of the above problems, it is proposed that the present invention is to provide one to overcome the problems referred to above or at least in part Solve a kind of segmenting method and the device of the problems referred to above.
The present invention provides a kind of segmenting method, including:
Word adjacent in pending document is carried out combination in any, it is thus achieved that candidate phrase;
Calculate the left entropy of all candidate phrase and right entropy respectively;
Left entropy and right entropy according to described candidate phrase determine the probability that described candidate phrase is word, according to described Probability carries out participle to described pending document.
In one embodiment, after the left entropy calculating all candidate phrase respectively and right entropy, described method May also include that
Left entropy and right entropy to described all candidate phrase are modified operation.
In one embodiment, the described left entropy to described all candidate phrase and right entropy are modified operation, Comprise the steps that
Left entropy and right entropy to described all candidate phrase carry out word string offset correction;
Left entropy revised to all candidate phrase and right entropy carry out data standard according to the length of candidate phrase Change.
In one embodiment, described according to described probability, described pending document is carried out participle, it may include:
Utilize dynamic programming algorithm that described pending document is carried out participle according to described probability.
In one embodiment, according to following equation, left entropy and the right entropy of candidate phrase are carried out word string skew and repair Just:
VRE(x1..n)=RE (x1..n)-RE(x1..n-1)
VLE(x1..n)=LE (x1..n)-LE(x2..n)
Wherein, RE (x1..n) it is the right entropy of the candidate phrase of a length of n word, RE (x1..n-1) it is length For the right entropy of the candidate phrase of n-1 word, VRE (x1..n) it is the right entropy of candidate phrase of a length of n word Carry out the result after smoothing;LE(x1..n) it is the left entropy of the candidate phrase of a length of n word, LE (x2..n) it is The left entropy of the candidate phrase of a length of n-1 word, VLE (x1..n) it is the candidate phrase of a length of n word Left entropy carries out the result after smoothing.
In one embodiment, data are carried out according to following equation left entropy revised to candidate phrase and right entropy Standardization:
nVRE(x1..n)=(VRE (x1..n)-RV(x1..n))/standard deviation
nVLE(x1..n)=(VLE (x1..n)-LV(x1..n))/standard deviation
Wherein, VRE (x1..n) be the right entropy of the candidate phrase of a length of n word carry out smooth after result, RV(x1..n) it is the meansigma methods of the right entropy of all candidate phrase of a length of n word, nVRE (x1..n) it is will VRE(x1..n) carry out data normalization after the new data that obtains;VLE(x1..n) it is the time of a length of n word The right entropy of phrase is selected to carry out the result after smoothing, LV (x1..n) it is that all candidate phrase of a length of n word are left The meansigma methods of entropy, nVLE (x1..n) it is by VLE (x1..n) carry out data normalization after the new data that obtains.
The present invention also provides for a kind of participle device, including:
Composite module, for carrying out combination in any by word adjacent in pending document, it is thus achieved that candidate phrase;
Computing module, for calculating the left entropy of all candidate phrase and right entropy respectively;
According to left entropy and the right entropy of described candidate phrase, word-dividing mode, for determining that described candidate phrase is word Probability, carries out participle according to described probability to described pending document.
In one embodiment, described device may also include that
Correcting module, for being modified operation to the left entropy of described all candidate phrase and right entropy.
In one embodiment, described correcting module, it may include:
First revises submodule, repaiies for the left entropy of described all candidate phrase and right entropy carry out word string skew Just;
Second revises submodule, is used for left entropy revised to all candidate phrase and right entropy according to candidate phrase Length carry out data normalization.
In one embodiment, described word-dividing mode, it may include:
Participle submodule, for utilizing dynamic programming algorithm to institute according to left entropy and the right entropy of described candidate phrase State pending document and carry out participle.
In one embodiment, according to following equation, left entropy and the right entropy of candidate phrase are carried out word string skew and repair Just:
VRE(x1..n)=RE (x1..n)-RE(x1..n-1)
VLE(x1..n)=LE (x1..n)-LE(x2..n)
Wherein, RE (x1..n) it is the right entropy of the candidate phrase of a length of n word, RE (x1..n-1) it is length For the right entropy of the candidate phrase of n-1 word, VRE (x1..n) it is the right entropy of candidate phrase of a length of n word Carry out the result after smoothing;LE(x1..n) it is the left entropy of the candidate phrase of a length of n word, LE (x2..n) it is The left entropy of the candidate phrase of a length of n-1 word, VLE (x1..n) it is the candidate phrase of a length of n word Left entropy carries out the result after smoothing.
In one embodiment, data are carried out according to following equation left entropy revised to candidate phrase and right entropy Standardization:
nVRE(x1..n)=(VRE (x1..n)-RV(x1..n))/standard deviation
nVLE(x1..n)=(VLE (x1..n)-LV(x1..n))/standard deviation
Wherein, VRE (x1..n) be the right entropy of the candidate phrase of a length of n word carry out smooth after result, RV(x1..n) it is the meansigma methods of the right entropy of all candidate phrase of a length of n word, nVRE (x1..n) it is will VRE(x1..n) carry out data normalization after the new data that obtains;VLE(x1..n) it is the time of a length of n word The right entropy of phrase is selected to carry out the result after smoothing, LV (x1..n) it is that all candidate phrase of a length of n word are left The meansigma methods of entropy, nVLE (x1..n) it is by VLE (x1..n) carry out data normalization after the new data that obtains.
The technical scheme that embodiments of the invention provide can include following beneficial effect:
The technique scheme of the embodiment of the present invention, by carrying out word adjacent in pending document arbitrarily Combination, it is thus achieved that candidate phrase, calculates the left entropy of all candidate phrase and right entropy, respectively according to candidate phrase Left entropy and right entropy carry out participle to pending document.Such that it is able to document is carried out on the premise of there is no dictionary Non-common word, neologisms are preferably processed by participle, thus carry out participle more accurately.
Other features and advantages of the present invention will illustrate in the following description, and, partly from explanation Book becomes apparent, or understands by implementing the present invention.The purpose of the present invention and other advantages can Realize by structure specifically noted in the description write, claims and accompanying drawing and obtain ?.
Below by drawings and Examples, technical scheme is described in further detail.
Accompanying drawing explanation
Accompanying drawing is for providing a further understanding of the present invention, and constitutes a part for description, with this Bright embodiment is used for explaining the present invention together, is not intended that limitation of the present invention.In the accompanying drawings:
Fig. 1 is the flow chart of a kind of segmenting method in the embodiment of the present invention;
Fig. 2 is the flow chart of another kind of segmenting method in the embodiment of the present invention;
Fig. 3 is the flow chart of step S14 in a kind of segmenting method in the embodiment of the present invention;
Fig. 4 is the flow chart of another kind of segmenting method in the embodiment of the present invention;
Fig. 5 is the block diagram of a kind of participle device in the embodiment of the present invention;
Fig. 6 is the block diagram of another kind of participle device in the embodiment of the present invention;
Fig. 7 is the block diagram of correcting module 54 in a kind of participle device in the embodiment of the present invention;
Fig. 8 is the block diagram of word-dividing mode 53 in a kind of participle device in the embodiment of the present invention.
Detailed description of the invention
Below in conjunction with accompanying drawing, the preferred embodiments of the present invention are illustrated, it will be appreciated that described herein Preferred embodiment is merely to illustrate and explains the present invention, is not intended to limit the present invention.
In prior art, traditional segmenting method cannot process and exist among the extensive stock description of electricity business's platform Brand word, function word, the non-common word, such as descriptive labelling " the former single POLLIWALKS of foreign trade such as neologisms Hole, Bao Liwo Rana nigromaculata hole footwear children's footwear garden shoe sandals for children slippers certified products " in " hole, hole footwear " in common dictionary Do not comprised.
Fig. 1 show the flow chart of a kind of segmenting method in the embodiment of the present invention, as it is shown in figure 1, include Following steps S11-S13:
Step S11, carries out combination in any by word adjacent in pending document, it is thus achieved that candidate phrase.
Utilize unsupervised mode, count the probability of word in pending document.That is, by pending document In all adjacent words be combined, enumerate all possible candidate phrase.In this step, Ying Xianshe Determine the greatest length of word.
Illustrate, it is assumed that the length of word is five to the maximum, it is assumed that pending document is that " sandals for children slippers are just Product ", enumerate all possible phrase (being shown in Table 1):
Table 1 is all possible phrase in the pending document enumerated
Child Tong Liang Sandals Footwear drag
Child is cool Children's sandal Sandals drag Slippers
Sandals for children Children's sandal drags Sandals slippers Slippers is just
Sandals for children drags Children's sandal slippers Sandals slippers are just Slippers certified products
Enumerate in table 16 phrases are defined as candidate phrase.In these 16 phrases, only part is true Positive " word ", although human brain the most just can identify, but computer is not aware that, utilize substantial amounts of language material and Certain algorithmic rule, we computer can be allowed to also identify which phrase is real " word ".
Step S12, calculates the left entropy of each candidate phrase and right entropy respectively.
In step S11, all possible phrase is all defined as in candidate phrase, this step, calculates all The left entropy of candidate phrase and right entropy.
Illustrate, as a example by above-mentioned 16 candidate phrase, calculate a left side for above-mentioned 16 candidate phrase respectively Entropy and right entropy, such as, calculate the left entropy of candidate's " phrase children's sandal drags " and right entropy.
Comentropy is the tolerance representing random degree, and entropy is the biggest, represents the most random.Certain state occurs Probability be p, if it occur that, its quantity of information is-log (p), the probability of the quantity of information of all states and be Total comentropy, is:
Comentropy=Σ log (Pi)*Pi
If now with three documents, " garment for children ... ", " children's paradise ... ", " infant foods ... ", Three words are occurred in that: take, find pleasure in, eat on the right of this phrase of " child clothing ... " child.The number of times occurred is 2、1、1.Then:
If three documents become, " one-piece dress is red ... ", " one-piece dress bag postal ... ", " one-piece dress generation Purchase ... ", then " even clothing " the right only " skirt " occurs in that three times, and its right Entropy Changes is:
Right entropy=-log (3/3) the * 3/3=0 of " even clothing "
As can be seen here, the word occurred on the right of a phrase is the most random, and the right entropy of this phrase is the biggest.With Managing left entropy is also same phenomenon." child " is a real word, and the word then ratio occurred on the right of it is more random, Comentropy is the highest;" even clothing " tends not to be occurred as single word, and the word occurred on the right of it is then the most solid Fixed, comentropy is smaller.
It is to say, the word that a phrase left side and the right occur is the most random, this phrase is the general of a word Rate is the biggest.I.e. one the left entropy of phrase and right entropy are the biggest, then the probability that it is actually a word is the biggest, Because the word of a real word both sides around appearance is often random.
The summary carrying out and formulating as defined above: assume phrase x1..nIt is the n tuple containing n word, It is:
x1..n=x1x2…xn
And assume x1..nThe collection of all characters that the right occurs is combined into χr.The right entropy of so phrase RE (Right Entropy, right entropy), for:
R E ( x 1.. n ) = Σ x ∈ x r [ P ( x | x 1.. n ) * log P ( x | x 1.. n ) ]
In like manner, left entropy LE (left Entropy, left entropy) is:
L E ( x 1.. n ) = Σ x ∈ x 1 [ P ( x | x 1.. n ) * log P ( x | x 1.. n ) ]
In one embodiment, as in figure 2 it is shown, after step s 12, said method may also include step S14:
Step S14, left entropy and right entropy to all candidate phrase are modified operation.
Owing to simple comentropy can not well represent the probability of word, therefore, in the present embodiment, to letter Breath entropy carries out two step optimizations, calculates the probability of word further.
In step s 12, calculate the left entropy of each candidate phrase and right entropy respectively, but at actual meter In calculation, find that it is inadequate for only carrying out the probability of grammatical term for the character with the left entropy of phrase and right entropy.Therefore entropy is carried out Two steps smooth.As follows:
In one embodiment, as it is shown on figure 3, step S14 can also be embodied as following steps S141-S142:
Step S141, left entropy and right entropy to all candidate phrase carry out word string offset correction.
In one embodiment, according to following equation, left entropy and the right entropy of candidate phrase are carried out word string skew and repair Just:
VRE(x1..n)=RE (x1..n)-RE(x1..n-1)
VLE(x1..n)=LE (x1..n)-LE(x2..n)
Wherein, RE (x1..n) it is the right entropy of the candidate phrase of a length of n word, RE (x1..n-1) it is length For the right entropy of the candidate phrase of n-1 word, VRE (x1..n) it is the right entropy of candidate phrase of a length of n word Carry out the result after smoothing;LE(x1..n) it is the left entropy of the candidate phrase of a length of n word, LE (x2..n) it is The left entropy of the candidate phrase of a length of n-1 word, VLE (x1..n) it is the candidate phrase of a length of n word Left entropy carries out the result after smoothing.
Illustrating, the right entropy of " one-piece dress " is very big, and the right entropy of " even clothing " is the least, and both subtract each other and obtain one afterwards The biggest individual value, can preferably represent that " one-piece dress " is actually the probability of a word.
Comentropy is obtained after the first step is smooth result VRE (x1..n) and VLE (x1..n)。
In the present embodiment, comentropy is carried out word string offset correction, enable revised data preferably generation The probability of table word.
Step S142, left entropy revised to all candidate phrase and right entropy are carried out according to the length of candidate phrase Data normalization.
In one embodiment, data are carried out according to following equation left entropy revised to candidate phrase and right entropy Standardization:
nVRE(x1..n)=(VRE (x1..n)-RV(x1..n))/standard deviation
nVLE(x1..n)=(VLE (x1..n)-LV(x1..n))/standard deviation
Wherein, VRE (x1..n) be the right entropy of the candidate phrase of a length of n word carry out smooth after result, RV(x1..n) it is the meansigma methods of the right entropy of all candidate phrase of a length of n word, nVRE (x1..n) it is will VRE(x1..n) carry out data normalization after the new data that obtains;VLE(x1..n) it is the time of a length of n word The right entropy of phrase is selected to carry out the result after smoothing, LV (x1..n) it is that all candidate phrase of a length of n word are left The meansigma methods of entropy, nVLE (x1..n) it is by VLE (x1..n) carry out data normalization after the new data that obtains.
After the first step has smoothed, find the fewest phrase correspondence VLE of number of characters and VRE the most more High, it is therefore desirable to carry out a step and data are standardized.It is first according to length be grouped, each phrase VLE and VRE be individually subtracted the average of equal length group belonging to it and, then divided by standard deviation, retrieve NVLE and nVRE.So operate the result obtained, the distribution symbol of all values of the phrase of its equal length Standardization normal distribution.
Finally, we define the probability a (x of a word1..n) as follows:
a ( x 1.. n ) = m i n n V L E ( x 1.. n ) n V R E ( x 1.. n )
In the present embodiment, phrase correspondence VLE the fewest due to number of characters and VRE are the highest, therefore will VLE and VRE is normalized correction, and after two step corrections, new data can preferably represent word Probability.
Step S13, left entropy and right entropy according to candidate phrase determine the probability that candidate phrase is word, according to generally Rate carries out participle to described pending document.
In one embodiment, as shown in Figure 4, step S13 may be implemented as step S131:
Step S131, left entropy and right entropy according to candidate phrase utilize dynamic programming algorithm to enter pending document Row participle.
In this step, utilize dynamic programming algorithm that document is carried out participle.Assume that pending document is S1..n, For any one splitting scheme, define the sum of the probability of the word that its income is all divided phrases.False If S1..nIt is divided into: x1..3x3..5x5..6…xn-2..n
Then the income of this splitting scheme is:
Profit(S1..n)=a (x1..3)+a(x3..5)+a(x5..6)+…a(xn-2..n)
Utilizing dynamic programming algorithm, can calculate Income Maximum in all splitting schemes, direct violence is enumerated Method be infeasible because total number of splitting scheme is up to 2^n (n is Document Length).This is one The dynamic programming problems of relative standard, be more no longer developed in details in.
In the present embodiment, on the premise of being calculated Word probability, utilize dynamic programming algorithm to carry out document and divide Cut, global optimum can be obtained, compare forward direction maximum fractionation and consequent maximum fractionation can obtain preferably Participle effect.
The said method of the embodiment of the present invention, by word adjacent in pending document is carried out combination in any, Obtain candidate phrase, calculate the left entropy of all candidate phrase and right entropy respectively, according to the left entropy of candidate phrase and Right entropy carries out participle to pending document.Such that it is able to document is carried out participle on the premise of there is no dictionary, Non-common word, neologisms are preferably processed, thus carries out participle more accurately.
Based on same inventive concept, the embodiment of the present invention additionally provides a kind of participle device, due to this device institute The principle of solution problem is similar to aforementioned segmenting method, and therefore the enforcement of this device may refer to preceding method Implement, repeat no more in place of repetition.
Fig. 5 show the block diagram of a kind of participle device in the embodiment of the present invention, as it is shown in figure 5, this device bag Include:
Composite module 51, for carrying out combination in any by word adjacent in pending document, it is thus achieved that candidate is short Language;
Computing module 52, for calculating the left entropy of all candidate phrase and right entropy respectively;
Word-dividing mode 53, for determining, according to left entropy and the right entropy of candidate phrase, the probability that candidate phrase is word, According to probability, described pending document is carried out participle.
The said apparatus of the embodiment of the present invention, by word adjacent in pending document is carried out combination in any, Obtain candidate phrase, calculate the left entropy of all candidate phrase and right entropy respectively, according to the left entropy of candidate phrase and Right entropy carries out participle to pending document.Such that it is able to document is carried out participle on the premise of there is no dictionary, Non-common word, neologisms are preferably processed, thus carries out participle more accurately.
In one embodiment, as shown in Figure 6, said apparatus may also include that
Correcting module 54, for being modified operation to the left entropy of all candidate phrase and right entropy.
In one embodiment, as it is shown in fig. 7, correcting module 54, it may include:
First revises submodule 541, repaiies for the left entropy of all candidate phrase and right entropy carry out word string skew Just;
Second revises submodule 542, is used for left entropy revised to all candidate phrase and right entropy according to candidate The length of phrase carries out data normalization.
In one embodiment, as shown in Figure 8, word-dividing mode 53, it may include:
Participle submodule 531, for utilizing dynamic programming algorithm to treat according to left entropy and the right entropy of candidate phrase Process document and carry out participle.
In one embodiment, according to following equation, left entropy and the right entropy of candidate phrase are carried out word string skew and repair Just:
VRE(x1..n)=RE (x1..n)-RE(x1..n-1)
VLE(x1..n)=LE (x1..n)-LE(x2..n)
Wherein, RE (x1..n) it is the right entropy of the candidate phrase of a length of n word, RE (x1..n-1) it is length For the right entropy of the candidate phrase of n-1 word, VRE (x1..n) it is the right entropy of candidate phrase of a length of n word Carry out the result after smoothing;LE(x1..n) it is the left entropy of the candidate phrase of a length of n word, LE (x2..n) it is The left entropy of the candidate phrase of a length of n-1 word, VLE (x1..n) it is the candidate phrase of a length of n word Left entropy carries out the result after smoothing.
In one embodiment, data are carried out according to following equation left entropy revised to candidate phrase and right entropy Standardization:
nVRE(x1..n)=(VRE (x1..n)-RV(x1..n))/standard deviation
nVLE(x1..n)=(VLE (x1..n)-LV(x1..n))/standard deviation
Wherein, VRE (x1..n) be the right entropy of the candidate phrase of a length of n word carry out smooth after result, RV(x1..n) it is the meansigma methods of the right entropy of all candidate phrase of a length of n word, nVRE (x1..n) it is will VRE(x1..n) carry out data normalization after the new data that obtains;VLE(x1..n) it is the time of a length of n word The right entropy of phrase is selected to carry out the result after smoothing, LV (x1..n) it is that all candidate phrase of a length of n word are left The meansigma methods of entropy, nVLE (x1..n) it is by VLE (x1..n) carry out data normalization after the new data that obtains.
Those skilled in the art are it should be appreciated that embodiments of the invention can be provided as method, system or meter Calculation machine program product.Therefore, the present invention can use complete hardware embodiment, complete software implementation or knot The form of the embodiment in terms of conjunction software and hardware.And, the present invention can use and wherein wrap one or more Computer-usable storage medium containing computer usable program code (include but not limited to disk memory and Optical memory etc.) form of the upper computer program implemented.
The present invention is with reference to method, equipment (system) and computer program product according to embodiments of the present invention The flow chart of product and/or block diagram describe.It should be understood that can by computer program instructions flowchart and / or block diagram in each flow process and/or flow process in square frame and flow chart and/or block diagram and/ Or the combination of square frame.These computer program instructions can be provided to general purpose computer, special-purpose computer, embedding The processor of formula datatron or other programmable data processing device is to produce a machine so that by calculating The instruction that the processor of machine or other programmable data processing device performs produces for realizing at flow chart one The device of the function specified in individual flow process or multiple flow process and/or one square frame of block diagram or multiple square frame.
These computer program instructions may be alternatively stored in and computer or the process of other programmable datas can be guided to set In the standby computer-readable memory worked in a specific way so that be stored in this computer-readable memory Instruction produce and include the manufacture of command device, this command device realizes in one flow process or multiple of flow chart The function specified in flow process and/or one square frame of block diagram or multiple square frame.
These computer program instructions also can be loaded in computer or other programmable data processing device, makes Sequence of operations step must be performed to produce computer implemented place on computer or other programmable devices Reason, thus the instruction performed on computer or other programmable devices provides for realizing flow chart one The step of the function specified in flow process or multiple flow process and/or one square frame of block diagram or multiple square frame.
Obviously, those skilled in the art can carry out various change and modification without deviating from this to the present invention Bright spirit and scope.So, if the present invention these amendment and modification belong to the claims in the present invention and Within the scope of its equivalent technologies, then the present invention is also intended to comprise these change and modification.

Claims (12)

1. a segmenting method, it is characterised in that including:
Word adjacent in pending document is carried out combination in any, it is thus achieved that candidate phrase;
Calculate the left entropy of all candidate phrase and right entropy respectively;
Left entropy and right entropy according to described candidate phrase determine the probability that described candidate phrase is word, according to described Probability carries out participle to described pending document.
2. the method for claim 1, it is characterised in that calculating all candidate phrase respectively After left entropy and right entropy, described method also includes:
Left entropy and right entropy to described all candidate phrase are modified operation.
3. method as claimed in claim 2, it is characterised in that described to described all candidate phrase Left entropy and right entropy are modified operation, including:
Left entropy and right entropy to described all candidate phrase carry out word string offset correction;
Left entropy revised to all candidate phrase and right entropy carry out data standard according to the length of candidate phrase Change.
4. the method as according to any one of claim 1-3, it is characterised in that described according to described generally Rate carries out participle to described pending document, including:
Utilize dynamic programming algorithm that described pending document is carried out participle according to described probability.
5. method as claimed in claim 3, it is characterised in that according to following equation to candidate phrase Left entropy and right entropy carry out word string offset correction:
VRE(x1..n)=RE (x1..n)-RE(x1..n-1)
VLE(x1..n)=LE (x1..n)-LE(x2..n)
Wherein, RE (x1..n) it is the right entropy of the candidate phrase of a length of n word, RE (x1..n-1) it is length For the right entropy of the candidate phrase of n-1 word, VRE (x1..n) it is the right entropy of candidate phrase of a length of n word Carry out the result after smoothing;LE(x1..n) it is the left entropy of the candidate phrase of a length of n word, LE (x2..n) it is The left entropy of the candidate phrase of a length of n-1 word, VLE (x1..n) it is the candidate phrase of a length of n word Left entropy carries out the result after smoothing.
6. method as claimed in claim 3, it is characterised in that candidate phrase is repaiied according to following equation Left entropy after just and right entropy carry out data normalization:
nVRE(x1..n)=(VRE (x1..n)-RV(x1..n))/standard deviation
nVLE(x1..n)=(VLE (x1..n)-LV(x1..n))/standard deviation
Wherein, VRE (x1..n) be the right entropy of the candidate phrase of a length of n word carry out smooth after result, RV(x1..n) it is the meansigma methods of the right entropy of all candidate phrase of a length of n word, nVRE (x1..n) it is will VRE(x1..n) carry out data normalization after the new data that obtains;VLE(x1..n) it is the time of a length of n word The right entropy of phrase is selected to carry out the result after smoothing, LV (x1..n) it is that all candidate phrase of a length of n word are left The meansigma methods of entropy, nVLE (x1..n) it is by VLE (x1..n) carry out data normalization after the new data that obtains.
7. a participle device, it is characterised in that including:
Composite module, for carrying out combination in any by word adjacent in pending document, it is thus achieved that candidate phrase;
Computing module, for calculating the left entropy of all candidate phrase and right entropy respectively;
According to left entropy and the right entropy of described candidate phrase, word-dividing mode, for determining that described candidate phrase is word Probability, carries out participle according to described probability to described pending document.
8. device as claimed in claim 7, it is characterised in that described device also includes:
Correcting module, for being modified operation to the left entropy of described all candidate phrase and right entropy.
9. device as claimed in claim 8, it is characterised in that described correcting module, including:
First revises submodule, repaiies for the left entropy of described all candidate phrase and right entropy carry out word string skew Just;
Second revises submodule, is used for left entropy revised to all candidate phrase and right entropy according to candidate phrase Length carry out data normalization.
10. device as claimed in any one of claims 7-9, it is characterised in that described word-dividing mode, Including:
Participle submodule, for utilizing dynamic programming algorithm to institute according to left entropy and the right entropy of described candidate phrase State pending document and carry out participle.
11. devices as claimed in claim 9, it is characterised in that according to following equation to candidate phrase Left entropy and right entropy carry out word string offset correction:
VRE(x1..n)=RE (x1..n)-RE(x1..n-1)
VLE(x1..n)=LE (x1..n)-LE(x2..n)
Wherein, RE (x1..n) it is the right entropy of the candidate phrase of a length of n word, RE (x1..n-1) it is length For the right entropy of the candidate phrase of n-1 word, VRE (x1..n) it is the right entropy of candidate phrase of a length of n word Carry out the result after smoothing;LE(x1..n) it is the left entropy of the candidate phrase of a length of n word, LE (x2..n) it is The left entropy of the candidate phrase of a length of n-1 word, VLE (x1..n) it is the candidate phrase of a length of n word Left entropy carries out the result after smoothing.
12. devices as claimed in claim 9, it is characterised in that candidate phrase is repaiied according to following equation Left entropy after just and right entropy carry out data normalization:
nVRE(x1..n)=(VRE (x1..n)-RV(x1..n))/standard deviation
nVLE(x1..n)=(VLE (x1..n)-LV(x1..n))/standard deviation
Wherein, VRE (x1..n) be the right entropy of the candidate phrase of a length of n word carry out smooth after result, RV(x1..n) it is the meansigma methods of the right entropy of all candidate phrase of a length of n word, nVRE (x1..n) it is will VRE(x1..n) carry out data normalization after the new data that obtains;VLE(x1..n) it is the time of a length of n word The right entropy of phrase is selected to carry out the result after smoothing, LV (x1..n) it is that all candidate phrase of a length of n word are left The meansigma methods of entropy, nVLE (x1..n) it is by VLE (x1..n) carry out data normalization after the new data that obtains.
CN201610339657.XA 2016-05-19 2016-05-19 A kind of segmenting method and device Active CN106021230B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610339657.XA CN106021230B (en) 2016-05-19 2016-05-19 A kind of segmenting method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610339657.XA CN106021230B (en) 2016-05-19 2016-05-19 A kind of segmenting method and device

Publications (2)

Publication Number Publication Date
CN106021230A true CN106021230A (en) 2016-10-12
CN106021230B CN106021230B (en) 2018-11-23

Family

ID=57096747

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610339657.XA Active CN106021230B (en) 2016-05-19 2016-05-19 A kind of segmenting method and device

Country Status (1)

Country Link
CN (1) CN106021230B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108415773A (en) * 2018-02-27 2018-08-17 天津大学 A kind of efficient Method for HW/SW partitioning based on blending algorithm
WO2019023911A1 (en) * 2017-07-31 2019-02-07 Beijing Didi Infinity Technology And Development Co., Ltd. System and method for segmenting text
TWI665567B (en) * 2018-09-26 2019-07-11 華碩電腦股份有限公司 Semantic processing method, electronic device, and non-transitory computer readable storage medium
CN110955748A (en) * 2018-09-26 2020-04-03 华硕电脑股份有限公司 Semantic processing method, electronic device and non-transitory computer readable recording medium
CN111898010A (en) * 2020-07-10 2020-11-06 时趣互动(北京)科技有限公司 New keyword mining method and device and electronic equipment
TWI709927B (en) * 2017-12-06 2020-11-11 開曼群島商創新先進技術有限公司 Method and device for determining target user group
CN115034211A (en) * 2022-05-19 2022-09-09 一点灵犀信息技术(广州)有限公司 Unknown word discovery method and device, electronic equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102622341A (en) * 2012-04-20 2012-08-01 北京邮电大学 Domain ontology concept automatic-acquisition method based on Bootstrapping technology
CN103049501A (en) * 2012-12-11 2013-04-17 上海大学 Chinese domain term recognition method based on mutual information and conditional random field model
CN104102658A (en) * 2013-04-09 2014-10-15 腾讯科技(深圳)有限公司 Method and device for mining text contents

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102622341A (en) * 2012-04-20 2012-08-01 北京邮电大学 Domain ontology concept automatic-acquisition method based on Bootstrapping technology
CN103049501A (en) * 2012-12-11 2013-04-17 上海大学 Chinese domain term recognition method based on mutual information and conditional random field model
CN104102658A (en) * 2013-04-09 2014-10-15 腾讯科技(深圳)有限公司 Method and device for mining text contents

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
G.CHAU等: "《One Channel Subvocal Speech Phrases Recognition Using Cumulative Residual Entropy and Support Vector Machines》", 《IEEE LATIN AMERICA TRANSACTIONS》 *
张立邦等: "《基于无监督学习的中文电子病历分词》", 《智能计算机与应用》 *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019023911A1 (en) * 2017-07-31 2019-02-07 Beijing Didi Infinity Technology And Development Co., Ltd. System and method for segmenting text
CN110998589B (en) * 2017-07-31 2023-06-27 北京嘀嘀无限科技发展有限公司 System and method for segmenting text
CN110998589A (en) * 2017-07-31 2020-04-10 北京嘀嘀无限科技发展有限公司 System and method for segmenting text
TWI713870B (en) * 2017-07-31 2020-12-21 大陸商北京嘀嘀無限科技發展有限公司 System and method for segmenting a text
TWI709927B (en) * 2017-12-06 2020-11-11 開曼群島商創新先進技術有限公司 Method and device for determining target user group
CN108415773A (en) * 2018-02-27 2018-08-17 天津大学 A kind of efficient Method for HW/SW partitioning based on blending algorithm
CN110955748A (en) * 2018-09-26 2020-04-03 华硕电脑股份有限公司 Semantic processing method, electronic device and non-transitory computer readable recording medium
CN110955748B (en) * 2018-09-26 2022-10-28 华硕电脑股份有限公司 Semantic processing method, electronic device and non-transitory computer readable recording medium
TWI665567B (en) * 2018-09-26 2019-07-11 華碩電腦股份有限公司 Semantic processing method, electronic device, and non-transitory computer readable storage medium
CN111898010A (en) * 2020-07-10 2020-11-06 时趣互动(北京)科技有限公司 New keyword mining method and device and electronic equipment
CN111898010B (en) * 2020-07-10 2024-09-13 时趣互动(北京)科技有限公司 New keyword mining method and device and electronic equipment
CN115034211A (en) * 2022-05-19 2022-09-09 一点灵犀信息技术(广州)有限公司 Unknown word discovery method and device, electronic equipment and storage medium
CN115034211B (en) * 2022-05-19 2023-04-18 一点灵犀信息技术(广州)有限公司 Unknown word discovery method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN106021230B (en) 2018-11-23

Similar Documents

Publication Publication Date Title
CN106021230A (en) Word segmentation method and word segmentation apparatus
McDonald et al. Structured models for fine-to-coarse sentiment analysis
CN110298035B (en) Word vector definition method, device, equipment and storage medium based on artificial intelligence
Balage Filho et al. NILC_USP: An improved hybrid system for sentiment analysis in twitter messages
WO2020187168A1 (en) Resume pushing method and apparatus, and task pushing method and apparatus
Bravo-Marquez et al. Positive, negative, or neutral: Learning an expanded opinion lexicon from emoticon-annotated tweets
Bartoli et al. An author verification approach based on differential features
WO2013118435A1 (en) Semantic similarity level computation method, system and program
US9262400B2 (en) Non-transitory computer readable medium and information processing apparatus and method for classifying multilingual documents
Agarwal et al. Frame semantic tree kernels for social network extraction from text
CN104142912A (en) Accurate corpus category marking method and device
Bilgin et al. Sentiment analysis with term weighting and word vectors
Zhai et al. Identifying evaluative sentences in online discussions
Shen et al. semipqa: A study on product question answering over semi-structured data
Nabil et al. Cufe at semeval-2016 task 4: A gated recurrent model for sentiment classification
CN106980898A (en) Deep learning system and its application method
CN107301426A (en) A kind of multi-tag clustering method of shoe sole print image
Sudhakaran et al. Classifying product reviews from balanced datasets for Sentiment Analysis and Opinion Mining
Agirre et al. Ubc: Cubes for english semantic textual similarity and supervised approaches for interpretable sts
El-Alami et al. Word sense representation based-method for Arabic text categorization
CN110990537A (en) Sentence similarity calculation method based on edge information and semantic information
Sulistyono et al. Sentiment Analysis on Social Media (Twitter) about Vaccine-19 Using Support Vector Machine Algorithm
JP2016152032A (en) Difficulty estimation model learning device, and device, method and program for estimating difficulty
CN109189932B (en) Text classification method and device and computer-readable storage medium
JP6368633B2 (en) Term meaning learning device, term meaning judging device, method, and program

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant