CN106156002A - The system of selection of participle dictionary and system - Google Patents

The system of selection of participle dictionary and system Download PDF

Info

Publication number
CN106156002A
CN106156002A CN201610512054.5A CN201610512054A CN106156002A CN 106156002 A CN106156002 A CN 106156002A CN 201610512054 A CN201610512054 A CN 201610512054A CN 106156002 A CN106156002 A CN 106156002A
Authority
CN
China
Prior art keywords
dictionary
participle
participle dictionary
text
assessed value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610512054.5A
Other languages
Chinese (zh)
Inventor
张喆琳
冀利刚
张立宁
余婧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
LeTV Holding Beijing Co Ltd
LeTV Cloud Computing Co Ltd
Original Assignee
LeTV Holding Beijing Co Ltd
LeTV Cloud Computing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by LeTV Holding Beijing Co Ltd, LeTV Cloud Computing Co Ltd filed Critical LeTV Holding Beijing Co Ltd
Priority to CN201610512054.5A priority Critical patent/CN106156002A/en
Publication of CN106156002A publication Critical patent/CN106156002A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the present invention provides a kind of participle dictionary system of selection, relates to areas of information technology.The method includes: arrange an assessment processing means;Multiple participle dictionaries are imported assessment processing means, generates the multiple assessed values corresponding with the plurality of participle dictionary;The assessed value of maximum is chosen from the plurality of assessed value, and using participle dictionary corresponding for the assessed value of described maximum as participle dictionary to be selected.The embodiment of the present invention additionally provides a kind of participle dictionary and selects system.Solve the problem that can not directly participle dictionary accurately be selected present in prior art, than prior art more accurately, convenient, save time, it is not necessary to long-time counting user behavior characteristics, also had the method being available for checking to searching for cluster targetedly.

Description

The system of selection of participle dictionary and system
Technical field
The present embodiments relate to areas of information technology, particularly relate to system of selection and the system of a kind of participle dictionary.
Background technology
On search text participle ability be the key factor affecting search engine Chinese retrieval quality, can accurately, have The participle of effect is most important to raising search effect and user satisfaction.Existing frequently-used segmenting method is based on dictionary, and Improving participle effect by adding self-defined dictionary correction, therefore the vocabulary quantity of dictionary can affect search effect to a great extent Really.
Further, inventor finds, if it is possible to Search Results is accomplished precise positioning, it will optimize the search of user Experience effect.From the point of view of current situation, the query statement of user's input, first it can be carried out by search system by segmenter Participle, scans for the most again.Therefore, accurate participle is the key condition of search.Generally, segmenter is to combine newly based on dictionary Word recognizer carries out participle.New word identification function many times cannot well be avoided the generation of ambiguity word and accurately find The neologisms such as such as movie and television play title, so the quality of dictionary will be the principal element affecting participle effect.
But, at present for the quality of dictionary without appraisal procedure direct, effective.Qualitative assessment on existing line Method is based on word segmentation accuracy and becomes positively related hypothesis with retrieval performance.By test retrieval performance, statistical computation user's " homepage clicking rate " and " page turning rate " assesses word segmentation accuracy, further evaluation dictionary effect, thus Selection effect is preferable Participle dictionary.But the method need reach the standard grade after, counting user behavior characteristics, the testing time is long, and if effect bad, there is stream Lose the risk of certain user.
Additionally it is widely used in metric " accuracy rate " and " recall rate " of information retrieval, although be to operate under line, but It is critical only that the setting to " dependent thresholds ", it is necessary to have standard document to calculate.But for search for targetedly cluster (as Video website), it is not available for the standard document of comparison, so also being difficult to carry out dictionary selection.
Dictionary all can not directly be judged for you to choose by both approaches accurately.
Summary of the invention
In order to solve at least one technical problem above-mentioned of the prior art, the embodiment of the present invention provides a kind of participle dictionary System of selection and system.
On the one hand, the embodiment of the present invention provides a kind of participle dictionary system of selection, including:
One assessment processing means is set;
Multiple participle dictionaries are imported assessment processing means, generates the multiple assessments corresponding with the plurality of participle dictionary Value;
The assessed value of maximum is chosen from the plurality of assessed value, and by participle dictionary corresponding for the assessed value of described maximum As participle dictionary to be selected.
On the other hand, the embodiment of the present invention provides a kind of participle dictionary to select system, including:
Assessment processing means;
Dictionary imports module, for multiple participle dictionaries import assessment processing means, generates and the plurality of participle word Multiple assessed values that storehouse is corresponding;
Select module, for choosing the assessed value of maximum from the plurality of assessed value, and by the assessed value of described maximum Corresponding participle dictionary is as participle dictionary to be selected.
The system of selection of participle dictionary that the embodiment of the present invention provides and system, of all categories after dividing according to participle dictionary Under the degree value that is evenly distributed of word frequency number be used as judging the assessed value of dictionary, solving can not be straight present in prior art Connect the problem that participle dictionary is accurately selected, than prior art more accurately, convenient, save time, it is not necessary to unite for a long time Meter user behavior feature, also to search for targetedly cluster had be available for checking method.
Accompanying drawing explanation
In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing In having technology to describe, the required accompanying drawing used is briefly described, it should be apparent that, the accompanying drawing in describing below is this Some bright embodiments, for those of ordinary skill in the art, on the premise of not paying creative work, it is also possible to root Other accompanying drawing is obtained according to these accompanying drawings.
Fig. 1 is the flow chart of the system of selection of a kind of participle dictionary of the embodiment of the present invention;
Fig. 2 is Fig. 1 neutron flow embodiment schematic diagram;
Fig. 3 is Fig. 2 neutron flow embodiment schematic diagram;
Fig. 4 is the embodiment flow chart of the another kind of alternative in the embodiment of the present invention;
Fig. 5 is the embodiment flow chart of another alternative in the embodiment of the present invention;
Fig. 6 is that the selection system of a kind of participle dictionary of embodiments of the invention implements structural representation;
Fig. 7 is the schematic diagram of the specific embodiment of particular module in Fig. 6;
Fig. 8 is that the selection system of the another kind of participle dictionary in the embodiment of the present invention implements structural representation;
Fig. 9 is that the selection system of another the participle dictionary in the embodiment of the present invention implements structural representation;
A kind of user device architecture schematic diagram that Figure 10 provides for the embodiment of the present invention.
Detailed description of the invention
For making the purpose of the embodiment of the present invention, technical scheme and advantage clearer, below in conjunction with the embodiment of the present invention In accompanying drawing, the technical scheme in the embodiment of the present invention is clearly and completely described, it is clear that described embodiment is The a part of embodiment of the present invention rather than whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art The every other embodiment obtained under not making creative work premise, broadly falls into the scope of protection of the invention.
Fig. 1 is illustrated that the flow chart of the system of selection of a kind of participle dictionary of the embodiment of the present invention, and method can include Following steps:
S11: an assessment processing means is set;
S12: multiple participle dictionaries import assessment processing means, generates corresponding with the plurality of participle dictionary multiple comment Valuation;
S13: choose the assessed value of maximum from the plurality of assessed value, and by participle corresponding for the assessed value of described maximum Dictionary is as participle dictionary to be selected;
In the present embodiment, first an assessment processing means is set, more multiple participle dictionaries are imported assessment process dress Put, through assessing the assessment of processing means, generate the assessed value corresponding with multiple participle dictionaries, choose the participle that assessed value is maximum Dictionary is as participle dictionary to be selected.
Fig. 2 is the embodiment schematic diagram of the sub-process in Fig. 1, as in figure 2 it is shown, step S11 may include that in Fig. 1
S110: utilize participle dictionary that test text is carried out participle;
S111: add up the word frequency number of each vocabulary after described participle dictionary participle;
S112: the vocabulary quantity in word frequency number based on each vocabulary and whole participle dictionary, determines described participle word Storehouse divide after of all categories under the degree value that is evenly distributed of word frequency number, using the described degree value that is evenly distributed as assessed value, its In, word frequency number identical for same classification.
In the present embodiment, after the participle dictionary that described importing is multiple, test text is carried out participle operation, thus After counting described participle dictionary, the word frequency number of each vocabulary, by word frequency number and the vocabulary quantity of whole participle dictionary, then Determine described participle dictionary divide after the degree value that is evenly distributed of lower word frequency number of all categories, thus carry out next step and operate.
Fig. 3 is the embodiment schematic diagram of the sub-process in Fig. 2, as it is shown on figure 3, step S112 may include that in Fig. 2
S1120: word frequency number based on each vocabulary, divide word frequency number classification, wherein, word frequency number identical for same Classification;
S1121: determine the quantity of the vocabulary corresponding to word frequency number under each classification, and determine that each quantity accounts for whole participle The proportion of the vocabulary quantity in dictionary;
S1122: proportion determined by according to and the quantity of classification after dividing, determine described participle dictionary divide after each The degree value that is evenly distributed of the word frequency number under classification.
In the present embodiment, determine described participle dictionary divide after of all categories under word frequency number be evenly distributed degree value Afterwards, by determined by proportion and described participle dictionary divide after all kinds of quantity import information entropy makers,
H ( a ) = - Σ i = 1 n P i log n P i
Such as, the vocabulary sum of above-mentioned participle dictionary is m, by participle dictionary, test text is carried out participle, based on often The word frequency number of individual vocabulary, divides the classification of word frequency, wherein, word frequency number identical for same classification, obtained the word of n kind Converge, generate word frequency manifold a=(a1, a2..., ai..., an), the probability that wherein the i-th class word frequency occurs is exactly Pi(i=1,2,3 ..., N) equal to word frequency aiThe quantity of corresponding vocabulary accounts for the proportion of vocabulary quantity m in whole participle dictionary.
If word frequency aiThe uncertainty of utilization rate is H (ai)=-lognPi,
The entropy being made up of probability system generation n class word frequency is
According to above-mentioned formula, thus obtain the entropy corresponding with described participle dictionary.
For the concrete application scenarios of above-mentioned formula, it is illustrated below:
Assume that described participle dictionary is included as: east emperor too, prime minister, China, under, son, Wei Zhuan, the Mohist School, senior general, sky Under, the first under heaven, the first under heaven sword, daybreak, finally, and office, Li Si, Chu Guo, hardly realize, the state of Qin, Emperor Qin, the bright moon during Qin, Amount to 20 words (m=20).
Test text is the description text of the movie and television play classification captured in one's power: the bright moon during Qin, Warring states latter stages, and Jing Ke assassinates King Qin mistake Lose sacrifice.The first under heaven swordsman's Gagne is held in the palm by Jing Ke, and the sub-Jing Tianming escorting Jing Ke hides Emperor Qin's chase.In state of Qin border Waning moon paddy, Gagne one people beats back 300 cavalries of the state of Qin, and Emperor Qin is furious, and life prime minister Li Si must root out two people.Li Si is in negative and positive Look for the fellow disciple brother Wei Zhuan of Gagne under the guide of family east emperor too, defended that village sword-play is preeminent to be occupy under Gagne the most all the time, For winning the name of the first under heaven sword, Wei Zhuan has promised the requirement of Li Si, hardly realizes the circle oneself having fallen into eastern emperor too Set.Originally daybreak is when birth, is the most just stealthily planted down " universe unit jade very " by geomancer, " universe unit jade is very " be related to one huge Big conspiracy.On escape road, Gagne and daybreak have got to know Mohist School crowd master-hand, and senior general Chu Guo descendant Xiang Shaoyu and Mohist School maiden high Month, a group traveling together's entrance under the leading of destiny is described as the Mohist School office city of the most last a piece of pure land.Office hides in absolutely in city Between hero peak, ridge, assembled the wisdom that the Mohist School is deep, be the fort that all over the world all anti-Qin force is last, it be also the disciple Mohist School After refuge.TV play, the bright moon during Qin, ancient costume, Lu Yi, Chen Yanxi, Jiang Jingfu, high definition video is acute, watches Qin Shiming online The moon the 27th collects.
In word segmentation result, in dictionary, the access times of each vocabulary are as follows: ghost party: 0, east emperor too: 2, prime minister: 1, it Under: 1, Wei Zhuan: 3, the Mohist School: 5, senior general: 1, all over the world: 1, and the first under heaven: 1, the first under heaven sword: 1, daybreak: 3, last: 3, office: 2, Li Si: 3, Chu Guo: 1, hardly realize: 1, the state of Qin: 2, Emperor Qin: 2, the bright moon during Qin: 3,27:1.
Show that word frequency kind is n=5, word frequency manifold a={0,1,2,3,5}.
Wherein, 0 occur probability be 1/20,1 occur probability be 9/20,2 occur probability be 4/20,3 occur general Rate is 5/20, and 5 probability occurred are 1/20, draw dictionary entropy according to described formula:
H ( a ) = - ( 1 20 log 5 1 20 + 10 20 log 5 10 20 + 4 20 log 5 4 20 + 5 20 log 5 5 20 + 1 20 log 5 1 20 ) = 0.824737.
By above-mentioned example, it can be seen that the method for the present invention may determine that the entropy of multiple participle dictionary, and therefrom selects Go out the dictionary of maximum entropy as participle dictionary to be selected.
The flow chart of a kind of alternative embodiment method that Fig. 4 is illustrated that in method shown in Fig. 1 is before step S11, logical Crossing the content of text gripping portion content of text from content library, generate test text, specific implementation process is:
S10: random gripping portion content of text in the content of text from content library, generates test text.
In the present embodiment, random gripping portion content of text in described text from content library, generate test literary composition This, had preferable test text, and the assessment in step S11 is the most faster effectively.The crawl of test text can be once, When crawl is one time, it is ensured that efficiency;Test text capture can also be twice, when capture number of times be repeatedly time, Ke Yibao The accuracy rate of card test text, repeatedly crawl content of text is less than once crawl content and is to ensure that the situation in not crash rate Under, improve the accuracy rate of test text.
The flow chart of a kind of alternative embodiment method that Fig. 5 is illustrated that in method shown in Fig. 1, after step s 13, enters Rapid S14 step by step: optimize participle dictionary, such as, include:
According to deleting instruction, at least one vocabulary in participle dictionary to be selected described in deletion further, generate and update Participle dictionary:
The participle dictionary of described renewal is imported described assessment processing means, generates and update assessed value;
The relatively size of the assessed value that described renewal assessed value is corresponding with the participle dictionary that described band selects;
If described renewal assessed value is relatively big, then using the participle dictionary of described renewal as selected dictionary;
If described renewal assessed value is less, then using described participle dictionary to be selected as selected dictionary.
Continue as a example by above-mentioned example: in the case of not affecting word segmentation result, numeral " 27 ", word frequency can be deleted here It is the word " ghost party " of 0, retains participle the most accurately: retain " the first under heaven sword ", delete " peerless ".
Dictionary word segmentation result adds up following m=17.
East emperor too: 2, prime minister: 1, under: 1, Wei Zhuan: 3, the Mohist School: 5, senior general: 1, all over the world: 1, and the first under heaven sword: 1, sky Bright: 3, last: 3, office: 2, Li Si: 3, Chu Guo: 1, hardly realize: 1, the state of Qin: 2, Emperor Qin: 2, the bright moon during Qin: 3.
Result: word frequency kind n=4, word frequency manifold a={1,2,3,5}.
1 probability occurred is 6/17, and 2 probability occurred are 5/17, and 3 probability occurred are 5/17, and 5 probability occurred are 1/ 17。
New dictionary entropy is:
H ( a ) = - ( 6 17 log 4 6 17 + 5 17 log 4 5 17 + 5 17 log 4 5 17 + 1 17 log 4 1 17 ) = 0.904642
So, revised dictionary entropy is big, for this selected test text, and the participle that after correction, dictionary ensures Effect, decreases again memory space.
Fig. 6 is that the selection system of a kind of participle dictionary of the present invention implements structural representation.As shown in Figure 6, for one The selection system of participle dictionary may include that assessment processing means 12, dictionary import module 13, select module 14.Wherein,
Dictionary imports module 13: for multiple participle dictionaries import assessment processing means, generate and the plurality of participle Multiple assessed values that dictionary is corresponding;
Select module 14: for choosing the assessed value of maximum from the plurality of assessed value, and by the assessment of described maximum The participle dictionary of value correspondence is as participle dictionary to be selected.
Shown in Fig. 7, assessment processing means 12 may include that segmenter 120, counter 121, assessed value maker 122.
Segmenter 120 configures for utilizing participle dictionary that test text is carried out participle;
Counter 121 configures for the word frequency number of each vocabulary after the described participle dictionary participle of statistics;
Assessed value maker 122 configures for the vocabulary in word frequency number based on each vocabulary and whole participle dictionary Quantity, determine described participle dictionary divide after of all categories under the degree value that is evenly distributed of word frequency number, be evenly distributed described Degree value is as assessed value, wherein, word frequency number identical for same classification.
Assessed value maker 122 is used for:
Word frequency number based on each vocabulary, divides the classification of word frequency number, wherein, word frequency number identical for same classification;
Determine the quantity of the vocabulary corresponding to word frequency number under each classification, and determine that each quantity accounts in whole participle dictionary The proportion of vocabulary quantity;
Proportion determined by according to and the quantity of classification after dividing, determine described participle dictionary divide after of all categories under The degree value that is evenly distributed of word frequency number.
In the present embodiment, the quantity import information entropy maker of the classification after the proportion determined and division, then may be used To obtain the entropy corresponding with described participle dictionary.Assessed value is provided for module 14.
The embodiment flow chart being system shown in Figure 6 and implementing the optional system of another kind of knot example method illustrated in fig. 8, as Shown in Fig. 8, the selection system for another kind of participle dictionary may include that test text generation module 11, assessment processing means 12, dictionary imports module 13, selects module 14.Wherein,
Test text generation module 11 is for before assessment processing module configuration assessment processing means, from content library The content of text of random gripping portion in content of text, generates test text, wherein, when described crawl number of times is one time, institute State the content of text that content of text is Part I of part;When described crawl number of times be repeatedly time, in the text of described part Hold the content of text for Part II;The content of text of described Part I is more than the content of text more than described Part II.
In the present embodiment, described test text generation module 11 assessment process before, the text from content library Content random gripping portion content of text, imports the assessment in module for dictionary and processes offer test text.
The embodiment flow chart being system shown in Figure 6 and implementing another optional system of knot example method illustrated in fig. 9, as Shown in Fig. 9, the selection system for another participle dictionary may include that test text generation module 11, assessment processing means 12, dictionary imports module 13, selects module 14, dictionary to optimize module 15.Wherein,
Dictionary optimizes module 15 for selecting module to choose the maximum participle dictionary of assessed value as after selecting dictionary, According to deleting instruction, at least one vocabulary in participle dictionary to be selected described in deletion further, generate the participle word updated Storehouse;
The participle dictionary of described renewal is imported described assessment processing means, generates and update assessed value;
The relatively size of the assessed value that described renewal assessed value is corresponding with described participle dictionary to be selected;
If described renewal assessed value is little greatly, then using the participle dictionary of described renewal as selected dictionary, if described renewal is commented Valuation is less, then using described participle dictionary to be selected as selected dictionary.
In the present embodiment, optimize in module 15 at dictionary, it is ensured that the exact value of selected dictionary to be optimized, again because of Cause dictionary word amount to reduce for deletion part, accelerate the efficiency of dictionary.
The structural representation of another subscriber equipment 1200 that Figure 10 provides for the embodiment of the present application, the application is embodied as Implementing of subscriber equipment 1200 is not limited by example.As shown in Figure 10, this subscriber equipment 1200 may include that
Processor (processor) 1210, communication interface (Communications Interface) 1220, memorizer (memory) 1230 and communication bus 1240.Wherein:
Processor 1210, communication interface 1220 and memorizer 1230 complete mutual leading to by communication bus 1240 Letter.
Communication interface 1220, for the net element communication with such as client etc..
Processor 1210, is used for the program that performs 1232, specifically can perform the correlation step in said method embodiment.
Specifically, program 1232 can include that program code, described program code include computer-managed instruction.
Processor 1210 is probably a central processor CPU, or specific integrated circuit ASIC (Application Specific Integrated Circuit), or it is configured to implement the one or more integrated electricity of the embodiment of the present application Road.
Device embodiment described above is only schematically, the wherein said unit illustrated as separating component Can be or may not be physically separate, the parts shown as unit can be or may not be physics list Unit, i.e. may be located at a place, or can also be distributed on multiple NE.Can be selected it according to the actual needs In some or all of module realize the purpose of the present embodiment scheme.Those of ordinary skill in the art are not paying creativeness Work in the case of, be i.e. appreciated that and implement.
Through the above description of the embodiments, those skilled in the art it can be understood that to each embodiment can The mode adding required general hardware platform by software realizes, naturally it is also possible to pass through hardware.Based on such understanding, on State the part that prior art contributes by technical scheme the most in other words to embody with the form of software product, should Computer software product can store in a computer-readable storage medium, such as ROM/RAM, magnetic disc, CD etc., including some fingers Make with so that a computer equipment (can be personal computer, server, or the network equipment etc.) performs each and implements The method described in some part of example or embodiment.
Last it is noted that above example is only in order to illustrate technical scheme, it is not intended to limit;Although With reference to previous embodiment, the present invention is described in detail, it will be understood by those within the art that: it still may be used So that the technical scheme described in foregoing embodiments to be modified, or wherein portion of techniques feature is carried out equivalent; And these amendment or replace, do not make appropriate technical solution essence depart from various embodiments of the present invention technical scheme spirit and Scope.

Claims (12)

1. a participle dictionary system of selection, including:
One assessment processing means is set;
Multiple participle dictionaries are imported assessment processing means, generates the multiple assessed values corresponding with the plurality of participle dictionary;
Choose from the plurality of assessed value maximum assessed value, and using participle dictionary corresponding for the assessed value of described maximum as Participle dictionary to be selected.
Method the most according to claim 1, wherein, described assessment processing means is used for:
Utilize participle dictionary that test text is carried out participle;
Add up the word frequency number of each vocabulary after described participle dictionary participle;
Vocabulary quantity in word frequency number based on each vocabulary and whole participle dictionary, after determining that described participle dictionary divides The degree value that is evenly distributed of the word frequency number under of all categories, using the described degree value that is evenly distributed as assessed value, wherein, word frequency number phase With for same classification.
Method the most according to claim 2, wherein, in described word frequency number based on each vocabulary and whole participle dictionary Vocabulary quantity, determine described participle dictionary divide after of all categories under the degree value that is evenly distributed of word frequency number include:
Word frequency number based on each vocabulary, divides the classification of word frequency number, wherein, word frequency number identical for same classification;
Determine the quantity of the vocabulary corresponding to word frequency number under each classification, and determine that each quantity accounts for the word in whole participle dictionary The proportion of remittance quantity;
Proportion determined by according to and the quantity of classification after dividing, determine described participle dictionary divide after of all categories under word The degree value that is evenly distributed of frequency.
Method the most according to claim 3, wherein, proportion determined by described basis and the quantity of the classification after division, Determine described participle dictionary divide after of all categories under the degree value that is evenly distributed of word frequency number include:
Proportion determined by by and the quantity import information entropy maker of the classification after division, it is thus achieved that corresponding with described participle dictionary Entropy.
Method the most according to claim 1, wherein, before arranging an assessment processing means, also includes: from content library Content of text in the content of text of random gripping portion, generate test text, wherein,
When described crawl number of times is one time, the content of text of described part is the content of text of Part I;
When described crawl number of times be repeatedly time, the content of text of described part is the content of text of Part II;
The content of text of described Part I is more than the content of text of described Part II.
6. according to the method according to any one of claim 1-5, wherein, at the participle word that the assessed value by described maximum is corresponding Storehouse, as after the participle dictionary selected, also includes:
Instructing according to deleting, at least one vocabulary in participle dictionary to be selected described in deletion further, what generation updated divides Word dictionary;
The participle dictionary of described renewal is imported described assessment processing means, generates and update assessed value;
The relatively size of the assessed value that described renewal assessed value is corresponding with described participle dictionary to be selected;
If described renewal assessed value is relatively big, then using the participle dictionary of described renewal as selected dictionary;
If described renewal assessed value is less, then using described participle dictionary to be selected as selected dictionary.
7. participle dictionary selects a system, including:
Assessment processing means;
Dictionary imports module, for multiple participle dictionaries import assessment processing means, generates and the plurality of participle dictionary pair The multiple assessed values answered;
Select module, for choosing the assessed value of maximum from the plurality of assessed value and the assessed value of described maximum is corresponding Participle dictionary as participle dictionary to be selected.
System the most according to claim 7, wherein, described assessment processing means includes:
Segmenter, is used for utilizing participle dictionary that test text is carried out participle;
Counter, the word frequency number of each vocabulary after adding up described participle dictionary participle;
Assessed value maker, the vocabulary quantity in word frequency number based on each vocabulary and whole participle dictionary, determine institute State after participle dictionary divides of all categories under the degree value that is evenly distributed of word frequency number, using the described degree value that is evenly distributed as commenting Valuation, wherein, word frequency number identical for same classification.
System the most according to claim 8, wherein, described assessed value maker is used for:
Word frequency number based on each vocabulary, divides the classification of word frequency number, wherein, word frequency number identical for same classification;
Determine the quantity of the vocabulary corresponding to word frequency number under each classification, and determine that each quantity accounts for the word in whole participle dictionary The proportion of remittance quantity;
Proportion determined by according to and the quantity of classification after dividing, determine described participle dictionary divide after of all categories under word The degree value that is evenly distributed of frequency.
System the most according to claim 9, wherein, described assessed value maker is used for:
Proportion determined by by and the quantity import information entropy maker of the classification after division, it is thus achieved that corresponding with described participle dictionary Entropy.
11. systems according to claim 7, wherein, described system also includes test text generation module, is used for:
Before multiple participle dictionaries are imported assessment processing means by described dictionary importing module, the content of text from content library In the content of text of random gripping portion, generate test text, wherein,
When described crawl number of times is one time, the content of text of described part is the content of text of Part I;
When described crawl number of times be repeatedly time, the content of text of described part is the content of text of Part II;
The content of text of described Part I is more than the content of text of described Part II.
12. according to the system according to any one of claim 7-11, and wherein, described system also includes that dictionary optimizes module, uses In:
In described selection module using participle dictionary corresponding for the assessed value of described maximum as after the participle dictionary selected, according to Delete instruction, at least one vocabulary in participle dictionary to be selected described in deletion further, generate the participle dictionary updated;
The participle dictionary of described renewal is imported described assessment processing means, generates and update assessed value;
The relatively size of the assessed value that described renewal assessed value is corresponding with described participle dictionary to be selected;
If described renewal assessed value is relatively big, then using the participle dictionary of described renewal as selected dictionary;
If described renewal assessed value is less, then using described participle dictionary to be selected as selected dictionary.
CN201610512054.5A 2016-06-30 2016-06-30 The system of selection of participle dictionary and system Pending CN106156002A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610512054.5A CN106156002A (en) 2016-06-30 2016-06-30 The system of selection of participle dictionary and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610512054.5A CN106156002A (en) 2016-06-30 2016-06-30 The system of selection of participle dictionary and system

Publications (1)

Publication Number Publication Date
CN106156002A true CN106156002A (en) 2016-11-23

Family

ID=57350982

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610512054.5A Pending CN106156002A (en) 2016-06-30 2016-06-30 The system of selection of participle dictionary and system

Country Status (1)

Country Link
CN (1) CN106156002A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108255956A (en) * 2017-12-21 2018-07-06 北京声智科技有限公司 The method and system of dictionary are adaptively obtained based on historical data and machine learning
CN109522298A (en) * 2018-08-29 2019-03-26 云南电网有限责任公司信息中心 Data cleaning method for CIM
CN111178070A (en) * 2019-12-25 2020-05-19 平安医疗健康管理股份有限公司 Word sequence obtaining method and device based on word segmentation and computer equipment
CN112765975A (en) * 2020-12-25 2021-05-07 北京百度网讯科技有限公司 Word segmentation ambiguity processing method, device, equipment and medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101246473A (en) * 2008-03-28 2008-08-20 腾讯科技(深圳)有限公司 Segmentation system evaluating method and segmentation evaluating system
CN101710326A (en) * 2009-12-03 2010-05-19 腾讯科技(深圳)有限公司 Word stock substitution method, device and input method system
CN103458462A (en) * 2012-06-04 2013-12-18 电信科学技术研究院 Cell selection method and equipment
CN104966090A (en) * 2015-07-21 2015-10-07 公安部第三研究所 Visual word generation and evaluation system and method for realizing image comprehension

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101246473A (en) * 2008-03-28 2008-08-20 腾讯科技(深圳)有限公司 Segmentation system evaluating method and segmentation evaluating system
CN101710326A (en) * 2009-12-03 2010-05-19 腾讯科技(深圳)有限公司 Word stock substitution method, device and input method system
CN103458462A (en) * 2012-06-04 2013-12-18 电信科学技术研究院 Cell selection method and equipment
CN104966090A (en) * 2015-07-21 2015-10-07 公安部第三研究所 Visual word generation and evaluation system and method for realizing image comprehension

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108255956A (en) * 2017-12-21 2018-07-06 北京声智科技有限公司 The method and system of dictionary are adaptively obtained based on historical data and machine learning
CN109522298A (en) * 2018-08-29 2019-03-26 云南电网有限责任公司信息中心 Data cleaning method for CIM
CN111178070A (en) * 2019-12-25 2020-05-19 平安医疗健康管理股份有限公司 Word sequence obtaining method and device based on word segmentation and computer equipment
CN111178070B (en) * 2019-12-25 2022-11-25 深圳平安医疗健康科技服务有限公司 Word sequence obtaining method and device based on word segmentation and computer equipment
CN112765975A (en) * 2020-12-25 2021-05-07 北京百度网讯科技有限公司 Word segmentation ambiguity processing method, device, equipment and medium
CN112765975B (en) * 2020-12-25 2023-08-04 北京百度网讯科技有限公司 Word segmentation disambiguation processing method, device, equipment and medium

Similar Documents

Publication Publication Date Title
CN105787025B (en) Network platform public account classification method and device
CN106156002A (en) The system of selection of participle dictionary and system
CN106528894B (en) The method and device of label information is set
CN111581092B (en) Simulation test data generation method, computer equipment and storage medium
CN111352907A (en) Method and device for analyzing pipeline file, computer equipment and storage medium
CN102402619A (en) Search method and device
CN106874253A (en) Recognize the method and device of sensitive information
CN104317784A (en) Cross-platform user identification method and cross-platform user identification system
CN104317891B (en) A kind of method and device that label is marked to the page
CN103902535A (en) Method, device and system for obtaining associational word
AU2018452738B2 (en) Binning for nonlinear modeling
CN110674247A (en) Barrage information intercepting method and device, storage medium and equipment
CN111510368B (en) Family group identification method, device, equipment and computer readable storage medium
CN105678625A (en) Method and equipment for determining identity information of user
CN105678129A (en) Method and device for determining user identity information
CN110032727A (en) Risk Identification Method and device
CN109558528A (en) Article method for pushing, device, computer readable storage medium and server
CN106257449A (en) A kind of information determines method and apparatus
CN110532528B (en) Book similarity calculation method based on random walk and electronic equipment
CN110008352B (en) Entity discovery method and device
CN110110119B (en) Image retrieval method, device and computer readable storage medium
Leeb Random numbers for computer simulation
CN108509571A (en) A kind of webpage information data excavation universal method
CN105260467B (en) A kind of SMS classified method and device
CN106611059A (en) Method and device for recommending multi-media files

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20161123

WD01 Invention patent application deemed withdrawn after publication