CN106156002A - The system of selection of participle dictionary and system - Google Patents
The system of selection of participle dictionary and system Download PDFInfo
- Publication number
- CN106156002A CN106156002A CN201610512054.5A CN201610512054A CN106156002A CN 106156002 A CN106156002 A CN 106156002A CN 201610512054 A CN201610512054 A CN 201610512054A CN 106156002 A CN106156002 A CN 106156002A
- Authority
- CN
- China
- Prior art keywords
- dictionary
- participle
- participle dictionary
- text
- assessed value
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Machine Translation (AREA)
Abstract
The embodiment of the present invention provides a kind of participle dictionary system of selection, relates to areas of information technology.The method includes: arrange an assessment processing means;Multiple participle dictionaries are imported assessment processing means, generates the multiple assessed values corresponding with the plurality of participle dictionary;The assessed value of maximum is chosen from the plurality of assessed value, and using participle dictionary corresponding for the assessed value of described maximum as participle dictionary to be selected.The embodiment of the present invention additionally provides a kind of participle dictionary and selects system.Solve the problem that can not directly participle dictionary accurately be selected present in prior art, than prior art more accurately, convenient, save time, it is not necessary to long-time counting user behavior characteristics, also had the method being available for checking to searching for cluster targetedly.
Description
Technical field
The present embodiments relate to areas of information technology, particularly relate to system of selection and the system of a kind of participle dictionary.
Background technology
On search text participle ability be the key factor affecting search engine Chinese retrieval quality, can accurately, have
The participle of effect is most important to raising search effect and user satisfaction.Existing frequently-used segmenting method is based on dictionary, and
Improving participle effect by adding self-defined dictionary correction, therefore the vocabulary quantity of dictionary can affect search effect to a great extent
Really.
Further, inventor finds, if it is possible to Search Results is accomplished precise positioning, it will optimize the search of user
Experience effect.From the point of view of current situation, the query statement of user's input, first it can be carried out by search system by segmenter
Participle, scans for the most again.Therefore, accurate participle is the key condition of search.Generally, segmenter is to combine newly based on dictionary
Word recognizer carries out participle.New word identification function many times cannot well be avoided the generation of ambiguity word and accurately find
The neologisms such as such as movie and television play title, so the quality of dictionary will be the principal element affecting participle effect.
But, at present for the quality of dictionary without appraisal procedure direct, effective.Qualitative assessment on existing line
Method is based on word segmentation accuracy and becomes positively related hypothesis with retrieval performance.By test retrieval performance, statistical computation user's
" homepage clicking rate " and " page turning rate " assesses word segmentation accuracy, further evaluation dictionary effect, thus Selection effect is preferable
Participle dictionary.But the method need reach the standard grade after, counting user behavior characteristics, the testing time is long, and if effect bad, there is stream
Lose the risk of certain user.
Additionally it is widely used in metric " accuracy rate " and " recall rate " of information retrieval, although be to operate under line, but
It is critical only that the setting to " dependent thresholds ", it is necessary to have standard document to calculate.But for search for targetedly cluster (as
Video website), it is not available for the standard document of comparison, so also being difficult to carry out dictionary selection.
Dictionary all can not directly be judged for you to choose by both approaches accurately.
Summary of the invention
In order to solve at least one technical problem above-mentioned of the prior art, the embodiment of the present invention provides a kind of participle dictionary
System of selection and system.
On the one hand, the embodiment of the present invention provides a kind of participle dictionary system of selection, including:
One assessment processing means is set;
Multiple participle dictionaries are imported assessment processing means, generates the multiple assessments corresponding with the plurality of participle dictionary
Value;
The assessed value of maximum is chosen from the plurality of assessed value, and by participle dictionary corresponding for the assessed value of described maximum
As participle dictionary to be selected.
On the other hand, the embodiment of the present invention provides a kind of participle dictionary to select system, including:
Assessment processing means;
Dictionary imports module, for multiple participle dictionaries import assessment processing means, generates and the plurality of participle word
Multiple assessed values that storehouse is corresponding;
Select module, for choosing the assessed value of maximum from the plurality of assessed value, and by the assessed value of described maximum
Corresponding participle dictionary is as participle dictionary to be selected.
The system of selection of participle dictionary that the embodiment of the present invention provides and system, of all categories after dividing according to participle dictionary
Under the degree value that is evenly distributed of word frequency number be used as judging the assessed value of dictionary, solving can not be straight present in prior art
Connect the problem that participle dictionary is accurately selected, than prior art more accurately, convenient, save time, it is not necessary to unite for a long time
Meter user behavior feature, also to search for targetedly cluster had be available for checking method.
Accompanying drawing explanation
In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing
In having technology to describe, the required accompanying drawing used is briefly described, it should be apparent that, the accompanying drawing in describing below is this
Some bright embodiments, for those of ordinary skill in the art, on the premise of not paying creative work, it is also possible to root
Other accompanying drawing is obtained according to these accompanying drawings.
Fig. 1 is the flow chart of the system of selection of a kind of participle dictionary of the embodiment of the present invention;
Fig. 2 is Fig. 1 neutron flow embodiment schematic diagram;
Fig. 3 is Fig. 2 neutron flow embodiment schematic diagram;
Fig. 4 is the embodiment flow chart of the another kind of alternative in the embodiment of the present invention;
Fig. 5 is the embodiment flow chart of another alternative in the embodiment of the present invention;
Fig. 6 is that the selection system of a kind of participle dictionary of embodiments of the invention implements structural representation;
Fig. 7 is the schematic diagram of the specific embodiment of particular module in Fig. 6;
Fig. 8 is that the selection system of the another kind of participle dictionary in the embodiment of the present invention implements structural representation;
Fig. 9 is that the selection system of another the participle dictionary in the embodiment of the present invention implements structural representation;
A kind of user device architecture schematic diagram that Figure 10 provides for the embodiment of the present invention.
Detailed description of the invention
For making the purpose of the embodiment of the present invention, technical scheme and advantage clearer, below in conjunction with the embodiment of the present invention
In accompanying drawing, the technical scheme in the embodiment of the present invention is clearly and completely described, it is clear that described embodiment is
The a part of embodiment of the present invention rather than whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art
The every other embodiment obtained under not making creative work premise, broadly falls into the scope of protection of the invention.
Fig. 1 is illustrated that the flow chart of the system of selection of a kind of participle dictionary of the embodiment of the present invention, and method can include
Following steps:
S11: an assessment processing means is set;
S12: multiple participle dictionaries import assessment processing means, generates corresponding with the plurality of participle dictionary multiple comment
Valuation;
S13: choose the assessed value of maximum from the plurality of assessed value, and by participle corresponding for the assessed value of described maximum
Dictionary is as participle dictionary to be selected;
In the present embodiment, first an assessment processing means is set, more multiple participle dictionaries are imported assessment process dress
Put, through assessing the assessment of processing means, generate the assessed value corresponding with multiple participle dictionaries, choose the participle that assessed value is maximum
Dictionary is as participle dictionary to be selected.
Fig. 2 is the embodiment schematic diagram of the sub-process in Fig. 1, as in figure 2 it is shown, step S11 may include that in Fig. 1
S110: utilize participle dictionary that test text is carried out participle;
S111: add up the word frequency number of each vocabulary after described participle dictionary participle;
S112: the vocabulary quantity in word frequency number based on each vocabulary and whole participle dictionary, determines described participle word
Storehouse divide after of all categories under the degree value that is evenly distributed of word frequency number, using the described degree value that is evenly distributed as assessed value, its
In, word frequency number identical for same classification.
In the present embodiment, after the participle dictionary that described importing is multiple, test text is carried out participle operation, thus
After counting described participle dictionary, the word frequency number of each vocabulary, by word frequency number and the vocabulary quantity of whole participle dictionary, then
Determine described participle dictionary divide after the degree value that is evenly distributed of lower word frequency number of all categories, thus carry out next step and operate.
Fig. 3 is the embodiment schematic diagram of the sub-process in Fig. 2, as it is shown on figure 3, step S112 may include that in Fig. 2
S1120: word frequency number based on each vocabulary, divide word frequency number classification, wherein, word frequency number identical for same
Classification;
S1121: determine the quantity of the vocabulary corresponding to word frequency number under each classification, and determine that each quantity accounts for whole participle
The proportion of the vocabulary quantity in dictionary;
S1122: proportion determined by according to and the quantity of classification after dividing, determine described participle dictionary divide after each
The degree value that is evenly distributed of the word frequency number under classification.
In the present embodiment, determine described participle dictionary divide after of all categories under word frequency number be evenly distributed degree value
Afterwards, by determined by proportion and described participle dictionary divide after all kinds of quantity import information entropy makers,
Such as, the vocabulary sum of above-mentioned participle dictionary is m, by participle dictionary, test text is carried out participle, based on often
The word frequency number of individual vocabulary, divides the classification of word frequency, wherein, word frequency number identical for same classification, obtained the word of n kind
Converge, generate word frequency manifold a=(a1, a2..., ai..., an), the probability that wherein the i-th class word frequency occurs is exactly Pi(i=1,2,3 ...,
N) equal to word frequency aiThe quantity of corresponding vocabulary accounts for the proportion of vocabulary quantity m in whole participle dictionary.
If word frequency aiThe uncertainty of utilization rate is H (ai)=-lognPi,
The entropy being made up of probability system generation n class word frequency is
According to above-mentioned formula, thus obtain the entropy corresponding with described participle dictionary.
For the concrete application scenarios of above-mentioned formula, it is illustrated below:
Assume that described participle dictionary is included as: east emperor too, prime minister, China, under, son, Wei Zhuan, the Mohist School, senior general, sky
Under, the first under heaven, the first under heaven sword, daybreak, finally, and office, Li Si, Chu Guo, hardly realize, the state of Qin, Emperor Qin, the bright moon during Qin,
Amount to 20 words (m=20).
Test text is the description text of the movie and television play classification captured in one's power: the bright moon during Qin, Warring states latter stages, and Jing Ke assassinates King Qin mistake
Lose sacrifice.The first under heaven swordsman's Gagne is held in the palm by Jing Ke, and the sub-Jing Tianming escorting Jing Ke hides Emperor Qin's chase.In state of Qin border
Waning moon paddy, Gagne one people beats back 300 cavalries of the state of Qin, and Emperor Qin is furious, and life prime minister Li Si must root out two people.Li Si is in negative and positive
Look for the fellow disciple brother Wei Zhuan of Gagne under the guide of family east emperor too, defended that village sword-play is preeminent to be occupy under Gagne the most all the time,
For winning the name of the first under heaven sword, Wei Zhuan has promised the requirement of Li Si, hardly realizes the circle oneself having fallen into eastern emperor too
Set.Originally daybreak is when birth, is the most just stealthily planted down " universe unit jade very " by geomancer, " universe unit jade is very " be related to one huge
Big conspiracy.On escape road, Gagne and daybreak have got to know Mohist School crowd master-hand, and senior general Chu Guo descendant Xiang Shaoyu and Mohist School maiden high
Month, a group traveling together's entrance under the leading of destiny is described as the Mohist School office city of the most last a piece of pure land.Office hides in absolutely in city
Between hero peak, ridge, assembled the wisdom that the Mohist School is deep, be the fort that all over the world all anti-Qin force is last, it be also the disciple Mohist School
After refuge.TV play, the bright moon during Qin, ancient costume, Lu Yi, Chen Yanxi, Jiang Jingfu, high definition video is acute, watches Qin Shiming online
The moon the 27th collects.
In word segmentation result, in dictionary, the access times of each vocabulary are as follows: ghost party: 0, east emperor too: 2, prime minister: 1, it
Under: 1, Wei Zhuan: 3, the Mohist School: 5, senior general: 1, all over the world: 1, and the first under heaven: 1, the first under heaven sword: 1, daybreak: 3, last: 3, office:
2, Li Si: 3, Chu Guo: 1, hardly realize: 1, the state of Qin: 2, Emperor Qin: 2, the bright moon during Qin: 3,27:1.
Show that word frequency kind is n=5, word frequency manifold a={0,1,2,3,5}.
Wherein, 0 occur probability be 1/20,1 occur probability be 9/20,2 occur probability be 4/20,3 occur general
Rate is 5/20, and 5 probability occurred are 1/20, draw dictionary entropy according to described formula:
By above-mentioned example, it can be seen that the method for the present invention may determine that the entropy of multiple participle dictionary, and therefrom selects
Go out the dictionary of maximum entropy as participle dictionary to be selected.
The flow chart of a kind of alternative embodiment method that Fig. 4 is illustrated that in method shown in Fig. 1 is before step S11, logical
Crossing the content of text gripping portion content of text from content library, generate test text, specific implementation process is:
S10: random gripping portion content of text in the content of text from content library, generates test text.
In the present embodiment, random gripping portion content of text in described text from content library, generate test literary composition
This, had preferable test text, and the assessment in step S11 is the most faster effectively.The crawl of test text can be once,
When crawl is one time, it is ensured that efficiency;Test text capture can also be twice, when capture number of times be repeatedly time, Ke Yibao
The accuracy rate of card test text, repeatedly crawl content of text is less than once crawl content and is to ensure that the situation in not crash rate
Under, improve the accuracy rate of test text.
The flow chart of a kind of alternative embodiment method that Fig. 5 is illustrated that in method shown in Fig. 1, after step s 13, enters
Rapid S14 step by step: optimize participle dictionary, such as, include:
According to deleting instruction, at least one vocabulary in participle dictionary to be selected described in deletion further, generate and update
Participle dictionary:
The participle dictionary of described renewal is imported described assessment processing means, generates and update assessed value;
The relatively size of the assessed value that described renewal assessed value is corresponding with the participle dictionary that described band selects;
If described renewal assessed value is relatively big, then using the participle dictionary of described renewal as selected dictionary;
If described renewal assessed value is less, then using described participle dictionary to be selected as selected dictionary.
Continue as a example by above-mentioned example: in the case of not affecting word segmentation result, numeral " 27 ", word frequency can be deleted here
It is the word " ghost party " of 0, retains participle the most accurately: retain " the first under heaven sword ", delete " peerless ".
Dictionary word segmentation result adds up following m=17.
East emperor too: 2, prime minister: 1, under: 1, Wei Zhuan: 3, the Mohist School: 5, senior general: 1, all over the world: 1, and the first under heaven sword: 1, sky
Bright: 3, last: 3, office: 2, Li Si: 3, Chu Guo: 1, hardly realize: 1, the state of Qin: 2, Emperor Qin: 2, the bright moon during Qin: 3.
Result: word frequency kind n=4, word frequency manifold a={1,2,3,5}.
1 probability occurred is 6/17, and 2 probability occurred are 5/17, and 3 probability occurred are 5/17, and 5 probability occurred are 1/
17。
New dictionary entropy is:
So, revised dictionary entropy is big, for this selected test text, and the participle that after correction, dictionary ensures
Effect, decreases again memory space.
Fig. 6 is that the selection system of a kind of participle dictionary of the present invention implements structural representation.As shown in Figure 6, for one
The selection system of participle dictionary may include that assessment processing means 12, dictionary import module 13, select module 14.Wherein,
Dictionary imports module 13: for multiple participle dictionaries import assessment processing means, generate and the plurality of participle
Multiple assessed values that dictionary is corresponding;
Select module 14: for choosing the assessed value of maximum from the plurality of assessed value, and by the assessment of described maximum
The participle dictionary of value correspondence is as participle dictionary to be selected.
Shown in Fig. 7, assessment processing means 12 may include that segmenter 120, counter 121, assessed value maker 122.
Segmenter 120 configures for utilizing participle dictionary that test text is carried out participle;
Counter 121 configures for the word frequency number of each vocabulary after the described participle dictionary participle of statistics;
Assessed value maker 122 configures for the vocabulary in word frequency number based on each vocabulary and whole participle dictionary
Quantity, determine described participle dictionary divide after of all categories under the degree value that is evenly distributed of word frequency number, be evenly distributed described
Degree value is as assessed value, wherein, word frequency number identical for same classification.
Assessed value maker 122 is used for:
Word frequency number based on each vocabulary, divides the classification of word frequency number, wherein, word frequency number identical for same classification;
Determine the quantity of the vocabulary corresponding to word frequency number under each classification, and determine that each quantity accounts in whole participle dictionary
The proportion of vocabulary quantity;
Proportion determined by according to and the quantity of classification after dividing, determine described participle dictionary divide after of all categories under
The degree value that is evenly distributed of word frequency number.
In the present embodiment, the quantity import information entropy maker of the classification after the proportion determined and division, then may be used
To obtain the entropy corresponding with described participle dictionary.Assessed value is provided for module 14.
The embodiment flow chart being system shown in Figure 6 and implementing the optional system of another kind of knot example method illustrated in fig. 8, as
Shown in Fig. 8, the selection system for another kind of participle dictionary may include that test text generation module 11, assessment processing means
12, dictionary imports module 13, selects module 14.Wherein,
Test text generation module 11 is for before assessment processing module configuration assessment processing means, from content library
The content of text of random gripping portion in content of text, generates test text, wherein, when described crawl number of times is one time, institute
State the content of text that content of text is Part I of part;When described crawl number of times be repeatedly time, in the text of described part
Hold the content of text for Part II;The content of text of described Part I is more than the content of text more than described Part II.
In the present embodiment, described test text generation module 11 assessment process before, the text from content library
Content random gripping portion content of text, imports the assessment in module for dictionary and processes offer test text.
The embodiment flow chart being system shown in Figure 6 and implementing another optional system of knot example method illustrated in fig. 9, as
Shown in Fig. 9, the selection system for another participle dictionary may include that test text generation module 11, assessment processing means
12, dictionary imports module 13, selects module 14, dictionary to optimize module 15.Wherein,
Dictionary optimizes module 15 for selecting module to choose the maximum participle dictionary of assessed value as after selecting dictionary,
According to deleting instruction, at least one vocabulary in participle dictionary to be selected described in deletion further, generate the participle word updated
Storehouse;
The participle dictionary of described renewal is imported described assessment processing means, generates and update assessed value;
The relatively size of the assessed value that described renewal assessed value is corresponding with described participle dictionary to be selected;
If described renewal assessed value is little greatly, then using the participle dictionary of described renewal as selected dictionary, if described renewal is commented
Valuation is less, then using described participle dictionary to be selected as selected dictionary.
In the present embodiment, optimize in module 15 at dictionary, it is ensured that the exact value of selected dictionary to be optimized, again because of
Cause dictionary word amount to reduce for deletion part, accelerate the efficiency of dictionary.
The structural representation of another subscriber equipment 1200 that Figure 10 provides for the embodiment of the present application, the application is embodied as
Implementing of subscriber equipment 1200 is not limited by example.As shown in Figure 10, this subscriber equipment 1200 may include that
Processor (processor) 1210, communication interface (Communications Interface) 1220, memorizer
(memory) 1230 and communication bus 1240.Wherein:
Processor 1210, communication interface 1220 and memorizer 1230 complete mutual leading to by communication bus 1240
Letter.
Communication interface 1220, for the net element communication with such as client etc..
Processor 1210, is used for the program that performs 1232, specifically can perform the correlation step in said method embodiment.
Specifically, program 1232 can include that program code, described program code include computer-managed instruction.
Processor 1210 is probably a central processor CPU, or specific integrated circuit ASIC (Application
Specific Integrated Circuit), or it is configured to implement the one or more integrated electricity of the embodiment of the present application
Road.
Device embodiment described above is only schematically, the wherein said unit illustrated as separating component
Can be or may not be physically separate, the parts shown as unit can be or may not be physics list
Unit, i.e. may be located at a place, or can also be distributed on multiple NE.Can be selected it according to the actual needs
In some or all of module realize the purpose of the present embodiment scheme.Those of ordinary skill in the art are not paying creativeness
Work in the case of, be i.e. appreciated that and implement.
Through the above description of the embodiments, those skilled in the art it can be understood that to each embodiment can
The mode adding required general hardware platform by software realizes, naturally it is also possible to pass through hardware.Based on such understanding, on
State the part that prior art contributes by technical scheme the most in other words to embody with the form of software product, should
Computer software product can store in a computer-readable storage medium, such as ROM/RAM, magnetic disc, CD etc., including some fingers
Make with so that a computer equipment (can be personal computer, server, or the network equipment etc.) performs each and implements
The method described in some part of example or embodiment.
Last it is noted that above example is only in order to illustrate technical scheme, it is not intended to limit;Although
With reference to previous embodiment, the present invention is described in detail, it will be understood by those within the art that: it still may be used
So that the technical scheme described in foregoing embodiments to be modified, or wherein portion of techniques feature is carried out equivalent;
And these amendment or replace, do not make appropriate technical solution essence depart from various embodiments of the present invention technical scheme spirit and
Scope.
Claims (12)
1. a participle dictionary system of selection, including:
One assessment processing means is set;
Multiple participle dictionaries are imported assessment processing means, generates the multiple assessed values corresponding with the plurality of participle dictionary;
Choose from the plurality of assessed value maximum assessed value, and using participle dictionary corresponding for the assessed value of described maximum as
Participle dictionary to be selected.
Method the most according to claim 1, wherein, described assessment processing means is used for:
Utilize participle dictionary that test text is carried out participle;
Add up the word frequency number of each vocabulary after described participle dictionary participle;
Vocabulary quantity in word frequency number based on each vocabulary and whole participle dictionary, after determining that described participle dictionary divides
The degree value that is evenly distributed of the word frequency number under of all categories, using the described degree value that is evenly distributed as assessed value, wherein, word frequency number phase
With for same classification.
Method the most according to claim 2, wherein, in described word frequency number based on each vocabulary and whole participle dictionary
Vocabulary quantity, determine described participle dictionary divide after of all categories under the degree value that is evenly distributed of word frequency number include:
Word frequency number based on each vocabulary, divides the classification of word frequency number, wherein, word frequency number identical for same classification;
Determine the quantity of the vocabulary corresponding to word frequency number under each classification, and determine that each quantity accounts for the word in whole participle dictionary
The proportion of remittance quantity;
Proportion determined by according to and the quantity of classification after dividing, determine described participle dictionary divide after of all categories under word
The degree value that is evenly distributed of frequency.
Method the most according to claim 3, wherein, proportion determined by described basis and the quantity of the classification after division,
Determine described participle dictionary divide after of all categories under the degree value that is evenly distributed of word frequency number include:
Proportion determined by by and the quantity import information entropy maker of the classification after division, it is thus achieved that corresponding with described participle dictionary
Entropy.
Method the most according to claim 1, wherein, before arranging an assessment processing means, also includes: from content library
Content of text in the content of text of random gripping portion, generate test text, wherein,
When described crawl number of times is one time, the content of text of described part is the content of text of Part I;
When described crawl number of times be repeatedly time, the content of text of described part is the content of text of Part II;
The content of text of described Part I is more than the content of text of described Part II.
6. according to the method according to any one of claim 1-5, wherein, at the participle word that the assessed value by described maximum is corresponding
Storehouse, as after the participle dictionary selected, also includes:
Instructing according to deleting, at least one vocabulary in participle dictionary to be selected described in deletion further, what generation updated divides
Word dictionary;
The participle dictionary of described renewal is imported described assessment processing means, generates and update assessed value;
The relatively size of the assessed value that described renewal assessed value is corresponding with described participle dictionary to be selected;
If described renewal assessed value is relatively big, then using the participle dictionary of described renewal as selected dictionary;
If described renewal assessed value is less, then using described participle dictionary to be selected as selected dictionary.
7. participle dictionary selects a system, including:
Assessment processing means;
Dictionary imports module, for multiple participle dictionaries import assessment processing means, generates and the plurality of participle dictionary pair
The multiple assessed values answered;
Select module, for choosing the assessed value of maximum from the plurality of assessed value and the assessed value of described maximum is corresponding
Participle dictionary as participle dictionary to be selected.
System the most according to claim 7, wherein, described assessment processing means includes:
Segmenter, is used for utilizing participle dictionary that test text is carried out participle;
Counter, the word frequency number of each vocabulary after adding up described participle dictionary participle;
Assessed value maker, the vocabulary quantity in word frequency number based on each vocabulary and whole participle dictionary, determine institute
State after participle dictionary divides of all categories under the degree value that is evenly distributed of word frequency number, using the described degree value that is evenly distributed as commenting
Valuation, wherein, word frequency number identical for same classification.
System the most according to claim 8, wherein, described assessed value maker is used for:
Word frequency number based on each vocabulary, divides the classification of word frequency number, wherein, word frequency number identical for same classification;
Determine the quantity of the vocabulary corresponding to word frequency number under each classification, and determine that each quantity accounts for the word in whole participle dictionary
The proportion of remittance quantity;
Proportion determined by according to and the quantity of classification after dividing, determine described participle dictionary divide after of all categories under word
The degree value that is evenly distributed of frequency.
System the most according to claim 9, wherein, described assessed value maker is used for:
Proportion determined by by and the quantity import information entropy maker of the classification after division, it is thus achieved that corresponding with described participle dictionary
Entropy.
11. systems according to claim 7, wherein, described system also includes test text generation module, is used for:
Before multiple participle dictionaries are imported assessment processing means by described dictionary importing module, the content of text from content library
In the content of text of random gripping portion, generate test text, wherein,
When described crawl number of times is one time, the content of text of described part is the content of text of Part I;
When described crawl number of times be repeatedly time, the content of text of described part is the content of text of Part II;
The content of text of described Part I is more than the content of text of described Part II.
12. according to the system according to any one of claim 7-11, and wherein, described system also includes that dictionary optimizes module, uses
In:
In described selection module using participle dictionary corresponding for the assessed value of described maximum as after the participle dictionary selected, according to
Delete instruction, at least one vocabulary in participle dictionary to be selected described in deletion further, generate the participle dictionary updated;
The participle dictionary of described renewal is imported described assessment processing means, generates and update assessed value;
The relatively size of the assessed value that described renewal assessed value is corresponding with described participle dictionary to be selected;
If described renewal assessed value is relatively big, then using the participle dictionary of described renewal as selected dictionary;
If described renewal assessed value is less, then using described participle dictionary to be selected as selected dictionary.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610512054.5A CN106156002A (en) | 2016-06-30 | 2016-06-30 | The system of selection of participle dictionary and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610512054.5A CN106156002A (en) | 2016-06-30 | 2016-06-30 | The system of selection of participle dictionary and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106156002A true CN106156002A (en) | 2016-11-23 |
Family
ID=57350982
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610512054.5A Pending CN106156002A (en) | 2016-06-30 | 2016-06-30 | The system of selection of participle dictionary and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106156002A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108255956A (en) * | 2017-12-21 | 2018-07-06 | 北京声智科技有限公司 | The method and system of dictionary are adaptively obtained based on historical data and machine learning |
CN109522298A (en) * | 2018-08-29 | 2019-03-26 | 云南电网有限责任公司信息中心 | Data cleaning method for CIM |
CN111178070A (en) * | 2019-12-25 | 2020-05-19 | 平安医疗健康管理股份有限公司 | Word sequence obtaining method and device based on word segmentation and computer equipment |
CN112765975A (en) * | 2020-12-25 | 2021-05-07 | 北京百度网讯科技有限公司 | Word segmentation ambiguity processing method, device, equipment and medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101246473A (en) * | 2008-03-28 | 2008-08-20 | 腾讯科技(深圳)有限公司 | Segmentation system evaluating method and segmentation evaluating system |
CN101710326A (en) * | 2009-12-03 | 2010-05-19 | 腾讯科技(深圳)有限公司 | Word stock substitution method, device and input method system |
CN103458462A (en) * | 2012-06-04 | 2013-12-18 | 电信科学技术研究院 | Cell selection method and equipment |
CN104966090A (en) * | 2015-07-21 | 2015-10-07 | 公安部第三研究所 | Visual word generation and evaluation system and method for realizing image comprehension |
-
2016
- 2016-06-30 CN CN201610512054.5A patent/CN106156002A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101246473A (en) * | 2008-03-28 | 2008-08-20 | 腾讯科技(深圳)有限公司 | Segmentation system evaluating method and segmentation evaluating system |
CN101710326A (en) * | 2009-12-03 | 2010-05-19 | 腾讯科技(深圳)有限公司 | Word stock substitution method, device and input method system |
CN103458462A (en) * | 2012-06-04 | 2013-12-18 | 电信科学技术研究院 | Cell selection method and equipment |
CN104966090A (en) * | 2015-07-21 | 2015-10-07 | 公安部第三研究所 | Visual word generation and evaluation system and method for realizing image comprehension |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108255956A (en) * | 2017-12-21 | 2018-07-06 | 北京声智科技有限公司 | The method and system of dictionary are adaptively obtained based on historical data and machine learning |
CN109522298A (en) * | 2018-08-29 | 2019-03-26 | 云南电网有限责任公司信息中心 | Data cleaning method for CIM |
CN111178070A (en) * | 2019-12-25 | 2020-05-19 | 平安医疗健康管理股份有限公司 | Word sequence obtaining method and device based on word segmentation and computer equipment |
CN111178070B (en) * | 2019-12-25 | 2022-11-25 | 深圳平安医疗健康科技服务有限公司 | Word sequence obtaining method and device based on word segmentation and computer equipment |
CN112765975A (en) * | 2020-12-25 | 2021-05-07 | 北京百度网讯科技有限公司 | Word segmentation ambiguity processing method, device, equipment and medium |
CN112765975B (en) * | 2020-12-25 | 2023-08-04 | 北京百度网讯科技有限公司 | Word segmentation disambiguation processing method, device, equipment and medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105787025B (en) | Network platform public account classification method and device | |
CN106156002A (en) | The system of selection of participle dictionary and system | |
CN106528894B (en) | The method and device of label information is set | |
CN111581092B (en) | Simulation test data generation method, computer equipment and storage medium | |
CN111352907A (en) | Method and device for analyzing pipeline file, computer equipment and storage medium | |
CN102402619A (en) | Search method and device | |
CN106874253A (en) | Recognize the method and device of sensitive information | |
CN104317784A (en) | Cross-platform user identification method and cross-platform user identification system | |
CN104317891B (en) | A kind of method and device that label is marked to the page | |
CN103902535A (en) | Method, device and system for obtaining associational word | |
AU2018452738B2 (en) | Binning for nonlinear modeling | |
CN110674247A (en) | Barrage information intercepting method and device, storage medium and equipment | |
CN111510368B (en) | Family group identification method, device, equipment and computer readable storage medium | |
CN105678625A (en) | Method and equipment for determining identity information of user | |
CN105678129A (en) | Method and device for determining user identity information | |
CN110032727A (en) | Risk Identification Method and device | |
CN109558528A (en) | Article method for pushing, device, computer readable storage medium and server | |
CN106257449A (en) | A kind of information determines method and apparatus | |
CN110532528B (en) | Book similarity calculation method based on random walk and electronic equipment | |
CN110008352B (en) | Entity discovery method and device | |
CN110110119B (en) | Image retrieval method, device and computer readable storage medium | |
Leeb | Random numbers for computer simulation | |
CN108509571A (en) | A kind of webpage information data excavation universal method | |
CN105260467B (en) | A kind of SMS classified method and device | |
CN106611059A (en) | Method and device for recommending multi-media files |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20161123 |
|
WD01 | Invention patent application deemed withdrawn after publication |