CN106951410A - Generation method, device and the electronic equipment of dictionary - Google Patents

Generation method, device and the electronic equipment of dictionary Download PDF

Info

Publication number
CN106951410A
CN106951410A CN201710171679.4A CN201710171679A CN106951410A CN 106951410 A CN106951410 A CN 106951410A CN 201710171679 A CN201710171679 A CN 201710171679A CN 106951410 A CN106951410 A CN 106951410A
Authority
CN
China
Prior art keywords
entry
dictionary
character
frequency
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710171679.4A
Other languages
Chinese (zh)
Other versions
CN106951410B (en
Inventor
梁孟然
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sankuai Online Technology Co Ltd
Original Assignee
Beijing Sankuai Online Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sankuai Online Technology Co Ltd filed Critical Beijing Sankuai Online Technology Co Ltd
Priority to CN201710171679.4A priority Critical patent/CN106951410B/en
Publication of CN106951410A publication Critical patent/CN106951410A/en
Application granted granted Critical
Publication of CN106951410B publication Critical patent/CN106951410B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)
  • Character Discrimination (AREA)

Abstract

The invention provides a kind of generation method of dictionary, device and electronic equipment, it is related to intelligent dictionary technical field, accurate cutting word can be realized, and generate effective dictionary in the high field of the renewal frequencies such as trade name.The generation method of the dictionary includes:Cutting word step, the frequency of character and character combination in dictionary module in candidate entry, it is determined that the entry in initial dictionary;Wherein, be stored with some monosyllabic word pieces and its frequency in the dictionary module, and some two-character word pieces and its frequency;Polymerization procedure, travels through the entry in the initial dictionary, and the multiple entries for meeting polymerizing condition are merged into an entry, generates final dictionary.

Description

Generation method, device and the electronic equipment of dictionary
Technical field
The present invention relates to intelligent dictionary technical field, set more particularly, to a kind of generation method of dictionary, device and electronics It is standby.
Background technology
In e-commerce field, trade name is most important basic data, trade name have it is structural substantially, into Word rate is high, and implication is enriched, the features such as limited length.
On the other hand, dictionary is the key that commodity are carried out with next step operation, and trade name is one professional extremely strong, The high field of renewal frequency.Again having a large vocabulary due to trade name, if by manual maintenance dictionary, it is necessary to pay it is huge and Repeat scissors and paste amount.
But, the existing mode effect based on general dictionary progress matching participle is poor, is typically the base in N-GRAM Pass through word frequency-reverse document-frequency (Term Frequency-Inverse Document Frequency, abbreviation TF- on plinth IDF first time participle) is carried out to original statement, second of participle then is carried out to original statement further according to existing segmenter, so After word segmentation result will merge generation dictionary twice.N-GRAM language models therein have higher time complexity and sky Between complexity, and TF-IDF text keyword extraction algorithm, it is necessary to safeguard a larger corpus, while being difficult hair again The word such as the relatively low brand of existing emerging frequency and commodity, is not suitable for trade name etc. into the higher field of word rate.
For the high field of the renewal frequencies such as trade name, prior art is difficult to accurate cutting word, and therefore, it is difficult to life Into effective dictionary.
The content of the invention
In view of this, it is an object of the invention to provide a kind of generation method of dictionary, device and electronic equipment, Neng Gou The high field of the renewal frequencies such as trade name, realizes accurate cutting word, and generate effective dictionary.
In a first aspect, the embodiments of the invention provide a kind of generation method of dictionary, including:
Cutting word step, the frequency of character and character combination in dictionary module in candidate entry, determines initial word Entry in storehouse;Wherein, be stored with some monosyllabic word pieces and its frequency in the dictionary module, and some two-character word pieces and its frequency Rate;
Polymerization procedure, travels through the entry in the initial dictionary, the multiple entries for meeting polymerizing condition is merged into one Entry, generates final dictionary.
With reference in a first aspect, the embodiments of the invention provide the possible embodiment of the first of first aspect, wherein, institute Cutting word step is stated, is specially:
Travel through the adjacent character a and character b in the candidate entry, and character a and character b character combination ab;
Obtain character a, the frequency of character b and character combination ab in dictionary module;
When the frequency of frequency, character b frequency and character combination ab as character a meets cutting word condition, by the character Combination ab is used as the entry or a part for entry in initial dictionary.
With reference in a first aspect, the embodiments of the invention provide the possible embodiment of second of first aspect, wherein, institute Polymerization procedure is stated, is specially:
Travel through the entry in the initial dictionary;
When entry p is comprised within entry q, frequencies of the entry p and entry q in the initial dictionary is obtained;
When entry p frequency and entry q frequency meet polymerizing condition, entry p is replaced with into entry q, or by word Bar q replaces with entry p.
With reference in a first aspect, the embodiments of the invention provide the possible embodiment of the third of first aspect, wherein, Before the cutting word step, also include:
Pre-treatment step, pre-processes to input entry, generates candidate entry.
With reference in a first aspect, the embodiments of the invention provide the possible embodiment of the 4th of first aspect kind, wherein, institute Pre-treatment step is stated, is specially:
Exclude the useless character in input entry;
Merge the repetition entry in input entry, generate candidate entry.
With reference in a first aspect, the embodiments of the invention provide the possible embodiment of the 5th of first aspect kind, wherein, should Method also includes:
Monosyllabic word piece is extracted from the input entry after useless character is eliminated;
The frequency of the monosyllabic word piece is updated in the dictionary module;
Two-character word piece is extracted from the input entry after useless character is eliminated;
The frequency of the two-character word piece is updated in the dictionary module.
With reference in a first aspect, the embodiments of the invention provide the possible embodiment of the 6th of first aspect kind, wherein, Before the frequency that the monosyllabic word piece is updated in the dictionary module, also include:
When monosyllabic word piece is the word piece in frequency reducing character set, line frequency reduction processing is entered to the monosyllabic word piece.
With reference in a first aspect, the embodiments of the invention provide the possible embodiment of the 7th of first aspect kind, wherein, should Method also includes:
Step is weighted, the entry in the final dictionary is traveled through, when the frequency of two entries with inclusion relation is met During weighting conditions, using the frequency sum of described two entries as by the frequency comprising entry.
Second aspect, the embodiment of the present invention also provides a kind of generating means of dictionary, including:
Cutting word module, for the frequency of the character in candidate entry and character combination in dictionary module, it is determined that just Entry in beginning dictionary;Wherein, be stored with some monosyllabic word pieces and its frequency in the dictionary module, and some two-character word pieces and Its frequency;
Aggregation module, for traveling through the entry in the initial dictionary, the multiple entries for meeting polymerizing condition are merged into One entry.
The third aspect, the embodiment of the present invention also provides a kind of electronic equipment, including memory, processor and is stored in described On memory and the computer program that can run on the processor, realized described in the computing device during computer program The step of above-mentioned method.
Fourth aspect, the embodiment of the present invention also provides a kind of computer-readable recording medium, the computer-readable storage Be stored with computer program on medium, the computer program performs above-mentioned method when being run by processor the step of.
The embodiment of the present invention brings following beneficial effect:It is main in the generation method of dictionary provided in an embodiment of the present invention To include cutting word step and polymerization procedure.In cutting word step, the character and character combination in candidate entry are in dictionary mould Frequency in block, is monosyllabic word piece and two-character word piece due to what is stored in dictionary module it is determined that the entry in initial dictionary, and Respective frequency, therefore, it is possible to realize accurate cutting word with the statistical of low complex degree.It is poly- by that will meet in polymerization procedure Multiple entries of conjunction condition merge into an entry, realize the merging of the entry with high similarity, and generate final dictionary.Cause This, technical scheme provided in an embodiment of the present invention can be realized and accurately cut in the high field of the renewal frequencies such as trade name Word, and generate effective dictionary.
Other features and advantages of the present invention will be illustrated in the following description, also, partly be become from specification Obtain it is clear that or being understood by implementing the present invention.The purpose of the present invention and other advantages are in specification, claims And specifically noted structure is realized and obtained in accompanying drawing.
To enable the above objects, features and advantages of the present invention to become apparent, preferred embodiment cited below particularly, and coordinate Appended accompanying drawing, is described in detail below.
Brief description of the drawings
, below will be to specific in order to illustrate more clearly of the specific embodiment of the invention or technical scheme of the prior art The accompanying drawing to be used needed for embodiment or description of the prior art is briefly described, it should be apparent that, in describing below Accompanying drawing is some embodiments of the present invention, for those of ordinary skill in the art, before creative work is not paid Put, other accompanying drawings can also be obtained according to these accompanying drawings.
Fig. 1 is the flow chart for the generation method that the embodiment of the present invention one provides dictionary;
Fig. 2 is the flow chart for the generation method that the embodiment of the present invention two provides dictionary;
The detail flowchart that Fig. 3 is step S201 in the embodiment of the present invention two;
The detail flowchart that Fig. 4 is step S202 in the embodiment of the present invention two;
The detail flowchart that Fig. 5 is step S203 in the embodiment of the present invention two;
The detail flowchart that Fig. 6 is step S204 in the embodiment of the present invention two;
Fig. 7 is the schematic diagram for the generating means that the embodiment of the present invention three provides dictionary.
Embodiment
To make the purpose, technical scheme and advantage of the embodiment of the present invention clearer, below in conjunction with accompanying drawing to the present invention Technical scheme be clearly and completely described, it is clear that described embodiment is a part of embodiment of the invention, rather than Whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art are not making creative work premise Lower obtained every other embodiment, belongs to the scope of protection of the invention.
At present for the high field of the renewal frequencies such as trade name, prior art is difficult to accurate cutting word, therefore difficult To generate effective dictionary.Based on this, a kind of generation method of dictionary provided in an embodiment of the present invention, device and electronic equipment, Accurate cutting word can be realized, and generates effective dictionary.
Embodiment one:
The embodiments of the invention provide a kind of generation method of dictionary, the high neck of the renewal frequencies such as trade name can be applied to Domain.As shown in figure 1, the generation method of the dictionary includes:
S101 cutting word steps:The frequency of character and character combination in dictionary module in candidate entry, it is determined that just Entry in beginning dictionary.
Wherein, be stored with some monosyllabic word pieces and its frequency in dictionary module, and some two-character word pieces and its frequency.
S102 polymerization procedures:The entry in initial dictionary is traveled through, the multiple entries for meeting polymerizing condition are merged into one Entry, generates final dictionary.
In cutting word step, the frequency of character and character combination in dictionary module in candidate entry, it is determined that just Entry in beginning dictionary, is monosyllabic word piece and two-character word piece due to what is stored in dictionary module, and respective frequency, therefore energy It is enough that accurate cutting word is realized with the statistical of low complex degree.In polymerization procedure, pass through multiple entries by polymerizing condition is met An entry is merged into, the merging of the entry with high similarity is realized, and generates final dictionary.Therefore, the embodiment of the present invention The generation method of the dictionary of offer, can realize accurate cutting word in the high field of the renewal frequencies such as trade name, and generation has The dictionary of effect.
Embodiment two:
The embodiments of the invention provide a kind of generation method of dictionary, the high neck of the renewal frequencies such as trade name can be applied to Domain.As shown in Fig. 2 the generation method of the dictionary comprises the following steps:
S201 pre-treatment steps:Input entry is pre-processed, candidate entry is generated.
As shown in figure 3, the pre-treatment step in the present embodiment is specially:
S2011:Exclude the useless character in input entry.
It can aid in excluding the useless character in input entry, useless words by way of pre-setting useless character collection Typically the relevant characters such as the type, model, specification of commodity are prestored in symbol collection.
For example, input entry is " bright smooth excellent Yoghourt (apple aroma) ", " taste " therein is the character that useless character is concentrated, " taste " will so be deleted by this step, input entry is changed into " bright smooth excellent Yoghourt (apple) ".
S2012:Input entry is handled according to business scenario.
In actual applications, different business scenarios has different application needs.For example, in a business scenario, Without the concern for the taste of food, then for input entry " bright smooth excellent Yoghourt (apple) ", it is possible to which " (apple) " is deleted, Input entry is changed into " bright smooth excellent Yoghourt ".
S2013:Merge the repetition entry in input entry, generate candidate entry.
For the input entry repeated, it is merged, candidate entry is generated, and multiple candidate entries composition is waited Select entry base.
In addition, two different entry itself, it is also possible to be changed into repeating after by above-mentioned steps S2011, S2012 Entry.For example, two input entries are " bright smooth excellent Yoghourt (apple aroma) " and " bright smooth excellent Yoghourt (strawberry flavor) " respectively, this Two input entries are changing to identical " bright smooth excellent Yoghourt " after passing through above-mentioned steps S2011, S2012, for such a Situation should also be as merging.
Further, in above-mentioned preprocessing process, the maintenance to dictionary module can also be realized, following step is specifically included Suddenly:
S2014:Monosyllabic word piece is extracted from input entry.
It is preferred that, this step is from eliminating after useless character, and in the input entry after business scenario is handled Monosyllabic word piece and the frequency of the monosyllabic word piece are extracted, to reduce the workload for safeguarding dictionary module.
It is further that this step should be carried out on the basis of the input entry after step S2012 processing, to ensure list The accuracy of the frequency of words piece.
S2015:Judge whether the word piece in frequency reducing character set.
When monosyllabic word piece is the word piece in frequency reducing character set, line frequency reduction processing is entered to the monosyllabic word piece.Can be pre- Frequency reducing character set is first set, wherein the relevant characters such as the unit for being described commodity, size, numeral-classifier compound that typically prestore.For example, input Entry " bottle opener ", character " bottle " therein is the character in frequency reducing character set, so entering to character " bottle " at line frequency reduction Reason.
Line frequency reduction processing can generally be entered by the way of division is done.The frequency that for example character " bottle " occurs is 200 Secondary, then by the frequency divided by 50 (or other numerical value), the frequency of the character " bottle " after reduction frequency is 4 times.Frequency is reduced above Procedural representation, in 200 characters " bottle " of appearance, it is for 196 times (such as bottled beer) occurred in the form of measure word that there are about, Remaining 4 times are the appearance (such as bottle opener) in the form of trade name, therefore ensure that the degree of accuracy that character frequency is counted.
S2016:Update individual character dictionary.
Dictionary module in the present embodiment includes individual character dictionary and Shuangzi dictionary, and its content is the execution in the present embodiment During enrich constantly.That is, at the beginning of the generation method for the dictionary that the present embodiment is provided is performed, dictionary module is empty , without any word piece record.
This step is the frequency that monosyllabic word piece is updated in dictionary module.It is, by the monosyllabic word piece extracted and The frequency of the monosyllabic word piece, is updated in the individual character dictionary of dictionary module.
S2017:Two-character word piece is extracted from input entry;
Identical with step S2014, this step is also from eliminating after useless character, and after business scenario is handled Input in entry and extract the frequency of two-character word piece and two-character word piece, to reduce the workload for safeguarding dictionary module.
For example, input entry " bright smooth excellent Yoghourt ", then " light " " clear and lucid " " freely excellent " " excellent acid " " Yoghourt " totally five is extracted Individual two-character word piece.
S2018:Update double word dictionary.
The frequency of the two-character word piece extracted and the two-character word piece is updated in the double word dictionary of dictionary module.
S202 cutting word steps:The frequency of character and character combination in dictionary module in candidate entry, it is determined that just Entry in beginning dictionary.
As shown in figure 4, this step detailed process is as follows:
S2021:The character in candidate entry and character combination are traveled through, s=1 is set.
S2021a:A=s is set.
S2021b:B=a+1 is set.
Wherein, the progress of parameter behalf traversal, s=1 represents to begin stepping through from the first character of candidate entry, a=s tables Show that the current bebinning character a judged, b=a+1 represent latter adjacent character b of character a, and currently judged Character combination ab.
For example candidate entry " bright smooth excellent Yoghourt ", first determines whether " light " " bright " and " light ".
S2022:Obtain character a, the frequency of character b and character combination ab in dictionary module.
S2023:Judge character a, character b and character combination ab whether cutting word condition.
The present embodiment uses cutting word formula PaPb/ (Pab2(Pa+Pb+Pab) whether result of calculation) is more than cutting word threshold Value is judged that wherein Pa is character a frequency, and Pb is character b frequency, and Pab is character combination ab frequency, cutting word threshold value Generally can be using value as 1.0.
If result of calculation is less than or equal to 1.0, the i.e. frequency of character a frequency, character b frequency and character combination ab Cutting word condition is met, represents that character combination ab (" light ") is often used in conjunction in an entry, can be as in initial dictionary Double word entry, or it is used as a part for multiword entry.
Step S2023a is now performed, parameter a value is added 1 (a++), and return to step S2021b, then obtain next Individual character combination ab (" clear and lucid ") and character a (" bright "), character b (" smooth ") frequency, and calculated using cutting word formula, Until there is the situation that result of calculation is more than 1.0.
If result of calculation is more than 1.0, then it represents that character combination ab is seldom used in conjunction in an entry, should be in character a With character b separated, progress step S2024.
Cutting word is carried out through the above way, you can it is determined that being put into the entry in initial dictionary.
S2024:Interception s to a entry is put into initial dictionary.
In the example of " bright smooth excellent Yoghourt ", then intercept entry " light " and be put into initial dictionary.In other examples, such as Fruit to having been circulated between step S2023 repeatedly, then can intercept out entry more than three words in step S2021b.
S2025:Judge whether the candidate entry has traveled through.
If do not traveled through, step S2025a is performed, s=a+1, and return to step S2021a are set, continued to word Symbol combination " smooth excellent " is judged.If traveled through, terminate this step, generate initial dictionary.
, can be completely independent of existing dictionary and any segmenter, according only to statistics related algorithm by cutting word step To realize the automatic foundation of initial dictionary.Also, accurate cutting word is realized in linear rank O (n) time complexity using statistical, Only need to safeguard that individual character dictionary and double word dictionary can provide basis to count for cutting word simultaneously.In addition, can be with by cutting word formula The word pieces such as the relatively low brand of emerging frequency or commodity are rapidly and accurately found, so that high in renewal frequencies such as trade names Field, realize accurate cutting word, and generate effective dictionary.
S203 polymerization procedures:The entry in initial dictionary is traveled through, the multiple entries for meeting polymerizing condition are merged into one Entry, generates final dictionary.
As shown in figure 5, this step detailed process is as follows:
S2031:Travel through the entry in initial dictionary.
S2032:Judge whether the entry with inclusion relation.
In this step, generally only the entry that number of words differs a word is judged.If it is present proceeding step S2033;If it does not exist, then terminating this step.
S2033:Obtain frequency of the entry with inclusion relation in initial dictionary.
When entry p is comprised within entry q, the frequency of entry p and entry q in initial dictionary is obtained.For example, word Bar p " corkage " is comprised within entry q " bottle opener ", then obtains frequency of the two entries in initial dictionary.
S2034:Judge whether entry p frequency meets polymerizing condition with entry q frequency.
The present embodiment is judged that wherein Pp is word using aggregation formula d=Pp/ (Pp+Pq) with polymerizeing the relation of threshold value Bar p frequency, Pq is entry q frequency, it is preferred that polymerization threshold value is two values, respectively 0.9 and 0.1.
When d value is located between two polymerization threshold values, sufficient polymerizing condition with thumb down;When d is less than 0.1 or more than 0.9 When, expression meets polymerizing condition.
If specifically, d < 0.1, then it represents that entry p frequency is far below entry q frequency, step S2034a is performed, Entry p is replaced with into entry q, such as replaced with " corkage " " bottle opener ".If d > 0.9, then it represents that entry p frequency is remote Higher than entry q frequency, step S2034b is performed, then entry q is replaced with into entry p.
If 0.1≤d≤0.9, terminate this step.
S2035:Generate final dictionary.
The generation method for the dictionary that the present embodiment is provided is carried out as every a batch of input entry is circulated, so this Step is typically that the final dictionary generated is updated, to generate more perfect final dictionary.If this batch is defeated Enter entry to input first, then this step is to generate final dictionary first.
In polymerization procedure, by using aggregation formula, the multiple entries for meeting polymerizing condition are merged into an entry, The merging of the entry with high similarity is realized, and generates final dictionary, the entry in final dictionary is more refined, accurately.
As a preferred scheme, the generation method of dictionary provided in an embodiment of the present invention also weights step including S204: The entry in final dictionary is traveled through, when the frequency of two entries with inclusion relation meets weighting conditions, by two entries Frequency sum as by the frequency comprising entry.
As shown in fig. 6, this step detailed process is as follows:
S2041:Travel through the entry in final dictionary.
S2042:Judge whether the entry with inclusion relation.
If it is present proceeding step S2043;If it does not exist, then terminating this step.
S2043:Obtain frequency of the entry with inclusion relation in final dictionary.
When entry x is comprised within entry y, the frequency of entry x and entry y in final dictionary is obtained.For example, word Bar " Pepsi " is comprised within " Pepsi Cola ", then obtains frequency of the two entries in final dictionary.
S2044:Judge whether entry x frequency meets weighting conditions with entry y frequency.
The present embodiment is judged that wherein Px is word using weighted formula e=Px/ (Px+Py) and Weighted Threshold relation Bar x frequency, Py is entry y frequency, and Weighted Threshold generally can be using value as 0.7.
If e >=0.7, to meet weighting conditions, step S2045 is performed.If e < 0.7, it is unsatisfactory for weighting bar Part, terminates this step.
S2045:Frequency weighting.
Using entry x and entry y frequency sum as entry x final dictionary frequency.For example, by " Pepsi " and " hundred Thing is laughable " frequency sum as " Pepsi " final dictionary frequency.Afterwards, step S2046 is continued executing with.
S2046:Update final dictionary.
By weighting step, two entries with inclusion relation can be made to exist simultaneously, and to wherein shorter product The frequency of board class entry is weighted, and can more objectively embodying the frequency of the entry, there is provided the degree of accuracy of final dictionary.
Embodiment three:
As shown in fig. 7, the embodiment of the present invention also provides a kind of generating means of dictionary, including pretreatment module 10, Dictionary module 20, cutting word module 30 and aggregation module 40.
Pretreatment module 10 is used to pre-process input entry, generates candidate entry.Specifically, pretreatment module 10 Include useless character collection 101, pretreatment module 10 excludes the useless character in input entry, root according to useless character collection 101 Input entry is handled according to business scenario, the repetition entry inputted in entry is then combined with, generates candidate entry.
Dictionary module 20 includes individual character dictionary 201 and Shuangzi dictionary 202, individual character dictionary be stored with some monosyllabic word pieces and Its frequency, Shuangzi dictionary is stored with some two-character word pieces and its frequency.
Dictionary module 20 also includes extracting updating block 203, for being carried from the input entry after useless character is eliminated Monosyllabic word piece is taken, and updates the frequency of monosyllabic word piece.Updating block 203 is extracted to be additionally operable to from eliminating the input after useless character Two-character word piece is extracted in entry, and updates the frequency of two-character word piece.
Also include also including frequency reducing character set 102 in lower frequency unit 204, and pretreatment module 10 in dictionary module 20. Before the frequency for updating monosyllabic word piece, lower frequency unit 204 is used in the word piece during monosyllabic word piece is frequency reducing character set 102, right The monosyllabic word piece enters line frequency reduction processing.
Cutting word module 30 is used for the frequency of character and character combination in dictionary module 20 in candidate entry, it is determined that Entry in initial dictionary.Specifically, cutting word module 30 travels through character a and character b adjacent in candidate entry, and word first Accord with a and character b character combination ab.Then character a, the frequency of character b and character combination ab in dictionary module 20 are obtained, when When the frequency of character a frequency, character b frequency and character combination ab meets cutting word condition, using character combination ab as initial A part for entry or entry in dictionary.
Aggregation module 40 is used to travel through the entry in initial dictionary, and the multiple entries for meeting polymerizing condition are merged into one Entry.Specifically, aggregation module 40 travels through the entry in initial dictionary first.When entry p is comprised within entry q, obtain Take the frequency of entry p and entry q in initial dictionary., will when entry p frequency and entry q frequency meet polymerizing condition Entry p replaces with entry q, or entry q is replaced with into entry p.
Further, the generating means of dictionary provided in an embodiment of the present invention also include weighting block 50, final for traveling through Entry in dictionary, when the frequency of two entries with inclusion relation meets weighting conditions, by the frequency of two entries it With as by the frequency comprising entry.
Specifically, weighting block 50 travels through the entry in final dictionary first.When entry x is comprised within entry y, Obtain the frequency of entry x and entry y in final dictionary.When entry x frequency and entry y frequency meet weighting conditions, Using entry x and entry y frequency sum as entry x final dictionary frequency.
It is apparent to those skilled in the art that, for convenience and simplicity of description, the device of foregoing description Specific work process, may be referred to the corresponding process in preceding method embodiment, will not be repeated here.
The generating means of dictionary provided in an embodiment of the present invention, the generation method of the dictionary provided with above-described embodiment has Identical technical characteristic, so can also solve identical technical problem, reaches identical technique effect.
The embodiment of the present invention also provides a kind of electronic equipment, including memory, processor and stores on a memory and can The computer program run on a processor, realizes that above-described embodiment one or embodiment two are carried during computing device computer program The step of generation method of the dictionary of confession.
The embodiment of the present invention also provides the meter that is stored with a kind of computer-readable recording medium, computer-readable recording medium Calculation machine program, performs the generation side for the dictionary that above-described embodiment one or embodiment two are provided when computer program is run by processor The step of method.
If the function is realized using in the form of SFU software functional unit and is used as independent production marketing or in use, can be with It is stored in a computer read/write memory medium.Understood based on such, technical scheme is substantially in other words The part contributed to prior art or the part of the technical scheme can be embodied in the form of software product, the meter Calculation machine software product is stored in a storage medium, including some instructions are make it that a computer equipment (can be individual People's computer, server, or network equipment etc.) perform all or part of step of each of the invention embodiment methods described. And foregoing storage medium includes:USB flash disk, mobile hard disk, read-only storage (ROM, Read-Only Memory), arbitrary access are deposited Reservoir (RAM, Random Access Memory), magnetic disc or CD etc. are various can be with the medium of store program codes.
Finally it should be noted that:Embodiment described above, is only the embodiment of the present invention, to illustrate the present invention Technical scheme, rather than its limitations, protection scope of the present invention is not limited thereto, although with reference to the foregoing embodiments to this hair It is bright to be described in detail, it will be understood by those within the art that:Any one skilled in the art The invention discloses technical scope in, it can still modify to the technical scheme described in previous embodiment or can be light Change is readily conceivable that, or equivalent is carried out to which part technical characteristic;And these modifications, change or replacement, do not make The essence of appropriate technical solution departs from the spirit and scope of embodiment of the present invention technical scheme, should all cover the protection in the present invention Within the scope of.Therefore, protection scope of the present invention described should be defined by scope of the claims.

Claims (11)

1. a kind of generation method of dictionary, it is characterised in that including:
Cutting word step, the frequency of character and character combination in dictionary module in candidate entry, it is determined that in initial dictionary Entry;Wherein, be stored with some monosyllabic word pieces and its frequency in the dictionary module, and some two-character word pieces and its frequency;
Polymerization procedure, travels through the entry in the initial dictionary, the multiple entries for meeting polymerizing condition is merged into an entry, Generate final dictionary.
2. according to the method described in claim 1, it is characterised in that the cutting word step, it is specially:
Travel through character a and character b adjacent in the candidate entry, and character a and character b character combination ab;
Obtain character a, the frequency of character b and character combination ab in dictionary module;
When the frequency of frequency, character b frequency and character combination ab as character a meets cutting word condition, by the character combination Ab is used as the entry or a part for entry in initial dictionary.
3. method according to claim 1 or 2, it is characterised in that the polymerization procedure, is specially:
Travel through the entry in the initial dictionary;
When entry p is comprised within entry q, frequencies of the entry p and entry q in the initial dictionary is obtained;
When entry p frequency and entry q frequency meet polymerizing condition, entry p is replaced with into entry q, or entry q is replaced It is changed to entry p.
4. according to the method described in claim 1, it is characterised in that before the cutting word step, also include:
Pre-treatment step, pre-processes to input entry, generates candidate entry.
5. method according to claim 4, it is characterised in that the pre-treatment step, is specially:
Exclude the useless character in input entry;
Merge the repetition entry in input entry, generate candidate entry.
6. method according to claim 5, it is characterised in that also include:
Monosyllabic word piece is extracted from the input entry after useless character is eliminated;
The frequency of the monosyllabic word piece is updated in the dictionary module;
Two-character word piece is extracted from the input entry after useless character is eliminated;
The frequency of the two-character word piece is updated in the dictionary module.
7. method according to claim 6, it is characterised in that the frequency of the monosyllabic word piece is updated in the dictionary module Before rate, also include:
When monosyllabic word piece is the word piece in frequency reducing character set, line frequency reduction processing is entered to the monosyllabic word piece.
8. according to the method described in claim 1, it is characterised in that also include:
Step is weighted, the entry in the final dictionary is traveled through, when the frequency of two entries with inclusion relation meets weighting During condition, using the frequency sum of described two entries as by the frequency comprising entry.
9. a kind of generating means of dictionary, it is characterised in that including:
Cutting word module, for the frequency of the character in candidate entry and character combination in dictionary module, determines initial word Entry in storehouse;Wherein, be stored with some monosyllabic word pieces and its frequency in the dictionary module, and some two-character word pieces and its frequency Rate;
The multiple entries for meeting polymerizing condition, for traveling through the entry in the initial dictionary, are merged into one by aggregation module Entry, generates final dictionary.
10. a kind of electronic equipment, including memory, processor and it is stored on the memory and can transports on the processor Capable computer program, it is characterised in that realize the claims 1 to 8 during computer program described in the computing device The step of method described in any one.
11. be stored with computer program, its feature on a kind of computer-readable recording medium, the computer-readable recording medium It is, the step of method described in any one of the claims 1 to 8 is performed when the computer program is run by processor.
CN201710171679.4A 2017-03-21 2017-03-21 Generation method, device and the electronic equipment of dictionary Active CN106951410B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710171679.4A CN106951410B (en) 2017-03-21 2017-03-21 Generation method, device and the electronic equipment of dictionary

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710171679.4A CN106951410B (en) 2017-03-21 2017-03-21 Generation method, device and the electronic equipment of dictionary

Publications (2)

Publication Number Publication Date
CN106951410A true CN106951410A (en) 2017-07-14
CN106951410B CN106951410B (en) 2018-01-05

Family

ID=59472756

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710171679.4A Active CN106951410B (en) 2017-03-21 2017-03-21 Generation method, device and the electronic equipment of dictionary

Country Status (1)

Country Link
CN (1) CN106951410B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107622049A (en) * 2017-09-06 2018-01-23 国家电网公司 A kind of special word stock generating method of electric service
CN108874869A (en) * 2018-04-24 2018-11-23 中国地质大学(武汉) A kind of method for building up of the geological classes dictionary based on data collaborative
CN112395408A (en) * 2020-11-19 2021-02-23 平安科技(深圳)有限公司 Stop word list generation method and device, electronic equipment and storage medium
CN113032581A (en) * 2021-04-09 2021-06-25 北京百度网讯科技有限公司 Method and device for updating product list

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050114130A1 (en) * 2003-11-20 2005-05-26 Nec Laboratories America, Inc. Systems and methods for improving feature ranking using phrasal compensation and acronym detection
CN101079027A (en) * 2007-06-27 2007-11-28 腾讯科技(深圳)有限公司 Chinese character word distinguishing method and system
CN101706807A (en) * 2009-11-27 2010-05-12 清华大学 Method for automatically acquiring new words from Chinese webpages
CN102902757A (en) * 2012-09-25 2013-01-30 姚明东 Automatic generation method of e-commerce dictionary
CN104899190A (en) * 2015-06-04 2015-09-09 百度在线网络技术(北京)有限公司 Generation method and device for word segmentation dictionary and word segmentation processing method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050114130A1 (en) * 2003-11-20 2005-05-26 Nec Laboratories America, Inc. Systems and methods for improving feature ranking using phrasal compensation and acronym detection
CN101079027A (en) * 2007-06-27 2007-11-28 腾讯科技(深圳)有限公司 Chinese character word distinguishing method and system
CN101706807A (en) * 2009-11-27 2010-05-12 清华大学 Method for automatically acquiring new words from Chinese webpages
CN102902757A (en) * 2012-09-25 2013-01-30 姚明东 Automatic generation method of e-commerce dictionary
CN104899190A (en) * 2015-06-04 2015-09-09 百度在线网络技术(北京)有限公司 Generation method and device for word segmentation dictionary and word segmentation processing method and device

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107622049A (en) * 2017-09-06 2018-01-23 国家电网公司 A kind of special word stock generating method of electric service
CN108874869A (en) * 2018-04-24 2018-11-23 中国地质大学(武汉) A kind of method for building up of the geological classes dictionary based on data collaborative
CN112395408A (en) * 2020-11-19 2021-02-23 平安科技(深圳)有限公司 Stop word list generation method and device, electronic equipment and storage medium
CN112395408B (en) * 2020-11-19 2023-11-07 平安科技(深圳)有限公司 Stop word list generation method and device, electronic equipment and storage medium
CN113032581A (en) * 2021-04-09 2021-06-25 北京百度网讯科技有限公司 Method and device for updating product list
CN113032581B (en) * 2021-04-09 2024-02-06 北京百度网讯科技有限公司 Method and device for updating product list

Also Published As

Publication number Publication date
CN106951410B (en) 2018-01-05

Similar Documents

Publication Publication Date Title
CN106951410B (en) Generation method, device and the electronic equipment of dictionary
CN105608218B (en) The method for building up of intelligent answer knowledge base establishes device and establishes system
CN105678324B (en) Method for building up, the apparatus and system of question and answer knowledge base based on similarity calculation
CN105389389B (en) A kind of network public-opinion propagation situation medium control analysis method
CN106776713A (en) It is a kind of based on this clustering method of the Massive short documents of term vector semantic analysis
CN107305539A (en) A kind of text tendency analysis method based on Word2Vec network sentiment new word discoveries
CN105843897A (en) Vertical domain-oriented intelligent question and answer system
CN107025297A (en) A kind of chat robots and its automatic chatting method
Nagamanjula et al. A novel framework based on bi-objective optimization and LAN2FIS for Twitter sentiment analysis
CN112417846B (en) Text automatic generation method and device, electronic equipment and storage medium
CN102033919A (en) Method and system for extracting text key words
CN110442720A (en) A kind of multi-tag file classification method based on LSTM convolutional neural networks
CN108509638A (en) A kind of problem extracting method and electronic equipment
CN106980620A (en) A kind of method and device matched to Chinese character string
CN102693279A (en) Method, device and system for fast calculating comment similarity
CN113761114A (en) Phrase generation method and device and computer-readable storage medium
CN106445915A (en) New word discovery method and device
CN105095430A (en) Method and device for setting up word network and extracting keywords
CN110347833A (en) A kind of classification method of more wheel dialogues
Krchnavy et al. Sentiment analysis of social network posts in Slovak language
CN115186654A (en) Method for generating document abstract
Li et al. Improved new word detection method used in tourism field
CN110175332A (en) A kind of intelligence based on artificial neural network is set a question method and system
CN106649732A (en) Information pushing method and device
CN107220233B (en) User knowledge demand model construction method based on Gaussian mixture model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant