CN104615590A - Project name extraction method and device - Google Patents

Project name extraction method and device Download PDF

Info

Publication number
CN104615590A
CN104615590A CN201510093192.XA CN201510093192A CN104615590A CN 104615590 A CN104615590 A CN 104615590A CN 201510093192 A CN201510093192 A CN 201510093192A CN 104615590 A CN104615590 A CN 104615590A
Authority
CN
China
Prior art keywords
entity word
webpage
destination item
coding
participle
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510093192.XA
Other languages
Chinese (zh)
Inventor
范莹
于治楼
梁华勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Group Co Ltd
Original Assignee
Inspur Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Group Co Ltd filed Critical Inspur Group Co Ltd
Priority to CN201510093192.XA priority Critical patent/CN104615590A/en
Publication of CN104615590A publication Critical patent/CN104615590A/en
Pending legal-status Critical Current

Links

Abstract

The invention provides a project name extraction method and device. The method comprises the following steps: obtaining all participles of a target project, searching codes of all the participles in a synonym lexicon, generating a participle list of target project names, searching codes of all entity words in the synonym lexicon by extracting all the entity words of all pages, then using all the entity words to construct all entity word combinations of all the pages corresponding to the target project names, generating the entity word combination list of all the pages, and using the codes corresponding to all the participles of the target project and the codes corresponding to all the entity word combinations of all pages for synonym match of the participle list of the target project names and the entity words in the entity word combinations, so as to obtain synonyms corresponding to the target project names, therefore, by using the project name extraction method, the synonyms of the project names can be recognized.

Description

A kind of extracting method of project name and device
Technical field
The present invention relates to computer application field, particularly a kind of extracting method of project name and device.
Background technology
Large data (big data), or claim flood tide data, it can be enterprise development or operation provides information.Enterprise is when carrying out large data, and is first exactly get through internal data and external data, namely obtains the internet data outside based on inside data of enterprise.In this course, extract project name to be absolutely necessary work, but project name is probably inconsistent in the webpage of different web sites or different templates, such as, in the call for tender of Chinese Government's buying net, some webpages are called " bid inviter ", and some webpages are called " purchaser " etc.At present, write different extraction project name programs mainly through artificial for project name, and the synonym of None-identified project name.
Summary of the invention
The invention provides a kind of extracting method and device of project name, to identify the synonym of project name.
An extracting method for project name, comprising:
Obtain each participle of destination item title, search the described coding of each participle in synonym dictionary, generate the participle list of destination item title, this participle list comprises each participle of destination item title and coding corresponding to this each participle;
Extract each entity word of each webpage, search the described coding of each entity word in synonym dictionary;
Utilize each entity word, each entity word building each webpage corresponding with destination item title combines, generate and preserve the entity word Assembly Listing of each webpage, the entity word Assembly Listing of this each webpage comprises the combination of each entity word and coding corresponding to this each entity word combination;
The coding that each participle of described destination item is corresponding is utilized to combine corresponding coding with each entity word of each webpage, synonym mates the participle list of described destination item title and the entity word Assembly Listing of each webpage described, obtains the synonym corresponding to described destination item title.
Preferably, before each entity word of described each webpage of extraction, comprise further: classified by each webpage described, the webpage that wherein project name of Structure of need is identical belongs to same class; After each entity word of described each webpage of extraction, comprise further: duplicate removal is carried out to each entity word in similar webpage; Each entity word of described each webpage of extraction, search the described coding of each entity word in synonym dictionary, comprise: extract each entity word in similar webpage, duplicate removal is carried out to each entity word in similar webpage, searches the coding of the entity word after similar removing duplicate webpages in described synonym dictionary;
Preferably, described in search the coding of each entity word described in synonym dictionary after, comprise further: according to destination item title, setting threshold value p, filters each entity word in each webpage;
Each entity word in each webpage of described filtration, comprise: according to coding rule, the word being "@" or " # " by last position of encoding in the entity word after described duplicate removal filters and is not considered, according to the text frequency of text frequency computation part formulae discovery similar webpage residue entity word, contrast described text frequency and described setting threshold value p, judge whether to filter described entity word corresponding to described text frequency;
Described text frequency computation part formula is: DF j=D bj/ n b, wherein, DF jfor the text frequency of a jth target entity word; D bjthe webpage number in b class webpage is appeared at for a jth target entity word; n bfor the number of b class webpage;
Described contrast described text frequency and described setting threshold value p, judge whether to filter described entity word corresponding to described text frequency, comprising: work as DF jduring < p, then filter out a jth target entity word, otherwise, retain a jth target entity word.
Preferably, the entity word of described each webpage of extraction, comprise further: according to the sequence of positions of entity word in each webpage, for the entity word of each webpage described generates sequence number, form the entity word set of each webpage, the entity word set of each webpage described comprises: the entity word of webpage and sequence number corresponding to this entity word;
Preferably, each entity word of each webpage that described structure is corresponding with destination item title combines, comprise: the participle number of adding up described destination item title, according to the continuity of the corresponding sequence number of each entity word of the participle number of described destination item title and each webpage described, search the continuous print entity word that can combine and form entity word combination, wherein, in the combination of described entity word, entity word number is identical with the participle number of described destination item title, and sequence number corresponding to the entity word during described entity word combines is continuous print.
Preferably, the coding that each participle of destination item is corresponding is utilized to combine corresponding coding with each entity word of each webpage, synonym mates the participle list of described destination item title and described entity word Assembly Listing, comprise: by the participle of described destination item title in order successively with the entity word one_to_one corresponding in the entity word Assembly Listing of each webpage, contrast the coding of current participle whether identical with the coding of the current entity word in current entity word combination, if, then continue the coding of the next entity word in the contrast coding of next participle and current entity word combination, otherwise, the contrast of target end project name and current entity word combination, carry out the contrast that destination item title and next entity word combine, when each point of Chinese word coding of destination item title is identical with the coding that entity word in entity word Assembly Listing combines, then obtain the synonym corresponding to destination item title.
Preferably, after obtaining the synonym corresponding to described destination item title, comprise further: integration objective project name and the synonym corresponding to destination item title, generate and correspond to the Alphabetical List of destination item title, described Alphabetical List comprises: destination item title and synonym corresponding to destination item title.
An extraction element for project name, comprising:
Acquiring unit, for obtaining each participle of destination item title;
First searches unit, for searching the coding of each participle in synonym dictionary that described acquiring unit provides, triggers the first generation unit;
First generation unit, for receiving the triggering of searching unit, generates the participle list of destination item title, and described participle list comprises each participle of destination item title and coding corresponding to this each participle;
Extraction unit, for extracting each entity word of each webpage;
Second searches unit, for searching the described coding of each entity word in synonym dictionary;
Construction unit, for utilizing each entity word described, each entity word building each webpage corresponding with destination item title combines;
Second generation unit, for the entity word Assembly Listing of each each webpage of entity word combination producing of each webpage built by construction unit, the entity word Assembly Listing of this each webpage comprises the combination of each entity word and coding corresponding to this each entity word combination;
Matching unit, corresponding coding is combined with each entity word of each webpage for utilizing the coding that each participle of described destination item is corresponding, the entity word Assembly Listing of each webpage that the participle list of the destination item title that described first generation unit of synonym coupling generates and described second generation unit generate, obtains the synonym corresponding to described destination item title.
Preferably, the device that described project name extracts, comprises: taxon, duplicate removal unit, filter element further, wherein,
Described taxon, for being classified by each webpage described, the webpage that wherein project name of Structure of need is identical belongs to same class, triggers described extraction unit;
Described extraction unit, is further used for extracting the entity word in similar webpage, triggers duplicate removal unit;
Described duplicate removal unit, for receiving the triggering of described extraction unit, carries out duplicate removal to the entity word of similar webpage, triggers second and searches unit;
Described second searches unit, is further used for searching the coding of entity word in described synonym dictionary after the similar removing duplicate webpages that described duplicate removal unit provides;
Described filter element, for according to coding rule, searching in the entity word after the duplicate removal that unit finds last position of encoding by described second is that the word of "@" or " # " filters, according to destination item title, setting threshold value p, in the similar webpage that extraction unit extracts according to text frequency computation part formulae discovery, the text frequency of each entity word, contrasts described text frequency and described threshold value p, judge whether to filter described entity word corresponding to described text frequency, work as DF jduring < p, then filter out a jth target entity word, otherwise, retain a jth target entity word;
Described text frequency computation part formula is: DF j=D bj/ n b, wherein, DF jfor the text frequency of a jth target entity word; D bjthe webpage number in b class webpage is appeared at for a jth target entity word; n bfor the number of b class webpage.
Preferably, the device that described project name extracts, comprises: statistic unit further, wherein,
Described extraction unit, be further used for according to the sequence of positions of entity word in each webpage, for the entity word of each webpage described generates sequence number, form the entity word set of each webpage, the entity word set of each webpage described comprises: the entity word of webpage and sequence number corresponding to this entity word;
Described statistic unit, for adding up the participle number of described destination item title;
Described construction unit, be further used for the continuity of the corresponding sequence number of each entity word of each webpage that the participle number of the destination item title of adding up according to described statistic unit and described providing unit provide, search the continuous print entity word that can combine and form entity word combination, wherein, in the combination of described entity word, entity word number is identical with the participle number of described destination item title, and sequence number corresponding to the entity word during described entity word combines is continuous print.
Preferably, described matching unit is further used for: the entity word one_to_one corresponding in being combined with the entity word of each webpage in the second generation unit successively in order by the participle of the destination item title in described first generation unit, contrast the coding of current participle whether identical with the coding of the current entity word in current entity word combination, if, then continue the coding of the next entity word in the contrast coding of next participle and current entity word combination, otherwise, the contrast of target end project name and current entity word combination, carry out the contrast that destination item title and next entity word combine, when each point of Chinese word coding of destination item title is identical with the coding that entity word in entity word Assembly Listing combines, then obtain the synonym corresponding to destination item title.
Preferably, the device that described project name extracts, comprises: the 3rd generation unit further,
Described 3rd generation unit, for the synonym corresponding to destination item title that integration objective project name and described matching unit provide, generate and correspond to the Alphabetical List of destination item title, described Alphabetical List comprises: destination item title and synonym corresponding to destination item title.
Embodiments provide a kind of extracting method and device of project name, it can by obtaining each participle of destination item title, search the described coding of each participle in synonym dictionary, generate the participle list of destination item title, meanwhile, each entity word of each webpage can also be extracted, search the described coding of each entity word in synonym dictionary, utilize each entity word, build the entity word corresponding with destination item title and combine.In synonym dictionary, synonym has identical coding, therefore, the coding that each entity word combination of the coding that the embodiment of the present invention utilizes each participle of described destination item of obtaining corresponding further above and each webpage is corresponding, synonym mates described destination item title and described entity word combines, to identify the synonym of project name.
In addition, each participle of the destination item title that the embodiment of the present invention obtains is completed by manual type, can effectively avoid occurring compound word in project name participle, thus making project name participle consistent with word granularity in synonym dictionary, Guarantee item title participle can find corresponding coding in synonym dictionary.
Simultaneously, each webpage described is classified, the project name of every class webpage Structure of need is identical, the matching process of the project name of described every class webpage Structure of need is identical, similar webpage can be made to carry out project name coupling by this process simultaneously, and by calculating similar webpage Chinese version frequency, and the threshold value p of text frequency and setting is compared, the entity word close with each participle of destination item title can be located more accurately, remove unnecessary entity word, thus improve the synon efficiency obtaining destination item title.
Accompanying drawing explanation
Fig. 1 is the extracting method process flow diagram of the project name that the embodiment of the present invention provides;
Fig. 2 is the method flow diagram of the project name that another embodiment of the present invention provides;
Fig. 3 is the device place configuration diagram of the project name that the embodiment of the present invention provides;
Fig. 4 is the apparatus structure schematic diagram of the project name that the embodiment of the present invention provides;
Fig. 5 is the apparatus structure schematic diagram of the project name that another embodiment of the present invention provides;
Fig. 6 is the apparatus structure schematic diagram of the project name that further embodiment of this invention provides;
Fig. 7 is the apparatus structure schematic diagram of the project name that further embodiment of this invention provides.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is clearly and completely described.Obviously, described embodiment is only the present invention's part embodiment, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, belong to the scope of protection of the invention.
As shown in Figure 1, embodiments provide a kind of extracting method of project name, the method can comprise the following steps:
Step 101: each participle obtaining destination item title, search the described coding of each participle in synonym dictionary, generate the participle list of destination item title, this participle list comprises each participle of destination item title and coding corresponding to this each participle;
Step 102: each entity word extracting each webpage, searches the described coding of each entity word in synonym dictionary;
Step 103: utilize each entity word, each entity word building each webpage corresponding with destination item title combines, generate and preserve the entity word Assembly Listing of each webpage, the entity word Assembly Listing of this each webpage comprises the combination of each entity word and coding corresponding to this each entity word combination;
Step 104: utilize the coding that each participle of described destination item is corresponding to combine corresponding coding with each entity word of each webpage, synonym mates the participle list of described destination item title and the entity word Assembly Listing of each webpage described, obtains the synonym corresponding to described destination item title.
The synon accuracy of destination item title is obtained in order to improve, the embodiment of the present invention is before step 102, classify further by by each webpage described, the webpage that wherein project name of Structure of need is identical belongs to same class, and extract each entity word of each webpage, the specific implementation of searching the described coding of each entity word in synonym dictionary is: by extracting each entity word in similar webpage, duplicate removal is carried out to each entity word in similar webpage, searches the coding of the entity word after similar removing duplicate webpages in described synonym dictionary.
Before step 103, the embodiment of the present invention, further according to destination item title, sets threshold value p, and by threshold value p, filters each entity word in each webpage.More entity word larger with participle otherness can be filtered out by this process, its specific implementation is: according to coding rule, the word being "@" or " # " by last position of encoding in the entity word after described duplicate removal filters and is not considered, according to the text frequency of text frequency computation part formulae discovery similar webpage residue entity word, contrast described text frequency and described setting threshold value p, judge whether to filter described entity word corresponding to described text frequency.
In order to accurately locate the position of entity word in each webpage, while the entity word extracting each webpage, according to the sequence of positions of entity word in each webpage, for the entity word of each webpage described generates sequence number, form the entity word set of each webpage, the entity word set of each webpage described comprises: the entity word of webpage and sequence number corresponding to this entity word, and the specific implementation of step 103 is the participle number of adding up described destination item title, according to the continuity of the corresponding sequence number of each entity word of the participle number of described destination item title and each webpage described, search the continuous print entity word that can combine and form entity word combination, wherein, in described entity word combination, entity word number is identical with the participle number of described destination item title, sequence number corresponding to the entity word in the combination of described entity word is continuous print.
In order to obtain the synonym of destination item title more accurately, the specific implementation of step 104: by the participle of described destination item title in order successively with the entity word one_to_one corresponding in the entity word Assembly Listing of each webpage, contrast the coding of current participle whether identical with the coding of the current entity word in current entity word combination, if, then continue the coding of the next entity word in the contrast coding of next participle and current entity word combination, otherwise, the contrast of target end project name and current entity word combination, carry out the contrast that destination item title and next entity word combine, when each point of Chinese word coding of destination item title is identical with the coding that entity word in entity word Assembly Listing combines, then obtain the synonym corresponding to destination item title.
As shown in Figure 2, embodiments provide a kind of extracting method of project name, the method can comprise the following steps:
Step 201: each participle obtaining destination item title, searches the described coding of each participle in synonym dictionary, generates the participle list of destination item title;
The embodiment of the present invention carries out participle by manual type to each project name, this is because, project name is likely compound word on the one hand, granularity is too large, may be inconsistent with the granularity in synonym dictionary, cause point Chinese word coding to find, on the other hand, project name quantity can not be too many, and manual type can realize project name participle completely.
Such as, the embodiment of the present invention will be extracted k project name, and so, the word segmentation result of this k project name can be designated as: ItemNameList={inl 1, inl 2..., inl k, wherein, inl 1, inl 2..., inl krepresent respectively destination item title 1,2 ... k; inl 1={ inl 11, inl 12..., inl 1xthe 1st participle representing destination item title 1 be inl 11, destination item title 1 the 2nd participle be inl 12... an xth participle of destination item title 1 is inl 1x, the coding that each participle of destination item title 1 is corresponding is designated as inlc 1={ inlc 11, inlc 12..., inlc 1x, i.e. inlc 11, inlc 12..., inlc 1xrespectively represent project name 1 the 1st participle, the 2nd participle ... the coding that an xth participle obtains from synonym dictionary, and the participle list of destination item title comprises each participle of destination item title and coding corresponding to this each participle.
Have " knowing net ", " Chinese thesaurus " " Chinese concept dictionary " etc. at Chinese synonym dictionary, the embodiment of the present invention chooses Chinese thesaurus as the synonym dictionary obtaining participle and entity word coding.Chinese thesaurus is organized into all entries of including together according to tree-shaped hierarchical structure, and vocabulary is divided into large, medium and small 3 classes, and large class has 12, and middle class has 97, and group has 1400.Have a lot of words in each group, these words divide into several clumps (paragraph) according to the distance of the meaning of a word and correlativity again.Word in each paragraph divide into several row again further, and with word or the meaning of a word identical (meaning of a word had is very close) of a line, or the meaning of a word has very strong correlativity.Chinese thesaurus provides 5 layers of coding altogether, as shown in the table:
1st grade represents with capitalization English letter; 2nd grade represents with small English alphabet; 3rd level two decimal integers represent; 4th grade represents with capitalization English letter; 5th grade represents by two decimal integers.8th be marked with 3 kinds, be "=", " # ", "@" respectively."=" representative " equal ", " synonym "; " # " representative " not etc. ", " similar ", belong to correlation word; "@" representative " self-isolation ", " independence ", it did not both have synonym in dictionary, did not have related term yet.In synonym dictionary, synonym has identical coding, such as: fisherman, elderly fishman, fisherman's family, fisherman, old fisherman or the military Ae07C01 coded representation of the synonyms such as youth of fishing.
Step 202: each webpage is classified, the webpage that wherein project name of Structure of need is identical belongs to same class;
The data of internet collection are usually present in different web page templates, and same template entity word used is generally identical.In the embodiment of the present invention, each webpage is divided into multiple class according to template, and every class webpage needs structurized project name should be identical, and the project name matching process of every class is identical.Such as: whole webpages that the leaching process of embodiment of the present invention project name relates to are N number of, can be divided into a class, and every class webpage number is { n 1, n 2..., n a.In similar webpage, so should occur in such each webpage with the word of project name synonym, at least should occur in most of webpage.
Step 203: each entity word extracting similar webpage, carries out duplicate removal to each entity word described, searches the described coding of each entity word in synonym dictionary;
The embodiment of the present invention is by extracting each entity word in similar webpage, according to the sequence of positions of entity word in each webpage, for the entity word of each webpage described generates sequence number, form the entity word set of each webpage, the entity word set of each webpage described comprises: the entity word of webpage and sequence number corresponding to this entity word.Such as: the entity word situation of b class i-th webpage can be designated as T bi={ < id bi1, term bi1>, < id bi2, term bi2> ..., < id bim, term bim> }, wherein, the sequence number of id presentation-entity word in entire chapter Web page text participle, namely which word in literary composition; Term is entity word title.Then, by carrying out duplicate removal to each entity word in similar webpage, this is because, containing a large amount of identical entity word in similar webpage, pass through duplicate removal, entity word identical in similar webpage effectively can be avoided to repeat to search coding in Chinese thesaurus, effectively can improve the speed of searching entity word coding in similar webpage.
Step 204: according to destination item title, setting threshold value p, filters each entity word in each webpage;
According to coding rule, the word being "@" or " # " by last position of encoding in the entity word after described duplicate removal filters and is not considered, according to the text frequency of text frequency computation part formulae discovery similar webpage residue entity word, contrast described text frequency and described setting threshold value p, judge whether to filter described entity word corresponding to described text frequency;
Described text frequency computation part formula is: DF j=D bj/ n b, wherein, DF jfor the text frequency of a jth target entity word; D bjthe webpage number in b class webpage is appeared at for a jth target entity word; n bfor the number of b class webpage;
Described contrast described text frequency and described setting threshold value p, judge whether to filter described entity word corresponding to described text frequency, comprising: work as DF jduring < p, then filter out a jth target entity word, otherwise, retain a jth target entity word.Such as: in embodiments of the present invention, the entity word situation obtaining i-th webpage after said process is designated as TC i={ < id i1, term i1, tc i1>, < id i2, term i2, tc i2> ..., < id im, term im, tc im> }.Wherein tc imrepresent the coding of m entity word in i-th webpage.
Step 205: utilize each entity word after described filtration, each entity word building each webpage corresponding with destination item title combines, and generates and preserves the entity word Assembly Listing of each webpage;
By adding up the participle number of described destination item title, according to the continuity of the corresponding sequence number of each entity word of the participle number of described destination item title and each webpage described, search the continuous print entity word that can combine and form entity word combination, wherein, in the combination of described entity word, entity word number is identical with the participle number of described destination item title, and sequence number corresponding to the entity word during described entity word combines is continuous print.Such as: for destination item title g, it is made up of x participle, and representation is inl g={ inl g1, inl g2..., inl gx.So, the similarity of entity word in i-th webpage and project name be calculated, first will find a continuous print x entity word that can combine.Suppose x=2, so, be the entity word of 1 for sequence number in i-th webpage, if the word that sequence number is 2 can be obtained in the web page simultaneously, so sequence number is 1 is 2 with sequence number morphology becomes entity word to combine, such as, in i-th webpage, entity word < 1, " bid inviter " >, there is < 2, " address " >, then form entity word combination " bid inviter address "; And for < 1, " bid inviter " >, if the sequence number of all entity word does not all equal 2 below, then this word does not meet matching condition.Preserve into entity word Assembly Listing by all qualified entity word combinations, the entity word Assembly Listing of this each webpage comprises the coding of the combination of each entity word and this each entity word combination correspondence.Wherein entity word Assembly Listing can be expressed as TCL i={ tcl i1, tcl i2..., tcl ix, tcl i1={ < term i11, tc i11> ..., < term i1y, tc i1y> }, wherein, TCL iit is the entity word Assembly Listing of i-th webpage; tcl i1be the 1st entity word combination in i-th webpage, its title term combined by entity word i11with the coding tc of entity word combination i11composition.
Step 206: utilize the coding that each participle of described destination item is corresponding to combine corresponding coding with each entity word of each webpage, synonym mates the participle list of described destination item title and the entity word Assembly Listing of each webpage described, obtains the synonym corresponding to described destination item title.
In order to whether the entity word combination obtained in determining step 206 is called synonym with destination item name.The embodiment of the present invention by the participle of described destination item title in order successively with the entity word one_to_one corresponding in the entity word Assembly Listing of each webpage, contrast the coding of current participle whether identical with the coding of the current entity word in current entity word combination, if, then continue the coding of the next entity word in the contrast coding of next participle and current entity word combination, otherwise, the contrast of target end project name and current entity word combination, carries out the contrast that destination item title and next entity word combine; When each point of Chinese word coding of destination item title is identical with the coding that entity word in entity word Assembly Listing combines, then obtain the synonym corresponding to destination item title.Such as: project name 1 comprises 2 participles, i.e. inl 1={ inl 11, inl 12, by the 1st entity word Assembly Listing tcl of itself and i-th webpage i1={ < term i11, tc i11>, < term i11, tc i12> }, so, first the embodiment of the present invention will compare the 1st participle inl 11with the 1st entity word term i11coding whether identical, if identical, then compare the 2nd participle inl 12with the 2nd entity word term i12coding.If the coding of each equivalent is identical, then think inl 1with tcl i1for synonym.
Step 207: integration objective project name and the synonym corresponding to destination item title, generate the Alphabetical List corresponding to destination item title.
Described Alphabetical List comprises: destination item title and synonym corresponding to destination item title, i.e. MatchResult={ < in 1, (rm 11, rm 12...) >, < in 2, (rm 21, rm 22...) > ..., < in k, (rm k1, rm k2...) >; Wherein, in ifor each project name, rm ijfor the jth synonym of i-th project name found.
As shown in Figure 3, Figure 4, the device that a kind of project name extracts is embodiments provided.Device embodiment can pass through software simulating, also can be realized by the mode of hardware or software and hardware combining.Say from hardware view; as shown in Figure 3; for a kind of hardware structure diagram of the device place equipment of the embodiment of the present invention; except the CPU shown in Fig. 3, internal memory, network interface and nonvolatile memory; in embodiment, the equipment at device place can also comprise other hardware usually, as the chip etc. of responsible processing item title.For software simulating, as shown in Figure 4, as the device on a logical meaning, be by the CPU of its place equipment, computer program instructions corresponding in nonvolatile memory is read operation in internal memory to be formed.The device 40 that the project name that the present embodiment provides extracts comprises:
Acquiring unit 401, for obtaining each participle of destination item title;
First searches unit 402, for searching the coding of each participle in synonym dictionary that described acquiring unit provides, triggers the first generation unit;
First generation unit 403, for receiving the triggering of searching unit, generates the participle list of destination item title, and described participle list comprises each participle of destination item title and coding corresponding to this each participle;
Extraction unit 404, for extracting each entity word of each webpage;
Second searches unit 405, for searching the coding of each entity word in synonym dictionary described in the filtration of described filter element;
Construction unit 406, for utilizing each entity word described, each entity word building each webpage corresponding with destination item title combines;
Second generation unit 407, for the entity word Assembly Listing of each each webpage of entity word combination producing of each webpage built by construction unit, the entity word Assembly Listing of this each webpage comprises the combination of each entity word and coding corresponding to this each entity word combination;
Matching unit 408, corresponding coding is combined with each entity word of each webpage for utilizing the coding that each participle of described destination item is corresponding, the entity word Assembly Listing of each webpage that the participle list of the destination item title that described first generation unit of synonym coupling generates and described second generation unit generate, obtains the synonym corresponding to described destination item title.
In another embodiment, as shown in Figure 5, the device that project name extracts may further include: taxon 501, duplicate removal unit 502, filter element 503, wherein,
Taxon 501, for being classified by each webpage described, the webpage that wherein project name of Structure of need is identical belongs to same class, triggers described extraction unit;
Preferably, described extraction unit 404, is further used for extracting the entity word in similar webpage, triggers duplicate removal unit;
Duplicate removal unit 502, for receiving the triggering of described extraction unit, carries out duplicate removal to the entity word of similar webpage, triggers second and searches unit;
Preferably, described second searches unit, is further used for searching the coding of entity word in described synonym dictionary after the similar removing duplicate webpages that described duplicate removal unit provides;
Described filter element 503, for according to coding rule, searching in the entity word after the duplicate removal that unit finds last position of encoding by described second is that the word of "@" or " # " filters, according to destination item title, setting threshold value p, in the similar webpage that extraction unit extracts according to text frequency computation part formulae discovery, the text frequency of each entity word, contrasts described text frequency and described threshold value p, judge whether to filter described entity word corresponding to described text frequency, work as DF jduring < p, then filter out a jth target entity word, otherwise, retain a jth target entity word;
Described text frequency computation part formula is: DF j=D bj/ n b, wherein, DF jfor the text frequency of a jth target entity word; D bjthe webpage number in b class webpage is appeared at for a jth target entity word; n bfor the number of b class webpage.
In another embodiment of the present invention, as shown in Figure 6, the device that project name extracts may further include: statistic unit 601, wherein,
Preferably, described extraction unit, be further used for according to the sequence of positions of entity word in each webpage, for the entity word of each webpage described generates sequence number, form the entity word set of each webpage, the entity word set of each webpage described comprises: the entity word of webpage and sequence number corresponding to this entity word;
Described statistic unit 601, for adding up the participle number of described destination item title;
Preferably, described construction unit, be further used for the continuity of the corresponding sequence number of each entity word of each webpage that the participle number of the destination item title of adding up according to described statistic unit and described providing unit provide, search the continuous print entity word that can combine and form entity word combination, wherein, in the combination of described entity word, entity word number is identical with the participle number of described destination item title, and sequence number corresponding to the entity word during described entity word combines is continuous print.
In another embodiment of the present invention, preferably, described matching unit is further used for: the entity word one_to_one corresponding in being combined with the entity word of each webpage in the second generation unit successively in order by the participle of the destination item title in described first generation unit, contrast the coding of current participle whether identical with the coding of the current entity word in current entity word combination, if, then continue the coding of the next entity word in the contrast coding of next participle and current entity word combination, otherwise, the contrast of target end project name and current entity word combination, carry out the contrast that destination item title and next entity word combine, when each point of Chinese word coding of destination item title is identical with the coding that entity word in entity word Assembly Listing combines, then obtain the synonym corresponding to destination item title.
In another embodiment of the present invention, as shown in Figure 7, the device that project name extracts may further include:
3rd generation unit 701, for the synonym corresponding to destination item title that integration objective project name and described matching unit provide, generate and correspond to the Alphabetical List of destination item title, described Alphabetical List comprises: destination item title and synonym corresponding to destination item title.
It should be noted that, in this article, the relational terms of such as first and second and so on is only used for an entity or operation to separate with another entity or operational zone, and not necessarily requires or imply the relation that there is any this reality between these entities or operation or sequentially.And, term " comprises ", " comprising " or its any other variant are intended to contain comprising of nonexcludability, thus make to comprise the process of a series of key element, method, article or equipment and not only comprise those key elements, but also comprise other key elements clearly do not listed, or also comprise by the intrinsic key element of this process, method, article or equipment.When not more restrictions, the key element " being comprised " limited by statement, and be not precluded within process, method, article or the equipment comprising described key element and also there is other same factor.
The foregoing is only preferred embodiment of the present invention, not in order to limit the present invention, within the spirit and principles in the present invention all, any amendment made, equivalent replacement, improvement etc., all should be included within the scope of protection of the invention.

Claims (10)

1. an extracting method for project name, is characterized in that, comprising:
Obtain each participle of destination item title, search the described coding of each participle in synonym dictionary, generate the participle list of destination item title, this participle list comprises each participle of destination item title and coding corresponding to this each participle;
Extract each entity word of each webpage, search the described coding of each entity word in synonym dictionary;
Utilize each entity word, each entity word building each webpage corresponding with destination item title combines, generate and preserve the entity word Assembly Listing of each webpage, the entity word Assembly Listing of this each webpage comprises the combination of each entity word and coding corresponding to this each entity word combination;
The coding that each participle of described destination item is corresponding is utilized to combine corresponding coding with each entity word of each webpage, synonym mates the participle list of described destination item title and the entity word Assembly Listing of each webpage described, obtains the synonym corresponding to described destination item title.
2. method according to claim 1, is characterized in that,
Before each entity word of described each webpage of extraction, comprise further: classified by each webpage described, the webpage that wherein project name of Structure of need is identical belongs to same class;
After each entity word of described each webpage of extraction, comprise further: duplicate removal is carried out to each entity word in similar webpage;
Each entity word of described each webpage of extraction, search the described coding of each entity word in synonym dictionary, comprise: extract each entity word in similar webpage, duplicate removal is carried out to each entity word in similar webpage, searches the coding of the entity word after similar removing duplicate webpages in described synonym dictionary;
Described search the coding of each entity word described in synonym dictionary after, comprise further: according to destination item title, setting threshold value p, filters each entity word in each webpage;
Each entity word in each webpage of described filtration, comprise: according to coding rule, the word being "@" or " # " by last position of encoding in the entity word after described duplicate removal filters and is not considered, according to the text frequency of text frequency computation part formulae discovery similar webpage residue entity word, contrast described text frequency and described setting threshold value p, judge whether to filter described entity word corresponding to described text frequency;
Described text frequency computation part formula is: DF j=D bj/ n b, wherein, DF jfor the text frequency of a jth target entity word; D bjthe webpage number in b class webpage is appeared at for a jth target entity word; n bfor the number of b class webpage;
Described contrast described text frequency and described setting threshold value p, judge whether to filter described entity word corresponding to described text frequency, comprising: work as DF jduring < p, then filter out a jth target entity word, otherwise, retain a jth target entity word.
3. method according to claim 1, is characterized in that,
The entity word of described each webpage of extraction, comprise further: according to the sequence of positions of entity word in each webpage, for the entity word of each webpage described generates sequence number, form the entity word set of each webpage, the entity word set of each webpage described comprises: the entity word of webpage and sequence number corresponding to this entity word;
Each entity word of each webpage that described structure is corresponding with destination item title combines, comprise: the participle number of adding up described destination item title, according to the continuity of the corresponding sequence number of each entity word of the participle number of described destination item title and each webpage described, search the continuous print entity word that can combine and form entity word combination, wherein, in the combination of described entity word, entity word number is identical with the participle number of described destination item title, and sequence number corresponding to the entity word during described entity word combines is continuous print.
4. method according to claim 1, it is characterized in that, utilize the coding that each participle of destination item is corresponding to combine corresponding coding with each entity word of each webpage, synonym mates the participle list of described destination item title and described entity word Assembly Listing, comprising:
By the participle of described destination item title in order successively with the entity word one_to_one corresponding in the entity word Assembly Listing of each webpage, contrast the coding of current participle whether identical with the coding of the current entity word in current entity word combination, if, then continue the coding of the next entity word in the contrast coding of next participle and current entity word combination, otherwise, the contrast of target end project name and current entity word combination, carries out the contrast that destination item title and next entity word combine; When each point of Chinese word coding of destination item title is identical with the coding that entity word in entity word Assembly Listing combines, then obtain the synonym corresponding to destination item title.
5. method according to claim 1, is characterized in that, after obtaining the synonym corresponding to described destination item title, comprises further:
Integration objective project name and correspond to the synonym of destination item title, generate the Alphabetical List corresponding to destination item title, described Alphabetical List comprises: destination item title and synonym corresponding to destination item title.
6. an extraction element for project name, is characterized in that, comprising:
Acquiring unit, for obtaining each participle of destination item title;
First searches unit, for searching the coding of each participle in synonym dictionary that described acquiring unit provides, triggers the first generation unit;
First generation unit, for receiving the triggering of searching unit, generates the participle list of destination item title, and described participle list comprises each participle of destination item title and coding corresponding to this each participle;
Extraction unit, for extracting each entity word of each webpage;
Second searches unit, for searching the described coding of each entity word in synonym dictionary;
Construction unit, for utilizing each entity word described, each entity word building each webpage corresponding with destination item title combines;
Second generation unit, for the entity word Assembly Listing of each each webpage of entity word combination producing of each webpage built by construction unit, the entity word Assembly Listing of this each webpage comprises the combination of each entity word and coding corresponding to this each entity word combination;
Matching unit, corresponding coding is combined with each entity word of each webpage for utilizing the coding that each participle of described destination item is corresponding, the entity word Assembly Listing of each webpage that the participle list of the destination item title that described first generation unit of synonym coupling generates and described second generation unit generate, obtains the synonym corresponding to described destination item title.
7. device according to claim 6, is characterized in that, comprises further: taxon, duplicate removal unit, filter element, wherein,
Described taxon, for being classified by each webpage described, the webpage that wherein project name of Structure of need is identical belongs to same class, triggers described extraction unit;
Described extraction unit, is further used for extracting the entity word in similar webpage, triggers duplicate removal unit;
Described duplicate removal unit, for receiving the triggering of described extraction unit, carries out duplicate removal to the entity word of similar webpage, triggers second and searches unit;
Described second searches unit, is further used for searching the coding of entity word in described synonym dictionary after the similar removing duplicate webpages that described duplicate removal unit provides;
Described filter element, for according to coding rule, searching in the entity word after the duplicate removal that unit finds last position of encoding by described second is that the word of "@" or " # " filters, according to destination item title, setting threshold value p, in the similar webpage that extraction unit extracts according to text frequency computation part formulae discovery, the text frequency of each entity word, contrasts described text frequency and described threshold value p, judge whether to filter described entity word corresponding to described text frequency, work as DF jduring < p, then filter out a jth target entity word, otherwise, retain a jth target entity word;
Described text frequency computation part formula is: DF j=D bj/ n b, wherein, DF jfor the text frequency of a jth target entity word; D bjthe webpage number in b class webpage is appeared at for a jth target entity word; n bfor the number of b class webpage.
8. device according to claim 6, is characterized in that, comprises further: statistic unit, wherein,
Described extraction unit, be further used for according to the sequence of positions of entity word in each webpage, for the entity word of each webpage described generates sequence number, form the entity word set of each webpage, the entity word set of each webpage described comprises: the entity word of webpage and sequence number corresponding to this entity word;
Described statistic unit, for adding up the participle number of described destination item title;
Described construction unit, be further used for the continuity of the corresponding sequence number of each entity word of each webpage that the participle number of the destination item title of adding up according to described statistic unit and described providing unit provide, search the continuous print entity word that can combine and form entity word combination, wherein, in the combination of described entity word, entity word number is identical with the participle number of described destination item title, and sequence number corresponding to the entity word during described entity word combines is continuous print.
9. device according to claim 6, is characterized in that, described matching unit is further used for:
Entity word one_to_one corresponding during the participle of the destination item title in described first generation unit is combined with the entity word of each webpage in the second generation unit in order successively, contrast the coding of current participle whether identical with the coding of the current entity word in current entity word combination, if, then continue the coding of the next entity word in the contrast coding of next participle and current entity word combination, otherwise, the contrast of target end project name and current entity word combination, carries out the contrast that destination item title and next entity word combine; When each point of Chinese word coding of destination item title is identical with the coding that entity word in entity word Assembly Listing combines, then obtain the synonym corresponding to destination item title.
10. device according to claim 6, is characterized in that, comprises further: the 3rd generation unit,
Described 3rd generation unit, for the synonym corresponding to destination item title that integration objective project name and described matching unit provide, generate and correspond to the Alphabetical List of destination item title, described Alphabetical List comprises: destination item title and synonym corresponding to destination item title.
CN201510093192.XA 2015-03-02 2015-03-02 Project name extraction method and device Pending CN104615590A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510093192.XA CN104615590A (en) 2015-03-02 2015-03-02 Project name extraction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510093192.XA CN104615590A (en) 2015-03-02 2015-03-02 Project name extraction method and device

Publications (1)

Publication Number Publication Date
CN104615590A true CN104615590A (en) 2015-05-13

Family

ID=53150042

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510093192.XA Pending CN104615590A (en) 2015-03-02 2015-03-02 Project name extraction method and device

Country Status (1)

Country Link
CN (1) CN104615590A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105045888A (en) * 2015-07-28 2015-11-11 浪潮集团有限公司 Participle training corpus tagging method for HMM (Hidden Markov Model)
CN106776616A (en) * 2015-11-20 2017-05-31 北京国双科技有限公司 Merge the method and device of symmetrical group of entities

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101782998A (en) * 2009-01-20 2010-07-21 复旦大学 Intelligent judging method for illegal on-line product information and system
CN101833556A (en) * 2009-03-12 2010-09-15 英业达股份有限公司 File content management system and method thereof
CN102760134A (en) * 2011-04-28 2012-10-31 北京百度网讯科技有限公司 Method and device for mining synonyms
US20130318124A1 (en) * 2011-02-08 2013-11-28 Fujitsu Limited Computer product, retrieving apparatus, and retrieval method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101782998A (en) * 2009-01-20 2010-07-21 复旦大学 Intelligent judging method for illegal on-line product information and system
CN101833556A (en) * 2009-03-12 2010-09-15 英业达股份有限公司 File content management system and method thereof
US20130318124A1 (en) * 2011-02-08 2013-11-28 Fujitsu Limited Computer product, retrieving apparatus, and retrieval method
CN102760134A (en) * 2011-04-28 2012-10-31 北京百度网讯科技有限公司 Method and device for mining synonyms

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
曹晶: "同义词挖掘及其在概念信息检索系统中的应用研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
杨关西: "基于上下文的同义词集挖掘研究与实现", 《华南理工大学硕士学位论文 道客巴巴》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105045888A (en) * 2015-07-28 2015-11-11 浪潮集团有限公司 Participle training corpus tagging method for HMM (Hidden Markov Model)
CN106776616A (en) * 2015-11-20 2017-05-31 北京国双科技有限公司 Merge the method and device of symmetrical group of entities
CN106776616B (en) * 2015-11-20 2020-03-06 北京国双科技有限公司 Method and device for merging symmetrical entity groups

Similar Documents

Publication Publication Date Title
CN107122413A (en) A kind of keyword extracting method and device based on graph model
JP5616444B2 (en) Method and system for document indexing and data querying
CN103123624B (en) Determine method and device, searching method and the device of centre word
CN107690634B (en) Automatic query pattern generation method and system
CN108681537A (en) Chinese entity linking method based on neural network and word vector
CN110837556A (en) Abstract generation method and device, terminal equipment and storage medium
CN104484380A (en) Personalized search method and personalized search device
CN111460170B (en) Word recognition method, device, terminal equipment and storage medium
CN111104801B (en) Text word segmentation method, system, equipment and medium based on website domain name
CN104778256A (en) Rapid incremental clustering method for domain question-answering system consultations
CN112231494B (en) Information extraction method and device, electronic equipment and storage medium
CN110457715B (en) Method for processing out-of-set words of Hanyue neural machine translation fused into classification dictionary
CN104699797A (en) Webpage data structured analytic method and device
CN112784009B (en) Method and device for mining subject term, electronic equipment and storage medium
CN110083683B (en) Entity semantic annotation method based on random walk
Evert A Lightweight and Efficient Tool for Cleaning Web Pages.
CN109299233A (en) Text data processing method, device, computer equipment and storage medium
CN109522396B (en) Knowledge processing method and system for national defense science and technology field
Perez-Cortes et al. Stochastic error-correcting parsing for OCR post-processing
CN106339481A (en) Chinese compound new-word discovery method based on maximum confidence coefficient
CN104298732A (en) Personalized text sequencing and recommending method for network users
CN110245349A (en) A kind of syntax dependency parsing method, apparatus and a kind of electronic equipment
CN111090994A (en) Chinese-internet-forum-text-oriented event place attribution province identification method
CN116362243A (en) Text key phrase extraction method, storage medium and device integrating incidence relation among sentences
CN103559177A (en) Geographical name identification method and geographical name identification device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20150513