CN103853738A - Identification method for webpage information related region - Google Patents

Identification method for webpage information related region Download PDF

Info

Publication number
CN103853738A
CN103853738A CN201210500929.1A CN201210500929A CN103853738A CN 103853738 A CN103853738 A CN 103853738A CN 201210500929 A CN201210500929 A CN 201210500929A CN 103853738 A CN103853738 A CN 103853738A
Authority
CN
China
Prior art keywords
pronoun
word
nounoun
ground
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201210500929.1A
Other languages
Chinese (zh)
Other versions
CN103853738B (en
Inventor
杨风雷
黎建辉
崔建业
李晓东
周园春
归文胜
汪海燕
杨俊峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Computer Network Information Center of CAS
Original Assignee
Computer Network Information Center of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Computer Network Information Center of CAS filed Critical Computer Network Information Center of CAS
Priority to CN201210500929.1A priority Critical patent/CN103853738B/en
Publication of CN103853738A publication Critical patent/CN103853738A/en
Application granted granted Critical
Publication of CN103853738B publication Critical patent/CN103853738B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

The invention discloses an identification method for a webpage information related region. The identification method for the webpage information related region includes: 1) establishing a regional information body; 2) extracting metadata information and text of crawled webpage information and dividing the extracted information title and text into words; 3) analyzing geographical name pronouns which express places in the words, judging whether there are reference relationships between the geographical name pronouns and the previous geographical terms, if so, replacing the geographical name pronouns with the corresponding geographical terms; 4) analyzing non-standard geographical name words in the words, and replacing the non-standard words with standard words; 5) analyzing the regional information of the relative position based on the regional information body to acquire an accurate geographical name word; 6) judging the analyzed webpage information based on the regional information body, and classifying the webpage information into a successfully matched region. The identification method for the webpage information related region greatly improves the identification accuracy for the webpage information related region.

Description

The recognition methods of the relevant region of a kind of info web
Technical field
The invention belongs to areas of information technology, relate in particular to a kind of region that information in webpage is associated and judge definite method, be mainly used in the fields such as internet information monitoring, information early warning, mobile search.
Background technology
In recent years, the life that takes place frequently of food safety affair such as clenbuterol hydrochloride, dyeing steamed bun, plasticiser, malicious cucumber equifrequency, this had both caused extremely bad social influence, had also brought a large amount of economic losses.For fear of or reduce to greatest extent the harm that these food safety affairs bring, the Risk-warning technology based on event starts to have obtained very big concern.For carrying out the Risk-warning based on event, this just need to find the information of these events in advance.
Along with the fast development of Internet, internet netizen's quantity is more and more huger, internet becomes that netizen releases news gradually, the main carriers of obtaining information and transmission of information, and has formed one and society by mutual between people, tissue etc. and have certain correspondence, the virtual society of incidence relation.It has become worldwide largest common data source, and its scale is also ceaselessly increasing.Under this situation; utilize the feature of internet self; set up perfect social information's feedback network; find in advance various " possible trouble " factors that may bring crisis, for the contingency management of food safety affair provides in time, accurate, comprehensively information just seems imperative and has very important meaning.
For utilizing information on internet to carry out the Risk-warning of food safety affair, need to obtain the information that event is relevant through certain process.Wherein, obtain the relevant range of event in internet information and be that the generation area of a very important job-can determine based on this event-this is the basis of food safety affair early warning, this just need to extract the content in the Internet web page information etc., analyze the region of determining that food safety affair information is associated.
Generally speaking, determine the region (geographic position) that info web is associated, traditional way need to be passed through place name identification, disambiguation, geographic area and be determined three links.Wherein the Main Function of place name identification is to identify all place names that comprise in info web, for completing the identification of place name, generally adopts in method based on dictionary of place name or natural language processing the method for named entity recognition after part-of-speech tagging; The Main Function of disambiguation (resolution) is that being one may exist the place name of multiple explanation to determine a geographic position accurately, for differentiating its geographic position accurately, the method conventionally adopting is that the index value of setting up, calculate its popularity of measurement is determined; The definite Main Function in geographic area is to determine that info web covers the geographic area of (association).Can identify to a certain extent although adopt these methods the geographic area that info web is relevant, but due to following situation: within the scope of different levels, place name is identical, same noun may mean different implications (such as place name or name), in describing, often there is the narration method (such as south, Pekinese) of relative position in information, in information, there is the saying much referring to, in same information, may relate to multiple different place names (especially different classes of place name), and the existence of abbreviation and the situation such as dialect in information, simultaneously also because the accuracy of current natural language processing work is relatively low, make the relevant region recognition accuracy of info web often lower.
Summary of the invention
For solving the above problems, the object of the present invention is to provide a kind of particular step of taking to analyze the content of info web, thereby determine the method for the relevant geographic area of info web, from the demand of food safety affair early warning, here indication geographic area mainly refers to country, provincial region, and it is flexible that the region of other classifications can adopt similar method to carry out granularity.In method, use for reference intelligent system thinking, the step of formation is as described below.
1, set up area information body
The needs that extract for meeting food safety affair information element, mainly set up area information body according to the administrative division of standard; Meanwhile, for the each example in body, set up respectively the add list of area code, postcode, abbreviation, showplace, adjacent domains, orientation, place six latitudes.
2, info web pre-service
To selected information source, adopt internet information to crawl system the info web in information source is crawled, extract metadata and the body matter information such as its title, source, author, issuing time, location, website and preserve; Afterwards to info web title, body matter, adopt participle device to carry out participle to it, and to may not be that the word of place name is got rid of.
3, nounoun pronoun is resolved
To existing some cannot directly show that such as this province, this city etc. the pronoun word of exact geographic location resolves in web page title information, text message.In process, in 2L word, whether (not exceeding whole sentence) exists rational geographical term to adopt respectively judgment models to identify judgement (if the relation that refers to is false, also by the nounoun pronoun definitely such as the information source according to extraction) or according to the nounoun pronoun definitely such as the information source obtaining in metadata extraction process before the nounoun pronoun of base area.
4, non-standard words is resolved
Adopt the place name word of dialect form such as occurring in Chinese text that beijing, bj etc. resolve to some that may exist in info web title, text message.In process mainly the standard word based on setting up in advance and the non-standard word table of comparisons by retrieval after the mode of replacing complete parsing.
5, relative position is resolved
Use such as southwest China province etc. of place name word of relative position expression way to resolve to some that may exist in web page title, text message.In process mainly based in step 1 set up area information instances of ontology and add list thereof, these relative position area informations are inquired about and are resolved, obtain place name word accurately.
6, region is determined
Info web is carried out can carrying out after pre-service and related resolution to definite work of information relevant range, in this process, mainly comprised two steps: adopt respectively pattern match, machine learning judgment models to carry out the judgement of information relevant range.
7, body is safeguarded
In the deterministic process of info web relevant range, the area information body of structure has important impact to judging nicety rate.From continuing the angle of raising method efficiency, regularly such as omission, mistake etc. of the deficiency in body supplemented, revised, with the follow-up efficiency of raising method.
The present invention is guarantee the judgement identification of info web relevant range accurate, efficient, set up area information body, in process of establishing, mainly carry out according to the administrative division of standard, simultaneously, for each example, set up respectively the add list of area code, postcode, abbreviation, showplace, adjacent domains, orientation, place six latitudes.
The present invention is in order to improve the accuracy of info web relevant range identification judgement, first info web is carried out after pre-service may be that the correlation word of place name carries out related resolution to obtain clear and definite word, judge by modes such as pattern match and judgment models judgements whether information can be included into target area afterwards, determine thus info web relevant range.
The present invention is in order to carry out the follow-up work such as ground nounoun pronoun parsing, relative position parsing, the info web crawling has been carried out to the meta-data extraction such as web page title, information source and preservation, afterwards title, text message are carried out to participle, and may not be that the word of place name is got rid of.
The present invention judges definite accuracy in order to improve info web relevant range, carry out the processing such as ground nounoun pronoun parsing, relative position parsing, non-standard word parsing for the info web after pretreated, thereby solved the low problem of info web relevant range accuracy of judgement degree that ground nounoun pronoun, relative position, non-standard place name word etc. bring.
The present invention judges definite accuracy in order to improve info web relevant range, in process, cannot directly show that to pronoun word such as this province, this city etc. in web page title information, text message the word of exact geographic location resolves.In resolving, in 2L word, whether exist before the nounoun pronoun of base area rational geographical term to adopt respectively judgment models to identify to judge or according to the nounoun pronoun definitely such as information source metadata.
The present invention judges definite accuracy in order to improve info web relevant range, in process, adopts the place name word of dialect forms such as occurring in Chinese text that beijing, bj etc. resolve to some in info web title, text message.In resolving mainly the standard word based on setting up and the non-standard word table of comparisons by retrieval after the mode of replacing complete parsing
The present invention judges definite accuracy in order to improve info web relevant range, has used such as southwest China province etc. of place name word of relative position expression way to resolve to some in web page title, text message.In process, mainly the area information instances of ontology based on prior foundation and add list thereof are inquired about and are resolved relative position word.
The present invention judges in deterministic process in info web relevant range, adopted successively the method judging for the method for mode matching of heading message, for the method for mode matching of text message, judgment models based on machine learning to carry out the judgement of information relevant range.Wherein, in the method judging based on machine learning judgment models, carry out the judgement of information relevant range by integrated region decision model, avoided the inaccurate problem of region decision of the same name, that bring with word contrary opinion (such as common word is as place name) etc.
Compared with prior art, advantage of the present invention:
The present invention takes after the works for the treatment of such as pre-service, pronoun parsing, relative position parsing, non-standard words parsing crawling the internet information obtaining, and has taked binding pattern coupling and the method based on machine learning judgment models to carry out the judgement identification of information relevant range.In method, solved non-standard place name word, the low problem of info web relevant range accuracy of judgement degree brought such as nounoun pronoun, relative position; Also avoid the inaccurate problem of region decision of the same name, that bring with word contrary opinion (comprise common word as place name etc.), thereby improved the accuracy of info web relevant range identification.Due to target area can set multiple, so also solved many sensings problem of info web relevant range; By judgment models etc. is set and set up in the target area of specific level, can realize the judgement of info web on different levels region and determine, thereby it is flexible to have realized the granularity of region decision.This for guarantee food safety affair INFORMATION DISCOVERY, early warning accurately, lay a good foundation comprehensively.
Accompanying drawing explanation
The recognition methods process flow diagram of the relevant region of a kind of info web of Fig. 1;
Fig. 2 area information body additional representation intention;
Fig. 3 info web relevant range determination methods schematic diagram;
The info web relevant range determination methods schematic diagram of Fig. 4 based on machine learning model.
Embodiment
The specific embodiment of the present invention as shown in Figure 1.Describe each step below in detail.
1, set up area information body
Consider the needs that the feature of food safety affair and later stage event information extraction, tracking etc. are analyzed, in the building process of food safety affair area information body, mainly carry out according to the administrative division of standard.Such as region can be divided into five classifications generally, be respectively Asia, Europe, A Feili california, America continent, Oceania; Can again segment each classification, such as Asia can be divided into East Asia, West Asia, South Asia, north Asia, the Central Asia, six, Southeast Asia classification; By that analogy, only can not be further divided into until be categorized into, be the element (being example) of a bottom.In addition, for the each example in body, area code, postcode, abbreviation, showplace (mountain, lake, sea, river, island, building), adjacent domains the adjacent peer territory of direction (east, south, west, north etc.), orientation, place (upper level are relatively set up respectively, such as middle part, south etc.) add list (as shown in Figure 2) of six latitudes, use in order in follow-up processing procedure.
2, info web pre-service
To selected information source, adopt internet information to crawl system (such as crawl the system that crawls of technology based on limited range) info web in information source is crawled.To the info web crawling, extract the metadata informations such as its title, source, author, issuing time, location, website and preserve, extract the body matter of info web simultaneously and preserve.
To the info web title extracting, body matter, (and the text that records the relative message header of word and body matter formation starts to adopt participle device to carry out participle based on statistics and dictionary (comprising that the body of setting up according to step 1 forms dictionary of place name) to it, the relative position finishing, affiliated sentence, the characteristic parameters such as the relative position that sentence starts and finishes relatively), adopt afterwards based on vocabulary (vocabulary arrange in advance form and regular update, comprising the word that can be used as name and place name simultaneously, there are other specific meanings but may are also simultaneously the word etc. of place name, such as Yi Ge city of Wuzhong-Ningxia Hui Autonomous Region, can be name simultaneously, Yi Ge county of Founder-Heilongjiang Province can be upright company simultaneously, but notice that the word that has comprised specific suffix is such as Wuzhong City will not be got rid of) matching process to may not be that the word of place name is got rid of.
3, nounoun pronoun is resolved
Through the pronoun that may exist some to represent places in the web page title information of participle, text message, such as this province, this city, this province etc.Itself cannot directly show exact geographic location because these pronouns are literal, therefore need it to resolve.
(1) be the parsing of carrying out ground nounoun pronoun, the moving window that model pronoun is resolved, moving window length L is determined (after counting distribution situation by the word between analysis ground nounoun pronoun and its antecedent, determining) in advance.
(2) selectively in L word, whether exist before nounoun pronoun afterwards rational geographical term (such as Liaoning corresponding to this province etc., based on the rule judgment of prior foundation), if existed, adopt geographical term and the ground of following foundation whether to exist the judgment models of the relation of referring to judge between nounoun pronoun, if there is the relation that refers to, according to referring to the geographical term that relation determines that pronoun is corresponding, resolve and finish (if there are multiple geographical terms that relation is set up that refer to, the nearest geographical term of chosen distance ground nounoun pronoun), otherwise carry out step (3).
(3) do not exist if do not exist rational geographical term or model judgement to refer to relation in L word, selectively in front 2L the word of nounoun pronoun, (do not exceed whole sentence, such as identifying with fullstop) whether there is rational geographical term, if existed, adopt geographical term and the ground of following foundation whether to exist the judgment models of the relation of referring to judge between nounoun pronoun, if there is the relation that refers to, according to referring to the geographical term that relation determines that pronoun is corresponding, resolve and finish (if there are multiple geographical terms that relation is set up that refer to, the nearest geographical term of chosen distance ground nounoun pronoun), otherwise carry out step (4).
(4) if not existing the judgement of rational geographical term or model to refer to relation in 2L word does not exist, according to the information source obtaining in metadata extraction process or location, website adopt the method that extracts or replace definitely nounoun pronoun refer to place name.
The method for building up of judgment models: compile the info web that comprises ground nounoun pronoun etc. and form sample set, and to each ground nounoun pronoun and its in sample set information before the relation that refers between the geographical term (not exceeding sentence scope) in the individual word of 2L (the same step of L length (1)) mark, as class variable; to the relation extraction related data between the geographical term (not exceeding sentence scope) in the individual word of 2L (the same step of L length (1)) before each ground nounoun pronoun and its in sample set information, set up message sample about this proper vector of relation between nounoun pronoun and geographical term over the ground: comprise that (suffix represents place name or have place name feature geographical term suffix, such as " autonomous region " in " Xinjiang Uygur Autonomous Regions ") length (suffix number of words is divided by text size), distance (word number is divided by text size) between geographical term and ground nounoun pronoun, the relative distance (word number is divided by text size) that geographical term starts apart from text, the relative distance (word number is divided by text size) that ground nounoun pronoun starts apart from text, the relative distance (word number is divided by text size) that geographical term starts apart from sentence, the relative distance (word number is divided by text size) that ground nounoun pronoun starts apart from sentence, the relative distance (word number is divided by text size) that geographical term finishes apart from sentence, the relative distance (word number is divided by text size) that ground nounoun pronoun finishes apart from sentence etc., select afterwards sample set, class variable and the proper vector of machine learning method (such as svm) based on above-mentioned to set up the judgment models that whether has the relation of referring between geographical term and ground nounoun pronoun.
The method that whether exists the relation of referring to judge between nounoun pronoun and geographical term over the ground based on judgment models is: the related data of first extracting relation between geographical term and ground nounoun pronoun forms proper vector, the data of extracting specifically comprise geographical term suffix length (suffix number of words is divided by text size), distance (word number is divided by text size) between geographical term and ground nounoun pronoun, the relative distance (word number is divided by text size) that geographical term starts apart from text, the relative distance (word number is divided by text size) that ground nounoun pronoun starts apart from text, the relative distance (word number is divided by text size) that geographical term starts apart from sentence, the relative distance (word number is divided by text size) that ground nounoun pronoun starts apart from sentence, the relative distance (word number is divided by text size) that geographical term finishes apart from sentence, the relative distance (average is divided by text size) that ground nounoun pronoun finishes apart from sentence etc.Judgment models based on above-mentioned foundation is identified judgement afterwards, and according to judged result definitely the relation that refers between nounoun pronoun and geographical term whether exist.
4, non-standard words is resolved
Some off-gauge linguistic forms are used through the word that may exist some to represent places in the web page title information of participle, text message, as there is beijing, bj etc. in Chinese text.To this, the standard word based on setting up and the non-standard word table of comparisons (setting up in advance and regular update), resolve off-gauge place name word form by the mode of replacing after inquiry.
5, relative position is resolved
Use the expression way of relative position through the word that may exist some to represent places in the web page title information of participle, text message, such as southwest China province etc.Same, these Expression of languages do not have clear and definite place name title yet.For head it off, based on area information instances of ontology and the add list thereof set up in step 1, these relative position area informations are inquired about and resolved, obtain accurately place name word (such as to southwest China province, in conjunction with the area information body of setting up, first find the affiliated province title of China, and the province under each is inquired about to the add list of its orientation, place latitude, be that southwestern province extracts by orientation, all places, substitute accordingly southwest China province, complete parsing).
6, region is determined
Info web is carried out can carrying out after pre-service and related resolution to definite work of information associated area, in this process, mainly comprised two steps: adopt respectively pattern match, machine learning judgment models to carry out the judgement (as shown in Figure 3) of information relevant range.
The definite target in region is identifying information relevant range, for the discovery of food safety affair information provides region base.Consider the problems such as accuracy, calculated amount and operability, in this process, first taked the method for pattern match to carry out.Here need to consider two problems: range of information, matched rule.About matched rule, based on the area information body of setting up, main consideration part instances of ontology title, attribute etc. in process, concrete passing through combined title, the attribute etc. of these instances of ontology and taked the method for pattern match to judge; The pattern match concrete grammar of taking in method comprises the modes such as the distance coupling between Boolean matching, frequency matched, instance name; Concrete mode is selected and specific rules is set up by definite after Information Statistics are analyzed (determining in advance and regular update).About the selection of range of information, here mainly consider two latitudes of title, the information content of information, consider that message header and the information content may exist unmatched situation, in concrete processing procedure, first the title of information is processed, if after adopting above-mentioned method for mode matching to process to the title of information, information can be included into current selected region (such as Beijing), is disposed for the pattern match in this region; Otherwise adopt above-mentioned method for mode matching to carry out quadratic modes matching treatment for this region to the content of this information.In this process, follow the principle that it is not excessive to be would rather be scarce, guarantee as far as possible the accuracy of identification judged result.
If through above-mentioned pattern matching process, this information cannot be included into a certain region, adopt the region decision model of setting up based on machine learning method to judge for the third time definite.The process of setting up in advance region decision model is: based on arranging (with step 2-5), the info web sample set (setting up in advance and regular update) that mark (whether being associated with certain region) is crossed, by the title of message sample, content word (is selected and instances of ontology title, the word of attributes match) combine: these words (are referred to province according to administrative place name, city etc.), area code, postcode, be called for short, (mountain, showplace, lake, sea, river, island, building etc.) five classifications sort out and form five proper vectors (wherein in vector, term weighing is word frequency, consider the importance of title word, the weight of title word is multiplied by pre-determined multiple).Afterwards, adopt machine learning method (support vector machine etc.) to set up the region decision model (5, based on the sample set regular update model upgrading) based on above-mentioned five proper vectors to each target area.Information is judged to definite process is for the third time: will process through step 2-5, after resolving but cannot be included into the title of the information in a certain region, content word (is selected and instances of ontology title, the word of attributes match) combine: (refer to province according to administrative place name, city etc.), area code, postcode, be called for short, (mountain, showplace, lake, sea, river, island, building etc.) five classifications sort out and form five vectors (wherein in vector, term weighing is word frequency, consider the importance of title word, the weight of title word is multiplied by pre-determined multiple), and adopt five region decision models of aforementioned foundation to detect judgement to these five vectors respectively, and the result of detection judgement is weighted to (flexible strategy are determined divided by the method for word frequency sum in five classifications according to word frequency sum in each classification in info web), if weighing computation results is greater than the threshold value of prior setting, this information can be included into this region, otherwise this information can not be included into this region (as shown in Figure 4).
7, body is safeguarded
In the deterministic process of info web relevant range, the area information body of structure has important impact to judging nicety rate.Therefore, consider the Variation Features of internet information, the affiliated relation in region etc., from continuing the angle of raising method efficiency, need regularly information relevant range deterministic process, result to be assessed, and such as omission, mistake etc. of the deficiency in body supplemented, revised, with the follow-up efficiency of raising method.
Thus, complete realization to info web associated area compare completely, accurately judgement overall process.In method, first info web is carried out after pre-service may be that the correlation word of place name carries out related resolution to obtain definite place name word, judge by modes such as pattern match and judgment models (setting up for specific target areas in advance) whether information can be included into the method for target area afterwards, determine the relevant range of info web.In method, solved non-standard place name word, the low problem of info web relevant range accuracy of judgement degree brought such as nounoun pronoun, relative position; Also avoided the inaccurate problem of region decision of the same name, that bring with word contrary opinion (such as common word is as place name); Due to target area can set multiple, so also solved many sensings problem of info web affiliated area; Finally, by judgment models etc. is set and set up in the target area of specific level, can realize the judgement of info web on different levels region and determine, thereby it is flexible to have realized the granularity of region decision.Generally, the present invention takes many kinds of measures to guarantee the accuracy of info web relevant range judgement identification, thereby lays a good foundation for the excavation of follow-up food safety affair information element.
What deserves to be explained is, the present invention not only can be used for the discovery of food safety affair area information key element, and also can be used for any of other need to judge in the field of identification info web relevant range.

Claims (9)

1. a recognition methods for the relevant region of info web, the steps include:
1) set up an area information body according to administrative division, and the each example in body is set up respectively to an add list;
2) extract metadata information and the body matter of the info web crawling, and adopt participle device to carry out participle to the message header in metadata information and body matter;
3) to through representing in participle gained word that the ground nounoun pronoun in place resolves, judge between ground nounoun pronoun and its geographical term occurring whether have the relation of referring to above by a judgment models, if exist, ground nounoun pronoun replaced with to corresponding geographical term;
4) based on standard word and the non-standard word table of comparisons, non-standard place name word in process participle gained word is resolved, non-standard word is replaced with to standard word;
5), based on described area information instances of ontology and add list thereof, to resolving through the relative position area information in participle gained word, obtain place name word accurately;
6) the instances of ontology title based in area information body, attribute adopt method for mode matching to step 3), 4), 5) info web after resolving processes, and info web is included into the region that the match is successful;
Wherein, the method for building up of described judgment models is: the info web that comprises ground nounoun pronoun is formed to a sample set, and in sample set the relation that refers between the geographical term before nounoun pronoun and its mark, as class variable; Set up the proper vector of relation between ground nounoun pronoun and the geographical term before it: then select machine learning method to set up the judgment models that whether has the relation of referring between geographical term and ground nounoun pronoun based on described sample set, class variable and proper vector;
Wherein, whether exist the method for refer to relation be: the proper vector value of calculating relation between ground nounoun pronoun and geographical term if judging between ground nounoun pronoun and its geographical term occurring above, utilize described judgment models to judge described proper vector value, whether the relation that refers between nounoun pronoun and geographical term exists definitely.
2. recognition methods as claimed in claim 1, is characterized in that described metadata information comprises: the title of webpage, source, author, issuing time, location, website; The content of described add list comprises: area code, postcode, abbreviation, showplace, adjacent domains, six, orientation, place latitude.
3. recognition methods as claimed in claim 2, it is characterized in that step 2) in adopt participle device to carry out participle to the message header extracting and body matter method be: the participle that adopts participle device to carry out the message header extracting and body matter, and record the relative position that relative position, affiliated sentence, relative sentence that text that the relative message header of participle gained word forms with body matter starts, finishes start and finish.
4. the recognition methods as described in claim 1 or 2 or 3, it is characterized in that model one noun list dubiously, record can be used as the place name of other titles, described in then using dubiously noun list to step 2) participle gained word mates, the word of filtering coupling; Wherein, if the word of coupling has the suffix of the place name of representing, retain this word.
5. recognition methods as claimed in claim 1, is characterized in that the component that forms described proper vector comprises: the distance between geographical term suffix length, geographical term and ground nounoun pronoun, the relative distance that geographical term starts apart from text, the relative distance that starts apart from text of nounoun pronoun, relative distance that geographical term starts apart from sentence, the relative distance that starts apart from sentence of nounoun pronoun, relative distance that geographical term finishes apart from sentence, the relative distance that finishes apart from sentence of nounoun pronoun.
6. recognition methods as claimed in claim 2, is characterized in that the method to resolving through the ground nounoun pronoun that represents place in participle gained word is:
61) to set up length that a pronoun resolves be L moving window;
62) selectively in front L the word of nounoun pronoun, whether there is geographical term, if existed, adopt judgment models to judge, if there is the relation that refers to,, according to referring to the geographical term that relation determines that pronoun is corresponding, resolve and finish, otherwise carry out step 63);
63) selectively in front 2L the word of nounoun pronoun, whether there is geographical term, if existed, adopt judgment models to judge, if there is the relation that refers to,, according to referring to the geographical term that relation determines that pronoun is corresponding, resolve and finish, otherwise carry out step 64);
64) according to the information source obtaining in metadata extraction process or location, website adopt the method that extracts or replace definitely nounoun pronoun refer to place name.
7. recognition methods as claimed in claim 6, is characterized in that in step 62) in, if there is multiple geographical terms that relation is set up that refer to, the nearest geographical term of chosen distance ground nounoun pronoun before ground nounoun pronoun in L word; In step 64) in, if there is multiple geographical terms that relation is set up that refer to, the nearest geographical term of chosen distance ground nounoun pronoun before ground nounoun pronoun in 2L word.
8. recognition methods as claimed in claim 2, it is characterized in that instances of ontology title based in area information body, attribute adopt method for mode matching to step 3), 4), 5) method processed of info web after resolving is: first the instances of ontology title based in area information body, attribute mate the info web title after resolving, if this webpage is included into selected target area by coupling; Otherwise the body matter to this webpage mates, if this webpage is included into target area by coupling.
9. recognition methods as claimed in claim 8, a certain webpage it is characterized in that if cannot be included into target area, the region decision model based on prior foundation judges definite to this webpage for the third time: the title that first cannot be included into the webpage of target area, content word combines, then according to administrative place name, area code, postcode, be called for short, five, showplace classification is sorted out five vectors of composition, and adopt the target area judgment models of building to detect judgement to these five vectors respectively, and the result that detects judgement is weighted, if weighing computation results is greater than the threshold value of prior setting, this webpage is included into target area, otherwise this webpage can not be included into target area, wherein, the method of setting up region decision model is: set up an info web sample set merging webpage is marked, the title of info web sample, content word are combined, then sort out five proper vectors of composition according to administrative place name, area code, postcode, abbreviation, five, showplace classification, then adopt machine learning method to set up the region decision model based on above-mentioned five proper vectors to selected areas.
CN201210500929.1A 2012-11-29 2012-11-29 A kind of recognition methods of info web correlation region Active CN103853738B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210500929.1A CN103853738B (en) 2012-11-29 2012-11-29 A kind of recognition methods of info web correlation region

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210500929.1A CN103853738B (en) 2012-11-29 2012-11-29 A kind of recognition methods of info web correlation region

Publications (2)

Publication Number Publication Date
CN103853738A true CN103853738A (en) 2014-06-11
CN103853738B CN103853738B (en) 2017-06-27

Family

ID=50861404

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210500929.1A Active CN103853738B (en) 2012-11-29 2012-11-29 A kind of recognition methods of info web correlation region

Country Status (1)

Country Link
CN (1) CN103853738B (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104572992A (en) * 2015-01-06 2015-04-29 武汉工程大学 Multi-constraint reasoning based standardization method for internet geographical location information
CN104615782A (en) * 2015-03-02 2015-05-13 武汉工程大学 Address matching method based on sliding window maximum matching algorithm
CN104951543A (en) * 2015-06-19 2015-09-30 百度在线网络技术(北京)有限公司 Information processing method and device realized through computer
CN105068989A (en) * 2015-07-23 2015-11-18 中国测绘科学研究院 Place name and address extraction method and apparatus
CN107133311A (en) * 2017-04-28 2017-09-05 安徽博约信息科技股份有限公司 Network information ownership place index marker method based on regional code
CN107193974A (en) * 2017-05-25 2017-09-22 北京百度网讯科技有限公司 Localized information based on artificial intelligence determines method and apparatus
CN107229698A (en) * 2017-05-24 2017-10-03 北京神州泰岳软件股份有限公司 A kind of method and device of information processing
CN107590123A (en) * 2017-08-07 2018-01-16 问众智能信息科技(北京)有限公司 Vehicle-mounted middle place context reference resolution method and device
CN107766322A (en) * 2017-08-31 2018-03-06 平安科技(深圳)有限公司 Entity recognition method, electronic equipment and computer-readable recording medium of the same name
CN107977399A (en) * 2017-10-09 2018-05-01 北京知道未来信息技术有限公司 A kind of English email signature extracting method and system based on machine learning
CN108280147A (en) * 2018-01-02 2018-07-13 浪潮软件集团有限公司 Data management method and device
CN109272377A (en) * 2018-08-23 2019-01-25 北京京东尚科信息技术有限公司 Articles search method and apparatus
CN109408819A (en) * 2018-10-16 2019-03-01 武大吉奥信息技术有限公司 A kind of core place name extracting method and device based on natural language processing technique
CN110399613A (en) * 2019-07-26 2019-11-01 浪潮软件股份有限公司 A kind of internet news based on part-of-speech tagging are related to place name identification method and system
CN110991176A (en) * 2020-02-27 2020-04-10 北京海天瑞声科技股份有限公司 Cross-language non-standard word recognition method and device
CN111045998A (en) * 2019-12-16 2020-04-21 北京智游网安科技有限公司 Statistical method, system and storage medium for application program affiliated area
CN112069824A (en) * 2020-11-11 2020-12-11 北京智慧星光信息技术有限公司 Region identification method, device and medium based on context probability and citation

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102426603A (en) * 2011-11-11 2012-04-25 任子行网络技术股份有限公司 Text information regional recognition method and device
US20120101807A1 (en) * 2010-10-25 2012-04-26 Electronics And Telecommunications Research Institute Question type and domain identifying apparatus and method
CN102708096A (en) * 2012-05-29 2012-10-03 代松 Network intelligence public sentiment monitoring system based on semantics and work method thereof

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120101807A1 (en) * 2010-10-25 2012-04-26 Electronics And Telecommunications Research Institute Question type and domain identifying apparatus and method
CN102426603A (en) * 2011-11-11 2012-04-25 任子行网络技术股份有限公司 Text information regional recognition method and device
CN102708096A (en) * 2012-05-29 2012-10-03 代松 Network intelligence public sentiment monitoring system based on semantics and work method thereof

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
RUNDA LIU等: "Building HKH Region Geographic Information Sharing Network China Node Based on GeoNetwork", 《INFORMATION TECHNOLOGY AND APPLICATIONS (IFITA), 2010 INTERNATIONAL FORUM ON》 *
杜萍: "基于本体的中国行政区划地名识别与抽取研究", 《中国博士学位论文全文数据库哲学与人文科学辑》 *

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104572992B (en) * 2015-01-06 2018-07-17 武汉工程大学 Internet geographical location information normalization method based on multiple constraint reasoning
CN104572992A (en) * 2015-01-06 2015-04-29 武汉工程大学 Multi-constraint reasoning based standardization method for internet geographical location information
CN104615782A (en) * 2015-03-02 2015-05-13 武汉工程大学 Address matching method based on sliding window maximum matching algorithm
CN104615782B (en) * 2015-03-02 2017-10-10 武汉工程大学 Address matching process based on sliding window maximum matching algorithm
CN104951543A (en) * 2015-06-19 2015-09-30 百度在线网络技术(北京)有限公司 Information processing method and device realized through computer
CN104951543B (en) * 2015-06-19 2019-02-22 百度在线网络技术(北京)有限公司 Pass through computer implemented information processing method and device
CN105068989B (en) * 2015-07-23 2018-05-04 中国测绘科学研究院 Place name address extraction method and device
CN105068989A (en) * 2015-07-23 2015-11-18 中国测绘科学研究院 Place name and address extraction method and apparatus
CN107133311A (en) * 2017-04-28 2017-09-05 安徽博约信息科技股份有限公司 Network information ownership place index marker method based on regional code
CN107229698A (en) * 2017-05-24 2017-10-03 北京神州泰岳软件股份有限公司 A kind of method and device of information processing
CN107193974A (en) * 2017-05-25 2017-09-22 北京百度网讯科技有限公司 Localized information based on artificial intelligence determines method and apparatus
US11475055B2 (en) 2017-05-25 2022-10-18 Beijing Baidu Netcom Science And Technology Co., Ltd. Artificial intelligence based method and apparatus for determining regional information
CN107193974B (en) * 2017-05-25 2020-11-10 北京百度网讯科技有限公司 Regional information determination method and device based on artificial intelligence
CN107590123A (en) * 2017-08-07 2018-01-16 问众智能信息科技(北京)有限公司 Vehicle-mounted middle place context reference resolution method and device
CN107766322A (en) * 2017-08-31 2018-03-06 平安科技(深圳)有限公司 Entity recognition method, electronic equipment and computer-readable recording medium of the same name
CN107977399A (en) * 2017-10-09 2018-05-01 北京知道未来信息技术有限公司 A kind of English email signature extracting method and system based on machine learning
CN108280147A (en) * 2018-01-02 2018-07-13 浪潮软件集团有限公司 Data management method and device
CN109272377A (en) * 2018-08-23 2019-01-25 北京京东尚科信息技术有限公司 Articles search method and apparatus
CN109408819A (en) * 2018-10-16 2019-03-01 武大吉奥信息技术有限公司 A kind of core place name extracting method and device based on natural language processing technique
CN110399613A (en) * 2019-07-26 2019-11-01 浪潮软件股份有限公司 A kind of internet news based on part-of-speech tagging are related to place name identification method and system
CN110399613B (en) * 2019-07-26 2023-03-31 浪潮软件股份有限公司 Method and system for identifying internet news related to place names based on part-of-speech tagging
CN111045998A (en) * 2019-12-16 2020-04-21 北京智游网安科技有限公司 Statistical method, system and storage medium for application program affiliated area
CN110991176A (en) * 2020-02-27 2020-04-10 北京海天瑞声科技股份有限公司 Cross-language non-standard word recognition method and device
CN112069824A (en) * 2020-11-11 2020-12-11 北京智慧星光信息技术有限公司 Region identification method, device and medium based on context probability and citation

Also Published As

Publication number Publication date
CN103853738B (en) 2017-06-27

Similar Documents

Publication Publication Date Title
CN103853738A (en) Identification method for webpage information related region
CN103544255B (en) Text semantic relativity based network public opinion information analysis method
CN102054015B (en) System and method of organizing community intelligent information by using organic matter data model
Zhang et al. Entity linking leveraging automatically generated annotation
CN103399901B (en) A kind of keyword abstraction method
CN104573028A (en) Intelligent question-answer implementing method and system
CN106777957B (en) The new method of biomedical more ginseng event extractions on unbalanced dataset
CN102081602B (en) Method and equipment for determining category of unlisted word
CN101609450A (en) Web page classification method based on training set
CN106570180A (en) Artificial intelligence based voice searching method and device
CN103853700B (en) A kind of event method for early warning found based on region and object information
CN103854064A (en) Event occurrence risk prediction and early warning method targeted to specific zone
CN105426354A (en) Sentence vector fusion method and apparatus
CN103854063A (en) Internet open information-based event occurrence risk prediction and early-warning method
CN110826312B (en) Software requirement specification evaluation method
CN103646112A (en) Dependency parsing field self-adaption method based on web search
CN113722478B (en) Multi-dimensional feature fusion similar event calculation method and system and electronic equipment
CN109344355A (en) Automatic returning detection and Block- matching adaptive approach and device for Web evolution
CN104462143A (en) Method and device for establishing chain brand word bank and category word bank
Mokhtari et al. Tagging address queries in maps search
CN110232160B (en) Method and device for detecting interest point transition event and storage medium
CN102737244A (en) Method for determining corresponding relationships between areas and annotations in annotated image
CN108830108A (en) A kind of web page contents altering detecting method based on NB Algorithm
Hu et al. Large-scale location prediction for web pages
CN109344233A (en) A kind of Chinese personal name recognition method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant