CN113626536B - News geocoding method based on deep learning - Google Patents

News geocoding method based on deep learning Download PDF

Info

Publication number
CN113626536B
CN113626536B CN202110747499.2A CN202110747499A CN113626536B CN 113626536 B CN113626536 B CN 113626536B CN 202110747499 A CN202110747499 A CN 202110747499A CN 113626536 B CN113626536 B CN 113626536B
Authority
CN
China
Prior art keywords
news
text
name
place
geocoding
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110747499.2A
Other languages
Chinese (zh)
Other versions
CN113626536A (en
Inventor
罗运
胡宏伟
余思佳
罗彩玉
蔡忠亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
Original Assignee
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU filed Critical Wuhan University WHU
Priority to CN202110747499.2A priority Critical patent/CN113626536B/en
Publication of CN113626536A publication Critical patent/CN113626536A/en
Application granted granted Critical
Publication of CN113626536B publication Critical patent/CN113626536B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/29Geographical information databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Remote Sensing (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a news geocoding method based on deep learning, which is used for realizing the geocoding of news contents. The invention combines the deep learning model and the place name database to realize the geographical coding of news under different provinces, cities and counties, thereby obtaining longitude and latitude information contained in the news, leading people to have more visual perception on the news places, and simultaneously, the result can be used for the functions of geographical search screening, distance sorting, region recommendation and the like of the news. The invention uses the deep learning model based on ERNIE and Bi-GRU-CRF to carry out the task of identifying the named entity, thereby being more accurate and efficient for identifying the place name and the organization name in the news text; by carrying out different news content extraction methods on different news sources, the news text extraction method is effectively compatible with the news text extraction of each large news portal website, and the news geocoding application range is wider.

Description

News geocoding method based on deep learning
Technical Field
The invention relates to text mining, natural language processing and geocoding in computer science and technology, in particular to a news geocoding method based on deep learning.
Background
With the development of science and technology and network technology, we have come to an era of information explosion, and it is increasingly important to obtain information efficiently in the current society. News is an important way for people to obtain information, a short report of the fact that recently occurs and has a social meaning and is of public interest, and is one of the most widely used genres in newspapers, radio, and television news.
However, its presentation in the space-time dimension is lacking. In traditional news reading, people can only know the place where the news occurs or the place where the news is related by reading news text or viewing pictures, but cannot intuitively know the geographical position where the news occurs, so that readers lack understanding of the geographical position of the news and grasp of the surrounding environment of the news place, and news information cannot be intuitively perceived and read. Meanwhile, the spatial attribute of the news cannot be fully mined, so that the functions of geographical search screening, distance sorting, region recommendation and the like of the news are difficult to realize.
Disclosure of Invention
The invention provides a news geocoding method based on deep learning, which extracts places in news texts and performs geocoding by using a deep learning technology, so that longitude and latitude information contained in news is obtained, people can intuitively recognize news places, and the result can be used for functions of geographic search screening, distance sorting, region recommendation and the like of news.
The technical scheme provided by the invention is a news geocoding method based on deep learning, which comprises the following steps:
step S10, constructing a database of provinces, cities, counties and place names of China;
step S20, extracting news text from given news links and news contents by using a text extractor;
step S30, searching a place name database according to the news text extracted in the step S20 to obtain the possible provinces, cities and county place names in the news;
step S40, carrying out a named entity recognition task based on the ERNIE and Bi-GRU-CRF deep learning model according to the news text extracted in the step S20, and obtaining a place name alternative list in the news text;
step S50, calling a local or national geocoding service to geocode the place names in the place name alternative list according to the county names, the city names and the province name alternative list;
step S60, the encoding result candidate list is structured and organized, and the geocoding result is organized according to the modes of national-place roll calling, province-place roll calling, city-place roll calling and county-place roll calling. Therefore, the geographical coding of the news is realized, the geographical coordinates of the news are determined, and longitude and latitude data contained in the news are obtained.
Further, the step S10 specifically includes:
step S101, obtaining provinces, cities and counties of the whole China, and classifying according to the provinces, the direct jurisdictions and the non-direct jurisdictions and the counties;
step S102, building a subordinate relation between place names, corresponding father-son relations of the city names of the provinces and the counties of the cities one by one, and storing the corresponding relations in a database.
Step S103, establishing a query service, and querying whether the corresponding place name exists in the database or not through the query keyword, and giving the place name type and the father-son relationship thereof.
Further, the step S20 specifically includes:
in step S201, the web page is first parsed into DOM tree by reading the web page HTML code, each HTML tag being a node, wherein all text is a leaf node in the DOM tree. Traversing each node in the DOM tree, and calculating the total number of character strings of all text leaf nodes in the node divided by the total number of child nodes contained in the node to obtain the text density of the node. And obtaining the news text DOM node by screening the node with the highest text density in the DOM tree. Further, the news text is obtained by obtaining the text content of the text leaf node in the DOM node.
Step S202, it is determined whether the given content is a link or text. For the text, the text is directly used as news text for subsequent steps.
Step S203, judging the news website for the link, if the link website is other news portal websites except WeChat and microblog, processing the link website in step S201, and extracting news text. If the link website is a newwave microblog and a WeChat, using a CSS and XPath selector for the newwave microblog and the WeChat, acquiring class, id, data attributes of DOM nodes containing the news text according to DOM structures of the microblog and the WeChat pages, and acquiring the news text by using CSS and XPath rules corresponding to the class, id, data attributes.
Further, the step S30 specifically includes:
step S301 searches for the province name, and if the province name existing in the database is included in the text, adds it to the province name candidate list.
Step S302, searching for the city names, and if the text contains the city names in the database, adding the city names into a city name alternative list, and adding the province names corresponding to the cities into the province name alternative list.
Step S303, searching for the county names, and if the text contains the county names in the database, adding the county names into a county name alternative list, and adding the city names corresponding to the counties and the province names corresponding to the cities into a city name and province name alternative list.
Further, the step S40 specifically includes:
step S401, constructing a deep learning model based on ERNIE and Bi-GRU-CRF. The ERNIE model is composed of 12 layers of Encoder layers in a superposition mode by using an ERNIE Base structure, wherein the input and output of each Encoder layer are composed of 768 Hidden Units. Each Encoder layer is formed by stacking a self-Attention layer, a standardization layer, a full connection layer and a standardization layer, wherein each self-Attention layer contains 12 Attention Heads.
The Bi-GRU-CRF model structure is formed by stacking 2 bidirectional GRU layers and a full-connection layer, the output of the full-connection layer is input into the CRF layer, the maximum possible label of each word in the sentence is obtained, and therefore a named entity recognition result is output.
The ERNIE model inputs the text of the character string, outputs 768-dimensional text embedded vector matrix, the vector matrix is input into the bottommost two-way GRU layer in the Bi-GRU-CRF model, and finally the maximum likelihood label of each word is obtained by outputting in the CRF layer.
The ERNIE model was trained on the MSRA-NER (SIGHAN 2006) dataset to yield a pre-trained model. Text is input into the model, and 768-dimensional text embedded vector matrixes corresponding to the model structures can be obtained. The Bi-GRU-CRF model is trained by adopting the LAC corpus as a training set, only parameters of the bidirectional GRU layer and the CRF layer are updated in the training process, and parameters of the ERNIR model are frozen so as not to participate in training. Through the ERNIE and Bi-GRU-CRF based deep learning model, named entity recognition tasks can be performed, and place nouns and mechanism nouns contained in a text can be obtained. Place nouns include provinces, cities, counties, roads, landmark names, etc., and institution names include government institutions, educational institutions, recreational and recreational facilities, etc.
Step S402, merging the place names and the organization names obtained by the named entity identification to obtain a place name alternative list
Step S403, traversing the place name candidate list, and deleting the place names repeated in the province name candidate list, the city name candidate list and the county name candidate list.
Further, the step S50 specifically includes:
step S501, determining a geocode range list. If the number of the entries in the county name candidate list is greater than 1, the geocoding range list is the county name candidate list, otherwise, whether the number of the entries in the city name candidate list is greater than 1 is judged, if yes, the geocoding range list is the city name candidate list, otherwise, whether the number of the entries in the province name candidate list is greater than 1 is judged, if yes, the geocoding range list is the province name candidate list, and otherwise, the geocoding range is national.
Step S502, traversing the place name alternative list, and if the geocode range is not national, according to the items in the geocode range list, using the geocode range as city limit to call the local geocode service for the current place name. The local geocoding service returns geographic longitude and latitude coordinates under the BD09 coordinate system of the query keyword according to the designated query keyword, the search geographic range limit parameter, the search city limit parameter and the search category limit parameter by searching the national POI interest point database of the hundred-degree map, and returns the understanding degree score and the credibility score of the coding result. The understanding degree score can be used for judging whether the input query keyword exists in the POI interest point database or not and the similarity with the data in the POI interest point database, and the higher the understanding degree is, the more correct the name format of the place name is, and the higher the possibility that the corresponding coordinate is found in the database is; the confidence score may be used to determine the range of place names to which the queried place name relates, with higher confidence scores indicating more specific locations to which the place name relates. In order to ensure the correctness of the place name analysis, an understanding score threshold value is set to be 70, and a credibility score threshold value is set to be 20, so that the wrong place name analysis result is filtered. And judging the result of the hundred-degree geocoding service, and storing the coding result into a coding result candidate list if the understanding score of the coding service result is more than 70 and the credibility score is more than 20.
And if the geocode range is national, performing national geocode service call on the place name. The national geocoding service returns the geographic longitude and latitude coordinates under the Mars coordinate system according to the designated query keyword parameters by searching the national POI interest point database of the Goldmap. The national geocoding service has higher coding accuracy for the national place names without specifying the province range of the city, and the place names are correct and unique and return the longitude and latitude coordinates under the accurate Mars coordinate system only by returning the geocoding result. And if the returned result is not null, storing the coding result into a coding result candidate list.
Step S503, traversing the candidate list of the coding result, selecting the items with higher understanding score and credible score for the same name place name, and deleting other items.
Compared with the prior art, the invention has the advantages that:
the invention uses the deep learning model based on ERNIE and Bi-GRU-CRF to carry out the task of identifying the named entity, thereby being more accurate and efficient for identifying the place name and the organization name in the news text; a place name database is constructed according to provinces, cities and counties of the whole country, and the news content is matched, so that the range of invoking the geocoding service is limited, and the geocoding result is more accurate; by carrying out different news content extraction methods on different news sources, the news text extraction method is effectively compatible with the news text extraction of each large news portal website, and the news geocoding application range is wider.
Drawings
FIG. 1 is a schematic flow chart of the method of the present invention;
FIG. 2 is a flowchart of an algorithm embodying the present invention.
Detailed Description
For a further understanding of the present invention, preferred embodiments of the invention are described below in conjunction with the examples, but it should be understood that these descriptions are merely intended to illustrate further features and advantages of the invention, and are not limiting of the claims of the invention.
The news geocoding method based on deep learning has a flow shown in figure 1, and comprises the following steps:
step S10, constructing a database of provinces, cities, counties and place names of China
Step S20, for a given news link, a general news website text extractor is used for extracting news text.
Step S30, searching a place name database according to the news text extracted in the step S20 to obtain the possible provinces, cities and county place names in the news
And step S40, carrying out a named entity recognition task based on the deep learning model of ERNIE and Bi-GRU-CRF according to the news text extracted in the step S20, and obtaining a place name alternative list in the news text.
Step S50, calling a local or national geocoding service to geocode the place names in the place name alternative list according to the county names, the city names and the province name alternative list.
Step S60, the encoding result candidate list is structured and organized, and the geocoding result is organized according to the modes of national-place roll calling, province-place roll calling, city-place roll calling and county-place roll calling. Therefore, the geographical coding of the news is realized, the geographical coordinates of the news are determined, and longitude and latitude data contained in the news are obtained.
In one illustrated embodiment, step S10 specifically includes:
step S101, obtaining provinces, cities and counties of the whole China, and classifying according to the provinces, the direct jurisdictions and the non-direct jurisdictions and the counties
Step S102, building a subordinate relation between place names, corresponding father-son relations of the city names of the provinces and the counties of the cities one by one, and storing the corresponding relations in a database.
Step S103, establishing a query service, and querying whether the corresponding place name exists in the database or not through the query keyword, and giving the place name type and the father-son relationship thereof.
In one illustrated embodiment, step S20 specifically includes:
in step S201, the web page is first parsed into DOM tree by reading the web page HTML code, each HTML tag being a node, wherein all text is a leaf node in the DOM tree. Traversing each node in the DOM tree, and calculating the total number of character strings of all text leaf nodes in the node divided by the total number of child nodes contained in the node to obtain the text density of the node. And obtaining the news text DOM node by screening the node with the highest text density in the DOM tree. Further, the news text is obtained by obtaining the text content of the text leaf node in the DOM node.
Step S202, it is determined whether the given content is a link or text. For the text, the text is directly used as news text for subsequent steps.
And step S203, judging a news website for the link, if the link website is a newwave microblog and a WeChat, using a CSS and XPath selector for the link website, acquiring class, id, data attributes of DOM nodes containing the news text according to DOM structures of the microblog and the WeChat page, and acquiring the news text by using a CSS and XPath rule corresponding to the attribute.
If the link web site is other news portal web site, the processing of step S201 is performed on the link web site to extract news text.
In one illustrated embodiment, step S30 specifically includes:
step S301 searches for the province name, and if the province name existing in the database is included in the text, adds it to the province name candidate list.
Step S302, searching for the city names, and if the text contains the city names in the database, adding the city names into a city name alternative list, and adding the province names corresponding to the cities into the province name alternative list.
Step S303, searching for the county names, and if the text contains the county names in the database, adding the county names into a county name alternative list, and adding the city names corresponding to the counties and the province names corresponding to the cities into a city name and province name alternative list.
In one illustrated embodiment, step S40 specifically includes:
step S401, merging the place names and the organization names obtained by the named entity identification to obtain a place name alternative list.
Step S402, traversing the place name alternative list, and deleting the place names repeated in the province name alternative list, the city name alternative list and the county name alternative list.
In one illustrated embodiment, step S50 specifically includes:
step S501, determining a geocode range list. If the number of the entries in the county name candidate list is greater than 1, the geocoding range list is the county name candidate list, otherwise, whether the number of the entries in the city name candidate list is greater than 1 is judged, if yes, the geocoding range list is the city name candidate list, otherwise, whether the number of the entries in the province name candidate list is greater than 1 is judged, if yes, the geocoding range list is the province name candidate list, and otherwise, the geocoding range is national.
Step S502, traversing the place name candidate list, if the geocoding range is not national, carrying out local geocoding service call on the current place name according to the items in the geocoding range list, and if the understanding score of the coding service result is more than 70 and the credibility score is more than 20, storing the coding result into the coding result candidate list. And if the geographic coding range is national, carrying out national geographic coding service call on the place name, and if the returned result is not null, storing the coding result into a coding result candidate list.
Step S503, traversing the candidate list of the coding result, selecting the items with higher understanding score and credible score for the same name place name, and deleting other items.
The above description of the embodiments is only for aiding in the understanding of the method of the present invention and its core ideas. It should be noted that it will be apparent to those skilled in the art that various modifications and adaptations of the invention can be made without departing from the principles of the invention and these modifications and adaptations are intended to be within the scope of the invention as defined in the following claims.

Claims (5)

1. The news geocoding method based on deep learning is characterized by comprising the following steps of:
step S10, constructing a database of provinces, cities, counties and place names of China;
step S20, extracting news text from given news links and news contents by using a text extractor;
the step S20 specifically includes the steps of,
step S201, firstly, analyzing a webpage into a DOM tree by reading webpage HTML codes, wherein each HTML label is a node, and all texts are leaf nodes in the DOM tree; traversing each node in the DOM tree, and calculating the total number of character strings of all text leaf nodes in the node divided by the total number of child nodes contained in the node to obtain the text density of the node; obtaining news text DOM nodes by screening nodes with highest text density in the DOM tree; further, obtaining a news text by obtaining text content of a text leaf node in the DOM node;
step S202, judging whether the given content is a link or a text, and regarding the text, directly taking the text as a news text to carry out the subsequent steps;
step S203, judging the news website for the link, if the link website is other news portal websites except WeChat and microblog, processing the link website in step S201, and extracting news text; if the link website is a newwave microblog and a WeChat nutlet, using a CSS and XPath selector to obtain class, id, data attributes of DOM nodes containing news texts according to DOM structures of the microblog and the WeChat pages, and obtaining the news texts by using CSS and XPath rules corresponding to the class, id, data attributes;
step S30, searching a place name database according to the news text extracted in the step S20 to obtain the possible provinces, cities and county place names in the news;
step S40, carrying out a named entity recognition task based on the ERNIE pre-training model and the Bi-GRU-CRF deep learning model according to the news text extracted in the step S20, and obtaining a place name alternative list in the news text;
step S50, calling a local or national geocoding service to geocode the place names in the place name alternative list according to the county names, the city names and the province name alternative list;
the step S50 specifically includes the steps of,
step S501, determining a geocoding range list; if the number of the entries in the county name candidate list is greater than 1, the geocoding range list is the county name candidate list, otherwise, whether the number of the entries in the city name candidate list is greater than 1 is judged, if yes, the geocoding range list is the city name candidate list, otherwise, whether the number of the entries in the province name candidate list is greater than 1 is judged, if yes, the geocoding range list is the province name candidate list, and otherwise, the geocoding range is national;
step S502, traversing a place name alternative list, if the geocoding range is not national, carrying out hundred-degree geocoding service call on the current place name by taking the geocoding range as city limitation according to an item in the geocoding range list, and returning geographic longitude and latitude coordinates under a BD09 coordinate system of the query keyword according to the designated query keyword, the search geographic range limiting parameter, the search city limiting parameter and the search category limiting parameter by searching a hundred-degree map national POI interest point database by the hundred-degree geocoding service, and returning an understanding degree score and a credibility score of the coding result; the understanding degree score can be used for judging whether the input query keyword exists in the POI interest point database or not and the similarity with the data in the POI interest point database, and the higher the understanding degree is, the more correct the name format of the place name is, and the higher the possibility that the corresponding coordinate is found in the database is; the credibility score can be used for judging the place name range related to the queried place name, and the higher the credibility score is, the more specific the place related to the place name is; if the understanding score of the coding service result is larger than P1 and the credibility score is larger than P2, storing the coding result into a coding result candidate list;
if the geographic coding range is national, carrying out national geographic coding service call on the place name, and returning geographic longitude and latitude coordinates under a Mars coordinate system by the national geographic coding service through searching a national POI interest point database of the Goldmap according to the designated query keyword parameters;
step S503, traversing the candidate list of the coding result, selecting an item with higher understanding score and credible score for the same name place name, and deleting other items;
step S60, the candidate list of the coding result is organized structurally, and the geocoding result is organized according to the modes of national-place roll call, province-place roll call, city-place roll call and county-place roll call, so that the geocoding of news is realized, the geographic coordinates of the news are determined, and longitude and latitude data contained in the news are obtained.
2. The news geocoding method based on deep learning of claim 1, wherein: the step S10 specifically includes the steps of,
step S101, obtaining provinces, cities and counties of the whole China, and classifying according to the provinces, the direct jurisdictions and the non-direct jurisdictions and the counties;
step S102, building subordinate relations among place names, corresponding father-son relations of the city names of the provinces and the counties of the cities one by one, and storing the corresponding relations into a database;
step S103, establishing a query service, querying whether a corresponding place name exists in a database through a query keyword, and giving the type of the place name and the father-son relationship thereof.
3. The news geocoding method based on deep learning of claim 1, wherein: the step S30 specifically includes the steps of,
step S301, searching for the province name, and if the province name existing in the database is contained in the text, adding the province name into a province name alternative list;
step S302, searching for the city names, if the text contains the city names in the database, adding the city names into a city name alternative list, and adding the province names corresponding to the cities into the province name alternative list;
step S303, searching for the county names, and if the text contains the county names in the database, adding the county names into a county name alternative list, and adding the city names corresponding to the counties and the province names corresponding to the cities into a city name and province name alternative list.
4. The news geocoding method based on deep learning of claim 1, wherein: the step S40 specifically includes the steps of,
s401, constructing a deep learning model based on ERNIE and Bi-GRU-CRF, wherein the ERNIE model is formed by superposing 12 layers of Encoder layers by using an ERNIE Base structure, wherein the input and output of each Encoder layer are composed of 768 Hidden Units, each Encoder layer is formed by stacking a self-Attention layer, a standardization layer, a full connection layer and a standardization layer, and each self-Attention layer contains 12 Attention Heads;
the Bi-GRU-CRF model structure is formed by stacking 2 layers of bidirectional GRU layers, the output of the top layer bidirectional GRU layer is input into the CRF layer to obtain the maximum possible label of each word in the sentence, and therefore a named entity recognition result is output;
the ERNIE model inputs a character string text, outputs 768-dimensional text embedded vector matrixes, the vector matrixes are input into a bottommost bidirectional GRU layer in the Bi-GRU-CRF model, and finally, the maximum likelihood labels of the words are obtained through outputting at the CRF layer;
training an ERNIE model on an MSRA-NER data set to obtain a pre-training model, inputting a text into the model to obtain 768-dimensional text embedded vector matrixes corresponding to model structures, training a Bi-GRU-CRF model by using an LAC corpus as a training set, and only updating parameters of a bidirectional GRU layer and a CRF layer in the training process and freezing parameters of the ERNIR model to enable the parameters not to participate in training;
through the ERNIE and Bi-GRU-CRF based deep learning model, named entity recognition tasks can be performed to obtain place nouns and mechanism nouns contained in a text, wherein the place nouns comprise provinces, cities, curves, roads and landmark names, and the mechanism names comprise government institutions, educational institutions and leisure and entertainment places;
step S402, merging place names and mechanism names obtained by identifying named entities to obtain a place name alternative list;
step S403, traversing the place name candidate list, and deleting the place names repeated in the province name candidate list, the city name candidate list and the county name candidate list.
5. The news geocoding method based on deep learning of claim 1, wherein: the understanding score threshold is set to 70, the confidence score threshold is set to 20, i.e., P1 is set to 70 and P2 is set to 20.
CN202110747499.2A 2021-07-02 2021-07-02 News geocoding method based on deep learning Active CN113626536B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110747499.2A CN113626536B (en) 2021-07-02 2021-07-02 News geocoding method based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110747499.2A CN113626536B (en) 2021-07-02 2021-07-02 News geocoding method based on deep learning

Publications (2)

Publication Number Publication Date
CN113626536A CN113626536A (en) 2021-11-09
CN113626536B true CN113626536B (en) 2023-08-15

Family

ID=78378968

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110747499.2A Active CN113626536B (en) 2021-07-02 2021-07-02 News geocoding method based on deep learning

Country Status (1)

Country Link
CN (1) CN113626536B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114580412B (en) * 2021-12-29 2024-06-04 西安工程大学 Clothing entity identification method based on field adaptation

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009134464A (en) * 2007-11-29 2009-06-18 Nippon Telegr & Teleph Corp <Ntt> Generation device, generation method and generation program of retrieval result snippet considering range meant by place name, and recording medium recording the generation program
CN109033358A (en) * 2018-07-26 2018-12-18 李辰洋 News Aggreagation and the associated method of intelligent entity
CN110472066A (en) * 2019-08-07 2019-11-19 北京大学 A kind of construction method of urban geography semantic knowledge map
WO2020215793A1 (en) * 2019-04-23 2020-10-29 深圳先进技术研究院 Urban aggregation event prediction and positioning method and device
CN112307364A (en) * 2020-11-25 2021-02-02 哈尔滨工业大学 Character representation-oriented news text place extraction method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100250562A1 (en) * 2009-03-24 2010-09-30 Mireo d.o.o. Recognition of addresses from the body of arbitrary text

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009134464A (en) * 2007-11-29 2009-06-18 Nippon Telegr & Teleph Corp <Ntt> Generation device, generation method and generation program of retrieval result snippet considering range meant by place name, and recording medium recording the generation program
CN109033358A (en) * 2018-07-26 2018-12-18 李辰洋 News Aggreagation and the associated method of intelligent entity
WO2020215793A1 (en) * 2019-04-23 2020-10-29 深圳先进技术研究院 Urban aggregation event prediction and positioning method and device
CN110472066A (en) * 2019-08-07 2019-11-19 北京大学 A kind of construction method of urban geography semantic knowledge map
CN112307364A (en) * 2020-11-25 2021-02-02 哈尔滨工业大学 Character representation-oriented news text place extraction method

Also Published As

Publication number Publication date
CN113626536A (en) 2021-11-09

Similar Documents

Publication Publication Date Title
Gritta et al. What’s missing in geographical parsing?
CN111353030B (en) Knowledge question and answer retrieval method and device based on knowledge graph in travel field
CN102395965B (en) Method for searching objects in a database
WO2006133538A1 (en) System and method for ranking web content
KR101221959B1 (en) An Integrated Region-Related Information Searching System applying of Map Interface and Knowledge Processing
CN103514234A (en) Method and device for extracting page information
CN102841920A (en) Method and device for extracting webpage frame information
CN112749265B (en) Intelligent question-answering system based on multiple information sources
JP2022532451A (en) How to disambiguate Chinese place name meanings based on encyclopedia knowledge base and word embedding
WO2019227581A1 (en) Interest point recognition method, apparatus, terminal device, and storage medium
CN111914539A (en) Channel announcement information extraction method and system based on BilSTM-CRF model
CN111078835A (en) Resume evaluation method and device, computer equipment and storage medium
CN117290489A (en) Method and system for quickly constructing industry question-answer knowledge base
Moura et al. Reference data enhancement for geographic information retrieval using linked data
Shi et al. Extraction of geospatial information on the Web for GIS applications
CN115129719A (en) Knowledge graph-based qualitative position space range construction method
CN113626536B (en) News geocoding method based on deep learning
CN114091454A (en) Method for extracting place name information and positioning space in internet text
Laparra et al. A dataset and evaluation framework for complex geographical description parsing
Kayed et al. Postal address extraction from the web: a comprehensive survey
CN102460440B (en) Searching methods and devices
Abascal-Mena et al. Geo information extraction and processing from travel narratives.
CN113377739A (en) Knowledge graph application method, knowledge graph application platform, electronic equipment and storage medium
Shi et al. Thematic data extraction from Web for GIS and applications
Luo et al. Chinese address standardisation of POIs based on GRU and spatial correlation and applied in multi-source emergency events fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant