CN105468791B - A kind of integrality expression for the geographical location entity known based on interacting Question-Answer community-Baidu - Google Patents
A kind of integrality expression for the geographical location entity known based on interacting Question-Answer community-Baidu Download PDFInfo
- Publication number
- CN105468791B CN105468791B CN201610001346.2A CN201610001346A CN105468791B CN 105468791 B CN105468791 B CN 105468791B CN 201610001346 A CN201610001346 A CN 201610001346A CN 105468791 B CN105468791 B CN 105468791B
- Authority
- CN
- China
- Prior art keywords
- area
- defectloc
- answer
- entity
- question
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000014509 gene expression Effects 0.000 title claims abstract description 11
- 239000013598 vector Substances 0.000 claims abstract description 15
- 238000012545 processing Methods 0.000 claims abstract description 11
- 230000002950 deficient Effects 0.000 claims description 25
- 238000004364 calculation method Methods 0.000 claims description 23
- 238000000034 method Methods 0.000 claims description 23
- 230000002452 interceptive effect Effects 0.000 claims description 10
- NAWXUBYGYWOOIX-SFHVURJKSA-N (2s)-2-[[4-[2-(2,4-diaminoquinazolin-6-yl)ethyl]benzoyl]amino]-4-methylidenepentanedioic acid Chemical compound C1=CC2=NC(N)=NC(N)=C2C=C1CCC1=CC=C(C(=O)N[C@@H](CC(=C)C(O)=O)C(O)=O)C=C1 NAWXUBYGYWOOIX-SFHVURJKSA-N 0.000 claims description 3
- 238000001914 filtration Methods 0.000 claims description 3
- 238000012546 transfer Methods 0.000 claims description 3
- 230000007704 transition Effects 0.000 claims description 3
- 230000007547 defect Effects 0.000 abstract description 7
- 238000007619 statistical method Methods 0.000 abstract description 5
- 238000000605 extraction Methods 0.000 abstract description 2
- 238000004458 analytical method Methods 0.000 description 3
- 230000008859 change Effects 0.000 description 3
- 230000003203 everyday effect Effects 0.000 description 3
- 238000002474 experimental method Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000009471 action Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 230000001186 cumulative effect Effects 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000000630 rising effect Effects 0.000 description 1
- 239000004575 stone Substances 0.000 description 1
- 210000003813 thumb Anatomy 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/29—Geographical information databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Remote Sensing (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention relates to a kind of integrality expressions of geographical location entity known based on interacting Question-Answer community-Baidu, comprising the following steps: step 1): extracting defect geographical location entity defectLoc by data processing;Step 2): problem is generated to the defectLoc of extraction: " which area certain defectLoc belongs to ", being known by Baidu and is retrieved;Step 3): extracting feature according to the result of retrieval, calculates defectLoc and belongs to the score of each region, and constructs the affiliated area feature vector of defectLoc;Step 4): completeization processing is carried out to defectLoc using rule.Based on the present invention complains text by microblogging city, lack of standardization, unstructured feature is expressed for geographical location entity therein, so that staff is difficult work for statistical analysis, the present invention proposes a kind of integrality expression of geographical location entity known based on Baidu, accuracy rate with higher is completely changed to defect geographical location entity, the needs of practical application can be met well.
Description
Technical Field
The invention belongs to the technical field of integrity expression of geographical position entities in microblog city complaint texts, and particularly relates to an integrity expression method of a geographical position entity based on interactive question and answer community-Baidu knowledge.
Background
In recent years, with the rise of microblog inquiry, more and more government departments open official microblogs to interact with common people. For the complaint information of the microblog city, the geographical location entity sometimes lacks regional information due to the large number of complaint microblogs received every day. A complete geographical location entity should include both the area and the name of the place, such as "zhang xiang fuwei city" in the sunny region. The geographical position entity in the microblog city complaint text has the following phenomena: one, the deletion of the region of the place name, such as 'Zhongguancun'; secondly, the area of the place name is fuzzy, such as 'Changan street'. Due to the lack of the place name areas or the fuzzy phenomenon, the statistical analysis work of workers is very difficult, so that the workers are difficult to count the accident occurrence amount of each area, and the accidents cannot be prevented in time. The geographical location entities with the two situations are collectively called as defective geographical location entities and are marked as defectLoc. Moreover, as time goes on, the place name and the area information change, so that the analysis of the place name subordinate area becomes more difficult, for example, if the 'Chongmen New scenery Home' originally belongs to the Chongo area and now belongs to the east city area, how to find the change of the area information of the place name in time becomes more important. The integrity of the geographical position entity is expressed, and the missing regional information is added, for example, the 'middle guancun' is normalized to the 'middle guancun in the sea lake region' or the 'Changan street' is normalized to the 'Changan street in the east region' or the 'Changan street in the west region', so that the statistical analysis and analysis can be conveniently carried out by city managers, the problems in the regions can be further found, the problems can be prevented, the early warning function can be realized, and the decision support can be provided for the subsequent work.
At present, domestic research focuses on identification of place names and geographical location entities, and the integrity of the geographical location entities is rarely researched. For the problem of missing regional information, related research has mostly solved the problem by building geographical ontologies and geographical knowledge bases. However, the building of the geographic ontology and the geographic knowledge base requires the participation of domain experts, the consistency and integrity of the built geographic ontology and the geographic knowledge base are maintained, the maintenance of the huge geographic ontology and the geographic knowledge base consumes a lot of manpower, data cannot be updated in time, and particularly when membership changes, more nodes are required to be modified usually, and real-time performance is not easy to achieve.
Disclosure of Invention
In view of the above problems in the prior art, the present invention is directed to a geographical location entity integrity representation method based on interactive question-answering community-Baidu knowledge, which can avoid the above technical defects.
In order to achieve the purpose of the invention, the technical scheme adopted by the invention is as follows;
a geographical location entity integrity representation method based on interactive question and answer community-Baidu knowledge comprises the following steps:
step 1): extracting a defective geographic position entity through data processing; wherein, the defective geographical position entity is a geographical position entity with missing regions or fuzzy regions and is marked as a defectLoc;
step 2): generating a problem for the defectLoc extracted in the step 1): "which area a certain defectLoc belongs to" is known by hundredths for retrieval;
step 3): extracting features according to the retrieval result in the step 2), calculating scores of the depfectloc belonging to each region, and constructing a feature vector of the region of the depfectloc;
step 4): and utilizing the rule to carry out integrity processing on the depfectLoc, and realizing the integrity representation of the geographical position entity.
Further, the step 1) is specifically as follows:
step A: analyzing the identified geographical position entity, judging whether the geographical position entity has regional information or not, and quitting if the geographical position entity exists; no transfer to step B;
and B: positioning an original microblog, segmenting words of the original microblog by an NLPIR, extracting all @ contents to form an @ array, judging whether unique area information exists in the array or not, completing the defectLoc if the unique area information exists in the array, and filtering the defectLoc; (iv) there is no transition to step C;
and C: and extracting the to-be-processed defectLoc to form a defectLoc set.
Further, the features extracted in the step 3) are specifically:
the method is characterized in that: a content characteristic;
the second characteristic: a hundred degree of knowledge characteristic;
the characteristics are three: a feedback feature is searched.
Further, the first characteristic is specifically as follows:
(1) whether the feedback question-answer pair has region information or not;
the score ScoreA of the region is shown in formula (1):
ScoreA(QAj,areai)
=(1-λ)×(areai/10)+λ×(areai%10) (1)
where i is the ith zone and j is the jth question of hundredth-degree-aware feedbackAnswering, wherein lambda is the weight of the region information appearing in the answer; areaiThe calculation is shown in formula (2) (3):
(2) a set of problem similarities;
the problem similarity set is denoted as Simq ═ Simq1,simq2,…,simq10In which simq is1-10For the similarity of the proposed question tq and the question-answer with hundred degree knowledge feedback to each question in the QA set, the calculation formula is shown as formula (4):
wherein A, B is two n-dimensional vectors, A is [ A1, A2, …, An]B is [ B1, B2, …, Bn],AiAnd BiIndicating the frequency of occurrence of the same character at A, B, n is A, B for all non-repeating individual characters.
Further, the second characteristic is specifically that:
(1) whether the answer is a recommendation answer;
wherein,a weight representing a recommended answer;
(2) the number of praise times;
ScoreI(QAi,Agree)=θ×count(QAi,Agree) (6)
where θ is the weight of each vote, count (QA)iAgree) is the number of praise in the ith QA.
(3) The time of the answer;
the answer time is limited, and the unit is year, and the calculation formula is shown as the formulas (7) and (8):
timei=Now-AnsTimei (7)
where i is the ith QA, Now is the current time, and AnsTime is the time to answer the question.
Further, the third characteristic is specifically that:
the first 3 query results of the feedback results are considered to be weighted equally, and the latter results are gradually weighted down with the increasing rank, and the specific distribution is shown as formula (9), wherein i is the ith QA pair.
Further, the step 3) is specifically as follows:
score (area) for missing geo-location entity deffectloc belonging to region iiI deffectloc), the calculation formula is shown as formula (10):
wherein RowScore (QA)j,areai) For the score of the area i to which the jth QA belongs, the calculation formula is shown in formula (11):
RowScore(QAj,areai)
=ScoreA(QAj,areai)×simqj×(1+Rec(j))
×(1+ScoreI(QAj,Agree))×(1+ScoreT(timej))
×(1+Pos(j)) (11)
finally constructing a Score feature vector of the defectLoc according to the Score value Score of the area of all the areas of the defectLoc
{
Score(area1|defectLoc),Score(area2|defectLoc),...,Score(area16|defectLoc)
}。
Further, the rule in step 4) is specifically:
rule 1: for an entity with definite geographic position, two conditions exist, firstly, if the retrieval result only contains one area information, the area information is the area information of the depfectloc; second, if there is Max (P (area)i| deffectloc)) > is more than or equal to γ, the areaiArea information of the depfectloc;
wherein the specific geographic location entity is one area, or Max (area)i| deffectloc)) > gamma deffectloc, and is marked as clearLoc; wherein the probability calculation formula is shown as formula (12):
rule 2: for an ambiguous geographic location entity, disambiguating the defectLoc by using the countLoc; wherein, countLoc is to count the number of each region, a plurality of same region information appears in one QA, and Max (countLoc | area) is obtained by one-time calculationi) If the area information of the depfectLoc is areai: if Max (countLoc | area)i) There are 2 or more than 2 regions, the first Max (countLoc | area) is takeni) The area information of (a);
where the ambiguous geolocation entity is the occurrence of multiple regions in the search results and Max (P (area)i| Location)) < gamma depectloc, noted ambiguityLoc;
rule 3: for the zero geographic position entity, the regional completion operation can not be carried out;
the zero geographic position entity is a depletloc in which no regional information appears in the retrieval result and is marked as zeroLoc.
The integrity expression method of the geographical location entity based on interactive question and answer community-Baidu knowledge, provided by the invention, is based on the microblog city complaint text, and aims at the characteristics of non-standard and non-structured expression of the geographical location entity, so that a worker can hardly perform statistical analysis work.
Drawings
FIG. 1 is a flow chart of the present invention;
fig. 2 shows the duty ratio of the depfectloc category.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described with reference to the accompanying drawings and specific embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
As shown in fig. 1, a geographical location entity integrity representation method based on interactive question-answering community-Baidu knowledge includes the following steps:
step 1): extracting a defective geographic position entity through data processing; wherein, the defective geographical position entity is a geographical position entity with missing regions or fuzzy regions and is marked as a defectLoc;
the invention firstly utilizes the geographical position entity automatic identification method proposed by Li multiplied by w to carry out geographical position entity identification, and after the geographical position entity is identified, the defectLoc is further extracted. When a user issues a complaint microblog, the @ relevant area may be @ in addition to the @ beijing 12345 ". According to the method, the contents of all complaint microblogs are extracted according to the characteristics of the microblog related area, when the content of the @ has unique area information, such as a government hot line of the @ morning area, the area is taken as the area of the defectLoc to be subjected to integrity representation, and finally, a part of the defectLoc is filtered. The algorithm for extracting the deffectloc to be processed is as follows:
step A: analyzing the identified geographical position entity, judging whether the geographical position entity has regional information or not, and quitting if the geographical position entity exists; no transfer to step B;
and B: positioning an original microblog, segmenting words of the original microblog by an NLPIR, extracting all @ contents to form an @ array, judging whether unique area information exists in the array or not, completing the defectLoc if the unique area information exists in the array, and filtering the defectLoc; (iv) there is no transition to step C;
and C: and extracting the to-be-processed defectLoc to form a defectLoc set.
Step 2): generating a problem for the defectLoc extracted in the step 1): "which area a certain defectLoc belongs to" is known by hundredths for retrieval;
one of the most popular Chinese interactive question-answer communities is known in centuries, and the cumulative problem of solution is known by centuries to exceed 3.77 hundred million from 2005 to 2015. Baidu knows that 17596864 problems are generated in two years after establishment, 17012767 problems are solved, and the problem solving rate is as high as 96.7%. Meanwhile, the Baidu knowledge is a knowledge community with extremely high participation rate and interaction, more than 1000 users visit the knowledge every day, 71308 questions are generated on average every day, 223907 answers are generated, and each question attracts 3.14 users to participate in interaction on average. Since the hundredth knowledge has a large number of user groups and question and answer data, the method is very suitable for solving the problem of regional completion of the defectLoc.
The method mainly utilizes an open interactive question-answer community-Baidu to know, and generates a question for the devectLoc extracted in the step 1), wherein the question is ' which area the devectLoc belongs to ', for example, ' which area the Zhongguancun belongs to ', and the search of the area the devectLoc belongs to is realized through the retrieval function of ' zhidao. For example, "middle guancun", which area "middle guancun belongs to" is submitted as a search string to Baidu know, the QA pair set of 10 similar questions is fed back, and the result of the feedback is structurally represented, as shown in table 1 as the intercepted first 6 QA pair sets.
Table 1: "Zhongguancun" structured data representation
Step 3): extracting features according to the retrieval result in the step 2), calculating scores of the depfectloc belonging to each region, and constructing a feature vector of the region of the depfectloc;
step 4): and utilizing the rule to carry out integrity processing on the depfectLoc, and realizing the integrity representation of the geographical position entity.
Wherein the features extracted in the step 3) are specifically as follows:
the method is characterized in that: a content characteristic;
the feature description is to know the question and answer content fed back by Baidu, and firstly, whether the region information appears in the question and answer content is confirmed. Meanwhile, if the feedback question has a high similarity to the proposed question, the regional information appearing in the question and the answer is considered to be more important.
The second characteristic: a hundred degree of knowledge characteristic;
the Baidu know feature refers to some property of the Baidu know itself, reflecting the credibility of the QA pair for which the Baidu know feedback is made.
The characteristics are three: a feedback feature is searched.
And calculating the weight by utilizing a search engine pseudo feedback technology through searching the sequence of the feedback results. The higher the rank in the Baidu-aware feedback result, the more relevant it is to the belonging area information of the defectLoc.
(1) Whether the feedback question-answer pair has region information or not;
constructing a bag (QA) according to the question-answer pairs fed back1,QA2,…,QA10And the target Area set is Area ═ Area1,area2,…,area16Where QAiFor the ith challenge-response pair of hundredth-degree-aware feedback, each QA corresponds to an Area set. As a judgment question, the invention uses 1 of ten digits and one digit to respectively represent the question and whether the answer has areaiAs shown in the formulas (1) and (2):
for different regions, each QA constructs a set containing all 16 regions, and since the importance of the region appearing in the QA question is different from that appearing in the answer, which is the answer to the region missing question, the region information in the answer is more important. The score ScoreA of the region is shown in formula (3):
ScoreA(QAj,areai)
=(1-λ)×(areai/10)+λ×(areai%10) (3)
wherein i is the ith area, j is the jth question-answer pair fed back by hundredths knowledge, and lambda is the weight of the area information appearing in the answer;
(2) a set of problem similarities;
this feature is used to measure the similarity between the question tq and all questions in the QA set, and is denoted as Simq, which is { Simq }1,simq2,...,simq10In which simq is1-10Is the similarity of tq to each question in the QA set. The result of cosine similarity is only [0, 1 ]]In between, and the number of two problem words needing to calculate the similarity is less, usually about 10 words, so the invention adopts the word as the vector and the cosine similarity as the problem similarity calculation method. Suppose A, B is two n-dimensional vectors, A is [ A1, A2]B is [ B1, B2., Bn]Wherein A isiAnd BiRepresenting the frequency of occurrence of the same character in A, B respectively, and n is all non-repeating individual characters in A, B, the cosine similarity of a and B can be expressed as:
wherein the second characteristic is specifically as follows:
(1) whether the answer is a recommendation answer;
the recommendation answer is a better quality answer for the senior knowing the net friend recommendation on the hundred-degree-aware platform. Therefore, the recommended answer generally has higher confidence and is more important than other answers, and the weight of the recommended answer is represented by phi.
(2) The number of praise times;
in the Baidu knowledge, the "approval" of other users can confirm the accuracy of the answer through the action of the vertical thumb, and the more times the answer is approved, the higher the quality of the answer. The calculation of the invention for the praise number is represented as follows:
ScoreI(QAi,Agree)=θ×count(QAi,Agree) (6)
where θ is the weight of each vote, count (QA)iAgree) is the number of praise in the ith QA.
(3) The time of the answer;
the answer time is from the time when the user who answers the question in the QA pair sends an answer, and since the regional attribution question of the geographic location changes with the change of the time, the answer is generally closer to the current time, and the accuracy is higher, therefore, the invention limits the answer time, which is expressed by the following formula (7) (8):
timei=Now-AnsTimei (7)
where i is the ith QA, Now is the current time, and AnsTime is the time to answer the question.
Wherein the third characteristic is specifically as follows:
the first 3 query results of the feedback results are considered to be weighted equally, and the latter results are gradually weighted down with the increasing rank, and the specific distribution is shown as formula (9), wherein i is the ith QA pair.
And constructing a scoring model of the region to which the defectLoc of each QA belongs according to the existence of region information, question similarity, recommendation, praise times, answer time and a feedback ranking result, wherein the existence of the region information and the question similarity serve as cardinal scores of the region information and the question similarity, the calculated score is modified by adding one feature according to the importance of different features, if the feature value is 0, the total score is kept unchanged, and otherwise, the larger the feature value is, the more the total score is increased. The step 3) is specifically as follows:
score (area) for missing geo-location entity deffectloc belonging to region iiI deffectloc), the calculation formula is shown as formula (10):
wherein RowScore (QA)j,areai) For the score of the area i to which the jth QA belongs, the calculation formula is shown in formula (11):
RowScore(QAj,areai)
=ScoreA(QAj,areai)×simqj×(1+Rec(j))
×(1+ScoreI(QAj,Agree))×(1+ScoreT(timej))
×(1+Pos(j)) (11)
finally constructing a Score feature vector of the defectLoc according to the Score value Score of the area of all the areas of the defectLoc
{
Score(area1|defectLoc),Score(area2|defectLoc),...,Score(area16|defectLoc)
}。
Table 2: score feature vectors for all regions of the defectLoc
Score (area) of the defective geolocation entity was discovered by observation and analysis of the dataiThe value of | depfectloc) and the number of areas appearing in the search result play a decisive role in the completeness of the defect geographical location entity. The method utilizes the rules to carry out regional integrity representation on the defective geographic position entities of different categories. Wherein, the rule in the step 4) is specifically as follows:
rule 1: for an entity with definite geographic position, two conditions exist, firstly, if the retrieval result only contains one area information, the area information is the area information of the depfectloc; second, if there is Max (P (area)i| deffectloc)) > is more than or equal to γ, the areaiArea information of the depfectloc; as in 6 of table 2, although there are multiple region scores, the region to which it belongs can be determined;
wherein the specific geographic location entity is one area, or Max (area)i| deffectloc)) > gamma deffectloc, and is marked as clearLoc; wherein the probability calculation formula is shown as formula (12):
rule 2: for an ambiguous geographic location entity, disambiguating the defectLoc by using the countLoc; wherein, countLoc is to count the number of each region, a plurality of same region information appears in one QA, and Max (countLoc | area) is obtained by one-time calculationi) If the area information of the depfectLoc is areai(ii) a If Max (countLoc | area)i) There are 2 or more than 2 regions, the first Max (countLoc | area) is takeni) The area information of (a); as shown in table 3 at 2, the countLoc of the sea lake is a maximum of 7, and the final result of the integrity normalization is "five way settlement in sea lake region";
where the ambiguous geolocation entity is the occurrence of multiple regions in the search results and Max (P (area)i| Location)) < gamma depectloc, noted ambiguityLoc;
table 3: countLoc of all areas of the defectLoc
Rule 3: for the zero geographic position entity, the regional completion operation can not be carried out; as such a geolocation entity does not necessarily belong to the beijing area, such as 3 in table 2.
The zero geographic position entity is a depletloc in which no regional information appears in the retrieval result and is marked as zeroLoc.
Classifying each defective geographic position entity through all the regional scores of the defective geographic position entity, completing the defective geographic position entity through the rules, and finally normalizing the defective geographic position entity into a complete geographic position entity as shown in table 4.
Table 4: defect geolocation entity integrity representation in Table 2
The corpus is derived from the Sino microblog, with '@ Beijing 12345' as a keyword, the search is carried out through the search page 's.weibo.com' of the Sino microblog, and a directional crawler program is compiled to automatically acquire the related microblog. Because the geographical location of the complaint microblog is concentrated in the Beijing area, the area of the geographical location entity includes 14 areas and 2 counties, namely, the east city area, the west city area, the rising area, the Fengtai area, the stone landscape area, the Haizu area, the Mentougou area, the mountain area, the Changxing area, the Shunyuan area, the Huanyou area, the valley area, the Miyun county and the Yangqing county.
1480 Xinlang city complaint microblogs are used as experimental linguistic data, 1482 geographical position entities are extracted according to a Li multiplied by w method, and are proofread by professional personnel. 840 place names contain definite regional information and can provide help for follow-up statistics, and 642 defective geographical position entities account for 43.32% of the whole corpus. Through the previous data processing, according to the microblog characteristics of the @ related area information, 218 defective geographical location entities which can be represented in an integrity manner are available in 642 defective geographical location entities, and the remaining 424 defective geographical location entities cannot be represented in an integrity manner. However, 90 of the 424 defective geolocation entities, such as "country trade" and "twin well", have been repeated, and these repeated entries are removed, and finally there are 334 defective geolocation entities to be represented in completeness.
From the above data, it can be seen that the integrity study of the geo-location entity is necessary, and the integrity study of 334 defective geo-location entities is mainly performed in the present invention. Through repeated experiments, the area information appearing in the answer generally contributes more to the area to which the answer belongs than the area information appearing in the question, the explanation of the question by the recommended answer is more authoritative, and the number of praise times in the feature is known to be less to contribute to the area to which the answer belongs. For a defective geolocation entity, if there is a region with a score that exceeds or is equal to half of the sum of all region scores, it can be determined to be an unambiguous geolocation entity, so, the present invention takes λ ═ 0.7,θ=0.1,γ=0.5。
the invention uses Accuracy (Accuracy) to evaluate the experimental result, namely the proportion of the number of correctly-finished defective geographic position entities to all defective geographic position entities, and the calculation method comprises the following steps:
wherein right represents the number of the defect geographical position entities which are correctly finished, and total represents the number of all the defect geographical position entities to be finished.
By the data processing stage, the integrity representation of 334 defective geographic position entities is required, and the experimental method disclosed by the invention is carried out by the following 3 steps:
1) and searching the problem and structuring a feedback result. Through data processing, the problem retrieval needs to be carried out on 334 defectLoc, the problem retrieval result is structured according to the structure of the table 1, and 334 feedback data tables are finally formed.
2) And (4) feature extraction, namely calculating scores of all the regions and constructing a score feature vector of the defectLoc. The invention adopts the characteristic value calculation method and the scoring model of the region to which the characteristic value calculation method belongs, and calculates and obtains each region score of each defectLoc through a feedback data sheet, and constructs a score characteristic vector.
3) And classifying all the depfectloc according to the score feature vector of the depfectloc, and performing integrity representation through rules. The invention represents 334 deffectloc categories, wherein 290 definite geographical position entities, 35 ambiguous geographical position entities and 9 zero geographical position entities. As shown in fig. 2, clearLoc accounts for 87% of all the deffectloc, indicating that most of the deffectloc in the city complaint microblogs are clearLoc, and although the zeroLoc which cannot be completely finished accounts for only 3%, other methods still need to be found to completely express the clearLoc.
From the experimental results in table 5, it can be seen that the accuracy of the method of the present invention for completing clearLoc reaches 96.21%, and the accuracy of ambiguityLoc reaches 85.71%. The clearLoc is completed by rule 1, since it is known that the retrieval is unique area or Max (P (area)iAnd | deffectloc)) is more than or equal to gamma, ambiguous area information basically does not appear, and therefore the accuracy is highest. The refinement rate of ambiguityLoc is slightly lower than that of clearLoc, mainly multiple ambiguous areas exist, and the scores are closer, so that errors sometimes occur in the multiple area disambiguation process. The method can realize integrity representation of most of the defectLoc, and the coverage rate reaches 97.31%. For a few zeroLoc which do not return a retrieval result, the method of the invention is still ineffective. In summary, the method of the present invention is applicable to integrity representation of the defectLoc.
Table 5: distribution table of each type and accuracy in defective geographic location entity
The integrity expression method of the geographical location entity based on interactive question and answer community-Baidu knowledge, provided by the invention, is based on the microblog city complaint text, and aims at the characteristics of non-standard and non-structured expression of the geographical location entity, so that a worker can hardly perform statistical analysis work.
The above-mentioned embodiments only express the embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.
Claims (1)
1. An integrity expression method of a geographical location entity based on interactive question-answer community-Baidu knowledge is characterized by comprising the following steps:
step 1): extracting a defective geographic position entity through data processing; wherein, the defective geographical position entity is a geographical position entity with missing regions or fuzzy regions and is marked as a defectLoc;
step 2): generating a problem for the defectLoc extracted in the step 1): "which area a certain defectLoc belongs to" is known by hundredths for retrieval;
step 3): extracting features according to the retrieval result in the step 2), calculating scores of the depfectloc belonging to each region, and constructing a feature vector of the region of the depfectloc;
the step 3) is specifically as follows:
score (area) for missing geo-location entity deffectloc belonging to region iiI deffectloc), the calculation formula is shown as formula (10):
wherein RowScore (QA)j,areai) For the score of the area i to which the jth QA belongs, the calculation formula is shown in formula (11):
RowScore(QAj,areai)
=ScoreA(QAj,areai)×simqj×(1+Rec(j))
×(1+ScoreI(QAj,Agree))×(1+ScoreT(timej))
×(1+Pos(j)) (11)
finally constructing a Score feature vector of the defectLoc according to the Score value Score of the area of all the areas of the defectLoc
{
Score(area1|defectLoc),Score(area2|defectLoc),...,Score(area16|defectLoc)
};
Wherein i is a positive integer, and i is more than or equal to 1 and less than or equal to 16;
step 4): utilizing a rule to carry out integrity processing on the depfectLoc, and realizing the integrity representation of the geographical position entity;
the step 1) is specifically as follows:
step A: analyzing the identified geographical position entity, judging whether the geographical position entity has regional information or not, and quitting if the geographical position entity exists; no transfer to step B;
and B: positioning an original microblog, segmenting words of the original microblog by an NLPIR, extracting all @ contents to form an @ array, judging whether unique area information exists in the array or not, completing the defectLoc if the unique area information exists in the array, and filtering the defectLoc; (iv) there is no transition to step C;
and C: extracting the deffectloc to be processed to form a deffectloc set;
the features extracted in the step 3) are specifically as follows:
the method is characterized in that: a content characteristic;
the second characteristic: a hundred degree of knowledge characteristic;
the characteristics are three: searching for feedback features;
the first characteristic is as follows:
(1) whether the feedback question-answer pair has region information or not;
the score ScoreA of the region is shown in formula (1):
ScoreA(QAj,areai)
=(1-λ)×(areai/10)+λ×(areai%10) (1)
wherein i is the ith area, j is the jth question-answer pair which is fed back by hundredths, λ is the weight of the area information appearing in the answer, and λ is 0.7; areaiThe calculation is shown in formula (2) (3):
wherein QA is a question-answer pair for hundred-degree knowledge feedback;
(2) a set of problem similarities;
the problem similarity set is denoted as Simq ═ Simq1,simq2,…,simq10In which simq is1-10For the similarity of the proposed question tq and the question-answer with hundred degree knowledge feedback to each question in the QA set, the calculation formula is shown as formula (4):
wherein A, B is two n-dimensional vectors, A is [ A1, A2, …, An]B is [ B1, B2, …, Bn],AiAnd BiRepresenting the same character partThe frequency of occurrence in A, B, n being A, B of all non-repeating single characters;
the second characteristic is specifically as follows:
(1) whether the answer is a recommendation answer;
wherein,the weight of the recommended answer is represented,
(2) the number of praise times;
ScoreI(QAi,Agree)=θ×count(QAi,Agree) (6)
where θ is the weight of each vote, θ is 0.1, count (QA)iAgree) is the number of praise in the ith QA;
(3) the time of the answer;
the answer time is limited, and the unit is year, and the calculation formula is shown as the formulas (7) and (8):
timei=Now-AnsTimei (7)
wherein i is the ith QA, Now is the current time, and AnsTime is the time for answering the question;
the third characteristic is specifically as follows:
regarding the first 3 query results of the feedback results as being weighted the same, and the later results are gradually weighted lower with the increase of the rank, and the specific distribution is shown as formula (9), wherein i is the ith QA pair;
the rule in the step 4) is specifically as follows:
rule 1: for an entity with definite geographic position, two conditions exist, firstly, if the retrieval result only contains one area information, the area information is the area information of the depfectloc; second, if there is Max (P (area)i| deffectloc)) > is more than or equal to γ, the areaiArea information of the depfectloc; wherein γ is 0.5;
wherein the specific geographic location entity is one area, or Max (area)i| deffectloc)) > gamma deffectloc, and is marked as clearLoc; wherein the probability calculation formula is shown as formula (12):
rule 2: for an ambiguous geographic location entity, disambiguating the defectLoc by using the countLoc; wherein, countLoc is to count the number of each region, a plurality of same region information appears in one QA, and Max (countLoc | area) is obtained by one-time calculationi) If the area information of the depfectLoc is areai(ii) a If Max (countLoc | area)i) There are 2 or more than 2 regions, the first Max (countLoc | area) is takeni) The area information of (a);
where the ambiguous geolocation entity is the occurrence of multiple regions in the search results and Max (P (area)i| Location)) < gamma depectloc, noted ambiguityLoc;
rule 3: for the zero geographic position entity, the regional completion operation can not be carried out;
the zero geographic position entity is a depletloc in which no regional information appears in the retrieval result and is marked as zeroLoc.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610001346.2A CN105468791B (en) | 2016-01-05 | 2016-01-05 | A kind of integrality expression for the geographical location entity known based on interacting Question-Answer community-Baidu |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610001346.2A CN105468791B (en) | 2016-01-05 | 2016-01-05 | A kind of integrality expression for the geographical location entity known based on interacting Question-Answer community-Baidu |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105468791A CN105468791A (en) | 2016-04-06 |
CN105468791B true CN105468791B (en) | 2019-11-15 |
Family
ID=55606491
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610001346.2A Active CN105468791B (en) | 2016-01-05 | 2016-01-05 | A kind of integrality expression for the geographical location entity known based on interacting Question-Answer community-Baidu |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105468791B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109408743B (en) * | 2018-08-21 | 2020-11-17 | 中国科学院自动化研究所 | Text link embedding method |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103473289A (en) * | 2013-08-30 | 2013-12-25 | 深圳市华傲数据技术有限公司 | Device and method for completing communication addresses |
CN103914543A (en) * | 2014-04-03 | 2014-07-09 | 北京百度网讯科技有限公司 | Search result displaying method and device |
CN104537062A (en) * | 2014-12-29 | 2015-04-22 | 北京牡丹电子集团有限责任公司数字电视技术中心 | Address information extracting method and system |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150261858A1 (en) * | 2009-06-29 | 2015-09-17 | Google Inc. | System and method of providing information based on street address |
-
2016
- 2016-01-05 CN CN201610001346.2A patent/CN105468791B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103473289A (en) * | 2013-08-30 | 2013-12-25 | 深圳市华傲数据技术有限公司 | Device and method for completing communication addresses |
CN103914543A (en) * | 2014-04-03 | 2014-07-09 | 北京百度网讯科技有限公司 | Search result displaying method and device |
CN104537062A (en) * | 2014-12-29 | 2015-04-22 | 北京牡丹电子集团有限责任公司数字电视技术中心 | Address information extracting method and system |
Also Published As
Publication number | Publication date |
---|---|
CN105468791A (en) | 2016-04-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109255033B (en) | Knowledge graph recommendation method based on location-based service field | |
CN106980692A (en) | A kind of influence power computational methods based on microblogging particular event | |
CN105528437B (en) | A kind of question answering system construction method extracted based on structured text knowledge | |
CN109753602B (en) | Cross-social network user identity recognition method and system based on machine learning | |
Joho et al. | Overview of NTCIR-11 Temporal Information Access (Temporalia) Task. | |
KR101543780B1 (en) | System and method for expert search by dynamic profile and social network reliability | |
CN105630884B (en) | A kind of geographical location discovery method of microblog hot event | |
CN110457404A (en) | Social media account-classification method based on complex heterogeneous network | |
CN102663129A (en) | Medical field deep question and answer method and medical retrieval system | |
CN104598648B (en) | A kind of microblog users interactive mode gender identification method and device | |
CN105653518A (en) | Specific group discovery and expansion method based on microblog data | |
CN107330020B (en) | User entity analysis method based on structure and attribute similarity | |
CN107577782B (en) | Figure similarity depicting method based on heterogeneous data | |
CN107145545A (en) | Top k zone users text data recommends method in a kind of location-based social networks | |
CN107577665B (en) | Text emotional tendency judging method | |
CN107194560A (en) | The Social search evaluation method clustered in LBSN based on good friend | |
Li et al. | A hybrid model for experts finding in community question answering | |
CN107153687B (en) | Indexing method for social network text data | |
CN109408726B (en) | Question answering person recommendation method in question and answer website | |
CN105843799A (en) | Academic paper label recommendation method based on multi-source heterogeneous information graph model | |
CN110347897A (en) | Micro blog network emotion community detection method based on event detection | |
Liao et al. | Coronavirus pandemic analysis through tripartite graph clustering in online social networks | |
Ju et al. | Relationship strength estimation based on Wechat Friends Circle | |
Grover et al. | Prediction model for influenza epidemic based on Twitter data | |
CN113868387A (en) | Word2vec medical similar problem retrieval method based on improved tf-idf weighting |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |