CN105468791B - An Integrity Expression Method Based on Interactive Q&A Community-Baidu Knows Geographical Entities - Google Patents
An Integrity Expression Method Based on Interactive Q&A Community-Baidu Knows Geographical Entities Download PDFInfo
- Publication number
- CN105468791B CN105468791B CN201610001346.2A CN201610001346A CN105468791B CN 105468791 B CN105468791 B CN 105468791B CN 201610001346 A CN201610001346 A CN 201610001346A CN 105468791 B CN105468791 B CN 105468791B
- Authority
- CN
- China
- Prior art keywords
- area
- defectloc
- geographic location
- question
- score
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 21
- 230000002452 interceptive effect Effects 0.000 title claims abstract description 11
- 230000002950 deficient Effects 0.000 claims abstract description 27
- 239000013598 vector Substances 0.000 claims abstract description 15
- 239000000284 extract Substances 0.000 claims abstract description 12
- 238000012545 processing Methods 0.000 claims abstract description 8
- 230000007547 defect Effects 0.000 claims abstract description 6
- 238000004364 calculation method Methods 0.000 claims description 18
- NAWXUBYGYWOOIX-SFHVURJKSA-N (2s)-2-[[4-[2-(2,4-diaminoquinazolin-6-yl)ethyl]benzoyl]amino]-4-methylidenepentanedioic acid Chemical compound C1=CC2=NC(N)=NC(N)=C2C=C1CCC1=CC=C(C(=O)N[C@@H](CC(=C)C(O)=O)C(O)=O)C=C1 NAWXUBYGYWOOIX-SFHVURJKSA-N 0.000 claims description 3
- 230000007423 decrease Effects 0.000 claims description 3
- 238000007619 statistical method Methods 0.000 abstract description 4
- 238000011160 research Methods 0.000 description 4
- 230000003203 everyday effect Effects 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 244000097202 Rathbunia alamosensis Species 0.000 description 1
- 235000009776 Rathbunia alamosensis Nutrition 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 210000003813 thumb Anatomy 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/29—Geographical information databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Remote Sensing (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本发明涉及一种基于互动问答社区‑百度知道的地理位置实体的完整性表达方法,包括以下步骤:步骤1):通过数据处理提取缺陷地理位置实体defectLoc;步骤2):对提取的defectLoc生成问题:“某defectLoc属于哪个区”,通过百度知道进行检索;步骤3):根据检索的结果提取特征,计算defectLoc属于各个区域的得分,并构建出defectLoc的所属区域特征向量;步骤4):利用规则对defectLoc进行完整化处理。本发明以微博城市投诉文本为基础,针对其中的地理位置实体表达不规范、非结构化的特点,使得工作人员很难进行统计分析工作,本发明提出一种基于百度知道的地理位置实体的完整性表达方法,对缺陷地理位置实体完整化具有较高的准确率,可以很好地满足实际应用的需要。
The present invention relates to a method for expressing the integrity of a geographic location entity based on an interactive question-and-answer community-Baidu, comprising the following steps: step 1): extracting a defective geographic location entity defectLoc through data processing; step 2): generating a question for the extracted defectLoc : "Which area does a certain defectLoc belong to", search through Baidu Know; step 3): extract features according to the search results, calculate the score of each area where the defectLoc belongs to, and construct the feature vector of the area to which the defectLoc belongs; step 4): use the rules Complete the defectLoc. The present invention is based on micro-blog city complaint texts, aiming at the non-standard and unstructured expression of geographic location entities, which makes it difficult for staff to perform statistical analysis work, the present invention proposes a geographic location entity based on Baidu The integrity expression method has a high accuracy rate for the integrity of the defect geographic location entity, which can well meet the needs of practical applications.
Description
技术领域technical field
本发明属于微博城市投诉文本中地理位置实体的完整性表达技术领域,具体涉及一种基于互动问答社区-百度知道的地理位置实体的完整性表达方法。The invention belongs to the technical field of completeness expression of geographic location entities in microblog city complaint texts, and in particular relates to a method for complete expression of geographic location entities based on an interactive question-and-answer community-Baidu Know.
背景技术Background technique
近几年,随着微博问政的兴起,越来越多的政府部门开设官方微博和百姓互动。对于微博城市投诉信息来说,由于每天收到的投诉微博数量巨大,地理位置实体有时会缺少区域信息。一条完整的地理位置实体应包括地名区域和地名两部分,例如“朝阳区豆各庄乡富力又一城”。而微博城市投诉文本中地理位置实体存在如下现象:一、地名区域缺失,如“中关村”;二、地名区域模糊,如“长安街”。由于地名区域缺失或模糊现象的存在,从而对工作人员的统计分析工作带来了极大的困难,以致于工作人员很难统计各个区域的事故发生量而不能及时预防事故的发生。本发明将存在上述两种情况的地理位置实体统称为缺陷地理位置实体,记为defectLoc。而且,随着时间的推移,地名及区域信息也随之变化,使得分析地名从属区域变得更加困难,例如“崇文门新景家园”原属于崇文区,而现在属于东城区,如何及时发现地名所属区域信息的变化变得尤为重要。对地理位置实体进行完整性表示,添加缺失的区域信息,如将“中关村”规范化为“海淀区中关村”或确定化模糊区域如将“长安街”规范化为“东城区长安街”或“西城区长安街”,可以方便城市管理人员进行统计与分析,进一步发现地区存在的问题,并对其进行预防,实现预警功能,对以后的工作提供决策支持。In recent years, with the rise of microblogs asking about politics, more and more government departments have set up official microblogs to interact with ordinary people. For microblog city complaint information, due to the huge number of complaint microblogs received every day, geographical location entities sometimes lack regional information. A complete geographical location entity should include two parts of place name area and place name, for example, "R&F Festival Walk, Dougezhuang Township, Chaoyang District". However, the geographical location entity in the Weibo city complaint text has the following phenomena: 1. The place name area is missing, such as "Zhongguancun"; 2. The place name area is vague, such as "Chang'an Avenue". Due to the existence of missing or blurred geographical names, it has brought great difficulties to the staff's statistical analysis work, so that it is difficult for the staff to count the accidents in each area and cannot prevent the occurrence of accidents in time. In the present invention, the geographic location entities with the above two situations are collectively referred to as defective geographic location entities, which are denoted as defectLoc. Moreover, with the passage of time, place names and area information have also changed, making it more difficult to analyze the subordinate areas of place names. For example, "Chongwenmen Xinjingjiayuan" originally belonged to Chongwen District, but now it belongs to Dongcheng District. How to find place names in time Changes in information about the area to which it belongs become particularly important. Integrity representation of geographic location entities, adding missing area information, such as normalizing "Zhongguancun" to "Zhongguancun, Haidian District" or deterministic fuzzy areas such as normalizing "Chang'an Street" to "Chang'an Street, Dongcheng District" or "Xicheng District Chang’an Avenue” can facilitate city managers to conduct statistics and analysis, further discover problems existing in the region, and prevent them, realize the early warning function, and provide decision support for future work.
目前,国内的研究均集中在地名与地理位置实体的识别上,对于地理位置实体的完整性研究较少。针对缺失的区域信息的问题,相关研究多通过构建地理本体和地理知识库解决该问题。但构建地理本体和地理知识库需要领域专家的参与,并且对已构建的地理本体和地理知识库进行一致性、完整性维护,维护如此庞大的地理本体和地理知识库需要耗费较大的人力,并且无法及时对数据进行更新,尤其是在隶属关系上发生变化时,通常需要对较多的节点进行修改,不易做到实时性。At present, domestic research is focused on the identification of place names and geographic location entities, and there are few studies on the integrity of geographic location entities. Aiming at the problem of missing regional information, related research mostly solves this problem by constructing geographic ontology and geographic knowledge base. However, the construction of geographic ontology and geographic knowledge base requires the participation of domain experts, and the consistency and integrity of the constructed geographic ontology and geographic knowledge base are maintained. Maintaining such a large geographic ontology and geographic knowledge base requires a lot of manpower. And it is impossible to update the data in time, especially when the affiliation relationship changes, it usually needs to modify more nodes, and it is difficult to achieve real-time performance.
发明内容Contents of the invention
针对上述现有技术中存在的问题,本发明的目的在于提供一种可避免出现上述技术缺陷的基于互动问答社区-百度知道的地理位置实体完整性表示方法。Aiming at the problems existing in the above-mentioned prior art, the object of the present invention is to provide a geographical location entity integrity representation method based on the interactive question-and-answer community-Baidu Zhizhi that can avoid the above-mentioned technical defects.
为了实现上述发明目的,本发明采用的技术方案如下;In order to realize the foregoing invention object, the technical scheme that the present invention adopts is as follows;
一种基于互动问答社区-百度知道的地理位置实体完整性表示方法,包括以下步骤:A method for representing the integrity of geographical location entities based on the interactive question-and-answer community-Baidu Know, comprising the following steps:
步骤1):通过数据处理提取缺陷地理位置实体;其中,缺陷地理位置实体为区域缺失或者区域模糊的地理位置实体,记为defectLoc;Step 1): extract the defect geographic location entity through data processing; wherein, the defective geographic location entity is a geographic location entity with missing or fuzzy regions, which is recorded as defectLoc;
步骤2):对步骤1)提取的defectLoc生成问题:“某defectLoc属于哪个区”,通过百度知道进行检索;Step 2): Generate a question for the defectLoc extracted in step 1): "Which district does a certain defectLoc belong to", and search through Baidu Zhizhi;
步骤3):根据步骤2)检索的结果提取特征,计算defectLoc属于各个区域的得分,并构建出defectLoc的所属区域特征向量;Step 3): extracting features according to the result retrieved in step 2), calculating the scores that defectLoc belongs to each region, and constructing the region feature vector of defectLoc;
步骤4):利用规则对defectLoc进行完整化处理,实现地理位置实体完整性表示。Step 4): Use the rules to complete the defectLoc to realize the integrity representation of the geographic location entity.
进一步地,所述步骤1)具体为:Further, the step 1) is specifically:
步骤A:分析已识别的地理位置实体,判断其是否存在区域信息,存在则退出;不存在转到步骤B;Step A: Analyze the identified geographic location entity to determine whether there is area information, exit if it exists; if not, go to step B;
步骤B:定位原微博,通过NLPIR进行原始微博的词语切分,并将所有@的内容提取出来组成@数组,判断数组中是否存在唯一区域信息,存在则补全该defectLoc,将其过滤;不存在转到步骤C;Step B: Locate the original Weibo, segment the words of the original Weibo through NLPIR, and extract all the contents of @ to form an @ array, determine whether there is unique area information in the array, complete the defectLoc if it exists, and filter it ; does not exist go to step C;
步骤C:提取待处理的defectLoc,组成defectLoc集合。Step C: extract the defectLoc to be processed to form a defectLoc set.
进一步地,所述步骤3)中提取的特征具体为:Further, the features extracted in the step 3) are specifically:
特征一:内容特征;Feature 1: content features;
特征二:百度知道特征;Feature 2: Baidu knows the feature;
特征三:搜索反馈特征。Feature 3: Search Feedback Feature.
进一步地,所述特征一具体为:Further, the feature one is specifically:
(1)反馈的问答对是否存在区域信息;(1) Whether there is regional information in the feedback question and answer pair;
区域的得分ScoreA如公式(1)所示:The score ScoreA of the area is shown in formula (1):
ScoreA(QAj,areai)ScoreA(QA j ,area i )
=(1-λ)×(areai/10)+λ×(areai%10) (1)=(1-λ)×(area i /10)+λ×(area i %10) (1)
其中i为第i个区域,j为百度知道反馈的第j个问答对,λ为答案中出现区域信息的权重;areai计算如公式(2)(3)所示:Among them, i is the i-th area, j is the j-th question-answer pair that Baidu knows the feedback, and λ is the weight of the area information in the answer; area i is calculated as shown in formula (2) (3):
(2)问题相似度集合;(2) Question similarity set;
问题相似度集合记为Simq={simq1,simq2,…,simq10},其中,simq1-10为提出的问题tq与百度知道反馈的问答对QA集合中每个问题的相似度,其计算公式如公式(4)所示:The problem similarity set is recorded as Simq={simq 1 , simq 2 ,..., simq 10 }, wherein, simq 1-10 is the similarity between the proposed question tq and the question-and-answer pair QA set of feedback from Baidu, where The calculation formula is shown in formula (4):
其中A、B是两个n维向量,A是[A1,A2,…,An],B是[B1,B2,…,Bn],Ai与Bi表示同一字符分别在A、B中出现的频度,n为A、B中所有不重复的单个字符。Among them, A and B are two n-dimensional vectors, A is [A1, A2, ..., An], B is [B1, B2, ..., Bn], A i and B i indicate that the same character appears in A and B respectively The frequency of , n is all non-repeating single characters in A and B.
进一步地,所述特征二具体为:Further, the second feature is specifically:
(1)是否为推荐答案;(1) Whether it is a recommended answer;
其中,表示推荐答案的权重;in, Indicates the weight of the recommended answer;
(2)赞次数;(2) Number of likes;
ScoreI(QAi,Agree)=θ×count(QAi,Agree) (6)ScoreI(QA i ,Agree)=θ×count(QA i ,Agree) (6)
其中θ为每个赞的权值,count(QAi,Agree)为第i个QA中的赞数。Where θ is the weight of each like, and count(QA i , Agree) is the number of likes in the i-th QA.
(3)回答时间;(3) Answer time;
对回答时间做限制,单位为年,计算公式如公式(7)(8)所示:Limit the answering time, the unit is year, the calculation formula is shown in formula (7)(8):
timei=Now-AnsTimei (7)time i = Now-Ans Time i (7)
其中i为第i个QA,Now为现在的时间,AnsTime为回答问题的时间。Where i is the i-th QA, Now is the current time, and AnsTime is the time to answer the question.
进一步地,所述特征三具体为:Further, the third feature is specifically:
将反馈结果的前3个查询结果看成权重相同的,后面结果随着排名的增加权重逐渐降低,具体分布如公式(9)所示,其中i为第i个QA对。The first three query results of the feedback results are regarded as having the same weight, and the weight of the latter results gradually decreases as the ranking increases. The specific distribution is shown in formula (9), where i is the i-th QA pair.
进一步地,所述步骤3)具体为:Further, the step 3) is specifically:
缺失地理位置实体defectLoc属于区域i的得分Score(areai|defectLoc),计算公式如公式(10)所示:The missing geographic location entity defectLoc belongs to the score Score(area i |defectLoc) of area i, and the calculation formula is shown in formula (10):
其中RowScore(QAj,areai)为第j条QA所属于区域i的得分,计算公式如公式(11)所示:Among them, RowScore(QA j , area i ) is the score of the area i to which the jth QA belongs, and the calculation formula is shown in formula (11):
RowScore(QAj,areai)RowScore(QA j ,area i )
=ScoreA(QAj,areai)×simqj×(1+Rec(j))=ScoreA(QA j , area i )×simq j ×(1+Rec(j))
×(1+ScoreI(QAj,Agree))×(1+ScoreT(timej))×(1+ScoreI(QA j ,Agree))×(1+ScoreT(time j ))
×(1+Pos(j)) (11) ×(1+Pos(j)) (11)
根据defectLoc所有区域area的分数值Score,最终构建出defectLoc的得分特征向量According to the score value Score of all areas of defectLoc, the score feature vector of defectLoc is finally constructed
{{
Score(area1|defectLoc),Score(area2|defectLoc),...,Score(area16|defectLoc)Score(area 1 |defectLoc), Score(area 2 |defectLoc),..., Score(area 16 |defectLoc)
}。}.
进一步地,所述步骤4)中规则具体为:Further, the rules in step 4) are specifically:
规则1:对于明确地理位置实体,存在两种情况,第一、如果检索结果中只含有一个区域信息,则此区域信息为defectLoc的区域信息;第二、如果存在Max(P(areai|defectLoc))≥γ,此areai为defectLoc的区域信息;Rule 1: There are two situations for a clear geographic location entity. First, if the search result contains only one area information, this area information is the area information of defectLoc; second, if there is Max(P(area i |defectLoc ))≥γ, this area i is the area information of defectLoc;
其中明确地理位置实体为检索结果中出现且只出现一个区域,或者Max(P(areai|defectLoc))≥γ的defectLoc,记为clearLoc;其中概率计算公式如式(12)所示:Among them, the clear geographic location entity is the defectLoc that appears in the search results and only appears in one area, or Max(P(area i |defectLoc))≥γ, which is recorded as clearLoc; the probability calculation formula is shown in formula (12):
规则2:对于歧义地理位置实体,利用countLoc对defectLoc进行消歧;其中countLoc为统计每个区域的个数,一条QA中出现多个相同的区域信息,按一次计算,最终得到Max(countLoc|areai),则defectLoc的区域信息为areai:如果Max(countLoc|areai)存在2个或2个以上的区域,取第一个Max(countLoc|areai)的区域信息;Rule 2: For ambiguous geographic location entities, use countLoc to disambiguate defectLoc; where countLoc counts the number of each area, and multiple identical area information appears in one QA, calculate once, and finally get Max(countLoc|area i ), the area information of defectLoc is area i : if there are 2 or more areas in Max(countLoc|area i ), take the area information of the first Max(countLoc|area i );
其中歧义地理位置实体为检索结果中出现了多个区域且Max(P(areai|Location))<γ的defectLoc,记为ambiguityLoc;Among them, the ambiguous geographic location entity is the defectLoc in which multiple areas appear in the retrieval results and Max(P(area i |Location))<γ, which is recorded as ambiguityLoc;
规则3:对于零地理位置实体,无法进行区域补全操作;Rule 3: For entities with zero geographic location, region completion cannot be performed;
其中零地理位置实体为检索结果中未出现区域信息的defectLoc,记为zeroLoc。The zero geographic location entity is the defectLoc that does not appear in the search results, and is denoted as zeroLoc.
本发明提供的基于互动问答社区-百度知道的地理位置实体的完整性表达方法,以微博城市投诉文本为基础,针对其中的地理位置实体表达不规范、非结构化的特点,使得工作人员很难进行统计分析工作,本发明提出一种基于百度知道的地理位置实体的完整性表达方法,对缺陷地理位置实体完整化具有较高的准确率,可以很好地满足实际应用的需要。The method for expressing the completeness of geographical location entities based on the interactive question-and-answer community-Baidu Knowing provided by the present invention is based on micro-blog city complaint texts, aiming at the non-standard and unstructured characteristics of geographical location entities, which makes it very difficult for staff It is difficult to carry out statistical analysis work. The present invention proposes a completeness expression method based on the geographic location entity known by Baidu, which has a high accuracy rate for the integrity of the defective geographic location entity and can well meet the needs of practical applications.
附图说明Description of drawings
图1为本发明的流程图;Fig. 1 is a flowchart of the present invention;
图2为defectLoc类别所占比。Figure 2 shows the proportion of defectLoc categories.
具体实施方式Detailed ways
为了使本发明的目的、技术方案及优点更加清楚明白,下面结合附图和具体实施例对本发明做进一步说明。应当理解,此处所描述的具体实施例仅用以解释本发明,并不用于限定本发明。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described below in conjunction with the accompanying drawings and specific embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.
如图1所示,一种基于互动问答社区-百度知道的地理位置实体完整性表示方法,包括以下步骤:As shown in Figure 1, a method for representing the integrity of geographic location entities based on the interactive question-and-answer community-Baidu Know includes the following steps:
步骤1):通过数据处理提取缺陷地理位置实体;其中,缺陷地理位置实体为区域缺失或者区域模糊的地理位置实体,记为defectLoc;Step 1): extract the defect geographic location entity through data processing; wherein, the defective geographic location entity is a geographic location entity with missing or fuzzy regions, which is recorded as defectLoc;
本发明首先利用Li×w提出的地理位置实体自动识别方法进行地理位置实体识别,识别出地理位置实体后,进一步提取defectLoc。用户在发布一条投诉微博时,除了“@北京12345”以外,有时也会@相关区域。本发明根据微博@相关区域的特点,对所有投诉微博@的内容进行抽取,当@的内容存在唯一的区域信息时,例如“@朝阳区政府热线”,将该区域作为此defectLoc的所属区域进行完整性表示,最终过滤一部分defectLoc。提取待处理的defectLoc的算法如下:The present invention first utilizes the geographic location entity automatic recognition method proposed by Li×w to identify the geographic location entity, and after the geographic location entity is identified, defectLoc is further extracted. When a user posts a complaint on Weibo, in addition to "@北京12345", sometimes he will also @ the relevant area. According to the characteristics of relevant areas of microblog @, the present invention extracts the content of all complained microblog @, when there is unique area information in the content of @, such as "@ Chaoyang District Government Hotline", this area is taken as the belonging of this defectLoc The integrity of the area is represented, and finally a part of defectLoc is filtered. The algorithm for extracting the pending defectLoc is as follows:
步骤A:分析已识别的地理位置实体,判断其是否存在区域信息,存在则退出;不存在转到步骤B;Step A: Analyze the identified geographic location entity to determine whether there is area information, exit if it exists; if not, go to step B;
步骤B:定位原微博,通过NLPIR进行原始微博的词语切分,并将所有@的内容提取出来组成@数组,判断数组中是否存在唯一区域信息,存在则补全该defectLoc,将其过滤;不存在转到步骤C;Step B: Locate the original Weibo, segment the words of the original Weibo through NLPIR, and extract all the contents of @ to form an @ array, determine whether there is unique area information in the array, complete the defectLoc if it exists, and filter it ; does not exist go to step C;
步骤C:提取待处理的defectLoc,组成defectLoc集合。Step C: extract the defectLoc to be processed to form a defectLoc set.
步骤2):对步骤1)提取的defectLoc生成问题:“某defectLoc属于哪个区”,通过百度知道进行检索;Step 2): Generate a question for the defectLoc extracted in step 1): "Which district does a certain defectLoc belong to", and search through Baidu Zhizhi;
百度知道作为最流行的中文互动问答社区之一,自2005年起至2015年,十年间,百度知道累计解决问题超3.77亿。百度知道创立后的短短两年内共产生17596864个问题,已解决17012767个问题,问题解决率高达96.7%。同时,百度知道是一个参与率和互动性极强的知识社区,每日有超过1000万用户访问知道,每天平均产生71308个问题,223907个回答,平均每一个问题吸引3.14个用户参与互动。由于百度知道拥有大量用户群及问答数据,因此,它非常适合解决defectLoc的所属区域补全问题。Baidu Knows is one of the most popular Chinese interactive question-and-answer communities. From 2005 to 2015, Baidu Knows has solved more than 377 million questions in a total of ten years. In just two years after its establishment, Baidu Zhizhi generated 17,596,864 problems and solved 17,012,767 problems, with a problem solving rate of 96.7%. At the same time, Baidu Zhizhi is a knowledge community with strong participation rate and interaction. More than 10 million users visit Zhizhi every day, generating an average of 71,308 questions and 223,907 answers every day, and each question attracts an average of 3.14 users to participate in the interaction. Since Baidu knows that it has a large user base and Q&A data, it is very suitable to solve the problem of region completion of defectLoc.
本发明主要利用开放的互动问答社区-百度知道,对步骤1)提取的defectLoc生成一个问题,该问题为“defectLoc属于哪个区,,,例如,“中关村属于哪个区”,通过“zhidao.baidu.com”对相关问题的检索功能,实现了对defectLoc所属区域的搜索,最后将反馈结果进行结构化数据表示。例如“中关村”,首先将“中关村属于哪个区,”作为检索串提交给百度知道,反馈10个相似问题的QA对集合,并对反馈的结果进行结构化表示,如表1所示为截取的前6个QA对集合。The present invention mainly utilizes the open interactive question-and-answer community-Baidu Zhizhi to generate a question for the defectLoc extracted in step 1), the question being "which district does the defectLoc belong to?", for example, "which district does Zhongguancun belong to", through "zhidao.baidu. com" to search for related issues, realize the search for the area to which defectLoc belongs, and finally present the feedback results as structured data. For example, for "Zhongguancun", first submit "Which district does Zhongguancun belong to" as a search string to Baidu Know, Feedback a set of QA pairs with 10 similar questions, and make a structured representation of the feedback results, as shown in Table 1 is the first 6 sets of QA pairs intercepted.
表1:“中关村”结构化数据表示Table 1: "Zhongguancun" structured data representation
步骤3):根据步骤2)检索的结果提取特征,计算defectLoc属于各个区域的得分,并构建出defectLoc的所属区域特征向量;Step 3): extracting features according to the result retrieved in step 2), calculating the scores that defectLoc belongs to each region, and constructing the region feature vector of defectLoc;
步骤4):利用规则对defectLoc进行完整化处理,实现地理位置实体完整性表示。Step 4): Use the rules to complete the defectLoc to realize the integrity representation of the geographic location entity.
其中,所述步骤3)中提取的特征具体为:Wherein, the feature extracted in the step 3) is specifically:
特征一:内容特征;Feature 1: content features;
该特征描述百度知道反馈的问答内容,首先,要确认问答内容中是否出现了区域信息。同时,如果反馈的问题与提出的问题有较高的相似度,则认为该问题、答案中出现的区域信息更重要。This feature describes that Baidu knows the content of the question and answer feedback. First, it is necessary to confirm whether there is regional information in the content of the question and answer. At the same time, if the feedback question has a high similarity with the proposed question, it is considered that the regional information in the question and answer is more important.
特征二:百度知道特征;Feature 2: Baidu knows the feature;
百度知道特征是指百度知道本身的一些属性,反映了百度知道反馈的QA对的可信性。Baidu Zhizhi features refer to some attributes of Baidu Zhizhi itself, reflecting the credibility of Baidu Zhizhi's QA pairs.
特征三:搜索反馈特征。Feature 3: Search Feedback Feature.
通过搜索反馈结果的顺序,利用搜索引擎伪反馈技术计算权值。百度知道的反馈结果中排名越靠前其与defectLoc的所属区域信息越相关。According to the order of the search feedback results, the weight is calculated by using the pseudo-feedback technology of the search engine. Baidu knows that the higher the ranking in the feedback results, the more relevant it is to the area information of defectLoc.
(1)反馈的问答对是否存在区域信息;(1) Whether there is regional information in the feedback question and answer pair;
根据反馈的问答对构建一个bag={QA1,QA2,…,QA10},目标区域集合为Area={area1,area2,…,area16},其中QAi为百度知道反馈的第i个问答对,每个QA对应一个Area集合。作为判断问题,本发明用十位、个位的1分别表示问题、答案中是否存在areai的区域,如公式(1)(2)所示:Construct a bag={QA 1 , QA 2 ,...,QA 10 } according to the feedback question-answer pair, and the target area set is Area={area 1 , area 2 ,...,area 16 }, where QA i is the number i question-answer pairs, each QA corresponds to an Area set. As a judging question, the present invention uses tens and ones 1 to respectively represent whether there is an area i in the question and the answer, as shown in formula (1)(2):
针对不同区域,每个QA构建一个集合包含全部的区域16个area,由于QA问题中出现的区域与答案中出现的区域重要性不同,答案是对区域缺失问题的解答,因此,答案中的区域信息更重要。区域的得分ScoreA如公式(3)所示:For different areas, each QA constructs a set containing all 16 areas. Since the areas appearing in the QA question are different from the areas appearing in the answer, the answer is the answer to the missing area problem. Therefore, the area in the answer Information is more important. The score ScoreA of the area is shown in formula (3):
ScoreA(QAj,areai)ScoreA(QA j ,area i )
=(1-λ)×(areai/10)+λ×(areai%10) (3)=(1-λ)×(area i /10)+λ×(area i %10) (3)
其中i为第i个区域,j为百度知道反馈的第j个问答对,λ为答案中出现区域信息的权重;Where i is the i-th area, j is the j-th question-answer pair that Baidu knows to feedback, and λ is the weight of the area information in the answer;
(2)问题相似度集合;(2) Question similarity set;
这个特征用来衡量提出的问题tq与QA集合中所有问题的相似度,记为Simq,则Simq={simq1,simq2,...,simq10},其中,simq1-10为tq与QA集合中每个问题的相似度。由于余弦相似度的结果只在[0,1]之间,且需要计算相似度的两个问题字数较少,通常在10个字左右,因此本发明采用以字为向量,以余弦相似度作为问题相似度计算方法。假设A、B是两个n维向量,A是[A1,A2,...,An],B是[B1,B2,...,Bn],其中Ai与Bi表示同一字符分别在A、B中出现的频度,n为A、B中所有不重复的单个字符,则A和B的余弦相似度可以表示为:This feature is used to measure the similarity between the proposed question tq and all questions in the QA set, recorded as Simq, then Simq={simq 1 , simq 2 ,..., simq 10 }, where simq 1-10 is tq and The similarity of each question in the QA set. Since the result of the cosine similarity is only between [0, 1], and the two questions that need to calculate the similarity have fewer words, usually around 10 words, the present invention uses words as vectors and cosine similarity as Question similarity calculation method. Suppose A and B are two n-dimensional vectors, A is [A1, A2, ..., An], B is [B1, B2, ..., Bn], where A i and B i represent the same character in The frequency of occurrences in A and B, n is all non-repeating single characters in A and B, then the cosine similarity between A and B can be expressed as:
其中,所述特征二具体为:Wherein, the second feature is specifically:
(1)是否为推荐答案;(1) Whether it is a recommended answer;
推荐答案是由百度知道平台上高级知道网友推荐的质量较好的回答。因此,推荐答案通常具有较高的可信度,并比其他答案更加重要,用φ表示推荐答案的权重。The recommended answer is a better-quality answer recommended by senior Zhizhi netizens on the Baidu Zhizhi platform. Therefore, recommended answers usually have higher credibility and are more important than other answers, and φ represents the weight of recommended answers.
(2)赞次数;(2) Number of likes;
百度知道中,其他用户的“赞同”可以通过竖拇指的行为对回答的准确性进行肯定,一般赞的次数越多的答案,其质量也越高。本发明对赞数的计算表示为:In Baidu Zhizhi, other users' "agreements" can affirm the accuracy of the answer through the behavior of thumbs up. Generally, the more likes the answer has, the higher its quality is. The calculation of the number of likes in the present invention is expressed as:
ScoreI(QAi,Agree)=θ×count(QAi,Agree) (6)ScoreI(QA i ,Agree)=θ×count(QA i ,Agree) (6)
其中θ为每个赞的权值,count(QAi,Agree)为第i个QA中的赞数。Where θ is the weight of each like, and count(QA i , Agree) is the number of likes in the i-th QA.
(3)回答时间;(3) Answer time;
回答时间来自该QA对中回答问题用户发表回答的时间,由于地理位置的区域归属问题存在随时间的改变而改变,通常越接近当前时间的回答,其准确性越高,因此本发明对回答时间做限制,单位为年,计算公式如公式(7)(8)所示:The answer time comes from the time when the user answers the question in the QA pair. Because the regional attribution problem of the geographic location changes with time, usually the closer to the answer at the current time, the higher its accuracy. Therefore, the answer time of the present invention Make a limit, the unit is year, the calculation formula is shown in formula (7) (8):
timei=Now-AnsTimei (7)time i = Now-Ans Time i (7)
其中i为第i个QA,Now为现在的时间,AnsTime为回答问题的时间。Where i is the i-th QA, Now is the current time, and AnsTime is the time to answer the question.
其中,所述特征三具体为:Among them, the third feature is specifically:
将反馈结果的前3个查询结果看成权重相同的,后面结果随着排名的增加权重逐渐降低,具体分布如公式(9)所示,其中i为第i个QA对。The first three query results of the feedback results are regarded as having the same weight, and the weight of the latter results gradually decreases as the ranking increases. The specific distribution is shown in formula (9), where i is the i-th QA pair.
根据是否存在区域信息,问题相似度,是否推荐,赞次数,回答时间和反馈排名的结果,构建出每条QA的defectLoc所属区域的评分模型,其中是否存在区域信息和问题相似度作为其基数得分,再根据不同特征的重要性,每增加一个特征对已计算的得分进行修改,如果该特征值为0,总得分保持不变,反之,特征值越大,总得分增加的越多。所述步骤3)具体为:According to the existence of regional information, question similarity, recommendation, number of likes, answer time and feedback ranking results, a scoring model for the region to which defectLoc belongs to each QA is constructed, in which whether there is regional information and question similarity as its base score , and according to the importance of different features, the calculated score is modified every time a feature is added. If the feature value is 0, the total score remains unchanged. On the contrary, the larger the feature value, the more the total score increases. The step 3) is specifically:
缺失地理位置实体defectLoc属于区域i的得分Score(areai|defectLoc),计算公式如公式(10)所示:The missing geographic location entity defectLoc belongs to the score Score(area i |defectLoc) of area i, and the calculation formula is shown in formula (10):
其中RowScore(QAj,areai)为第j条QA所属于区域i的得分,计算公式如公式(11)所示:Among them, RowScore(QA j , area i ) is the score of the area i to which the jth QA belongs, and the calculation formula is shown in formula (11):
RowScore(QAj,areai)RowScore(QA j ,area i )
=ScoreA(QAj,areai)×simqj×(1+Rec(j))=ScoreA(QA j , area i )×simq j ×(1+Rec(j))
×(1+ScoreI(QAj,Agree))×(1+ScoreT(timej))×(1+ScoreI(QA j ,Agree))×(1+ScoreT(time j ))
×(1+Pos(j)) (11) ×(1+Pos(j)) (11)
根据defectLoc所有区域area的分数值Score,最终构建出defectLoc的得分特征向量According to the score value Score of all areas of defectLoc, the score feature vector of defectLoc is finally constructed
{{
Score(area1|defectLoc),Score(area2|defectLoc),...,Score(area16|defectLoc)Score(area 1 |defectLoc), Score(area 2 |defectLoc),..., Score(area 16 |defectLoc)
}。}.
表2:defectLoc所有区域的得分特征向量Table 2: Score feature vectors for all regions of defectLoc
通过对数据的观察分析发现,缺陷地理位置实体的Score(areai|defectLoc)值及检索结果中出现的区域的个数对缺陷地理位置实体的完整化起决定性作用。本发明利用规则对不同类别的缺陷地理位置实体进行区域完整性表示。其中,所述步骤4)中规则具体为:Through the observation and analysis of the data, it is found that the Score(area i |defectLoc) value of the defective geographic location entity and the number of areas appearing in the retrieval results play a decisive role in the integrity of the defective geographic location entity. The invention utilizes the rules to represent the region completeness of different types of defective geographic location entities. Wherein, the rules in the step 4) are specifically:
规则1:对于明确地理位置实体,存在两种情况,第一、如果检索结果中只含有一个区域信息,则此区域信息为defectLoc的区域信息;第二、如果存在Max(P(areai|defectLoc))≥γ,此areai为defectLoc的区域信息;如表2中的6,虽然有多个区域得分,但可确定所属区域;Rule 1: There are two situations for a clear geographic location entity. First, if the search result contains only one area information, this area information is the area information of defectLoc; second, if there is Max(P(area i |defectLoc ))≥γ, this area i is the area information of defectLoc; as in 6 in Table 2, although there are multiple area scores, the area to which it belongs can be determined;
其中明确地理位置实体为检索结果中出现且只出现一个区域,或者Max(P(areai|defectLoc))≥γ的defectLoc,记为clearLoc;其中概率计算公式如式(12)所示:Among them, the clear geographic location entity is the defectLoc that appears in the search results and only appears in one area, or Max(P(area i |defectLoc))≥γ, which is recorded as clearLoc; the probability calculation formula is shown in formula (12):
规则2:对于歧义地理位置实体,利用countLoc对defectLoc进行消歧;其中countLoc为统计每个区域的个数,一条QA中出现多个相同的区域信息,按一次计算,最终得到Max(countLoc|areai),则defectLoc的区域信息为areai;如果Max(countLoc|areai)存在2个或2个以上的区域,取第一个Max(countLoc|areai)的区域信息;如表3中的2,海淀的countLoc为最大值7,最后完整性规范化表示的结果为“海淀区五路居”;Rule 2: For ambiguous geographic location entities, use countLoc to disambiguate defectLoc; where countLoc counts the number of each area, and multiple identical area information appears in one QA, calculate once, and finally get Max(countLoc|area i ), the area information of defectLoc is area i ; if there are 2 or more areas in Max(countLoc|area i ), take the area information of the first Max(countLoc|area i ); as in Table 3 2. The countLoc of Haidian is the maximum value of 7, and the result of the final integrity normalization is "Wulu Residence, Haidian District";
其中歧义地理位置实体为检索结果中出现了多个区域且Max(P(areai|Location))<γ的defectLoc,记为ambiguityLoc;Among them, the ambiguous geographic location entity is the defectLoc in which multiple areas appear in the retrieval results and Max(P(area i |Location))<γ, which is recorded as ambiguityLoc;
表3:defectLoc所有区域的countLocTable 3: countLoc for all areas of defectLoc
规则3:对于零地理位置实体,无法进行区域补全操作;由于此类地理位置实体不一定属于北京地区,例如表2中的3。Rule 3: For zero geographic location entities, region completion operations cannot be performed; since such geographic location entities do not necessarily belong to the Beijing region, such as 3 in Table 2.
其中零地理位置实体为检索结果中未出现区域信息的defectLoc,记为zeroLoc。The zero geographic location entity is the defectLoc that does not appear in the search results, and is denoted as zeroLoc.
对每个缺陷地理位置实体通过其所有区域得分将每个缺陷地理位置实体分类,再通过上述规则对缺陷地理位置实体进行补全,最终规范化为完整地理位置实体,如表4所示。For each defective geographic location entity, classify each defective geographic location entity through its all area scores, and then complete the defective geographic location entity through the above rules, and finally normalize it into a complete geographic location entity, as shown in Table 4.
表4:对表2中的缺陷地理位置实体完整性表示Table 4: Entity Integrity Representation for Defective Geolocation in Table 2
本发明的语料来源于新浪微博,以“@北京12345”为关键词,通过新浪微博的搜索页面“s.weibo.com”进行检索,并编写定向爬虫程序自动采集相关微博。由于投诉微博的地理位置集中在北京地区,因此,地理位置实体的所属区域包括14个区和2个县,即东城区、西城区、朝阳区、丰台区、石景山区、海淀区、门头沟区、房山区、大兴区、昌平区、顺义区、通州区、怀柔区、平谷区、密云县、延庆县。The corpus of the present invention comes from Sina Weibo, with "@北京12345" as the keyword, searches through the search page "s.weibo.com" of Sina Weibo, and writes a directional crawler program to automatically collect related Weibo. Since the geographical locations of complaints about microblogs are concentrated in the Beijing area, the geographical location entities include 14 districts and 2 counties, namely Dongcheng District, Xicheng District, Chaoyang District, Fengtai District, Shijingshan District, Haidian District, and Mentougou District , Fangshan District, Daxing District, Changping District, Shunyi District, Tongzhou District, Huairou District, Pinggu District, Miyun County, Yanqing County.
本发明以1480条新浪城市投诉微博作为实验语料,根据Li×w的方法共提取1482个地理位置实体,并由专业人员对其进行校对。其中有840个地名包含明确的区域信息,可以为后续统计提供帮助,有642个缺陷地理位置实体,占整个语料的43.32%。经过前期的数据处理,根据@相关区域信息的微博特点,在642个缺陷地理位置实体中,可以完整性表示的缺陷地理位置实体有218个,余下的424个缺陷地理位置实体无法进行完整性表示。但在这424个缺陷地理位置实体中有90个重复出现过,例如“国贸”、“双井”等常见地理位置实体,去除这些重复项,最后总共有334个缺陷地理位置实体要进行完整性表示。The present invention uses 1480 Sina city complaint microblogs as experimental corpus, extracts 1482 geographic location entities in total according to the method of Li×w, and proofreads them by professionals. Among them, 840 place names contain clear regional information, which can provide help for subsequent statistics, and there are 642 defective geographic location entities, accounting for 43.32% of the entire corpus. After the previous data processing, according to the Weibo characteristics of @related area information, among the 642 defective geographic location entities, 218 defective geographic location entities can be represented in integrity, and the remaining 424 defective geographic location entities cannot be completed. express. However, 90 of these 424 defective geographic location entities have repeated occurrences, such as "Guomao", "Shuangjing" and other common geographic location entities. After removing these duplicates, there are a total of 334 defective geographic location entities that need to be completed. express.
通过上述的数据可以看出地理位置实体的完整性研究是有必要的,本发明主要对334个缺陷地理位置实体进行完整性研究。经过反复的实验,通常答案中出现区域信息比问题中出现区域信息对所属区域的贡献大,推荐答案对问题的解释更加权威,而百度知道特征中的赞次数对所属区域的贡献较小。针对缺陷地理位置实体,如果存在其某个区域的得分超过或等于所有区域得分和的一半时,可以确定其为明确地理位置实体,因此,本发明取λ=0.7,θ=0.1,γ=0.5。From the above data, it can be seen that the integrity research of the geographic location entity is necessary, and the present invention mainly conducts the integrity research on 334 defective geographic location entities. After repeated experiments, the regional information in the answer usually contributes more to the region than the regional information in the question. The recommended answer is more authoritative for the explanation of the question, while the number of likes in the Baidu know feature has a smaller contribution to the region. For the defective geographic location entity, if there is a region whose score exceeds or is equal to half of the sum of all regional scores, it can be determined as a clear geographic location entity. Therefore, the present invention takes λ=0.7, θ=0.1, γ=0.5.
本发明使用精确率(Accuracy)来对实验结果进行评价,即正确完整化的缺陷地理位置实体数量占全部缺陷地理位置实体的比例,其计算方法为:The present invention uses accuracy (Accuracy) to evaluate the experimental results, that is, the ratio of the number of correct and complete defective geographic location entities to all defective geographic location entities, and its calculation method is:
其中right表示正确完整化的缺陷地理位置实体个数,total表示待完整化的所有缺陷地理位置实体个数。Among them, right represents the number of defective geographic location entities that are correctly completed, and total represents the number of all defective geographic location entities to be completed.
通过数据处理阶段需要对334个缺陷地理位置实体进行完整性表示,本发明的实验方法分以下3个步骤进行:Need to carry out complete representation to 334 defective geographic location entities through the data processing stage, the experimental method of the present invention is divided into following 3 steps and carries out:
1)检索问题,结构化反馈结果。通过数据处理,本发明需要对334个defectLoc进行问题检索,并将问题检索的结果按照表1的结构,进行结构化处理,最终形成334个反馈数据表。1) Retrieve questions and give structured feedback results. Through data processing, the present invention needs to perform problem retrieval on 334 defectLocs, and perform structural processing on the problem retrieval results according to the structure in Table 1, and finally form 334 feedback data tables.
2)特征提取,计算所有区域的得分,构建defectLoc的得分特征向量。本发明采用上述特征值计算方法及所属区域的评分模型,通过反馈数据表,计算得到每个defectLoc的各个区域得分,并构建出得分特征向量。2) Feature extraction, calculate the scores of all regions, and construct the score feature vector of defectLoc. The present invention adopts the above-mentioned eigenvalue calculation method and the scoring model of the corresponding area, calculates the scores of each area of each defectLoc through the feedback data table, and constructs the score feature vector.
3)根据defectLoc的得分特征向量,对所有defectLoc进行分类,通过规则进行完整性表示。本发明将334个defectLoc分类表示,其中有290个明确地理位置实体,35个歧义地理位置实体,9个零地理位置实体。如图2所示,clearLoc占全部defectLoc的87%,说明城市投诉微博中大多数的defectLoc都是clearLoc,而虽然无法完整化的zeroLoc只占了3%,但仍需要找到其他方法对其进行完整性表示。3) According to the score feature vector of defectLoc, all defectLocs are classified, and the completeness is represented by rules. The present invention classifies and expresses 334 defectLocs, including 290 definite geographic location entities, 35 ambiguous geographic location entities, and 9 zero geographic location entities. As shown in Figure 2, clearLoc accounts for 87% of all defectLocs, indicating that most of the defectLocs in urban complaint microblogs are clearLocs, and although zeroLocs that cannot be completed account for only 3%, other methods still need to be found. Integrity representation.
从表5的实验结果可以看出,本发明方法完整化clearLoc的精确率达到了96.21%,ambiguityLoc的精确率达到了85.71%。clearLoc的完整化是通过规则1,由于百度知道检索得到的是唯一区域或Max(P(areai|defectLoc))≥γ,基本不会出现歧义的区域信息,所以正确率最高。而ambiguityLoc的完整化的精确率略低于clearLoc,主要是存在多个歧义区域,并且得分较接近,因此在多个区域消歧过程中,有时会出现错误。本发明方法可以对多数defectLoc实现完整性表示,覆盖率达到97.31%。对于少数的未返回检索结果的zeroLoc,本发明方法还无能为力。综上所述,本发明方法适用于defectLoc的完整性表示。It can be seen from the experimental results in Table 5 that the accuracy rate of the complete clearLoc of the method of the present invention reaches 96.21%, and the accuracy rate of ambiguityLoc reaches 85.71%. The completeness of clearLoc is through rule 1. Since Baidu knows that the retrieved area is the only area or Max(P(area i |defectLoc))≥γ, there will be basically no ambiguous area information, so the accuracy rate is the highest. The accuracy of ambiguityLoc's completeness is slightly lower than that of clearLoc, mainly because there are multiple ambiguous regions and the scores are relatively close, so errors sometimes occur in the process of disambiguating multiple regions. The method of the invention can realize integrity representation for most defectLocs, and the coverage rate reaches 97.31%. For a small number of zeroLocs that do not return retrieval results, the method of the present invention is powerless. To sum up, the method of the present invention is suitable for the integrity representation of defectLoc.
表5:缺陷地理位置实体中各类型分布表及正确率Table 5: Distribution table and correct rate of various types in defect geographic location entities
本发明提供的基于互动问答社区-百度知道的地理位置实体的完整性表达方法,以微博城市投诉文本为基础,针对其中的地理位置实体表达不规范、非结构化的特点,使得工作人员很难进行统计分析工作,本发明提出一种基于百度知道的地理位置实体的完整性表达方法,对缺陷地理位置实体完整化具有较高的准确率,可以很好地满足实际应用的需要。The method for expressing the completeness of geographical location entities based on the interactive question-and-answer community-Baidu Knowing provided by the present invention is based on micro-blog city complaint texts, aiming at the non-standard and unstructured characteristics of geographical location entities, which makes it very difficult for staff It is difficult to carry out statistical analysis work. The present invention proposes a completeness expression method based on the geographic location entity known by Baidu, which has a high accuracy rate for the integrity of the defective geographic location entity and can well meet the needs of practical applications.
以上所述实施例仅表达了本发明的实施方式,其描述较为具体和详细,但并不能因此而理解为对本发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本发明构思的前提下,还可以做出若干变形和改进,这些都属于本发明的保护范围。因此,本发明专利的保护范围应以所附权利要求为准。The above-mentioned embodiments only express the implementation manner of the present invention, and the description thereof is relatively specific and detailed, but should not be construed as limiting the patent scope of the present invention. It should be noted that those skilled in the art can make several modifications and improvements without departing from the concept of the present invention, and these all belong to the protection scope of the present invention. Therefore, the protection scope of the patent for the present invention should be based on the appended claims.
Claims (1)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610001346.2A CN105468791B (en) | 2016-01-05 | 2016-01-05 | An Integrity Expression Method Based on Interactive Q&A Community-Baidu Knows Geographical Entities |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610001346.2A CN105468791B (en) | 2016-01-05 | 2016-01-05 | An Integrity Expression Method Based on Interactive Q&A Community-Baidu Knows Geographical Entities |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105468791A CN105468791A (en) | 2016-04-06 |
CN105468791B true CN105468791B (en) | 2019-11-15 |
Family
ID=55606491
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610001346.2A Active CN105468791B (en) | 2016-01-05 | 2016-01-05 | An Integrity Expression Method Based on Interactive Q&A Community-Baidu Knows Geographical Entities |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105468791B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109408743B (en) * | 2018-08-21 | 2020-11-17 | 中国科学院自动化研究所 | Text link embedding method |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103473289A (en) * | 2013-08-30 | 2013-12-25 | 深圳市华傲数据技术有限公司 | Device and method for completing communication addresses |
CN103914543A (en) * | 2014-04-03 | 2014-07-09 | 北京百度网讯科技有限公司 | Search result displaying method and device |
CN104537062A (en) * | 2014-12-29 | 2015-04-22 | 北京牡丹电子集团有限责任公司数字电视技术中心 | Address information extracting method and system |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150261858A1 (en) * | 2009-06-29 | 2015-09-17 | Google Inc. | System and method of providing information based on street address |
-
2016
- 2016-01-05 CN CN201610001346.2A patent/CN105468791B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103473289A (en) * | 2013-08-30 | 2013-12-25 | 深圳市华傲数据技术有限公司 | Device and method for completing communication addresses |
CN103914543A (en) * | 2014-04-03 | 2014-07-09 | 北京百度网讯科技有限公司 | Search result displaying method and device |
CN104537062A (en) * | 2014-12-29 | 2015-04-22 | 北京牡丹电子集团有限责任公司数字电视技术中心 | Address information extracting method and system |
Also Published As
Publication number | Publication date |
---|---|
CN105468791A (en) | 2016-04-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Michelson et al. | Discovering users' topics of interest on twitter: a first look | |
Zhang et al. | Online social network profile linkage | |
CN106372072B (en) | A kind of recognition methods of location-based mobile agency meeting network user's relationship | |
CN104615593B (en) | Hot microblog topic automatic testing method and device | |
CN108052593A (en) | A kind of subject key words extracting method based on descriptor vector sum network structure | |
CN111899890B (en) | Medical data similarity detection system and method based on bit string hash | |
Mirani et al. | Sentiment analysis of isis related tweets using absolute location | |
CN102609433A (en) | Method and system for recommending query based on user log | |
CN110781670B (en) | Chinese place name semantic disambiguation method based on encyclopedic knowledge base and word vectors | |
CN106682172A (en) | Keyword-based document research hotspot recommending method | |
CN104699730A (en) | Identifying and displaying relationships between candidate answers | |
CN107194560B (en) | Social search evaluation method based on friend clustering in LBSN | |
CN106202211A (en) | A kind of integrated microblogging rumour recognition methods based on microblogging type | |
CN103744984B (en) | Method of retrieving documents by semantic information | |
CN110598219A (en) | A sentiment analysis method for Douban movie reviews | |
CN104317881B (en) | One kind is based on the authoritative microblogging method for reordering of user's topic | |
Liang et al. | Expert finding for microblog misinformation identification | |
Wang et al. | Effective online knowledge graph fusion | |
CN105468780A (en) | Normalization method and device of product name entity in microblog text | |
Lei et al. | Personalized item recommendation algorithm for outdoor sports | |
CN105468791B (en) | An Integrity Expression Method Based on Interactive Q&A Community-Baidu Knows Geographical Entities | |
CN105589916A (en) | Method for extracting explicit and implicit interest knowledge | |
Cui et al. | Personalized microblog recommendation using sentimental features | |
CN108763400B (en) | Object dividing method and device based on object behaviors and theme preferences | |
Banweer et al. | Multi-stage collaborative filtering for tweet geolocation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |