WO2020107929A1 - Method and terminal for obtaining associated information - Google Patents

Method and terminal for obtaining associated information Download PDF

Info

Publication number
WO2020107929A1
WO2020107929A1 PCT/CN2019/099124 CN2019099124W WO2020107929A1 WO 2020107929 A1 WO2020107929 A1 WO 2020107929A1 CN 2019099124 W CN2019099124 W CN 2019099124W WO 2020107929 A1 WO2020107929 A1 WO 2020107929A1
Authority
WO
WIPO (PCT)
Prior art keywords
entity
strength value
entities
correlation strength
data
Prior art date
Application number
PCT/CN2019/099124
Other languages
French (fr)
Chinese (zh)
Inventor
陈捷
吴春德
林世国
栾江霞
吴鸿伟
吴文
Original Assignee
厦门市美亚柏科信息股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 厦门市美亚柏科信息股份有限公司 filed Critical 厦门市美亚柏科信息股份有限公司
Publication of WO2020107929A1 publication Critical patent/WO2020107929A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri

Definitions

  • the present disclosure relates to the field of data processing technology, and in particular, to a method and terminal for acquiring related information.
  • the technical problem to be solved by the present disclosure is: how to improve the efficiency of obtaining related information from massive data.
  • the present disclosure provides a method for obtaining associated information, including:
  • S1 is specifically:
  • calculating the correlation strength value between any two entities in the second entity set according to the first correlation strength value set to obtain a second correlation strength value set specifically includes:
  • e i , e j is any entity in the second entity set
  • a i ⁇ j is the entity e i
  • the correlation strength value between any two nodes on the e j connection path, X(e i , e j ) is the strength of the association between the entities e i and e j .
  • constructing a knowledge graph based on the second entity set and the second correlation strength value set specifically includes:
  • S4 is specifically:
  • the entity is output.
  • the shortest path between the retrieval entity and the entity is output.
  • the present disclosure also provides a computer-readable storage medium on which a program is stored, and when the program is executed by a computer, the method for acquiring associated information is executed.
  • the present disclosure further provides a terminal for acquiring associated information, including one or more processors and a memory, the memory stores a program, and is configured to be executed by the one or more processors to perform the following steps:
  • S1 is specifically:
  • calculating the correlation strength value between any two entities in the second entity set according to the first correlation strength value set to obtain a second correlation strength value set specifically includes:
  • e i , e j are any entities in the second entity set
  • a i ⁇ j are entities e i
  • the correlation strength value between any two nodes on the e j connection path, X(e i , e j ) is the correlation strength value between the entities e i and e j ;
  • the S4 is specifically:
  • the shortest path between the retrieval entity and the entity is output.
  • the beneficial effect of the present disclosure is that the present disclosure realizes the rapid extraction of the data set associated with the retrieval entity from the massive data by constructing a knowledge graph based on the massive first data, which simplifies the process of business personnel data retrieval
  • FIG. 1 is a flowchart of a specific implementation manner of a method for obtaining associated information provided by the present disclosure
  • FIG. 2 is a structural block diagram of a specific implementation manner of a terminal for acquiring associated information provided by the present disclosure
  • FIG. 3 is an example diagram of the relationship between the retrieval entity and the entity in the knowledge graph
  • the present disclosure provides a method for obtaining associated information, including:
  • S1 is specifically:
  • extracting entities from business data ie, the first data
  • setting the strength of association between the extracted entities according to business needs to directly store them is conducive to improving the efficiency and data of business personnel when searching accuracy.
  • calculating the correlation strength value between any two entities in the second entity set according to the first correlation strength value set to obtain a second correlation strength value set specifically includes:
  • e i , e j are any entities in the second entity set
  • a i ⁇ j are entities e i
  • the correlation strength value between any two nodes on the e j connection path, X(e i , e j ) is the strength of the association between the entities e i and e j .
  • the entity data whose strength value is higher than the set threshold can be effectively extracted to avoid the omission of key entity information.
  • constructing a knowledge graph based on the second entity set and the second correlation strength value set specifically includes:
  • the normalized correlation strength value has a fixed value range (greater than or equal to 0 and less than or equal to 1), and business personnel can easily set the threshold.
  • S4 is specifically:
  • the entity is output.
  • the posterior probability value of the original entity will also be dynamically adjusted, especially for the case where the amount of data associated with the retrieved entity is large, which can guarantee each time The extracted entity data is of relatively high importance.
  • the shortest path between the retrieval entity and the entity is output.
  • the shortest path is output, and the most direct contact method between entities can be known, which can help business personnel understand how two entities are linked, and the business personnel can decide whether to view the entities on the link path.
  • the present disclosure also provides a computer-readable storage medium on which a program is stored, and when the program is executed by a computer, the method for acquiring associated information is executed.
  • the present disclosure further provides a terminal for acquiring associated information, including one or more processors 1 and a memory 2, the memory 2 stores a program, and is configured to be processed by the one or more Device 1 performs the following steps:
  • S1 is specifically:
  • calculating the correlation strength value between any two entities in the second entity set according to the first correlation strength value set to obtain a second correlation strength value set specifically includes:
  • e i , e j are any entities in the second entity set
  • a i ⁇ j are entities e i
  • the correlation strength value between any two nodes on the e j connection path, X(e i , e j ) is the correlation strength value between the entities e i and e j ;
  • the S4 is specifically:
  • the shortest path between the retrieval entity and the entity is output.
  • This embodiment provides a method for obtaining associated information, including:
  • the first data is the daily record data of the business department.
  • a large amount of event recording data will be generated.
  • Most of these massive data are text data, and also contain some table data, which are often distributed and stored in structured and unstructured databases.
  • the person entity, address entity, event entity, item entity and organization entity are extracted from the daily record data of the business department through technologies such as rule matching, OCR recognition, and natural language analysis.
  • the person entity includes the person identity information of the person entity and its associated person indicated in the business record, such as name, ID number, gender, blood type, etc.;
  • the address entity includes address information of companies, group organizations, individuals, etc. involved in the event record, such as the registered address, office address, individual's household registration address, temporary residence address, etc. of the enterprise;
  • the event entity includes information required for event description such as event type, event date, and event content in the event record;
  • the item entity includes identification information of items such as mobile phones, computers, and vehicles included in the event record, such as mobile phone numbers, MAC addresses of computers, and license plate numbers;
  • the organization entity includes information such as the organization name, type, scale, and scope of activities in the event record.
  • OCR recognition technology to extract the main body data from the picture data in the event record, such as license plate information, business license, etc.
  • the format of these picture data is relatively fixed, and can be recognized by the pre-trained OCR recognition model;
  • person entities, address entities, event entities, item entities, and organizational entities are related to each other according to the relationship in the event record, and the correlation strength value between the entities is set by the business personnel according to the degree of closeness to the event
  • the score ranges from 0 to 100.
  • e i , e j is any entity in the second entity set
  • a i ⁇ j is the entity e i
  • the correlation strength value between any two nodes on the e j connection path, X(e i , e j ) is the strength of the association between the entities e i and e j .
  • the correlation strength value between any two entities depends on the maximum correlation strength value existing on the two paths.
  • X(e i , e j ) is divided by the maximum value in the second set of correlation strength values.
  • the established entities and the correlation strength values between the entities are stored in graph databases such as Neo4j or Titan to construct a knowledge graph.
  • the knowledge graph was mainly used by Google in the field of semantic search in order to improve the search effect. It is now also used in chat robots, intelligent question answering systems, medical services, book information services and other fields.
  • the data in the knowledge graph can be expressed in the form of triples, that is, the form of entity 1-relationship-entity 2, in which the entity is the most basic element in the knowledge graph and is a description of the facts. Different entities have different relationships .
  • the knowledge graph containing a large number of triples becomes a huge knowledge graph, thus connecting different types of information into a relationship network, providing a relationship The ability to analyze problems from a perspective.
  • knowledge graph technology By applying knowledge graph technology to the field of big data, these massive amounts of heterogeneous data can be fused to realize the construction of the association relationship between object data, so that business personnel can quickly realize the relationship query, analysis and mining of the full amount of data to improve the work. effectiveness.
  • the new search data is preprocessed to extract the search entity data set; the search subject is extracted from the obtained search information, such as the subject name, certificate number, contact information, type, location, organization and other daily business information.
  • the retrieval entity is associated with the person entity, address entity, event entity, item entity, and organization entity in the knowledge graph, and all entities related to the retrieval entity are extracted according to the established relationship between the entities Information, forming a data set ⁇ e 1 ,e 2 ,...,e k ⁇ .
  • the mobile phone number in the search entity can be used to directly associate the item entity, and then the person name, address and other information of the mobile phone number are associated according to the association established between the item entity and the person entity, address entity, event entity, and organization entity.
  • the person entity, address entity, event entity, item entity, organization entity related to the retrieval subject are extracted to form the data set ⁇ e 1 ,e 2 ,...,e k ⁇ ;
  • the formula for calculating the posterior probability is:
  • C i represents the retrieval entity i
  • k is the number of the data set
  • ⁇ e 1 , e 2 ,..., e k ⁇ is the entity composed of person entity, address entity, event entity, item entity, organization entity data set.
  • the shortest path between the output e j e i and e i ⁇ entities associated with e j e i shortest path between the output e i and e m associated entities ⁇ e k ⁇ e n ⁇ e m.
  • the method for obtaining related information provided by this embodiment, through the construction of the knowledge graph, can quickly extract the data set associated with the retrieval subject, simplifying the data retrieval process of business personnel, and improving the Work efficiency, at the same time retrieve data through intelligent filtering, improve the efficiency of data query.
  • This embodiment provides a computer-readable storage medium on which a program is stored, and the program performs the following steps by a computer:
  • the first data is the daily record data of the business department.
  • a large amount of event recording data will be generated.
  • Most of these massive data are text data, and also contain some table data, which are often distributed and stored in structured and unstructured databases.
  • the person entity, address entity, event entity, item entity and organization entity are extracted from the daily record data of the business department through technologies such as rule matching, OCR recognition, and natural language analysis.
  • the person entity includes the person identity information of the person entity and its associated person indicated in the business record, such as name, ID number, gender, blood type, etc.;
  • the address entity includes address information of companies, group organizations, individuals, etc. involved in the event record, such as the registered address, office address, individual's household registration address, temporary residence address, etc. of the enterprise;
  • the event entity includes information required for event description such as event type, event date, and event content in the event record;
  • the item entity includes identification information of items such as mobile phones, computers, and vehicles included in the event record, such as mobile phone numbers, MAC addresses of computers, and license plate numbers;
  • the organization entity includes information such as the organization name, type, scale, and scope of activities in the event record.
  • OCR recognition technology to extract the main body data from the picture data in the event record, such as license plate information, business license, etc.
  • the format of these picture data is relatively fixed, and can be recognized by the pre-trained OCR recognition model;
  • person entities, address entities, event entities, item entities, and organizational entities are related to each other according to the relationship in the event record, and the correlation strength value between the entities is set by the business personnel according to the degree of closeness to the event
  • the score ranges from 0 to 100.
  • e i and e j are any entities in the second entity set
  • a i ⁇ j is the entity e i
  • the correlation strength value between any two nodes on the e j connection path is the entity e i
  • the strength of the correlation between e j is the strength of the correlation between e j .
  • the correlation strength value between any two entities depends on the maximum correlation strength value existing on the two paths.
  • X(e i , e j ) is divided by the maximum value in the second set of correlation strength values.
  • the established entities and the correlation strength values between the entities are stored in graph databases such as Neo4j or Titan to construct a knowledge graph.
  • the knowledge graph was mainly used by Google in the field of semantic search in order to improve the search effect. It is now also used in chat robots, intelligent question answering systems, medical services, book information services and other fields.
  • the data in the knowledge graph can be expressed in the form of triples, that is, the form of entity 1-relationship-entity 2, in which the entity is the most basic element in the knowledge graph and is a description of the facts. Different entities have different relationships .
  • the knowledge graph containing a large number of triples becomes a huge knowledge graph, thus connecting different types of information into a relationship network, providing a relationship The ability to analyze problems from a perspective.
  • knowledge graph technology By applying knowledge graph technology to the field of big data, these massive amounts of heterogeneous data can be fused to realize the construction of the association relationship between object data, so that business personnel can quickly realize the relationship query, analysis and mining of the full amount of data to improve the work. effectiveness.
  • the new search data is preprocessed to extract the search entity data set; the search subject is extracted from the obtained search information, such as the subject name, certificate number, contact information, type, location, organization and other daily business information.
  • the retrieval entity is associated with the person entity, address entity, event entity, item entity, and organization entity in the knowledge graph, and all entities related to the retrieval entity are extracted according to the established relationship between the entities Information, forming a data set ⁇ e 1 ,e 2 ,...,e k ⁇ .
  • the mobile phone number in the search entity can be used to directly associate the item entity, and then the person name, address and other information of the mobile phone number are associated according to the association established between the item entity and the person entity, address entity, event entity, and organization entity.
  • the person entity, address entity, event entity, item entity, organization entity related to the retrieval subject are extracted to form the data set ⁇ e 1 ,e 2 ,...,e k ⁇ ;
  • the formula for calculating the posterior probability is:
  • C i represents the retrieval entity i
  • k is the number of the data set
  • ⁇ e 1 , e 2 ,..., e k ⁇ is the entity composed of person entity, address entity, event entity, item entity, organization entity data set.
  • the shortest path between the output e j e i and e i ⁇ entities associated with e j e i shortest path between the output e i and e m associated entities ⁇ e k ⁇ e n ⁇ e m.
  • This embodiment provides a terminal for acquiring associated information, including one or more processors and a memory, the memory stores a program, and is configured to be executed by the one or more processors to perform the following steps:
  • the first data is the daily record data of the business department.
  • a large amount of event recording data will be generated.
  • Most of these massive data are text data, and also contain some table data, which are often distributed and stored in structured and unstructured databases.
  • the person entity, address entity, event entity, item entity and organization entity are extracted from the daily record data of the business department through technologies such as rule matching, OCR recognition, and natural language analysis.
  • the person entity includes the person identity information of the person entity and its associated person indicated in the business record, such as name, ID number, gender, blood type, etc.;
  • the address entity includes address information of companies, group organizations, individuals, etc. involved in the event record, such as the registered address, office address, individual's household registration address, temporary residence address, etc. of the enterprise;
  • the event entity includes information required for event description such as event type, event date, and event content in the event record;
  • the item entity includes identification information of items such as mobile phones, computers, and vehicles included in the event record, such as mobile phone numbers, MAC addresses of computers, and license plate numbers;
  • the organization entity includes information such as the organization name, type, scale, and scope of activities in the event record.
  • OCR recognition technology to extract subject data from the picture data in the event records, such as license plate information, business licenses, etc.
  • the format of these picture data is relatively fixed, and can be recognized by the pre-trained OCR recognition model;
  • person entities, address entities, event entities, item entities, and organizational entities are related to each other according to the relationship in the event record, and the correlation strength value between the entities is set by the business personnel according to the degree of closeness to the event
  • the score ranges from 0 to 100.
  • e i and e j are any entities in the second entity set
  • a i ⁇ j is the entity e i
  • the correlation strength value between any two nodes on the e j connection path is the entity e i
  • the strength of the correlation between e j is the strength of the correlation between e j .
  • the correlation strength value between any two entities depends on the maximum correlation strength value existing on the two paths.
  • X(e i , e j ) is divided by the maximum value in the second set of correlation strength values.
  • the established entities and the correlation strength values between the entities are stored in graph databases such as Neo4j or Titan to construct a knowledge graph.
  • the knowledge graph was mainly used by Google in the field of semantic search in order to improve the search effect. It is now also used in chat robots, intelligent question answering systems, medical services, book information services and other fields.
  • the data in the knowledge graph can be expressed in the form of triples, that is, the form of entity 1-relationship-entity 2, in which the entity is the most basic element in the knowledge graph and is a description of the facts. Different entities have different relationships .
  • the knowledge graph containing a large number of triples becomes a huge knowledge graph, thus connecting different types of information into a relationship network, providing a relationship The ability to analyze problems from a perspective.
  • knowledge graph technology By applying knowledge graph technology to the field of big data, these massive amounts of heterogeneous data can be fused to realize the construction of the association relationship between object data, so that business personnel can quickly realize the relationship query, analysis and mining of the full amount of data to improve the work. effectiveness.
  • the new search data is preprocessed to extract the search entity data set; the search subject is extracted from the obtained search information, such as the subject name, certificate number, contact information, type, location, organization and other daily business information.
  • the retrieval entity is associated with the person entity, address entity, event entity, item entity, and organization entity in the knowledge graph, and all entities related to the retrieval entity are extracted according to the established relationship between the entities Information, forming a data set ⁇ e 1 ,e 2 ,...,e k ⁇ .
  • the mobile phone number in the search entity can be used to directly associate the item entity, and then the person name, address and other information of the mobile phone number are associated according to the association established between the item entity and the person entity, address entity, event entity, and organization entity.
  • the person entity, address entity, event entity, item entity, organization entity related to the retrieval subject are extracted to form the data set ⁇ e 1 ,e 2 ,...,e k ⁇ ;
  • the formula for calculating the posterior probability is:
  • C i represents the retrieval entity i
  • k is the number of the data set
  • ⁇ e 1 , e 2 ,..., e k ⁇ is the entity composed of person entity, address entity, event entity, item entity, organization entity data set.
  • the shortest path between the output e j e i and e i ⁇ entities associated with e j e i shortest path between the output e i and e m associated entities ⁇ e k ⁇ e n ⁇ e m.
  • the method and terminal for obtaining related information provided by the present disclosure, by constructing a knowledge graph based on massive first data, realizes the rapid extraction of the data set associated with the retrieval entity from the massive data, which simplifies The process of data retrieval for business personnel improves the work efficiency of business personnel, and at the same time retrieves data through intelligent filtering, improving the efficiency of obtaining related information from massive data. Further, extracting entities from the business data (ie, the first data) according to business requirements, and setting the strength of association between the extracted entities according to the business requirements to directly store them, is beneficial to improve the efficiency and data accuracy of business personnel when searching .
  • the entity data whose strength value is higher than the set threshold can be effectively extracted to avoid the omission of key entity information.
  • the normalized correlation strength value has a fixed value range (greater than or equal to 0 and less than or equal to 1), and business personnel can conveniently set the threshold.
  • the posterior probability method when new entity data is stored in the database, the posterior probability value of the original entity will also be dynamically adjusted, especially for the case where the amount of associated entity data of the retrieved entity is large, and each extraction can be guaranteed. All are relatively important entity data. Further, by outputting the shortest path, you can know the most direct contact method between the entities, which can help business personnel understand how the two entities are linked, and the business personnel can decide whether to view the entities on the link path.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present disclosure relates to the technical field of data processing, and particularly relates to a method and terminal for obtaining associated information. In the present disclosure, a knowledge map is constructed according to preset first data; a search entity is obtained; an entity associated with the search entity is obtained according to said knowledge map to obtain a first entity set; more than one entity from the first entity set having an association strength value with the search entity greater than a preset threshold is obtained. The efficiency of obtaining associated information from a massive amount data is improved.

Description

一种获取关联信息的方法及终端Method and terminal for obtaining related information
相关申请Related application
本申请要求保护在2018年11月26日提交的申请号为201811420058.6的中国专利申请的优先权,该申请的全部内容以引用的方式结合到本文中。This application claims the priority of the Chinese patent application with the application number 201811420058.6 filed on November 26, 2018. The entire content of this application is incorporated herein by reference.
技术领域Technical field
本公开涉及数据处理技术领域,尤其涉及一种获取关联信息的方法及终端。The present disclosure relates to the field of data processing technology, and in particular, to a method and terminal for acquiring related information.
背景技术Background technique
在日常的很多业务中都会产生大量的事件记录数据。这些海量的数据大部分都是文本数据,也包含一些表格类数据,往往分布存储在结构化、非结构化的数据库中。按照传统的方法,业务人员在调用数据时需要到不同的系统中进行查询调取,再通过人工的方式建立数据之间的关系,费时费力。In many daily operations, a large amount of event recording data will be generated. Most of these massive data are text data, and also contain some table data, which are often distributed and stored in structured and unstructured databases. According to the traditional method, when business personnel call data, they need to go to different systems for query and retrieval, and then manually establish the relationship between the data, which takes time and effort.
公开内容Public content
本公开所要解决的技术问题是:如何提高从海量数据中获取关联信息的效率。The technical problem to be solved by the present disclosure is: how to improve the efficiency of obtaining related information from massive data.
为了解决上述技术问题,本公开采用的技术方案为:In order to solve the above technical problems, the technical solutions adopted by the present disclosure are:
本公开提供一种获取关联信息的方法,包括:The present disclosure provides a method for obtaining associated information, including:
S1、根据预设的第一数据构建知识图谱;S1. Construct a knowledge graph based on preset first data;
S2、获取检索实体;S2. Obtain the retrieval entity;
S3、根据所述知识图谱获取与所述检索实体关联的实体,得到第一实体集合;S3. Acquire an entity associated with the search entity according to the knowledge graph to obtain a first entity set;
S4、从所述第一实体集合中获取一个以上与所述检索实体的关联强度值大于预设阈值的实体。S4. Obtaining more than one entity with an association strength value greater than a preset threshold from the first entity set from the first entity set.
进一步地,所述S1具体为:Further, the S1 is specifically:
从所述第一数据中提取实体,得到第二实体集合;Extracting entities from the first data to obtain a second entity set;
设置所述第二实体集合中具有关联关系的两实体间的关联强度值,得到第一关联强度值集合;Setting a correlation strength value between two entities having an association relationship in the second entity set to obtain a first correlation strength value set;
根据所述第一关联强度值集合计算所述第二实体集合中任意两实体间的关联强度值,得到第二关联强度值集合;Calculating the correlation strength value between any two entities in the second entity set according to the first correlation strength value set to obtain a second correlation strength value set;
根据所述第二实体集合和所述第二关联强度值集合构建知识图谱。Construct a knowledge graph according to the second entity set and the second correlation strength value set.
进一步地,根据所述第一关联强度值集合计算所述第二实体集合中任意两实体间的关联强度值,得到第二关联强度值集合,具体为:Further, calculating the correlation strength value between any two entities in the second entity set according to the first correlation strength value set to obtain a second correlation strength value set specifically includes:
X(e i,e j)=max(a i→j) X(e i ,e j )=max(a i→j )
其中,e i,e j为所述第二实体集合中的任一实体,a i→j为实体e i,e j连接路径上任意两个节点之间的关联强度值,X(e i,e j)为实体e i,e j之间的关联强度值。 Where e i , e j is any entity in the second entity set, a i→j is the entity e i , and the correlation strength value between any two nodes on the e j connection path, X(e i , e j ) is the strength of the association between the entities e i and e j .
进一步地,根据所述第二实体集合和所述第二关联强度值集合构建知识图谱,具体为:Further, constructing a knowledge graph based on the second entity set and the second correlation strength value set specifically includes:
归一化处理所述第二关联强度值集合,得到第三关联强度值集合;Normalizing the second set of correlation strength values to obtain a third set of correlation strength values;
根据所述第三关联强度值集合和所述第二实体集合构建知识图谱。Construct a knowledge graph according to the third set of association strength values and the second set of entities.
进一步地,所述S4具体为:Further, the S4 is specifically:
计算所述检索实体与所述第一实体集合中每一实体的后验概率,得到后验概率集合;Calculating the posterior probability of each entity in the retrieval entity and the first entity set to obtain a posterior probability set;
若所述第一实体集合中一实体的后验概率大于预设阈值,则输出所述一实体。If the posterior probability of an entity in the first entity set is greater than a preset threshold, the entity is output.
进一步地,还包括:Further, it also includes:
输出所述检索实体与所述一实体之间的最短路径。The shortest path between the retrieval entity and the entity is output.
本公开还提供一种计算机可读存储介质,其上存储有程序,所述程序在被计算机执行时执行所述获取关联信息的方法。The present disclosure also provides a computer-readable storage medium on which a program is stored, and when the program is executed by a computer, the method for acquiring associated information is executed.
本公开另提供一种获取关联信息的终端,包括一个或多个处理器及存储器,所述存储器存储有程序,并且被配置成由所述一个或多个处理器执行以下步骤:The present disclosure further provides a terminal for acquiring associated information, including one or more processors and a memory, the memory stores a program, and is configured to be executed by the one or more processors to perform the following steps:
S1、根据预设的第一数据构建知识图谱;S1. Construct a knowledge graph based on preset first data;
S2、获取检索实体;S2. Obtain the retrieval entity;
S3、根据所述知识图谱获取与所述检索实体关联的实体,得到第一实体集合;S3. Acquire an entity associated with the search entity according to the knowledge graph to obtain a first entity set;
S4、从所述第一实体集合中获取一个以上与所述检索实体的关联强度值大于预设阈值的实体。S4. Obtaining more than one entity with an association strength value greater than a preset threshold from the first entity set from the first entity set.
进一步地,所述S1具体为:Further, the S1 is specifically:
从所述第一数据中提取实体,得到第二实体集合;Extracting entities from the first data to obtain a second entity set;
设置所述第二实体集合中具有关联关系的两实体间的关联强度值,得到第一关联强度值集合;Setting a correlation strength value between two entities having an association relationship in the second entity set to obtain a first correlation strength value set;
根据所述第一关联强度值集合计算所述第二实体集合中任意两实体间的关联强度值,得到第二关联强度值集合;Calculating the correlation strength value between any two entities in the second entity set according to the first correlation strength value set to obtain a second correlation strength value set;
根据所述第二实体集合和所述第二关联强度值集合构建知识图谱。Construct a knowledge graph according to the second entity set and the second correlation strength value set.
进一步地,根据所述第一关联强度值集合计算所述第二实体集合中任意两实体间的关联强度值,得到第二关联强度值集合,具体为:Further, calculating the correlation strength value between any two entities in the second entity set according to the first correlation strength value set to obtain a second correlation strength value set specifically includes:
X(e i,e j)=max(a i→j) X(e i ,e j )=max(a i→j )
其中,e i,e j为所述第二实体集合中的任一实体,a i→j为实体e i,e j连接路径上任意两个节点之间的关联强度值,X(e i,e j)为实体e i,e j之间的关联强度值; Where e i , e j are any entities in the second entity set, a i→ j are entities e i , and the correlation strength value between any two nodes on the e j connection path, X(e i , e j ) is the correlation strength value between the entities e i and e j ;
根据所述第二实体集合和所述第二关联强度值集合构建知识图谱,具体为:归一化处理所述第二关联强度值集合,得到第三关联强度值集合;根据所述第三关联强度值集合和所述第二实体集合构建知识图谱;Constructing a knowledge graph according to the second entity set and the second correlation strength value set, specifically: normalizing the second correlation strength value set to obtain a third correlation strength value set; according to the third correlation Constructing a knowledge graph with the intensity value set and the second entity set;
所述S4具体为:The S4 is specifically:
计算所述检索实体与所述第一实体集合中每一实体的后验概率,得到后验概率集合;若所述第一实体集合中一实体的后验概率大于预设阈值,则输出所述一实体;Calculating the posterior probability of each entity in the retrieval entity and the first entity set to obtain a posterior probability set; if the posterior probability of an entity in the first entity set is greater than a preset threshold, the output An entity
输出所述检索实体与所述一实体之间的最短路径。The shortest path between the retrieval entity and the entity is output.
本公开的有益效果在于:本公开通过根据海量的第一数据构建知识图谱,实现将与检索实体相关联的数据集从海量数据中快速提取出来,简化了业务人员数据调取的过程,提高了业务人员的工作效率,同时通过智能过滤检索数据,提升了从海量数据中获取关联信息的效率。The beneficial effect of the present disclosure is that the present disclosure realizes the rapid extraction of the data set associated with the retrieval entity from the massive data by constructing a knowledge graph based on the massive first data, which simplifies the process of business personnel data retrieval The work efficiency of business personnel, while retrieving data through intelligent filtering, improves the efficiency of obtaining related information from massive data.
附图说明BRIEF DESCRIPTION
图1为本公开提供的一种获取关联信息的方法的具体实施方式的流程框图;1 is a flowchart of a specific implementation manner of a method for obtaining associated information provided by the present disclosure;
图2为本公开提供的一种获取关联信息的终端的具体实施方式的结构框图;2 is a structural block diagram of a specific implementation manner of a terminal for acquiring associated information provided by the present disclosure;
图3为检索实体与知识图谱中的实体的关联示例图;FIG. 3 is an example diagram of the relationship between the retrieval entity and the entity in the knowledge graph;
标号说明:Label description:
1、处理器;2、存储器。1. Processor; 2. Memory.
具体实施方式detailed description
为详细说明本公开的技术内容、所实现目的及效果,以下结合实施方式并配合附图予以说明。In order to explain in detail the technical content of the present disclosure, the objectives and effects achieved, the following will be described in conjunction with the embodiments and accompanying drawings.
请参照图1至图3,Please refer to Figures 1 to 3,
如图1所示,本公开提供一种获取关联信息的方法,包括:As shown in FIG. 1, the present disclosure provides a method for obtaining associated information, including:
S1、根据预设的第一数据构建知识图谱;S1. Construct a knowledge graph based on preset first data;
S2、获取检索实体;S2. Obtain the retrieval entity;
S3、根据所述知识图谱获取与所述检索实体关联的实体,得到第一实体集合;S3. Acquire an entity associated with the search entity according to the knowledge graph to obtain a first entity set;
S4、从所述第一实体集合中获取一个以上与所述检索实体的关联强度值大于预设阈值的实体。S4. Obtaining more than one entity with an association strength value greater than a preset threshold from the first entity set from the first entity set.
进一步地,所述S1具体为:Further, the S1 is specifically:
从所述第一数据中提取实体,得到第二实体集合;Extracting entities from the first data to obtain a second entity set;
设置所述第二实体集合中具有关联关系的两实体间的关联强度值,得到第一关联强度值集合;Setting a correlation strength value between two entities having an association relationship in the second entity set to obtain a first correlation strength value set;
根据所述第一关联强度值集合计算所述第二实体集合中任意两实体间的关联强度值,得到第二关联强度值集合;Calculating the correlation strength value between any two entities in the second entity set according to the first correlation strength value set to obtain a second correlation strength value set;
根据所述第二实体集合和所述第二关联强度值集合构建知识图谱。Construct a knowledge graph according to the second entity set and the second correlation strength value set.
由上述描述可知,根据业务需求从业务数据(即第一数据)中提取实体,并根据业务需求设置所提取的实体间的关联强度直接存储起来,有利于提升业务人员进行检索时的效率和数据准确性。As can be seen from the above description, extracting entities from business data (ie, the first data) according to business needs, and setting the strength of association between the extracted entities according to business needs to directly store them, is conducive to improving the efficiency and data of business personnel when searching accuracy.
进一步地,根据所述第一关联强度值集合计算所述第二实体集合中任意两实体间的关联强度值,得到第二关联强度值集合,具体为:Further, calculating the correlation strength value between any two entities in the second entity set according to the first correlation strength value set to obtain a second correlation strength value set specifically includes:
X(e i,e j)=max(a i→j) X(e i ,e j )=max(a i→j )
其中,e i,e j为所述第二实体集合中的任一实体,a i→j为实体e i,e j连接路径上任意两个节点之间的关联强度值,X(e i,e j)为实体e i,e j之间的关联强度值。 Where e i , e j are any entities in the second entity set, a i→ j are entities e i , and the correlation strength value between any two nodes on the e j connection path, X(e i , e j ) is the strength of the association between the entities e i and e j .
由上述描述可知,通过选取任意两实体间最大关联强度作为有效关联强度值,可以有效将强度值高于所设阈值的实体数据都提取出来,避免关键实体信息的遗漏。As can be seen from the above description, by selecting the maximum correlation strength between any two entities as the effective correlation strength value, the entity data whose strength value is higher than the set threshold can be effectively extracted to avoid the omission of key entity information.
进一步地,根据所述第二实体集合和所述第二关联强度值集合构建知识图谱,具体为:Further, constructing a knowledge graph based on the second entity set and the second correlation strength value set specifically includes:
归一化处理所述第二关联强度值集合,得到第三关联强度值集合;Normalizing the second set of correlation strength values to obtain a third set of correlation strength values;
根据所述第三关联强度值集合和所述第二实体集合构建知识图谱。Construct a knowledge graph according to the third set of association strength values and the second set of entities.
由上述描述可知,归一化后的关联强度值取值范围固定(大于等于0,小于等于1),业务人员可以方便的进行阈值的设定。As can be seen from the above description, the normalized correlation strength value has a fixed value range (greater than or equal to 0 and less than or equal to 1), and business personnel can easily set the threshold.
进一步地,所述S4具体为:Further, the S4 is specifically:
计算所述检索实体与所述第一实体集合中每一实体的后验概率,得到后验概率集合;Calculating the posterior probability of each entity in the retrieval entity and the first entity set to obtain a posterior probability set;
若所述第一实体集合中一实体的后验概率大于预设阈值,则输出所述一实体。If the posterior probability of an entity in the first entity set is greater than a preset threshold, the entity is output.
由上述描述可知,采用后验概率的方法,当有新的实体数据入库,原有实体的后验概率值也会动态调整,特别针对检索实体关联实体数据量大的情况,能够保证每次提取的都是相对重要性高的实体数据。It can be seen from the above description that, with the method of posterior probability, when new entity data is stored in the database, the posterior probability value of the original entity will also be dynamically adjusted, especially for the case where the amount of data associated with the retrieved entity is large, which can guarantee each time The extracted entity data is of relatively high importance.
进一步地,还包括:Further, it also includes:
输出所述检索实体与所述一实体之间的最短路径。The shortest path between the retrieval entity and the entity is output.
由上述描述可知,输出最短路径,可以知道实体间最直接的联系方式,能够辅助业务人员理解两个实体是如何链接的,由业务人员决策是否对链接路径上的实体进行查看。It can be seen from the above description that the shortest path is output, and the most direct contact method between entities can be known, which can help business personnel understand how two entities are linked, and the business personnel can decide whether to view the entities on the link path.
本公开还提供一种计算机可读存储介质,其上存储有程序,所述程序在被计算机执行时执行所述获取关联信息的方法。The present disclosure also provides a computer-readable storage medium on which a program is stored, and when the program is executed by a computer, the method for acquiring associated information is executed.
如图2所示,本公开另提供一种获取关联信息的终端,包括一个或多个处理器1及存储器2,所述存储器2存储有程序,并且被配置成由所述一个或多个处理器1执行以下步骤:As shown in FIG. 2, the present disclosure further provides a terminal for acquiring associated information, including one or more processors 1 and a memory 2, the memory 2 stores a program, and is configured to be processed by the one or more Device 1 performs the following steps:
S1、根据预设的第一数据构建知识图谱;S1. Construct a knowledge graph based on preset first data;
S2、获取检索实体;S2. Obtain the retrieval entity;
S3、根据所述知识图谱获取与所述检索实体关联的实体,得到第一实体集合;S3. Acquire an entity associated with the search entity according to the knowledge graph to obtain a first entity set;
S4、从所述第一实体集合中获取一个以上与所述检索实体的关联强度值大于预设阈值的实体。S4. Obtaining more than one entity with an association strength value greater than a preset threshold from the first entity set from the first entity set.
进一步地,所述S1具体为:Further, the S1 is specifically:
从所述第一数据中提取实体,得到第二实体集合;Extracting entities from the first data to obtain a second entity set;
设置所述第二实体集合中具有关联关系的两实体间的关联强度值,得到第一关联强度值集合;Setting a correlation strength value between two entities having an association relationship in the second entity set to obtain a first correlation strength value set;
根据所述第一关联强度值集合计算所述第二实体集合中任意两实体间的关联强度值,得到第二关联强度值集合;Calculating the correlation strength value between any two entities in the second entity set according to the first correlation strength value set to obtain a second correlation strength value set;
根据所述第二实体集合和所述第二关联强度值集合构建知识图谱。Construct a knowledge graph according to the second entity set and the second correlation strength value set.
进一步地,根据所述第一关联强度值集合计算所述第二实体集合中任意两实体间的关联强度值,得到第二关联强度值集合,具体为:Further, calculating the correlation strength value between any two entities in the second entity set according to the first correlation strength value set to obtain a second correlation strength value set specifically includes:
X(e i,e j)=max(a i→j) X(e i ,e j )=max(a i→j )
其中,e i,e j为所述第二实体集合中的任一实体,a i→j为实体e i,e j连接路径上任意两个节点之间的关联强度值,X(e i,e j)为实体e i,e j之间的关联强度值; Where e i , e j are any entities in the second entity set, a i→ j are entities e i , and the correlation strength value between any two nodes on the e j connection path, X(e i , e j ) is the correlation strength value between the entities e i and e j ;
根据所述第二实体集合和所述第二关联强度值集合构建知识图谱,具体为:归一化处理所述第二关联强度值集合,得到第三关联强度值集合;根据所述第三关联强度值集合和所述第二实体集合构建知识图谱;Constructing a knowledge graph according to the second entity set and the second correlation strength value set, specifically: normalizing the second correlation strength value set to obtain a third correlation strength value set; according to the third correlation Constructing a knowledge graph with the intensity value set and the second entity set;
所述S4具体为:The S4 is specifically:
计算所述检索实体与所述第一实体集合中每一实体的后验概率,得到后验概率集合;若所述第一实体集合中一实体的后验概率大于预设阈值,则输出所述一实体;Calculating the posterior probability of each entity in the retrieval entity and the first entity set to obtain a posterior probability set; if the posterior probability of an entity in the first entity set is greater than a preset threshold, the output An entity
输出所述检索实体与所述一实体之间的最短路径。The shortest path between the retrieval entity and the entity is output.
本公开的实施例一为:The first embodiment of the present disclosure is:
本实施例提供一种获取关联信息的方法,包括:This embodiment provides a method for obtaining associated information, including:
S1、根据预设的第一数据构建知识图谱。S1. Construct a knowledge graph according to preset first data.
其中,所述第一数据为业务部门日常记录数据。在日常的很多业务中都会产生大量的事件记录数据。这些海量的数据大部分都是文本数据,也包含一些表格类数据,往往分布存储在结构化、非结构化的数据库中。Wherein, the first data is the daily record data of the business department. In many daily operations, a large amount of event recording data will be generated. Most of these massive data are text data, and also contain some table data, which are often distributed and stored in structured and unstructured databases.
S11、从所述第一数据中提取实体,得到第二实体集合。S11. Extract entities from the first data to obtain a second entity set.
其中,通过规则匹配、OCR识别、自然语言分析等技术从业务部门日常记录数据中提取出人物实体、地址实体、事件实体、物品实体和组织实体。Among them, the person entity, address entity, event entity, item entity and organization entity are extracted from the daily record data of the business department through technologies such as rule matching, OCR recognition, and natural language analysis.
在本实施例中,所述人物实体包括业务记录中标明的人员实体及其关联人员的人物身份标识信息,如姓名、证件号码、性别、血型等;In this embodiment, the person entity includes the person identity information of the person entity and its associated person indicated in the business record, such as name, ID number, gender, blood type, etc.;
所述地址实体包括事件记录中涉及的公司、团体组织、个人等的地址信息,如企业的注册地址、办公地址、个人的户籍地址、暂住地地址等;The address entity includes address information of companies, group organizations, individuals, etc. involved in the event record, such as the registered address, office address, individual's household registration address, temporary residence address, etc. of the enterprise;
所述事件实体包括事件记录中事件类型、事件日期、事件内容等事件描述需要的信息;The event entity includes information required for event description such as event type, event date, and event content in the event record;
所述物品实体包括事件记录中包含的手机、电脑、车辆等物品的标识信息,如手机号码、电脑的MAC地址、车牌号码等;The item entity includes identification information of items such as mobile phones, computers, and vehicles included in the event record, such as mobile phone numbers, MAC addresses of computers, and license plate numbers;
所述组织实体包括事件记录中组织名称、类型、规模、活动范围等信息。The organization entity includes information such as the organization name, type, scale, and scope of activities in the event record.
通过规则匹配技术从规范化录入的事件记录数据中提取主体数据,如交通出行记录、证件申办材料等;Extract the main body data from the normalized event record data through rule matching technology, such as traffic travel records, document application materials, etc.;
通过OCR识别技术从事件记录中的图片数据中提取主体数据,如车牌照信息、经营许可证等。这些图片数据的格式比较固定,可以通过预先训练好的OCR识别模型进行识别;Use OCR recognition technology to extract the main body data from the picture data in the event record, such as license plate information, business license, etc. The format of these picture data is relatively fixed, and can be recognized by the pre-trained OCR recognition model;
通过自然语言处理技术从事件记录中格式规范不强的文本数据中提取主体数据,如事件描述信息等。Use natural language processing technology to extract subject data, such as event description information, from text data with weak format specifications in event records.
S12、设置所述第二实体集合中具有关联关系的两实体间的关联强度值,得到第一关联强度值集合。S12. Set a correlation strength value between two entities having an association relationship in the second entity set to obtain a first correlation strength value set.
其中,将人物实体、地址实体、事件实体、物品实体、组织实体根据事件记录中的关系进行两两关联,实体之间关联强度值在本实施例中按照与事件的紧密程度由业务人员进行设定,分值范围在0到100之间。Among them, person entities, address entities, event entities, item entities, and organizational entities are related to each other according to the relationship in the event record, and the correlation strength value between the entities is set by the business personnel according to the degree of closeness to the event The score ranges from 0 to 100.
S13、根据所述第一关联强度值集合计算所述第二实体集合中任意两实体间的关联强度值,得到第二关联强度值集合;具体为:S13. Calculate the correlation strength value between any two entities in the second entity set according to the first correlation strength value set to obtain a second correlation strength value set; specifically:
X(e i,e j)=max(a i→j) X(e i ,e j )=max(a i→j )
其中,e i,e j为所述第二实体集合中的任一实体,a i→j为实体e i,e j连接路径上任意两个节点之间的关联强度值,X(e i,e j)为实体e i,e j之间的关联强度值。 Where e i , e j is any entity in the second entity set, a i→ j is the entity e i , and the correlation strength value between any two nodes on the e j connection path, X(e i , e j ) is the strength of the association between the entities e i and e j .
其中,任意两实体之间关联强度值取决于两者路径上存在的最大关联强度值。Among them, the correlation strength value between any two entities depends on the maximum correlation strength value existing on the two paths.
S14、归一化处理所述第二关联强度值集合,得到第三关联强度值集合。S14. Normalize the second set of correlation strength values to obtain a third set of correlation strength values.
其中,归一化处理X(e i,e j)的公式为: Among them, the formula for normalizing X(e i , e j ) is:
Figure PCTCN2019099124-appb-000001
Figure PCTCN2019099124-appb-000001
即将X(e i,e j)除以第二关联强度值集合中的最大值。 That is, X(e i , e j ) is divided by the maximum value in the second set of correlation strength values.
S15、根据所述第三关联强度值集合和所述第二实体集合构建知识图谱。S15. Construct a knowledge graph according to the third set of association strength values and the second set of entities.
其中,将建立的实体以及实体之间的关联强度值存入Neo4j或Titan这类图数据库中,构建知识图谱。知识图谱早先主要由Google应用在语义搜索领域,以提升搜索的效果,现在也被应用于聊天机器人、智能问答系统、医疗服务、图书信息服务等领域。知识图谱中的数据都可以表述为三元组的形式,即实体1-关系-实体2的形式,其中实体是知识图谱中最基本元素,是对事实的描述,不同的实体间存在不同的关系。若将实体视为结点,实体间的关系作为边,那么包含了大量三元组的知识图谱,就成为一个庞大的知识图,从而将不同种类的信息连接成一个关系网络,提供了从关系的角度去分析问题的能力。将知识图谱技术应用到大数据领域,可以将这些海量异构数据进行融合,实现对象数据之间的关联关系构建,让业务人员能够快速实现对全量数据的关系查询、分析和挖掘,提高工作的效率。Among them, the established entities and the correlation strength values between the entities are stored in graph databases such as Neo4j or Titan to construct a knowledge graph. The knowledge graph was mainly used by Google in the field of semantic search in order to improve the search effect. It is now also used in chat robots, intelligent question answering systems, medical services, book information services and other fields. The data in the knowledge graph can be expressed in the form of triples, that is, the form of entity 1-relationship-entity 2, in which the entity is the most basic element in the knowledge graph and is a description of the facts. Different entities have different relationships . If the entity is regarded as a node, and the relationship between the entities as an edge, then the knowledge graph containing a large number of triples becomes a huge knowledge graph, thus connecting different types of information into a relationship network, providing a relationship The ability to analyze problems from a perspective. By applying knowledge graph technology to the field of big data, these massive amounts of heterogeneous data can be fused to realize the construction of the association relationship between object data, so that business personnel can quickly realize the relationship query, analysis and mining of the full amount of data to improve the work. effectiveness.
S2、获取检索实体。S2. Obtain the retrieval entity.
其中,对新增检索数据进行预处理,提取检索实体数据集;从获取的检索信息中抽取检索主体,如主体名称、证件号码、联系方式、涉及类型、涉及地点、涉及组织等日常业务信息。Among them, the new search data is preprocessed to extract the search entity data set; the search subject is extracted from the obtained search information, such as the subject name, certificate number, contact information, type, location, organization and other daily business information.
S3、根据所述知识图谱获取与所述检索实体关联的实体,得到第一实体集合。S3. Acquire an entity associated with the search entity according to the knowledge graph to obtain a first entity set.
例如,如图3所示,将检索实体与知识图谱中的人物实体、地址实体、事件实体、物品实体、组织实体分别进行关联,依据实体之间建立的关联关系,提取检索实体相关的所有实体信息,组成数据集{e 1,e 2,...,e k}。如使用检索实体中的手机号可以直接关联物品实体,再依据物品实体与人物实体、地址实体、事件实体、组织实体建立的关联,关联出这个手机号的人物姓名、住址等信息。 For example, as shown in FIG. 3, the retrieval entity is associated with the person entity, address entity, event entity, item entity, and organization entity in the knowledge graph, and all entities related to the retrieval entity are extracted according to the established relationship between the entities Information, forming a data set {e 1 ,e 2 ,...,e k }. For example, the mobile phone number in the search entity can be used to directly associate the item entity, and then the person name, address and other information of the mobile phone number are associated according to the association established between the item entity and the person entity, address entity, event entity, and organization entity.
S4、从所述第一实体集合中获取一个以上与所述检索实体的关联强度值大于预设阈值 的实体。具体为:S4. Acquire more than one entity with a correlation strength value greater than a preset threshold from the first entity set from the first entity set. Specifically:
S41、计算所述检索实体与所述第一实体集合中每一实体的后验概率,得到后验概率集合;S41. Calculate the posterior probability of each entity in the retrieval entity and the first entity set to obtain a posterior probability set;
S42、若所述第一实体集合中一实体的后验概率大于预设阈值,则输出所述一实体。S42. If the posterior probability of an entity in the first entity set is greater than a preset threshold, output the entity.
其中,通过所建知识图谱,结合实体间关联强度值,代入后验概率的计算中,提取出与检索主体相关的人物实体、地址实体、事件实体、物品实体、组织实体,从而构成数据集{e 1,e 2,...,e k}; Among them, through the built knowledge map, combined with the correlation strength value between entities, and substituted into the calculation of the posterior probability, the person entity, address entity, event entity, item entity, organization entity related to the retrieval subject are extracted to form the data set{ e 1 ,e 2 ,...,e k };
所述后验概率的计算公式为:The formula for calculating the posterior probability is:
Figure PCTCN2019099124-appb-000002
Figure PCTCN2019099124-appb-000002
式中C i代表检索实体i,k为该数据集的个数,{e 1,e 2,...,e k}为人物实体、地址实体、事件实体、物品实体、组织实体构成的实体数据集。 Where C i represents the retrieval entity i, k is the number of the data set, {e 1 , e 2 ,..., e k } is the entity composed of person entity, address entity, event entity, item entity, organization entity data set.
将关联概率高于指定阈值的实体数据按照概率从高到低依次推送。Push entity data with an association probability higher than the specified threshold in order from high to low.
例如,在图3中,X(e i,e k)=10,X(e i,e j)=100,X(e i,e n)=10,X(e i,e m)=100 For example, in FIG. 3, X(e i , e k )=10, X(e i , e j )=100, X(e i , e n )=10, X(e i , e m )=100
随后对X(e i,e j)进行归一化处理,公式为: Then X(e i , e j ) is normalized, the formula is:
Figure PCTCN2019099124-appb-000003
Figure PCTCN2019099124-appb-000003
则得到:You get:
X(e i,e k)′=0.1,X(e i,e j)′=1,X(e i,e n)′=0.1,X(e i,e m)′=1 X(e i , e k )′=0.1, X(e i ,e j )′=1, X(e i ,e n )′=0.1, X(e i ,e m )′=1
将上述结果代入后验概率计算,可以得到:Substituting the above results into the posterior probability calculation, we can get:
Figure PCTCN2019099124-appb-000004
Figure PCTCN2019099124-appb-000004
同理可得:Similarly, we can get:
P(c i|e j)=0.45,P(c i|e n)=0.05,P(c i|e m)=0.45 P(c i |e j )=0.45, P(c i |e n )=0.05, P(c i |e m )=0.45
若将概率高于0.3的实体输出:则输出实体e j和实体e mIf the output is higher than the probability of the entity of 0.3: e j of the output entity and the entity e m.
S5、输出所述检索实体与所述一实体之间的最短路径。S5. Output the shortest path between the retrieval entity and the one entity.
例如,输出e i和e j实体间联系的最短路径e i→e j,输出e i和e m实体间联系的最短路径e i→e k→e n→e mFor example, the shortest path between the output e j e i and e i → entities associated with e j, e i shortest path between the output e i and e m associated entities → e k → e n → e m.
综上所述,本实施例提供的获取关联信息的方法,通过知识图谱的构建,可以快速将检索主体相关联的数据集提取出来,简化了业务人员数据调取的过程,提高了业务人员的工作效率,同时通过智能过滤检索数据,提升数据查询的效率。In summary, the method for obtaining related information provided by this embodiment, through the construction of the knowledge graph, can quickly extract the data set associated with the retrieval subject, simplifying the data retrieval process of business personnel, and improving the Work efficiency, at the same time retrieve data through intelligent filtering, improve the efficiency of data query.
本公开的实施例二为:The second embodiment of the present disclosure is:
本实施例提供一种计算机可读存储介质,其上存储有程序,所述程序在被计算机执行以下步骤:This embodiment provides a computer-readable storage medium on which a program is stored, and the program performs the following steps by a computer:
S1、根据预设的第一数据构建知识图谱。S1. Construct a knowledge graph according to preset first data.
其中,所述第一数据为业务部门日常记录数据。在日常的很多业务中都会产生大量的事件记录数据。这些海量的数据大部分都是文本数据,也包含一些表格类数据,往往分布存储在结构化、非结构化的数据库中。Wherein, the first data is the daily record data of the business department. In many daily operations, a large amount of event recording data will be generated. Most of these massive data are text data, and also contain some table data, which are often distributed and stored in structured and unstructured databases.
S11、从所述第一数据中提取实体,得到第二实体集合。S11. Extract entities from the first data to obtain a second entity set.
其中,通过规则匹配、OCR识别、自然语言分析等技术从业务部门日常记录数据中提取出人物实体、地址实体、事件实体、物品实体和组织实体。Among them, the person entity, address entity, event entity, item entity and organization entity are extracted from the daily record data of the business department through technologies such as rule matching, OCR recognition, and natural language analysis.
在本实施例中,所述人物实体包括业务记录中标明的人员实体及其关联人员的人物身份标识信息,如姓名、证件号码、性别、血型等;In this embodiment, the person entity includes the person identity information of the person entity and its associated person indicated in the business record, such as name, ID number, gender, blood type, etc.;
所述地址实体包括事件记录中涉及的公司、团体组织、个人等的地址信息,如企业的注册地址、办公地址、个人的户籍地址、暂住地地址等;The address entity includes address information of companies, group organizations, individuals, etc. involved in the event record, such as the registered address, office address, individual's household registration address, temporary residence address, etc. of the enterprise;
所述事件实体包括事件记录中事件类型、事件日期、事件内容等事件描述需要的信息;The event entity includes information required for event description such as event type, event date, and event content in the event record;
所述物品实体包括事件记录中包含的手机、电脑、车辆等物品的标识信息,如手机号码、电脑的MAC地址、车牌号码等;The item entity includes identification information of items such as mobile phones, computers, and vehicles included in the event record, such as mobile phone numbers, MAC addresses of computers, and license plate numbers;
所述组织实体包括事件记录中组织名称、类型、规模、活动范围等信息。The organization entity includes information such as the organization name, type, scale, and scope of activities in the event record.
通过规则匹配技术从规范化录入的事件记录数据中提取主体数据,如交通出行记录、证件申办材料等;Extract the main body data from the normalized event record data through rule matching technology, such as traffic travel records, document application materials, etc.;
通过OCR识别技术从事件记录中的图片数据中提取主体数据,如车牌照信息、经营许可证等。这些图片数据的格式比较固定,可以通过预先训练好的OCR识别模型进行识别;Use OCR recognition technology to extract the main body data from the picture data in the event record, such as license plate information, business license, etc. The format of these picture data is relatively fixed, and can be recognized by the pre-trained OCR recognition model;
通过自然语言处理技术从事件记录中格式规范不强的文本数据中提取主体数据,如事件描述信息等。Use natural language processing technology to extract subject data, such as event description information, from text data with weak format specifications in event records.
S12、设置所述第二实体集合中具有关联关系的两实体间的关联强度值,得到第一关联 强度值集合。S12. Set a correlation strength value between two entities having an association relationship in the second entity set to obtain a first correlation strength value set.
其中,将人物实体、地址实体、事件实体、物品实体、组织实体根据事件记录中的关系进行两两关联,实体之间关联强度值在本实施例中按照与事件的紧密程度由业务人员进行设定,分值范围在0到100之间。Among them, person entities, address entities, event entities, item entities, and organizational entities are related to each other according to the relationship in the event record, and the correlation strength value between the entities is set by the business personnel according to the degree of closeness to the event The score ranges from 0 to 100.
S13、根据所述第一关联强度值集合计算所述第二实体集合中任意两实体间的关联强度值,得到第二关联强度值集合;具体为:S13. Calculate the correlation strength value between any two entities in the second entity set according to the first correlation strength value set to obtain a second correlation strength value set; specifically:
X(e i,e j)=max(a i→j) X(e i ,e j )=max(a i→j )
其中,e i,e j为所述第二实体集合中的任一实体,a i→j为实体e i,e j连接路径上任意两个节点之间的关联强度值,为实体e i,e j之间的关联强度值。 Where e i and e j are any entities in the second entity set, a i→ j is the entity e i , and the correlation strength value between any two nodes on the e j connection path is the entity e i , The strength of the correlation between e j .
其中,任意两实体之间关联强度值取决于两者路径上存在的最大关联强度值。Among them, the correlation strength value between any two entities depends on the maximum correlation strength value existing on the two paths.
S14、归一化处理所述第二关联强度值集合,得到第三关联强度值集合。S14. Normalize the second set of correlation strength values to obtain a third set of correlation strength values.
其中,归一化处理X(e i,e j)的公式为: Among them, the formula for normalizing X(e i , e j ) is:
Figure PCTCN2019099124-appb-000005
Figure PCTCN2019099124-appb-000005
即将X(e i,e j)除以第二关联强度值集合中的最大值。 That is, X(e i , e j ) is divided by the maximum value in the second set of correlation strength values.
S15、根据所述第三关联强度值集合和所述第二实体集合构建知识图谱。S15. Construct a knowledge graph according to the third set of association strength values and the second set of entities.
其中,将建立的实体以及实体之间的关联强度值存入Neo4j或Titan这类图数据库中,构建知识图谱。知识图谱早先主要由Google应用在语义搜索领域,以提升搜索的效果,现在也被应用于聊天机器人、智能问答系统、医疗服务、图书信息服务等领域。知识图谱中的数据都可以表述为三元组的形式,即实体1-关系-实体2的形式,其中实体是知识图谱中最基本元素,是对事实的描述,不同的实体间存在不同的关系。若将实体视为结点,实体间的关系作为边,那么包含了大量三元组的知识图谱,就成为一个庞大的知识图,从而将不同种类的信息连接成一个关系网络,提供了从关系的角度去分析问题的能力。将知识图谱技术应用到大数据领域,可以将这些海量异构数据进行融合,实现对象数据之间的关联关系构建,让业务人员能够快速实现对全量数据的关系查询、分析和挖掘,提高工作的效率。Among them, the established entities and the correlation strength values between the entities are stored in graph databases such as Neo4j or Titan to construct a knowledge graph. The knowledge graph was mainly used by Google in the field of semantic search in order to improve the search effect. It is now also used in chat robots, intelligent question answering systems, medical services, book information services and other fields. The data in the knowledge graph can be expressed in the form of triples, that is, the form of entity 1-relationship-entity 2, in which the entity is the most basic element in the knowledge graph and is a description of the facts. Different entities have different relationships . If the entity is regarded as a node, and the relationship between the entities as an edge, then the knowledge graph containing a large number of triples becomes a huge knowledge graph, thus connecting different types of information into a relationship network, providing a relationship The ability to analyze problems from a perspective. By applying knowledge graph technology to the field of big data, these massive amounts of heterogeneous data can be fused to realize the construction of the association relationship between object data, so that business personnel can quickly realize the relationship query, analysis and mining of the full amount of data to improve the work. effectiveness.
S2、获取检索实体。S2. Obtain the retrieval entity.
其中,对新增检索数据进行预处理,提取检索实体数据集;从获取的检索信息中抽取检索主体,如主体名称、证件号码、联系方式、涉及类型、涉及地点、涉及组织等日常业务信息。Among them, the new search data is preprocessed to extract the search entity data set; the search subject is extracted from the obtained search information, such as the subject name, certificate number, contact information, type, location, organization and other daily business information.
S3、根据所述知识图谱获取与所述检索实体关联的实体,得到第一实体集合。S3. Acquire an entity associated with the search entity according to the knowledge graph to obtain a first entity set.
例如,如图3所示,将检索实体与知识图谱中的人物实体、地址实体、事件实体、物品实体、组织实体分别进行关联,依据实体之间建立的关联关系,提取检索实体相关的所有实体信息,组成数据集{e 1,e 2,...,e k}。如使用检索实体中的手机号可以直接关联物品实体,再依据物品实体与人物实体、地址实体、事件实体、组织实体建立的关联,关联出这个手机号的人物姓名、住址等信息。 For example, as shown in FIG. 3, the retrieval entity is associated with the person entity, address entity, event entity, item entity, and organization entity in the knowledge graph, and all entities related to the retrieval entity are extracted according to the established relationship between the entities Information, forming a data set {e 1 ,e 2 ,...,e k }. For example, the mobile phone number in the search entity can be used to directly associate the item entity, and then the person name, address and other information of the mobile phone number are associated according to the association established between the item entity and the person entity, address entity, event entity, and organization entity.
S4、从所述第一实体集合中获取一个以上与所述检索实体的关联强度值大于预设阈值的实体。具体为:S4. Obtaining more than one entity with an association strength value greater than a preset threshold from the first entity set from the first entity set. Specifically:
S41、计算所述检索实体与所述第一实体集合中每一实体的后验概率,得到后验概率集合;S41. Calculate the posterior probability of each entity in the retrieval entity and the first entity set to obtain a posterior probability set;
S42、若所述第一实体集合中一实体的后验概率大于预设阈值,则输出所述一实体。S42. If the posterior probability of an entity in the first entity set is greater than a preset threshold, output the entity.
其中,通过所建知识图谱,结合实体间关联强度值,代入后验概率的计算中,提取出与检索主体相关的人物实体、地址实体、事件实体、物品实体、组织实体,从而构成数据集{e 1,e 2,...,e k}; Among them, through the built knowledge map, combined with the correlation strength value between entities, and substituted into the calculation of the posterior probability, the person entity, address entity, event entity, item entity, organization entity related to the retrieval subject are extracted to form the data set{ e 1 ,e 2 ,...,e k };
所述后验概率的计算公式为:The formula for calculating the posterior probability is:
Figure PCTCN2019099124-appb-000006
Figure PCTCN2019099124-appb-000006
式中C i代表检索实体i,k为该数据集的个数,{e 1,e 2,...,e k}为人物实体、地址实体、事件实体、物品实体、组织实体构成的实体数据集。 Where C i represents the retrieval entity i, k is the number of the data set, {e 1 , e 2 ,..., e k } is the entity composed of person entity, address entity, event entity, item entity, organization entity data set.
将关联概率高于指定阈值的实体数据按照概率从高到低依次推送。Push entity data with an association probability higher than the specified threshold in order from high to low.
例如,在图3中,X(e i,e k)=10,X(e i,e j)=100,X(e i,e n)=10,X(e i,e m)=100 For example, in FIG. 3, X(e i , e k )=10, X(e i , e j )=100, X(e i , e n )=10, X(e i , e m )=100
随后对X(e i,e j)进行归一化处理,公式为: Then X(e i , e j ) is normalized, the formula is:
Figure PCTCN2019099124-appb-000007
Figure PCTCN2019099124-appb-000007
则得到:You get:
X(e i,e k)′=0.1,X(e i,e j)′=1,X(e i,e n)′=0.1,X(e i,e m)′=1 X(e i , e k )′=0.1, X(e i ,e j )′=1, X(e i ,e n )′=0.1, X(e i ,e m )′=1
将上述结果代入后验概率计算,可以得到:Substituting the above results into the posterior probability calculation, we can get:
Figure PCTCN2019099124-appb-000008
Figure PCTCN2019099124-appb-000008
同理可得:Similarly, we can get:
P(c i|e j)=0.45,P(c i|e n)=0.05,P(c i|e m)=0.45 P(c i |e j )=0.45, P(c i |e n )=0.05, P(c i |e m )=0.45
若将概率高于0.3的实体输出:则输出实体e j和实体e mIf the output is higher than the probability of the entity of 0.3: e j of the output entity and the entity e m.
S5、输出所述检索实体与所述一实体之间的最短路径。S5. Output the shortest path between the retrieval entity and the one entity.
例如,输出e i和e j实体间联系的最短路径e i→e j,输出e i和e m实体间联系的最短路径e i→e k→e n→e mFor example, the shortest path between the output e j e i and e i → entities associated with e j, e i shortest path between the output e i and e m associated entities → e k → e n → e m.
本公开的实施例三为:The third embodiment of the present disclosure is:
本实施例提供一种获取关联信息的终端,包括一个或多个处理器及存储器,所述存储器存储有程序,并且被配置成由所述一个或多个处理器执行以下步骤:This embodiment provides a terminal for acquiring associated information, including one or more processors and a memory, the memory stores a program, and is configured to be executed by the one or more processors to perform the following steps:
S1、根据预设的第一数据构建知识图谱。S1. Construct a knowledge graph according to preset first data.
其中,所述第一数据为业务部门日常记录数据。在日常的很多业务中都会产生大量的事件记录数据。这些海量的数据大部分都是文本数据,也包含一些表格类数据,往往分布存储在结构化、非结构化的数据库中。Wherein, the first data is the daily record data of the business department. In many daily operations, a large amount of event recording data will be generated. Most of these massive data are text data, and also contain some table data, which are often distributed and stored in structured and unstructured databases.
S11、从所述第一数据中提取实体,得到第二实体集合。S11. Extract entities from the first data to obtain a second entity set.
其中,通过规则匹配、OCR识别、自然语言分析等技术从业务部门日常记录数据中提取出人物实体、地址实体、事件实体、物品实体和组织实体。Among them, the person entity, address entity, event entity, item entity and organization entity are extracted from the daily record data of the business department through technologies such as rule matching, OCR recognition, and natural language analysis.
在本实施例中,所述人物实体包括业务记录中标明的人员实体及其关联人员的人物身份标识信息,如姓名、证件号码、性别、血型等;In this embodiment, the person entity includes the person identity information of the person entity and its associated person indicated in the business record, such as name, ID number, gender, blood type, etc.;
所述地址实体包括事件记录中涉及的公司、团体组织、个人等的地址信息,如企业的注册地址、办公地址、个人的户籍地址、暂住地地址等;The address entity includes address information of companies, group organizations, individuals, etc. involved in the event record, such as the registered address, office address, individual's household registration address, temporary residence address, etc. of the enterprise;
所述事件实体包括事件记录中事件类型、事件日期、事件内容等事件描述需要的信息;The event entity includes information required for event description such as event type, event date, and event content in the event record;
所述物品实体包括事件记录中包含的手机、电脑、车辆等物品的标识信息,如手机号码、电脑的MAC地址、车牌号码等;The item entity includes identification information of items such as mobile phones, computers, and vehicles included in the event record, such as mobile phone numbers, MAC addresses of computers, and license plate numbers;
所述组织实体包括事件记录中组织名称、类型、规模、活动范围等信息。The organization entity includes information such as the organization name, type, scale, and scope of activities in the event record.
通过规则匹配技术从规范化录入的事件记录数据中提取主体数据,如交通出行记录、证件申办材料等;Extract the main body data from the normalized event record data through rule matching technology, such as traffic travel records, document application materials, etc.;
通过OCR识别技术从事件记录中的图片数据中提取主体数据,如车牌照信息、经营许 可证等。这些图片数据的格式比较固定,可以通过预先训练好的OCR识别模型进行识别;Use OCR recognition technology to extract subject data from the picture data in the event records, such as license plate information, business licenses, etc. The format of these picture data is relatively fixed, and can be recognized by the pre-trained OCR recognition model;
通过自然语言处理技术从事件记录中格式规范不强的文本数据中提取主体数据,如事件描述信息等。Use natural language processing technology to extract subject data, such as event description information, from text data with weak format specifications in event records.
S12、设置所述第二实体集合中具有关联关系的两实体间的关联强度值,得到第一关联强度值集合。S12. Set a correlation strength value between two entities having an association relationship in the second entity set to obtain a first correlation strength value set.
其中,将人物实体、地址实体、事件实体、物品实体、组织实体根据事件记录中的关系进行两两关联,实体之间关联强度值在本实施例中按照与事件的紧密程度由业务人员进行设定,分值范围在0到100之间。Among them, person entities, address entities, event entities, item entities, and organizational entities are related to each other according to the relationship in the event record, and the correlation strength value between the entities is set by the business personnel according to the degree of closeness to the event The score ranges from 0 to 100.
S13、根据所述第一关联强度值集合计算所述第二实体集合中任意两实体间的关联强度值,得到第二关联强度值集合;具体为:S13. Calculate the correlation strength value between any two entities in the second entity set according to the first correlation strength value set to obtain a second correlation strength value set; specifically:
X(e i,e j)=max(a i→j) X(e i ,e j )=max(a i→j )
其中,e i,e j为所述第二实体集合中的任一实体,a i→j为实体e i,e j连接路径上任意两个节点之间的关联强度值,为实体e i,e j之间的关联强度值。 Where e i and e j are any entities in the second entity set, a i→j is the entity e i , and the correlation strength value between any two nodes on the e j connection path is the entity e i , The strength of the correlation between e j .
其中,任意两实体之间关联强度值取决于两者路径上存在的最大关联强度值。Among them, the correlation strength value between any two entities depends on the maximum correlation strength value existing on the two paths.
S14、归一化处理所述第二关联强度值集合,得到第三关联强度值集合。S14. Normalize the second set of correlation strength values to obtain a third set of correlation strength values.
其中,归一化处理X(e i,e j)的公式为: Among them, the formula for normalizing X(e i , e j ) is:
Figure PCTCN2019099124-appb-000009
Figure PCTCN2019099124-appb-000009
即将X(e i,e j)除以第二关联强度值集合中的最大值。 That is, X(e i , e j ) is divided by the maximum value in the second set of correlation strength values.
S15、根据所述第三关联强度值集合和所述第二实体集合构建知识图谱。S15. Construct a knowledge graph according to the third set of association strength values and the second set of entities.
其中,将建立的实体以及实体之间的关联强度值存入Neo4j或Titan这类图数据库中,构建知识图谱。知识图谱早先主要由Google应用在语义搜索领域,以提升搜索的效果,现在也被应用于聊天机器人、智能问答系统、医疗服务、图书信息服务等领域。知识图谱中的数据都可以表述为三元组的形式,即实体1-关系-实体2的形式,其中实体是知识图谱中最基本元素,是对事实的描述,不同的实体间存在不同的关系。若将实体视为结点,实体间的关系作为边,那么包含了大量三元组的知识图谱,就成为一个庞大的知识图,从而将不同种类的信息连接成一个关系网络,提供了从关系的角度去分析问题的能力。将知识图谱技术应用到大数据领域,可以将这些海量异构数据进行融合,实现对象数据之间的关联关系构建,让业务人员能够快速实现对全量数据的关系查询、分析和挖掘,提高工作的效率。Among them, the established entities and the correlation strength values between the entities are stored in graph databases such as Neo4j or Titan to construct a knowledge graph. The knowledge graph was mainly used by Google in the field of semantic search in order to improve the search effect. It is now also used in chat robots, intelligent question answering systems, medical services, book information services and other fields. The data in the knowledge graph can be expressed in the form of triples, that is, the form of entity 1-relationship-entity 2, in which the entity is the most basic element in the knowledge graph and is a description of the facts. Different entities have different relationships . If the entity is regarded as a node, and the relationship between the entities as an edge, then the knowledge graph containing a large number of triples becomes a huge knowledge graph, thus connecting different types of information into a relationship network, providing a relationship The ability to analyze problems from a perspective. By applying knowledge graph technology to the field of big data, these massive amounts of heterogeneous data can be fused to realize the construction of the association relationship between object data, so that business personnel can quickly realize the relationship query, analysis and mining of the full amount of data to improve the work. effectiveness.
S2、获取检索实体。S2. Obtain the retrieval entity.
其中,对新增检索数据进行预处理,提取检索实体数据集;从获取的检索信息中抽取检索主体,如主体名称、证件号码、联系方式、涉及类型、涉及地点、涉及组织等日常业务信息。Among them, the new search data is preprocessed to extract the search entity data set; the search subject is extracted from the obtained search information, such as the subject name, certificate number, contact information, type, location, organization and other daily business information.
S3、根据所述知识图谱获取与所述检索实体关联的实体,得到第一实体集合。S3. Acquire an entity associated with the search entity according to the knowledge graph to obtain a first entity set.
例如,如图3所示,将检索实体与知识图谱中的人物实体、地址实体、事件实体、物品实体、组织实体分别进行关联,依据实体之间建立的关联关系,提取检索实体相关的所有实体信息,组成数据集{e 1,e 2,...,e k}。如使用检索实体中的手机号可以直接关联物品实体,再依据物品实体与人物实体、地址实体、事件实体、组织实体建立的关联,关联出这个手机号的人物姓名、住址等信息。 For example, as shown in FIG. 3, the retrieval entity is associated with the person entity, address entity, event entity, item entity, and organization entity in the knowledge graph, and all entities related to the retrieval entity are extracted according to the established relationship between the entities Information, forming a data set {e 1 ,e 2 ,...,e k }. For example, the mobile phone number in the search entity can be used to directly associate the item entity, and then the person name, address and other information of the mobile phone number are associated according to the association established between the item entity and the person entity, address entity, event entity, and organization entity.
S4、从所述第一实体集合中获取一个以上与所述检索实体的关联强度值大于预设阈值的实体。具体为:S4. Obtaining more than one entity with an association strength value greater than a preset threshold from the first entity set from the first entity set. Specifically:
S41、计算所述检索实体与所述第一实体集合中每一实体的后验概率,得到后验概率集合;S41. Calculate the posterior probability of each entity in the retrieval entity and the first entity set to obtain a posterior probability set;
S42、若所述第一实体集合中一实体的后验概率大于预设阈值,则输出所述一实体。S42. If the posterior probability of an entity in the first entity set is greater than a preset threshold, output the entity.
其中,通过所建知识图谱,结合实体间关联强度值,代入后验概率的计算中,提取出与检索主体相关的人物实体、地址实体、事件实体、物品实体、组织实体,从而构成数据集{e 1,e 2,...,e k}; Among them, through the built knowledge map, combined with the correlation strength value between entities, and substituted into the calculation of the posterior probability, the person entity, address entity, event entity, item entity, organization entity related to the retrieval subject are extracted to form the data set{ e 1 ,e 2 ,...,e k };
所述后验概率的计算公式为:The formula for calculating the posterior probability is:
Figure PCTCN2019099124-appb-000010
Figure PCTCN2019099124-appb-000010
式中C i代表检索实体i,k为该数据集的个数,{e 1,e 2,...,e k}为人物实体、地址实体、事件实体、物品实体、组织实体构成的实体数据集。 Where C i represents the retrieval entity i, k is the number of the data set, {e 1 , e 2 ,..., e k } is the entity composed of person entity, address entity, event entity, item entity, organization entity data set.
将关联概率高于指定阈值的实体数据按照概率从高到低依次推送。Push entity data with an association probability higher than the specified threshold in order from high to low.
例如,在图3中,X(e i,e k)=10,X(e i,e j)=100,X(e i,e n)=10,X(e i,e m)=100 For example, in FIG. 3, X(e i , e k )=10, X(e i , e j )=100, X(e i , e n )=10, X(e i , e m )=100
随后对X(e i,e j)进行归一化处理,公式为: Then X(e i , e j ) is normalized, the formula is:
Figure PCTCN2019099124-appb-000011
Figure PCTCN2019099124-appb-000011
则得到:You get:
X(e i,e k)′=0.1,X(e i,e j)′=1,X(e i,e n)′=0.1,X(e i,e m)′=1 X(e i , e k )′=0.1, X(e i ,e j )′=1, X(e i ,e n )′=0.1, X(e i ,e m )′=1
将上述结果代入后验概率计算,可以得到:Substituting the above results into the posterior probability calculation, we can get:
Figure PCTCN2019099124-appb-000012
Figure PCTCN2019099124-appb-000012
同理可得:Similarly, we can get:
P(c i|e j)=0.45,P(c i|e n)=0.05,P(c i|e m)=0.45 P(c i |e j )=0.45, P(c i |e n )=0.05, P(c i |e m )=0.45
若将概率高于0.3的实体输出:则输出实体e j和实体e mIf the output is higher than the probability of the entity of 0.3: e j of the output entity and the entity e m.
S5、输出所述检索实体与所述一实体之间的最短路径。S5. Output the shortest path between the retrieval entity and the one entity.
例如,输出e i和e j实体间联系的最短路径e i→e j,输出e i和e m实体间联系的最短路径e i→e k→e n→e mFor example, the shortest path between the output e j e i and e i → entities associated with e j, e i shortest path between the output e i and e m associated entities → e k → e n → e m.
综上所述,本公开提供的一种获取关联信息的方法及终端,通过根据海量的第一数据构建知识图谱,实现将与检索实体相关联的数据集从海量数据中快速提取出来,简化了业务人员数据调取的过程,提高了业务人员的工作效率,同时通过智能过滤检索数据,提升了从海量数据中获取关联信息的效率。进一步地,根据业务需求从业务数据(即第一数据)中提取实体,并根据业务需求设置所提取的实体间的关联强度直接存储起来,有利于提升业务人员进行检索时的效率和数据准确性。进一步地,通过选取任意两实体间最大关联强度作为有效关联强度值,可以有效将强度值高于所设阈值的实体数据都提取出来,避免关键实体信息的遗漏。进一步地,归一化后的关联强度值取值范围固定(大于等于0,小于等于1),业务人员可以方便的进行阈值的设定。进一步地,采用后验概率的方法,当有新的实体数据入库,原有实体的后验概率值也会动态调整,特别针对检索实体关联实体数据量大的情况,能够保证每次提取的都是相对重要性高的实体数据。进一步地,输出最短路径,可以知道实体间最直接的联系方式,能够辅助业务人员理解两个实体是如何链接的,由业务人员决策是否对链接路径上的实体进行查看。In summary, the method and terminal for obtaining related information provided by the present disclosure, by constructing a knowledge graph based on massive first data, realizes the rapid extraction of the data set associated with the retrieval entity from the massive data, which simplifies The process of data retrieval for business personnel improves the work efficiency of business personnel, and at the same time retrieves data through intelligent filtering, improving the efficiency of obtaining related information from massive data. Further, extracting entities from the business data (ie, the first data) according to business requirements, and setting the strength of association between the extracted entities according to the business requirements to directly store them, is beneficial to improve the efficiency and data accuracy of business personnel when searching . Further, by selecting the maximum correlation strength between any two entities as the effective correlation strength value, the entity data whose strength value is higher than the set threshold can be effectively extracted to avoid the omission of key entity information. Further, the normalized correlation strength value has a fixed value range (greater than or equal to 0 and less than or equal to 1), and business personnel can conveniently set the threshold. Further, using the posterior probability method, when new entity data is stored in the database, the posterior probability value of the original entity will also be dynamically adjusted, especially for the case where the amount of associated entity data of the retrieved entity is large, and each extraction can be guaranteed. All are relatively important entity data. Further, by outputting the shortest path, you can know the most direct contact method between the entities, which can help business personnel understand how the two entities are linked, and the business personnel can decide whether to view the entities on the link path.
以上所述仅为本公开的实施例,并非因此限制本公开的专利范围,凡是利用本公开说明书及附图内容所作的等同变换,或直接或间接运用在相关的技术领域,均同理包括在本公开的专利保护范围内。The above is only an embodiment of the present disclosure, and does not limit the patent scope of the present disclosure. Any equivalent transformations made by using the specification and drawings of the present disclosure, or directly or indirectly applied in related technical fields, are equally included Within the scope of patent protection of this disclosure.

Claims (10)

  1. 一种获取关联信息的方法,其特征在于,包括:A method for obtaining associated information, characterized in that it includes:
    S1、根据预设的第一数据构建知识图谱;S1. Construct a knowledge graph based on preset first data;
    S2、获取检索实体;S2. Obtain the retrieval entity;
    S3、根据所述知识图谱获取与所述检索实体关联的实体,得到第一实体集合;S3. Acquire an entity associated with the search entity according to the knowledge graph to obtain a first entity set;
    S4、从所述第一实体集合中获取一个以上与所述检索实体的关联强度值大于预设阈值的实体。S4. Obtaining more than one entity with an association strength value greater than a preset threshold from the first entity set from the first entity set.
  2. 根据权利要求1所述的获取关联信息的方法,其特征在于,所述S1具体为:The method for obtaining associated information according to claim 1, wherein the S1 is specifically:
    从所述第一数据中提取实体,得到第二实体集合;Extracting entities from the first data to obtain a second entity set;
    设置所述第二实体集合中具有关联关系的两实体间的关联强度值,得到第一关联强度值集合;Setting a correlation strength value between two entities having an association relationship in the second entity set to obtain a first correlation strength value set;
    根据所述第一关联强度值集合计算所述第二实体集合中任意两实体间的关联强度值,得到第二关联强度值集合;Calculating the correlation strength value between any two entities in the second entity set according to the first correlation strength value set to obtain a second correlation strength value set;
    根据所述第二实体集合和所述第二关联强度值集合构建知识图谱。Construct a knowledge graph according to the second entity set and the second correlation strength value set.
  3. 根据权利要求2所述的获取关联信息的方法,其特征在于,根据所述第一关联强度值集合计算所述第二实体集合中任意两实体间的关联强度值,得到第二关联强度值集合,具体为:The method for obtaining association information according to claim 2, wherein the association strength value between any two entities in the second entity set is calculated according to the first association strength value set to obtain a second association strength value set ,Specifically:
    X(e i,e j)=max(a i→j) X(e i ,e j )=max(a i→j )
    其中,e i,e j为所述第二实体集合中的任一实体,a i→j为实体e i,e j连接路径上任意两个节点之间的关联强度值,X(e i,e j)为实体e i,e j之间的关联强度值。 Where e i , e j is any entity in the second entity set, a i→j is the entity e i , and the correlation strength value between any two nodes on the e j connection path, X(e i , e j ) is the strength of the association between the entities e i and e j .
  4. 根据权利要求2所述的获取关联信息的方法,其特征在于,根据所述第二实体集合和所述第二关联强度值集合构建知识图谱,具体为:The method for obtaining association information according to claim 2, wherein the knowledge graph is constructed based on the second entity set and the second association strength value set, specifically:
    归一化处理所述第二关联强度值集合,得到第三关联强度值集合;Normalizing the second set of correlation strength values to obtain a third set of correlation strength values;
    根据所述第三关联强度值集合和所述第二实体集合构建知识图谱。Construct a knowledge graph according to the third set of association strength values and the second set of entities.
  5. 根据权利要求2所述的获取关联信息的方法,其特征在于,所述S4具体为:The method for obtaining associated information according to claim 2, wherein the S4 is specifically:
    计算所述检索实体与所述第一实体集合中每一实体的后验概率,得到后验概率集合;Calculating the posterior probability of each entity in the retrieval entity and the first entity set to obtain a posterior probability set;
    若所述第一实体集合中一实体的后验概率大于预设阈值,则输出所述一实体。If the posterior probability of an entity in the first entity set is greater than a preset threshold, the entity is output.
  6. 根据权利要求5所述的获取关联信息的方法,其特征在于,还包括:The method for obtaining associated information according to claim 5, further comprising:
    输出所述检索实体与所述一实体之间的最短路径。The shortest path between the retrieval entity and the entity is output.
  7. 一种计算机可读存储介质,其上存储有程序,所述程序在被计算机执行时执行如 权利要求1-6中任一项所述的方法。A computer-readable storage medium having stored thereon a program which when executed by a computer performs the method according to any one of claims 1-6.
  8. 一种获取关联信息的终端,其特征在于,包括一个或多个处理器及存储器,所述存储器存储有程序,并且被配置成由所述一个或多个处理器执行以下步骤:A terminal for acquiring associated information is characterized by including one or more processors and a memory, the memory stores a program, and is configured to be executed by the one or more processors to perform the following steps:
    S1、根据预设的第一数据构建知识图谱;S1. Construct a knowledge graph based on preset first data;
    S2、获取检索实体;S2. Obtain the retrieval entity;
    S3、根据所述知识图谱获取与所述检索实体关联的实体,得到第一实体集合;S3. Acquire an entity associated with the search entity according to the knowledge graph to obtain a first entity set;
    S4、从所述第一实体集合中获取一个以上与所述检索实体的关联强度值大于预设阈值的实体。S4. Obtaining more than one entity with an association strength value greater than a preset threshold from the first entity set from the first entity set.
  9. 根据权利要求8所述的获取关联信息的终端,其特征在于,所述S1具体为:The terminal for acquiring associated information according to claim 8, wherein the S1 is specifically:
    从所述第一数据中提取实体,得到第二实体集合;Extracting entities from the first data to obtain a second entity set;
    设置所述第二实体集合中具有关联关系的两实体间的关联强度值,得到第一关联强度值集合;Setting a correlation strength value between two entities having an association relationship in the second entity set to obtain a first correlation strength value set;
    根据所述第一关联强度值集合计算所述第二实体集合中任意两实体间的关联强度值,得到第二关联强度值集合;Calculating the correlation strength value between any two entities in the second entity set according to the first correlation strength value set to obtain a second correlation strength value set;
    根据所述第二实体集合和所述第二关联强度值集合构建知识图谱。Construct a knowledge graph according to the second entity set and the second correlation strength value set.
  10. 根据权利要求9所述的获取关联信息的终端,其特征在于,根据所述第一关联强度值集合计算所述第二实体集合中任意两实体间的关联强度值,得到第二关联强度值集合,具体为:The terminal for acquiring association information according to claim 9, wherein the association strength value between any two entities in the second entity set is calculated according to the first association strength value set to obtain a second association strength value set ,Specifically:
    X(e i,e j)=max(a i→j) X(e i ,e j )=max(a i→j )
    其中,e i,e j为所述第二实体集合中的任一实体,a i→j为实体e i,e j连接路径上任意两个节点之间的关联强度值,X(e i,e j)为实体e i,e j之间的关联强度值; Where e i , e j is any entity in the second entity set, a i→j is the entity e i , and the correlation strength value between any two nodes on the e j connection path, X(e i , e j ) is the correlation strength value between the entities e i and e j ;
    根据所述第二实体集合和所述第二关联强度值集合构建知识图谱,具体为:归一化处理所述第二关联强度值集合,得到第三关联强度值集合;根据所述第三关联强度值集合和所述第二实体集合构建知识图谱;Constructing a knowledge graph according to the second entity set and the second correlation strength value set, specifically: normalizing the second correlation strength value set to obtain a third correlation strength value set; according to the third correlation Constructing a knowledge graph with the intensity value set and the second entity set;
    所述S4具体为:The S4 is specifically:
    计算所述检索实体与所述第一实体集合中每一实体的后验概率,得到后验概率集合;若所述第一实体集合中一实体的后验概率大于预设阈值,则输出所述一实体;Calculating the posterior probability of each entity in the retrieval entity and the first entity set to obtain a posterior probability set; if the posterior probability of an entity in the first entity set is greater than a preset threshold, the output An entity
    输出所述检索实体与所述一实体之间的最短路径。The shortest path between the retrieval entity and the entity is output.
PCT/CN2019/099124 2018-11-26 2019-08-02 Method and terminal for obtaining associated information WO2020107929A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811420058.6 2018-11-26
CN201811420058.6A CN109739992B (en) 2018-11-26 2018-11-26 Method and terminal for acquiring associated information

Publications (1)

Publication Number Publication Date
WO2020107929A1 true WO2020107929A1 (en) 2020-06-04

Family

ID=66358734

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/099124 WO2020107929A1 (en) 2018-11-26 2019-08-02 Method and terminal for obtaining associated information

Country Status (2)

Country Link
CN (1) CN109739992B (en)
WO (1) WO2020107929A1 (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109739992B (en) * 2018-11-26 2021-06-11 厦门市美亚柏科信息股份有限公司 Method and terminal for acquiring associated information
CN110504028A (en) * 2019-08-22 2019-11-26 上海软中信息系统咨询有限公司 A kind of disease way of inquisition, device, system, computer equipment and storage medium
CN113496332B (en) * 2020-04-02 2024-01-26 中国电信股份有限公司 Industrial Internet fault prediction method and system
CN111831833A (en) * 2020-07-27 2020-10-27 人民卫生电子音像出版社有限公司 Knowledge graph construction method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060242113A1 (en) * 2005-04-20 2006-10-26 Kumar Anand Cybernetic search with knowledge maps
CN106874695A (en) * 2017-03-22 2017-06-20 北京大数医达科技有限公司 The construction method and device of medical knowledge collection of illustrative plates
CN107145744A (en) * 2017-05-08 2017-09-08 合肥工业大学 Construction method, device and the aided diagnosis method of medical knowledge collection of illustrative plates
CN108875053A (en) * 2018-06-28 2018-11-23 国信优易数据有限公司 A kind of knowledge mapping data processing method and device
CN109739992A (en) * 2018-11-26 2019-05-10 厦门市美亚柏科信息股份有限公司 A kind of method and terminal obtaining related information

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010124137A1 (en) * 2009-04-22 2010-10-28 Millennium Pharmacy Systems, Inc. Pharmacy management and administration with bedside real-time medical event data collection
CN107247881B (en) * 2017-06-20 2020-04-28 北京大数医达科技有限公司 Multi-mode intelligent analysis method and system
CN108052636B (en) * 2017-12-20 2022-02-25 北京工业大学 Method and device for determining text theme correlation degree and terminal equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060242113A1 (en) * 2005-04-20 2006-10-26 Kumar Anand Cybernetic search with knowledge maps
CN106874695A (en) * 2017-03-22 2017-06-20 北京大数医达科技有限公司 The construction method and device of medical knowledge collection of illustrative plates
CN107145744A (en) * 2017-05-08 2017-09-08 合肥工业大学 Construction method, device and the aided diagnosis method of medical knowledge collection of illustrative plates
CN108875053A (en) * 2018-06-28 2018-11-23 国信优易数据有限公司 A kind of knowledge mapping data processing method and device
CN109739992A (en) * 2018-11-26 2019-05-10 厦门市美亚柏科信息股份有限公司 A kind of method and terminal obtaining related information

Also Published As

Publication number Publication date
CN109739992A (en) 2019-05-10
CN109739992B (en) 2021-06-11

Similar Documents

Publication Publication Date Title
WO2020107929A1 (en) Method and terminal for obtaining associated information
CN109635117B (en) Method and device for recognizing user intention based on knowledge graph
US20120278263A1 (en) Cost-sensitive alternating decision trees for record linkage
Chanial et al. Connectionlens: Finding connections across heterogeneous data sources
WO2019085064A1 (en) Medical claim denial determination method, device, terminal apparatus, and storage medium
WO2019196226A1 (en) System information querying method and apparatus, computer device, and storage medium
WO2019001429A1 (en) Multisource data fusion method and apparatus
CN112116331A (en) Talent recommendation method and device
CA2868540C (en) Entity resolution from documents
US20150261837A1 (en) Querying Structured And Unstructured Databases
CN111752922A (en) Method and device for establishing knowledge database and realizing knowledge query
CN111984797A (en) Customer identity recognition device and method
JP2015011723A (en) Information processing method, information processing apparatus, organization name standardizing method, and organization name standardizing apparatus
CN113204644B (en) Government affair encyclopedia construction method based on knowledge graph
US10445061B1 (en) Matching entities during data migration
Kalokyri et al. Integration and exploration of connected personal digital traces
US10838973B2 (en) Processing datasets of varying schemas from tenants
US11880377B1 (en) Systems and methods for entity resolution
KR101752259B1 (en) High value-added content management device and method and recording medium storing program for executing the same and recording medium storing program for executing the same
CN114416848A (en) Data blood relationship processing method and device based on data warehouse
US10997248B2 (en) Data association using complete lists
US20150074132A1 (en) Methods and systems for inmate searching
US11842285B2 (en) Graph database implemented knowledge mesh
EP2929473A1 (en) System and method for determining by an external entity the human hierarchial structure of an organization, using public social networks
CN112163485B (en) Data processing method and device, database system and electronic equipment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19888819

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19888819

Country of ref document: EP

Kind code of ref document: A1