WO2020107929A1

WO2020107929A1 - Method and terminal for obtaining associated information

Info

Publication number: WO2020107929A1
Application number: PCT/CN2019/099124
Authority: WO
Inventors: 陈捷; 吴春德; 林世国; 栾江霞; 吴鸿伟; 吴文
Original assignee: 厦门市美亚柏科信息股份有限公司
Priority date: 2018-11-26
Filing date: 2019-08-02
Publication date: 2020-06-04
Also published as: CN109739992A; CN109739992B

Abstract

The present disclosure relates to the technical field of data processing, and particularly relates to a method and terminal for obtaining associated information. In the present disclosure, a knowledge map is constructed according to preset first data; a search entity is obtained; an entity associated with the search entity is obtained according to said knowledge map to obtain a first entity set; more than one entity from the first entity set having an association strength value with the search entity greater than a preset threshold is obtained. The efficiency of obtaining associated information from a massive amount data is improved.

Description

Method and terminal for obtaining related information

Related application

This application claims the priority of the Chinese patent application with the application number 201811420058.6 filed on November 26, 2018. The entire content of this application is incorporated herein by reference.

Technical field

The present disclosure relates to the field of data processing technology, and in particular, to a method and terminal for acquiring related information.

Background technique

In many daily operations, a large amount of event recording data will be generated. Most of these massive data are text data, and also contain some table data, which are often distributed and stored in structured and unstructured databases. According to the traditional method, when business personnel call data, they need to go to different systems for query and retrieval, and then manually establish the relationship between the data, which takes time and effort.

Public content

The technical problem to be solved by the present disclosure is: how to improve the efficiency of obtaining related information from massive data.

In order to solve the above technical problems, the technical solutions adopted by the present disclosure are:

The present disclosure provides a method for obtaining associated information, including:

S1. Construct a knowledge graph based on preset first data;

S2. Obtain the retrieval entity;

S3. Acquire an entity associated with the search entity according to the knowledge graph to obtain a first entity set;

S4. Obtaining more than one entity with an association strength value greater than a preset threshold from the first entity set from the first entity set.

Further, the S1 is specifically:

Extracting entities from the first data to obtain a second entity set;

Setting a correlation strength value between two entities having an association relationship in the second entity set to obtain a first correlation strength value set;

Calculating the correlation strength value between any two entities in the second entity set according to the first correlation strength value set to obtain a second correlation strength value set;

Construct a knowledge graph according to the second entity set and the second correlation strength value set.

Further, calculating the correlation strength value between any two entities in the second entity set according to the first correlation strength value set to obtain a second correlation strength value set specifically includes:

X(e _i ,e _j )=max(a _i→j )

Where e _i , e _j is any entity in the second entity set, a _i→j is the entity e _i , and the correlation strength value between any two nodes on the e _j connection path, X(e _i , e _j ) is the strength of the association between the entities e _i and e _j .

Further, constructing a knowledge graph based on the second entity set and the second correlation strength value set specifically includes:

Normalizing the second set of correlation strength values to obtain a third set of correlation strength values;

Construct a knowledge graph according to the third set of association strength values and the second set of entities.

Further, the S4 is specifically:

Calculating the posterior probability of each entity in the retrieval entity and the first entity set to obtain a posterior probability set;

If the posterior probability of an entity in the first entity set is greater than a preset threshold, the entity is output.

Further, it also includes:

The shortest path between the retrieval entity and the entity is output.

The present disclosure also provides a computer-readable storage medium on which a program is stored, and when the program is executed by a computer, the method for acquiring associated information is executed.

The present disclosure further provides a terminal for acquiring associated information, including one or more processors and a memory, the memory stores a program, and is configured to be executed by the one or more processors to perform the following steps:

S1. Construct a knowledge graph based on preset first data;

S2. Obtain the retrieval entity;

Further, the S1 is specifically:

Extracting entities from the first data to obtain a second entity set;

X(e _i ,e _j )=max(a _i→j )

Where e _i , e _j are any entities in the second entity set, a _i→ j are entities e _i , and the correlation strength value between any two nodes on the e _j connection path, X(e _i , e _j ) is the correlation strength value between the entities e _i and e _j ;

Constructing a knowledge graph according to the second entity set and the second correlation strength value set, specifically: normalizing the second correlation strength value set to obtain a third correlation strength value set; according to the third correlation Constructing a knowledge graph with the intensity value set and the second entity set;

The S4 is specifically:

Calculating the posterior probability of each entity in the retrieval entity and the first entity set to obtain a posterior probability set; if the posterior probability of an entity in the first entity set is greater than a preset threshold, the output An entity

The shortest path between the retrieval entity and the entity is output.

The beneficial effect of the present disclosure is that the present disclosure realizes the rapid extraction of the data set associated with the retrieval entity from the massive data by constructing a knowledge graph based on the massive first data, which simplifies the process of business personnel data retrieval The work efficiency of business personnel, while retrieving data through intelligent filtering, improves the efficiency of obtaining related information from massive data.

BRIEF DESCRIPTION

1 is a flowchart of a specific implementation manner of a method for obtaining associated information provided by the present disclosure;

2 is a structural block diagram of a specific implementation manner of a terminal for acquiring associated information provided by the present disclosure;

FIG. 3 is an example diagram of the relationship between the retrieval entity and the entity in the knowledge graph;

Label description:

1. Processor; 2. Memory.

detailed description

In order to explain in detail the technical content of the present disclosure, the objectives and effects achieved, the following will be described in conjunction with the embodiments and accompanying drawings.

Please refer to Figures 1 to 3,

As shown in FIG. 1, the present disclosure provides a method for obtaining associated information, including:

S1. Construct a knowledge graph based on preset first data;

S2. Obtain the retrieval entity;

Further, the S1 is specifically:

Extracting entities from the first data to obtain a second entity set;

As can be seen from the above description, extracting entities from business data (ie, the first data) according to business needs, and setting the strength of association between the extracted entities according to business needs to directly store them, is conducive to improving the efficiency and data of business personnel when searching accuracy.

X(e _i ,e _j )=max(a _i→j )

Where e _i , e _j are any entities in the second entity set, a _i→ j are entities e _i , and the correlation strength value between any two nodes on the e _j connection path, X(e _i , e _j ) is the strength of the association between the entities e _i and e _j .

As can be seen from the above description, by selecting the maximum correlation strength between any two entities as the effective correlation strength value, the entity data whose strength value is higher than the set threshold can be effectively extracted to avoid the omission of key entity information.

As can be seen from the above description, the normalized correlation strength value has a fixed value range (greater than or equal to 0 and less than or equal to 1), and business personnel can easily set the threshold.

Further, the S4 is specifically:

It can be seen from the above description that, with the method of posterior probability, when new entity data is stored in the database, the posterior probability value of the original entity will also be dynamically adjusted, especially for the case where the amount of data associated with the retrieved entity is large, which can guarantee each time The extracted entity data is of relatively high importance.

Further, it also includes:

The shortest path between the retrieval entity and the entity is output.

It can be seen from the above description that the shortest path is output, and the most direct contact method between entities can be known, which can help business personnel understand how two entities are linked, and the business personnel can decide whether to view the entities on the link path.

As shown in FIG. 2, the present disclosure further provides a terminal for acquiring associated information, including one or more processors 1 and a memory 2, the memory 2 stores a program, and is configured to be processed by the one or more Device 1 performs the following steps:

S1. Construct a knowledge graph based on preset first data;

S2. Obtain the retrieval entity;

Further, the S1 is specifically:

Extracting entities from the first data to obtain a second entity set;

X(e _i ,e _j )=max(a _i→j )

The S4 is specifically:

The shortest path between the retrieval entity and the entity is output.

The first embodiment of the present disclosure is:

This embodiment provides a method for obtaining associated information, including:

S1. Construct a knowledge graph according to preset first data.

Wherein, the first data is the daily record data of the business department. In many daily operations, a large amount of event recording data will be generated. Most of these massive data are text data, and also contain some table data, which are often distributed and stored in structured and unstructured databases.

S11. Extract entities from the first data to obtain a second entity set.

Among them, the person entity, address entity, event entity, item entity and organization entity are extracted from the daily record data of the business department through technologies such as rule matching, OCR recognition, and natural language analysis.

In this embodiment, the person entity includes the person identity information of the person entity and its associated person indicated in the business record, such as name, ID number, gender, blood type, etc.;

The address entity includes address information of companies, group organizations, individuals, etc. involved in the event record, such as the registered address, office address, individual's household registration address, temporary residence address, etc. of the enterprise;

The event entity includes information required for event description such as event type, event date, and event content in the event record;

The item entity includes identification information of items such as mobile phones, computers, and vehicles included in the event record, such as mobile phone numbers, MAC addresses of computers, and license plate numbers;

The organization entity includes information such as the organization name, type, scale, and scope of activities in the event record.

Extract the main body data from the normalized event record data through rule matching technology, such as traffic travel records, document application materials, etc.;

Use OCR recognition technology to extract the main body data from the picture data in the event record, such as license plate information, business license, etc. The format of these picture data is relatively fixed, and can be recognized by the pre-trained OCR recognition model;

Use natural language processing technology to extract subject data, such as event description information, from text data with weak format specifications in event records.

S12. Set a correlation strength value between two entities having an association relationship in the second entity set to obtain a first correlation strength value set.

Among them, person entities, address entities, event entities, item entities, and organizational entities are related to each other according to the relationship in the event record, and the correlation strength value between the entities is set by the business personnel according to the degree of closeness to the event The score ranges from 0 to 100.

S13. Calculate the correlation strength value between any two entities in the second entity set according to the first correlation strength value set to obtain a second correlation strength value set; specifically:

X(e _i ,e _j )=max(a _i→j )

Where e _i , e _j is any entity in the second entity set, a _i→ j is the entity e _i , and the correlation strength value between any two nodes on the e _j connection path, X(e _i , e _j ) is the strength of the association between the entities e _i and e _j .

Among them, the correlation strength value between any two entities depends on the maximum correlation strength value existing on the two paths.

S14. Normalize the second set of correlation strength values to obtain a third set of correlation strength values.

Among them, the formula for normalizing X(e _i , e _j ) is:

That is, X(e _i , e _j ) is divided by the maximum value in the second set of correlation strength values.

S15. Construct a knowledge graph according to the third set of association strength values and the second set of entities.

Among them, the established entities and the correlation strength values between the entities are stored in graph databases such as Neo4j or Titan to construct a knowledge graph. The knowledge graph was mainly used by Google in the field of semantic search in order to improve the search effect. It is now also used in chat robots, intelligent question answering systems, medical services, book information services and other fields. The data in the knowledge graph can be expressed in the form of triples, that is, the form of entity 1-relationship-entity 2, in which the entity is the most basic element in the knowledge graph and is a description of the facts. Different entities have different relationships . If the entity is regarded as a node, and the relationship between the entities as an edge, then the knowledge graph containing a large number of triples becomes a huge knowledge graph, thus connecting different types of information into a relationship network, providing a relationship The ability to analyze problems from a perspective. By applying knowledge graph technology to the field of big data, these massive amounts of heterogeneous data can be fused to realize the construction of the association relationship between object data, so that business personnel can quickly realize the relationship query, analysis and mining of the full amount of data to improve the work. effectiveness.

S2. Obtain the retrieval entity.

Among them, the new search data is preprocessed to extract the search entity data set; the search subject is extracted from the obtained search information, such as the subject name, certificate number, contact information, type, location, organization and other daily business information.

S3. Acquire an entity associated with the search entity according to the knowledge graph to obtain a first entity set.

For example, as shown in FIG. 3, the retrieval entity is associated with the person entity, address entity, event entity, item entity, and organization entity in the knowledge graph, and all entities related to the retrieval entity are extracted according to the established relationship between the entities Information, forming a data set {e ₁ ,e ₂ ,...,e _k }. For example, the mobile phone number in the search entity can be used to directly associate the item entity, and then the person name, address and other information of the mobile phone number are associated according to the association established between the item entity and the person entity, address entity, event entity, and organization entity.

S4. Acquire more than one entity with a correlation strength value greater than a preset threshold from the first entity set from the first entity set. Specifically:

S41. Calculate the posterior probability of each entity in the retrieval entity and the first entity set to obtain a posterior probability set;

S42. If the posterior probability of an entity in the first entity set is greater than a preset threshold, output the entity.

Among them, through the built knowledge map, combined with the correlation strength value between entities, and substituted into the calculation of the posterior probability, the person entity, address entity, event entity, item entity, organization entity related to the retrieval subject are extracted to form the data set{ e ₁ ,e ₂ ,...,e _k };

The formula for calculating the posterior probability is:

Where C _i represents the retrieval entity i, k is the number of the data set, {e ₁ , e ₂ ,..., e _k } is the entity composed of person entity, address entity, event entity, item entity, organization entity data set.

Push entity data with an association probability higher than the specified threshold in order from high to low.

For example, in FIG. 3, X(e _i , e _k )=10, X(e _i , e _j )=100, X(e _i , e _n )=10, X(e _i , e _m )=100

Then X(e _i , e _j ) is normalized, the formula is:

You get:

X(e _i , e _k )′=0.1, X(e _i ,e _j )′=1, X(e _i ,e _n )′=0.1, X(e _i ,e _m )′=1

Substituting the above results into the posterior probability calculation, we can get:

Similarly, we can get:

P(c _i |e _j )=0.45, P(c _i |e _n )=0.05, P(c _i |e _m )=0.45

If the output is higher than the probability of the entity of 0.3: e _j of the output entity and the entity e _m.

S5. Output the shortest path between the retrieval entity and the one entity.

For example, the shortest path between the output e _j e _i and e _i → entities associated with e _j, e _i shortest path between the output e _i and e _m associated entities _{_{→ e k → e n → e}} m.

In summary, the method for obtaining related information provided by this embodiment, through the construction of the knowledge graph, can quickly extract the data set associated with the retrieval subject, simplifying the data retrieval process of business personnel, and improving the Work efficiency, at the same time retrieve data through intelligent filtering, improve the efficiency of data query.

The second embodiment of the present disclosure is:

This embodiment provides a computer-readable storage medium on which a program is stored, and the program performs the following steps by a computer:

S1. Construct a knowledge graph according to preset first data.

S11. Extract entities from the first data to obtain a second entity set.

X(e _i ,e _j )=max(a _i→j )

Where e _i and e _j are any entities in the second entity set, a _i→ j is the entity e _i , and the correlation strength value between any two nodes on the e _j connection path is the entity e _i , The strength of the correlation between e _j .

Among them, the formula for normalizing X(e _i , e _j ) is:

S2. Obtain the retrieval entity.

S4. Obtaining more than one entity with an association strength value greater than a preset threshold from the first entity set from the first entity set. Specifically:

The formula for calculating the posterior probability is:

Then X(e _i , e _j ) is normalized, the formula is:

You get:

Similarly, we can get:

P(c _i |e _j )=0.45, P(c _i |e _n )=0.05, P(c _i |e _m )=0.45

S5. Output the shortest path between the retrieval entity and the one entity.

The third embodiment of the present disclosure is:

This embodiment provides a terminal for acquiring associated information, including one or more processors and a memory, the memory stores a program, and is configured to be executed by the one or more processors to perform the following steps:

S1. Construct a knowledge graph according to preset first data.

S11. Extract entities from the first data to obtain a second entity set.

Use OCR recognition technology to extract subject data from the picture data in the event records, such as license plate information, business licenses, etc. The format of these picture data is relatively fixed, and can be recognized by the pre-trained OCR recognition model;

X(e _i ,e _j )=max(a _i→j )

Where e _i and e _j are any entities in the second entity set, a _i→j is the entity e _i , and the correlation strength value between any two nodes on the e _j connection path is the entity e _i , The strength of the correlation between e _j .

Among them, the formula for normalizing X(e _i , e _j ) is:

S2. Obtain the retrieval entity.

The formula for calculating the posterior probability is:

Then X(e _i , e _j ) is normalized, the formula is:

You get:

Similarly, we can get:

P(c _i |e _j )=0.45, P(c _i |e _n )=0.05, P(c _i |e _m )=0.45

S5. Output the shortest path between the retrieval entity and the one entity.

In summary, the method and terminal for obtaining related information provided by the present disclosure, by constructing a knowledge graph based on massive first data, realizes the rapid extraction of the data set associated with the retrieval entity from the massive data, which simplifies The process of data retrieval for business personnel improves the work efficiency of business personnel, and at the same time retrieves data through intelligent filtering, improving the efficiency of obtaining related information from massive data. Further, extracting entities from the business data (ie, the first data) according to business requirements, and setting the strength of association between the extracted entities according to the business requirements to directly store them, is beneficial to improve the efficiency and data accuracy of business personnel when searching . Further, by selecting the maximum correlation strength between any two entities as the effective correlation strength value, the entity data whose strength value is higher than the set threshold can be effectively extracted to avoid the omission of key entity information. Further, the normalized correlation strength value has a fixed value range (greater than or equal to 0 and less than or equal to 1), and business personnel can conveniently set the threshold. Further, using the posterior probability method, when new entity data is stored in the database, the posterior probability value of the original entity will also be dynamically adjusted, especially for the case where the amount of associated entity data of the retrieved entity is large, and each extraction can be guaranteed. All are relatively important entity data. Further, by outputting the shortest path, you can know the most direct contact method between the entities, which can help business personnel understand how the two entities are linked, and the business personnel can decide whether to view the entities on the link path.

The above is only an embodiment of the present disclosure, and does not limit the patent scope of the present disclosure. Any equivalent transformations made by using the specification and drawings of the present disclosure, or directly or indirectly applied in related technical fields, are equally included Within the scope of patent protection of this disclosure.

Claims

A method for obtaining associated information, characterized in that it includes:

S1. Construct a knowledge graph based on preset first data;

S2. Obtain the retrieval entity;

S3. Acquire an entity associated with the search entity according to the knowledge graph to obtain a first entity set;

S4. Obtaining more than one entity with an association strength value greater than a preset threshold from the first entity set from the first entity set.
The method for obtaining associated information according to claim 1, wherein the S1 is specifically:

Extracting entities from the first data to obtain a second entity set;

Setting a correlation strength value between two entities having an association relationship in the second entity set to obtain a first correlation strength value set;

Calculating the correlation strength value between any two entities in the second entity set according to the first correlation strength value set to obtain a second correlation strength value set;

Construct a knowledge graph according to the second entity set and the second correlation strength value set.
The method for obtaining association information according to claim 2, wherein the association strength value between any two entities in the second entity set is calculated according to the first association strength value set to obtain a second association strength value set ,Specifically:

X(e i ,e j )=max(a i→j )

Where e i , e j is any entity in the second entity set, a i→j is the entity e i , and the correlation strength value between any two nodes on the e j connection path, X(e i , e j ) is the strength of the association between the entities e i and e j .
The method for obtaining association information according to claim 2, wherein the knowledge graph is constructed based on the second entity set and the second association strength value set, specifically:

Normalizing the second set of correlation strength values to obtain a third set of correlation strength values;

Construct a knowledge graph according to the third set of association strength values and the second set of entities.
The method for obtaining associated information according to claim 2, wherein the S4 is specifically:

Calculating the posterior probability of each entity in the retrieval entity and the first entity set to obtain a posterior probability set;

If the posterior probability of an entity in the first entity set is greater than a preset threshold, the entity is output.
The method for obtaining associated information according to claim 5, further comprising:

The shortest path between the retrieval entity and the entity is output.
A computer-readable storage medium having stored thereon a program which when executed by a computer performs the method according to any one of claims 1-6.
A terminal for acquiring associated information is characterized by including one or more processors and a memory, the memory stores a program, and is configured to be executed by the one or more processors to perform the following steps:

S1. Construct a knowledge graph based on preset first data;

S2. Obtain the retrieval entity;

S3. Acquire an entity associated with the search entity according to the knowledge graph to obtain a first entity set;

S4. Obtaining more than one entity with an association strength value greater than a preset threshold from the first entity set from the first entity set.
The terminal for acquiring associated information according to claim 8, wherein the S1 is specifically:

Extracting entities from the first data to obtain a second entity set;

Setting a correlation strength value between two entities having an association relationship in the second entity set to obtain a first correlation strength value set;

Calculating the correlation strength value between any two entities in the second entity set according to the first correlation strength value set to obtain a second correlation strength value set;

Construct a knowledge graph according to the second entity set and the second correlation strength value set.
The terminal for acquiring association information according to claim 9, wherein the association strength value between any two entities in the second entity set is calculated according to the first association strength value set to obtain a second association strength value set ,Specifically:

X(e i ,e j )=max(a i→j )

Where e i , e j is any entity in the second entity set, a i→j is the entity e i , and the correlation strength value between any two nodes on the e j connection path, X(e i , e j ) is the correlation strength value between the entities e i and e j ;

Constructing a knowledge graph according to the second entity set and the second correlation strength value set, specifically: normalizing the second correlation strength value set to obtain a third correlation strength value set; according to the third correlation Constructing a knowledge graph with the intensity value set and the second entity set;

The S4 is specifically:

Calculating the posterior probability of each entity in the retrieval entity and the first entity set to obtain a posterior probability set; if the posterior probability of an entity in the first entity set is greater than a preset threshold, the output An entity

The shortest path between the retrieval entity and the entity is output.