CN108829858B

CN108829858B - Data query method and device and computer readable storage medium

Info

Publication number: CN108829858B
Application number: CN201810647001.3A
Authority: CN
Inventors: 黄正元; 龚杰; 孙俊; 李伟
Original assignee: JD Digital Technology Holdings Co Ltd
Current assignee: JD Digital Technology Holdings Co Ltd; Jingdong Technology Holding Co Ltd
Priority date: 2018-06-22
Filing date: 2018-06-22
Publication date: 2021-09-17
Anticipated expiration: 2038-06-22
Also published as: CN108829858A

Abstract

The disclosure provides a data query method, a data query device and a computer readable storage medium, and relates to the technical field of data processing. The data query method comprises the following steps: receiving query information input by a user; carrying out natural language processing on query information input by a user, and analyzing to obtain query keywords; determining a query entity corresponding to a query keyword in a knowledge graph; the data chains in the knowledge-graph associated with the query entity are returned to the user. According to the method and the device, the relevant data chain in the knowledge graph is returned to the user after the user inputs the query information, and the upstream and downstream incidence relation of the relevant data information can be fully displayed for the user, so that the range of data analysis of the user is expanded, and the more comprehensive data analysis of the user is facilitated.

Description

Data query method and device and computer readable storage medium

Technical Field

The present disclosure relates to the field of data processing technologies, and in particular, to a data query method and apparatus, and a computer-readable storage medium.

Background

Today, as the internet rapidly develops, information is abundantly present in unstructured textual data, large numbers of semi-structured forms and web pages, and structured data of production systems. The financial industry is much more sensitive to various news events than other industries and therefore relies heavily on the accuracy, comprehensiveness, and timeliness of news events. More and more data and information can be collected to the financial industry user in time, and the relationship of thousands of threads can be analyzed from the data and the information, so that the working efficiency of the user can be greatly improved, the user is assisted to make the most accurate judgment, and the maximum return value is obtained.

Financial industry users often need to go through several steps before making a final investment or decision. First, manual data collection and processing are performed, data are collected from various news sources (such as search engines, news portals, government announcements, forums, and the like), so that a large amount of news information is obtained, and data which are invalid for users are filtered out according to personal experiences of financial users. And then, analyzing and summarizing classified data, wherein after information is acquired, a user needs to extract key points of each information event, summarize and classify the key points, and form a document, a report or store the document, the report or the document into a structured data system (such as an Excel or mysql database). Because the amount of data is very large, the result of data storage may lack some information originally provided by the data for search convenience and query efficiency. And finally, making a decision, wherein after the data is classified, the user can carry out data analysis, judge the influence of the occurrence of the news event on the investment subject and the influence range, and then make a final decision according to the data analysis result.

Disclosure of Invention

One technical problem that this disclosure solves is how to achieve a more comprehensive data analysis.

According to an aspect of an embodiment of the present disclosure, there is provided a data query method including: receiving query information input by a user; carrying out natural language processing on query information input by a user, and analyzing to obtain query keywords; determining a query entity corresponding to a query keyword in a knowledge graph; the data chains in the knowledge-graph associated with the query entity are returned to the user.

In some embodiments, determining a query entity corresponding to the query keyword in the knowledge-graph comprises: determining an initial query entity corresponding to a query keyword in a knowledge graph; returning to the user the data chains in the knowledge-graph associated with the query entity includes: and returning a data chain which takes the initial query entity as an initial node in the knowledge graph to the user.

In some embodiments, determining a query entity corresponding to the query keyword in the knowledge-graph comprises: determining a starting query entity and an ending query entity corresponding to a query keyword in a knowledge graph; returning to the user the data chains in the knowledge-graph associated with the query entity includes: and returning a data chain which takes the initial query entity as an initial node and takes the end query entity as an end node in the knowledge graph to the user.

In some embodiments, the data chains associated with the query entity in the knowledge-graph are returned to the user in descending order of the number of relationships contained in the data chains.

In some embodiments, the data query method further comprises: collecting database data; carrying out natural language processing on database data, and extracting entities and relations among the entities; and generating a knowledge graph according to the entities and the relationship among the entities.

In some embodiments, natural language processing the database data to extract entities and relationships between entities includes: extracting data keywords from database data by using a natural language processing model; taking the data key words with the word frequency-reverse document frequency higher than a first threshold value as entities; relationships between entities are extracted from database data using a natural language processing model.

In some embodiments, collecting database data comprises: configuring a data source address list, a starting page number, an ending page number and acquisition time; according to the acquisition time, news data determined by a data source address list, a starting page number and an ending page number are automatically extracted; and analyzing to obtain the title and text data in the news data, and storing the title and text data in the database.

According to another aspect of the embodiments of the present disclosure, there is provided a data query apparatus including: the information receiving module is configured to receive query information input by a user; the keyword analysis module is configured to perform natural language processing on query information input by a user and analyze the query information to obtain a query keyword; an entity determination module configured to determine a query entity corresponding to the query keyword in the knowledge graph; a data return module configured to return to the user a data chain in the knowledge-graph associated with the query entity.

In some embodiments, the entity determination module is configured to: determining an initial query entity corresponding to a query keyword in a knowledge graph; the data return module is configured to: and returning a data chain which takes the initial query entity as an initial node in the knowledge graph to the user.

In some embodiments, the entity determination module is configured to: determining a starting query entity and an ending query entity corresponding to a query keyword in a knowledge graph; the data return module is configured to: and returning a data chain which takes the initial query entity as an initial node and takes the end query entity as an end node in the knowledge graph to the user.

In some embodiments, the data return module is configured to: and returning the data chains associated with the query entity in the knowledge graph to the user according to the sequence of the relationship numbers contained in the data chains from large to small.

In some embodiments, the data querying device further comprises a knowledge-graph generation module configured to: collecting database data; carrying out natural language processing on database data, and extracting entities and relations among the entities; and generating a knowledge graph according to the entities and the relationship among the entities.

In some embodiments, the knowledge-graph generation module is configured to: extracting data keywords from database data by using a natural language processing model; taking the data key words with the word frequency-reverse document frequency higher than a first threshold value as entities; relationships between entities are extracted from database data using a natural language processing model.

In some embodiments, the knowledge-graph generation module is configured to: configuring a data source address list, a starting page number, an ending page number and acquisition time; according to the acquisition time, news data determined by a data source address list, a starting page number and an ending page number are automatically extracted; and analyzing to obtain the title and text data in the news data, and storing the title and text data in the database.

According to still another aspect of the embodiments of the present disclosure, there is provided a data query apparatus including: a memory; and a processor coupled to the memory, the processor configured to execute the aforementioned data query method based on instructions stored in the memory.

According to still another aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium, wherein the computer-readable storage medium stores computer instructions, and the computer instructions, when executed by a processor, implement the aforementioned data query method.

According to the method and the device, the relevant data chain in the knowledge graph is returned to the user after the user inputs the query information, and the upstream and downstream incidence relation of the relevant data information can be fully displayed for the user, so that the range of data analysis of the user is expanded, and the more comprehensive data analysis of the user is facilitated.

Other features of the present disclosure and advantages thereof will become apparent from the following detailed description of exemplary embodiments thereof, which proceeds with reference to the accompanying drawings.

Drawings

In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and for those skilled in the art, other drawings can be obtained according to the drawings without inventive exercise.

FIG. 1 shows a flow diagram of one embodiment of a process for generating a knowledge-graph.

FIG. 2 shows a schematic diagram of a knowledge-graph.

FIG. 3 shows a flow diagram of one embodiment of the disclosed data query method.

FIG. 4 shows a complete workflow diagram for data query federated data collection.

Fig. 5 shows a schematic flow chart of a data query device according to an embodiment of the present disclosure.

Fig. 6 shows a schematic structural diagram of a data query device according to another embodiment of the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, and not all of the embodiments. The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.

The inventor researches and discovers that in the related art, in order to enable an end user to concentrate on event analysis, data collection personnel and the end user are not the same person or the same team generally, so that the result is that the two parties have different opinions on key points of collected information, important data or certain details are certainly lost, the final data analysis is influenced, and the risk degree of investment of the user is improved. The data analysis and summarization results are solidified into the document or the database for the convenience of the user to view, however, the document usually has a fixed template, and the database has a fixed field, so that the data processing process cannot dynamically analyze the precursor consequence of the event according to the attribute of the current information, and the data is incomplete. In addition, as data is incrementally stored, historical data may gradually accumulate, resulting in a slow data query process.

The data query method of the present disclosure is described in detail below in steps.

Generating a knowledge graph

One embodiment of a process for generating a knowledge-graph is first described in conjunction with FIG. 1.

FIG. 1 shows a flow diagram of one embodiment of a process for generating a knowledge-graph. As shown in fig. 1, the process of generating the knowledge-graph in the present embodiment includes steps S102 to S106.

In step S102, database data is collected.

In the related technology, the required information is collected from a plurality of news sources completely by manpower, so that a large amount of manpower is needed to complete complex and heavy labor every day, time waste is caused, and when a data source is newly added, certain manpower is inevitably needed to collect new data source information.

In order to reduce the difficulty degree of data acquisition, the embodiment realizes automatic acquisition of news data so as to save labor and time cost. First, a data source address list, a start page number, an end page number, and a collection time are configured. Then, according to the acquisition time, news data determined by the data source address list, the starting page number and the ending page number are automatically extracted. And finally, analyzing to obtain the title and text data in the news data, and storing the title and text data in a database for the subsequent natural language processing. If the endpoint focuses on the knowledge map generation of financial events, news data related to finance and finance can be collected.

In step S104, natural language processing is performed on the database data, and entities and relationships between the entities are extracted. In particular, a natural language processing model may be utilized to extract data keywords from database data. For example for news data: "Monday (11 days 6), Zhang III has reached Singapore, i.e., will have a historical meeting with Korean Liqu. It is known that the single meeting of zhang with lien four will be 9 am on local time 12 days: 15, act. After the session is over, Zhang III will hold the reporter and leave Singapore and return to the United states about 8 a.m. The 'deep V' market is presented in the gold day of the spot shipment, and the thousand-three major relations are broken through. However, the rare pre-negotiation in the united states before the 'individual meeting' and the upward income rate of national debt in the 10-year period in the united states lead to the dangerous emotion beginning to fall back and the gold price wandering around the $ 1300 gateway ". In this embodiment, the natural language processing model performs semantic segmentation on the text in the news data, and segments the text into terms such as zhang san, singapore, korean, and the like; then, labeling each word according to corresponding part of speech, such as: zhang san (name of person), singapore (place name); finally, the required annotated entities, including names of people, places, numbers, etc., are retrieved.

Optionally, the natural language processing model may use a word frequency-reverse document frequency algorithm, and take the data keyword whose word frequency-reverse document frequency is higher than the first threshold as an entity. When the word frequency-reverse document frequency algorithm is specifically applied, the calculation formula of the word frequency-reverse document frequency is as follows:

X＝Y*lg{M/(N+1)}

wherein, X represents the word frequency of a word-the reverse document frequency, Y represents the word frequency of a word, namely the total number of occurrences of the word in the article, M represents the total number of documents in the corpus storing all news data, and N represents the number of documents containing the word in the corpus. The idea of the word frequency-inverse document frequency algorithm is that a word or phrase is considered to have a good degree of discrimination if it occurs with high frequency in one article and rarely in other articles. And taking the data key word with a high word frequency-reverse document frequency value as an entity, which is beneficial to generating a knowledge graph representing news key points.

Relationships between entities can be further extracted from database data using a natural language processing model. For example, the database data records the text information "influence of 618 on the stock market". Then, the natural language processing model first cuts the text into "618", "pair", "stock market", "influence", and then labels the part of speech of each word. "618 (noun)", "stock market (noun)", "pair (preposition)". The noun in the text is an entity, and the "influence on …" in the text can determine the relationship between the entities, i.e. the starting entity is "618" and the ending entity is "stock market". It will be understood by those skilled in the art that if the text message has only one entity, the default is the originating entity.

When an entity with the same content appears, the entity cannot be replaced, and only one relationship is added to the node corresponding to the entity to point to the next node. Finally, a complex network with a plurality of data is formed, and the knowledge graph is formed.

In step S106, a knowledge-graph is generated from the entities and relationships between the entities.

FIG. 2 shows a schematic diagram of a knowledge-graph. Neo4j, OrientDB, TITAN, preferably Neo4j, may be used in storing the knowledge graph. Neo4j is a high-performance graph database that stores structured data on a network rather than in tables. It is an embedded, disk-based Java persistence engine with full transactional properties, but it stores structured data on a graph rather than in a table. Neo4j can also be viewed as a high performance graph engine with all the features of a full database. This allows the structured information contained in the collected news data to be characterized in the Neo4j database.

In the embodiment, data source data are automatically collected, events are automatically processed through natural language processing, entities obtained through analysis and relationships among the entities are stored in the knowledge graph, so that the entities or the relationships related to the events can be displayed to users in a knowledge graph mode, and the users can be helped to quickly obtain related event information.

(II) data query

One embodiment of the disclosed data query method is described below in conjunction with fig. 3.

FIG. 3 shows a flow diagram of one embodiment of the disclosed data query method. As shown in fig. 3, the data query method in the present embodiment includes steps S302 to S308.

In step S302, query information input by a user is received.

In step S304, natural language processing is performed on the query information input by the user, and the query information is analyzed to obtain a query keyword.

In step S306, a query entity corresponding to the query keyword is determined in the knowledge-graph.

In step S308, the data chain in the knowledge-graph associated with the querying entity is returned to the user.

The query process can be roughly divided into two cases according to the query information input by the user.

The first situation is that the query information input by the user is analyzed by using natural language processing, after the query keyword is obtained, the initial query entity and the end query entity corresponding to the query keyword are determined in the knowledge graph, and then a data chain in which the initial query entity is used as the initial node and the end query entity is used as the end node in the knowledge graph is returned to the user.

For example, the query information input by the user is "three meet with lie four". Analyzing the Zhang Sanzhuilian Liqu to obtain Zhang Sanlian and Liqu. Determining a starting query entity 'Zhang three' and an ending query entity 'Liquan' in the knowledge graph. Two data chains are finally inquired, wherein the first data chain is ' three-piece ' - > ' American ' single meeting ' - > ' Korean ' - > ' LiSi four ', and the second data chain is ' three-piece ' - > ' sane ' - > ' Korean ' - > ' LiSi four '.

The second situation is that the query information input by the user is analyzed by using natural language processing, an initial query entity corresponding to the query keyword is determined in the knowledge graph after the query keyword is obtained, and then a data chain taking the initial query entity as an initial node in the knowledge graph is returned to the user.

When the data chain is determined, the related end entity starts to be traversed by the initial query entity according to the depth priority rule, and then the related end node continues to be traversed by taking the end entity as the start until the traversal is finished or a certain condition is reached.

Optionally, in step S308, the data chains associated with the query entity in the knowledge graph are returned to the user in the order from large to small of the number of relationships included in the data chains. That is, the more and higher the priority of the relationship tree existing between two hit entities, the more the relationship number will be displayed preferentially. And finally, preferentially returning node data with a large incidence relation according to the sequencing value of the traversal times.

For example, the first data chain "zhangsan" - > "usa" - > "individual meeting" - > "korean" - > "prune four" takes precedence over the second data chain "zhangsan" - > "sanctioning" - > "korean" - > "prune four"

Optionally, if an entity in any map is missed in the query process, the text information may be passed to the information collection stage through middleware (e.g., kafka), and a data collection instruction containing the specified text is executed. FIG. 4 shows a complete workflow diagram for data query federated data collection.

In the embodiment, the relevant data chain in the knowledge graph is returned to the user after the user inputs the query information, and the upstream and downstream association relation of the relevant data information can be fully displayed for the user, so that the data analysis range of the user is expanded, and the user can perform more comprehensive data analysis.

Furthermore, in graph databases, relationships are the most important elements by which entities can be associated with each other to build related complex models. Each node in the graph database model directly comprises a relationship list, and relationship records of the node and other nodes are stored in the relationship list. These relationship records are organized by type and orientation, and additional attributes may be saved. Whenever a JOIN operation similar to a relational database is run, the graph database directly accesses the connected nodes by using the list without performing recorded searching and matching calculation operations, thereby improving the efficiency and stability of information query.

A data query apparatus according to an embodiment of the present disclosure is described below with reference to fig. 5.

Fig. 5 shows a schematic flow chart of a data query device according to an embodiment of the present disclosure. As shown in fig. 5, the data query device 50 in the present embodiment includes:

an information receiving module 502 configured to receive query information input by a user.

And a keyword analysis module 504 configured to perform natural language processing on the query information input by the user and analyze the query information to obtain a query keyword.

An entity determination module 506 configured to determine a query entity corresponding to the query keyword in the knowledge-graph.

A data return module 508 configured to return data chains in the knowledge-graph associated with the query entity to the user.

In some embodiments, the entity determination module 506 is configured to: determining an initial query entity corresponding to a query keyword in a knowledge graph; the data return module 508 is configured to: and returning a data chain which takes the initial query entity as an initial node in the knowledge graph to the user.

In some embodiments, the entity determination module 506 is configured to: determining a starting query entity and an ending query entity corresponding to a query keyword in a knowledge graph; the data return module 508 is configured to: and returning a data chain which takes the initial query entity as an initial node and takes the end query entity as an end node in the knowledge graph to the user.

In some embodiments, the data return module 508 is configured to: and returning the data chains associated with the query entity in the knowledge graph to the user according to the sequence of the relationship numbers contained in the data chains from large to small.

In some embodiments, the data query device 50 further comprises a knowledge-graph generation module 500 configured to: collecting database data; carrying out natural language processing on database data, and extracting entities and relations among the entities; and generating a knowledge graph according to the entities and the relationship among the entities.

In some embodiments, the knowledge-graph generation module 500 is configured to: extracting data keywords from database data by using a natural language processing model; taking the data key words with the word frequency-reverse document frequency higher than a first threshold value as entities; relationships between entities are extracted from database data using a natural language processing model.

In some embodiments, the knowledge-graph generation module 500 is configured to: configuring a data source address list, a starting page number, an ending page number and acquisition time; according to the acquisition time, news data determined by a data source address list, a starting page number and an ending page number are automatically extracted; and analyzing to obtain the title and text data in the news data, and storing the title and text data in the database.

Fig. 6 shows a schematic structural diagram of a data query device according to another embodiment of the present disclosure. As shown in fig. 6, the data query apparatus 60 of this embodiment includes: a memory 610 and a processor 620 coupled to the memory 610, the processor 620 being configured to execute the data query method of any of the foregoing embodiments based on instructions stored in the memory 610.

Memory 610 may include, for example, system memory, fixed non-volatile storage media, and the like. The system memory stores, for example, an operating system, an application program, a Boot Loader (Boot Loader), and other programs.

The data query apparatus 60 may further include an input-output interface 630, a network interface 640, a storage interface 650, and the like. These

interfaces

630, 640, 650 and the connections between the memory 610 and the processor 620 may be through a bus 660, for example. The input/output interface 630 provides a connection interface for input/output devices such as a display, a mouse, a keyboard, and a touch screen. The network interface 640 provides a connection interface for various networking devices. The storage interface 650 provides a connection interface for external storage devices such as an SD card and a usb disk.

The present disclosure also includes a computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the data query method in any of the foregoing embodiments.

As will be appreciated by one skilled in the art, embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable non-transitory storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only exemplary of the present disclosure and is not intended to limit the present disclosure, so that any modification, equivalent replacement, or improvement made within the spirit and principle of the present disclosure should be included in the scope of the present disclosure.

Claims

1. A method of data query, comprising:

collecting database data;

extracting data keywords from database data by using a natural language processing model;

taking the data key words with the word frequency-reverse document frequency higher than a first threshold value as entities;

extracting relationships between the entities from database data using a natural language processing model;

generating a knowledge graph according to entities and the relationship between the entities;

receiving query information input by a user;

carrying out natural language processing on query information input by a user, and analyzing to obtain query keywords;

determining a starting query entity and an ending query entity corresponding to a query keyword in a knowledge graph;

and returning the data chain which takes the initial query entity as an initial node and takes the end query entity as an end node in the knowledge graph to the user according to the descending order of the priority of the data chain, wherein the priority of the data chain is in positive correlation with the relation number included in the data chain.

2. The data query method of claim 1, wherein the collecting database data comprises:

configuring a data source address list, a starting page number, an ending page number and acquisition time;

according to the acquisition time, news data determined by a data source address list, a starting page number and an ending page number are automatically extracted;

and analyzing to obtain the title and text data in the news data, and storing the title and text data in a database.

3. A data query apparatus, comprising:

a knowledge-graph generation module configured to: collecting database data; extracting data keywords from database data by using a natural language processing model; taking the data key words with the word frequency-reverse document frequency higher than a first threshold value as entities; extracting relationships between the entities from database data using a natural language processing model; generating a knowledge graph according to entities and the relationship between the entities;

the information receiving module is configured to receive query information input by a user;

the keyword analysis module is configured to perform natural language processing on query information input by a user and analyze the query information to obtain a query keyword;

the entity determining module is configured to determine a starting query entity and an ending query entity corresponding to the query keyword in the knowledge graph;

and the data returning module is configured to return the data chain which takes the initial query entity as an initial node and takes the end query entity as an end node in the knowledge graph to the user according to the sequence of the priorities of the data chain from large to small, wherein the priority of the data chain is in positive correlation with the relation number included in the data chain.

4. The data query apparatus of claim 3, wherein the knowledge-graph generation module is configured to:

5. A data query apparatus, comprising:

a memory; and

a processor coupled to the memory, the processor configured to perform the data query method of claim 1 or 2 based on instructions stored in the memory.

6. A computer-readable storage medium, wherein the computer-readable storage medium stores computer instructions which, when executed by a processor, implement the data query method of claim 1 or 2.