CN110895548B

CN110895548B - Method and apparatus for processing information

Info

Publication number: CN110895548B
Application number: CN201810975593.1A
Authority: CN
Inventors: 刘畅; 张阳; 谢奕; 杨双全; 熊云; 郑灿翔; 季昆鹏; 张雪婷
Original assignee: Baidu Online Network Technology Beijing Co Ltd
Current assignee: Baidu Online Network Technology Beijing Co Ltd
Priority date: 2018-08-24
Filing date: 2018-08-24
Publication date: 2022-08-09
Anticipated expiration: 2038-08-24
Also published as: CN110895548A

Abstract

The embodiment of the application discloses a method and a device for processing information. One embodiment of the method comprises: at least one log is obtained, wherein the log comprises entity information of at least one entity. And for the log in at least one log, generating an original edge based on the log according to a preset entity extraction configuration rule. And for an entity in at least one entity related to the generated at least one original edge, acquiring a vertex identifier corresponding to the entity through a preset vertex identifier dictionary. For an original edge in at least one original edge, vertex identifications corresponding to two entities included in the original edge are obtained, and a related edge is generated according to the edge information of the original edge, the vertex identifications corresponding to the two entities and the entity information of the two entities. The embodiment can rapidly and accurately extract the relationship between the entities from the massive space-time data, and is convenient for storing and searching the relationship between the entities.

Description

Method and apparatus for processing information

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to a method and a device for processing information.

Background

With the development of the mobile internet, people and things in the real world have direct or indirect connection, namely, the connection of everything. Mining and acquiring these relationships are of great significance to various fields. For example, in the financial field, banks and dealer can recommend financial products and evaluate risk preference according to interpersonal relationship. For example, group buying and commenting websites and apps can recommend products and increase the advertisement conversion rate through the relationship between people and ordinary places and the relationship between people and restaurants. For example, in the public security field, criminal clues are combed through the relations among people, people and places and people and things, the case handling efficiency is improved, or group relations are mined, and prediction and advance preparation are carried out on some group events and terrorist events. People, places, things and the like are referred to as entities, and mining the relationship among the entities has great practical significance. Due to the popularization of acquisition devices and sensors in the real world and the popularization of various handheld devices and wearable devices, people can acquire a large amount of structured entity tracks and entity log information, and it is extremely challenging work to mine entities and entity relationships from data in the vast ocean.

Disclosure of Invention

The embodiment of the application provides a method and a device for processing information.

In a first aspect, an embodiment of the present application provides a method for processing information, including: acquiring at least one log, wherein the log comprises entity information of at least one entity; for a log in at least one log, generating an original edge based on the log according to a preset entity extraction configuration rule, wherein the original edge comprises entity information of two entities, the entities correspond to vertexes in a preset graph database, and the entity extraction configuration rule is used for specifying positions of the entity information of the two entities on the original edge in the log and comprises edge information of an associated edge of the two entities; for an entity in at least one entity related to the generated at least one original edge, acquiring a vertex identification corresponding to the entity through a preset vertex identification dictionary, wherein the vertex identification dictionary is used for representing the corresponding relation between the vertex identification of a vertex in a graph database and entity information of the entity; for an original edge in at least one original edge, vertex identifications corresponding to two entities included in the original edge are obtained, and a related edge is generated according to the edge information of the original edge, the vertex identifications corresponding to the two entities and the entity information of the two entities.

In some embodiments, the entity information comprises at least one of: entity labels, entity keys and entity attributes, the side information comprising at least one of: edge label, edge attribute.

In some embodiments, the entity information comprises: an entity tag and an entity key; and obtaining the vertex identification corresponding to the entity through a preset vertex identification dictionary, wherein the vertex identification comprises the following steps: determining whether entity information matched with an entity label and an entity key in the entity information of the entity exists in a preset vertex identification dictionary; and if so, determining the vertex identification corresponding to the matched entity information as the vertex identification of the entity, and updating the entity information of the entity in the vertex identification dictionary.

In some embodiments, the method further comprises: if not, generating the vertex identification of the entity, and adding the generated corresponding relation between the vertex identification of the entity and the entity information of the entity in the vertex identification dictionary.

In some embodiments, updating the entity information for the entity in the vertex identification dictionary comprises: and in response to detecting that the entity attribute in the entity information of the entity is different from the entity attribute in the entity information of the entity in the vertex identification dictionary, combining the entity attribute in the entity information of the entity with the entity attribute in the entity information of the entity in the vertex identification dictionary to serve as the entity attribute in the entity information of the entity in the vertex identification dictionary.

In some embodiments, generating the original edge based on the log according to a predetermined entity extraction configuration rule includes: respectively reading the entity information of the two entities according to the positions of the entity information of the two entities at the original edge in the log, which are specified in the preset entity extraction configuration rule; analyzing the read entity information of the two entities according to a preset field type rule to obtain the analyzed entity information of the two entities, wherein the field type rule is used for specifying the data type of each field in the entity information; and generating an original edge according to the analyzed entity information of the two entities and the edge information in the entity extraction configuration rule.

In some embodiments, the method further comprises: and checking the entity extraction configuration rule by using the field type rule.

In a second aspect, an embodiment of the present application provides an apparatus for processing information, including: an acquisition unit configured to acquire at least one log, wherein the log includes entity information of at least one entity; an original edge generating unit configured to generate, for a log of at least one log, an original edge based on the log according to a predetermined entity extraction configuration rule, wherein the original edge includes entity information of two entities, the entities correspond to vertices in a preset graph database, and the entity extraction configuration rule is used to specify positions of the entity information of the two entities of the original edge in the log and includes edge information of an associated edge of the two entities; the vertex extraction unit is configured to obtain a vertex identifier corresponding to an entity in at least one entity related to the generated at least one original edge through a preset vertex identifier dictionary, wherein the vertex identifier dictionary is used for representing a corresponding relation between the vertex identifier of a vertex in the graph database and entity information of the entity; and the associated edge generating unit is configured to acquire vertex identifications corresponding to two entities included in at least one original edge, and generate an associated edge according to the edge information of the original edge, the vertex identifications corresponding to the two entities and the entity information of the two entities.

In some embodiments, the entity information comprises: an entity tag and an entity key; and the vertex extraction unit is further configured to: determining whether entity information matched with an entity label and an entity key in the entity information of the entity exists in a preset vertex identification dictionary; and if so, determining the vertex identification corresponding to the matched entity information as the vertex identification of the entity, and updating the entity information of the entity in the vertex identification dictionary.

In some embodiments, the vertex extraction unit is further configured to: if not, generating the vertex identification of the entity, and adding the generated corresponding relation between the vertex identification of the entity and the entity information of the entity in the vertex identification dictionary.

In some embodiments, the vertex extraction unit is further configured to: and in response to detecting that the entity attribute in the entity information of the entity is different from the entity attribute in the entity information of the entity in the vertex identification dictionary, combining the entity attribute in the entity information of the entity with the entity attribute in the entity information of the entity in the vertex identification dictionary to serve as the entity attribute in the entity information of the entity in the vertex identification dictionary.

In some embodiments, the raw edge generation unit is further configured to: respectively reading the entity information of the two entities according to the positions of the entity information of the two entities at the original edge in the log, which are specified in the preset entity extraction configuration rule; analyzing the read entity information of the two entities according to a preset field type rule to obtain the analyzed entity information of the two entities, wherein the field type rule is used for specifying the data type of each field in the entity information; and generating an original edge according to the analyzed entity information of the two entities and the edge information in the entity extraction configuration rule.

In some embodiments, the apparatus further comprises a verification unit configured to: and checking the entity extraction configuration rule by using the field type rule.

In a third aspect, an embodiment of the present application provides an electronic device, including: one or more processors; a storage device having one or more programs stored thereon which, when executed by one or more processors, cause the one or more processors to implement a method as in any one of the first aspects.

In a fourth aspect, the present application provides a computer readable medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the method according to any one of the first aspect.

According to the method and the device for processing the information, the entities are extracted from the log, the indexes used by the entities stored in the graph database are obtained, and the associated edges stored in the graph database are generated, so that the speed of extracting the relationships among the entities and the speed of retrieving the relationships among the entities are improved.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present application may be applied;

FIG. 2 is a flow diagram for one embodiment of a method for processing information according to the present application;

FIGS. 3a, 3b are schematic diagrams of an application scenario of a method for processing information according to the present application;

FIG. 4 is a flow diagram of yet another embodiment of a method for processing information according to the present application;

FIG. 5 is a schematic block diagram illustrating one embodiment of an apparatus for processing information according to the present application;

FIG. 6 is a schematic block diagram of a computer system suitable for use in implementing an electronic device according to embodiments of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Fig. 1 shows an exemplary system architecture 100 to which embodiments of the method for processing information or the apparatus for processing information of the present application may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The

terminal devices

101, 102, 103 may have various communication client applications installed thereon, such as a web browser application, a shopping application, a search application, an instant messaging tool, a mailbox client, social platform software, and the like.

The

terminal apparatuses

101, 102, and 103 may be hardware or software. When the

terminal devices

101, 102, 103 are hardware, they may be various electronic devices having a display screen and supporting a function of generating a log, including but not limited to a smart phone, a tablet computer, an e-book reader, an MP3 player (Moving Picture Experts Group Audio Layer III, motion Picture Experts Group Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, motion Picture Experts Group Audio Layer 4), a laptop portable computer, a desktop computer, and the like. When the

terminal apparatuses

101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.

The server 105 may be a server that provides various services, such as a background analysis server that collects and analyzes logs generated on the

terminal devices

101, 102, 103. The background analysis server may perform processing such as analysis on the received massive logs, and feed back a processing result (e.g., an association edge generated according to a relationship between entities) to the terminal device or store the processing result in a graph database.

The server may be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster formed by multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as multiple pieces of software or software modules (e.g., multiple pieces of software or software modules used to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

It should be noted that the method for processing information provided in the embodiment of the present application is generally performed by the server 105, and accordingly, the apparatus for processing information is generally disposed in the server 105.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for an implementation.

With continued reference to FIG. 2, a flow 200 of one embodiment of a method for processing information in accordance with the present application is shown. The method for processing information comprises the following steps:

step 201, at least one log is obtained.

In this embodiment, an execution subject (e.g., a server shown in fig. 1) of the method for processing information may acquire massive log data from a third-party server by a wired connection manner or a wireless connection manner. The operation log of the user can also be directly obtained from the terminal equipment. Wherein the log comprises entity information of at least one entity. The extracted data source of the relationships between entities may be police spatiotemporal big data. The data come from various product lines and various sensors and acquisition equipment in the real world, and are massive log data. Each piece of data describes one of a person, thing, place, thing, case, and relationship. People, things, places, things, cases are entities. The person refers to a person, and the entity information of the person may include a name, an age, and the like of the person. The event refers to a discovered event, and the entity information of the event may include an event name, an occurrence time, and the like. Ground refers to an address, e.g., an address where an event occurs, an address where a person appears, etc. The object refers to an article, and may be an article operated by a person or an article related to an event, and the entity information of the object may include an article name and the like. Cases refer to cases such as missing cases, fraud cases, and the like.

Step 202, for the log in at least one log, generating an original edge based on the log according to a predetermined entity extraction configuration rule.

In this embodiment, the original edge includes entity information of two entities, and the entities correspond to vertices in a preset graph database. If only entity information of one entity is included in one log, the log is filtered. The graph database is a non-relational database that stores relational information between entities using graph theory. The most common example is the interpersonal relationship in social networks. Relational databases are not effective for storing "relational" data, are complex, slow, and beyond expectations in querying, and the unique design of graphic databases just remedies this deficiency. The graph is a collection of vertices (Vertex) and edges (Edge), which may have respective attributes. And respectively reading the entity information of the two entities according to the positions of the entity information of the two entities on the original edge specified in the preset entity extraction configuration rule in the log. And generating an original edge according to the entity information of the two entities and the edge information in the entity extraction configuration rule.

The entity extraction configuration rule is used for specifying the positions of the entity information of the two entities of the original edge in the log and comprises the edge information of the associated edge of the two entities. Side information may be used to indicate a relationship between two entities. The content of the entity extraction configuration rule mainly comprises fields needing to be extracted, fields belonging to label (entity label), key (entity key) and property (entity attribute), a separator, an input/output path and the like, the content of the rule is filled by a user, and the code level modification and writing are not needed for new logs and the requirement of entity extraction. For example, the entity extraction configuration rule specifies that the 2 nd field in the log extracts the entity information of the entity 1, the 12 th field in the log extracts the entity information of the entity 2, and the separator of the fields is specified. When the log is analyzed, the log is divided into a field when a separator is encountered, and the content in the field at the designated position is extracted as entity information after the log is divided into a plurality of fields.

Fig. 3a shows the content of the configuration rule extraction, where src _ config is the log-related configuration and dst _ config is the original edge-related configuration.

Directories in src _ config are paths for input logs, and delimiter is a separator. fields is a field, wherein e _ idcard _ id:0 indicates that the content of field 0 is obtained from the log as e _ idcard _ id. e _ idcard _ name: 1 denotes that the content of field 1 is acquired from the log as e _ idcard _ name. e _ idcard _ huji: and 4, acquiring the content of the field 4 from the log as e _ idcard _ huji. e _ idcard _ status: and 5 denotes that the content of field 5 is acquired from the log as e _ idcard _ status. 67 denotes obtaining the contents of the field 67 from the log as e _ car _ id. An e _ car _ license: 68 indicates that the content of field 68 is obtained from the log as e _ car _ license.

Directories in dst _ config are paths for outputting original edges, and delimiter is a separator. meta _ fields is an element field, where vertex1_ label: el _ p _ idcard indicates that the label for vertex1 is el _ p _ idcard. vertex1_ key e _ idcard _ id indicates that the key for vertex1 is e _ idcard _ id. vertex2_ label: el _ o _ car indicates that vertex2 is labeled el _ o _ car. vertex2_ key e _ car _ id indicates that the key for vertex2 is e _ car _ id. edge _ label rl _ pa _ travel indicates that the edge label is rl _ pa _ travel. property _ fields represents a property field, where vertex1_ property represents the property of vertex1, including the 4 properties "e _ idcard _ id", "e _ idcard _ name", "e _ idcard _ huji", "e _ idcard _ status". vertex2_ property represents the property of vertex2, including the 2 properties "e _ car _ id", "e _ car _ license". edge _ alert represents the attribute of the edge, which is null in this example.

And extracting the relation between the entities from the original log, and performing disc-dropping storage according to a json format. The json format is chosen because: json is provided with the schema and the data type, so that the analysis is convenient, and the schema and the meta information do not need to be maintained additionally.

The generated raw edges are as follows:

el _ p _ idcard e _ idcard _ id #310000196002196105 { "e _ idcard _ id": 310000196002196105), "e _ idcard _ name": feather true 211"," e _ idcard _ huji ": Shanghai City", "e _ idcard _ status": expired el _ o _ car e _ car _ id #17792{ "e _ car _ id":17792, "e _ car _ license": Guangdong VR03316"} rl _ pa _ track { }

Where el _ p _ idcard is an entity label of entity 1, e _ idcard _ id #310000196002196105 is an entity key of entity 1, and the contents in the first { } are entity attributes of entity 1. The entity tag of el _ o _ car entity 2, e _ car _ id #17792 is the entity key of entity 2, and the contents in the second { } are the entity attributes of entity 2. rl _ pa _ travel { } is the edge label of the edge, which is null in this example.

In some optional implementation manners of this embodiment, generating an original edge based on the log according to a predetermined entity extraction configuration rule includes:

step 2021, reading the entity information of the two entities according to the positions of the entity information of the two entities of the original edge specified in the predetermined entity extraction configuration rule in the log.

The entity information includes at least one of: entity labels, entity keys and entity attributes, the side information comprising at least one of: edge label, edge attribute. As shown in fig. 3a, entity information of two entities is extracted from the log according to the entity extraction configuration rule.

Step 2022, analyzing the entity information of the two read entities according to a predetermined field type rule to obtain the analyzed entity information of the two entities.

The field type rule is used for specifying the data type of each field in the entity information. For example, if the data type of a field is integer, the content of the field is parsed by integer when the log is parsed. As shown in fig. 3b, the data type in the log is specified, denoted by type. The data type of each field in the entity information, such as that listed by desc, is also specified. E _ idcard _ age: integer means that the data type of the attribute e _ idcard _ age is an Integer, e.g., the field in the log is 0x1f, and the value is resolved to age 31.

Step 2023, generating an original edge according to the entity information of the two analyzed entities and the edge information in the entity extraction configuration rule.

And resolving the entity label, the entity key and the entity attribute from the log according to the data types of the entity label, the entity key and the entity attribute included in the entity information. And then generating an original edge according to the analysis result and the edge information specified in the entity extraction configuration rule.

Step 203, for an entity in the at least one entity related to the generated at least one original edge, obtaining a vertex identifier corresponding to the entity through a preset vertex identifier dictionary.

In this embodiment, different original edges may have the same entity, and then, for each entity, the vertex identifier corresponding to the entity is obtained. The vertex identification dictionary is used for representing the corresponding relation between the vertex identification of the vertex in the graph database and the entity information of the entity. That is, the vertex identification is used as an index to the graph database. Entity information can be quickly searched and inserted through vertex identification. And inserting the vertex into the graph database to obtain the built-in vertex identification of the graph database, and acquiring the vertex identification in advance to be beneficial to improving the rate of filling the library. Therefore, the original edge and the vertex identification dictionary are subjected to de-duplication and combination operation, and the vertex matched with the entity label, the entity key and the vertex identification dictionary is used as the vertex or the entity needing the graph database updating. And the vertex of the entity label, the entity key and the vertex identification dictionary which are not matched is the vertex or the entity which needs to be added by the graph database.

In some optional implementation manners of this embodiment, obtaining the vertex identifier corresponding to the entity through a preset vertex identifier dictionary includes: and determining whether entity information matched with the entity label and the entity key in the entity information of the entity exists in the preset vertex identification dictionary. And if so, determining the vertex identification corresponding to the matched entity information as the vertex identification of the entity, and updating the entity information of the entity in the vertex identification dictionary. That is, if a duplicate vertex is found in the vertex identification dictionary, entity information of an entity to which the vertex corresponds is updated. Entity information is no longer stored as new vertices. Optionally, when the entity attribute in the entity information of an entity is different from the entity attribute in the entity information of the entity in the vertex identification dictionary, the entity attributes in the entity information are merged. For example, the entity attributes of the entity 1 in the original vertex identification dictionary include { P1: K1, P2: K2}, and the entity attributes of the currently extracted entity 1 include { P3: K3}, so that the updated entity attributes of the entity 1 include { P1: K1, P2: K2, P3: K3 }. If the same attribute field, but the attribute values are different, the old value is replaced with the new attribute value according to the chronological order in the log. For example, the entity attribute of entity 1 in the original vertex identification dictionary includes { P3: K3}, the entity attribute of currently extracted entity 1 includes { P3: K4}, and the entity attribute of entity 1 in the updated vertex identification dictionary includes { P3: K4 }.

An example of updating the entity information of the entity in the vertex identification dictionary is shown below, where # U indicates that the piece of information is updated:

entity 1:

e _ idcard _ id #310000200712161365el _ p _ idcard e _ idcard _ id #310000200712161365 { "e _ idcard _ id": 310000200712161365"," e _ idcard _ name ": yield 6", "e _ idcard _ huji": Shanghai city "," e _ idcard _ status ": Normal" }12302# U

For entity 1 with label as el _ p _ idcard and key as e _ idcard _ id #310000200712161365, the updated property is { "e _ idcard _ id": 310000200712161365"," e _ idcard _ name ": yield 6", "e _ idcard _ huji": shanghai city "," e _ idcard _ status ": normal" }.

Entity 2:

e _ car _ id #40511el _ o _ car e _ car _ id #40511 { "e _ car _ id":40511, "e _ car _ license": Yu R00686"}12304# U

For entity 2 with label as el _ o _ car and key as e _ car _ id #40511, the updated property is { "e _ car _ id":40511, "e _ car _ license": yu R00686 "}.

In some optional implementation manners of this embodiment, if the entity does not exist, the vertex identifier of the entity is generated, and the generated vertex identifier of the entity and the entity information of the entity are newly added to the vertex identifier dictionary. And generating a new vertex identification according to a preset coding rule. And adding the newly generated corresponding relation between the vertex identification and the entity information of the entity into the vertex identification dictionary.

An example of entity information of the entity in the newly added vertex identification dictionary is shown as follows, and the piece of information is newly inserted as indicated by # I:

entity 1:

e_idcard_id#310000200705245112el_p_idcard e_idcard_id#310000200705245112

{ "e _ idcard _ id": 310000200705245472"," e _ idcard _ name ": late 546", "e _ idcard _ huji": Shanghai City "," e _ idcard _ status ": immigration" } # I

An entity 1 representing that the newly added label in the vertex identifier dictionary is el _ p _ idcard, key is e _ idcard _ id #310000200705245112, property is { "e _ idcard _ id": 310000200705245472"," e _ idcard _ name ": late", "e _ idcard _ huji": Shanghai city "," e _ idcard _ status ": migration" }

Entity 2:

e _ car _ id #81866el _ o _ car e _ car _ id #81866 { "e _ car _ id":81866, "e _ car _ license": min B00304"} # I

And the entity 2 which represents that the newly added label in the vertex identifier dictionary is el _ o _ car, key is e _ car _ id #81866, property is { "e _ car _ id":81866, "e _ car _ license": min B00304 "}.

Step 204, for an original edge in at least one original edge, obtaining vertex identifications corresponding to two entities included in the original edge, and generating a related edge according to the edge information of the original edge, the vertex identifications corresponding to the two entities, and the entity information of the two entities.

In this embodiment, for an original edge in at least one original edge, entity information of two entities included in the original edge is compared with a vertex identifier dictionary, and an existing vertex identifier or a newly added vertex identifier is obtained. The format of the associated edge may be:

[ Label of entity 1, key of entity 1, vertex identification of entity 1, property of entity 1, Label of entity 2, key of entity 2, vertex identification of entity 2, property of entity 2, Label of edge, property of edge ]

The associated edge has more information of the vertex identification than the original edge, and the graph database is conveniently and quickly inserted. For each associated edge, two steps of finding the vertex identifier of the entity are required.

With continuing reference to fig. 3a, 3b, fig. 3a, 3b are schematic diagrams of application scenarios of the method for processing information according to the present embodiment. In the application scenario of fig. 3a and 3b, the server obtains a large amount of logs from the third-party server, and a process for generating an edge for storage in the graph database based on one of the logs is described below by taking one of the logs as an example. The log, which includes hundreds of characters, is divided into at least one field, starting with a number of 0, according to the entity extraction configuration rule specified delimiter \ t in fig. 3 a. And then finding the content corresponding to each field according to the fields (fields) specified by the entity extraction configuration rule. The data type restriction for each field is referred to fig. 3b when the field is parsed. The field 0 thus results in an e _ idcard _ id, the content of which is "310000196002196105". The field 1 indicates e _ idcard _ name, whose contents are "feathers 211". Field 4 indicates e _ idcard _ huji, whose contents are Shanghai city. The field 5 indicates e _ idcard _ status, the contents of which are "expired". The field 67 indicates e _ car _ id with the content "17792", and the field 68 indicates e _ car _ license with the content "yue VR 03316".

The generated raw edges are as follows:

For entity 1, after finding that there is no entity key whose content is 310000196002196105 after looking up the vertex identifier dictionary, a new vertex identifier 12310 is generated for the entity, and the newly generated vertex identifier and entity information correspondence is added to the vertex identifier dictionary.

For the entity 2, after finding the vertex identification dictionary, the entity key whose existence content is 17792 is found, and the vertex identification corresponding to the entity key is 12302, 12302 is taken as the vertex identification of the entity 2, and the entity information corresponding to the vertex identification 17792 in the vertex identification dictionary is updated according to the entity information of the entity 2, for example, the entity information corresponding to the vertex identification 17792 in the vertex identification dictionary does not include the age, and the entity information of the entity 2 includes the vehicle age (e _ car _ age), then the vehicle age is newly added to the entity information corresponding to the vertex identification 17792 in the vertex identification dictionary, that is, the entity information is merged.

Finally, the associated edges are generated as follows:

el _ p _ idcard, e _ idcard _ id #310000196002196105, 12310, { "e _ idcard _ id": 310000196002196105"," e _ idcard _ name ": feather 211", "e _ idcard _ huji": shanghai "," e _ idcard _ status ": expired" } el _ o _ car e _ card _ id #17792, 12302, { "e _ car _ id":17792, "e _ car _ license": Guangdong VR03316"," e _ car _ rl ": 7" } pa _ travel { }

The server may store the generated associated edges in a graph database for convenient indexing by vertex identification.

According to the method provided by the embodiment of the application, the entities are extracted from the log, and then the vertex identification in the graph database is used as the index of the entities to generate the associated edge in the graph database.

With further reference to FIG. 4, a flow 400 of yet another embodiment of a method for processing information is shown. The flow 400 of the method for processing information includes the steps of:

step 401, an entity extraction configuration rule is obtained.

In this embodiment, an executing entity (e.g., the server shown in fig. 1) of the method for processing information may extract the configuration rule from the third-party server acquisition entity through a wired connection or a wireless connection. The entity extraction configuration rule is used for specifying the positions of the entity information of the two entities of the original edge in the log and comprises the edge information of the associated edge of the two entities. Side information may be used to indicate a relationship between two entities. The content of the entity extraction configuration rule mainly comprises fields needing to be extracted, fields belonging to label (entity label), key (entity key) and property (entity attribute), a separator, an input/output path and the like, the content of the rule is filled by a user, and the code level modification and writing are not needed for new logs and the requirement of entity extraction. For example, the entity extraction configuration rule specifies that the 2 nd field in the log extracts the entity information of the entity 1, the 12 th field in the log extracts the entity information of the entity 2, and the separator of the fields is specified. When the log is analyzed, the log is divided into a field when a separator is encountered, and the content in the field at the designated position is extracted as entity information after the log is divided into a plurality of fields. The content of the entity extraction configuration rule is shown in fig. 3a, and the detailed description can refer to step 202.

At step 402, field type rules are obtained.

In this embodiment, the field type rule may be obtained from a third-party server through a wired connection manner or a wireless connection manner. The field type rule is used to specify the data type of each field in the entity information. For example, if the data type of a field is integer, the content of the field is parsed by integer when the log is parsed. As shown in FIG. 3b, the data type in the log is specified, denoted by type. The data type of each field in the entity information, such as that listed by desc, is also specified. E _ idcard _ age: integer means that the data type of the attribute e _ idcard _ age is an Integer, e.g., the field in the log is 0x1f, and the value is resolved to age 31.

Step 403, checking whether the entity extraction configuration rule matches with the field type rule.

In this embodiment, whether a field in the entity extraction configuration rule exists in the field type rule may be verified, and if not, it indicates that the field in the entity extraction configuration rule or the field type rule is wrongly filled in, or that a field is missing in the field type rule. It is also possible to check whether the field is erroneous according to the type of the contract, e.g., e _ id _ age is contracted to be an integer, i.e., integer. If e _ idcard _ age in the field type rule is a non-integer such as date or string, it indicates that the field type rule has an error and needs to be corrected. Error prompt information can be output in the correction process, and workers can conveniently and quickly position errors to modify the errors.

If yes, at least one log is obtained, step 404.

In this embodiment, if the entity extraction configuration rule matches the field type rule, at least one log is obtained. This step is substantially the same as step 201 and thus is not described in detail. If not, the entity extraction configuration rule and the field type rule need to be modified.

Step 405, for the log in at least one log, generating an original edge based on the log according to a predetermined entity extraction configuration rule.

Step 406, for an entity in the at least one entity related to the generated at least one original edge, obtaining a vertex identifier corresponding to the entity through a preset vertex identifier dictionary.

Step 407, for an original edge of at least one original edge, obtaining vertex identifiers corresponding to two entities included in the original edge, and generating an associated edge according to the edge information of the original edge, the vertex identifiers corresponding to the two entities, and the entity information of the two entities.

The

steps

404 and 407 are substantially the same as the

steps

201 and 204, and therefore, the description thereof is omitted.

As can be seen from fig. 4, compared with the embodiment corresponding to fig. 2, the flow 400 of the method for processing information in the present embodiment highlights the step of checking the entity extraction configuration rule. Therefore, the scheme described in the embodiment can improve the accuracy of extracting the relationship between the entities

With further reference to fig. 5, as an implementation of the methods shown in the above-mentioned figures, the present application provides an embodiment of an apparatus for processing information, which corresponds to the method embodiment shown in fig. 2, and which is particularly applicable in various electronic devices.

As shown in fig. 5, the apparatus 500 for processing information of the present embodiment includes: an acquisition unit 501, an original edge generation unit 502, a vertex extraction unit 503, and an associated edge generation unit 504. Wherein the obtaining unit 501 is configured to obtain at least one log, wherein the log comprises entity information of at least one entity. The original edge generating unit 502 is configured to generate, for a log of at least one log, an original edge based on the log according to a predetermined entity extraction configuration rule, wherein the original edge includes entity information of two entities, the entities correspond to vertices in a preset graph database, and the entity extraction configuration rule is used for specifying positions of the entity information of the two entities of the original edge in the log and includes edge information of an associated edge of the two entities; the vertex extracting unit 503 is configured to, for an entity in the at least one entity related to the generated at least one original edge, obtain a vertex identifier corresponding to the entity through a preset vertex identifier dictionary, where the vertex identifier dictionary is used to characterize a correspondence between a vertex identifier of a vertex in the graph database and entity information of the entity. The associated edge generating unit 504 is configured to, for an original edge in at least one original edge, obtain vertex identifiers corresponding to two entities included in the original edge, and generate an associated edge according to edge information of the original edge, the vertex identifiers corresponding to the two entities, and entity information of the two entities.

In this embodiment, specific processing of the acquiring unit 501, the original edge generating unit 502, the vertex extracting unit 503, and the associated edge generating unit 504 of the apparatus 500 for processing information may refer to step 201, step 202, step 203, and step 204 in the corresponding embodiment of fig. 2.

In some optional implementations of this embodiment, the entity information includes at least one of: entity labels, entity keys and entity attributes, the side information comprising at least one of: edge label, edge attribute.

In some optional implementations of this embodiment, the entity information includes: an entity tag and an entity key; and the vertex extraction unit 503 is further configured to: determining whether entity information matched with an entity label and an entity key in the entity information of the entity exists in a preset vertex identification dictionary; and if so, determining the vertex identification corresponding to the matched entity information as the vertex identification of the entity, and updating the entity information of the entity in the vertex identification dictionary.

In some optional implementations of this embodiment, the vertex extraction unit 503 is further configured to: if not, generating the vertex identification of the entity, and adding the generated corresponding relation between the vertex identification of the entity and the entity information of the entity in the vertex identification dictionary.

In some optional implementations of this embodiment, the vertex extraction unit 503 is further configured to: and in response to detecting that the entity attribute in the entity information of the entity is different from the entity attribute in the entity information of the entity in the vertex identification dictionary, combining the entity attribute in the entity information of the entity with the entity attribute in the entity information of the entity in the vertex identification dictionary to serve as the entity attribute in the entity information of the entity in the vertex identification dictionary.

In some optional implementations of this embodiment, the original edge generating unit 502 is further configured to: respectively reading the entity information of the two entities according to the positions of the entity information of the two entities at the original edge in the log, which are specified in the preset entity extraction configuration rule; analyzing the read entity information of the two entities according to a preset field type rule to obtain the analyzed entity information of the two entities, wherein the field type rule is used for specifying the data type of each field in the entity information; and generating an original edge according to the analyzed entity information of the two entities and the edge information in the entity extraction configuration rule.

In some optional implementations of this embodiment, the apparatus 500 further comprises a verification unit (not shown) configured to: and checking the entity extraction configuration rule by using the field type rule.

Referring now to FIG. 6, a block diagram of a computer system 600 suitable for use in implementing an electronic device (e.g., the server shown in FIG. 1) of an embodiment of the present application is shown. The electronic device shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 6, the computer system 600 includes a Central Processing Unit (CPU)601 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the system 600 are also stored. The CPU 601, ROM 602, and RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

The following components are connected to the I/O interface 605: an input portion 606 including a keyboard, a mouse, and the like; an output portion 607 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The driver 610 is also connected to the I/O interface 605 as needed. A removable medium 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 610 as necessary, so that a computer program read out therefrom is mounted in the storage section 608 as necessary.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 609, and/or installed from the removable medium 611. The computer program performs the above-described functions defined in the method of the present application when executed by a Central Processing Unit (CPU) 601. It should be noted that the computer readable medium described herein can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present application may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes an acquisition unit, an original edge generation unit, a vertex extraction unit, and an associated edge generation unit. Where the names of these units do not in some cases constitute a limitation on the unit itself, for example, the obtaining unit may also be described as a "unit that obtains at least one log".

As another aspect, the present application also provides a computer-readable medium, which may be contained in the apparatus described in the above embodiments; or may be present separately and not assembled into the device. The computer readable medium carries one or more programs which, when executed by the apparatus, cause the apparatus to: at least one log is obtained, wherein the log comprises entity information of at least one entity. And for the log in at least one log, generating an original edge based on the log according to a preset entity extraction configuration rule. And for an entity in at least one entity related to the generated at least one original edge, acquiring a vertex identifier corresponding to the entity through a preset vertex identifier dictionary. For an original edge in at least one original edge, vertex identifications corresponding to two entities included in the original edge are obtained, and a related edge is generated according to the edge information of the original edge, the vertex identifications corresponding to the two entities and the entity information of the two entities.

The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by a person skilled in the art that the scope of the invention as referred to in the present application is not limited to the embodiments with a specific combination of the above-mentioned features, but also covers other embodiments with any combination of the above-mentioned features or their equivalents without departing from the inventive concept. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims

1. A method for processing information, comprising:

acquiring at least one log, wherein the log comprises entity information of at least one entity;

for a log in the at least one log, generating an original edge based on the log according to a preset entity extraction configuration rule, wherein the original edge comprises entity information of two entities, the entities correspond to vertexes in a preset graph database, and the entity extraction configuration rule is used for specifying positions of the entity information of the two entities of the original edge in the log and comprises edge information of an associated edge of the two entities;

for an entity in at least one entity related to at least one generated original edge, acquiring a vertex identification corresponding to the entity through a preset vertex identification dictionary, wherein the vertex identification dictionary is used for representing the corresponding relation between the vertex identification of a vertex in the graph database and entity information of the entity, the vertex identification is used as an index of the graph database, the entity information is searched and inserted through the vertex identification, the original edge and the vertex identification dictionary are subjected to de-duplication and merging operations, a vertex matched with an entity label, an entity key and the vertex identification dictionary is used as a vertex or an entity needing graph database updating, and a vertex matched with the entity label, the entity key and the vertex identification dictionary is used as a vertex or an entity needing a new graph database;

and for the original edge in the at least one original edge, acquiring vertex identifications corresponding to two entities included in the original edge, generating an associated edge according to the edge information of the original edge, the vertex identifications corresponding to the two entities and the entity information of the two entities, storing the generated associated edge in a graph database, and indexing through the vertex identifications.

2. The method of claim 1, wherein entity information comprises at least one of: entity labels, entity keys and entity attributes, the side information comprising at least one of: edge label, edge attribute.

3. The method of claim 2, wherein entity information comprises: an entity tag and an entity key; and

the obtaining of the vertex identifier corresponding to the entity through the preset vertex identifier dictionary includes:

determining whether entity information matched with an entity label and an entity key in the entity information of the entity exists in a preset vertex identification dictionary;

and if the entity information exists, determining the vertex identification corresponding to the matched entity information as the vertex identification of the entity, and updating the entity information of the entity in the vertex identification dictionary.

4. The method of claim 3, wherein the method further comprises:

and if the entity does not exist, generating the vertex identification of the entity, and adding the generated corresponding relation between the vertex identification of the entity and the entity information of the entity in the vertex identification dictionary.

5. The method of claim 3, wherein the updating entity information for the entity in the vertex identification dictionary comprises:

and in response to detecting that the entity attribute in the entity information of the entity is different from the entity attribute in the entity information of the entity in the vertex identification dictionary, combining the entity attribute in the entity information of the entity with the entity attribute in the entity information of the entity in the vertex identification dictionary and then using the combined entity attribute as the entity attribute in the entity information of the entity in the vertex identification dictionary.

6. The method of claim 1, wherein generating the original edge based on the log according to a predetermined entity extraction configuration rule comprises:

respectively reading the entity information of the two entities according to the positions of the entity information of the two entities at the original edge in the log, which are specified in the preset entity extraction configuration rule;

analyzing the read entity information of the two entities according to a preset field type rule to obtain the analyzed entity information of the two entities, wherein the field type rule is used for specifying the data type of each field in the entity information;

and generating an original edge according to the analyzed entity information of the two entities and the edge information in the entity extraction configuration rule.

7. The method of claim 6, wherein the method further comprises:

and checking the entity extraction configuration rule by using the field type rule.

8. An apparatus for processing information, comprising:

an acquisition unit configured to acquire at least one log, wherein the log includes entity information of at least one entity;

an original edge generating unit configured to generate an original edge based on a log in the at least one log according to a predetermined entity extraction configuration rule, wherein the original edge includes entity information of two entities, the entities correspond to vertices in a preset graph database, and the entity extraction configuration rule is used for specifying positions of the entity information of the two entities of the original edge in the log and includes edge information of an associated edge of the two entities;

a vertex extraction unit, configured to obtain a vertex identifier corresponding to an entity in at least one entity related to at least one generated original edge through a preset vertex identifier dictionary, where the vertex identifier dictionary is used to represent a corresponding relationship between a vertex identifier of a vertex in the graph database and entity information of the entity, the vertex identifier is used as an index of the graph database, the entity information is searched and inserted through the vertex identifier, the original edge and the vertex identifier dictionary are subjected to deduplication and merging operations, a vertex matched with an entity label, an entity key and the vertex identifier dictionary is used as a vertex or an entity needing graph database updating, and a vertex matched with an entity label, an entity key and the vertex identifier dictionary is a vertex or an entity needing graph database newly added;

and the associated edge generating unit is configured to acquire vertex identifications corresponding to two entities included in an original edge, generate an associated edge according to the edge information of the original edge, the vertex identifications corresponding to the two entities and the entity information of the two entities, store the generated associated edge in a graph database, and perform indexing through the vertex identifications.

9. The apparatus of claim 8, wherein entity information comprises at least one of: entity labels, entity keys and entity attributes, the side information comprising at least one of: edge label, edge attribute.

10. The apparatus of claim 9, wherein entity information comprises: an entity tag and an entity key; and

the vertex extraction unit is further configured to:

11. The apparatus of claim 10, wherein the vertex extraction unit is further configured to:

12. The apparatus of claim 10, wherein the vertex extraction unit is further configured to:

13. The apparatus of claim 8, wherein the raw edge generation unit is further configured to:

14. The apparatus of claim 13, wherein the apparatus further comprises a verification unit configured to:

15. An electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.

16. A computer-readable medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the method of any one of claims 1-7.