CN110895584B - Method and apparatus for generating data - Google Patents

Method and apparatus for generating data Download PDF

Info

Publication number
CN110895584B
CN110895584B CN201811051031.4A CN201811051031A CN110895584B CN 110895584 B CN110895584 B CN 110895584B CN 201811051031 A CN201811051031 A CN 201811051031A CN 110895584 B CN110895584 B CN 110895584B
Authority
CN
China
Prior art keywords
data
entity
type
entities
graph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811051031.4A
Other languages
Chinese (zh)
Other versions
CN110895584A (en
Inventor
张阳
谢奕
刘畅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201811051031.4A priority Critical patent/CN110895584B/en
Publication of CN110895584A publication Critical patent/CN110895584A/en
Application granted granted Critical
Publication of CN110895584B publication Critical patent/CN110895584B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application discloses a method and a device for generating data. One embodiment of the above method includes: acquiring target data; analyzing the target data and determining entities of preset types included in the target data; extracting the determined identification data and attribute data of the entities and the type data and attribute data of the relationship between the entities; from the extracted data, graph data is generated. The implementation method can effectively analyze the data to generate the image data, and is convenient to retrieve.

Description

Method and apparatus for generating data
Technical Field
The embodiment of the application relates to the technical field of computers, in particular to a method and a device for generating data.
Background
Under the public security scene, a large amount of monitoring data can be generated every day, and meanwhile under the environment that the internet is popularized, the internet behavior data of the user is increased day by day. Under the background, in order to monitor the network environment in time and provide clue analysis and derivation for public security technical investigation and criminal investigation scenes, the related retrieval of the existing data becomes more and more important.
Disclosure of Invention
The embodiment of the application provides a method and a device for generating data.
In a first aspect, an embodiment of the present application provides a method for generating data, including: acquiring target data; analyzing the target data and determining entities of preset types included in the target data; extracting the determined identification data and attribute data of the entities and the type data and attribute data of the relationship between the entities; from the extracted data, graph data is generated.
In some embodiments, the attribute data of the entity includes at least one key-value pair, the attribute data of the entity including a data type of the key and a data type of the value; and generating graph data based on the extracted data, comprising: performing at least one of the following processes on the extracted data and generating graph data from the processed data: deleting key value pairs of which the data types of the values are not matched with the preset data types in the attribute data of the entity; for the determined entity, determining a first number of relationships with the type of the entity as a preset type according to the type data of the relationship with the entity; in response to determining that the first number is larger than a first preset threshold, deleting type data and attribute data of a relationship, of which the type of the relationship with the entity is a preset type; for the determined entities, determining a second number of entities having a relationship with the entity according to the type data of the relationship with the entity; and deleting the identification data and the attribute data of the entity in response to the fact that the second quantity is larger than a second preset threshold value.
In some embodiments, the attribute data of the entity comprises at least one key-value pair; and generating graph data based on the extracted data, comprising: analyzing the keys in the at least one key value pair, and determining the key value pairs corresponding to the keys meeting the preset conditions; generating graph data according to the determined key-value pairs.
In some embodiments, the above method further comprises: storing the generated graph data in a preset graph database; and storing the key value pairs which do not meet the preset condition in at least one key value pair in a database except the database.
In some embodiments, the storing the generated graph data in a preset graph database includes: for the determined entity, determining whether the entity is included in the graph database according to the identification data of the entity; and in response to the fact that the entity is determined to be included in the graph database, updating the attribute data of the entity stored in the graph database according to the extracted attribute data of the entity.
In some embodiments, the storing the generated graph data in a preset graph database includes: responsive to determining that the entity is not included in the graph database, generating an entity identification for the entity; storing the generated entity identification, the identification data of the entity and the attribute data of the entity in the graph database.
In some embodiments, the storing the generated graph data in a preset graph database includes: in response to completion of the storing of the identification data and the attribute data of the determined entities, type data and attribute data of the relationships between the entities are stored in the graph database.
In a second aspect, an embodiment of the present application provides an apparatus for generating data, including: an acquisition unit configured to acquire target data; the analysis unit is configured to analyze the target data and determine entities of preset types included in the target data; an extracting unit configured to extract the determined identification data, attribute data of the entities and type data, attribute data of the relationship between the entities; a generating unit configured to generate graph data from the extracted data.
In some embodiments, the attribute data of the entity comprises at least one key-value pair, the attribute data of the entity comprising a data type of the key and a data type of the value; and the generating unit is further configured to: performing at least one of the following processes on the extracted data and generating graph data from the processed data: deleting key value pairs of which the data types of the values are not matched with the preset data types in the attribute data of the entity; for the determined entity, determining a first number of relationships with the type of the entity as a preset type according to the type data of the relationship with the entity; in response to determining that the first number is larger than a first preset threshold, deleting type data and attribute data of a relationship, of which the type of the relationship with the entity is a preset type; for the determined entity, determining a second number of entities having a relationship with the entity according to the type data of the relationship with the entity; and deleting the identification data and the attribute data of the entity in response to the fact that the second number is larger than a second preset threshold value.
In some embodiments, the attribute data of the entity comprises at least one key-value pair; and the generating unit is further configured to: analyzing the keys in the at least one key value pair, and determining the key value pairs corresponding to the keys meeting the preset conditions; graph data is generated according to the determined key-value pairs.
In some embodiments, the above apparatus further comprises: a first storage unit configured to store the generated map data in a preset map database; and the second storage unit is configured to store the key value pairs which do not meet the preset condition in a database except the database.
In some embodiments, the first storage unit is further configured to: for the determined entity, determining whether the entity is included in the graph database according to the identification data of the entity; and in response to the fact that the entity is determined to be included in the graph database, updating the attribute data of the entity stored in the graph database according to the extracted attribute data of the entity.
In some embodiments, the first storage unit is further configured to: responsive to determining that the entity is not included in the graph database, generating an entity identification for the entity; and storing the generated entity identification, the identification data of the entity and the attribute data of the entity into the graph database.
In some embodiments, the first storage unit is further configured to: in response to completion of the storing of the identification data and the attribute data of the determined entities, type data and attribute data of the relationships between the entities are stored in the graph database.
In a third aspect, an embodiment of the present application provides a server, including: one or more processors; a storage device, on which one or more programs are stored, which when executed by the one or more processors cause the one or more processors to implement the method as described in any embodiment of the first aspect.
In a fourth aspect, embodiments of the present application provide a computer-readable medium, on which a computer program is stored, where the program, when executed by a processor, implements a method as described in any of the embodiments of the first aspect.
The method and the device for generating data provided by the above embodiments of the present application may first acquire target data. And then, analyzing the target data and determining entities of preset types included in the target data. Then, the identification data, attribute data, and type data, attribute data of the relationship between the entities of the determined entities can be extracted. Finally, graph data is generated from the extracted data. The method and the device of the embodiment can effectively analyze the data to generate the graph data, and are convenient to retrieve.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:
FIG. 1 is an exemplary system architecture diagram to which one embodiment of the present application may be applied;
FIG. 2 is a flow diagram of one embodiment of a method for generating data according to the present application;
FIG. 3 is a schematic diagram of an application scenario of a method for generating data according to the present application;
FIG. 4 is a flow diagram of another embodiment of a method for generating data according to the present application;
FIG. 5 is a block diagram of one embodiment of an apparatus for generating data according to the present application;
FIG. 6 is a schematic block diagram of a computer system suitable for use in implementing a server according to embodiments of the present application.
Detailed Description
The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the present invention are shown in the drawings.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
Fig. 1 shows an exemplary system architecture 100 to which embodiments of the method for generating data or the apparatus for generating data of the present application may be applied.
As shown in fig. 1, the system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. Network 104 is the medium used to provide communication links between terminal devices 101, 102, 103 and server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
A user may use terminal devices 101, 102, 103 to interact with a server 105 over a network 104 to receive or send messages or the like. The terminal devices 101, 102, 103 may have various communication client applications installed thereon, such as a web browser application, a shopping application, a search application, an instant messaging tool, a mailbox client, social platform software, and the like.
The terminal apparatuses 101, 102, and 103 may be hardware or software. When the terminal devices 101, 102, 103 are hardware, they may be various electronic devices having a display screen and supporting data transmission, including but not limited to smart phones, tablet computers, e-book readers, laptop portable computers, desktop computers, and the like. When the terminal devices 101, 102, 103 are software, they can be installed in the electronic devices listed above. It may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.
The server 105 may be a server that provides various services, such as a background server that processes data transmitted by the terminal devices 101, 102, 103. The backend server may perform processing such as analysis on the received data and feed back a processing result (e.g., generated graph data) to the terminal apparatuses 101, 102, 103.
The server may be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster composed of multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as a plurality of software or software modules (for example, to provide distributed services), or as a single software or software module. And is not particularly limited herein.
It should be noted that the method for generating data provided in the embodiment of the present application is generally performed by the server 105, and accordingly, the apparatus for generating data is generally disposed in the server 105.
It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
With continued reference to FIG. 2, a flow 200 of one embodiment of a method for generating data in accordance with the present application is shown. The method for generating data of the embodiment comprises the following steps:
step 201, target data is acquired.
In the present embodiment, the execution subject of the method for generating data (e.g., the server 105 shown in fig. 1) may acquire the target data by a wired connection manner or a wireless connection manner. The target data may be used to generate graph data. The execution agent may obtain the target data in various ways, for example, it may be crawled from the network by a web crawler (also called web spider, web robot, a program or script that automatically captures web information according to certain rules), or it may be obtained from a relational database, or it may be local to the execution agent, or it may be sent by a receiving terminal, etc. The target data may be data in various formats, and may be, for example, table type data, text type data, or the like.
It is noted that the wireless connection means may include, but is not limited to, a 3G/4G connection, a WiFi connection, a bluetooth connection, a WiMAX connection, a Zigbee connection, a UWB (ultra wideband) connection, and other wireless connection means now known or developed in the future.
Step 202, analyzing the target data, and determining entities of preset types included in the target data.
After obtaining the target data, the execution subject may analyze the target data to determine a preset type of entity included in the target data. The execution subject may identify entities included in the target data through Named Entity Recognition (NER) and Entity chain pointing (EL) techniques. An entity (entity) represents an object or concept in the real world described in the database. Entities are things that exist in the objective world and are distinguishable from each other. The entity can be a person, a ground, an object, or an abstract concept. In this embodiment, the type of the entity to be identified may be preset, and the type may include a person, a place, an object, and the like.
Step 203, extracting the determined identification data and attribute data of the entities and the type data and attribute data of the relationship between the entities.
After determining the entities included in the target data, data of the entities and data of relationships between the entities in the target data may be extracted. The data of the entity may include identification data and attribute data, and the data of the relationship may include type data and attribute data. The identification data of the entity may be data for distinguishing the entity, and may be a name, for example. The attribute data of the entity may be data for describing the entity. In the case of an entity, the attribute data may include a birth date, a height, an identification number, and the like. For entity-by-entity, attribute data may include volume, size, date of manufacture, brand, and the like. The type data of the relationship may be data for describing the type of relationship between the entities. For example, the interpersonal relationships may include a couple relationship, a friendship relationship, a parent-child relationship, and the like. The relationship between the person and the object may include an attribution relationship and the like. The relationship between the person and the ground is an arrival relationship, etc. The attribute data of the relationship may be data for describing the relationship. For example, the time when the relationship occurs, the place where the relationship occurs, and the like.
Step 204, generating graph data according to the extracted data.
After extracting the data, the execution body may generate graph data from the extracted data. Specifically, the execution body may use the extracted entities as nodes, use the relationships between the entities as edges, and connect the nodes to obtain a structure diagram of each entity. The execution agent may then store the identification data, attribute data, of the entity in the node. The type data and attribute data of the relationship can also be stored in the edge.
In some implementations, the execution body may also determine the orientation of the edge according to the type of relationship. For example, in a couple relationship, an edge is directed to two nodes. In an arrival relationship, an edge is directed to the node where the ground is located.
In some implementations, the execution principal may also display identification data of the entity on the node and type data of the relationship on the edge.
In some optional implementations of this embodiment, the executing body may further send the extracted data and the target data to the terminal. The user using the terminal may further determine an entity in the target data on which the subject determination is not performed, after receiving the target data and the extracted data. And further extracting the identification data and the attribute data of the entities and the type data and the attribute data of the relationship among the entities. After the user completes the extraction, the extracted data may be transmitted to the execution main body through the terminal. The execution agent may generate the graph data from the data extracted by itself and the data received from the terminal.
With continuing reference to fig. 3, fig. 3 is a schematic diagram of an application scenario of the method for generating data according to the present embodiment. In the application scenario of fig. 3, the server obtains a portion of the police data. After the police service data are analyzed, entities included in the police service data are determined, and identification data and attribute data of the entities and type data and attribute data of relations among the entities are extracted. After the server extracts the data, the data can be sent to a terminal used by a staff member of a public security department. After knowing the data extracted by the server, the working personnel of the public security department can supplement and extract the identification data and the attribute data of the entities and the type data and the attribute data of the relationship among the entities in the police service data. Meanwhile, the server can also receive the data which are supplemented and extracted by the working personnel of the public security department. The server generates graph data for representing the entity and the relationship between the entities according to the identification data and the attribute data of the entities, the type data and the attribute data of the relationship between the entities, which are obtained in the two modes.
The method for generating data provided by the above embodiments of the present application may first acquire target data. And then, analyzing the target data and determining entities of preset types included in the target data. Then, the identification data, attribute data, and type data, attribute data of the relationship between the entities of the determined entities can be extracted. Finally, graph data is generated from the extracted data. The method of the embodiment can effectively analyze the data to generate the image data, and is convenient to retrieve.
In some optional implementations of this embodiment, the step 204 may further include the following steps not shown in fig. 2: processing the extracted data; and generating graph data according to the processed data.
In this implementation, the execution agent may first process the extracted data, and then generate graph data from the processed data. The processing may include, but is not limited to, deleting anomalous data, repeating data, and the like. Here, if a certain item of data of an entity is significantly different from certain items of data of other entities, the data of the entity can be considered as abnormal data. For example, there are typically relationships between other entities and around 10 entities. And if the relationship exists between the entity A and thousands of entities, the data of the entity A is considered to be abnormal data. Exception data may also include data in attribute data that does not conform to logic. For example, if the generation time of the couple relationship is a future time in the attribute data of the relationship, the attribute data of the relationship is considered as abnormal data. The execution agent may also delete duplicate portions of the extracted data.
In some specific implementations of this implementation, the attribute data of the entity includes at least one key-value pair. The attribute data of the entity includes a data type of the key and a data type of the value.
The above processing may include at least one of: and deleting key value pairs of which the data types of the values are not matched with the preset data types in the attribute data of the entity. For the determined entity, determining a first number of relationships with the type of the entity as a preset type according to the type data of the relationship with the entity; and deleting the type data and the attribute data of the relationship of which the type is a preset type and which is related to the entity in response to the fact that the first number is larger than a first preset threshold value. For the determined entity, determining a second number of entities having a relationship with the entity according to the type data of the relationship with the entity; and deleting the identification data and the attribute data of the entity in response to the fact that the second number is larger than a second preset threshold value.
In this implementation, the execution subject may detect whether the data type of the median value of the key value pair matches a preset data type, and if not, the key value pair is determined to be abnormal data. For example, for a key-value pair "age = ABC", the data type of the value "ABC" is string (representing a character string), the preset data type is int (representing an integer number), and the two data types do not match, then the key-value pair is abnormal data. The execution agent may delete the key-value pair after determining that the key-value pair is anomalous data.
In this implementation, for each entity, the execution principal may also determine the first number of relationships of a particular type from the type data of the relationships between the entity and other entities. If the first number is larger than a first preset threshold value, the data of the specific type of relationship of the entity is considered to be abnormal data, and the type data and the attribute data of the specific type of relationship can be deleted. For example, the executive may determine the number of relationships with the entity that are couple relationships. If the couple relationship between the entity and the plurality of entities is determined, the couple relationship between the entity and the plurality of entities is determined to be abnormal data. The executive may delete the couple relationship data for that entity.
In this implementation, for each entity, the execution subject may further determine the number of entities having a relationship with the entity according to the type data of the relationship between the entity and other entities. If the number is larger than a second preset threshold value, the data of the entity is considered to be abnormal data, and the identification data and the attribute data of the entity can be deleted.
In some optional implementations of this embodiment, the attribute data of the entity may include at least one key-value pair. The above step 204 may include the following steps not shown in fig. 2: analyzing the keys in at least one key value pair, and determining the key value pairs corresponding to the keys meeting the preset conditions; generating graph data according to the determined key-value pairs.
In this implementation, the execution subject may analyze the keys in at least one key value pair to determine whether the keys in the key value pair satisfy a preset condition. The preset condition may be data, which is used to determine that the key value pair is used to describe the entity and has a low possibility of changing over time, and the data may also be referred to as steady-state attribute data. Such as date of birth, place of work, home address, etc. In other key value pairs, data with a high possibility of relatively changing with time is considered as key value pairs which do not meet preset conditions, and the data can also be called as unsteady attribute data. Such as hobbies, and the like. The execution agent may generate the graph data according to the key value pair satisfying the preset condition. Specifically, the executing agent may store the key value pair satisfying the preset condition in the node.
In some optional implementations of this embodiment, the method may further include the following steps not shown in fig. 2: storing the generated graph data in a preset graph database; and storing the key value pairs which do not meet the preset condition in at least one key value pair in a database except the database.
In this implementation, after generating the drawing data, the executing agent may store the generated drawing data in a preset drawing database. And storing the key value pairs which do not meet the preset conditions in the attribute data of the entity in a database except the database. The database may be a Redis database or a MongoDB. The Redis database is an open-source log-type and Key-Value (Key Value pair) database which is written by using ANSI C language, supports network and can be based on memory and can also be persistent. MongoDB is a database based on distributed file storage, written in the C + + language.
With continued reference to FIG. 4, a flow 400 of one embodiment of storing graph data in a method for generating data according to the present application is shown. As shown in fig. 4, for each determined entity, the graph data generated by the embodiment shown in fig. 2 may be stored in the present embodiment through the following steps:
step 401, determining whether the entity is included in the graph database according to the identification data of the entity.
In this embodiment, since the graph database stores the identification data of the entity, the execution subject may query whether the graph database includes the same identification data according to the identification data of the entity. If the same identification data is included in the map database, the entity is included in the map database, and step 402 is executed; if the same identification data is not included in the map database, indicating that the entity is not included in the map database, step 403 is performed.
Step 402, in response to determining that the entity is included in the graph database, updating the stored attribute data of the entity in the graph database according to the extracted attribute data of the entity.
After determining that the entity is included in the graph database, the execution agent may update the attribute data of the entity stored in the graph database according to the extracted attribute data of the entity. Specifically, at the time of updating, the executing agent may first determine whether or not there is a portion in the graph database that is identical to the extracted attribute data. If so, the executing agent may replace the stored attribute data with the extracted attribute data. If not, the execution principal may transfer the extracted attribute data. For example, the stored data in the graph database includes work sites: chang Ping district of Beijing City. And the extracted attribute data includes the work place: the Haitai district of Beijing. The executing agent may update the work location in the graph database to the haiji district of beijing.
In response to determining that the entity is not included in the graph database, an entity identification for the entity is generated, STEP 403.
Upon determining that the entity is not included in the graph database, the executing entity may first generate an entity identification for the entity. The entity identification may be used to build an index directory. Specifically, the execution main body may determine the entity identifier corresponding to the identifier data of the entity according to a preset correspondence list between the identifier data and the entity identifier. For example, if the entity identifier in the correspondence list is a natural number arranged in order, the entity identifier of the entity may be determined in the order. Alternatively, the executing entity may use information of the entity, which is unique to the entity, as the entity identifier, where the information may be an identification number, a passport number, or the like.
Step 404, storing the generated entity identification, the identification data of the entity and the attribute data of the entity in a graph database.
After generating the entity identifier of the entity, the execution agent may store the entity identifier, the identification data of the entity, and the attribute data of the entity in the graph database.
In some optional implementations, the execution subject may further store the entity identifier in a preset correspondence list of the identification data and the entity identifier, so that the execution subject generates the entity identifier of the entity according to the list again.
In some optional implementations of this embodiment, after completing the storage of the identification data and the attribute data of each entity, the execution subject may perform the following steps not shown in fig. 4: in response to completion of the storing of the identification data and attribute data of the determined entities, type data and attribute data of the relationships between the entities are stored in the graph database.
After the data storage of the nodes in the graph data is completed, the data of the edges between the nodes may be stored in the graph database, i.e., the type data and the attribute data of the relationships between the entities are stored in the graph database.
In some optional implementation manners of this embodiment, after the entity identifier is generated, the execution subject may establish the index directory according to the generated entity identifier. To facilitate rapid retrieval of entities or nodes in a graph database. The execution subject may also select an appropriate storage device for the created index directory according to the number of entity identifications. When the number of entity identifiers is small (e.g., less than 100 ten thousand), the execution principal may store the index directory in memory. When the number of entity identifications is small (e.g., less than 1 hundred million), the execution principal may store the index directory in the database. When the number of entities is larger, the execution principal may store the index target in disk.
In some optional implementation manners of this embodiment, the execution main body may further back up the extracted data, the processed data obtained by processing the extracted data, the generated entity identifier, the established index directory, and the generated graph data, so as to improve the data recovery capability and the disaster recovery capability.
In some optional implementation manners of this embodiment, before the executing subject stores the map data, it may further detect whether an amount of the generated map data reaches a preset threshold, and if the amount of the generated map data reaches the preset threshold, store the generated map data in the map database. In the storage process, the graph data are stored in a mode of combining multiple processes and multiple threads, so that the graph data can be stored in a graph database in batches, and the storage efficiency is improved.
The method for generating data provided by the embodiment of the application can provide index service and improve retrieval efficiency; and at the same time, the efficiency of storing graph data can be improved.
With further reference to fig. 5, as an implementation of the methods shown in the above-mentioned figures, the present application provides an embodiment of an apparatus for generating data, which corresponds to the method embodiment shown in fig. 2, and which is particularly applicable to various electronic devices.
As shown in fig. 5, the apparatus 500 for generating data of the present embodiment includes: an acquisition unit 501, an analysis unit 502, an extraction unit 503, and a generation unit 504.
An acquisition unit 501 configured to acquire target data.
An analyzing unit 502 configured to analyze the target data and determine entities of preset types included in the target data.
An extracting unit 503 configured to extract the determined identification data, attribute data of the entities and type data, attribute data of the relationships between the entities.
A generating unit 504 configured to generate graph data from the extracted data.
In some optional implementations of this embodiment, the attribute data of the entity includes at least one key-value pair, and the attribute data of the entity includes a data type of the key and a data type of the value. The generating unit 504 may be further configured to: performing at least one of the following processes on the extracted data and generating graph data from the processed data: deleting key value pairs of which the data types of the values are not matched with the preset data types in the attribute data of the entity; for the determined entity, determining a first number of relationships with the type of the entity as a preset type according to the type data of the relationship with the entity; in response to determining that the first number is larger than a first preset threshold, deleting type data and attribute data of a relationship, of which the type of the relationship with the entity is a preset type, from the entity; for the determined entity, determining a second number of entities having a relationship with the entity according to the type data of the relationship with the entity; and deleting the identification data and the attribute data of the entity in response to determining that the second number is greater than a second preset threshold.
In some optional implementations of this embodiment, the attribute data of the entity includes at least one key-value pair. The generating unit 504 may be further configured to: analyzing keys in at least one key value pair, and determining key value pairs corresponding to the keys meeting preset conditions; generating graph data according to the determined key-value pairs.
In some optional implementations of this embodiment, the apparatus 500 may further include a first storage unit and a second storage unit that are not shown in fig. 5.
A first storage unit configured to store the generated map data in a preset map database.
And the second storage unit is configured to store the key value pairs which do not meet the preset condition in at least one key value pair in a database except the database.
In some optional implementations of this embodiment, the first storage unit may be further configured to: for the determined entity, determining whether the entity is included in the graph database according to the identification data of the entity; in response to determining that the entity is included in the graph database, the stored attribute data for the entity in the graph database is updated according to the extracted attribute data for the entity.
In some optional implementations of this embodiment, the first storage unit may be further configured to: in response to determining that the entity is not included in the graph database, generating an entity identification for the entity; storing the generated entity identification, the identification data of the entity and the attribute data of the entity in a graph database.
In some optional implementations of this embodiment, the first storage unit may be further configured to: in response to completion of the storing of the identification data and the attribute data of the determined entities, type data and attribute data of the relationships between the entities are stored in the graph database.
The apparatus for generating data according to the above embodiments of the present application may first obtain target data. And then, analyzing the target data and determining entities of preset types included in the target data. Then, the identification data, the attribute data, and the type data, the attribute data of the relationship between the entities of the determined entities can be extracted. Finally, graph data is generated from the extracted data. The device of the embodiment can effectively analyze the data to generate the image data, and is convenient to retrieve.
It should be understood that units 501 to 505, respectively, recited in the apparatus 500 for generating data correspond to the respective steps in the method described with reference to fig. 2. Thus, the operations and features described above for the method for generating data are equally applicable to the apparatus 500 and the units contained therein and will not be described again here.
Referring now to FIG. 6, a block diagram of a computer system 600 suitable for use to implement a server according to embodiments of the present application is shown. The server shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.
As shown in fig. 6, the computer system 600 includes a Central Processing Unit (CPU) 601 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the system 600 are also stored. The CPU 601, ROM 602, and RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.
The following components are connected to the I/O interface 605: an input portion 606 including a keyboard, a mouse, and the like; an output portion 607 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The driver 610 is also connected to the I/O interface 605 as needed. A removable medium 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 610 as necessary, so that a computer program read out therefrom is mounted in the storage section 608 as necessary.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a machine-readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 609 and/or installed from the removable medium 611. The computer program performs the above-described functions defined in the method of the present application when executed by a Central Processing Unit (CPU) 601.
It should be noted that the computer readable medium described herein can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
In the context of this application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present application may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes an acquisition unit, an analysis unit, an extraction unit, and a generation unit. Where the names of these units do not in some cases constitute a limitation on the unit itself, for example, the acquisition unit may also be described as a "unit that acquires target data".
As another aspect, the present application also provides a computer-readable medium, which may be contained in the apparatus described in the above embodiments; or may be present separately and not assembled into the device. The computer readable medium carrying one or more programs which, when executed by the apparatus, cause the apparatus to: acquiring target data; analyzing the target data and determining entities of preset types included in the target data; extracting the identification data and the attribute data of the determined entities and the type data and the attribute data of the relationship between the entities; from the extracted data, graph data is generated.
The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the invention. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims (16)

1. A method for generating data, comprising:
acquiring target data, wherein the target data are text data;
analyzing the target data and determining entities of preset types included in the target data;
extracting the determined identification data and attribute data of the entities and the type data and attribute data of the relationship between the entities;
sending the extracted data and the target data to a terminal, and receiving identification data and attribute data of the supplementary entities, and type data and attribute data of the relationship between the supplementary entities;
generating graph data according to the extracted data and the received data;
storing the generated graph data in a preset graph database;
an index directory is created based on the entity identifiers in the graph database.
2. The method of claim 1, wherein the attribute data of the entity comprises at least one key-value pair, the attribute data of the entity comprising a data type of the key and a data type of the value; and
generating graph data from the extracted data and the received data, comprising:
performing at least one of the following processes on the extracted data and the received data and generating graph data from the processed data:
deleting key value pairs of which the data types are not matched with the preset data types in the attribute data of the entity and the attribute data of the supplementary entity;
determining, for the determined entity and the received supplemental entity, a first number of relationships of a type of a preset type for the type of the relationship with the entity according to type data of the relationship with the entity; in response to the fact that the first number is larger than a first preset threshold value, deleting type data and attribute data of a relation, of which the type of the relation with the entity is a preset type, from the entity;
determining, for the determined entity and the received supplemental entity, a second number of entities having a relationship with the entity based on the type data of the relationship with the entity; and deleting the identification data and the attribute data of the entity in response to the fact that the second number is larger than a second preset threshold value.
3. The method of claim 1, wherein the attribute data of the entity comprises at least one key-value pair; and
generating graph data according to the extracted data and the received data, including:
analyzing the keys in the at least one key value pair, and determining the key value pairs corresponding to the keys meeting the preset conditions;
generating graph data according to the determined key-value pairs.
4. The method of claim 3, wherein the method further comprises:
storing key-value pairs of the at least one key-value pair that do not satisfy a preset condition in a database other than the graph database.
5. The method of claim 4, wherein said storing the generated graph data in a preset graph database comprises:
for the determined entity and the received supplemental entity, determining whether the entity is included in the graph database based on identification data for the entity; in response to determining that the entity is included in the graph database, updating stored attribute data of the entity in the graph database based on the extracted and received attribute data of the entity.
6. The method of claim 5, wherein said storing the generated graph data in a preset graph database comprises:
in response to determining that the entity is not included in the graph database, generating an entity identification for the entity;
storing the generated entity identification, identification data of the entity and attribute data of the entity in the graph database.
7. The method of claim 6, wherein said storing the generated graph data in a preset graph database comprises:
in response to completion of the storing of the identification data and attribute data of the determined entities, storing type data and attribute data of relationships between entities in the graph database.
8. An apparatus for generating data, comprising:
an acquisition unit configured to acquire target data, wherein the target data is text data;
the analysis unit is configured to analyze the target data and determine entities of preset types included in the target data;
an extracting unit configured to extract the determined identification data, attribute data of the entities and type data, attribute data of the relationship between the entities;
a receiving unit configured to transmit the extracted data and the target data to a terminal, and receive identification data, attribute data of the supplementary entities and type data, attribute data of relationships between the supplementary entities;
a generation unit configured to generate graph data from the extracted data and the received data;
a first storage unit configured to store the generated map data in a preset map database;
an establishing unit configured to establish an index directory based on the entity identification in the graph database.
9. The apparatus of claim 8, wherein the attribute data of the entity comprises at least one key-value pair, the attribute data of the entity comprising a data type of the key and a data type of the value; and
the generation unit is further configured to:
performing at least one of the following processes on the extracted data and the received data and generating graph data from the processed data:
deleting key value pairs of which the data types are not matched with the preset data types in the attribute data of the entity and the attribute data of the supplementary entity;
determining, for the determined entity and the received supplemental entity, a first number of relationships of a type of a preset type for the type of the relationship with the entity according to type data of the relationship with the entity; in response to the fact that the first number is larger than a first preset threshold value, deleting type data and attribute data of a relation, of which the type of the relation with the entity is a preset type, from the entity;
determining, for the determined entity and the received supplemental entity, a second number of entities having a relationship with the entity based on type data of the relationship with the entity; and deleting the identification data and the attribute data of the entity in response to the fact that the second number is larger than a second preset threshold value.
10. The apparatus of claim 8, wherein the attribute data of the entity comprises at least one key-value pair; and
the generation unit is further configured to:
analyzing the keys in the at least one key value pair, and determining the key value pairs corresponding to the keys meeting the preset conditions;
generating graph data according to the determined key-value pairs.
11. The apparatus of claim 10, wherein the apparatus further comprises:
a second storage unit configured to store key-value pairs that do not satisfy a preset condition among the at least one key-value pair in a database other than the graph database.
12. The apparatus of claim 11, wherein the first storage unit is further configured to:
for the determined entity and the received supplemental entity, determining whether the entity is included in the graph database based on identification data for the entity; in response to determining that the entity is included in the graph database, updating stored attribute data of the entity in the graph database based on the extracted and received attribute data of the entity.
13. The apparatus of claim 12, wherein the first storage unit is further configured to:
in response to determining that the entity is not included in the graph database, generating an entity identification for the entity;
storing the generated entity identification, identification data of the entity and attribute data of the entity in the graph database.
14. The apparatus of claim 13, wherein the first storage unit is further configured to:
in response to completion of the storing of the identification data and attribute data of the determined entities, storing type data and attribute data of relationships between entities in the graph database.
15. A server, comprising:
one or more processors;
a storage device having one or more programs stored thereon,
the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method recited in any of claims 1-7.
16. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-7.
CN201811051031.4A 2018-09-10 2018-09-10 Method and apparatus for generating data Active CN110895584B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811051031.4A CN110895584B (en) 2018-09-10 2018-09-10 Method and apparatus for generating data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811051031.4A CN110895584B (en) 2018-09-10 2018-09-10 Method and apparatus for generating data

Publications (2)

Publication Number Publication Date
CN110895584A CN110895584A (en) 2020-03-20
CN110895584B true CN110895584B (en) 2023-01-03

Family

ID=69785082

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811051031.4A Active CN110895584B (en) 2018-09-10 2018-09-10 Method and apparatus for generating data

Country Status (1)

Country Link
CN (1) CN110895584B (en)

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10061789B2 (en) * 2013-10-28 2018-08-28 Excalibur Ip, Llc Dynamic database indexes for entity attribute value stores
CN106897273B (en) * 2017-04-12 2018-02-06 福州大学 A kind of network security dynamic early-warning method of knowledge based collection of illustrative plates
CN108052501A (en) * 2017-12-13 2018-05-18 北京数洋智慧科技有限公司 It is a kind of based on the entity relationship of artificial intelligence to recognition methods and system

Also Published As

Publication number Publication date
CN110895584A (en) 2020-03-20

Similar Documents

Publication Publication Date Title
US10558984B2 (en) Method, apparatus and server for identifying risky user
CN108846753B (en) Method and apparatus for processing data
CN111522927B (en) Entity query method and device based on knowledge graph
CN111046237B (en) User behavior data processing method and device, electronic equipment and readable medium
CN108897874B (en) Method and apparatus for processing data
CN110019263B (en) Information storage method and device
CN108733317B (en) Data storage method and device
US20200286014A1 (en) Information updating method and device
CN110399448B (en) Chinese place name address searching and matching method, terminal and computer readable storage medium
CN111314063A (en) Big data information management method, system and device based on Internet of things
CN109446384B (en) Method and system for generating personnel organization architecture information
CN109522399B (en) Method and apparatus for generating information
CN111383097A (en) Method and device for mining suspected personal account
CN110188113B (en) Method, device and storage medium for comparing data by using complex expression
CN110737820B (en) Method and apparatus for generating event information
CN113595886A (en) Instant messaging message processing method and device, electronic equipment and storage medium
CN111488386B (en) Data query method and device
CN110895584B (en) Method and apparatus for generating data
CN113590447B (en) Buried point processing method and device
CN112887426B (en) Information stream pushing method and device, electronic equipment and storage medium
CN110889000B (en) Method and apparatus for outputting information
CN113393288A (en) Order processing information generation method, device, equipment and computer readable medium
CN110555070B (en) Method and apparatus for outputting information
CN110674137A (en) Data processing method and device, storage medium and electronic equipment
CN112529646A (en) Commodity classification method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant