CN111159390A - Information extraction method, device and storage medium - Google Patents

Information extraction method, device and storage medium Download PDF

Info

Publication number
CN111159390A
CN111159390A CN201911414416.7A CN201911414416A CN111159390A CN 111159390 A CN111159390 A CN 111159390A CN 201911414416 A CN201911414416 A CN 201911414416A CN 111159390 A CN111159390 A CN 111159390A
Authority
CN
China
Prior art keywords
node
nodes
child
node set
ancestor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911414416.7A
Other languages
Chinese (zh)
Inventor
张涵初
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Lenovo Beijing Ltd
Original Assignee
Lenovo Beijing Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Lenovo Beijing Ltd filed Critical Lenovo Beijing Ltd
Priority to CN201911414416.7A priority Critical patent/CN111159390A/en
Publication of CN111159390A publication Critical patent/CN111159390A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles

Abstract

The application provides an information extraction method, equipment and a storage medium, wherein the method comprises the following steps: acquiring a global node set and keywords for extracting target information, wherein the nodes in the global node set are stored according to a mesh structure, and page information contained in each node in the global node set forms global information; screening a first node set corresponding to the keyword in the global node set according to the mesh structure, wherein the nodes in the first node set are nodes which meet a preset correlation relationship with the keyword and do not have subject drift; and extracting page information contained in each node in the first node set to obtain target information.

Description

Information extraction method, device and storage medium
Technical Field
The present application relates to the field of information processing, and in particular, to an information extraction method, device, and storage medium.
Background
With the rapid increase of information, the natural language processing technology has gradually shifted from a statistical method to a deep learning method, and an advanced solution is provided for the processing of language problems. However, the deep learning method must rely on a large amount of text data, and the encyclopedic provides a basis for the text data of deep learning due to the advantages of easy acquisition, high text quality and wide coverage field.
Encyclopedias are various electronic encyclopedias which have massive texts, are free and are edited manually. When extracting text data from encyclopedia, because the encyclopedia has huge text data amount and numerous classifications, how to extract proper text data from massive encyclopedia information is a difficult problem to solve urgently when using a deep learning method.
The existing method for extracting text data from encyclopedia is as follows: the text data of the limited layer included in the field is directly used as the text data of deep learning, all related text data cannot be extracted, and a large amount of text data irrelevant to the field exists in the extracted text data, so that the text data for deep learning is incomplete and inaccurate, the deep learning processing time is prolonged, and the processing efficiency and accuracy are reduced.
Disclosure of Invention
In view of the above, embodiments of the present application provide an information extraction method, an information extraction apparatus, and a storage medium to solve the problems in the prior art.
The technical scheme of the embodiment of the application is realized as follows:
in a first aspect, an embodiment of the present application provides an information extraction method, including:
acquiring a global node set and keywords for extracting target information, wherein the nodes in the global node set are stored according to a mesh structure, and page information contained in each node in the global node set forms global information;
screening a first node set corresponding to the keyword in the global node set according to the mesh structure, wherein the nodes in the first node set are nodes which meet a preset correlation relationship with the keyword and do not have subject drift;
and extracting page information contained in each node in the first node set to obtain target information.
In a second aspect, an embodiment of the present application provides an information extraction apparatus, including:
the system comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring a global node set and keywords for extracting target information, the nodes in the global node set are stored according to a mesh structure, and page information contained in each node in the global node set forms global information;
the screening module is used for screening a first node set corresponding to the keyword in the global node set according to the mesh structure, wherein nodes in the first node set meet a preset correlation relationship with the keyword and do not have theme drift;
and the extraction module is used for extracting the page information contained in each node in the first node set to obtain the target information.
In a third aspect, an embodiment of the present application provides an information extraction device, including:
a memory for storing executable instructions;
and the processor is used for realizing the method provided by the embodiment of the application when executing the executable instructions stored in the memory.
In a fourth aspect, an embodiment of the present application provides a storage medium storing executable instructions for causing a processor to implement a method provided by an embodiment of the present application when the processor executes the executable instructions.
In the information extraction method provided by the embodiment of the application, a global node set and a keyword for extracting target information are obtained, and a first node set corresponding to the keyword is screened out from the global node set according to a storage mode of a mesh structure of the global node set, wherein nodes included in the first node set are nodes which satisfy a preset correlation relation with the keyword and do not have theme drift; and then extracting page information contained in each node in the first node set to obtain target information. The first node set corresponding to the keywords is obtained through screening, nodes in the first node set are nodes without subject drift, accuracy of extracted target information is improved, the nodes for extracting page information are determined through whether the subject drift occurs, the nodes are not limited to nodes in a limited layer, and the extracted target information is more comprehensive.
Drawings
Fig. 1 is a schematic network architecture diagram of an information extraction method according to an embodiment of the present application;
fig. 2 is a schematic structural diagram of an information extraction terminal provided in an embodiment of the present application;
fig. 3 is a schematic flow chart of an implementation of the information extraction method according to the embodiment of the present application;
fig. 4A is a schematic flowchart of another implementation of the information extraction method according to the embodiment of the present application;
fig. 4B is a schematic diagram of a storage structure of a node according to an embodiment of the present application;
fig. 5A is a schematic flowchart of another implementation of the information extraction method according to the embodiment of the present application;
fig. 5B is a schematic storage structure diagram of another node according to an embodiment of the present application;
fig. 6 is a schematic flow chart of another implementation of the information extraction method according to the embodiment of the present application;
fig. 7 is a schematic flow chart of another implementation of the information extraction method according to the embodiment of the present application.
Detailed Description
In order to make the objectives, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the attached drawings, the described embodiments should not be considered as limiting the present application, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.
In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.
In the following description, references to the terms "first \ second \ third" are only to distinguish similar objects and do not denote a particular order, but rather the terms "first \ second \ third" are used to interchange specific orders or sequences, where appropriate, so as to enable the embodiments of the application described herein to be practiced in other than the order shown or described herein.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.
Before further detailed description of the embodiments of the present application, terms and expressions referred to in the embodiments of the present application will be described, and the terms and expressions referred to in the embodiments of the present application will be used for the following explanation.
The names used in the embodiments of the present invention are described below.
1) Body (Ontology): in the field of computer science and information science, in theory, ontology refers to a "formalized, clear and detailed description of a shared conceptual system". The ontology provides a shared vocabulary, namely the types or concepts of objects existing in a specific field and the attributes and interrelations of the objects; or, ontology is a special type of term set, with structured features, and more suitable for use in computer systems; or, ontology is actually a formalized expression of a set of concepts and their relationships to each other in a particular domain. The ontology is a work compiled by using knowledge in the field of interest of people as a material and applying the ontology principle of information science. Ontologies can be used to reason about the properties of the domain in general, and also to define the domain (i.e., model the domain).
2) Domain Ontology (Domain Ontology): is a description of discipline concepts, including concepts in disciplines, attributes of concepts, relationships between concepts, and constraints on attributes and relationships. Domain ontologies may represent specific knowledge within a specific domain, since knowledge has significant domain characteristics. The "domain" is established according to the needs of the ontology builder, and may be a subject domain, a combination of several domains, or a small range of domains.
In order to better understand the embodiments of the present application, first, a description is given of an information extraction method and existing disadvantages in the related art.
With the rapid increase of information, the natural language processing technology has gradually shifted from a statistical method to a deep learning method, and provides the most advanced solution for processing language problems. However, deep learning approaches based on neural networks rely on a large amount of training text. Encyclopedias are various electronic encyclopedias which have massive texts, are free and are edited manually, so that the encyclopedia has the advantages of easiness in obtaining, high text quality and wide coverage field. But the encyclopedia has huge text quantity and numerous categories, and only text data related to a certain field is needed generally. Therefore, the text content related to the field can be completely and efficiently extracted from encyclopedia, and the method has great significance for further work.
In the related art, the encyclopedia is stored similarly to the network topology, and all nodes in the encyclopedia constitute a global node set of the mesh structure. A method for collecting field texts is characterized in that a starting node is selected from a global node set, a node set related to the starting node is obtained through downward traversal based on a fixed number of layers, and then page information contained in each node in the node set is extracted to obtain field text data of the starting node. The method based on the fixed layer number is simple and easy to implement, but the encyclopedia is edited by social members, and does not have strict specifications and definitions on pages, concepts, layers and the like, so that the classification system of the method is loose and lacks consistency, and the problems of theme drift and incomplete collection of related data often occur. When extracting text data, the related technology directly takes the traversed node concentrated node page data of the limited layer as the extracted text data, and cannot extract all the text data related to the initial node, so that the extracted text data is not comprehensive enough; in addition, the nodes with subject drift with the initial node exist in the node set, so that a large amount of text data irrelevant to the field of the initial node exists in the extracted text data, and the extracted text data is not accurate enough. Furthermore, when the extracted text data is used for performing deep learning on the text data, the deep learning processing time is prolonged, and the processing efficiency and accuracy are reduced.
Based on the above problems, in the embodiments of the present application, an information extraction method is provided, where a first node set is screened in a global node set, nodes included in the first node set are nodes that do not have topic drift with a keyword, text data is extracted through the nodes in the first node set, accuracy of the text data extracted according to the keyword is improved, and whether the topic drift occurs or not is determined to determine nodes for extracting page information, which is no longer limited to nodes in a limited layer, so that the extracted text data is more comprehensive.
Fig. 1 is a schematic diagram of a network architecture of an information extraction method according to an embodiment of the present application, as shown in fig. 1, the network architecture at least includes an information extraction device 100, a server 200, and a network 300. The information extraction device 100 may be a mobile phone (mobile phone), a tablet computer, a notebook computer, a desktop computer, or other terminals capable of network communication. The information extraction device 100 includes at least information extraction means by which a user can extract target information. To support an exemplary application, the information extraction device 100 establishes a communication link with the server 200 via a network 300, where the network 300 may be a wide area network or a local area network, or a combination thereof, and uses a wireless link to transmit data.
In the network architecture, the information extraction device 100 may send, to the server 200 through the network 300, first request information for acquiring the global node set in response to an operation instruction of the domain target information of the search keyword executed by the user, and the server 200 carries the global node set stored therein in the first response information according to the first request information and returns the global node set to the information extraction device 100. Then, the information extraction device 100 screens out nodes corresponding to the keyword and having no subject drift in the global node set through a preset correlation relationship to form a first node set, and further extracts page information of each node in the first node set, thereby obtaining target information in the keyword field.
The apparatus provided in the embodiments of the present application may be implemented as hardware or a combination of hardware and software, and various exemplary implementations of the apparatus provided in the embodiments of the present application are described below.
Other exemplary configurations of the information extraction apparatus 100 are envisioned from the exemplary configuration of the information extraction apparatus shown in fig. 2, and thus the configurations described herein should not be considered limiting, e.g., some of the components described below may be omitted, or components not described below may be added to accommodate the particular needs of certain applications.
The information extraction apparatus 100 shown in fig. 2 includes: at least one processor 110, memory 140, at least one network interface 120, and a user interface 130. Each of the components in the information extraction device 100 are coupled together by a bus system 150. It will be appreciated that the bus system 150 is used to enable communications among the components of the connection. The bus system 150 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 150 in fig. 2.
The user interface 130 may include a display, a keyboard, a mouse, a touch-sensitive pad, a touch screen, and the like.
The memory 140 may be either volatile memory or nonvolatile memory, and may also include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM). The volatile Memory may be Random Access Memory (RAM). The memory 140 described in embodiments herein is intended to comprise any suitable type of memory.
The memory 140 in the embodiment of the present application is capable of storing data to support the operation of the information extracting apparatus 100. Examples of such data include: any computer program for operating on the information extraction device 100, such as an operating system and an application program. The operating system includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, and is used for implementing various basic services and processing hardware-based tasks. The application program may include various application programs.
As an example of the method provided by the embodiment of the present application implemented by software, the method provided by the embodiment of the present application may be directly embodied as a combination of software modules executed by the processor 110, the software modules may be located in a storage medium located in the memory 140, and the processor 110 reads executable instructions included in the software modules in the memory 140, and completes the method provided by the embodiment of the present application in combination with necessary hardware (for example, including the processor 110 and other components connected to the bus 150).
By way of example, the Processor 110 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor or the like.
The information extraction method provided by the embodiment of the present application will be described in conjunction with exemplary applications and implementations of the device provided by the embodiment of the present application.
Referring to fig. 3, fig. 3 is a schematic diagram of an implementation flow of an information extraction method provided in the embodiment of the present application, and is applied to the information extraction device shown in fig. 1. This embodiment will be described with reference to the steps shown in fig. 3.
Step S301, a global node set and keywords for extracting target information are obtained.
The nodes in the global node set are stored according to a mesh structure, and page information contained in each node in the global node set forms global information.
In this embodiment, when the information extraction device obtains the global node set, the information extraction device may obtain the global node set from a server. The server stores encyclopedic information and nodes storing the encyclopedic information, and access connection is established through the index relation between the nodes and the information. The server is also stored with topological relations among all nodes, all the nodes stored on the server form a global node set, in the global node set, except leaf nodes, each node can have a plurality of child nodes, except a globally unique root node, each node can belong to at least one father node, and therefore the nodes in the global node set are stored according to a mesh topological structure.
In this embodiment, the keyword for extracting the target information may be a keyword pre-stored in the information extraction device, or a keyword extracted after semantic analysis is performed on a sentence input by the user, or a keyword received from another terminal, or a keyword obtained by an input operation instruction executed by the user on the information extraction device. When the keyword is acquired by an input operation instruction performed by the user on the information extraction device, the input operation instruction performed by the user may be to type the keyword through a keyboard, or to select the keyword in the global node set by mouse clicking. In this embodiment, the number of keywords used for extracting the target information may be multiple, and the multiple keywords may be related words or unrelated words.
In this embodiment, the information extraction device may obtain the global node set and the keywords after receiving the operation instruction for information extraction, or may obtain the global node set and the keywords in advance and store them in its own memory, and directly obtain them from its own memory after receiving the operation instruction for information extraction.
Step S302, according to the mesh structure, a first node set corresponding to the keyword is screened out from the global node set.
Here, the nodes included in the first node set are nodes that satisfy a preset correlation with the keyword and do not have subject drift.
And determining nodes corresponding to the keywords in the global node set, traversing downwards along the path of the mesh structure from the nodes corresponding to the keywords according to the mesh structure of the global node set, and removing the nodes with subject drift with the keywords to obtain a first node set.
If the number of the keywords used for extracting the target information is multiple, the screened first node set is a total set of nodes corresponding to the keywords obtained by screening the keywords respectively.
Step S303, extracting page information included in each node in the first node set to obtain target information.
And after a first node set corresponding to the keyword is obtained, page information contained in each node in the first node set is extracted, and the extracted page information is used as target information to be extracted. Here, the page information may be stored in the information extraction device while the global node set is acquired, or may be request information carrying the first node set and sent to the server after the first node set is acquired, where the request information is used to acquire page information of nodes included in the first node set.
When the page information of the global node set is acquired while the global node set is acquired, the target information can be quickly extracted, and the efficiency of acquiring the target information is improved. Especially, when information extraction is carried out for multiple times, only one time of request information needs to be sent to the server, the efficiency of extracting target information for multiple times is greatly improved, and the extraction time is saved.
After the first node set is obtained, the page information contained in each node in the first node set is obtained from the server, and the obtained target information is the information of the server at the current moment, so that the real-time performance of the extracted target information can be ensured, and the accuracy of the target information is improved. Moreover, only the page information of each node in the first node set needs to be downloaded from the server, so that the downloaded data volume is greatly reduced, the downloading time is saved, and the storage space occupied by the information extraction equipment is reduced.
In the information extraction method provided by the embodiment of the application, a global node set and keywords for extracting target information are obtained, wherein the nodes in the global node set are stored according to a mesh structure, and page information contained in each node in the global node set forms global information; then screening a first node set corresponding to the keyword in the global node set according to a mesh structure storage mode of the global node set, wherein nodes included in the first node set are nodes which meet a preset correlation relationship with the keyword and do not have subject drift; and finally, extracting page information contained in each node in the first node set to obtain target information. In this embodiment, the first node set corresponding to the keyword is obtained through screening, and the nodes in the first node set are the nodes without subject drift, so that the accuracy of the extracted target information is improved, and the nodes for extracting the page information are determined by whether the subject drift occurs or not, and are not limited to the nodes in the limited layer, so that the extracted target information is more comprehensive.
On the basis of the embodiment shown in fig. 3, an information extraction method is further provided in the embodiment of the present application. Fig. 4A is a schematic view of another implementation flow of the information extraction method provided in this embodiment of the present application, and as shown in fig. 4A, the information extraction method provided in this embodiment includes the following steps:
step S401, a global node set and keywords for extracting target information are obtained.
The nodes in the global node set are stored according to a mesh structure, and page information contained in each node in the global node set forms global information.
Step S401 and step S405 in this embodiment refer to the description of corresponding parts in the embodiment shown in fig. 3, and are not described herein again. Step S402 to step S404 are an implementation manner of "screening out the first node set corresponding to the keyword in the global node set according to the mesh structure", and specifically, the implementation manner is as follows:
step S402, determining the nodes corresponding to the keywords in the global node set to obtain the initial nodes.
After the keywords are obtained, if the keywords are the keywords selected in the global node set through mouse click operation executed by a user, the selected keywords are the nodes corresponding to the keywords in the global node set, namely the initial nodes. If the keyword is not selected by the user in the global node set, but is obtained by other methods, such as a keyword input by the user through a keyboard, or a keyword received by the user from another terminal, etc., it is necessary to further determine a node corresponding to the keyword in the global node set. If the node matched with the keyword is found in the global node set, the node is determined to be the node corresponding to the keyword, namely the initial node. If the node corresponding to the keyword is not found in the global node set, semantic analysis needs to be performed on the keyword, and the node with the closest semantics to the keyword in the global node set is determined as the node corresponding to the keyword, so that the initial node is obtained.
Step S403, starting from the start node, traversing downward according to the mesh structure, and acquiring a first node in the global node set, which satisfies a preset correlation with the keyword and has no subject drift.
And in the global node set of the mesh structure, traversing downwards from the initial node along the path of the network structure to obtain all first nodes which meet the preset correlation relationship and have no theme drift with the keyword.
In the actual traversal process, depth-first traversal can be adopted, breadth-first traversal can also be adopted, the number of nodes obtained by traversal is equal, but the node sequence is different. Fig. 4B is a schematic storage structure diagram of a node according to an embodiment of the present disclosure, and as shown in fig. 4B, when traversal is started with depth first from a starting node "computing output devices" according to a mesh structure, the traversal results are "loud readers", "graphics hardware", "display adapter", "graphics chips", "video", "music and v ideo", "animation music", "file music", and "video hardware". When traversal is started with breadth first according to the mesh structure from the starting node "c categories", the traversal results are "loud speakers", "graphics hardware", "video", "display adapter", "graphics", "music and video", "video hardware", "animation communication", and "file communication".
And after all the traversed nodes are obtained, further judging whether each node and the keyword meet a preset correlation relationship, determining whether each node has theme drift, and determining the node which meets the preset correlation relationship and has the theme drift in the traversal result as a first node.
Step S404, determining the starting node and the acquired first node as a first node set.
Here, the nodes included in the first node set are nodes that satisfy a preset correlation with the keyword and do not have subject drift.
And determining the starting node and all the first nodes as the first node set which is screened out from the global node set and corresponds to the keyword.
Step S405, extracting page information contained in each node in the first node set to obtain target information.
In the information extraction method provided by the embodiment of the application, the nodes corresponding to the keywords are determined in the global node set to obtain the initial nodes, then the nodes are traversed downwards from the initial nodes according to the global node set of the mesh structure, the first nodes which meet the preset correlation relation with the keywords and do not have subject drift in the global node set are obtained, and therefore the first node set comprising the initial nodes and all the first nodes is obtained, the nodes in the first node set are the nodes without subject drift, the accuracy of the extracted target information is improved, the nodes for extracting the page information are determined according to whether the subject drift occurs or not, the nodes are not limited to the nodes in the limited layer, and the extracted target information is more comprehensive.
On the basis of the embodiment shown in fig. 4A, the embodiment of the present application further provides an information extraction method. Fig. 5A is a schematic view of another implementation flow of the information extraction method provided in the embodiment of the present application, and as shown in fig. 5A, the information extraction method provided in the embodiment includes the following steps:
step S501, a global node set and keywords for extracting target information are obtained.
The nodes in the global node set are stored according to a mesh structure, and page information contained in each node in the global node set forms global information.
Step S501, step S502, step S509, and step S510 in this embodiment refer to the description of corresponding parts in the embodiment shown in fig. 4A, and are not repeated here.
Step S502, determining the nodes corresponding to the keywords in the global node set to obtain the initial nodes.
Step S503 to step S508 are an implementation manner of "starting from the start node, traversing downward according to the mesh structure, and obtaining a first node in the global node set, which satisfies a preset correlation with the keyword and has no subject drift", and specifically, the implementation manner is as follows:
step S503, a current node set is established.
Here, the start node is included in the current node set.
And establishing a current node set, wherein only one node, namely the starting node, is included in the current node set at the beginning. Still as illustrated in FIG. 4B, the current node set is { computing output devices }.
Step S504, acquiring all child nodes of each node included in the current node set in the global node set.
The method provided by this embodiment adopts breadth-first traversal, and acquires child nodes of each node included in the current node set in the global node set. At the beginning, the current node set only includes one node, namely the starting node, that is, all the child nodes of the starting node are obtained in the global node set, namely all the child nodes of the "computing output devices" are "loud speakers", "graphics hardware" and "video".
And step S505, deleting the child nodes with the subject drifting in all the child nodes to obtain a child node set.
And judging all the child nodes, determining whether the child nodes are child nodes with subject drift, if so, deleting the child nodes, if not, adding the child nodes into a child node set, and after judging all the child nodes of all the nodes in the current node set, deleting all the child nodes with the subject drift to obtain the child node set. According to judgment, the theme drifts do not occur in the 'loudspeakers', the 'graphics hardware' and the 'video', so that the obtained child node set is { loudspeakers, graphics hardware and video }.
Step S506, judging whether the child node set is empty.
When the child node set is not empty, it indicates that the child nodes in the child node set are child nodes without subject drift, and as the currently obtained child node set is { loudspeakers, graphics hardware, video }, and is obviously not empty, the method proceeds to step S507.
When the child node set is empty, it indicates that there is no child node for each node in the current node set, or all child nodes of each node in the current node set are child nodes with subject drift, and all child nodes have been deleted, and then step S509 is performed.
Step S507, determining a child node in the child node set as a first node.
Since the child nodes in the child node set are nodes which satisfy the preset correlation relation with the keyword and have no theme drift, the child nodes can be determined to be the first nodes. In this embodiment, the child nodes "loud speakers", "graphics hard ware", and "video" are all descendant nodes of the starting node "computing output devices" where no subject drift occurs, and all of the descendant nodes can be used as the first node in the first node set.
Step S508, updating the child node set to a current node set.
In order to determine all the descendant nodes of the start node and obtain that the child node set is not empty, the child nodes of each child node in the child node set need to be further determined, at this time, the child node set is updated to the current node set, and the process returns to step S504.
In this embodiment, the child node set is updated to the current child node set, and when the next-layer child node is determined, the descendant nodes of the child node having the subject drift do not traverse again, that is, it is not necessary to continuously determine whether the child node of the node having the subject drift drifts, so that the determination frequency is reduced, the time for obtaining the first node set is shortened, and the target information can be obtained as soon as possible.
In step S509, the start node and the acquired first node are determined as a first node set.
Here, the nodes included in the first node set are nodes that satisfy a preset correlation with the keyword and do not have subject drift.
When the step S506 determines that the obtained child node is empty, it indicates that the determination on whether the subject drift occurs to all the descendant nodes of the start node has been performed, and at this time, the start node and the first node obtained in each cycle are determined as the first node set.
Step S510, extracting page information included in each node in the first node set to obtain target information.
In the information extraction method provided by the embodiment of the application, a current node set is established, all child nodes of each node included in the current node set are obtained in a global node set, the child nodes with subject drift in all the child nodes are deleted, the child node set is obtained, when the child node set is not empty, child nodes without subject drift in each node of the current node set are obtained, the child nodes are determined as first nodes, and then the child node set is updated to be the current node set to continue traversing of the next layer; when the child node set is empty, the drift nodes appearing in all the descendant nodes of the starting node are deleted, at the moment, the starting node and the first node obtained in each cycle are determined as the first node set, and the nodes in the first node set are ensured to be the nodes without the theme drift, so that when the target information is extracted according to the nodes in the first node set, the accuracy of the extracted target information can be improved, and the nodes for extracting the page information are determined according to whether the theme drift appears, and are not limited to the nodes in the limited layer, so that the extracted target information is more comprehensive.
In some embodiments, the step S505 "delete the child node with the subject drift in all the child nodes to obtain the child node set", may be implemented by:
step S5051, in the global node set, obtains a direct parent node of each child node.
And the direct father node of the child node is the father node with the maximum correlation degree with the child node in all father nodes of the child node.
When the subject drift occurs to the node, the direct parent node of the node does not exist in the descendant node of the starting node. Based on this, in the present embodiment, it is determined whether the subject drift occurs by determining whether the first node set exists in the direct parent node of the child node.
Step S5052, if the direct parent node of the current child node does not belong to the current node set, determining that the current child node is a child node in which the subject drift occurs.
When the fact that the direct father node of the child node does not belong to the current node set is judged, the fact that the direct father node of the child node is in the global node set outside the descendant node of the starting node is indicated, the correlation degree between the child node and the father node in the current node set is smaller than the correlation degree between the child node and the direct father node, and it is determined that the child node has the theme drift.
Step S5053, deleting the child node with the subject drift from all the child nodes, and obtaining a child node set.
In the embodiments shown in steps S5051 to S5053, by obtaining the direct parent node of each child node in the global node set, when the direct parent node belongs to the current node set, it is determined that the subject drift does not occur in the child node, and when the direct parent node does not belong to the current node set, it is determined that the subject drift occurs in the child node, it is implemented to determine whether the subject drift occurs in the node, and delete the child node in which the subject drift occurs from all the child nodes, so as to obtain all the child nodes in which the subject drift does not occur.
In some embodiments, the step S5051 "obtaining the direct parent node of each child node in the global node set" may be implemented by:
in step S5051a, all parents of the current child node are obtained in the global node set.
When a direct father node of a child node is obtained, all father nodes of the child node are obtained according to the mesh structure of the global node set, that is, all nodes which can directly reach the child node without passing through other nodes are obtained, and the nodes are used as father nodes of the child node.
In step S5051b, the relevance of the current child node to each parent node of the current child node is calculated.
And after obtaining all father nodes of the current child node, further calculating the correlation degree of the child node and each father node, and selecting the father node with the maximum correlation degree as the direct father node of the child node.
In step S5051c, the parent node with the largest correlation degree is determined as the direct parent node of the current child node.
In the embodiments shown in steps S5051 a-S5053 c, the parent node with the largest correlation degree is used as the direct parent node of the child node by calculating the correlation degree of the child node and each parent node thereof, so that whether the child node has subject drift or not can be judged according to the direct parent node of the child node.
In some embodiments, the step S5051b "calculating the relevance of the current child node to each parent node of the current child node" may be implemented by:
step S051b1, in the global node set, obtaining an ancestor node of the ith father node of the current child node, and obtaining a first ancestor node set.
Here, i is 1,2, …, and M is the number of parent nodes of the current child node, and M is a positive integer.
And (3) ancestor nodes of the ith parent node of the current child node, namely starting from the root node of the global node set, and all nodes on all paths which can reach the ith parent node are the ancestor nodes of the ith parent node. Fig. 5B is a schematic diagram of another storage structure of a node according to the embodiment of the present disclosure, and as shown in fig. 5B, the current child node is "music and video", and the ancestor nodes of the 1 st parent node "public culture" of the current child node are "categories", "entitlements", and "music", so as to obtain a first ancestor node set { categories, entitlements, music }.
Step S051b2, based on the ancestor node of each child node of the current child node, determining a second ancestor node set.
The second ancestor node set is determined according to the ancestor nodes of the current child node "music and video", which are the child nodes of "animation music".
In this embodiment, step S051b2 may be determined by:
step Sb21, an ancestor node of each child node of the current child node is obtained, and a third ancestor node set is obtained.
Still referring to FIG. 5B, the child node of the current child node "music and video" is "animation music", and the ancestor node of the "animation music" includes "categories", "entries", "information", "music", "multimedia", "music and video", "information", "c output", "calculating output device", and "video", thereby obtaining a third ancestor node set { c categories, entries, animation, music, multimedia and video, information, computer, calculating output device, video }.
Step Sb22, removing the ancestor node included in the path passing through the ith parent node from the third ancestor node set, to obtain a second ancestor node set.
The path of the 1 st parent node "logical culture" in the third ancestral node set is 1 → 2 → 4 → 5 → 6, and after the path is removed, the remaining paths are 1 → 2 → 3 and 1 → 8 → 9 → 10 → 11 → 6, at this time, the ancestral nodes include "categories", "entries", "animation", "music and video", "information", "computer", "computing output device", and "video", thereby obtaining the second ancestral node set { categories, entries, animation, music and video, information, computing output device, video }.
Step S051b3, calculating the correlation degree between the current child node and the ith parent node according to the first ancestor node set and the second ancestor node set.
And calculating the correlation between the current child node 'music and video' and the 1 st parent node 'music' according to the first ancestor node set { resources, entritament, animation, music and video }, and the second ancestor node set { resources, entritament, animation, music and video, information, computer, computing output device, video }.
In this embodiment, step S051b3 may be determined by:
step Sb31, determine the intersection of the first ancestor node set and the second ancestor node set, to obtain a shared ancestor set.
Calculating the intersection of the first ancestor node set { resources, entritament, music } and the second ancestor node set { resources, entritament, animation, music and video }, wherein the obtained shared ancestor set is { resources, entritament }.
Step Sb32, determining a union of the first ancestor node set and the third ancestor node set to obtain an ancestor union.
Calculating the union of the first ancestor node set { resources, entritancing, music } and the third ancestor node set { resources, entritancing, animation, music culture, music and video }, wherein the obtained union of the resources is { resources, entritancing, animation, music culture, music a nd video }, and the obtained union of the resources is the resources, the entities and video }.
Step Sb33, determining the first element number of the shared ancestor set and the second element number of the ancestor union set, respectively.
Counting the number of elements of the shared ancestor set { resources, entitlements }, obtaining the number of the first elements as 2, counting the number of elements of the ancestor union set { resources, entitlements, animation, music efficiency, music and video, information, computer, computing output devices, v ideo }, obtaining the number of the second elements as 10.
Step Sb34, determining a ratio of the first element number to the second element number as a correlation degree of the current child node and the ith parent node.
The relevance of the current child node "music and video" to the 1 st parent node "music culture" is 0.2.
In the information extraction method provided in this embodiment, a third ancestor node set is obtained by obtaining ancestor nodes of each child node of the current child node, a second ancestor node set is obtained by removing, from the third ancestor node set, ancestor nodes included after passing through a path of the ith parent node, and a correlation between the current child node and the ith parent node is obtained by calculation according to the first ancestor node set and the second ancestor node set, so that a parent node with the highest correlation can be selected from multiple parent nodes as a direct parent node of the current child node according to the correlation, and thus, whether subject drift occurs in the current child node can be determined by combining the current node set, and the first node set is obtained.
Based on the foregoing embodiments, an information extraction method is further provided in an embodiment of the present application, and is applied to the network architecture shown in fig. 1, fig. 6 is a schematic diagram of a further implementation flow of the information extraction method provided in the embodiment of the present application, and as shown in fig. 6, the information extraction method provided in this embodiment includes the following steps:
step S601, the server sends the global node set to the information extraction device.
In this embodiment, the global node set obtained by the information extraction device is obtained from a server. The server may actively send the global node set to the information extraction device, or may send the global information set to the information extraction device after receiving a certain trigger instruction, where the trigger instruction may be sent by the server or sent by the information extraction device. In actual implementation, the server may send the global node set to the information extraction device after receiving a trigger instruction sent by the information extraction device, such as a request instruction for obtaining the global node set.
In step S602, the information extraction device acquires a keyword for extracting target information in response to an input operation instruction performed by a user on the information extraction device.
Here, the order of steps S601 and S602 is not limited, and the information extraction device may receive the global node set first and then acquire the keyword, or may acquire the keyword first and then receive the global node set. And, the global node set may be received in advance and stored in its own memory, and after a period of time, the keyword may be acquired.
Step S603, the information extraction device determines a node corresponding to the keyword in the global node set, and obtains an initial node.
After the keywords are obtained, the nodes with the closest semantics are selected from the global node set as the nodes corresponding to the keywords, namely the initial nodes.
In step S604, the information extraction device establishes a current node set.
Here, the start node is included in the current node set.
Step S605, the information extraction device acquires all child nodes of each node included in the current node set in the global node set.
Here, the child nodes of each node included in the current node set may be acquired in the global node set by using a breadth-first traversal method.
Step S606, the information extraction device obtains all father nodes of the current child node in the global node set.
When a direct father node of a child node is obtained, all father nodes of the child node are obtained according to the mesh structure of the global node set, that is, all nodes which can directly reach the child node without passing through other nodes are obtained, and the nodes are used as father nodes of the child node.
Step S607, the information extraction device obtains the ancestor node of the ith parent node of the current child node in the global node set to obtain a first ancestor node set.
Where, i is 1,2, …, M, and the current child node has M parents, that is, M is the number of parents of the current child node, and M is a positive integer.
And (3) ancestor nodes of the ith parent node of the current child node, namely starting from the root node of the global node set, and all nodes on all paths which can reach the ith parent node are the ancestor nodes of the ith parent node. All ancestor nodes of the ith parent node are taken as a first ancestor node set.
Step S608, the information extraction device obtains ancestor nodes of each child node of the current child node, and obtains a third ancestor node set.
In step S609, the information extraction device removes, from the third ancestor node set, an ancestor node included after passing through the path of the ith parent node, to obtain a second ancestor node set.
Here, the second ancestor node set may be obtained in the manner of step S609, or may be obtained through the following steps: firstly, according to the network structure of a global node set, all paths from a root node to each child node of a current child node are obtained to be used as a first path set, and all paths from the root node to an ith parent node are obtained to be used as a second path set; and then removing the second path set from the first path set, and taking all nodes passed by the rest paths as a second ancestor node set.
Step S610, the information extraction device determines an intersection of the first ancestor node set and the second ancestor node set, to obtain a shared ancestor set.
In step S611, the information extraction device determines a union of the first ancestor node set and the third ancestor node set to obtain an ancestor union.
Step S612, the information extraction device determines the number of the first elements of the shared ancestor set and the number of the second elements of the ancestor union set respectively.
In step S613, the information extraction device determines the ratio of the first element number to the second element number as the correlation between the current child node and the ith parent node.
In step S614, the information extraction device determines the parent node with the largest correlation as the direct parent node of the current child node.
And the direct father node of the child node is the father node with the maximum correlation degree with the child node in all father nodes of the child node.
Step S615, the information extraction device determines whether the direct parent node of the current child node belongs to the current node set.
If the direct parent node of the current child node belongs to the current node set, it indicates that the subject drift does not occur in the current child node, and the step S606 is entered. If the direct parent node of the current child node does not belong to the current node set, it indicates that the subject drift occurs in the current child node, and the step S616 is performed.
In step S616, the information extraction device determines that the current child node is a child node in which the topic drift occurs.
In step S617, the information extraction device deletes the child node with the subject drift from all the child nodes to obtain a child node set.
In step S618, the information extraction device determines whether the child node set is empty.
When the child node set is not empty, it indicates that the child nodes in the child node set are child nodes without subject drift, and the step S619 is entered. When the child node set is empty, it indicates that there is no child node for each node in the current node set, or all child nodes of each node in the current node set are child nodes with subject drift, and all child nodes have been deleted, and then step S621 is performed.
Step S619, the information extraction device determines a child node in the child node set as a first node.
Since the child nodes in the child node set are nodes which satisfy the preset correlation relation with the keyword and have no theme drift, the child nodes can be determined to be the first nodes.
Step S620, the information extraction device updates the child node set to a current node set.
In order to determine all the descendant nodes of the start node and obtain that the child node set is not empty, the child nodes of each child node in the child node set need to be further determined, at this time, the child node set is updated to the current node set, and the process returns to step S605.
Step S621, the information extraction device determines the start node and the acquired first node as a first node set.
Here, the nodes included in the first node set are nodes that satisfy a preset correlation with the keyword and do not have subject drift.
In step S622, the information extraction device extracts page information included in each node in the first node set to obtain target information.
In the information extraction method provided by the embodiment of the application, the first node set corresponding to the keyword is obtained through screening in the global node set, the nodes in the first node set are nodes without subject drift, the accuracy of the extracted target information is improved, the nodes for extracting the page information are determined through whether the subject drift occurs or not, and the nodes are not limited to the nodes on the limited layer any more, so that the extracted target information is more comprehensive.
Next, an exemplary application of the embodiment of the present application in a practical application scenario will be described.
The ontology was originally a branch of philosophy and was given new meanings with the development of artificial intelligence. The widely recognized definition is: an explicit formal specification of the conceptual model is shared. The higher the concept of an ontology, the more abstract and generalized the meaning of a representation. The lower the concept of ontology is in the classification system, the more concrete and detailed the meaning of the representation. The concept similarity of an ontology refers to the degree of association of concepts and the degree of approximation of meanings. One basic assumption of concept similarity of ontologies is that two similar concepts have a common ancestor node in an ontology, and common algorithms include two major categories based on path distance and based on information entropy. The node in encyclopedia contains classification of the objective world, and the meaning of the node can be regarded as a concept in an ontology, so that the classification system of the node can be regarded as an ontology describing open domain knowledge.
In encyclopedia, each node is subordinate to at least one parent node except for a globally unique root node, and each node at least comprises one child node. The nodes of the entire encyclopedia thus form a mesh-like, hierarchical topology, wherein the subordinate nodes can be regarded as subclasses of the superordinate nodes, which are refinements of the superordinate nodes.
In order to extract target document data of interest from encyclopedia, the nodes of encyclopedia can be regarded as an ontology, and the position of each node in the mesh structure also represents the meaning of the node, so that the concept theory and similarity calculation method of the ontology are utilized to select nodes related to the field from the nodes of the mesh structure, and the page document of the selected nodes is extracted, namely the document data related to the field.
In the information extraction method provided in the embodiment of the present application, first, according to the field, a node c _ start determined in the encyclopedia nodes is used as an interested entry node (that is, a node corresponding to a keyword in other embodiments). With c _ start as a starting point (i.e., a starting node in other embodiments), a breadth or depth-first algorithm may be used to start to traverse downward layer by layer according to the upper-lower bit relationship of the node hierarchy, so as to obtain all child node sets. The child nodes encountered in the traversal process include nodes related to the starting topic (i.e., nodes where no subject drift occurs in other embodiments) and also include unrelated drift nodes (i.e., nodes where subject drift occurs in other embodiments). In order to obtain an accurate page document set in the node domain, the irrelevant drift nodes need to be judged and eliminated. And stopping the traversal of the node branch when a certain node is confirmed to be a drift node. Fig. 7 is a schematic view of a further implementation flow of the information extraction method provided in the embodiment of the present application, and as shown in fig. 7, the information extraction method provided in the embodiment includes the following steps:
in step S701, the starting point is queued.
Step S702 determines whether the queue is empty.
When the queue is empty, it indicates that the drift nodes existing in all the child nodes in the child node set have been excluded, and at this time, all the nodes without subject drift have been obtained, and the process proceeds to step S710. When the queue is not empty, it indicates that there is a child node in the child node set that has not been determined whether the topic drift occurs, and then the process proceeds to step S703.
And step S703, dequeuing the head node of the queue.
And taking the head of line node as a node v _ p, and traversing according to the breadth first to obtain all child nodes of the node v _ p.
In step S704, a child node v of the node v _ p is acquired.
Starting from the first child node of the node v _ p to the last child node of the node v _ p, a loop is started to judge whether the topic drift occurs in each child node of the node v _ p, and the judging process can be as follows from step S705 to step S707.
Step S705, determining whether the child node v is successfully acquired.
If the child node v is successfully acquired, it indicates that there is a child node that has not been determined whether the topic drift occurs, that is, the last child node of the node v _ p has not been determined, and then step S706 is performed. If the child node v fails to acquire, it indicates that there is no child node in the node v _ p, or all child nodes of the node v _ p have already performed the judgment on whether the subject drifts, at this time, the child node judgment of the next node in the queue is started, and the process returns to step S703.
Step S706, calculating the correlation degree between the child node v and each parent node.
A child node v may have multiple parents, which are related to the node v _ p from the top-bottom relationship. However, the correlation degrees of the nodes are generally different, and when the contents of the indirect child node of the child node v (i.e., the child node of the child node v) and the indirect parent node (i.e., the parent node of the child node v) are considered, there is a parent node with a closer semantic content relationship. In this embodiment, the parent node having the closest relationship, i.e., the highest degree of correlation, is referred to as the direct parent node. The child nodes of the child node v are the appearance and the refinement of the meaning of the child node v, and also contribute semantic meaning to the child node v.
Based on this, in the embodiment, when the correlation degree between the child node v and the ith parent node v _ pi is calculated, the ancestor node of the ith parent node v _ pi and each child node of the child node v are taken into consideration together.
In this embodiment, a formula for calculating the correlation rel (v, v _ pi) between the child node v and the ith parent node v _ pi is shown in the following formula 1:
Figure BDA0002350811980000231
where LCA (v _ sj, v _ pi) is a set of nodes on a shared path of the jth child node v _ sj of the child node v and the ith parent node v _ pi of the child node v, where the shared path of v _ sj and v _ pi does not include a path from v _ pi to the root node, e.g., LCA (animation).
t (v) is a super concept of the child node v (i.e. an ancestor node set in other embodiments), and t (v) { v _ gp-v _ gp is an ancestor node of the node v }, e.g. t (music) { resources, entiertiment }.
I | represents the number of elements of the computation set, for example | LCA (animation) | | { categories, entitlentaining } | 2
Taking fig. 5B as an example, the correlation degrees of the child node 6 and the 1 st parent node 5 and the 2 nd parent node 11 are calculated respectively:
Figure BDA0002350811980000232
Figure BDA0002350811980000241
from the above formula, rel (6,5) > rel (6,11), i.e., the correlation degree of the child node 6 with the 1 st parent node 5 is greater than the correlation degree of the child node 6 with the 2 nd parent node 11.
In step S707, it is determined whether the correlation between the child node v and the node v _ p is maximum.
The child nodes v in the mesh structure are considered as a concept in the ontology. To determine whether the child node v belongs to the range of the starting point domain, that is, the node with the maximum correlation degree with the child node v among the parent nodes v _ p1, v _ p2, …, v _ pn of v is calculated and is marked as c _ p*In the present embodiment, the node c _ p having the greatest correlation with the child node v*The calculation method of (d) is shown in the following equation 2:
c_p*=max{rel(v,v_pi)} (2)
wherein i is a positive integer greater than 0 and less than or equal to n. The node with max { rel (v, v _ pi) } returned is the parent node with the maximum relevance of the child node v and the n parent nodes of the child node v, namely c _ p*
Judging the correlation between the child node and the starting point field, namely judging whether the child node belongs to the starting point field, determining whether the child node v has theme drift by calculating the correlation of each parent node v _ p1, v _ p2, … and v _ pn of the child node v, comparing the correlation of each parent node v _ p1, v _ p2, … and v _ pn of the child node v, and if the parent node with the maximum correlation with the child node v is the node v _ p, determining that the child node v is related to the starting point field, namely the child node v has no theme drift; if the parent node with the maximum relevance with the child node v is not the node v _ p, the child node v is irrelevant to the starting field, namely, the topic of the child node v drifts.
For example, with the node 10 as a starting point, the child node 11 is traversed first, the child node 11 does not have other parent nodes, the correlation degree of the child node 11 with the node 10 is the maximum, and the child node 11 does not have theme drift; continuing to traverse downwards to the child node 6, according to the calculation in the step S706, the correlation degree between the child node 6 and the node 5 is greater than the correlation degree between the child node 6 and the node 11, so that the correlation degree between the child node 6 and the node 11 is not the maximum, and the topic drift occurs in the child node 6.
When the correlation degree between the child node v and the node v _ p is the maximum, it indicates that the child node v has no theme drift, that is, the child node v is related to the starting point field, at this time, it is necessary to continuously determine the child node of the child node v, queue the child nodes of the child node v, and then enter step S708.
When the correlation degree between the child node v and the node v _ p is not the maximum, it indicates that the child node v has the theme drift, that is, the child node v is related to the starting point field, and it is not necessary to continue to judge the child node of the child node v having the theme drift, at this time, it is not necessary to queue the child node of the child node v, and the judgment of the next child node of the node v _ p is directly started, and the step S709 is entered.
In step S708, the child node v is added to the queue.
After determining that the current child node v of the node v _ p has not undergone the theme drift, continuing to judge the next child node v of the node v _ p, and entering step S709.
In step S709, the next child node v of the node v _ p is acquired.
After the next child node v of the node v _ p is obtained, the child node v is continuously judged whether the theme drift occurs, and the step S705 is returned.
Step S710, acquiring page documents of all nodes without subject drift, and acquiring page documents related to the starting point field.
The information extraction method provided in this embodiment determines whether a page document of an encyclopedic node is related to a starting point field by regarding the encyclopedic node as an ontology, selecting a node in a mesh structure as a starting point according to a field, representing a node entry of a highest-level concept in the field to be extracted, traversing layer by layer from the entry node based on the mesh structure, and combining concept similarity calculation of the mesh structure and the ontology according to a position in a directory structure where each node is located, and when the node is related to the starting point field, continuing to traverse downward, and when the node is not related to the starting point field, without traversing downward again, abandoning subsequent processing of the node, thereby dynamically determining a depth to be traversed, and compared with a mode of acquiring a page document by traversing a fixed depth, the method of this embodiment makes full use of a mesh structure storage mode of encyclopedic nodes, the relevance of the nodes and the starting points is judged from the mesh structure, the method is visual, efficient and easy to understand, irrelevant noise page documents are eliminated from the obtained page documents, the documents are more comprehensive, and the quality of the document data is improved.
Continuing with the exemplary structure of the information extraction device 80 provided by the embodiments of the present application as software modules, in some embodiments, as shown in fig. 2, the software modules stored in the information extraction device 80 of the memory 140 may include:
and an obtaining module 81, configured to obtain the global node set and the keywords used for extracting the target information.
Here, the nodes in the global node set are stored according to a mesh structure, and the page information included in each node in the global node set constitutes global information.
And a screening module 82, configured to screen out, according to the mesh structure, a first node set corresponding to the keyword in the global node set.
Here, the nodes included in the first node set are nodes that satisfy a preset correlation with the keyword and do not have subject drift.
And the extracting module 83 is configured to extract page information included in each node in the first node set to obtain target information.
In some embodiments, the screening module 82 further includes:
the first determining submodule is used for determining the nodes corresponding to the keywords in the global node set to obtain initial nodes;
the obtaining submodule is used for traversing downwards from the starting node according to the mesh structure to obtain a first node which meets a preset correlation relation with the keyword in the global node set and does not have theme drift;
and the second determining submodule is used for determining the starting node and the acquired first node as a first node set.
In some embodiments, the obtaining sub-module further comprises:
and the establishing unit is used for establishing the current node set.
Here, the start node is included in the current node set.
An obtaining unit, configured to obtain, in the global node set, all child nodes of each node included in the current node set;
a deleting unit, configured to delete a child node in which the subject drifts among all child nodes, so as to obtain a child node set;
and the determining unit is used for determining the child nodes in the child node set as the first nodes when the child node set is not empty.
And the updating unit is used for updating the child node set into the current node set.
In some embodiments, the deleting unit further includes:
and the acquisition child unit is used for acquiring the direct father node of each child node in the global node set.
The direct father node of the child node is a father node with the maximum correlation degree with the child node in all father nodes of the child node;
a determining child unit, configured to determine that a current child node is a child node in which subject drift occurs if a direct parent node of the current child node does not belong to the current node set;
and the deleting subunit is used for deleting the child nodes with the theme drift from all the child nodes to obtain a child node set.
In some embodiments, the obtaining subunit is further configured to:
acquiring all father nodes of the current child node in the global node set;
calculating the correlation degree of the current child node and each father node of the current child node;
and determining the parent node with the maximum relevance as the direct parent node of the current child node.
In some embodiments, the obtaining subunit is further configured to:
in the global node set, obtaining an ancestor node of the ith father node of the current child node to obtain a first ancestor node set, wherein i is 1,2, …, and M is a positive integer;
determining a second ancestor node set based on ancestor nodes of each of the current child nodes;
and calculating the correlation degree of the current child node and the ith parent node according to the first ancestor node set and the second ancestor node set.
In some embodiments, the obtaining subunit is further configured to:
obtaining ancestor nodes of all child nodes of the current child node to obtain a third ancestor node set;
and removing ancestor nodes included after the path passing through the ith father node from the third ancestor node set to obtain a second ancestor node set.
Determining an intersection of the first ancestor node set and the second ancestor node set to obtain a shared ancestor set;
determining a union of the first ancestor node set and the third ancestor node set to obtain an ancestor union;
respectively determining the first element number of the shared ancestor set and the second element number of the ancestor union set;
and determining the ratio of the first element number to the second element number as the correlation degree of the current child node and the ith parent node.
Here, it should be noted that: the above description of the embodiment of the information extraction apparatus is similar to the above description of the method, and has the same advantageous effects as the embodiment of the method. For technical details not disclosed in the embodiments of the information extraction apparatus of the present application, those skilled in the art should understand with reference to the description of the embodiments of the method of the present application.
Embodiments of the present application provide a storage medium having stored therein executable instructions, which when executed by a processor, will cause the processor to perform the methods provided by embodiments of the present application, for example, the methods as illustrated in fig. 3 to 7.
In some embodiments, the storage medium may be a memory such as FRAM, ROM, PROM, EPROM, EE PROM, flash, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories.
In some embodiments, executable instructions may be written in any form of programming language (including compiled or interpreted languages), in the form of programs, software modules, scripts or code, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
By way of example, executable instructions may correspond, but do not necessarily have to correspond, to files in a file system, may be stored in a portion of a file that holds other programs or data, e.g., in one or more scripts in a HyperText markup Language (H TML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).
By way of example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.
The above description is only an example of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims (10)

1. A method of information extraction, the method comprising:
acquiring a global node set and keywords for extracting target information, wherein the nodes in the global node set are stored according to a mesh structure, and page information contained in each node in the global node set forms global information;
screening a first node set corresponding to the keyword in the global node set according to the mesh structure, wherein the nodes in the first node set are nodes which meet a preset correlation relationship with the keyword and do not have subject drift;
and extracting page information contained in each node in the first node set to obtain target information.
2. The method of claim 1, the screening the global set of nodes for a first set of nodes corresponding to the keyword according to the mesh structure, comprising:
determining nodes corresponding to the keywords in the global node set to obtain initial nodes;
from the starting node, traversing downwards according to the mesh structure to obtain a first node which satisfies a preset correlation relation with the keyword in the global node set and has no theme drift;
and determining the starting node and the acquired first node as a first node set.
3. The method according to claim 2, wherein the step of obtaining, starting from the start node, a first node in the global node set that satisfies a preset correlation relationship with the keyword and has no subject drift according to a downward traversal of the mesh structure comprises:
establishing a current node set, wherein the current node set comprises the starting node;
acquiring all child nodes of each node included in the current node set in the global node set;
deleting child nodes with subject drift in all the child nodes to obtain a child node set;
and when the child node set is not empty, determining child nodes in the child node set as first nodes, and updating the child node set to be a current node set.
4. The method according to claim 3, wherein the deleting child nodes with subject drift from all the child nodes to obtain a child node set comprises:
acquiring a direct father node of each child node in the global node set, wherein the direct father node of the child node is a father node with the maximum correlation degree with the child node in all father nodes of the child node;
if the direct father node of the current child node does not belong to the current node set, determining the current child node as a child node with subject drift;
and deleting the child nodes with the theme drift from all the child nodes to obtain a child node set.
5. The method of claim 4, said obtaining, in the global node set, an immediate parent of the each child node, comprising:
acquiring all father nodes of the current child node in the global node set;
calculating the correlation degree of the current child node and each father node of the current child node;
and determining the parent node with the maximum relevance as the direct parent node of the current child node.
6. The method of claim 5, wherein said current child node has M parent nodes, and wherein said calculating the relevance of said current child node to each parent node of said current child node comprises:
in the global node set, obtaining an ancestor node of the ith father node of the current child node to obtain a first ancestor node set, wherein i is 1,2, …, and M is a positive integer;
determining a second ancestor node set based on ancestor nodes of each of the current child nodes;
and calculating the correlation degree of the current child node and the ith parent node according to the first ancestor node set and the second ancestor node set.
7. The method of claim 6, wherein determining a second set of ancestor nodes based on ancestor nodes of each of the current child nodes comprises:
obtaining ancestor nodes of all child nodes of the current child node to obtain a third ancestor node set;
and removing ancestor nodes included after the path passing through the ith father node from the third ancestor node set to obtain a second ancestor node set.
8. The method of claim 7, said calculating a relevance of said current child node to said ith parent node from said first ancestor node set and said second ancestor node set, comprising:
determining an intersection of the first ancestor node set and the second ancestor node set to obtain a shared ancestor set;
determining a union of the first ancestor node set and the third ancestor node set to obtain an ancestor union;
respectively determining the first element number of the shared ancestor set and the second element number of the ancestor union set;
and determining the ratio of the first element number to the second element number as the correlation degree of the current child node and the ith parent node.
9. An information extraction device characterized by comprising:
a memory for storing executable instructions;
a processor for implementing the method of any one of claims 1 to 7 when executing executable instructions stored in the memory.
10. A storage medium having stored thereon executable instructions for causing a processor to perform the method of any one of claims 1 to 7 when executed.
CN201911414416.7A 2019-12-31 2019-12-31 Information extraction method, device and storage medium Pending CN111159390A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911414416.7A CN111159390A (en) 2019-12-31 2019-12-31 Information extraction method, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911414416.7A CN111159390A (en) 2019-12-31 2019-12-31 Information extraction method, device and storage medium

Publications (1)

Publication Number Publication Date
CN111159390A true CN111159390A (en) 2020-05-15

Family

ID=70560042

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911414416.7A Pending CN111159390A (en) 2019-12-31 2019-12-31 Information extraction method, device and storage medium

Country Status (1)

Country Link
CN (1) CN111159390A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130159236A1 (en) * 2011-12-16 2013-06-20 Huawei Technologies Co., Ltd. Data processing method and system, and relevant apparartus
CN106776714A (en) * 2016-11-21 2017-05-31 辽宁工程技术大学 Search method, device and system
CN108512765A (en) * 2017-02-28 2018-09-07 中国科学院声学研究所 A kind of Web content method of diffusion based on network node distribution Pagerank
CN110033851A (en) * 2019-04-02 2019-07-19 腾讯科技(深圳)有限公司 Information recommendation method, device, storage medium and server

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130159236A1 (en) * 2011-12-16 2013-06-20 Huawei Technologies Co., Ltd. Data processing method and system, and relevant apparartus
CN106776714A (en) * 2016-11-21 2017-05-31 辽宁工程技术大学 Search method, device and system
CN108512765A (en) * 2017-02-28 2018-09-07 中国科学院声学研究所 A kind of Web content method of diffusion based on network node distribution Pagerank
CN110033851A (en) * 2019-04-02 2019-07-19 腾讯科技(深圳)有限公司 Information recommendation method, device, storage medium and server

Similar Documents

Publication Publication Date Title
US10713323B2 (en) Analyzing concepts over time
US11327978B2 (en) Content authoring
EP3929769A1 (en) Information recommendation method and apparatus, electronic device, and readable storage medium
US10725836B2 (en) Intent-based organisation of APIs
CN110457439B (en) One-stop intelligent writing auxiliary method, device and system
US7860817B2 (en) System, method and computer program for facet analysis
CN101334774B (en) Character input method and input method system
CN111831802B (en) Urban domain knowledge detection system and method based on LDA topic model
CN112749326B (en) Information processing method, information processing device, computer equipment and storage medium
CN105706078A (en) Automatic definition of entity collections
Visvam Devadoss et al. Efficient daily news platform generation using natural language processing
WO2015051481A1 (en) Determining collection membership in a data graph
CN114141384A (en) Method, apparatus and medium for retrieving medical data
CN110609991A (en) Text generation method, electronic device and storage medium
CN111859079B (en) Information searching method, device, computer equipment and storage medium
CN115062135B (en) Patent screening method and electronic equipment
CN113741864B (en) Automatic semantic service interface design method and system based on natural language processing
CN115796177A (en) Method, medium and electronic device for realizing Chinese word segmentation and part-of-speech tagging
CN111159390A (en) Information extraction method, device and storage medium
CN115270776A (en) Method, system, device and medium for automatically acquiring concepts in domain knowledge base
CN107220249A (en) Full-text search based on classification
Lamba et al. Tools and techniques for text mining and visualization
EP4127957A1 (en) Methods and systems for searching and retrieving information
CN112860940B (en) Music resource retrieval method based on sequential concept space on description logic knowledge base
Wu et al. Web Service Matching for RESTful Web Services Based on Parameter Semantic Network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination