WO2018113532A1 - 信息抽取方法和系统 - Google Patents

信息抽取方法和系统 Download PDF

Info

Publication number
WO2018113532A1
WO2018113532A1 PCT/CN2017/115185 CN2017115185W WO2018113532A1 WO 2018113532 A1 WO2018113532 A1 WO 2018113532A1 CN 2017115185 W CN2017115185 W CN 2017115185W WO 2018113532 A1 WO2018113532 A1 WO 2018113532A1
Authority
WO
WIPO (PCT)
Prior art keywords
node
nodes
attribute
rule
text
Prior art date
Application number
PCT/CN2017/115185
Other languages
English (en)
French (fr)
Inventor
李阳
张锋
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2018113532A1 publication Critical patent/WO2018113532A1/zh
Priority to US16/385,163 priority Critical patent/US11093520B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • G06F16/83Querying
    • G06F16/835Query processing
    • G06F16/8373Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2246Trees, e.g. B+trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/322Trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/55Rule-based translation

Definitions

  • the present application relates to the field of information extraction, and in particular to an information extraction method and system applicable to different texts.
  • Structured data can be organized into a matrix structure in which the nature of the data and the location of the magnitude of the data are fixed, so that it can be accurately located, usually the data managed by the database.
  • Semi-structured data has a canonical title and body syntax, such as a subdivision channel on a professional website.
  • Unstructured data refers to data structures that are irregular or incomplete, have no predefined data models, and are inconvenient to use database two-dimensional logic tables to represent data, including office documents, text, images, XML data, and HTML data in all formats. , various reports, images and audio/video data.
  • Most of the data in web data exists in the form of unstructured data that cannot be understood and utilized by applications.
  • the embodiment of the present application provides an information extraction method, system, and storage medium.
  • the rule base includes a plurality of rules for generating a node, each rule indicating a role of a node generated by the rule, and the role of the node is a root node or a non-root node;
  • the information extraction system of the embodiment of the present application may include: at least one processor and a memory, wherein the memory stores computer readable instructions, and the instructions may cause the at least one processor to:
  • the rule base includes a plurality of rules for generating a node, each rule indicating a role of a node generated by the rule, and the role of the node is a root node or a non-root node;
  • the rule base includes a plurality of rules for generating a node, each rule indicating a role of a node generated by the rule, and the role of the node is a root node or a non-root node;
  • the technical solution of the embodiment of the present application by decomposing the unstructured text into words, using structured data to describe the words, and then combining the structured data by using preset rules, thereby obtaining a root describing the unstructured text.
  • the node takes the structured data in the root node as the extracted structured data.
  • the extraction logic is based on preset rules, and does not require a large number of corpus and training extraction models, and the implementation cost is low.
  • FIG. 1a is a flowchart of an information extraction method according to various embodiments of the present application.
  • FIG. 1b is a flowchart of an information extraction method according to various embodiments of the present application.
  • FIG. 2 is a schematic diagram showing the internal structure of a server in each embodiment
  • FIG. 3a is a schematic diagram of an application scenario of an information extraction method in each embodiment
  • FIG. 3b is a flowchart of an information extraction method in each embodiment
  • FIG. 5 is a flowchart of an information extraction method in each embodiment
  • FIG. 14 is a schematic diagram of synthesizing a parent node by a child node according to a node synthesis rule according to a node synthesis rule, and forming an information tree according to a correspondence relationship between the child node and the parent node;
  • 16 is a schematic structural diagram of an information extraction system in each embodiment
  • 17 is a schematic structural diagram of an information extraction system in each embodiment
  • 21 is a schematic structural diagram of an information extraction system in each embodiment
  • 22 is a schematic structural diagram of an information extraction system in each embodiment
  • FIG. 23 is a schematic structural diagram of an information extraction system in each embodiment.
  • FIG. 1a is a flowchart of an information extraction method according to an embodiment of the present application.
  • the information extraction method can be performed by a computing device (such as a server, PC, etc.), for example, by an information extraction application in the computing device.
  • the method can include the following steps.
  • the information extraction method may include the following steps.
  • Step S11 acquiring unstructured text
  • Step S12 parsing the to-be-extracted text according to a preset node format, and generating a first node set composed of nodes describing the unstructured text;
  • Step S13 Obtain a preset rule base, where the rule base includes a plurality of rules for generating nodes, each rule indicates a role of a node generated by the rule, and the role of the node is a root node or a non-root node;
  • Step S14 Synthesize the nodes in the first node set according to the rule base to generate a root node, and generate structured information according to the root node.
  • FIG. 1b is a flowchart of an information extraction method according to an embodiment of the present application.
  • the information extraction method can be performed by a computing device (such as a server, PC, etc.), for example, by an information extraction application in the computing device.
  • the method can include the following steps.
  • Step S110 acquiring unstructured text and a preset rule base.
  • Unstructured text can be obtained from the memory of the computing device or from a storage device on the network. Unstructured text can be obtained from files in various formats, such as office documents, XML files, HTML files, and so on.
  • the rule base includes a plurality of rules for generating nodes (hereinafter also referred to as node synthesis rules). Each rule indicates the role of the node generated by the rule, and the role of the node is a root node or a non-root node.
  • a node is structured data that conforms to a preset format and can be used to describe a piece of text, such as words, words, phrases, sentences, and so on.
  • Rule bases can be passed and stored as files.
  • a rule base can include one or more files.
  • the files included in the rule base can be obtained in various ways. For example, a file can be read into a rule base from a pre-configured location in the information extraction application (eg, URL, storage path in the computing device, etc.).
  • the information extraction application can provide a file input interface, receive the file through the file input interface, and join the rule base. Embodiments do not limit the source of files in the rule base, and may be in any way possible Get these files.
  • Step S121 performing segmentation on the unstructured text to obtain a set of words including a plurality of words.
  • a word segmentation is the division of a piece of text into individual words. Texts in different languages can use different word segmentation techniques.
  • the order of words in a collection of words is consistent with the order in which they are arranged in unstructured text.
  • Step S122 generating a first node set by using the word set.
  • each node in the first set of nodes is used to describe a word in the set of words, including attribute names and attribute values of one or more attributes.
  • the corresponding processing method can be used to generate a node corresponding to the word.
  • the node conforms to the preset node format (ie, the "node format of the node expressing the text information" in the following).
  • a node can include a node name, one or more attributes.
  • the preset processing method may include extracting information from a word as a node name of a node or an attribute value of a specified attribute.
  • the part of speech of the word may be extracted as the attribute value of the attribute "part of speech" of the node corresponding to the word; the character string corresponding to the word may be extracted as the node name of the corresponding node of the word or the attribute value of the attribute "text", and the like.
  • Embodiments do not limit the type and manner in which information is extracted from words.
  • the preset processing method may include a pre-configured processing method in the configuration file of the information extraction application, and may also include a custom processing method for receiving an external input.
  • nodes corresponding to all words in a set of words may be grouped into a first set of nodes.
  • the nodes corresponding to a part of the words in the set of words may also be combined into a first set of nodes.
  • the words in the word set can be filtered to remove some meaningless words, and the nodes corresponding to the remaining words form a first node set.
  • Step S141 processing the nodes in the first node set by using rules in the rule base to generate a second node set.
  • each node in the second set of nodes describes at least one node in the first set of nodes.
  • steps S12, S13 the unstructured text has been transformed into a first set of nodes describing the words in the unstructured text, wherein each node includes one or more pieces of information (ie, attributes extracted from a word) Value), each node describes only a simple corpus (ie a single word). That is, the task of steps S12, S13 is to disassemble the text (i.e., "analysis” as described later), and analyze and extract information for a single word.
  • the task of step S14 is to merge the nodes according to the information in each node, and the merged nodes describe the more complicated corpus (phrase or sentence) with grammatical structure with structured data.
  • one or more merged nodes can be obtained, and the merged nodes form a second set of nodes.
  • the merging of nodes is based on the rules in the rule base.
  • Each rule can set the generation rules of the node according to the grammar rules of the language of the unstructured text. Grammatical rules include the way in which words are combined, the type, and how rules are expressed.
  • the unstructured text may be divided into a plurality of sub-texts, and then the processing of the above steps S12 to S14 is performed on each sub-text one by one.
  • a subtext can be a clause, a sentence, a paragraph, and the like.
  • a collection of words can be generated for the entire unstructured text.
  • a separator mark may be added to the set of words corresponding to the entire unstructured text to mark the starting position of each subtext.
  • the partition node may also be added in the first node set, and the node between the two adjacent split nodes corresponds to one sub-text.
  • Step S142 outputting the attribute name and the attribute value of each attribute of the node having the root node role in the second node set as the structured information.
  • Each rule indicates the role of the node generated by the rule, for example, a root node or a non-root node.
  • the non-root node describes a piece of text with incomplete semantics, and the root node describes completeness.
  • a piece of semantic text such as a clause, a single sentence, a complex sentence, a paragraph, and so on.
  • Whether the node generated by each rule is a root node is determined by the grammar rules on which the rule is based. For example, when a node's attributes include a description of the person, time, place, and behavior, it can be used as the root node.
  • the role of a node can be represented by the attribute value of an attribute of the node. For example, if the attribute value of the node's attribute "role" is "root”, it means that the node is the root node.
  • One or more root nodes may be included in the second set of nodes.
  • the second set of nodes may include a root node.
  • the second node set may include multiple root nodes, and each root node corresponds to a sub-text in the unstructured text.
  • the root node includes multiple attributes, and the attribute value of the specified attribute may be extracted according to a preset extraction rule, and outputted in the form of structured data.
  • the output structured data can be stored as data in a preset format, such as JS object notation data, and the like.
  • the output structured data can be stored in a preset storage device for subsequent queries.
  • the extracted structured data can be applied in various scenarios, such as data mining, knowledge map construction, and the like.
  • the technical solution of the embodiment of the present application by decomposing the unstructured text into words, using structured data to describe the words, and then combining the structured data by using preset rules, thereby obtaining a root describing the unstructured text.
  • the node takes the structured data in the root node as the extracted structured data.
  • the extraction logic is based on preset rules, and the implementation cost is lower without a large number of corpus and training extraction models.
  • a node corresponding to each word in the word set may be generated, the node includes a first attribute, and an attribute value of the first attribute is a character string corresponding to the word; a set of nodes, the first set of nodes including nodes corresponding to respective words in the set of words.
  • step S122 it may also be set in a node corresponding to each word.
  • Part of speech can include nouns, verbs, prepositions, adverbs, adjectives, and so on.
  • a first word having a preset content type may also be identified in the set of words.
  • the preset content type is selected from the group consisting of: a person's name, a place name, a date, a time, and a proper noun. Then, the preset content type is represented by using a node name of the first word corresponding node or an attribute value of the third attribute.
  • the text of the first word may also be converted into the target text of the specified format corresponding to the preset content type; and the fourth step is added to the pre-processing node corresponding to the first word.
  • An attribute whose attribute value is the target text For example, the text of the word of the recognized date type can be converted into the text of the preset date format, for example, "June 23, 2008” is converted to "2008-6-23".
  • each rule in the rule base includes a description of one or more input nodes and a manner in which the output nodes are generated using the one or more input nodes.
  • at least one node may be selected from the first node set, and a rule is searched in the rule base: an input node of the rule matches the at least one node.
  • the processing steps here can be repeated one or more times.
  • the first set of nodes subjected to the above processing is taken as the second set of nodes.
  • nodes including one or more nodes may be traversed; in some examples, only node combinations including adjacent nodes may be selected; the method of selecting nodes is not performed here. Restrictions, you can design the selection method as needed.
  • the description of one or more input nodes in a rule can include at least one of the following:
  • the attribute value of the specified attribute of one of the one or more input nodes is required Condition of satisfaction
  • the manner in which the one or more input nodes are used to generate an output node in one rule may include at least one of the following:
  • Attribute value of a specified attribute of one of the one or more input nodes as an attribute value of a specified attribute of the output node
  • the manner of combining the attribute values of the specified attributes of the at least two nodes to obtain the combined value may include: following the rule
  • the specified merge mode merges the attribute values of the specified attributes corresponding to the at least two nodes into a string or an array of strings.
  • the merge mode may include one of the following:
  • the first string is merged into an array of strings, each of which is an element in the array of strings.
  • the rule base may include information on the priority of each rule.
  • the rules may be searched in descending order of priority of the rules.
  • computing device 200 includes a processor and a storage medium.
  • the storage medium stores an information extraction system.
  • the information extraction system can be implemented by computer readable instructions.
  • the information extraction system can perform the information extraction method of the embodiments of the present application, and extract the structured data from the unstructured text.
  • computing device 200 can include one or more physical devices, such as a distributed computing system, a server cluster, and the like.
  • FIG. 3 is a schematic diagram of an example of an application scenario of an information extraction method according to an embodiment of the present application.
  • the terminal 100 communicates with the server 200 through a network.
  • the terminal 100 can receive text input by the user (ie, unstructured text) and send it to the server 200 through the network.
  • the server 200 extracts information from the text to form structured extracted information (ie, structured data), thereby implementing standardized automatic management of the document.
  • the server 200 can also transmit the extraction result to the terminal 100 for display.
  • the terminal 100 can be a smartphone, a tablet, a personal digital assistant (PDA), and a personal computer.
  • Server 200 can be a standalone physical server or a cluster of physical servers.
  • server 200 can include a processor, storage medium, memory, and network interface that are linked by a system bus.
  • the storage medium of the server 200 stores an operating system, a database, and an information extraction system.
  • the database is used to store data such as node formats for information extraction, node synthesis rules (ie rules in the rule base), and the like.
  • the processor of the server 200 is used to provide computing and control capabilities to support the operation of the entire access server 200.
  • the memory of the server 200 provides an environment for the operation of the information extraction system in the storage medium.
  • the network interface of the server 200 is used to communicate with the external terminal 100 through a network connection, such as the text to be extracted sent by the terminal 100.
  • FIG. 3b is a flowchart of an information extraction method according to an embodiment of the present application. This method can be applied to the server shown in FIG. 2. The method can include the following steps.
  • Step 101 Acquire text to be extracted.
  • the text to be extracted may be any text data composed of text, and may be semi-structured web data or unstructured text data (ie, unstructured text).
  • the obtaining the text to be extracted includes obtaining text data displayed in the specified application, such as text data published by the specified website, text data published by the specified information publishing platform, and the like.
  • Step 103 Define a node format of a node that expresses text information.
  • node format of the node that expresses the text information means that the predefined one is obtained.
  • the format of the output node defined by each rule in the node format or the rule base is used as the node format of the subsequently generated node.
  • a node is a basic unit that expresses text information. Each node has a unified node format, and the text information is grouped by the same node format. Each node of the same node format is accompanied by text information, and the text information contained in the node is uniformly labeled. It is convenient to process the text information setting operation rule to realize the extraction of the text information.
  • Step 105 Analyze the extracted text according to the node format to generate a node that expresses the text information of the text to be extracted, and form a queue through the node.
  • the node to be extracted is parsed into a node with a preset format with text information for expression.
  • the text to be extracted is usually parsed in units of sentences, and each sentence is parsed into a plurality of nodes that express text information, and correspondingly form a queue.
  • the first node set is implemented by using a queue.
  • the parsing step herein may include the above steps S12, S13.
  • Step 107 Acquire a node synthesis rule for generating a parent node by using a child node.
  • the node synthesis rule refers to processing a node by an operation rule, and synthesizing text information expressed by a plurality of nodes (ie, input nodes) according to an operation rule to form a new node (ie, an output node), that is, one or more input nodes. Describe and utilize the manner in which the one or more input nodes generate an output node.
  • the plurality of nodes are respectively child nodes, and the formed new node corresponds to a parent node, and the parent node contains summary text information of text information contained in the plurality of nodes.
  • Each node synthesis rule contains a set of correspondence between a parent node and a child node. Obtaining a node synthesis rule can be specifically implemented by providing a decimator interface.
  • decimator interface receives user-defined node synthesis rules through the decimator interface.
  • a class can be defined to implement the decimator interface.
  • the decimator interface can also obtain the text to be extracted as a parameter, and generate a required extraction result according to the node synthesis rule.
  • Step 109 Synthesize a node in the queue according to a node synthesis rule to generate a parent node, and form extraction information according to the parent node.
  • the nodes in the queue are matched with the node synthesis rules in turn, and the corresponding nodes are synthesized according to the result of the matching to generate a parent node according to the node synthesis rule.
  • Each parent node includes summary text information obtained by synthesizing text information contained in the child nodes according to at least one node synthesis rule.
  • the parent node generated according to a node synthesis rule can be used as a child node in another node synthesis rule, so that the information extraction can be gradually realized through the correspondence between the parent node and the child node by defining different node synthesis rules. In this way, the extraction of the text to be extracted can be obtained to obtain corresponding extracted information.
  • an information tree containing the text information of the text to be extracted and the extraction result may be formed, wherein the final extracted information is stored in the parent node at the top of the information tree, and the parent node at the top of the information tree is Root node.
  • the information extraction method provided by the above embodiment obtains the node synthesis format of the node expressing the text information and the node synthesis rule of the parent node by the child node, thereby parsing the text to be extracted into a node expressing the text information in a predetermined node format.
  • the node synthesis rule can be customized according to the pre-extracted information result, and the correspondence relationship between the plurality of child nodes and the parent node is expressed by the node synthesis rule, so that the text information expressed by the child node can be synthesized according to the node synthesis rule to obtain the inclusion summary.
  • the parent node of the sexual text information obtains the final extracted information by gradually transferring the information extraction through the correspondence between the parent node and the child node, and the information extraction by the information extraction method is not limited to the structure of the data in the text to be extracted.
  • the node synthesis rule can support customization and supplement according to the requirements of individual special complex texts.
  • the entire extraction implementation logic is easy to understand, facilitates real-time expansion, and does not require a large number of annotations to predict the training extraction model, and the implementation cost is low.
  • FIG. 4 is a flowchart of an information extraction method according to an embodiment of the present application.
  • Figure 3b the node format defining the node that expresses the text information may include:
  • step 1031 a custom node is set.
  • set ** node means to set the node name or attribute of the node to the specified value, so that the node has the form of ** node, the same below.
  • the node format of the custom node is identified by the first identification symbol for each custom node.
  • the node content of each custom node includes a node name and text information expressed by a correspondence relationship between the text information attribute (ie, the attribute of the node) and the text information attribute value (ie, the attribute value of the node attribute).
  • the correspondence between the text information attribute and the corresponding text information attribute value is identified by the second identification symbol.
  • Each text information attribute value is identified by a third identification symbol.
  • Each custom node is identified by a first identification symbol, so that different custom nodes can be distinguished by the first identification symbol.
  • the text information attached to each custom node is expressed by the correspondence between the text information attribute and the text information attribute value.
  • the correspondence between the text information attribute and the corresponding text information attribute value is identified by the second identifier symbol, so that the different text information contained in the node content can be separated by the second identifier symbol.
  • Each text information attribute value is identified by a third identification symbol, so that the text information attribute and the text information attribute value can be distinguished by the third identification symbol.
  • the node content of each custom node may include text information expressed by a correspondence between multiple text information attributes and text information attribute values, and the correspondence between different text information attributes and text information attribute values is usually preset symbols. Separately, in some embodiments, the preset symbol is a space, and the node name is an arbitrary string that does not contain a space.
  • the first identification symbol is angle brackets ( ⁇ >), that is, each custom node is enclosed in angle brackets
  • the third identification symbol is a double quotation mark (""), that is, each text information attribute value is enclosed by double quotation marks
  • the node name is A node.
  • the text information attribute value defaults to true "true”.
  • the node whose node name is event is expressed as: ⁇ event root>, event is the node name, root is the text information attribute, and the text information attribute value is "true”.
  • the expression format of the text information is close to the general thinking understanding mode, is easy to understand, and is convenient to pass.
  • the set node format parses the text information into nodes for expression, and the text information attribute and the text information attribute value express the text information manner to introduce the part-of-speech information, which can facilitate the subsequent rule of extracting the text information by means of the part-of-speech information.
  • FIG. 5 is a flowchart of an information extraction method according to an embodiment of the present application. As shown in FIG. 5, in step 1031 of FIG. 4, the step of setting a custom node may include the following steps.
  • step 1032 the node that expresses the text information related to the time, the address, and the character is a built-in node.
  • Step 1033 Set a node that expresses text information related to the event type as a message node.
  • Custom nodes include built-in nodes and message nodes.
  • a built-in node is a node that includes common text information such as time, address, person, proper noun, and the like.
  • a node that expresses time, address, and text information related to a person may be separately set as a built-in node.
  • Set the node with time-related text information as a time built-in node, such as ⁇ time>, where time is the node name of the time built-in node.
  • the node that sets the address text information is an address built-in node, such as ⁇ location>, where location is the node name of the built-in node of the address.
  • the node Set the node with the character text information as the built-in node of the character, such as ⁇ people>, where people are the built-in section of the character
  • the text information is corresponding to the parsing generation time built-in node, the address built-in node, and the character built-in node.
  • the message node is a node that includes text information of the event type.
  • the node name of the message node is a message, such as ⁇ word>, where word is the node name of the message node, the message node is the initial parsing node of the text to be extracted, and the node that generates the text information related to the description of the parsing event type is generated.
  • the information extraction process of expressing the text to be extracted is formed by the node forming a message tree, and the message tree is composed of a mapping relationship between the child node and the parent node.
  • the nodes are child nodes and parent nodes in different levels in the tree at the same time.
  • the parent node at the top of the message tree does not act as a child node of any node and is the root node.
  • the child node at the bottom of the message tree does not act as the parent node of any node and is a leaf node.
  • the message node is the leaf node.
  • the text information contained in the text to be extracted can be parsed and generated through the node queue expressing the text information by the node, so that the node can perform the operation of the preset grammar rule by The attached text information is extracted.
  • FIG. 6 is a flowchart of an information extraction method according to an embodiment of the present application.
  • the step of defining a node format of a node expressing the text information may include:
  • Step 1034 Set a text information attribute and a text information attribute value type.
  • the text information attribute includes an original character string, a regularized character string, and a part-of-speech tag, and the text information attribute value corresponding to the original character string is the original text.
  • the text information attribute value corresponding to the string after the regularization is the text converted by the original text in the preset format, and the text information attribute value corresponding to the part of speech tag is used. Preset characters that identify different original text parts of speech.
  • the text information attribute of the node with different text information is predefined, and the type of the text information attribute and the text information attribute value is set.
  • the text information attributes mainly include the original string, the regularized string, and the part of speech.
  • the corresponding text information attribute value is the converted text "2008-06-23" of the original text "June 23, 2008” in the text to be extracted.
  • the corresponding text information attribute value is a preset character cc, and is used to indicate that the part of speech of the text information attached to the node is cc.
  • the setting of the preset characters is mainly to facilitate the memory and distinguish the part of speech, and the character digits and setting rules can be arbitrarily set.
  • the process of parsing the generated text into the node may be identified according to the corresponding attribute of the text information attached to each node, and used for synthesizing at the node.
  • the operation condition of the node synthesis is defined by a unified text information attribute.
  • FIG. 7 is a flowchart of an information extraction method according to an embodiment of the present application. As shown in FIG. 7, in step 103 of FIG. 3b, the step of defining a node format of a node expressing the text information may include:
  • step 1035 a text information attribute is set, and the text information attribute includes a nullable attribute.
  • the text information attribute value corresponding to the nullable attribute is usually true "true", and the text information attribute value corresponding to the nullable attribute is usually not written by the default method, and the text information attribute is
  • the node corresponding to the empty attribute expression can be empty, that is, a nullable node.
  • the nullable attribute is represented by orEmpty, such as ⁇ and orEmpty>, wherein the node whose node name is and is a nullable node.
  • a node with a nullable attribute can be applied in a node synthesis rule to represent an input node. By setting an input node as a nullable node, the node synthesis rule expresses that the text information attached to the input node can be omitted, that is, the input node may not exist.
  • FIG. 8 is a flowchart of an information extraction method according to an embodiment of the present application. As shown in FIG. 8, in step 103 of FIG. 3b, the step of defining a node format of a node expressing the text information may include:
  • Step 1036 Set a text information attribute and a text information attribute value type, the text information attribute includes a filtering attribute, and the text information attribute value corresponding to the filtering attribute is a filtering condition.
  • an attribute of an input node may include a filter attribute.
  • the text information attribute value corresponding to the filter attribute is the content contained in the specific filter condition.
  • the filtering relationship expressed by the filtering attribute and its corresponding text information attribute value includes equal or unequal, and the node whose text information attribute is the filtering attribute is a filtering node.
  • the attribute name of the filter attribute is represented by $pos
  • ⁇ B$pos ”nr”>
  • ⁇ C$pos! "adj">
  • the node named B and the node name C are filter nodes
  • the text information attribute value of the part-of-speech tag of the node ⁇ B> must be nr
  • the text information of the part-of-speech tag of the node ⁇ C> The attribute value cannot be adj.
  • the method can be used to set the filtering attribute of the input node in the node synthesis rule, and express the condition that the text information attached to the input node needs to be satisfied, for example, must be the same as or different from the specified value in the filtering condition, so as to achieve different condition matching.
  • a node can define multiple filtering attributes, and the relationship between multiple filtering conditions can be a relationship of "and” or "or".
  • step 103 the step of defining a node format of a node expressing the text information may include:
  • the text information attribute value corresponding to the root node attribute is usually true "true”, and the text information attribute value corresponding to the root node attribute is usually not written but is expressed by the default manner, and the node corresponding to the root node attribute expression is rooted by the text information attribute. node.
  • the root node attribute is represented by root, such as ⁇ marry root>, where the node whose name is marry is the root node.
  • the text information attached in the corresponding node is expressed by the root node in the node synthesis rule as the final extraction information.
  • step 103 the step of defining a node format of a node expressing the text information may include:
  • the value of the text information attribute corresponding to the priority attribute is usually a numeric value.
  • the priority of the node synthesis rule is expressed by the priority attribute and its corresponding text information attribute value.
  • the priority may be sequentially decreased from 1 to 10.
  • step 1031 the step of setting a custom node may include:
  • the text to be extracted is usually parsed in units of sentences, and each sentence is parsed into a form in which text information is expressed by a plurality of nodes, and correspondingly constitutes a queue.
  • the start node corresponds to the head of the node queue formed by one sentence
  • the end node corresponds to the tail of the node queue formed by one sentence.
  • the text to be extracted may be parsed according to the sentence or the paragraph to generate a node queue, which is The beginning and end nodes divide the paragraph.
  • FIG. 9 is a flowchart of an information extraction method according to an embodiment of the present application.
  • the step of defining a node format of a node expressing the text information includes:
  • Step 1039 Set a text node, and the node format of the text node is to directly display the original text expression text information.
  • FIG. 10 is an information extraction method according to an embodiment of the present application.
  • the node format of the node includes expressing the text information by a correspondence between the text information attribute and the text information attribute value.
  • the step of acquiring a node synthesis rule for generating a parent node by using a child node may include:
  • Step 1071 Obtain a copy synthesis rule that copies a text information attribute value of the specified child node as a text information attribute value of the parent node.
  • each node synthesis rule The parent node and the child node included in each node synthesis rule are identified by a preset fourth identifier.
  • the parent node is located in the fourth identifier
  • the child node On the left side of the number, the child node is located to the right of the fourth identification symbol, and the node synthesis rule is divided into two left and right parts by the fourth identification symbol.
  • the copy synthesis rule refers to copying the text information attribute value of the specified child node as the text information attribute value of the parent node to complete the extraction of the text information attached to the child node to form a parent node.
  • the text information attribute value of the parent node in the copy synthesis rule is represented by a preset fifth identifier.
  • FIG. 11 is a flowchart of an information extraction method according to an embodiment of the present application.
  • the node format of the node includes expressing the text information by a correspondence between the text information attribute and the text information attribute value.
  • the step of acquiring a node synthesis rule for generating a parent node by using a child node may include:
  • Step 1072 Obtain a merge synthesis rule that combines text information attribute values of multiple child nodes to generate a text information attribute value of the parent node.
  • the merge synthesis rule refers to merging the text information attribute values of the specified plurality of child nodes as the text information attribute values of the parent node, so as to complete the extraction of the text information attached to the child nodes to form the parent node.
  • the text information attribute value of the parent node in the merge synthesis rule is represented by a preset sixth identifier.
  • the sixth identification symbol is a $join+ index list, wherein the index list includes multiple numbers separated by preset identification symbols, and the numbers indicate merge The text information attribute value of the child node corresponding to the plurality of numbers.
  • the six identifiers $join 1,3 indicate that the text information attribute values of the first child node and the third child node, that is, the child nodes ⁇ B> and ⁇ C> are merged as the text information attribute of the node ⁇ A> is attr1. Information attribute value.
  • the index list does not contain a number, that is, the child node to be merged is not specified, then all child nodes are defaulted; wherein the child nodes to be merged may also be represented by all the child nodes, in one embodiment, the designation The identifier is underlined (_).
  • the preset identifier symbol used to separate the plurality of digits in the index list also indicates the separator symbol when the text information attribute value of the corresponding merged child node is combined with the text information attribute value of the parent node.
  • the index list page may use the specified identifier symbol to represent the text information attribute value of the child node to be merged.
  • the text information attribute value of the parent node is not included in the index.
  • the specified identifier is ⁇ empty.
  • FIG. 12 is a flowchart of an information extraction method according to an embodiment of the present application.
  • the node format of the node includes expressing the text information by a correspondence between the text information attribute and the text information attribute value.
  • the step of acquiring a node synthesis rule for generating a parent node by using a child node may include:
  • Step 1073 Acquire a collection and synthesis rule that selects a text information attribute value of a specified text information attribute of all child nodes to generate a text information attribute value of the parent node.
  • all child nodes include child nodes of the parent node generated in the node synthesis rule and child nodes of the child nodes.
  • the collection and synthesis rule refers to collecting the text information attribute value of the specified text information attribute of all the child nodes as the text information attribute value of the parent node, so as to complete the extraction of the text information attached to the child node to form the parent node.
  • the text information attribute of the parent node in the collection synthesis rule is represented by a preset seventh identification symbol, and the text information attribute value of the text information attribute is a text information attribute of the child node.
  • mapping table includes the mapping relationship between the collected text information attributes of all the child nodes and the corresponding text information attribute values.
  • the seventh identification symbol is collect.
  • the parent node ⁇ B> is generated by the child node ⁇ C> and the child node ⁇ D>, and the node ⁇ A> indicates that the parent node ⁇ A> is generated by the text information attribute of all the child nodes through the collection and synthesis principle for the text information attribute value corresponding to the role.
  • the result of collecting the parent node ⁇ A> is:
  • the replication synthesis rule, the merge synthesis rule, and the collection synthesis rule included in the node synthesis rule respectively define the node synthesis rule by defining the attribute value of the parent node. Defining a node synthesis rule for generating a parent node through a child node, the parent node generates its own new information according to the information of the child node, wherein the parent node can be used as a child node in the synthesis rule of other nodes, thereby realizing the text information attached by the child node. Pass up and finally summarize the results into the topmost parent. In this way, the sentences of the text to be extracted are respectively parsed into an information tree with information, the parent node at the top is the root node, and the final extracted information is formed in the root node.
  • FIG. 13 is a flowchart of an information extraction method according to an embodiment of the present application.
  • the nodes in the queue are synthesized according to the node synthesis rule to generate a parent node, and the step of forming the extraction information according to the parent node may include the following steps.
  • Step 1091 determining whether the queue is empty
  • Step 1093 when the queue is not empty, store the node of the queue header in the database. a node to be extracted;
  • step 1095 the node to be extracted in the database is matched with the node synthesis rule.
  • the node to be extracted matches the node synthesis rule, the node to be extracted is synthesized according to the node synthesis rule to generate a parent node, and returns to determine whether the queue is empty. step;
  • Step 1097 when the queue is empty, the extraction information is formed according to the parent node in the database.
  • the database is a stack
  • step 1091 before determining whether the queue is empty, the method further includes:
  • step 1090 the stack is initialized.
  • the specific judgment method is: matching the stack header item with the last item of the node synthesis rule to determine whether the node name and the filter condition match at the same time, and if the match or the current node is a nullable node, proceed to perform the previous item of the stack. Matching, if all the right nodes of the node synthesis rule match, the matching is successful.
  • the corresponding items in the stack are synthesized into new nodes, the corresponding items are deleted from the stack, and the new node is pushed onto the stack, the new node name
  • the left node of the node synthesis rule is defined, and the text information attribute value of the new node is generated according to the information transfer rule defined by the node synthesis rule; when the node queue queue is empty, the text information attribute and text information included in the root node of the stack are extracted.
  • the attribute value forms the result of the extraction, where the root node is the topmost parent of the information tree.
  • the point synthesis rule realizes the information extraction of the text to be extracted, and then forms the extraction information according to the text information attached to the parent node generated by the one node synthesis rule.
  • Defining a node synthesis rule according to the to-be-extracted text includes:
  • the extracted information is formed into the extracted text, and the nodes in the queue are synthesized according to the node synthesis rule to generate a parent node.
  • the process of forming the extracted information according to the parent node may be represented by a tree structure, and the adjacent two layers in the tree The correspondence between the node and the parent node is matched with the corresponding node synthesis rule respectively.
  • the specific matching process is shown in the following table (where the node in the queue queue omits the text information attribute representation, and the node in the stack stack omits the child node representation).
  • the definition of the node synthesis rule is based on the principle of text information transmission, and the information extraction path of the node is clear through the node synthesis rule, and is close to the syntax format of the node format and the node synthesis rule.
  • General thinking understanding, so writing custom rules is easier and easier to understand; in the process of information extraction, the expansion of node synthesis rules only needs to add new rules, without modifying the previous rules, the synthesis rules between the nodes The coupling is low, and the decimator interface is supported to obtain the custom node synthesis rule, which is easy to expand; the synthesis rules of each node can be referenced by each other through tags, and the versatile node synthesis rule can be extracted to the required file.
  • the text information attached to the node in the node synthesis rule is expressed by the correspondence between the text information attribute and the text information attribute value, and the part-of-speech information can be used, and the custom text information attribute and the text-based information attribute are supported.
  • the node synthesis rule for node operation the node synthesis rule supports recursive definition, and the generalization ability of the information extraction method is added; the information extraction does not require a large amount of training corpus, and the cost is low.
  • FIG. 15 is a schematic diagram of an information extraction system according to an embodiment of the present application.
  • the system may include an obtaining module 11, a node module 13, a parsing module 15, a rules module 17, and an extracting module 19, wherein the obtaining module 11 is configured to acquire text to be extracted.
  • the node module 13 is used to define a node format of a node that expresses text information.
  • the parsing module 15 is configured to parse the text to be extracted according to the node format to generate a node that expresses the text information of the text to be extracted, and form a queue through the node.
  • the rule module 17 is configured to acquire a node synthesis rule for generating a parent node by using a child node.
  • the extracting module 19 is configured to synthesize a node in the queue according to a node synthesis rule to generate a parent node, and form extraction information according to the parent node.
  • FIG. 16 is an information extraction system of an embodiment, wherein the node module 13 includes a custom node unit 131.
  • the custom node unit 131 is configured to set a custom node, and the node format of the custom node is identified by the first identifier symbol for each custom node.
  • the node content of each custom node includes a node name and text information expressed by a correspondence relationship between the text information attribute and the text information attribute value. The correspondence between the text information attribute and the corresponding text information attribute value is identified by the second identification symbol, and each text information attribute value is identified by the third identification symbol.
  • FIG. 17 is an information extraction system of an embodiment, and the custom node unit 131 includes a built-in node unit 132 and a message node unit 133.
  • the built-in node unit 132 is configured to set a node that expresses time, address, and text information related to a person as a built-in node.
  • the message node unit 133 is configured to set a node that expresses text information related to the event type as a message section. point.
  • FIG. 18 is an information extraction system provided by an embodiment, and the node module 13 includes an attribute unit 134.
  • the attribute unit 134 is configured to set a type of the text information attribute and the text information attribute value, and the text information attribute includes an original character string, a regularized character string, and a part-of-speech tag.
  • the text information attribute value corresponding to the original string is the original text
  • the text information attribute value corresponding to the regular string is the text converted by the original text in a preset format
  • the text information attribute value corresponding to the part of speech tag is used to identify different The default character of the original text part of speech.
  • node module 13 includes an attribute unit 134.
  • the attribute unit 134 is used to set a text information attribute, and the text information attribute includes a nullable attribute.
  • node module 13 includes an attribute unit 134.
  • the attribute unit is used to set the type of the text information attribute and the text information attribute value, the text information attribute includes a filtering attribute, and the text information attribute value corresponding to the filtering attribute is a filtering condition.
  • FIG. 19 is an information extraction system provided by an embodiment, and the node module 13 includes a text node unit 135.
  • the text node unit 135 is used to set a text node, and the node format of the text node is to directly display the original text expression text information.
  • FIG. 20 is an information extraction system provided by an embodiment, and the rule module 17 includes a copy synthesis rule unit 171.
  • the node format of the node includes expressing the text information by a correspondence between the text information attribute and the text information attribute value.
  • the copy synthesis rule unit is configured to obtain a copy synthesis rule that copies the text information attribute value of the specified child node as the text information attribute value of the parent node.
  • FIG. 21 is an information extraction system provided by an embodiment, and the rule module 17 includes a merge synthesis rule unit 172.
  • the node format of the node includes expressing the text information by a correspondence between the text information attribute and the text information attribute value.
  • the merge synthesis rule unit is configured to obtain a merge synthesis rule that combines text information attribute values of multiple child nodes to generate a text information attribute value of the parent node.
  • FIG. 22 is an information extraction system provided by the embodiment, and the rule module 17 includes a collection synthesis rule unit 173.
  • the node format of the node includes expressing the text information by a correspondence between the text information attribute and the text information attribute value.
  • the collection and synthesis rule unit is configured to obtain a collection and synthesis rule for selecting a text information attribute value of a specified text information attribute of all child nodes to generate a text information attribute value of the parent node.
  • FIG. 23 is an information extraction system provided by the embodiment, and the extraction module 19 includes a determination unit 191, a storage unit 193, a matching unit 195, and an extraction unit 197.
  • the determining unit 191 is configured to determine whether the queue is empty.
  • the storage unit 193 is configured to store the node of the queue header into the database to form a node to be extracted when the queue is not empty.
  • the matching unit 195 is configured to match the node to be extracted in the database with the node synthesis rule. When the node to be extracted matches the node synthesis rule, the node to be extracted is synthesized according to the node synthesis rule to generate a parent node, and returns to determine whether the queue is Empty steps.
  • the extracting unit 197 is configured to form extraction information according to the parent node in the database when the queue is empty.
  • the definition of the node synthesis rule is based on the principle of text information transmission, the definition of the node synthesis rule is clear and easy to understand; wherein in the process of information extraction, the expansion of the node synthesis rule only needs New rules can be added, no need to modify the previous rules, low coupling between the synthesis rules of each node, and support for the extractor interface to obtain custom node synthesis rules, which is easy to expand; each node synthesis rules can be referenced by tags.
  • the general-purpose node synthesis rule can be extracted into the required file, which is easy to manage and supports reuse; the text information attached to the node in the node synthesis rule is performed by the correspondence relationship between the text information attribute and the text information attribute value.
  • Expression can use part of speech information, and support custom text information attributes and node synthesis rules based on text information attributes, node synthesis rules support recursive definition, increase the generalization ability of information extraction methods; achieve information extraction without large-scale training corpus , the cost is lower.
  • the storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), or a random access memory (RAM).

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

一种信息抽取方法包括:获取非结构化文本(S11);根据预设的节点格式对所述待抽取文本进行解析,生成描述所述非结构化文本的节点组成的第一节点集合(S12);获取预设的规则库,所述规则库包括多个用于生成节点的规则,各规则指明该规则生成的节点的角色,所述节点的角色为根节点或非根节点(S13);根据所述规则库对所述第一节点集合中的所述节点进行合成以生成根节点,根据所述根节点生成结构化信息(S14)。

Description

信息抽取方法和系统
相关文件
本申请要求于2016年12月22日提交中国专利局、申请号为201611200449.8、发明名称为“信息抽取方法和系统”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及信息抽取领域,特别是涉及一种可适用于不同文本的信息抽取方法和系统。
背景
随着互联网技术的快速发展,Web已经发展成为一个巨大的、分布和共享的信息资源。Web上的海量信息大概可以分为结构化信息、半结构化信息和非结构化信息三种。结构化数据(Structured data)可以组织成行列结构,其中数据的性质和量值的出现位置是固定的,因此可以被精确地定位到,通常为数据库所管理的数据。半结构化数据具有规范的标题和正文的语法,例如专业网站上的细分频道。非结构化数据是指数据结构不规则或不完整、没有预定义的数据模型、不方便用数据库二维逻辑表来表现的数据,包括所有格式的办公文档、文本、图片、XML数据、HTML数据、各类报表、图像和音频/视频数据等。Web数据中大部分数据是以非结构化数据形式存在的,这些非结构化数据无法被应用程序理解并利用。
技术内容
为了使海量的非结构化的Web数据能够被利用,本申请实施例提供了一种信息抽取方法、系统和存储介质。
本申请实施例的信息抽取方法可以包括:
获取非结构化文本;
根据预设的节点格式对所述待抽取文本进行解析,生成描述所述非结构化文本的节点组成的第一节点集合;
获取预设的规则库,所述规则库包括多个用于生成节点的规则,各规则指明该规则生成的节点的角色,所述节点的角色为根节点或非根节点;
根据所述规则库对所述第一节点集合中的所述节点进行合成以生成根节点,根据所述根节点生成结构化信息。
本申请实施例的信息抽取系统可以包括:至少一个处理器和存储器,所述存储器中存储有计算机可读指令,所述指令可以使所述至少一个处理器:
获取非结构化文本;
根据预设的节点格式对所述待抽取文本进行解析,生成描述所述非结构化文本的节点组成的第一节点集合;
获取预设的规则库,所述规则库包括多个用于生成节点的规则,各规则指明该规则生成的节点的角色,所述节点的角色为根节点或非根节点;
根据所述规则库对所述第一节点集合中的所述节点进行合成以生成根节点,根据所述根节点生成结构化信息。
本申请实施例的计算机可读存储介质可以包括:计算机可读指令,所述指令可以使至少一个处理器:
获取非结构化文本;
根据预设的节点格式对所述待抽取文本进行解析,生成描述所述非结构化文本的节点组成的第一节点集合;
获取预设的规则库,所述规则库包括多个用于生成节点的规则,各规则指明该规则生成的节点的角色,所述节点的角色为根节点或非根节点;
根据所述规则库对所述第一节点集合中的所述节点进行合成以生成根节点,根据所述根节点生成结构化信息。
本申请实施例的技术方案,通过将非结构化文本分解为词语,采用结构化数据来描述这些词语,再利用预设的规则将这些结构化数据合并,从而得到描述该非结构化文本的根节点,将根节点中的结构化数据作为提取出的结构化数据。抽取逻辑基于预设的规则,无需大量标注语料及训练抽取模型,实现成本较低。
附图简要说明
图1a为本申请各实施例的信息抽取方法的流程图;
图1b为本申请各实施例的信息抽取方法的流程图;
图2为各实施例中服务器的内部结构示意图;
图3a为各实施例中信息抽取方法的应用场景示意图;
图3b为各实施例中信息抽取方法的流程图;
图4为各实施例中信息抽取方法的流程图;
图5为各实施例中信息抽取方法的流程图;
图6为各实施例中信息抽取方法的流程图;
图7为各实施例中信息抽取方法的流程图;
图8为各实施例中信息抽取方法的流程图;
图9为各实施例中信息抽取方法的流程图;
图10为各实施例中信息抽取方法的流程图;
图11为各实施例中信息抽取方法的流程图;
图12为各实施例中信息抽取方法的流程图;
图13为各实施例中信息抽取方法的流程图;
图14为信息抽取方法中根据节点合成规则由子节点合成父节点,根据子节点与父节点对应关系形成信息树的示意图;
图15为各实施例中信息抽取系统的结构示意图;
图16为各实施例中信息抽取系统的结构示意图;
图17为各实施例中信息抽取系统的结构示意图;
图18为各实施例中信息抽取系统的结构示意图;
图19为各实施例中信息抽取系统的结构示意图;
图20为各实施例中信息抽取系统的结构示意图;
图21为各实施例中信息抽取系统的结构示意图;
图22为各实施例中信息抽取系统的结构示意图;
图23为各实施例中信息抽取系统的结构示意图。
实施本发明的方式
为了使本发明的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本发明进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本发明,并不用于限定本发明。
本申请各实施例的信息抽取方案用于从非结构化数据中抽取出结构化数据,使得海量的非结构化数据中的信息能够被计算机理解并处理。图1a为本申请实施例的一种信息抽取方法的流程图。该信息抽取方法可以由计算设备(如服务器、PC等)执行,例如可以由计算设备中的信息抽取应用执行。如图1a所示,该方法可以包括以下步骤。各实施例的 信息抽取方法可以包括以下步骤。
步骤S11,获取非结构化文本;
步骤S12,根据预设的节点格式对所述待抽取文本进行解析,生成描述所述非结构化文本的节点组成的第一节点集合;
步骤S13,获取预设的规则库,所述规则库包括多个用于生成节点的规则,各规则指明该规则生成的节点的角色,所述节点的角色为根节点或非根节点;
步骤S14,根据所述规则库对所述第一节点集合中的所述节点进行合成生成根节点,根据所述根节点生成结构化信息。
图1b为本申请实施例的一种信息抽取方法的流程图。该信息抽取方法可以由计算设备(如服务器、PC等)执行,例如可以由计算设备中的信息抽取应用执行。如图1b所示,该方法可以包括以下步骤。
步骤S110,获取非结构化文本和预设的规则库。
非结构化文本可以从计算设备的存储器中或者网络上的存储设备中获取。非结构化文本可以从各种格式的文件中获得,如办公文档、XML文件、HTML文件等。
规则库包括多个用于生成节点的规则(后文也称为节点合成规则)。各规则指明该规则生成的节点的角色,所述节点的角色为根节点或非根节点。本文中,节点是指符合预设格式的结构化数据,可以用于描述一段文本,例如,字、词、短语、句子,等。规则库可以以文件的形式传递和存储。规则库可以包括一个或多个文件。规则库包括的各文件可以通过各种方式获得。例如,可以从信息抽取应用中预配置的位置(例如URL、计算设备中的存储路径,等)读取文件加入规则库。又例如,信息抽取应用可以提供文件输入接口,通过文件输入接口接收文件并加入规则库。各实施例不限制规则库中文件的来源,可以以任何可能的方式 获取这些文件。
步骤S121,对所述非结构化文本进行分词得到包括多个词语的词语集合。
分词是指将一段文本分割成单个的词语。不同语言的文本可以采用不同的分词技术。
一些例子中,词语集合中的词语其排列顺序与在非结构化文本中的排列顺序一致。
步骤S122,利用所述词语集合生成第一节点集合。
其中,所述第一节点集合中的每个节点用于描述所述词语集合中的一个词语,包括一个或多个属性的属性名和属性值。
一些例子中,可以利用预设的处理方法生成词语对应的节点。节点符合预设的节点格式(即后文中的“表达文本信息的节点的节点格式”)。例如,节点可以包括节点名、一个或多个属性。预设的处理方法可以包括从词语中提取信息,作为节点的节点名或指定属性的属性值。例如,可以提取词语的词性,作为该词语对应的节点的属性“词性”的属性值;可以提取词语对应的字符串,作为该词语对应节点的节点名或者属性“文本”的属性值,等。各实施例不限定从词语中提取信息的种类和方式。预设的处理方法可以包括信息抽取应用的配置文件中预先配置的处理方法,也可以包括接收外部输入的自定义处理方法。
一些例子中,可以将词语集合中所有词语对应的节点组成第一节点集合。一些例子中,也可以将词语集合中一部分词语对应的节点组成第一节点集合。例如,可以对词语集合中的词语进行筛选,去除一些无意义的词语,将剩下的词语对应的节点组成第一节点集合。
步骤S141,利用所述规则库中的规则处理所述第一节点集合中的节点以生成第二节点集合。
其中,所述第二节点集合中的每个节点描述所述第一节点集合中的至少一个节点。
通过步骤S12、S13,非结构化文本已经被转化为一个描述该非结构化文本中词语的第一节点集合,其中每个节点包括从一个词语中提取出的一项或多项信息(即属性值),每个节点仅描述简单的语料(即单个词语)。也即,步骤S12、S13的任务是拆解文本(即后文所说的“解析”),并对单个词语进行分析和信息提取。步骤S14的任务则是根据各个节点中的信息对这些节点进行合并,合并得到的节点以结构化的数据描述具有语法结构的较复杂的语料(短语或句子)。经过一次或多次合并可以得到一个或多个合并后的节点,这些合并后的节点组成第二节点集合。节点的合并依据的是规则库中的规则。各规则可以根据非结构化文本的语言的语法规则设置节点的生成规则。语法规则包括词语的组合方式、类型及如何表达语义的规则。
一些例子中,可以将非结构化文本分割成多个子文本,然后逐个对各个子文本进行上述步骤S12至S14的处理。例如,子文本可以是分句、句子、段落,等。例如,为每个子文本生成一个词语集合。另一些例子中,可以为整个非结构化文本生成一个词语集合。一些例子中,可以在整个非结构化文本对应的词语集合中加入分隔标记,标记各子文本的起始位置。生成该词语集合对应的第一节点集合时,可以在第一节点集合中同样加入分隔节点,两个相邻的分隔节点之间的节点对应一个子文本。
步骤S142,将所述第二节点集合中具有根节点角色的节点的各属性的属性名和属性值输出为结构化信息。
每个规则指明了该规则生成的节点的角色,例如,根节点或非根节点。非根节点描述的是语义不完整的一段文本,根节点则描述具有完整 语义的一段文本,例如分句、单句、复句、段落,等。各规则生成的节点是否为根节点由该规则基于的语法规则确定。例如,当一个节点的属性包括对人物、时间、地点、行为的描述,则可以作为根节点。节点的角色可以用节点的一个属性的属性值来表示。例如,节点的属性“角色”的属性值为“root”时,表示该节点为根节点。
第二节点集合中可以包括一个或多个根节点。例如,当第二节点集合对应非结构化文本中的一个子文本,例如句子,时,第二节点集合中可能包括一个根节点。又例如,当第二节点集合对应整个非结构化文本时,第二节点集合中可能包括多个根节点,每个根节点对应非结构化文本中的一个子文本。根节点包括多个属性,可以根据预设的提取规则提取出指定属性的属性值,以结构化数据的形式输出。输出的结构化数据可以存储为预设格式的数据,例如JS对象标记(JavaScript Object Notation)数据,等。输出的结构化数据可以存储到预设的存储设备中,供后续查询使用。提取出的结构化数据可以应用在各种场景中,例如数据挖掘、知识图谱构建,等。
本申请实施例的技术方案,通过将非结构化文本分解为词语,采用结构化数据来描述这些词语,再利用预设的规则将这些结构化数据合并,从而得到描述该非结构化文本的根节点,将根节点中的结构化数据作为提取出的结构化数据。抽取逻辑基于预设的规则,无需通过大量标注语料及训练抽取模型,实现成本较低。
一些实施例中,在步骤S122,可以生成词语集合中的每个词语对应的节点,所述节点包括第一属性,所述第一属性的属性值为所述词语对应的字符串;生成第一节点集合,所述第一节点集合包括所述词语集合中各词语对应的节点。
一些实施例中,在步骤S122,还可以在每个词语对应的节点中设置 第二属性,所述第二属性的属性值表示所述词语的词性。词性可以包括名词、动词、介词、副词、形容词等。
一些实施例中,在步骤S122,还可以在所述词语集合中识别具有预设内容类型的第一词语。其中,所述预设的内容类型选自:人名、地名、日期、时间、专有名词。然后,利用所述第一词语对应节点的节点名或第三属性的属性值表示所述预设内容类型。
一些实施例中,在步骤S122,还可以将所述第一词语的文本转换为所述预设内容类型对应的指定格式的目标文本;在所述第一词语对应的预处理节点中增加第四属性,所述第四属性的属性值为所述目标文本。例如,可以将识别出的日期类型的词语的文本转换为预设的日期格式的文本,例如,将“2008年6月23日”转换为“2008-6-23”。
一些实施例中,规则库中各规则包括一个或多个输入节点的描述以及利用所述一个或多个输入节点生成输出节点的方式。在步骤S141中,可以从所述第一节点集合中选取至少一个节点,在所述规则库中查找规则:该规则的输入节点与所述至少一个节点匹配。将所述至少一个节点作为所述规则的输入节点、按照所述规则中的方式生成第二节点,在所述第一节点集合中使用所述第二节点替换所述至少一个节点。这里的处理步骤可以重复执行一次或多次。将经过上述处理的第一节点集合作为所述第二节点集合。从第一节点集合中选取节点时,一些例子中,可以遍历各种包括一个或一个以上节点的节点组合;一些例子中,可以仅选取包括相邻节点的节点组合;选取节点的方式这里不做限制,可以根据需要设计选取方式。
一些实施例中,一个规则中一个或多个输入节点的描述可以包括以下中的至少一个:
所述一个或多个输入节点中一个输入节点的指定属性的属性值需要 满足的条件;
所述一个或多个输入节点的排列顺序。
一些实施例中,一个规则中利用所述一个或多个输入节点生成输出节点的方式可以包括以下中的至少一个:
将所述一个或多个输入节点中一个节点的指定属性的属性值作为所述输出节点的指定属性的属性值;
将所述一个或多个输入节点中至少两个节点的指定属性的属性值合并得到合并值,将所述合并值作为所述输出节点的指定属性的属性值。
一些实施例中,所述规则中所述至少两个节点的指定属性的属性值为字符串时,将至少两个节点的指定属性的属性值合并得到合并值的方式可以包括:按照所述规则指定的合并方式将所述至少两个节点对应的指定属性的属性值合并为字符串或字符串数组。所述合并方式可以包括以下中的一个:
将第一字符串拼接为第二字符串;或
将第一字符串合并为字符串数组,各第一字符串作为所述字符串数组中的元素。
一些实施例中,规则库可以包括各规则的优先级的信息,在规则库中查找规则时,可以按照规则的优先级从高到低的顺序查找规则。
图2为本申请各实施例的计算设备200的示意图。如图2所示,计算设备200包括通过处理器和存储介质。其中,存储介质存储有信息抽取系统。信息抽取系统可以由计算机可读指令实现。信息抽取系统可以执行本申请各实施例的信息提取方法,从非结构化文本中提取出结构化数据。一些例子中,计算设备200可以包括一个或多个物理设备,例如分布式计算系统、服务器集群,等。
图3a为本申请实施例的信息提取方法的应用场景的一个例子的示意图。如图3a所示,终端100通过网络与服务器200进行通信。终端100可以接收用户输入的文本(即非结构化文本),并通过网络发送给服务器200。服务器200对文本进行信息抽取,形成结构化的抽取信息(即结构化数据),从而实现文档的规范化自动化管理。服务器200还可以将抽取结果发送到终端100进行显示。终端100可以为智能手机、平板电脑、个人数字助理(PDA)及个人计算机。服务器200可以为独立的物理服务器或者物理服务器集群。
服务器200的结构可以如图2所示。例如,服务器200可以包括通过系统总线链接的处理器、存储介质、内存和网络接口。其中,该服务器200的存储介质存储有操作系统、数据库和一种信息抽取系统。数据库用于存储数据如用于信息抽取的节点格式、节点合成规则(即规则库中的规则)等。该服务器200的处理器用于提供计算和控制能力,支撑整个接入服务器200的运行。该服务器200的内存为存储介质中的信息抽取系统的运行提供环境。该服务器200的网络接口用于与外部的终端100通过网络连接通信,比如接收终端100发送的待抽取文本等。
图3b为本申请实施例的一种信息抽取方法的流程图。该方法可应用于图2所示的服务器中。该方法可以包括如下步骤。
步骤101,获取待抽取文本。
其中待抽取文本可以为由文字组成的任意文本数据,可以为半结构化web数据或者无结构的文本数据(即非结构化文本)。其中,获取待抽取文本包括获取指定应用程序中显示的文本数据,如指定网站发布的文本数据、指定信息发布平台发布的文本数据等。
步骤103,定义表达文本信息的节点的节点格式。
这里,“定义表达文本信息的节点的节点格式”是指,获取预定义的 节点格式或者规则库中各规则定义的输出节点的格式作为后续生成的节点的节点格式。节点为表达文本信息的基本单元。每一节点具有统一的节点格式,通过以相同的节点格式将文本信息进行分组,每一相同节点格式的节点内附带有文本信息,并对节点内所包含的文本信息进行统一规则的标识,可便于对文本信息设置运算规则进行处理,以实现对文本信息的抽取。
步骤105,根据节点格式对待抽取文本进行解析生成表达待抽取文本的文本信息的节点,通过节点组成队列。
将待抽取文本解析成附带有文本信息的预设格式的节点进行表达。通常将待抽取文本以句子为单位进行解析,将每个句子解析成通过多个表达文本信息的节点,并对应组成一个队列。本实施例中,第一节点集合利用队列来实现。一个例子中,这里的解析步骤可以包括上述步骤S12、S13。
步骤107,获取通过子节点生成父节点的节点合成规则。
节点合成规则是指通过运算规则对节点进行处理,根据运算规则将多个节点(即输入节点)所表达的文本信息进行合成形成新节点(即输出节点),即上述一个或多个输入节点的描述以及利用所述一个或多个输入节点生成输出节点的方式。该多个节点分别为子节点,所形成的新节点对应为父节点,父节点包含该多个节点所包含文本信息的总结性文本信息。每一节点合成规则包含一组父节点与子节点之间的对应关系。获取节点合成规则具体可通过提供抽取器接口实现。通过抽取器接口接收用户自定义的节点合成规则。当需要针对不同的待抽取文本增加节点合成规则时,可以定义一个类实现该抽取器接口。一些实施例中,通过该抽取器接口还可以获取待抽取文本作为参数,根据节点合成规则生成所需抽取结果。
步骤109,根据节点合成规则对队列中的节点进行合成生成父节点,根据父节点形成抽取信息。
将队列中的节点与节点合成规则依次进行匹配,根据匹配的结果将对应的节点按照节点合成规则进行合成生成父节点。每一父节点包含根据至少一条节点合成规则将子节点所包含的文本信息进行合成得到的总结性文本信息。根据一条节点合成规则所生成的父节点可以作为另一节点合成规则中的子节点,从而可通过定义不同的节点合成规则,通过父节点与子节点之间的对应关系逐步实现信息抽取的传递,如此可实现对待抽取文本的抽取获得相应抽取信息。根据附带文本信息的节点之间的传递关系可形成包含待抽取文本的文本信息及抽取结果的信息树,其中最终的抽取信息存储在信息树顶端的父节点中,该信息树顶端的父节点为根节点。
以上实施例所提供信息抽取方法,通过获取表达文本信息的节点的节点格式与通过子节点生成父节点的节点合成规则,从而可通过将待抽取文本解析成以预定的节点格式表达文本信息的节点,节点合成规则可以根据预抽取信息结果进行自定义,通过节点合成规则表达多个子节点与父节点之间的对应关系,从而根据节点合成规则可以将子节点表达的文本信息进行合成而得到包含总结性文本信息的父节点,通过父节点与子节点之间的对应关系逐步实现信息抽取的传递而获取最终的抽取信息,通过该信息抽取方法进行信息抽取不受限于待抽取文本中数据的结构,且节点合成规则可支持自定义及根据个别特殊复杂文本的需求进行补充,整个抽取实现逻辑易于理解,便于实时扩展,也无需通过大量标注预料训练抽取模型,实现成本较低。
图4为本申请实施例的信息抽取方法的流程图。如图4所示,图3b 的步骤103中,定义表达文本信息的节点的节点格式可以包括:
步骤1031,设置自定义节点。
本文中,“设置**节点”是指将节点的节点名或属性设置为指定的值,使该节点具有**节点的形式,下同。
自定义节点的节点格式为每一自定义节点以第一标识符号进行标识。每一自定义节点的节点内容包括节点名和通过文本信息属性(即节点的属性)与文本信息属性值(即节点属性的属性值)的对应关系表达的文本信息。文本信息属性与对应的文本信息属性值之间的对应关系通过第二标识符号进行标识。每一文本信息属性值通过第三标识符号进行标识。
每一自定义节点以第一标识符号进行标识,从而通过该第一标识符号可以区分分隔不同自定义节点。每一自定义节点所附带文本信息通过文本信息属性与文本信息属性值的对应关系进行表达。其中,文本信息属性与对应的文本信息属性值之间的对应关系通过第二标识符号进行标识,从而通过该第二标识符号可以分隔节点内容内所包含的不同文本信息。每一文本信息属性值通过第三标识符号进行标识,从而通过第三标识符号可以区分文本信息属性与文本信息属性值。其中,每一自定义节点的节点内容可包含由多个文本信息属性与文本信息属性值的对应关系表达的文本信息,不同文本信息属性与文本信息属性值的对应关系之间通常以预设符号分隔,一些实施例中,该预设符号为空格,同时节点名为不包含空格的任意字符串。
在一具体实施例中,第一标识符号为尖括号(<>),即每个自定义节点用尖括号括起,第二标识符号为等号(=),即每一文本信息属性与对应的文本信息属性值之间通过等号连接,第三标识符号为双引号(“”),即每一文本信息属性值通过双引号括起,则节点名为A的节点 表达形式为:<A attr1=”value1”>,A为节点名,attr1为文本信息属性,value1为文本信息属性值。一些实施例中,每一文本信息属性与文本信息属性值的对应关系中,当只包含属性名而未写文本信息属性值的,则文本信息属性值默认为真“true”。如,节点名为event的节点表达形式为:<event root>,event为节点名,root为文本信息属性,文本信息属性值为“true”。
通过设置节点的节点格式,其中节点格式中节点内容包括节点名和通过文本信息属性与文本信息属性值的对应关系表达的文本信息,该文本信息的表达格式接近一般思维理解方式,易于理解,方便通过设置的节点格式将文本信息解析成节点进行表达,文本信息属性与文本信息属性值表达文本信息方式引入词性信息,可便于后续借助词性信息对文本信息设置抽取的规则。
图5为本申请实施例的信息抽取方法的流程图。如图5所示,图4的步骤1031中,设置自定义节点的步骤可以包括以下步骤。
步骤1032,设置表达时间、地址、人物相关的文本信息的节点为内置节点。
步骤1033,设置表达事件类型相关的文本信息的节点为消息节点。
自定义节点包括内置节点与消息节点。内置节点为包括常用文本信息,例如时间、地址、人物、专有名词等,的节点。例如,可以分别设置表达时间、地址、人物相关的文本信息的节点为内置节点。设置附带时间相关文本信息的节点为时间内置节点,如<time>,其中time为时间内置节点的节点名。设置附带地址文本信息的节点为地址内置节点,如<location>,其中location为该地址内置节点的节点名。设置附带人物文本信息的节点为人物内置节点,如<people>,其中people为人物内置节 点的节点名。由于时间、地址、人物通常是信息抽取的结果中必要显示的信息,通过设置表达时间、地址、人物相关的文本信息的节点为内置节点,可以自动识别待抽取文本中包含时间、地址、人物相关的文本信息并对应解析生成时间内置节点、地址内置节点及人物内置节点。
消息节点为包括事件类型的文本信息的节点。通过设置消息节点,可以自动识别待抽取文本中包含的事件类型相关的文本信息并对应解析生成消息节点。消息节点的节点名为消息,如<word>,其中word为消息节点的节点名,消息节点为待抽取文本的初始解析节点,表达解析事件类型的描述相关的文本信息生成的节点。通过节点形成消息树表达待抽取文本的信息抽取过程,消息树由子节点与父节点之间的映射关系组成。其中一部分节点同时为树中不同层级中的子节点和父节点,位于消息树顶端的父节点不作为任意节点的子节点,为根节点。位于消息树底端的子节点不作为任意节点的父节点,为叶子节点。消息节点即为叶子节点。
通过设置自定义节点的类型包括内置节点与消息节点,可实现将待抽取文本所包含文本信息解析生成通过由节点表达文本信息的节点队列,从而可通过对节点进行预设语法规则的运算对其所附带文本信息进行抽取。
图6为本申请实施例的信息抽取方法的流程图。如图6所示,图3b的步骤103中,定义表达文本信息的节点的节点格式的步骤可以包括:
步骤1034,设置文本信息属性与文本信息属性值的类型,文本信息属性包括原始字符串、规整后字符串及词性标记,原始字符串对应的文本信息属性值为原始文本。规整后字符串对应的文本信息属性值为将原始文本以预设格式转换后的文本,词性标记对应的文本信息属性值为用 于分别标识不同原始文本词性的预设字符。
将附带不同文本信息的节点的文本信息属性进行预定义,设置文本信息属性与文本信息属性值的类型。文本信息属性主要包括原始字符串、规整后字符串及词性标记。其中原始字符串表示对应的文本信息属性值为待抽取文本中的原始文本,如<people original=“刘德华”>,其中用original表示文本信息属性为原始字符串,其对应的文本信息属性值为待抽取文本中的原始文本“刘德华”。规整后字符串表示对应的文本信息属性值为待抽取文本中的原始文本以预设格式转换后的文本,如<time text=“2008-06-23”>,其中用text表示文本信息属性为规整后字符串,其对应的文本信息属性值为待抽取文本中的原始文本“2008年6月23日”转换后的文本“2008-06-23”。词性标记为待抽取文本中的不同原始文本的词性,其对应的文本信息属性值为区分不同词性的预设字符,如<word pos="cc">,其中用pos表示文本信息属性为词性标记,其对应的文本信息属性值为预设字符cc,用于表示该节点所附带文本信息的词性为cc。其中预设字符的设置主要是便于记忆和区别词性,其字符位数及设置规则可以任意设置。通过设置文本信息属性包括原始字符串、规整后字符串及词性标记,可以将待抽取文本解析生成节点的过程中,根据各节点所附带的文本信息的对应属性进行标识,以用于在节点合成规则中通过统一的文本信息属性定义节点合成的运算条件。
图7为本申请实施例的信息抽取方法的流程图。如图7所示,图3b的步骤103中,定义表达文本信息的节点的节点格式的步骤可以包括:
步骤1035,设置文本信息属性,文本信息属性包括可空属性。
可空属性对应的文本信息属性值通常为真“true”,且可空属性对应的文本信息属性值通常不写由默认的方式表达,通过文本信息属性为可 空属性表达对应的节点可以为空,即为可空节点。在一实施例中,可空属性用orEmpty表示,如<and orEmpty>,其中节点名为and的节点为可空节点。可空属性的节点可以应用在节点合成规则中,表示输入节点。通过将一输入节点设置为可空节点,该节点合成规则表达该输入节点内所附带的文本信息可以省略,即该输入节点可以不存在。
图8为本申请实施例的信息抽取方法的流程图。如图8所示,图3b的步骤103中,定义表达文本信息的节点的节点格式的步骤可以包括:
步骤1036,设置文本信息属性与文本信息属性值的类型,文本信息属性包括过滤属性,过滤属性对应的文本信息属性值为过滤条件。节点合成规则中,输入节点的属性可以包括过滤属性。过滤属性对应的文本信息属性值为具体过滤条件包含的内容。通过过滤属性与其对应的文本信息属性值表达的过滤关系包括相等或不相等,文本信息属性为过滤属性的节点为过滤节点。在一实施例中,过滤属性的属性名用$pos表示,关系相等用(=)表示,关系不相等用(!=),如<B$pos=”nr”>、<C$pos!=”adj”>,其中节点名为B、节点名为C的节点均为过滤节点,表示节点<B>的词性标记的文本信息属性值必须为nr,节点<C>的词性标记的文本信息属性值不能为adj。通过,可用于在节点合成规则中设置输入节点的过滤属性,表达输入节点内所附带的文本信息需要满足的条件,例如必须与过滤条件中的指定值相同或者不相同,以实现不同条件匹配。
可以理解的,一个节点可以定义多个过滤属性,多个过滤条件之间的关系可以为“且”或者“或”的关系。
一些实施例中,步骤103中,定义表达文本信息的节点的节点格式的步骤可以包括:
设置文本信息属性,文本信息属性包括根节点属性。根节点属性对应的文本信息属性值通常为真“true”,且根节点属性对应的文本信息属性值通常不写而由默认的方式表达,通过文本信息属性为根节点属性表达对应的节点为根节点。在一实施例中,根节点属性用root表示,如<marry root>,其中节点名为marry的节点为根节点。通过根节点的设置,用于在节点合成规则中通过根节点表达对应节点内所附带的文本信息为最终的抽取信息。
一些实施例中,步骤103中,定义表达文本信息的节点的节点格式的步骤可以包括:
设置文本信息属性,文本信息属性包括优先级属性。优先级属性对应的文本信息属性值通常为数值。通过优先级属性及其对应的文本信息属性值表达节点合成规则的优先级。在一实施例中,优先权属性用level表示,如<level=“1”>,通常,优先级可从1到10依次降低,当通过节点合成规则将多个子节点生成父节点的过程中,如果同时命中多条节点合成规则,则优先执行优先级别更高的节点合成规则。
一些实施例中,步骤1031中,设置自定义节点的步骤可以包括:
设置表达待抽取文本开始的节点为开始节点;
设置表达待抽取文本结尾的节点为结束节点。
通常将待抽取文本以句子为单位进行解析,将每个句子解析成通过多个节点表达文本信息的形式,并对应组成一个队列。其中,开始节点对应位于一个句子所形成的节点队列的头部,结束节点则对应位于一个句子所形成的节点队列的尾部。当待抽取文本包含多个句子或段落时,可将待抽取文本按照句子或者段落为单元进行解析生成节点队列,可通 过开始节点和结束节点划分段落。
一些实施例中,图9为本申请实施例的信息抽取方法的流程图。如图9所示,图3b的步骤103中,定义表达文本信息的节点的节点格式的步骤包括:
步骤1039,设置文本节点,文本节点的节点格式为直接显示原始文本表达文本信息。
其中文本节点是指将原始文本直接显示在待抽取文本解析生成的节点列表中。文本节点与自定义节点不同,不需要设置标识符号进行区分。如待抽取文本中包含“的”,以直接通过文本节点“的”显示在该待抽取文本解析生成的节点队列中。其中,根据前述消息节点及文本信息属性中原始字符串的定义,文本节点也可等价于文本信息属性值为原始文本的消息节点。在一实施例中,文本节点“的”等价于消息节点<word text=“的”>。文本节点的含义为文本自身,不附带其它文本信息属性。文本节点的设置,可简化待抽取文本解析形成节点队列时部分节点的表达形成,使得解析后通过节点表达文本信息时更加易于理解。
一些实施例中,图10为本申请实施例的信息抽取方法,节点的节点格式包括通过文本信息属性与文本信息属性值的对应关系表达文本信息。步骤107中,获取通过子节点生成父节点的节点合成规则的步骤可以包括:
步骤1071,获取复制指定子节点的文本信息属性值作为父节点的文本信息属性值的复制合成规则。
其中,每一节点合成规则中包含的父节点与子节点之间通过预设的第四标识符号进行标识。每一节点合成规则中,父节点位于第四标识符 号的左边,子节点位于第四标识符号的右边,通过第四标识符号将节点合成规则分割成左右两个部分。在一实施例中,第四标识符号为(:=),如<A>:=<B><C><D>,表示三个子节点<B><C><D>合成父节点<A>的节点合成规则。
复制合成规则是指复制指定子节点的文本信息属性值作为父节点的文本信息属性值,以完成对子节点所附带文本信息的抽取形成父节点。其中,复制合成规则中的父节点的文本信息属性值以预设的第五标识符号进行表示。在一实施例中,第五标识符号为$+数字,其中数字则表示复制与该数字对应的子节点的文本信息属性值,如<A attr1=”$1”>:=<B><C><D>,表示通过复制合成规则通过子节点<B><C><D>合成父节点<A>,第五标识符号$1表示把右边的节点中的第一个子节点,即节点<B>的文本信息属性值作为节点<A>的文本信息属性为attr1的文本信息属性值。
一些实施例中,图11为本申请实施例的信息抽取方法的流程图。节点的节点格式包括通过文本信息属性与文本信息属性值的对应关系表达文本信息。步骤107中,获取通过子节点生成父节点的节点合成规则的步骤可以包括:
步骤1072,获取选取多个子节点的文本信息属性值合并生成父节点的文本信息属性值的合并合成规则。
合并合成规则是指合并指定多个子节点的文本信息属性值作为父节点的文本信息属性值,以完成对子节点所附带文本信息的抽取形成父节点。其中,合并合成规则中的父节点的文本信息属性值以预设的第六标识符号进行表示。在一实施例中,第六标识符号为$join+索引列表,其中索引列表中包含通过预设标识符号进行分隔多个数字,数字表示合并 与该多个数字对应的子节点的文本信息属性值。如<A attr1=”$join 1,3”>:=<B><and><C>,表示通过合并合成规则通过子节点<B><and><C>合成父节点<A>,第六标识符号$join 1,3表示把第一个子节点与第三个子节点,即子节点<B>、<C>的文本信息属性值合并作为节点<A>的文本信息属性为attr1的文本信息属性值。当索引列表中未包含数字,即未指明待合并的子节点时,则默认为所有子节点;其中也可用指定的标识符号表示待合并的子节点为所有子节点,在一实施例中该指定的标识符号为下划线(_)。其中,该索引列表中用于分隔多个数字的预设标识符号同时也表示对应合并的子节点的文本信息属性值合成父节点的文本信息属性值时的分隔符号。其中,该索引列表页可用指定的标识符号表示待合并子节点的文本信息属性值合成父节点的文本信息属性值时不包含分隔符,在一实施例中,该指定的标识符号为\empty。
一些实施例中,图12为本申请实施例的信息抽取方法的流程图。节点的节点格式包括通过文本信息属性与文本信息属性值的对应关系表达文本信息。步骤107中,获取通过子节点生成父节点的节点合成规则的步骤可以包括:
步骤1073,获取选取所有子节点的指定文本信息属性的文本信息属性值生成父节点的文本信息属性值的收集合成规则。
其中,所有子节点包括该节点合成规则中所生成父节点的子节点以及子节点的子节点。收集合成规则是指收集所有子节点的指定文本信息属性的文本信息属性值作为父节点的文本信息属性值,以完成对子节点所附带文本信息的抽取形成父节点。收集合成规则中的父节点的文本信息属性以预设的第七标识符号进行表示,该文本信息属性的文本信息属性值为子节点的文本信息属性。通过收集合成原则完成对子节点所附带 文本信息的抽取形成的父节点的收集结果为一个映射表,其中映射表包含所有子节点的收集的文本信息属性及对应的文本信息属性值的映射关系。在一实施例中,第七标识符号为collect。如
<A collect=”role”>:=<B><at><T role=”time”text=”1984-11-25”>
<B>:=<C role=”participator”text=”jack”><and><D role=”participator”text=”lucy”>
其中父节点<B>为由子节点<C>和子节点<D>生成,节点<A>表示通过收集合成原则通过所有子节点的文本信息属性为role对应的文本信息属性值生成父节点<A>。父节点<A>的收集结果为:
role.participator=[jack,lucy]
role.time=[1984-11-25]
上述实施例中节点合成规则所包含的复制合成规则、合并合成规则以及收集合成规则,分别通过父节点的属性值的定义实现节点合成规则的定义。定义通过子节点生成父节点的节点合成规则,父节点会根据子节点的信息生成自己新的信息,其中父节点可作为其它节点合成规则中的子节点,从而可实现由子节点所附带文本信息的向上传递,最终把结果汇总到最顶端的父节点中。如此可将待抽取文本的句子分别解析成附带信息的信息树,位于最顶端的父节点为根节点,最终的抽取信息则形成在根节点中。
一些实施例中,图13为本申请实施例的信息抽取方法的流程图。步骤109中,根据节点合成规则对队列中的节点进行合成生成父节点,根据父节点形成抽取信息的步骤可以包括以下步骤。
步骤1091,判断队列是否为空;
步骤1093,当队列不为空时,将队列头部的节点存储至数据库中形 成待抽取节点;
步骤1095,将数据库中的待抽取节点与节点合成规则进行匹配,当待抽取节点与节点合成规则匹配时,将待抽取节点按照节点合成规则进行合成生成父节点,并返回判断队列是否为空的步骤;
步骤1097,当队列为空时,根据数据库中的父节点形成抽取信息。
通过将待抽取文本解析生成的节点队列中的节点依序加入数据库中,以数据库中包含的节点作为节点合成规则匹配的对象,可实现信息抽取的逐步向上传递原则。
一些实施例中,该数据库为栈,步骤1091,判断队列是否为空之前,还包括:
步骤1090,初始化栈。
以数据库为栈为例,对信息抽取的逐步向上传递原则的具体实施例描述如下,初始化后栈为stack=[],节点队列为queue=[word1,word2,...];检查节点队列queue是否为空,当不为空时,弹出节点队列queue的头部元素,压入栈stack中,按照节点合成规则的优先级,依次判断栈中的元素是否符合对应的节点合成规则。具体判断方法为,将栈头项与节点合成规则的最后一项进行匹配,判断节点名与过滤条件是否同时匹配,若匹配,或者当前节点为可空节点,则继续对栈的前一项进行匹配,如节点合成规则的右边节点全部匹配,则匹配成功,根据该节点合成规则把栈中的对应项合成新节点,从栈中删除对应项,并把新节点压入栈中,新节点名由节点合成规则的左边节点定义,同时根据该节点合成规则定义的信息传递规则生成新节点的文本信息属性值;当节点队列queue为空,则抽取栈中根节点所包含的文本信息属性及文本信息属性值形成抽取结果,其中该根节点即为信息树种最顶端的父节点。可以理解的,针对待抽取文本包含文本信息相对较少时,可能存在通过一条节 点合成规则实现待抽取文本的信息抽取,则根据该一条节点合成规则所生成父节点所附带文本信息形成抽取信息。
以待抽取文本为“刘德华和朱丽倩于2008年6月23日结婚”为例,通过以上实施例提供的信息抽取方法形成抽取信息的过程具体如下所述。
通过定义表达文本信息的节点格式如以上实施例所述,对该待抽取文本进行解析生成节点队列为:<begin><people pos="nr"text="刘德华"orginal="刘德华"><word pos="cc"text="和"orginal="和"><people pos="nr"text="朱丽倩"orginal="朱丽倩"><word pos="p"text="于"orginal="于"><time pos="time"text="2008-6-23"orginal="2008年6月23日"><word pos="vi"text="结婚"orginal="结婚"><end>
根据该待抽取文本定义节点合成规则包括:
<marry root collect="role">:=<people role="couple"><and><people role="couple"><atTime orEmpty role="marryTime">结婚
<and>:=和
<and>:=与
<atTime text="$2">:=<at><time>
<at>:=在
<at>:=于
请参阅图14,对该待抽取文本形成抽取信息,根据节点合成规则对队列中的节点进行合成生成父节点,根据父节点形成抽取信息的过程可通过树结构表示,树中相邻两层子节点与父节点之间的对应关系分别与对应的节点合成规则匹配,具体匹配过程如下表格所示(其中,队列queue中节点省略文本信息属性表示,栈stack中节点省略子节点表示)。
Figure PCTCN2017115185-appb-000001
Figure PCTCN2017115185-appb-000002
以上实施例所提供的信息抽取方法,其中节点合成规则的定义是建立在文本信息传递的原则上,通过节点合成规则对节点进行信息抽取脉络清晰,通过定义节点格式及节点合成规则的语法格式接近一般思维理解方式,因此编写自定义规则更加容易,易于理解;其中信息抽取实现过程中,节点合成规则的扩展只需增加新的规则即可,无需修改之前的规则,各节点合成规则之间的耦合低,同时支持抽取器接口获取自定义节点合成规则,从而易于扩展;各节点合成规则之间可以通过标记相互引用,无需重复编写,可把通用性较强的节点合成规则抽取到所需文件 中,易于管理,支持重用;该节点合成规则中节点所附带文本信息通过文本信息属性与文本信息属性值的对应关系进行表达,可使用词性信息,并支持自定义文本信息属性和基于文本信息属性进行节点运算的节点合成规则,节点合成规则支持递归定义,增加了所述信息抽取方法的泛化能力;实现信息抽取无需大批量的训练语料,成本较低。
图15为本申请实施例的信息抽取系统的示意图。该系统可以包括获取模块11、节点模块13、解析模块15、规则模块17及抽取模块19,其中,获取模块11用于获取待抽取文本。节点模块13用于定义表达文本信息的节点的节点格式。解析模块15用于根据节点格式对待抽取文本进行解析生成表达待抽取文本的文本信息的节点,通过节点组成队列。规则模块17用于获取通过子节点生成父节点的节点合成规则。抽取模块19用于根据节点合成规则对队列中的节点进行合成生成父节点,根据父节点形成抽取信息。
一些实施例中,图16为实施例的信息抽取系统,其中,节点模块13包括自定义节点单元131。自定义节点单元131用于设置自定义节点,自定义节点的节点格式为每一自定义节点以第一标识符号进行标识。每一自定义节点的节点内容包括节点名和通过文本信息属性与文本信息属性值的对应关系表达的文本信息。文本信息属性与对应的文本信息属性值之间的对应关系通过第二标识符号进行标识,每一文本信息属性值通过第三标识符号进行标识。
一些实施例中,图17为实施例的信息抽取系统,自定义节点单元131包括内置节点单元132及消息节点单元133。内置节点单元132用于设置表达时间、地址、人物相关的文本信息的节点为内置节点。消息节点单元133用于设置表达事件类型相关的文本信息的节点为消息节 点。
一些实施例中,图18为实施例所提供的信息抽取系统,节点模块13包括属性单元134。属性单元134用于设置文本信息属性与文本信息属性值的类型,文本信息属性包括原始字符串、规整后字符串及词性标记。原始字符串对应的文本信息属性值为原始文本,规整后字符串对应的文本信息属性值为将原始文本以预设格式转换后的文本,词性标记对应的文本信息属性值为用于分别标识不同原始文本词性的预设字符。
一些实施例中,节点模块13包括属性单元134。属性单元134用于设置文本信息属性,文本信息属性包括可空属性。
一些实施例中,节点模块13包括属性单元134。属性单元用于设置文本信息属性与文本信息属性值的类型,文本信息属性包括过滤属性,过滤属性对应的文本信息属性值为过滤条件。
一些实施例中,图19为实施例所提供的信息抽取系统,节点模块13包括文本节点单元135。文本节点单元135用于设置文本节点,文本节点的节点格式为直接显示原始文本表达文本信息。
一些实施例中,图20为实施例所提供的信息抽取系统,规则模块17包括复制合成规则单元171。节点的节点格式包括通过文本信息属性与文本信息属性值的对应关系表达文本信息。复制合成规则单元用于获取复制指定子节点的文本信息属性值作为父节点的文本信息属性值的复制合成规则。
一些实施例中,图21为实施例所提供的信息抽取系统,规则模块17包括合并合成规则单元172。节点的节点格式包括通过文本信息属性与文本信息属性值的对应关系表达文本信息。合并合成规则单元用于获取选取多个子节点的文本信息属性值合并生成父节点的文本信息属性值的合并合成规则。
一些实施例中,图22为实施例所提供的信息抽取系统,规则模块17包括收集合成规则单元173。节点的节点格式包括通过文本信息属性与文本信息属性值的对应关系表达文本信息。收集合成规则单元用于获取选取所有子节点的指定文本信息属性的文本信息属性值生成父节点的文本信息属性值的收集合成规则。
一些实施例中,图23为实施例所提供的信息抽取系统,抽取模块19包括判断单元191、存储单元193、匹配单元195及抽取单元197。判断单元191用于判断队列是否为空。存储单元193用于当队列不为空时,将队列头部的节点存储至数据库中形成待抽取节点。匹配单元195用于将数据库中的待抽取节点与节点合成规则进行匹配,当待抽取节点与节点合成规则匹配时,将待抽取节点按照节点合成规则进行合成生成父节点,并返回判断队列是否为空的步骤。抽取单元197用于当队列为空时,根据数据库中的父节点形成抽取信息。
以上实施例所提供的信息抽取系统,其中节点合成规则的定义是建立在文本信息传递的原则上,节点合成规则定义脉络清晰,易于理解;其中信息抽取实现过程中,节点合成规则的扩展只需增加新的规则即可,无需修改之前的规则,各节点合成规则之间低耦合,同时支持抽取器接口获取自定义节点合成规则,从而易于扩展;各节点合成规则之间可以通过标记相互引用,无需重复编写,可把通用性较强的节点合成规则抽取到所需文件中,易于管理,支持重用;该节点合成规则中节点所附带文本信息通过文本信息属性与文本信息属性值的对应关系进行表达,可使用词性信息,并支持自定义文本信息属性和基于文本信息属性的节点合成规则,节点合成规则支持递归定义,增加了信息抽取方法的泛化能力;实现信息抽取无需大批量的训练语料,成本较低。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的程序可存储于一计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,所述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)或随机存储记忆体(Random Access Memory,RAM)等。
以上所述实施例仅给出了几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本发明构思的前提下,还可以做出若干变形和改进,这些都属于本发明的保护范围。

Claims (24)

  1. 一种信息抽取方法,包括:
    获取非结构化文本;
    根据预设的节点格式对所述待抽取文本进行解析,生成描述所述非结构化文本的节点组成的第一节点集合;
    获取预设的规则库,所述规则库包括多个用于生成节点的规则,各规则指明该规则生成的节点的角色,所述节点的角色为根节点或非根节点;
    根据所述规则库对所述第一节点集合中的所述节点进行合成以生成根节点,根据所述根节点生成结构化信息。
  2. 如权利要求1所述的信息抽取方法,其中,根据预设的节点格式对所述待抽取文本进行解析,生成描述所述非结构化文本的节点组成的第一节点集合包括:
    对所述非结构化文本进行分词得到包括多个词语的词语集合;
    利用所述词语集合生成所述第一节点集合,其中,所述第一节点集合中的每个节点用于描述所述词语集合中的一个词语,包括一个或多个属性的属性值。
  3. 如权利要求1所述的信息抽取方法,其中,根据所述规则库对所述第一节点集合中的所述节点进行合成生成根节点,根据所述根节点生成结构化信息包括:
    利用所述规则库中的规则处理所述第一节点集合中的节点以生成第二节点集合,其中,所述第二节点集合中的每个节点描述所述第一节点集合中的至少一个节点;
    将所述第二节点集合中具有根节点角色的节点的一个或多个属性的 属性值输出为所述结构化信息。
  4. 如权利要求2所述的信息抽取方法,其中,利用所述词语集合生成所述第一节点集合包括:
    生成所述词语集合中的每个词语对应的节点,所述节点包括第一属性,所述第一属性的属性值为所述词语对应的字符串;
    生成所述第一节点集合,所述第一节点集合包括所述词语集合中各词语对应的节点。
  5. 如权利要求4所述的信息抽取方法,进一步包括:
    在每个词语对应的节点中设置第二属性,所述第二属性的属性值表示所述词语的词性。
  6. 如权利要求4所述的信息抽取方法,进一步包括:
    在所述词语集合中识别具有预设内容类型的第一词语,其中,所述预设的内容类型为以下中的一个:人名、地名、日期、时间、专有名词;
    利用所述第一词语对应的节点的节点名或第三属性的属性值表示所述预设内容类型。
  7. 如权利要求6所述的信息抽取方法,进一步包括:
    将所述第一词语的文本转换为所述预设内容类型对应的指定格式的目标文本;
    在所述第一词语对应的预处理节点中增加第四属性,所述第四属性的属性值为所述目标文本。
  8. 如权利要求2所述的信息抽取方法,其中,所述第一节点集合中各节点的排列顺序与各节点对应的词语在所述非结构化文本中的排列顺序一致。
  9. 如权利要求3所述的信息抽取方法,其中,所述规则库中各规则包括一个或多个输入节点的描述以及利用所述一个或多个输入节点生 成输出节点的方式;利用所述规则库中的规则处理所述第一节点集合中的节点以生成所述第二节点集合包括:
    从所述第一节点集合中选取至少一个节点,在所述规则库中查找规则:该规则的输入节点与所述至少一个节点匹配;
    将所述至少一个节点作为所述规则的输入节点、按照所述规则中的方式生成第二节点,在所述第一节点集合中使用所述第二节点替换所述至少一个节点;
    将所述第一节点集合作为所述第二节点集合。
  10. 如权利要求9所述的信息抽取方法,其中,所述规则中一个或多个输入节点的描述包括以下中的至少一个:
    所述一个或多个输入节点中一个输入节点的指定属性的属性值需要满足的条件;
    所述一个或多个输入节点的排列顺序。
  11. 如权利要求9所述的信息抽取方法,其中,所述规则中利用所述一个或多个输入节点生成输出节点的方式包括以下中的至少一个:
    将所述一个或多个输入节点中一个节点的指定属性的属性值作为所述输出节点的指定属性的属性值;
    将所述一个或多个输入节点中至少两个节点的指定属性的属性值合并得到合并值,将所述合并值作为所述输出节点的指定属性的属性值。
  12. 如权利要求11所述的信息抽取方法,其中,所述规则中所述至少两个节点的指定属性的属性值为字符串,将所述至少两个节点的指定属性的属性值合并得到合并值包括:
    按照所述规则指定的合并方式将所述至少两个节点对应的指定属性的属性值合并为字符串或字符串数组;
    所述合并方式包括以下中的一个:
    将第一字符串拼接为第二字符串;或
    将第一字符串合并为字符串数组,各第一字符串作为所述字符串数组中的元素。
  13. 如权利要求9所述的信息抽取方法,所述规则库包括各规则的优先级的信息;其中,查找规则包括:
    在所述规则库中按照规则的优先级从高到低的顺序查找所述规则。
  14. 一种信息抽取系统,包括:至少一个处理器和存储器,所述存储器中存储有计算机可读指令,所述指令可以使所述至少一个处理器:
    获取非结构化文本;
    根据预设的节点格式对所述待抽取文本进行解析,生成描述所述非结构化文本的节点组成的第一节点集合;
    获取预设的规则库,所述规则库包括多个用于生成节点的规则,各规则指明该规则生成的节点的角色,所述节点的角色为根节点或非根节点;
    根据所述规则库对所述第一节点集合中的所述节点进行合成以生成根节点,根据所述根节点生成结构化信息。
  15. 如权利要求14所述的信息抽取系统,其中,根据预设的节点格式对所述待抽取文本进行解析,生成描述所述非结构化文本的节点组成的第一节点集合包括:
    对所述非结构化文本进行分词得到包括多个词语的词语集合;
    利用所述词语集合生成第一节点集合,其中,所述第一节点集合中的每个节点用于描述所述词语集合中的一个词语,包括一个或多个属性的属性值;
  16. 如权利要求14所述的信息抽取系统,其中,根据所述规则库对所述第一节点集合中的所述节点进行合成生成根节点,根据所述根节点 生成结构化信息包括:
    利用所述规则库中的规则处理所述第一节点集合中的节点以生成第二节点集合,其中,所述第二节点集合中的每个节点描述所述第一节点集合中的至少一个节点;
    将所述第二节点集合中具有根节点角色的节点的一个或多个属性的属性值输出为所述结构化信息。
  17. 如权利要求15所述的信息抽取系统,其中,利用所述词语集合生成所述第一节点集合包括:
    生成所述词语集合中的每个词语对应的节点,所述节点包括第一属性,所述第一属性的属性值为所述词语对应的字符串;
    生成所述第一节点集合,所述第一节点集合包括所述词语集合中各词语对应的节点。
  18. 如权利要求17所述的信息抽取系统,其中,所述指令可以使所述至少一个处理器:
    在每个词语对应的节点中设置第二属性,所述第二属性的属性值表示所述词语的词性。
  19. 如权利要求17所述的信息抽取系统,其中,所述指令可以使所述至少一个处理器:
    在所述词语集合中识别具有预设内容类型的第一词语,其中,所述预设的内容类型选自:人名、地名、日期、时间、专有名词;
    利用所述第一词语对应的节点的节点名或第三属性的属性值表示所述预设内容类型。
  20. 如权利要求19所述的信息抽取系统,其中,所述指令可以使所述至少一个处理器:
    将所述第一词语的文本转换为所述预设内容类型对应的指定格式的 目标文本;
    在所述第一词语对应的预处理节点中增加第四属性,所述第四属性的属性值为所述目标文本。
  21. 如权利要求16所述的信息抽取系统,其中,利用所述规则库中的规则处理所述第一节点集合中的节点以生成所述第二节点集合包括:
    从所述第一节点集合中选取至少一个节点,在所述规则库中查找规则:所述至少一个节点符合该规则中一个或多个输入节点的描述;
    将所述至少一个节点作为所述规则的输入节点、按照所述规则指定的利用所述一个或多个输入节点生成输出节点的方式生成第二节点,在所述第一节点集合中使用所述第二节点替换所述至少一个节点;
    将所述第一节点集合作为所述第二节点集合。
  22. 如权利要求21所述的信息抽取系统,其中,生成第二节点包括:
    将所述至少一个节点中一个节点的指定属性的属性值作为所述第二节点的指定属性的属性值;
    将所述至少一个节点节点中至少两个节点的指定属性的属性值合并得到合并值,将所述合并值作为所述第二节点的指定属性的属性值。
  23. 如权利要求22所述的信息抽取系统,其中,将所述至少两个节点的指定属性的属性值合并得到合并值包括:
    当所述至少两个节点对应的指定属性的属性值为字符串时,按照所述规则指定的合并方式,将所述至少两个节点对应的指定属性的属性值合并为字符串或字符串数组;
    所述合并方式包括以下中的一个:
    将第一字符串拼接为第二字符串;或
    将第一字符串合并为字符串数组,各第一字符串作为所述字符串数组中的元素。
  24. 一种计算机可读存储介质,包括:计算机可读指令,所述指令可以使至少一个处理器:
    获取非结构化文本;
    根据预设的节点格式对所述待抽取文本进行解析,生成描述所述非结构化文本的节点组成的第一节点集合;
    获取预设的规则库,所述规则库包括多个用于生成节点的规则,各规则指明该规则生成的节点的角色,所述节点的角色为根节点或非根节点;
    根据所述规则库对所述第一节点集合中的所述节点进行合成以生成根节点,根据所述根节点生成结构化信息。
PCT/CN2017/115185 2016-12-22 2017-12-08 信息抽取方法和系统 WO2018113532A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US16/385,163 US11093520B2 (en) 2016-12-22 2019-04-16 Information extraction method and system

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201611200449.8 2016-12-22
CN201611200449.8A CN108228676B (zh) 2016-12-22 2016-12-22 信息抽取方法和系统

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/385,163 Continuation US11093520B2 (en) 2016-12-22 2019-04-16 Information extraction method and system

Publications (1)

Publication Number Publication Date
WO2018113532A1 true WO2018113532A1 (zh) 2018-06-28

Family

ID=62624361

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/115185 WO2018113532A1 (zh) 2016-12-22 2017-12-08 信息抽取方法和系统

Country Status (3)

Country Link
US (1) US11093520B2 (zh)
CN (1) CN108228676B (zh)
WO (1) WO2018113532A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110162786A (zh) * 2019-04-23 2019-08-23 百度在线网络技术(北京)有限公司 构建配置文件以及抽取结构化信息的方法、装置

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10467900B2 (en) 2017-11-22 2019-11-05 Bank Of America Corporation System for communicable integration of an automobile system and a parking system
CN108509397B (zh) * 2018-03-21 2020-07-31 清华大学 基于标识符技术的层次化结构数据的存储、解析方法及系统
US10990109B2 (en) * 2018-05-22 2021-04-27 Bank Of America Corporation Integrated connectivity of devices for resource transmission
CN110782886A (zh) * 2018-07-30 2020-02-11 阿里巴巴集团控股有限公司 语音处理的系统、方法、电视、设备和介质
CN110059314B (zh) * 2019-04-08 2023-04-25 东南大学 一种基于增强学习的关系抽取方法
CN110753316B (zh) * 2019-09-26 2022-04-26 贝壳技术有限公司 信息发送方法和装置、计算机可读存储介质、电子设备
CN112988406B (zh) * 2019-12-12 2023-10-13 腾讯科技(深圳)有限公司 远程调用方法、装置和存储介质
CN112985397B (zh) * 2019-12-13 2024-04-19 北京京东乾石科技有限公司 机器人轨迹规划方法、装置、存储介质及电子设备
CN113673229B (zh) * 2021-08-23 2024-04-05 广东电网有限责任公司 一种电力营销数据交互方法、系统及存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060190465A1 (en) * 2005-02-24 2006-08-24 Nahava Inc. Method and apparatus for efficient indexed storage for unstructured content
CN103077164A (zh) * 2012-12-27 2013-05-01 新浪网技术(中国)有限公司 文本分析方法及文本分析器
CN104536950A (zh) * 2014-12-11 2015-04-22 北京百度网讯科技有限公司 生成文本摘要的方法及装置
CN104615724A (zh) * 2015-02-06 2015-05-13 百度在线网络技术(北京)有限公司 知识库的建立以及基于知识库的信息搜索方法和装置

Family Cites Families (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8335779B2 (en) * 2002-08-16 2012-12-18 Gamroe Applications, Llc Method and apparatus for gathering, categorizing and parameterizing data
US20050278368A1 (en) * 2004-06-08 2005-12-15 Benedikt Michael A System and method for XML data integration
CN101305366B (zh) * 2005-11-29 2013-02-06 国际商业机器公司 从非结构化文本提取和显现图表结构化关系的方法和系统
US7996440B2 (en) * 2006-06-05 2011-08-09 Accenture Global Services Limited Extraction of attributes and values from natural language documents
CN101727461B (zh) * 2008-10-13 2012-11-21 中国科学院计算技术研究所 一种网页的正文抽取方法
CN102456050B (zh) * 2010-10-27 2014-04-09 中国移动通信集团四川有限公司 从网页中抽取数据的方法和装置
US8873813B2 (en) * 2012-09-17 2014-10-28 Z Advanced Computing, Inc. Application of Z-webs and Z-factors to analytics, search engine, learning, recognition, natural language, and other utilities
DE102013003055A1 (de) * 2013-02-18 2014-08-21 Nadine Sina Kurz Verfahren und Vorrichtung zum Durchführen von Suchen in natürlicher Sprache
CN103473285B (zh) * 2013-08-29 2017-04-12 北京奇虎科技有限公司 基于位置标记的网页信息抽取方法和装置
US9753767B2 (en) * 2014-03-11 2017-09-05 Sas Institute Inc. Distributed data set task selection
US20160350854A1 (en) * 2015-06-01 2016-12-01 Chicago Mercantile Exchange Inc. Data Structure Management in Hybrid Clearing and Default Processing
US20170075904A1 (en) * 2015-09-16 2017-03-16 Edgetide Llc System and method of extracting linked node graph data structures from unstructured content
CN105677638B (zh) * 2016-01-05 2018-10-09 北京工业大学 Web信息抽取方法
US10031822B2 (en) * 2016-01-29 2018-07-24 Netapp, Inc. Techniques for estimating ability of nodes to support high availability functionality in a storage cluster system
US10698806B1 (en) * 2018-12-10 2020-06-30 Sap Se Combinatorial testing of software for multi-level data structures

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060190465A1 (en) * 2005-02-24 2006-08-24 Nahava Inc. Method and apparatus for efficient indexed storage for unstructured content
CN103077164A (zh) * 2012-12-27 2013-05-01 新浪网技术(中国)有限公司 文本分析方法及文本分析器
CN104536950A (zh) * 2014-12-11 2015-04-22 北京百度网讯科技有限公司 生成文本摘要的方法及装置
CN104615724A (zh) * 2015-02-06 2015-05-13 百度在线网络技术(北京)有限公司 知识库的建立以及基于知识库的信息搜索方法和装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
HU WANTING: "Research on large scale Chinese people information extraction based on web", CHINA MASTER'S THESES- ELECTRONIC TECHNOLOGY & INFORMATION SCIENCE, vol. 2013, no. 12, 15 December 2013 (2013-12-15), pages 26 - 35, ISSN: 1139-29 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110162786A (zh) * 2019-04-23 2019-08-23 百度在线网络技术(北京)有限公司 构建配置文件以及抽取结构化信息的方法、装置
CN110162786B (zh) * 2019-04-23 2024-02-27 百度在线网络技术(北京)有限公司 构建配置文件以及抽取结构化信息的方法、装置

Also Published As

Publication number Publication date
CN108228676A (zh) 2018-06-29
US11093520B2 (en) 2021-08-17
CN108228676B (zh) 2021-08-13
US20190243842A1 (en) 2019-08-08

Similar Documents

Publication Publication Date Title
WO2018113532A1 (zh) 信息抽取方法和系统
US11080295B2 (en) Collecting, organizing, and searching knowledge about a dataset
JP4851789B2 (ja) ユーザ関心反映型検索結果指示子使用及び作成システム及び方法
US11120059B2 (en) Conversational query answering system
US9411790B2 (en) Systems, methods, and media for generating structured documents
CN108319583B (zh) 从中文语料库提取知识的方法与系统
TWI735380B (zh) 自然語言處理方法與其計算裝置
US20190005154A1 (en) Method and system for extracting user-specific content
CN112527291A (zh) 网页生成方法、装置、电子设备及存储介质
Watson Scripting intelligence: Web 3.0 information gathering and processing
JP2007011973A (ja) 情報検索装置及び情報検索プログラム
JP7122773B2 (ja) 辞書構築装置、辞書の生産方法、およびプログラム
JP7227705B2 (ja) 自然言語処理装置、検索装置、自然言語処理方法、検索方法およびプログラム
Tarawneh et al. a hybrid approach for indexing and searching the holy Quran
CN113268608A (zh) 知识概念构建方法和装置
CN110457435A (zh) 一种专利新颖性分析系统及其分析方法
JP2000207407A (ja) 情報抽出方法及び装置及び情報抽出プログラムを格納した記憶媒体
RU2610585C2 (ru) Способ и система для модификации текста в документе
JP7272540B2 (ja) 情報提供システム、情報提供方法、及びデータ構造
US11645472B2 (en) Conversion of result processing to annotated text for non-rich text exchange
Rasham et al. The challenges and case for urdu DBpedia
KR101476230B1 (ko) 자연어와 수학식이 포함된 복합문장의 시맨틱 정보 추출방법과 그를 위한 장치 및 컴퓨터로 읽을 수 있는 기록매체
JP2023072885A (ja) 文書構造化装置、文書構造化方法
Vandeghinste et al. Making a large treebank searchable online. The SoNaR case.
Oliveira et al. Semantic CMS for Newsrooms: Proposal of Virus Ontology to Enhance Knowledge Representation in Authoring Environments

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17885307

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17885307

Country of ref document: EP

Kind code of ref document: A1