WO2020020085A1 - Representation learning method and device - Google Patents

Representation learning method and device Download PDF

Info

Publication number
WO2020020085A1
WO2020020085A1 PCT/CN2019/096895 CN2019096895W WO2020020085A1 WO 2020020085 A1 WO2020020085 A1 WO 2020020085A1 CN 2019096895 W CN2019096895 W CN 2019096895W WO 2020020085 A1 WO2020020085 A1 WO 2020020085A1
Authority
WO
WIPO (PCT)
Prior art keywords
entity
vector
type
representation
relationship
Prior art date
Application number
PCT/CN2019/096895
Other languages
French (fr)
Chinese (zh)
Inventor
贾岩涛
刘冬
李冯福
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2020020085A1 publication Critical patent/WO2020020085A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology

Definitions

  • the present application relates to the field of big data technology, and in particular, to a method and a device for representation learning.
  • Knowledge Graph describes structured concepts, entities and their relationships in a structured way, expressing information on the Internet into a form closer to the human cognitive world, providing a better organization and management And the ability to understand the vast amount of information on the Internet.
  • knowledge map has gradually become one of the key technologies, and has been widely used in intelligent search, intelligent question answering, personalized recommendation, content distribution and other fields.
  • Knowledge graph representation learning method aims to represent the entities and relationships in the knowledge graph as vectors in a low-dimensional vector space, thereby transforming the calculation between entities and relationships into numerical calculations between vectors.
  • the present application provides a representation learning method and device, which are used to characterize deep-level semantic information in a knowledge map and improve the accuracy of representation learning.
  • a representation learning method including: determining a type representation vector of an entity according to a type of an entity in a triple of a knowledge graph of a fused text, the entity including a head entity and a tail entity; and according to a relationship in the triple
  • the type of the relationship determines the type representation vector of the relationship; the context representation vector of the entity is determined based on the textual information of the entity; the context representation vector of the relationship is determined based on the weight value of the relationship; the vector is represented by the type of the entity, the vector of the context representation of the entity, and the relationship
  • the type represents the vector and the context representation vector of the relationship to construct the scoring function of the triplet; the objective function is constructed based on the scoring function of the triplet; the objective function is minimized to learn the representation vector of the entity and the representation vector of the relationship.
  • the type and weight value of the relationship represent a certain level of deep semantic information
  • the type and weight of the entity are considered by considering the type and context of the entity, as well as the type and weight value of the relationship.
  • Representation vectors and relational representation vectors can depict deep-level semantic information in the knowledge map and improve the accuracy of representation learning.
  • the method before determining the type representation vector of the entity according to the type of the entity in the triple of the knowledge graph of the fused text, the method further includes: initializing the representation vector of the head entity, the representation vector of the tail entity, and the relationship Representation vector, entity type representation matrix, relation type representation matrix, word representation matrix, and weight vector. In this way, in the subsequent representation learning process, the type representation vector of the relationship, the type representation vector of the entity, the context representation vector of the relationship, and the context representation vector of the entity can be determined.
  • determining the type representation vector of the entity according to the type of the entity in the triple of the knowledge graph of the fused text includes: determining the type identification vector of the head entity according to the type of the head entity; Determine the type representation vector of the head entity; where f 1 (h) represents the type representation vector of the head entity, W etype represents the entity type representation matrix, and v etype (h) represents the type identification vector of the head entity; according to the type of the tail entity, Determine the type identification vector of the tail entity; according to the formula Determine the type of the tail entity to represent the vector; where f 1 (t) represents the type of the tail entity represents the vector, and v etype (t) represents the type identification vector of the tail entity.
  • determining the type of the relationship to represent the vector according to the type of the relationship in the triple includes: determining the type identification vector of the relationship according to the type of the relationship; and according to the formula: Determine the type of the relationship to represent the vector; where g 1 (r) represents the type of the relationship represents the vector, W rtype represents the relationship type represents the matrix, and v rtype (r) represents the type identification vector of the relationship.
  • determining an entity's contextual representation vector according to the entity's text information includes: determining the words related to the head entity based on the text information of the head entity; Determine the context representation vector of the head entity; where f 2 (h) represents the context representation vector of the head entity, ⁇ , ⁇ are constants between 0 and 1, v h represents the vector of representation of the head entity, and w i represents Words related to the head entity, ⁇ 1 represents a set of all words related to the head entity, W word represents a matrix of words, and V vocubalary (w i ) represents an identification vector of w i ; according to the text information of the tail entity, determine the Tail entity related words; according to the formula: Determine the context representation vector of the tail entity; where f 2 (t) represents the context representation vector of the tail entity, v t represents the vector of the tail entity, mi represents the words related to the tail entity, and ⁇ 2 represents all the tail entity related. V vocubalary (m i )
  • determining the contextual representation vector of the relationship according to the weight value of the relationship includes: according to the formula: Determine the context representation vector of the relationship; where g 2 (r) represents the context representation vector of the relationship, v r represents the vector of the relationship, n i represents the weight value of the relationship, and ⁇ 3 represents the set of weight values of the relationship.
  • the scoring function for the triples is: among them, Represents compound operations, f 1 (h) represents the type of the head entity represents a vector, g 1 (r) represents the type of a relation represents a vector, f 1 (t) represents the type of a tail entity represents a vector, and f 2 (h) represents a head entity
  • the context represents the vector of, g 2 (r) represents the context of the relationship, and f 2 (t) represents the context of the tail entity.
  • (h, r, t) represents a positive example triplet
  • represents a positive example triplet set
  • (h ′, r, t ′) represents a negative example triplet
  • h ′ represents a head entity of a negative example
  • t ′ represents the tail entity of the negative example
  • ⁇ ′ represents the triple set of negative examples
  • M is a constant.
  • the method before determining the type representation vector of the entity according to the type of the entity in the triple of the knowledge map of the fused text, the method further includes: obtaining an initial knowledge map; constructing a fusion based on a framework of the initial knowledge map
  • the framework of textual knowledge graphs; the framework of textual knowledge graphs defines at least the following: extended attributes of entities, extended attributes of relations, and extended relations between entities; extended attributes of entities include textual information of entities; according to the initial knowledge graph Obtain external data from the entity or relationship information in the external data; from the external data, determine the extended attribute value of the entity and the extended attribute value of the relationship to build a knowledge map of the fused text.
  • the server constructs a knowledge graph of the fused text by extending the framework of the initial knowledge graph and supplementing the extended attribute values of the related entities and the extended attribute values of the relationship. In this way, compared with the original knowledge map, the knowledge map of the fused text is more complete in content.
  • a representation learning device including: a type representation module for determining a type representation vector of an entity according to a type of an entity in a triple of a knowledge graph of a fused text, and the entity includes a head entity and a tail entity; The type of the relationship in the triple, determines the type of the relationship to represent the vector.
  • the context representation module is used to determine the context representation vector of the entity according to the text information of the entity; and determine the context representation vector of the relationship according to the weight value of the relationship.
  • a processing module for constructing a scoring function of a triple based on the type representation vector of the entity, the context representation vector of the entity, the type representation vector of the relationship, and the context representation vector of the relationship; constructing an objective function based on the scoring function of the triple Minimize the objective function and learn the representation vector of the entity and the representation vector of the relationship.
  • the processing module is further used to initialize the representation vector of the head entity, the representation vector of the tail entity, the relationship vector, the entity type representation matrix, the relationship type representation matrix, the word representation matrix, and the weight vector.
  • the type representation module is used to determine the type identification vector of the head entity according to the type of the head entity; according to the formula Determine the type representation vector of the head entity; where f 1 (h) represents the type representation vector of the head entity, W etype represents the entity type representation matrix, and v etype (h) represents the type identification vector of the head entity; according to the type of the tail entity, Determine the type identification vector of the tail entity; according to the formula Determine the type of the tail entity to represent the vector; where f 1 (t) represents the type of the tail entity represents the vector, and v etype (t) represents the type identification vector of the tail entity.
  • the type representation module is used to determine the type identification vector of the relationship according to the type of the relationship; according to the formula: Determine the type of the relationship to represent the vector; where g 1 (r) represents the type of the relationship represents the vector, W rtype represents the relationship type represents the matrix, and v rtype (r) represents the type identification vector of the relationship.
  • the context representation module is used to determine the words related to the head entity based on the text information of the head entity; according to the formula: Determine the context representation vector of the head entity; where f 2 (h) represents the context representation vector of the head entity, ⁇ , ⁇ are constants between 0 and 1, v h represents the vector of representation of the head entity, and w i represents Words related to the head entity, ⁇ 1 represents a set of all words related to the head entity, W word represents a matrix of words, and V vocubalary (w i ) represents an identification vector of w i ; according to the text information of the tail entity, determine the Tail entity related words; according to the formula: Determine the context representation vector of the tail entity; where f 2 (t) represents the context representation vector of the tail entity, v t represents the vector of the tail entity, mi represents the words related to the tail entity, and ⁇ 2 represents all the tail entity related. V vocubalary (m i ) represents the identity vector of mi
  • the context representation module is used according to the formula: Determine the context representation vector of the relationship; where g 2 (r) represents the context representation vector of the relationship, v r represents the vector of the relationship, n i represents the weight value of the relationship, and ⁇ 3 represents the set of weight values of the relationship.
  • the scoring function for the triples is: among them, Represents compound operations, f 1 (h) represents the type of the head entity represents a vector, g 1 (r) represents the type of a relation represents a vector, f 1 (t) represents the type of a tail entity represents a vector, and f 2 (h) represents a head entity
  • the context represents the vector of, g 2 (r) represents the context of the relationship, and f 2 (t) represents the context of the tail entity.
  • (h, r, t) represents a positive example triplet
  • represents a positive example triplet set
  • (h ′, r, t ′) represents a negative example triplet
  • h ′ represents a head entity of a negative example
  • t ′ represents the tail entity of the negative example
  • ⁇ ′ represents the set of triples of the negative example
  • M is a constant.
  • the representation learning device further includes: a framework extension module, a data acquisition module, and an extension mapping module.
  • Framework extension module used to obtain the initial knowledge map; based on the framework of the initial knowledge map, construct a framework of the text knowledge map; the framework of the text knowledge map defines at least the following: extended attributes of entities, extended attributes of relationships, and entities The extended relationship between them; the extended attributes of the entity include the textual information of the entity.
  • a data acquisition module is configured to acquire external data according to entity information or relationship information in the initial knowledge map.
  • the extended mapping module is used to determine the extended attribute value of the entity and the extended attribute value of the relationship from the external data to construct a knowledge map of the fused text.
  • a server including: a processor, a memory, a bus, and a communication interface; the memory is configured to store a computer to execute instructions, the processor is connected to the memory through the bus, and when the server runs When the processor executes the computer execution instructions stored in the memory, so that the server executes the representation learning method according to any one of the first aspects.
  • a computer-readable storage medium stores instructions that, when run on a computer, enable the computer to perform the representation learning according to any one of the first aspects. method.
  • a computer program product containing instructions, which, when run on a computer, enables the computer to execute the representation learning method according to any one of the first aspects.
  • a chip system includes a processor for supporting a server to implement the functions involved in the first aspect.
  • the chip system further includes a memory, and the memory is configured to store program instructions and data necessary for the server.
  • the chip system can be composed of chips, and can also include chips and other discrete devices.
  • the technical effects brought by any one of the design methods in the second aspect to the sixth aspect may refer to the technical effects brought by the different design methods in the first aspect, and are not repeated here.
  • FIG. 1 is a schematic diagram of a communication system according to an embodiment of the present application.
  • FIG. 2 is a schematic structural diagram of a server according to an embodiment of the present application.
  • FIG. 3 is a flowchart of a method for constructing a knowledge map according to an embodiment of the present application
  • FIG. 4 is a schematic diagram of an initial knowledge map provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of a fused text knowledge map according to an embodiment of the present application.
  • FIG. 6 is a flowchart illustrating a learning method according to an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of a learning device according to an embodiment of the present application.
  • the knowledge map is a symbolic expression of the objective world.
  • the knowledge graph itself is a networked knowledge base where entities with attributes are connected through relationships. From the perspective of the graph, the knowledge graph is essentially a network, where nodes represent entities (or concepts) in the objective world, and edges represent various relationships or attributes of entities.
  • entity refers to specific things that are distinguishable and exist independently. For example, “apple”, “banana”, etc. can all be entities.
  • Concept refers to the conceptual representation of objective things that people form in the process of cognizing the world, such as people, animals, and plants.
  • a concept can be understood as a collection of entities with the same characteristics.
  • Relations are used to describe objectively existing associations between entities and concepts.
  • the relationship between the entities may be an include relationship, a subordinate relationship, or the like.
  • a mobile phone contains a camera, that is, an inclusion relationship exists between the mobile phone and the camera.
  • a property is a characterization of an abstract aspect of an object. It is worth noting that an entity (or concept) generally has many properties and relationships. These properties and relationships can be referred to as the attributes of the entity (or concept). For example, if the entity is Beijing, the attributes of Beijing include population, area, and so on.
  • the attribute value is the value of the specified attribute of the object.
  • China's area is 9.6 million square kilometers
  • 9.6 million square kilometers is the value of the area attribute.
  • a triple is a universal representation of a knowledge graph.
  • the basic forms of triples include (first entity-relation-tail entity) and (concept-attribute-attribute value).
  • China-Capital-Beijing is an example of a triad (head entity-relation-tail entity), where China is the head entity, Beijing is the tail entity, and the capital is the relationship between China and Beijing.
  • Beijing-population-2069.3 million constitutes an example of (concept-attribute-attribute value) triples, in which population is an attribute and 20.693,000 is the attribute value.
  • the triples refer to the basic form of (first entity-relationship-tail entity).
  • the schema of the knowledge graph is a specification for modeling concepts, an abstract model describing the objective world, and a clear definition of concepts and their relationships in a formal way. Understandably, the schema defines the data model in the knowledge graph. Specifically, the schema defines the types of entities and the types of relationships.
  • FIG. 1 shows a communication system to which the technical solution provided in this application is applicable.
  • the communication system includes a server 10 and a terminal device 20.
  • the server 10 and the terminal device 20 communicate through a wireless network or a wired network.
  • the terminal device 20 may be a mobile phone, a tablet computer, a notebook computer, an ultra-mobile personal computer (UMPC), a netbook, a personal digital assistant (PDA), or the like.
  • the terminal device 20 may be installed with a client having functions of intelligent search, intelligent question answering, and the like.
  • the server 10 is configured to provide services such as intelligent search, intelligent question answering, and the like for the terminal device.
  • the server 10 includes a frame extension unit, a data acquisition unit, an extension mapping unit, a feature calculation unit, and a storage unit.
  • the frame expansion unit is configured to construct a frame of a knowledge graph fused with text according to a frame of an initial knowledge graph.
  • the data obtaining unit is configured to obtain external data from the Internet according to entity information or relationship information in the initial knowledge map.
  • the extended mapping unit is configured to generate an extended attribute value of an entity and an extended attribute value of a relationship from external data; add the extended attribute value of the entity and the extended attribute value of the relationship to a knowledge map of the fused text, Construct a knowledge map of the fused text.
  • the feature calculation unit is configured to determine a representation vector of an entity and a representation vector of a relationship in a knowledge map of the fused text.
  • the storage unit is configured to store related data of the constructed knowledge map of the fused text.
  • FIG. 2 is a schematic diagram of a hardware structure of a server according to an embodiment of the present application.
  • the server includes at least one processor 101, a communication line 102, a memory 103, and at least one communication interface 104.
  • the processor 101 may be a general-purpose central processing unit (CPU), a microprocessor, an application-specific integrated circuit (ASIC), or one or more processors for controlling the execution of the program of the solution of the present application. integrated circuit.
  • CPU central processing unit
  • ASIC application-specific integrated circuit
  • the communication line 102 may include a path for transmitting information between the aforementioned components.
  • the communication interface 104 uses any device such as a transceiver to communicate with other devices or communication networks, such as Ethernet, radio access network (RAN), wireless local area networks (WLAN), etc. .
  • RAN radio access network
  • WLAN wireless local area networks
  • the memory 103 may be a read-only memory (ROM) or other types of static storage devices that can store static information and instructions, a random access memory (random access memory, RAM), or other types that can store information and instructions
  • the dynamic storage device can also be electrically erasable programmable read-only memory (EEPROM-ready-only memory (EEPROM)), compact disc (read-only memory (CD-ROM)) or other optical disk storage, optical disk storage (Including compact discs, laser discs, optical discs, digital versatile discs, Blu-ray discs, etc.), magnetic disk storage media or other magnetic storage devices, or can be used to carry or store desired program code in the form of instructions or data structures and can be used by a computer Any other media accessed, but not limited to this.
  • the memory 103 may exist independently, and is connected to the processor through the communication line 102.
  • the memory 103 may also be integrated with the processor 101.
  • the memory 103 is configured to store computer-executable instructions for executing the solution of the present application.
  • the processor 101 is configured to execute computer-executable instructions stored in the memory 103, so as to implement the technical solution provided by the following embodiments of the present application.
  • the computer-executable instructions in the embodiments of the present application may also be referred to as application program codes, which are not specifically limited in the embodiments of the present application.
  • the processor 101 may include one or more CPUs, such as CPU0 and CPU1 in FIG. 2.
  • the server may include multiple processors, such as the processor 101 and the processor 107 in FIG. 2. Each of these processors may be a single-CPU processor or a multi-CPU processor.
  • a processor herein may refer to one or more devices, circuits, and / or processing cores for processing data (such as computer program instructions).
  • a method for constructing a knowledge map includes the following steps:
  • the server obtains an initial knowledge map.
  • the initial knowledge map is obtained by the server from the Internet; or, the initial knowledge map is manually entered into the server.
  • the server constructs a framework of the text knowledge map.
  • the framework of the text-fused knowledge graph further includes at least: extended attributes of entities, extended attributes of relationships, and extended relations between entities.
  • the extended attributes of entities are the attributes of undefined entities in the framework of the initial knowledge graph. For example, for the Beijing entity, there are only two attributes defined in the framework of the initial knowledge map: area and population, while the framework of the fused knowledge map also defines another attribute: latitude and longitude. In this way, latitude and longitude are the extended attributes of Beijing.
  • the extended attributes of the entity may be determined according to the following manner: the server extracts the entity name information from the existing text information or external text information using a text mining algorithm according to the entity name information in the initial knowledge map Related high-frequency words, combined with part-of-speech filtering techniques, form extended attributes of entities.
  • the text mining algorithm includes a topic model, core word extraction, and named entity recognition.
  • the extended attributes of the entity include at least: text information of the entity. That is, the knowledge graph of the fused text is a knowledge graph having at least the attribute of the text information of the entity.
  • the extended attributes of relationships are the attributes of undefined relationships in the framework of the initial knowledge graph.
  • the extended attributes of the relationship include: the type of the head entity, the type of the tail entity, and the like.
  • the extended attributes of the relationship are manually defined by an expert.
  • the extended relationship between entities is the relationship between undefined entities in the framework of the initial knowledge graph.
  • the extended relationship between entities includes a distance relationship, a proximity relationship, and the like.
  • the extended relationship may be obtained from a relational database, which includes relationships between various entities. It should be noted that in order to ensure the rationality of the extended relationship between entities, after determining the extended relationship between the entities, the server uses the relationship between the entities specified in the knowledge map of the current fused text as training data, and uses weak supervision Algorithm and reinforcement learning algorithm to verify the rationality of the extended relationship, thereby removing the unreasonable extended relationship.
  • the server obtains external data according to entity information or relationship information in the knowledge map of the fused text.
  • the information of the entity includes a name of the entity.
  • the information of the relationship includes the name of the relationship.
  • the external data includes information of extended attributes of entities or relationships.
  • the external data may be structured data, semi-structured data, or unstructured data. If the external data is unstructured data, the external data may be text information or multimedia information, and the multimedia information includes videos, pictures, and web pages.
  • the server obtains external data from the Internet by using a crawler or other technology according to entity information or relationship information in the fused text knowledge map.
  • the server directly extracts external data from an encyclopedia website (such as Baidu Encyclopedia, Wikipedia) or a vertical website (such as an electronic product website, a book website, a movie website, or a music website).
  • encyclopedia websites and vertical websites include a lot of entity attribute information, for example, the book website includes information such as the author, publisher, and publishing time of the book
  • the server can generate a certain wrapper (or template) by generating certain rules. Use a wrapper to extract external data that contains attribute information.
  • the method of generating a wrapper can be divided into: a manual method (that is, writing a wrapper manually), a supervised method, a semi-supervised method, and an unsupervised method.
  • the server determines an extended attribute value of the entity and an extended attribute value of the relationship from external data to construct a knowledge map of the fused text.
  • the server extracts the extended attribute value of the entity or the extended attribute value of the relationship from the external data by using a manually defined or automatically generated matching mode.
  • the server uses data mining methods to mine the relationship mode between the attribute and the attribute value from the text information, so as to realize the attribute name and attribute. Positioning of the value in the text. It can be understood that in a real language environment, there are some keywords (such as attribute names) used to limit and define the meaning of the attribute value near many attribute values, so these keywords can be used to locate the attribute value.
  • the server adds the extended attribute value of the entity and the extended attribute value of the relationship to the knowledge graph of the fused text to complete the creation of the knowledge graph of the fused text.
  • FIG. 4 shows a schematic diagram of an initial knowledge map
  • FIG. 5 shows a schematic diagram of a knowledge map that fuses text.
  • the framework of the initial knowledge graph defines two entity types: products and components.
  • the products are: Huawei P10 and Huawei P8, the parts are: camera, lens 1, lens 2 and lens.
  • Lens 1, lens 2 and lens have two attributes: sensor and pixel.
  • the knowledge map of the fused text shown in FIG. 5 is obtained based on the expansion of the initial knowledge map shown in FIG. 4.
  • Huawei P10 has extended attributes: theme and frequency. Containment relationships have extended attributes: htype, hr frequency, ttype, and rt frequency.
  • htype indicates the type of the head entity
  • hr frequency indicates the frequency of the head entity and the relationship
  • ttype indicates the type of the tail entity
  • rt frequency indicates the frequency of the relationship and the tail entity.
  • the method for constructing a knowledge graph constructs a knowledge graph of the fused text by extending the framework of the initial knowledge graph and supplementing the extended attribute values of the related entities and the extended attribute values of the relationship. In this way, compared with the original knowledge map, the knowledge map of the fused text is more complete in content.
  • FIG. 6 shows a flowchart of a learning method according to an embodiment of the present application. The method includes the following steps:
  • the server initializes the representation vector of the head entity, the representation vector of the tail entity, the representation vector of the relationship, the entity type representation matrix, the relationship type representation matrix, the word representation matrix, and the weight value in the triple of the knowledge graph of the fused text. Represents a vector.
  • the server uses methods such as uniform distribution initialization and Bernoulli distribution initialization to initialize the representation vector of the head entity, the representation vector of the tail entity, and the relationship in the triple of the knowledge graph of the fused text.
  • a representation vector, the entity type representation matrix, the relation type representation matrix, the word representation matrix, and a representation vector of the weight value are examples of uniform distribution initialization and Bernoulli distribution initialization.
  • the dimensions of the representation vector of the head entity, the dimensions of the representation vector of the tail entity, the dimensions of the relationship representation vector, and the weight value of the representation vector are all preset. Fixed. Furthermore, the dimensions of the representation vector of the head entity, the dimensions of the representation vector of the tail entity, the dimensions of the relationship representation vector, and the weight value of the representation vector are equal.
  • the entity type representation matrix is used to map the type identification vector of the entity to the type representation vector of the entity.
  • the type identification vector of the entity is used to directly characterize the type to which the entity belongs.
  • the type representation vector of the entity is used to indirectly characterize the type to which the entity belongs.
  • the number of rows of the entity type representation matrix is equal to the number of dimensions of the entity type representation vector.
  • the number of columns of the entity type representation matrix is equal to the number of dimensions of the type identification vector of the entity.
  • the dimension of the type representation vector of the entity is equal to the dimension of the representation vector of the entity.
  • the dimension of the type identification vector of the entity is equal to the total number of types of the entity.
  • Each dimension of the entity type identification vector corresponds to a type of the entity, and each dimension of the entity type identification vector has a value of 0 or 1. If the dimension of the entity type identification vector is 1, it indicates that the entity belongs to the type corresponding to the dimension.
  • the framework of the knowledge graph defines the types of entities including type 1, type 2, type 3, and type 4. If the type to which entity A belongs is type 1 and type 4, the type identification vector of entity A is (1,0 , 0,1).
  • the relationship type representation matrix is used to map the type identification vector of the relationship to the type representation vector of the relationship.
  • the type identification vector of the relationship is used to directly characterize the type to which the relationship belongs.
  • the type representation vector of the relationship is used to indirectly characterize the type to which the relationship belongs.
  • the number of rows of the relationship type representation matrix is equal to the number of dimensions of the relationship type representation vector.
  • the relationship type indicates that the number of columns of the matrix is equal to the dimension of the type identification vector of the relationship.
  • the dimension of the type-representation vector of the relation is equal to the dimension of the relation-representation vector.
  • the dimension of the type identification vector of the relationship is equal to the total number of types of the relationship.
  • Each dimension of the type identification vector of the relationship corresponds to a type of the entity, and each dimension of the type identification vector of the entity has a value of 0 or 1. If the value of the dimension identification vector of the relationship is 1, the relationship belongs to the corresponding type of the dimension; if the value of the dimension identification vector of the relationship is 0, the relationship does not belong to the corresponding type of the dimension.
  • the word representation matrix is used to map an identification vector of a word to a representation vector of a word.
  • the identification vector of the word is used to directly characterize the position of the word in the vocabulary.
  • the word type representation vector is used to indirectly characterize the position of the word in the vocabulary.
  • the vocabulary contains all entity-related words in the knowledge graph.
  • the number of rows of the word representation matrix is equal to the number of dimensions of the word representation vector.
  • the number of columns of the word representation matrix is equal to the dimension of the identification vector of the word.
  • the dimension of the representation vector of the word is equal to the dimension of the representation vector of the entity.
  • the dimension of the identification vector of the word is equal to the total number of words in the vocabulary.
  • Each dimension of the word's identification vector corresponds to a position in the vocabulary.
  • the value of the word's identification vector is 0 or 1. If a dimension of the word's identification vector is 0, it means that the word is not in the position of the vocabulary corresponding to that dimension; if a dimension of the word's identification vector is 1, it means that the word's corresponding vocabulary in that dimension position.
  • the representation vector of the weight value is used to represent the weight value of the relationship.
  • the weight value of the relationship is used to explain the degree of correlation between the two entities to which the relationship is connected.
  • the server determines a type representation vector of the entity according to the type of the entity in the triple of the knowledge map of the fused text.
  • the entities include a head entity and a tail entity.
  • the type of the entity is defined by a framework of a knowledge graph of fused text. And, the type of the entity is not unique. In other words, an entity can correspond to multiple types. Exemplarily, it is assumed that the entity is a mobile phone, and the mobile phone may be an electronic product or a communication tool. Here, the electronic product or communication tool is the type to which the mobile phone belongs.
  • the server first determines the type identification vector of the head entity based on the type of the head entity; then, the server then determines the vector based on the formula A type representation vector of the head entity is determined.
  • f 1 (h) represents a type representation vector of the head entity
  • W etype represents the entity type representation matrix
  • v etype (h) represents a type identification vector of the head entity
  • ⁇ ⁇ represents a second norm.
  • the server first determines the type identification vector of the tail entity according to the type of the tail entity; then, the server then determines the vector based on the formula A type representation vector of the tail entity is determined.
  • f 1 (t) represents a type representation vector of the tail entity
  • v etype (t) represents a type identification vector of the tail entity.
  • the server determines a type representation vector of the relationship according to the type of the relationship in the triple.
  • the type of the relationship is defined by a framework of a knowledge graph of fused text.
  • the type of the relationship includes an include relationship, a subordinate relationship, a side-by-side relationship, and the like.
  • the server may first determine the type identification vector of the relationship according to the type of the relationship.
  • the server then according to the formula: Determine the type representation vector of the relationship; where g 1 (r) represents the type representation vector of the relationship, W rtype represents the relationship type representation matrix, and v rtype (r) represents the type identification vector of the relationship.
  • the server determines a context representation vector of the entity according to the text information of the entity.
  • the context representation vector of the entity is used to characterize the context features of the entity.
  • the text information is stored in advance by the server.
  • the server determines words related to the head entity according to the text information of the head entity; then, the server according to the formula: Determine the context representation vector of the head entity; where f 2 (h) represents the context representation vector of the head entity, ⁇ , ⁇ are constants between 0 and 1, and v h represents the head entity's Represents a vector, w i represents a word related to the head entity, ⁇ 1 represents a set of all words related to the head entity, W word represents a matrix of words, and V vocubalary (w i ) represents an identification vector of w i .
  • the server determines words related to the tail entity according to the text information of the tail entity; then, the server according to the formula: Determining a context representation vector of the tail entity; wherein f 2 (t) represents a context representation vector of the tail entity, v t represents a representation vector of the tail entity, and mi represents a word related to the tail entity, ⁇ 2 represents a set of all words related to the tail entity, and V vocubalary (m i ) represents an identification vector of mi .
  • the server may determine the words related to the entity in the following implementation manner: the server selects a text sequence within a certain distance from the entity name from the text information, and uses word segmentation technology to divide the text sequence into individual Words, these individual words are words related to the entity.
  • the above word segmentation technology may be a word segmentation technology based on string matching, a word segmentation technology based on understanding, and a word segmentation technology based on statistics.
  • word segmentation technology based on string matching
  • word segmentation technology based on understanding e.g., a word segmentation technology based on understanding
  • word segmentation technology based on statistics e.g., a word segmentation technology based on statistics.
  • the server determines a context representation vector of the relationship according to a weight value of the relationship.
  • the context representation vector of the relationship is used to represent the weight feature of the relationship.
  • the server uses the formula: Determine the contextual representation vector of the relationship.
  • g 2 (r) represents a context representation vector of the relationship
  • v r represents a representation vector of the relationship
  • n i represents a weight value of the relationship
  • ⁇ 3 represents a set of ownership weight values of the relationship.
  • the server constructs a scoring function for the triplet according to a type representation vector of the entity, a context representation vector of the entity, a type representation vector of the relationship, and a context representation vector of the relationship.
  • the scoring function of the triples is:
  • Bitwise multiplication refers to multiplying the value of each dimension of the first vector by the value of the corresponding dimension of the second vector to generate the value of the corresponding dimension of the third vector.
  • the server constructs an objective function according to the scoring function of the triple.
  • the objective function is:
  • (h, r, t) represents a positive example triplet
  • represents a positive example triplet set
  • (h ′, r, t ′) represents a negative example triplet
  • h ′ represents a negative example head entity
  • t ′ Represents a negative example tail entity
  • ⁇ ′ represents a negative example triple set
  • M is a constant.
  • the positive triples are triples that exist in the knowledge graph of the fused text
  • the negative triples are triples that do not exist in the knowledge graph of the fused text.
  • the negative triples are obtained by randomly replacing the head or tail entities with the positive triples.
  • the negative triples are not included in the negative triples.
  • the server minimizes the objective function and learns a representation vector of the entity and a representation vector of the relationship.
  • the server uses a gradient descent algorithm to iteratively update the representation vector of the head entity, the representation vector of the tail entity, the representation vector of the relationship, the entity type representation matrix, and the relationship.
  • the type and weight value of the relationship represent certain deep-level semantic information, therefore, by considering the type and context of the entity and the type and weight value of the relationship, The identified vector of the entity and the vector of the relation can describe the deep-level semantic information in the knowledge map, and improve the accuracy of the representation learning.
  • the server includes a hardware structure and / or a software module corresponding to each function.
  • the server includes a hardware structure and / or a software module corresponding to each function.
  • the server may be divided according to the foregoing method example.
  • each module or unit may be divided corresponding to each function, or two or more functions may be integrated into one processing module.
  • the above integrated modules may be implemented in the form of hardware, or in the form of software modules or units.
  • the division of modules or units in the embodiments of the present application is schematic, and is only a logical function division. In actual implementation, there may be another division manner.
  • FIG. 7 illustrates a possible structural diagram of a learning device according to the foregoing embodiment.
  • the representation learning device includes a type representation module 701, a context representation module 702, a processing module 703, a framework extension module 704, a data acquisition module 705, and an extension mapping module 706.
  • the type indicating module 701 is configured to support the server to perform steps S202 and S203 in FIG. 6.
  • the context representation module 702 is configured to support the server to perform steps S204 and S205 in FIG. 6.
  • the processing module 703 is configured to support the server to execute steps S201, S206, S207, and S208 in FIG.
  • the frame expansion module 704 is configured to support the server to perform steps S101 and S102 in FIG. 3.
  • the data acquisition module 705 is configured to support the server to execute step S103 in FIG. 3.
  • the extended mapping module 706 is configured to support the server to execute step S104 in FIG. 3.
  • the presentation learning device is presented in the form of dividing each functional module corresponding to each function, or the presentation learning device is presented in the form of dividing each functional module in an integrated manner.
  • the “module” here may include application-specific integrated circuits (ASICs), circuits, processors and memories executing one or more software or firmware programs, integrated logic circuits, or other devices that can provide the above functions .
  • ASICs application-specific integrated circuits
  • the representation learning apparatus may be implemented by using the server shown in FIG. 2.
  • the type representation module 701, context representation module 702, processing module 703, framework extension module 704, and extension mapping module 706 in FIG. 7 may be implemented by the processor 101 in FIG. 2, and the data acquisition module 705 in FIG. 7 may It is implemented by the communication interface 104 in FIG. 2.
  • the embodiment of the present application does not make any limitation on this.
  • the embodiment of the present application further provides a computer-readable storage medium, where the computer-readable storage medium stores instructions; when the computer-readable storage medium runs on the server shown in FIG. 2, The server executes the representation learning method shown in FIG. 3 or FIG. 6.
  • the embodiment of the present application further provides a computer program product containing instructions, which when run on a computer enables the computer to execute the representation learning method shown in FIG. 3 or FIG. 6.
  • an embodiment of the present application provides a chip system including a processor, which is configured to support a server to implement the representation learning method shown in FIG. 3 or FIG. 6.
  • the chip system further includes a memory. This memory is used to store the necessary program instructions and data of the receiver.
  • the memory may not be in the chip system.
  • the chip system may be composed of a chip, and may also include a chip and other discrete devices, which are not specifically limited in the embodiments of the present application.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be from a website site, a computer, a server, or a data center.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device including one or more servers, data centers, and the like that can be integrated with the medium.
  • the usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, a magnetic tape), an optical medium (for example, a DVD), or a semiconductor medium (for example, a solid state disk (SSD)).

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Machine Translation (AREA)

Abstract

A representation learning method and device, relating to the technical field of big data, and used for solving the problem of current representation learning methods being unable to portray deep semantic information in a knowledge graph. The method comprises: determining, according to the type of entity in a triplet of a knowledge graph which fuses texts, a type representation vector of the entity (S202); determining, according to the type of a relationship in the triplet, a type representation vector of the relationship (S203); determining a context representation vector of the entity according to text information of the entity (S204); determining a context representation vector of the relationship according to a weight value of the relationship (S205); constructing a scoring function of the triplet according to the type representation vector of the entity, the context representation vector of the entity, the type representation vector of the relationship and the context representation vector of the relationship (S206); constructing a target function according to the scoring function of the triplet (S207); and minimizing the target function, and learning the representation vector of the entity and the representation vector of the relationship (S208).

Description

表示学习方法及装置Representation learning method and device
本申请要求于2018年7月24日提交国家知识产权局、申请号为201810822334.5、发明名称为“表示学习方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims priority from a Chinese patent application filed with the State Intellectual Property Office on July 24, 2018, with application number 201810822334.5, and the invention name is "indicating learning methods and devices", the entire contents of which are incorporated herein by reference.
技术领域Technical field
本申请涉及大数据技术领域,尤其涉及表示学习方法及装置。The present application relates to the field of big data technology, and in particular, to a method and a device for representation learning.
背景技术Background technique
知识图谱(Knowledge Graph)以结构化的方式描述客观世界中概念、实体及它们之间的关系,将互联网的信息表达成更接近人类认知世界的形式,提供了一种更好地组织、管理和理解互联网海量信息的能力。随着人工智能的技术发展和应用,知识图谱逐渐成为关键技术之一,现已被广泛应用于智能搜索、智能问答、个性化推荐、内容分发等领域。Knowledge Graph describes structured concepts, entities and their relationships in a structured way, expressing information on the Internet into a form closer to the human cognitive world, providing a better organization and management And the ability to understand the vast amount of information on the Internet. With the development and application of artificial intelligence technology, knowledge map has gradually become one of the key technologies, and has been widely used in intelligent search, intelligent question answering, personalized recommendation, content distribution and other fields.
由于知识图谱中的实体、概念以及关系均采用了离散的符号化表示,这些离散的符号化表示难以直接应用于计算或推理等方面。因此,为了有效利用知识图谱中的符号化知识,研究人员提出了知识图谱的表示学习方法。知识图谱的表示学习方法旨在将知识图谱中的实体和关系都表示成低维向量空间中的向量,从而将实体和关系之间的计算转化为向量间的数值计算。Because the entities, concepts, and relationships in the knowledge graph use discrete symbolic representations, these discrete symbolic representations are difficult to apply directly to calculations or inferences. Therefore, in order to effectively use the symbolized knowledge in the knowledge graph, researchers have proposed a representation learning method for the knowledge graph. Knowledge graph representation learning method aims to represent the entities and relationships in the knowledge graph as vectors in a low-dimensional vector space, thereby transforming the calculation between entities and relationships into numerical calculations between vectors.
当前的表示学习方法不能刻画知识图谱中深层次的语义信息,例如“公司A-吞了-公司B”和“蟒蛇-吞了-兔子”,两个“吞了”具有不同的含义,但是在当前的表示学习中会表示成同一个向量,导致出现错误。Current representation learning methods cannot describe the deep-seated semantic information in the knowledge graph, such as "company A-swallow-company B" and "python-swallow-rabbit". The two "swallow" have different meanings, but The current representation learning will be represented as the same vector, resulting in errors.
发明内容Summary of the Invention
本申请提供一种表示学习方法及装置,用于刻画知识图谱中深层次的语义信息,提高表示学习的准确性。The present application provides a representation learning method and device, which are used to characterize deep-level semantic information in a knowledge map and improve the accuracy of representation learning.
为达到上述目的,本申请采用如下技术方案:In order to achieve the above purpose, this application uses the following technical solutions:
第一方面,提供一种表示学习方法,包括:根据融合文本的知识图谱的三元组中实体的类型,确定实体的类型表示向量,该实体包括头实体和尾实体;根据三元组中关系的类型,确定关系的类型表示向量;根据实体的文本信息,确定实体的上下文表示向量;根据关系的权重值,确定关系的上下文表示向量;根据实体的类型表示向量、实体的上下文表示向量、关系的类型表示向量以及关系的上下文表示向量,构建三元组的打分函数;根据三元组的打分函数,构建目标函数;最小化目标函数,学习实体的表示向量和关系的表示向量。基于该技术方案,由于实体的类型和上下文,关系的类型和权重值均代表一定的深层次语义信息,因此通过考虑实体的类型和上下文,以及关系的类型和权重值,使得确定出来的实体的表示向量以及关系的表示向量能够刻画出知识图谱中深层次的语义信息,提高表示学习的准确性。According to a first aspect, a representation learning method is provided, including: determining a type representation vector of an entity according to a type of an entity in a triple of a knowledge graph of a fused text, the entity including a head entity and a tail entity; and according to a relationship in the triple The type of the relationship determines the type representation vector of the relationship; the context representation vector of the entity is determined based on the textual information of the entity; the context representation vector of the relationship is determined based on the weight value of the relationship; the vector is represented by the type of the entity, the vector of the context representation of the entity, and the relationship The type represents the vector and the context representation vector of the relationship to construct the scoring function of the triplet; the objective function is constructed based on the scoring function of the triplet; the objective function is minimized to learn the representation vector of the entity and the representation vector of the relationship. Based on this technical solution, because the type and context of the entity, the type and weight value of the relationship represent a certain level of deep semantic information, the type and weight of the entity are considered by considering the type and context of the entity, as well as the type and weight value of the relationship. Representation vectors and relational representation vectors can depict deep-level semantic information in the knowledge map and improve the accuracy of representation learning.
一种可能的设计中,在根据融合文本的知识图谱的三元组中实体的类型,确定实体的类型表示向量之前,该方法还包括:初始化头实体的表示向量、尾实体的表示向量、关系的表示向量、实体类型表示矩阵、关系类型表示矩阵、词表示矩阵以及权重值的表示向量。 这样一来,在后续的表示学习过程中,能够确定关系的类型表示向量、实体的类型表示向量、关系的上下文表示向量以及实体的上下文表示向量。In a possible design, before determining the type representation vector of the entity according to the type of the entity in the triple of the knowledge graph of the fused text, the method further includes: initializing the representation vector of the head entity, the representation vector of the tail entity, and the relationship Representation vector, entity type representation matrix, relation type representation matrix, word representation matrix, and weight vector. In this way, in the subsequent representation learning process, the type representation vector of the relationship, the type representation vector of the entity, the context representation vector of the relationship, and the context representation vector of the entity can be determined.
一种可能的设计中,根据融合文本的知识图谱的三元组中实体的类型,确定实体的类型表示向量,包括:根据头实体的类型,确定头实体的类型标识向量;根据公式
Figure PCTCN2019096895-appb-000001
确定头实体的类型表示向量;其中,f 1(h)表示头实体的类型表示向量,W etype表示实体类型表示矩阵,v etype(h)表示头实体的类型标识向量;根据尾实体的类型,确定尾实体的类型标识向量;根据公式
Figure PCTCN2019096895-appb-000002
确定尾实体的类型表示向量;其中,f 1(t)表示尾实体的类型表示向量,v etype(t)表示尾实体的类型标识向量。
In a possible design, determining the type representation vector of the entity according to the type of the entity in the triple of the knowledge graph of the fused text includes: determining the type identification vector of the head entity according to the type of the head entity;
Figure PCTCN2019096895-appb-000001
Determine the type representation vector of the head entity; where f 1 (h) represents the type representation vector of the head entity, W etype represents the entity type representation matrix, and v etype (h) represents the type identification vector of the head entity; according to the type of the tail entity, Determine the type identification vector of the tail entity; according to the formula
Figure PCTCN2019096895-appb-000002
Determine the type of the tail entity to represent the vector; where f 1 (t) represents the type of the tail entity represents the vector, and v etype (t) represents the type identification vector of the tail entity.
一种可能的设计中,根据三元组中关系的类型,确定关系的类型表示向量,包括:根据关系的类型,确定关系的类型标识向量;根据公式:
Figure PCTCN2019096895-appb-000003
确定关系的类型表示向量;其中,g 1(r)表示关系的类型表示向量,W rtype表示关系类型表示矩阵,v rtype(r)表示关系的类型标识向量。
In a possible design, determining the type of the relationship to represent the vector according to the type of the relationship in the triple, includes: determining the type identification vector of the relationship according to the type of the relationship; and according to the formula:
Figure PCTCN2019096895-appb-000003
Determine the type of the relationship to represent the vector; where g 1 (r) represents the type of the relationship represents the vector, W rtype represents the relationship type represents the matrix, and v rtype (r) represents the type identification vector of the relationship.
一种可能的设计中,根据实体的文本信息,确定实体的上下文表示向量,包括:根据头实体的文本信息,确定与头实体相关的词;根据公式:
Figure PCTCN2019096895-appb-000004
确定头实体的上下文表示向量;其中,f 2(h)表示头实体的上下文表示向量,α、β为取值在0到1之间的常数,v h表示头实体的表示向量,w i表示与头实体相关的词,ε 1表示所有与头实体相关的词构成的集合,W word表示词表示矩阵,V vocubalary(w i)表示w i的标识向量;根据尾实体的文本信息,确定与尾实体相关的词;根据公式:
Figure PCTCN2019096895-appb-000005
确定尾实体的上下文表示向量;其中,f 2(t)表示尾实体的上下文表示向量,v t表示尾实体的表示向量,m i表示与尾实体相关的词,ε 2表示所有与尾实体相关的词构成的集合,V vocubalary(m i)表示m i的标识向量。
In a possible design, determining an entity's contextual representation vector according to the entity's text information includes: determining the words related to the head entity based on the text information of the head entity;
Figure PCTCN2019096895-appb-000004
Determine the context representation vector of the head entity; where f 2 (h) represents the context representation vector of the head entity, α, β are constants between 0 and 1, v h represents the vector of representation of the head entity, and w i represents Words related to the head entity, ε 1 represents a set of all words related to the head entity, W word represents a matrix of words, and V vocubalary (w i ) represents an identification vector of w i ; according to the text information of the tail entity, determine the Tail entity related words; according to the formula:
Figure PCTCN2019096895-appb-000005
Determine the context representation vector of the tail entity; where f 2 (t) represents the context representation vector of the tail entity, v t represents the vector of the tail entity, mi represents the words related to the tail entity, and ε 2 represents all the tail entity related. V vocubalary (m i ) represents the identity vector of mi .
一种可能的设计中,根据关系的权重值,确定关系的上下文表示向量,包括:根据公式:
Figure PCTCN2019096895-appb-000006
确定关系的上下文表示向量;其中,g 2(r)表示关系的上下文表示向量,v r表示关系的表示向量,n i表示关系的权重值,ε 3表示关系的所有权重值构成的集合,
Figure PCTCN2019096895-appb-000007
表示n i的表示向量。
In a possible design, determining the contextual representation vector of the relationship according to the weight value of the relationship includes: according to the formula:
Figure PCTCN2019096895-appb-000006
Determine the context representation vector of the relationship; where g 2 (r) represents the context representation vector of the relationship, v r represents the vector of the relationship, n i represents the weight value of the relationship, and ε 3 represents the set of weight values of the relationship.
Figure PCTCN2019096895-appb-000007
A representation vector representing n i .
可选的,三元组的打分函数为:
Figure PCTCN2019096895-appb-000008
Figure PCTCN2019096895-appb-000009
其中,
Figure PCTCN2019096895-appb-000010
表示复合运算,f 1(h)表示头实体的类型表示向量,g 1(r)表示关系的类型表示向量,f 1(t)表示尾实体的类型表示向量,f 2(h)表示头实体的上下文表示向量,g 2(r)表示关系的上下文表示向量,f 2(t)表示尾实体的上下文表示向量。
Optionally, the scoring function for the triples is:
Figure PCTCN2019096895-appb-000008
Figure PCTCN2019096895-appb-000009
among them,
Figure PCTCN2019096895-appb-000010
Represents compound operations, f 1 (h) represents the type of the head entity represents a vector, g 1 (r) represents the type of a relation represents a vector, f 1 (t) represents the type of a tail entity represents a vector, and f 2 (h) represents a head entity The context represents the vector of, g 2 (r) represents the context of the relationship, and f 2 (t) represents the context of the tail entity.
可选的,目标函数为:L=∑ (h,r,t)∈Δ(∑ (h′,r,t′)∈Δ′max(0,S(h,r,t)+M-S(h′,r,t′)))。其中,(h,r,t)表示正例三元组,Δ表示正例三元组集合,(h′,r,t′)表示负例三元组,h′表示负例的头实体,t′表示负例的尾实体,Δ′表示负例三元组集合,M为常数。 Optionally, the objective function is: L = ∑ (h, r, t) ∈ Δ (∑ (h ′, r, t ′) ∈ Δ ′ max (0, S (h, r, t) + MS (h ′, R, t ′))). Among them, (h, r, t) represents a positive example triplet, Δ represents a positive example triplet set, (h ′, r, t ′) represents a negative example triplet, and h ′ represents a head entity of a negative example, t ′ represents the tail entity of the negative example, Δ ′ represents the triple set of negative examples, and M is a constant.
一种可能的设计中,在根据融合文本的知识图谱的三元组中实体的类型,确定实体的类型表示向量之前,该方法还包括:获取初始知识图谱;基于初始知识图谱的框架,构建融合文本的知识图谱的框架;融合文本的知识图谱的框架至少定义以下内容:实体的扩展属性、关系的扩展属性以及实体之间的扩展关系;实体的扩展属性包括实体的文本信息;根据初始知识图谱中实体的信息或者关系的信息,获取外部数据;从外部数据中,确定实 体的扩展属性值以及关系的扩展属性值,以构建融合文本的知识图谱。基于该技术方案,服务器通过扩展初始知识图谱的框架,补充相关的实体的扩展属性值和关系的扩展属性值,构建出融合文本的知识图谱。这样一来,相比于初始知识图谱,融合文本的知识图谱在内容上更加完备。In a possible design, before determining the type representation vector of the entity according to the type of the entity in the triple of the knowledge map of the fused text, the method further includes: obtaining an initial knowledge map; constructing a fusion based on a framework of the initial knowledge map The framework of textual knowledge graphs; the framework of textual knowledge graphs defines at least the following: extended attributes of entities, extended attributes of relations, and extended relations between entities; extended attributes of entities include textual information of entities; according to the initial knowledge graph Obtain external data from the entity or relationship information in the external data; from the external data, determine the extended attribute value of the entity and the extended attribute value of the relationship to build a knowledge map of the fused text. Based on this technical solution, the server constructs a knowledge graph of the fused text by extending the framework of the initial knowledge graph and supplementing the extended attribute values of the related entities and the extended attribute values of the relationship. In this way, compared with the original knowledge map, the knowledge map of the fused text is more complete in content.
第二方面,提供一种表示学习装置,包括:类型表示模块,用于根据融合文本的知识图谱的三元组中实体的类型,确定实体的类型表示向量,实体包括头实体和尾实体;根据三元组中关系的类型,确定关系的类型表示向量。上下文表示模块,用于根据实体的文本信息,确定实体的上下文表示向量;根据关系的权重值,确定关系的上下文表示向量。处理模块,用于根据实体的类型表示向量、实体的上下文表示向量、关系的类型表示向量以及关系的上下文表示向量,构建三元组的打分函数;根据三元组的打分函数,构建目标函数;最小化目标函数,学习实体的表示向量和关系的表示向量。In a second aspect, a representation learning device is provided, including: a type representation module for determining a type representation vector of an entity according to a type of an entity in a triple of a knowledge graph of a fused text, and the entity includes a head entity and a tail entity; The type of the relationship in the triple, determines the type of the relationship to represent the vector. The context representation module is used to determine the context representation vector of the entity according to the text information of the entity; and determine the context representation vector of the relationship according to the weight value of the relationship. A processing module for constructing a scoring function of a triple based on the type representation vector of the entity, the context representation vector of the entity, the type representation vector of the relationship, and the context representation vector of the relationship; constructing an objective function based on the scoring function of the triple Minimize the objective function and learn the representation vector of the entity and the representation vector of the relationship.
一种可能的设计中,处理模块,还用于初始化头实体的表示向量、尾实体的表示向量、关系的表示向量、实体类型表示矩阵、关系类型表示矩阵、词表示矩阵以及权重值的表示向量。In a possible design, the processing module is further used to initialize the representation vector of the head entity, the representation vector of the tail entity, the relationship vector, the entity type representation matrix, the relationship type representation matrix, the word representation matrix, and the weight vector. .
一种可能的设计中,类型表示模块,用于根据头实体的类型,确定头实体的类型标识向量;根据公式
Figure PCTCN2019096895-appb-000011
确定头实体的类型表示向量;其中,f 1(h)表示头实体的类型表示向量,W etype表示实体类型表示矩阵,v etype(h)表示头实体的类型标识向量;根据尾实体的类型,确定尾实体的类型标识向量;根据公式
Figure PCTCN2019096895-appb-000012
确定尾实体的类型表示向量;其中,f 1(t)表示尾实体的类型表示向量,v etype(t)表示尾实体的类型标识向量。
In a possible design, the type representation module is used to determine the type identification vector of the head entity according to the type of the head entity; according to the formula
Figure PCTCN2019096895-appb-000011
Determine the type representation vector of the head entity; where f 1 (h) represents the type representation vector of the head entity, W etype represents the entity type representation matrix, and v etype (h) represents the type identification vector of the head entity; according to the type of the tail entity, Determine the type identification vector of the tail entity; according to the formula
Figure PCTCN2019096895-appb-000012
Determine the type of the tail entity to represent the vector; where f 1 (t) represents the type of the tail entity represents the vector, and v etype (t) represents the type identification vector of the tail entity.
一种可能的设计中,类型表示模块,用于根据关系的类型,确定关系的类型标识向量;根据公式:
Figure PCTCN2019096895-appb-000013
确定关系的类型表示向量;其中,g 1(r)表示关系的类型表示向量,W rtype表示关系类型表示矩阵,v rtype(r)表示关系的类型标识向量。
In a possible design, the type representation module is used to determine the type identification vector of the relationship according to the type of the relationship; according to the formula:
Figure PCTCN2019096895-appb-000013
Determine the type of the relationship to represent the vector; where g 1 (r) represents the type of the relationship represents the vector, W rtype represents the relationship type represents the matrix, and v rtype (r) represents the type identification vector of the relationship.
一种可能的设计中,上下文表示模块,用于根据头实体的文本信息,确定与头实体相关的词;根据公式:
Figure PCTCN2019096895-appb-000014
确定头实体的上下文表示向量;其中,f 2(h)表示头实体的上下文表示向量,α、β为取值在0到1之间的常数,v h表示头实体的表示向量,w i表示与头实体相关的词,ε 1表示所有与头实体相关的词构成的集合,W word表示词表示矩阵,V vocubalary(w i)表示w i的标识向量;根据尾实体的文本信息,确定与尾实体相关的词;根据公式:
Figure PCTCN2019096895-appb-000015
确定尾实体的上下文表示向量;其中,f 2(t)表示尾实体的上下文表示向量,v t表示尾实体的表示向量,m i表示与尾实体相关的词,ε 2表示所有与尾实体相关的词构成的集合,V vocubalary(m i)表示m i的标识向量。
In a possible design, the context representation module is used to determine the words related to the head entity based on the text information of the head entity; according to the formula:
Figure PCTCN2019096895-appb-000014
Determine the context representation vector of the head entity; where f 2 (h) represents the context representation vector of the head entity, α, β are constants between 0 and 1, v h represents the vector of representation of the head entity, and w i represents Words related to the head entity, ε 1 represents a set of all words related to the head entity, W word represents a matrix of words, and V vocubalary (w i ) represents an identification vector of w i ; according to the text information of the tail entity, determine the Tail entity related words; according to the formula:
Figure PCTCN2019096895-appb-000015
Determine the context representation vector of the tail entity; where f 2 (t) represents the context representation vector of the tail entity, v t represents the vector of the tail entity, mi represents the words related to the tail entity, and ε 2 represents all the tail entity related. V vocubalary (m i ) represents the identity vector of mi .
一种可能的设计中,上下文表示模块,用于根据公式:
Figure PCTCN2019096895-appb-000016
确定关系的上下文表示向量;其中,g 2(r)表示关系的上下文表示向量,v r表示关系的表示向量,n i表示关系的权重值,ε 3表示关系的所有权重值构成的集合,
Figure PCTCN2019096895-appb-000017
表示n i的表示向量。
In one possible design, the context representation module is used according to the formula:
Figure PCTCN2019096895-appb-000016
Determine the context representation vector of the relationship; where g 2 (r) represents the context representation vector of the relationship, v r represents the vector of the relationship, n i represents the weight value of the relationship, and ε 3 represents the set of weight values of the relationship.
Figure PCTCN2019096895-appb-000017
A representation vector representing n i .
可选的,三元组的打分函数为:
Figure PCTCN2019096895-appb-000018
Figure PCTCN2019096895-appb-000019
其中,
Figure PCTCN2019096895-appb-000020
表示复合运算,f 1(h)表示头实体的类型表示向量,g 1(r)表示关系的 类型表示向量,f 1(t)表示尾实体的类型表示向量,f 2(h)表示头实体的上下文表示向量,g 2(r)表示关系的上下文表示向量,f 2(t)表示尾实体的上下文表示向量。
Optionally, the scoring function for the triples is:
Figure PCTCN2019096895-appb-000018
Figure PCTCN2019096895-appb-000019
among them,
Figure PCTCN2019096895-appb-000020
Represents compound operations, f 1 (h) represents the type of the head entity represents a vector, g 1 (r) represents the type of a relation represents a vector, f 1 (t) represents the type of a tail entity represents a vector, and f 2 (h) represents a head entity The context represents the vector of, g 2 (r) represents the context of the relationship, and f 2 (t) represents the context of the tail entity.
可选的,目标函数为:L=∑ (h,r,t)∈Δ(∑ (h′,r,t′)∈Δ′max(0,S(h,r,t)+M-S(h′,r,t′)))。其中,(h,r,t)表示正例三元组,Δ表示正例三元组集合,(h′,r,t′)表示负例三元组,h′表示负例的头实体,t′表示负例的尾实体,Δ′表示负例三元组集合,M为常数。 Optionally, the objective function is: L = ∑ (h, r, t) ∈ Δ (∑ (h ′, r, t ′) ∈ Δ ′ max (0, S (h, r, t) + MS (h ′, R, t ′))). Among them, (h, r, t) represents a positive example triplet, Δ represents a positive example triplet set, (h ′, r, t ′) represents a negative example triplet, and h ′ represents a head entity of a negative example, t ′ represents the tail entity of the negative example, Δ ′ represents the set of triples of the negative example, and M is a constant.
一种可能的设计中,表示学习装置还包括:框架扩展模块、数据获取模块以及扩展映射模块。框架扩展模块,用于获取初始知识图谱;基于初始知识图谱的框架,构建融合文本的知识图谱的框架;融合文本的知识图谱的框架至少定义以下内容:实体的扩展属性、关系的扩展属性以及实体之间的扩展关系;实体的扩展属性包括实体的文本信息。数据获取模块,用于根据初始知识图谱中实体的信息或者关系的信息,获取外部数据。扩展映射模块,用于从外部数据中,确定实体的扩展属性值以及关系的扩展属性值,以构建融合文本的知识图谱。In a possible design, the representation learning device further includes: a framework extension module, a data acquisition module, and an extension mapping module. Framework extension module, used to obtain the initial knowledge map; based on the framework of the initial knowledge map, construct a framework of the text knowledge map; the framework of the text knowledge map defines at least the following: extended attributes of entities, extended attributes of relationships, and entities The extended relationship between them; the extended attributes of the entity include the textual information of the entity. A data acquisition module is configured to acquire external data according to entity information or relationship information in the initial knowledge map. The extended mapping module is used to determine the extended attribute value of the entity and the extended attribute value of the relationship from the external data to construct a knowledge map of the fused text.
第三方面,提供一种服务器,包括:处理器、存储器、总线和通信接口;所述存储器用于存储计算机执行指令,所述处理器与所述存储器通过所述总线连接,当所述服务器运行时,所述处理器执行所述存储器存储的所述计算机执行指令,以使所述服务器执行如上述第一方面中任一项所述的表示学习方法。According to a third aspect, a server is provided, including: a processor, a memory, a bus, and a communication interface; the memory is configured to store a computer to execute instructions, the processor is connected to the memory through the bus, and when the server runs When the processor executes the computer execution instructions stored in the memory, so that the server executes the representation learning method according to any one of the first aspects.
第四方面,提供了一种计算机可读存储介质,该计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机可以执行上述第一方面中任一项所述的表示学习方法。According to a fourth aspect, a computer-readable storage medium is provided. The computer-readable storage medium stores instructions that, when run on a computer, enable the computer to perform the representation learning according to any one of the first aspects. method.
第五方面,提供了一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机可以执行上述第一方面中任一项所述的表示学习方法。According to a fifth aspect, a computer program product containing instructions is provided, which, when run on a computer, enables the computer to execute the representation learning method according to any one of the first aspects.
第六方面,提供了一种芯片系统,该芯片系统包括处理器,用于支持服务器实现上述第一方面中所涉及的功能。在一种可能的设计中,该芯片系统还包括存储器,该存储器,用于保存服务器必要的程序指令和数据。该芯片系统,可以由芯片构成,也可以包含芯片和其他分立器件。According to a sixth aspect, a chip system is provided. The chip system includes a processor for supporting a server to implement the functions involved in the first aspect. In a possible design, the chip system further includes a memory, and the memory is configured to store program instructions and data necessary for the server. The chip system can be composed of chips, and can also include chips and other discrete devices.
其中,第二方面至第六方面中任一种设计方式所带来的技术效果可参见第一方面中不同设计方式所带来的技术效果,此处不再赘述。The technical effects brought by any one of the design methods in the second aspect to the sixth aspect may refer to the technical effects brought by the different design methods in the first aspect, and are not repeated here.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
图1为本申请实施例提供的一种通信系统的示意图;FIG. 1 is a schematic diagram of a communication system according to an embodiment of the present application; FIG.
图2为本申请实施例提供的一种服务器的结构示意图;2 is a schematic structural diagram of a server according to an embodiment of the present application;
图3为本申请实施例提供的一种构建知识图谱的方法的流程图;3 is a flowchart of a method for constructing a knowledge map according to an embodiment of the present application;
图4为本申请实施例提供的一种初始知识图谱的示意图;4 is a schematic diagram of an initial knowledge map provided by an embodiment of the present application;
图5为本申请实施例提供的一种融合文本的知识图谱的示意图;FIG. 5 is a schematic diagram of a fused text knowledge map according to an embodiment of the present application; FIG.
图6为本申请实施例提供的一种表示学习方法的流程图;6 is a flowchart illustrating a learning method according to an embodiment of the present application;
图7为本申请实施例提供的一种表示学习装置的结构示意图。FIG. 7 is a schematic structural diagram of a learning device according to an embodiment of the present application.
具体实施方式detailed description
在介绍本申请实施例提供的方法之前,对本申请实施例涉及的术语进行简单介绍。Before introducing the method provided in the embodiments of the present application, the terms involved in the embodiments of the present application are briefly introduced.
知识图谱是对客观世界的一种符号表达。知识图谱本身是一个具有属性的实体通过关系连接而成的网状知识库。从图的角度来看,知识图谱在本质上是一种网络,其中的节点表示客观世界的实体(或概念),而边则表示实体间的各种关系或者实体的属性。The knowledge map is a symbolic expression of the objective world. The knowledge graph itself is a networked knowledge base where entities with attributes are connected through relationships. From the perspective of the graph, the knowledge graph is essentially a network, where nodes represent entities (or concepts) in the objective world, and edges represent various relationships or attributes of entities.
其中,实体是指具有可区别性且独立存在的具体事物。例如,“苹果”、“香蕉”等都可 以是实体。Among them, entity refers to specific things that are distinguishable and exist independently. For example, "apple", "banana", etc. can all be entities.
概念是指人们在认知世界过程中形成的对客观事物的概念化表示,例如人、动物、植物。换句话说,概念可以理解为具有同种特性的实体构成的集合。Concept refers to the conceptual representation of objective things that people form in the process of cognizing the world, such as people, animals, and plants. In other words, a concept can be understood as a collection of entities with the same characteristics.
关系用于描述实体、概念之间的客观存在的关联。示例性的,实体之间的关系可以为包含关系、上下位关系等。例如,手机包含摄像头,也即手机和摄像头之间存在包含关系。Relations are used to describe objectively existing associations between entities and concepts. Exemplarily, the relationship between the entities may be an include relationship, a subordinate relationship, or the like. For example, a mobile phone contains a camera, that is, an inclusion relationship exists between the mobile phone and the camera.
属性是一个对象的抽象方面的刻画。值得说明的是,一个实体(或概念)一般具有许多的性质与关系,这些性质与关系可以称之为实体(或概念)的属性。例如,以实体是北京为例,北京的属性包括:人口、面积等。A property is a characterization of an abstract aspect of an object. It is worth noting that an entity (or concept) generally has many properties and relationships. These properties and relationships can be referred to as the attributes of the entity (or concept). For example, if the entity is Beijing, the attributes of Beijing include population, area, and so on.
属性值是对象指定属性的值。例如中国的面积为:960万平方公里,960万平方公里即是面积这一属性的值。The attribute value is the value of the specified attribute of the object. For example, China's area is 9.6 million square kilometers, and 9.6 million square kilometers is the value of the area attribute.
三元组是知识图谱的一种通用表示形式。三元组的基本形式包括(首实体-关系-尾实体)和(概念-属性-属性值)。例如,中国-首都-北京是一个(首实体-关系-尾实体)的三元组示例,其中,中国是首实体,北京是尾实体,首都就是中国和北京的关系。北京-人口-2069.3万构成一个(概念-属性-属性值)的三元组示例,其中,人口是一种属性,2069.3万是属性值。需要说明的是,在本申请实施例中,如无特殊说明,三元组均指(首实体-关系-尾实体)这一基本形式。A triple is a universal representation of a knowledge graph. The basic forms of triples include (first entity-relation-tail entity) and (concept-attribute-attribute value). For example, China-Capital-Beijing is an example of a triad (head entity-relation-tail entity), where China is the head entity, Beijing is the tail entity, and the capital is the relationship between China and Beijing. Beijing-population-2069.3 million constitutes an example of (concept-attribute-attribute value) triples, in which population is an attribute and 20.693,000 is the attribute value. It should be noted that in the embodiments of the present application, unless otherwise specified, the triples refer to the basic form of (first entity-relationship-tail entity).
知识图谱的框架(schema)是对概念进行建模的规范,是描述客观世界的抽象模型,以形式化方式对概念及其之间的联系给出明确的定义。可以理解的是,schema定义了知识图谱中的数据模型。具体的,schema定义了实体的类型和关系的类型。The schema of the knowledge graph is a specification for modeling concepts, an abstract model describing the objective world, and a clear definition of concepts and their relationships in a formal way. Understandably, the schema defines the data model in the knowledge graph. Specifically, the schema defines the types of entities and the types of relationships.
图1给出了本申请提供的技术方案所适用的一种通信系统,该通信系统包括服务器10以及终端设备20。服务器10与终端设备20之间通过无线网络或有线网络来通信。FIG. 1 shows a communication system to which the technical solution provided in this application is applicable. The communication system includes a server 10 and a terminal device 20. The server 10 and the terminal device 20 communicate through a wireless network or a wired network.
该终端设备20可以手机、平板电脑、笔记本电脑、超级移动个人计算机(ultra-mobile personal computer,UMPC)、上网本、个人数字助理(personal digital assistant,PDA)等等。终端设备20可以安装具有智能搜索、智能问答等功能的客户端。The terminal device 20 may be a mobile phone, a tablet computer, a notebook computer, an ultra-mobile personal computer (UMPC), a netbook, a personal digital assistant (PDA), or the like. The terminal device 20 may be installed with a client having functions of intelligent search, intelligent question answering, and the like.
服务器10用于为终端设备提供智能搜索、智能问答等服务。服务器10包括框架扩展单元、数据获取单元、扩展映射单元、特征计算单元和存储单元。The server 10 is configured to provide services such as intelligent search, intelligent question answering, and the like for the terminal device. The server 10 includes a frame extension unit, a data acquisition unit, an extension mapping unit, a feature calculation unit, and a storage unit.
其中,所述框架扩展单元用于根据初始的知识图谱的框架,构建融合文本的知识图谱的框架。Wherein, the frame expansion unit is configured to construct a frame of a knowledge graph fused with text according to a frame of an initial knowledge graph.
所述数据获取单元用于根据初始的知识图谱中实体的信息或者关系的信息,从互联网中获取外部数据。The data obtaining unit is configured to obtain external data from the Internet according to entity information or relationship information in the initial knowledge map.
所述扩展映射单元用于从外部数据中,生成实体的扩展属性值和关系的扩展属性值;将所述实体的扩展属性值以及所述关系的扩展属性值添加到融合文本的知识图谱中,构建所述融合文本的知识图谱。The extended mapping unit is configured to generate an extended attribute value of an entity and an extended attribute value of a relationship from external data; add the extended attribute value of the entity and the extended attribute value of the relationship to a knowledge map of the fused text, Construct a knowledge map of the fused text.
所述特征计算单元用于确定融合文本的知识图谱中的实体的表示向量和关系的表示向量。The feature calculation unit is configured to determine a representation vector of an entity and a representation vector of a relationship in a knowledge map of the fused text.
所述存储单元用于存储构建好的融合文本的知识图谱的相关数据。The storage unit is configured to store related data of the constructed knowledge map of the fused text.
图2为本申请实施例提供的一种服务器的硬件结构示意图。该服务器包括至少一个处理器101,通信线路102,存储器103以及至少一个通信接口104。FIG. 2 is a schematic diagram of a hardware structure of a server according to an embodiment of the present application. The server includes at least one processor 101, a communication line 102, a memory 103, and at least one communication interface 104.
处理器101可以是一个通用中央处理器(central processing unit,CPU),微处理器, 特定应用集成电路(application-specific integrated circuit,ASIC),或一个或多个用于控制本申请方案程序执行的集成电路。The processor 101 may be a general-purpose central processing unit (CPU), a microprocessor, an application-specific integrated circuit (ASIC), or one or more processors for controlling the execution of the program of the solution of the present application. integrated circuit.
通信线路102可包括一通路,在上述组件之间传送信息。The communication line 102 may include a path for transmitting information between the aforementioned components.
通信接口104,使用任何收发器一类的装置,用于与其他设备或通信网络通信,如以太网,无线接入网(radio access network,RAN),无线局域网(wireless local area networks,WLAN)等。The communication interface 104 uses any device such as a transceiver to communicate with other devices or communication networks, such as Ethernet, radio access network (RAN), wireless local area networks (WLAN), etc. .
存储器103可以是只读存储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)或者可存储信息和指令的其他类型的动态存储设备,也可以是电可擦可编程只读存储器(electrically erasable programmable read-only memory,EEPROM)、只读光盘(compact disc read-only memory,CD-ROM)或其他光盘存储、光碟存储(包括压缩光碟、激光碟、光碟、数字通用光碟、蓝光光碟等)、磁盘存储介质或者其他磁存储设备、或者能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质,但不限于此。存储器103可以独立存在,通过通信线路102与处理器相连接。存储器103也可以和处理器101集成在一起。The memory 103 may be a read-only memory (ROM) or other types of static storage devices that can store static information and instructions, a random access memory (random access memory, RAM), or other types that can store information and instructions The dynamic storage device can also be electrically erasable programmable read-only memory (EEPROM-ready-only memory (EEPROM)), compact disc (read-only memory (CD-ROM)) or other optical disk storage, optical disk storage (Including compact discs, laser discs, optical discs, digital versatile discs, Blu-ray discs, etc.), magnetic disk storage media or other magnetic storage devices, or can be used to carry or store desired program code in the form of instructions or data structures and can be used by a computer Any other media accessed, but not limited to this. The memory 103 may exist independently, and is connected to the processor through the communication line 102. The memory 103 may also be integrated with the processor 101.
其中,存储器103用于存储执行本申请方案的计算机可执行指令。处理器101用于执行存储器103中存储的计算机可执行指令,从而实现本申请下述实施例提供的技术方案。The memory 103 is configured to store computer-executable instructions for executing the solution of the present application. The processor 101 is configured to execute computer-executable instructions stored in the memory 103, so as to implement the technical solution provided by the following embodiments of the present application.
可选地,本申请实施例中的计算机可执行指令也可以称之为应用程序代码,本申请实施例对此不作具体限定。Optionally, the computer-executable instructions in the embodiments of the present application may also be referred to as application program codes, which are not specifically limited in the embodiments of the present application.
在具体实现中,作为一种实施例,处理器101可以包括一个或多个CPU,例如图2中的CPU0和CPU1。In a specific implementation, as an embodiment, the processor 101 may include one or more CPUs, such as CPU0 and CPU1 in FIG. 2.
在具体实现中,作为一种实施例,服务器可以包括多个处理器,例如图2中的处理器101和处理器107。这些处理器中的每一个可以是一个单核(single-CPU)处理器,也可以是一个多核(multi-CPU)处理器。这里的处理器可以指一个或多个设备、电路、和/或用于处理数据(例如计算机程序指令)的处理核。In a specific implementation, as an embodiment, the server may include multiple processors, such as the processor 101 and the processor 107 in FIG. 2. Each of these processors may be a single-CPU processor or a multi-CPU processor. A processor herein may refer to one or more devices, circuits, and / or processing cores for processing data (such as computer program instructions).
如图3所示,为本申请实施例提供的一种构建知识图谱的方法,该方法包括以下步骤:As shown in FIG. 3, a method for constructing a knowledge map provided by an embodiment of the present application includes the following steps:
S101、服务器获取初始知识图谱。S101. The server obtains an initial knowledge map.
其中,所述初始知识图谱是服务器从互联网中获取的;或者,所述初始知识图谱是人工录入到服务器的。The initial knowledge map is obtained by the server from the Internet; or, the initial knowledge map is manually entered into the server.
S102、服务器基于所述初始知识图谱的框架,构建融合文本的知识图谱的框架。S102. Based on the framework of the initial knowledge map, the server constructs a framework of the text knowledge map.
相比于初始知识图谱的框架,融合文本的知识图谱的框架至少还包括:实体的扩展属性、关系的扩展属性以及实体之间的扩展关系。Compared to the framework of the initial knowledge graph, the framework of the text-fused knowledge graph further includes at least: extended attributes of entities, extended attributes of relationships, and extended relations between entities.
实体的扩展属性即为初始知识图谱的框架中未定义的实体的属性。例如,对于北京这一实体,初始知识图谱的框架中仅定义的两个属性:面积和人口,而融合文本的知识图谱的框架中还定义了另一个属性:经纬度。这样一来,经纬度即为北京的扩展属性。可选的,实体的扩展属性可以根据以下方式来确定:服务器根据初始知识图谱中的实体的名称信息,利用文本挖掘算法,从已有的文本信息或者外部的文本信息中抽取与实体的名称信息有关的高频词,并结合词性过滤技术,形成实体的扩展属性。其中,所述文本挖掘算法包括:主题模型、核心词抽取以及命名实体识别。The extended attributes of entities are the attributes of undefined entities in the framework of the initial knowledge graph. For example, for the Beijing entity, there are only two attributes defined in the framework of the initial knowledge map: area and population, while the framework of the fused knowledge map also defines another attribute: latitude and longitude. In this way, latitude and longitude are the extended attributes of Beijing. Optionally, the extended attributes of the entity may be determined according to the following manner: the server extracts the entity name information from the existing text information or external text information using a text mining algorithm according to the entity name information in the initial knowledge map Related high-frequency words, combined with part-of-speech filtering techniques, form extended attributes of entities. The text mining algorithm includes a topic model, core word extraction, and named entity recognition.
需要说明的是,在本申请实施例中,实体的扩展属性至少包括:实体的文本信息。也就是说,融合文本的知识图谱是至少具有实体的文本信息这一属性的知识图谱。It should be noted that, in the embodiment of the present application, the extended attributes of the entity include at least: text information of the entity. That is, the knowledge graph of the fused text is a knowledge graph having at least the attribute of the text information of the entity.
关系的扩展属性即为初始知识图谱的框架中未定义的关系的属性。示例性的,关系的扩展属性包括:头实体的类型、尾实体的类型等。可选的,所述关系的扩展属性由专家人工定义。The extended attributes of relationships are the attributes of undefined relationships in the framework of the initial knowledge graph. Exemplarily, the extended attributes of the relationship include: the type of the head entity, the type of the tail entity, and the like. Optionally, the extended attributes of the relationship are manually defined by an expert.
实体之间的扩展关系即为初始知识图谱的框架中未定义的实体之间的关系。示例性的,实体之间的扩展关系包括:距离关系、临近关系等。可选的,扩展关系可从关系数据库中获取,所述关系数据库包含了各种实体之间的关系。需要说明的是,为了保证实体之间的扩展关系的合理性,在确定实体之间的扩展关系后,服务器将当前融合文本的知识图谱中明确的实体之间的关系作为训练数据,利用弱监督算法以及强化学习算法,验证扩展关系的合理性,从而去除掉不合理的扩展关系。The extended relationship between entities is the relationship between undefined entities in the framework of the initial knowledge graph. Exemplarily, the extended relationship between entities includes a distance relationship, a proximity relationship, and the like. Optionally, the extended relationship may be obtained from a relational database, which includes relationships between various entities. It should be noted that in order to ensure the rationality of the extended relationship between entities, after determining the extended relationship between the entities, the server uses the relationship between the entities specified in the knowledge map of the current fused text as training data, and uses weak supervision Algorithm and reinforcement learning algorithm to verify the rationality of the extended relationship, thereby removing the unreasonable extended relationship.
S103、服务器根据所述融合文本的知识图谱中实体的信息或者关系的信息,获取外部数据。S103. The server obtains external data according to entity information or relationship information in the knowledge map of the fused text.
其中,所述实体的信息包括实体的名称。所述关系的信息包括关系的名称。所述外部数据包括实体或者关系的扩展属性的信息。所述外部数据可以为结构化的数据、半结构化的数据或者非结构化的数据。若所述外部数据为非结构化的数据,所述外部数据可以为文本信息或多媒体信息,所述多媒体信息包括:视频、图片和网页。The information of the entity includes a name of the entity. The information of the relationship includes the name of the relationship. The external data includes information of extended attributes of entities or relationships. The external data may be structured data, semi-structured data, or unstructured data. If the external data is unstructured data, the external data may be text information or multimedia information, and the multimedia information includes videos, pictures, and web pages.
一种可选的实现方式中,服务器根据所述融合文本的知识图谱中实体的信息或者关系的信息,利用爬虫等技术从互联网获取外部数据。In an optional implementation manner, the server obtains external data from the Internet by using a crawler or other technology according to entity information or relationship information in the fused text knowledge map.
示例性的,服务器直接从百科类网站(例如百度百科、维基百科)或者垂直网站(例如电子产品网站、图书网站、电影网站、音乐网站)中提取外部数据。由于百科类网站和垂直网站包括有大量的实体的属性信息,例如图书网站中包括了图书的作者、出版社、出版时间等信息,因此服务器可以通过生成一定规则的包装器(或称为模板),以包装器来提取包含属性信息的外部数据。需要说明的是,包装器的生成方法可以分为:人工法(即以人工的方式编写包装器)、监督方法、半监督方法以及无监督方法。Exemplarily, the server directly extracts external data from an encyclopedia website (such as Baidu Encyclopedia, Wikipedia) or a vertical website (such as an electronic product website, a book website, a movie website, or a music website). Because encyclopedia websites and vertical websites include a lot of entity attribute information, for example, the book website includes information such as the author, publisher, and publishing time of the book, the server can generate a certain wrapper (or template) by generating certain rules. Use a wrapper to extract external data that contains attribute information. It should be noted that the method of generating a wrapper can be divided into: a manual method (that is, writing a wrapper manually), a supervised method, a semi-supervised method, and an unsupervised method.
S104、服务器从外部数据中,确定所述实体的扩展属性值以及所述关系的扩展属性值,以构建所述融合文本的知识图谱。S104. The server determines an extended attribute value of the entity and an extended attribute value of the relationship from external data to construct a knowledge map of the fused text.
可选的,若外部数据为结构化数据或者半结构化数据,服务器利用人工定义或者自动生成的匹配模式,从外部数据中提取所述实体的扩展属性值或者关系的扩展属性值。Optionally, if the external data is structured data or semi-structured data, the server extracts the extended attribute value of the entity or the extended attribute value of the relationship from the external data by using a manually defined or automatically generated matching mode.
可选的,若外部数据为非结构化数据,例如外部数据为文本信息,则服务器采用数据挖掘的方法,从文本信息中挖掘属性与属性值之间的关系模式,从而实现对属性名称与属性值在文本中的定位。可以理解的是,在真实语言环境中,许多属性值的附近均存在一些用于限制和界定该属性值含义的关键词(例如属性名称),因此可以利用这些关键词来定位属性值。Optionally, if the external data is unstructured data, for example, the external data is text information, the server uses data mining methods to mine the relationship mode between the attribute and the attribute value from the text information, so as to realize the attribute name and attribute. Positioning of the value in the text. It can be understood that in a real language environment, there are some keywords (such as attribute names) used to limit and define the meaning of the attribute value near many attribute values, so these keywords can be used to locate the attribute value.
之后,服务器将实体的扩展属性值以及关系的扩展属性值补充到融合文本的知识图谱中,以完成对融合文本的知识图谱的创建。After that, the server adds the extended attribute value of the entity and the extended attribute value of the relationship to the knowledge graph of the fused text to complete the creation of the knowledge graph of the fused text.
示例性的,图4示出一种初始知识图谱的示意图,图5示出一种融合文本的知识图谱的示意图。如图4所示,初始知识图谱的框架定义了两种实体类型:产品和零部件。产品有:华为P10和华为P8,零部件有:摄像头、镜头1、镜头2以及镜头。镜头1、镜头2 以及镜头具有两种属性:传感器以及像素。图5所示的融合文本的知识图谱是基于图4所示的初始知识图谱扩展得到的。如图5所示,华为P10和华为P8之间存在扩展关系:有序共现关系。华为P10具有扩展属性:主题和频度。包含关系具有扩展属性:htype、hr频度、ttype以及rt频度。其中,htype表示头实体的类型,hr频度表示头实体与关系的频度,ttype表示尾实体的类型,rt频度表示关系与尾实体的频度。需要说明的是,虽然图5中未示出,图5中的任一实体还具有文本信息这一扩展属性。For example, FIG. 4 shows a schematic diagram of an initial knowledge map, and FIG. 5 shows a schematic diagram of a knowledge map that fuses text. As shown in Figure 4, the framework of the initial knowledge graph defines two entity types: products and components. The products are: Huawei P10 and Huawei P8, the parts are: camera, lens 1, lens 2 and lens. Lens 1, lens 2 and lens have two attributes: sensor and pixel. The knowledge map of the fused text shown in FIG. 5 is obtained based on the expansion of the initial knowledge map shown in FIG. 4. As shown in Figure 5, there is an extended relationship between Huawei P10 and Huawei P8: an orderly co-occurrence relationship. Huawei P10 has extended attributes: theme and frequency. Containment relationships have extended attributes: htype, hr frequency, ttype, and rt frequency. Among them, htype indicates the type of the head entity, hr frequency indicates the frequency of the head entity and the relationship, ttype indicates the type of the tail entity, and rt frequency indicates the frequency of the relationship and the tail entity. It should be noted that although not shown in FIG. 5, any entity in FIG. 5 also has an extended attribute of text information.
本申请实施例提供的构建知识图谱的方法,通过扩展初始知识图谱的框架,补充相关的实体的扩展属性值和关系的扩展属性值,构建出融合文本的知识图谱。这样一来,相比于初始知识图谱,融合文本的知识图谱在内容上更加完备。The method for constructing a knowledge graph provided in the embodiments of the present application constructs a knowledge graph of the fused text by extending the framework of the initial knowledge graph and supplementing the extended attribute values of the related entities and the extended attribute values of the relationship. In this way, compared with the original knowledge map, the knowledge map of the fused text is more complete in content.
在构建融合文本的知识图谱之后,需要对构建好的融合文本的知识图谱进行表示学习,以便于有效利用融合文本的知识图谱中的知识。图6示出本申请实施例提供的一种表示学习方法的流程图,该方法包括如下步骤:After constructing the knowledge graph of the fused text, it is necessary to perform learning on the constructed knowledge graph of the fused text in order to effectively utilize the knowledge in the knowledge graph of the fused text. FIG. 6 shows a flowchart of a learning method according to an embodiment of the present application. The method includes the following steps:
S201、服务器初始化所述融合文本的知识图谱的三元组中头实体的表示向量、尾实体的表示向量、关系的表示向量、实体类型表示矩阵、关系类型表示矩阵、词表示矩阵以及权重值的表示向量。S201. The server initializes the representation vector of the head entity, the representation vector of the tail entity, the representation vector of the relationship, the entity type representation matrix, the relationship type representation matrix, the word representation matrix, and the weight value in the triple of the knowledge graph of the fused text. Represents a vector.
具体的,服务器采用均匀分布初始化、伯努利分布初始化等方法来初始化所述融合文本的知识图谱的三元组中所述头实体的表示向量、所述尾实体的表示向量、所述关系的表示向量、所述实体类型表示矩阵、所述关系类型表示矩阵、所述词表示矩阵以及所述权重值的表示向量。Specifically, the server uses methods such as uniform distribution initialization and Bernoulli distribution initialization to initialize the representation vector of the head entity, the representation vector of the tail entity, and the relationship in the triple of the knowledge graph of the fused text. A representation vector, the entity type representation matrix, the relation type representation matrix, the word representation matrix, and a representation vector of the weight value.
需要说明的是,所述头实体的表示向量的维数、所述尾实体的表示向量的维数、所述关系的表示向量的维数以及所述权重值表示向量的维数均是预先设定的。并且,所述头实体的表示向量的维数、所述尾实体的表示向量的维数、所述关系的表示向量的维数以及所述权重值表示向量的维数相等。It should be noted that the dimensions of the representation vector of the head entity, the dimensions of the representation vector of the tail entity, the dimensions of the relationship representation vector, and the weight value of the representation vector are all preset. Fixed. Furthermore, the dimensions of the representation vector of the head entity, the dimensions of the representation vector of the tail entity, the dimensions of the relationship representation vector, and the weight value of the representation vector are equal.
在本申请的实施例中,所述实体类型表示矩阵用于使实体的类型标识向量映射为实体的类型表示向量。所述实体的类型标识向量用于直接表征实体所属的类型。所述实体的类型表示向量用于间接表征实体所属的类型。In the embodiment of the present application, the entity type representation matrix is used to map the type identification vector of the entity to the type representation vector of the entity. The type identification vector of the entity is used to directly characterize the type to which the entity belongs. The type representation vector of the entity is used to indirectly characterize the type to which the entity belongs.
需要说明的是,所述实体类型表示矩阵的行数等于所述实体的类型表示向量的维数。所述实体类型表示矩阵的列数等于所述实体的类型标识向量的维数。其中,所述实体的类型表示向量的维数等于实体的表示向量的维数。所述实体的类型标识向量的维数等于实体的类型的总数。实体的类型标识向量的每一个维度分别对应实体的一种类型,实体的类型标识向量的每一个维度的取值为0或1。若实体的类型标识向量的一个维度取值为1,说明实体属于该维度对应的类型。若实体的类型标识向量的一个维度取值为0,说明实体不属于该维度对应的类型。示例性的,知识图谱的框架定义实体的类型包括类型1、类型2、类型3以及类型4,若实体A所属的类型为类型1和类型4,则实体A的类型标识向量为(1,0,0,1)。It should be noted that the number of rows of the entity type representation matrix is equal to the number of dimensions of the entity type representation vector. The number of columns of the entity type representation matrix is equal to the number of dimensions of the type identification vector of the entity. Wherein, the dimension of the type representation vector of the entity is equal to the dimension of the representation vector of the entity. The dimension of the type identification vector of the entity is equal to the total number of types of the entity. Each dimension of the entity type identification vector corresponds to a type of the entity, and each dimension of the entity type identification vector has a value of 0 or 1. If the dimension of the entity type identification vector is 1, it indicates that the entity belongs to the type corresponding to the dimension. If one dimension of the entity's type identification vector is 0, it indicates that the entity does not belong to the type corresponding to the dimension. Exemplarily, the framework of the knowledge graph defines the types of entities including type 1, type 2, type 3, and type 4. If the type to which entity A belongs is type 1 and type 4, the type identification vector of entity A is (1,0 , 0,1).
在本申请实施例中,所述关系类型表示矩阵用于使关系的类型标识向量映射为关系的类型表示向量。所述关系的类型标识向量用于直接表征关系所属的类型。所述关系的类型表示向量用于间接表征关系所属的类型。In the embodiment of the present application, the relationship type representation matrix is used to map the type identification vector of the relationship to the type representation vector of the relationship. The type identification vector of the relationship is used to directly characterize the type to which the relationship belongs. The type representation vector of the relationship is used to indirectly characterize the type to which the relationship belongs.
需要说明的是,所述关系类型表示矩阵的行数等于所述关系的类型表示向量的维数。 所述关系类型表示矩阵的列数等于所述关系的类型标识向量的维数。所述关系的类型表示向量的维数等于所述关系的表示向量的维数。所述关系的类型标识向量的维数等于关系的类型的总数。关系的类型标识向量的每一个维度分别对应实体的一种类型,实体的类型标识向量的每一个维度的取值为0或1。若关系的类型标识向量的一个维度取值为1,说明关系属于该维度对应的类型;若关系的类型标识向量的一个维度取值为0,说明关系不属于该维度对应的类型。It should be noted that the number of rows of the relationship type representation matrix is equal to the number of dimensions of the relationship type representation vector. The relationship type indicates that the number of columns of the matrix is equal to the dimension of the type identification vector of the relationship. The dimension of the type-representation vector of the relation is equal to the dimension of the relation-representation vector. The dimension of the type identification vector of the relationship is equal to the total number of types of the relationship. Each dimension of the type identification vector of the relationship corresponds to a type of the entity, and each dimension of the type identification vector of the entity has a value of 0 or 1. If the value of the dimension identification vector of the relationship is 1, the relationship belongs to the corresponding type of the dimension; if the value of the dimension identification vector of the relationship is 0, the relationship does not belong to the corresponding type of the dimension.
在本申请实施例中,所述词表示矩阵用于使词的标识向量映射为词的表示向量。所述词的标识向量用于直接表征词在词表中的位置。所述词的类型表示向量用于间接表征词在词表中的位置。词表包含知识图谱中所有实体相关的词。In the embodiment of the present application, the word representation matrix is used to map an identification vector of a word to a representation vector of a word. The identification vector of the word is used to directly characterize the position of the word in the vocabulary. The word type representation vector is used to indirectly characterize the position of the word in the vocabulary. The vocabulary contains all entity-related words in the knowledge graph.
需要说明的是,所述词表示矩阵的行数等于所述词的表示向量的维数。所述词表示矩阵的列数等于所述词的标识向量的维数。所述词的表示向量的维数等于实体的表示向量的维数。所述词的标识向量的维数等于词表中词的总数。词的标识向量的每一个维度对应词表中的一个位置。词的标识向量的每一个维度的取值为0或1。若词的标识向量的一个维度取值为0,说明该词不在该维度对应的词表的位置;若词的标识向量的一个维度取值为1,说明该词在该维度对应的词表的位置。It should be noted that the number of rows of the word representation matrix is equal to the number of dimensions of the word representation vector. The number of columns of the word representation matrix is equal to the dimension of the identification vector of the word. The dimension of the representation vector of the word is equal to the dimension of the representation vector of the entity. The dimension of the identification vector of the word is equal to the total number of words in the vocabulary. Each dimension of the word's identification vector corresponds to a position in the vocabulary. The value of the word's identification vector is 0 or 1. If a dimension of the word's identification vector is 0, it means that the word is not in the position of the vocabulary corresponding to that dimension; if a dimension of the word's identification vector is 1, it means that the word's corresponding vocabulary in that dimension position.
在本申请实施例中,所述权重值的表示向量用于表示关系的权重值。关系的权重值用于说明关系所连接的两个实体之间的相关程度。In the embodiment of the present application, the representation vector of the weight value is used to represent the weight value of the relationship. The weight value of the relationship is used to explain the degree of correlation between the two entities to which the relationship is connected.
S202、服务器根据融合文本的知识图谱的三元组中实体的类型,确定所述实体的类型表示向量。S202. The server determines a type representation vector of the entity according to the type of the entity in the triple of the knowledge map of the fused text.
其中,所述实体包含头实体和尾实体。The entities include a head entity and a tail entity.
在本申请实施例中,所述实体的类型是由融合文本的知识图谱的框架定义的。并且,所述实体的类型并不是唯一的。换句话说,一个实体可以对应多个类型。示例性的,假设实体为手机,手机可以是电子产品,也可以是通信工具。这里,电子产品或者通信工具即为手机所属的类型。In the embodiment of the present application, the type of the entity is defined by a framework of a knowledge graph of fused text. And, the type of the entity is not unique. In other words, an entity can correspond to multiple types. Exemplarily, it is assumed that the entity is a mobile phone, and the mobile phone may be an electronic product or a communication tool. Here, the electronic product or communication tool is the type to which the mobile phone belongs.
具体的,服务器先根据头实体的类型,确定头实体的类型标识向量;然后,服务器再根据公式
Figure PCTCN2019096895-appb-000021
确定所述头实体的类型表示向量。其中,f 1(h)表示所述头实体的类型表示向量,W etype表示所述实体类型表示矩阵,v etype(h)表示所述头实体的类型标识向量,‖ ‖表示二范数。
Specifically, the server first determines the type identification vector of the head entity based on the type of the head entity; then, the server then determines the vector based on the formula
Figure PCTCN2019096895-appb-000021
A type representation vector of the head entity is determined. Among them, f 1 (h) represents a type representation vector of the head entity, W etype represents the entity type representation matrix, v etype (h) represents a type identification vector of the head entity, and ‖ ‖ represents a second norm.
具体的,服务器先根据尾实体的类型,确定尾实体的类型标识向量;然后,服务器再根据公式
Figure PCTCN2019096895-appb-000022
确定所述尾实体的类型表示向量。其中,f 1(t)表示所述尾实体的类型表示向量,v etype(t)表示所述尾实体的类型标识向量。
Specifically, the server first determines the type identification vector of the tail entity according to the type of the tail entity; then, the server then determines the vector based on the formula
Figure PCTCN2019096895-appb-000022
A type representation vector of the tail entity is determined. Wherein, f 1 (t) represents a type representation vector of the tail entity, and v etype (t) represents a type identification vector of the tail entity.
S203、服务器根据所述三元组中关系的类型,确定所述关系的类型表示向量。S203. The server determines a type representation vector of the relationship according to the type of the relationship in the triple.
其中,所述关系的类型是由融合文本的知识图谱的框架定义的。所述关系的类型包括:包含关系、上下位关系、并列关系等。Wherein, the type of the relationship is defined by a framework of a knowledge graph of fused text. The type of the relationship includes an include relationship, a subordinate relationship, a side-by-side relationship, and the like.
具体的,服务器可以先根据关系的类型,确定所述关系的类型标识向量。然后,服务器根据公式:
Figure PCTCN2019096895-appb-000023
确定所述关系的类型表示向量;其中,g 1(r)表示所述关系的类型表示向量,W rtype表示所述关系类型表示矩阵,v rtype(r)表示所述关系的类型标识向量。
Specifically, the server may first determine the type identification vector of the relationship according to the type of the relationship. The server then according to the formula:
Figure PCTCN2019096895-appb-000023
Determine the type representation vector of the relationship; where g 1 (r) represents the type representation vector of the relationship, W rtype represents the relationship type representation matrix, and v rtype (r) represents the type identification vector of the relationship.
S204、服务器根据所述实体的文本信息,确定所述实体的上下文表示向量。S204. The server determines a context representation vector of the entity according to the text information of the entity.
其中,所述实体的上下文表示向量用于表征所述实体的上下文特征。所述文本信息为 服务器预先存储的。The context representation vector of the entity is used to characterize the context features of the entity. The text information is stored in advance by the server.
具体的,服务器根据所述头实体的文本信息,确定与所述头实体相关的词;然后,服务器根据公式:
Figure PCTCN2019096895-appb-000024
确定所述头实体的上下文表示向量;其中,f 2(h)表示所述头实体的上下文表示向量,α、β为取值在0到1之间的常数,v h表示所述头实体的表示向量,w i表示与所述头实体相关的词,ε 1表示所有与所述头实体相关的词构成的集合,W word表示词表示矩阵,V vocubalary(w i)表示w i的标识向量。
Specifically, the server determines words related to the head entity according to the text information of the head entity; then, the server according to the formula:
Figure PCTCN2019096895-appb-000024
Determine the context representation vector of the head entity; where f 2 (h) represents the context representation vector of the head entity, α, β are constants between 0 and 1, and v h represents the head entity's Represents a vector, w i represents a word related to the head entity, ε 1 represents a set of all words related to the head entity, W word represents a matrix of words, and V vocubalary (w i ) represents an identification vector of w i .
具体的,服务器根据所述尾实体的文本信息,确定与所述尾实体相关的词;然后,服务器根据公式:
Figure PCTCN2019096895-appb-000025
确定所述尾实体的上下文表示向量;其中,f 2(t)表示所述尾实体的上下文表示向量,v t表示所述尾实体的表示向量,m i表示与所述尾实体相关的词,ε 2表示所有与所述尾实体相关的词构成的集合,V vocubalary(m i)表示m i的标识向量。
Specifically, the server determines words related to the tail entity according to the text information of the tail entity; then, the server according to the formula:
Figure PCTCN2019096895-appb-000025
Determining a context representation vector of the tail entity; wherein f 2 (t) represents a context representation vector of the tail entity, v t represents a representation vector of the tail entity, and mi represents a word related to the tail entity, ε 2 represents a set of all words related to the tail entity, and V vocubalary (m i ) represents an identification vector of mi .
需要说明的是,服务器确定与实体相关的词可采用以下实现方式:服务器从文本信息中选取距离实体的名称在一定范围内的文字序列,使用分词技术将所述文字序列划分为一个个单独的词,这些单独的词即为与实体相关的词。It should be noted that the server may determine the words related to the entity in the following implementation manner: the server selects a text sequence within a certain distance from the entity name from the text information, and uses word segmentation technology to divide the text sequence into individual Words, these individual words are words related to the entity.
可选的,上述分词技术可以为基于字符串匹配的分词技术、基于理解的分词技术和基于统计的分词技术。这些分词技术的具体实现方式可以参考现有技术,本申请实施例对此不予赘述。Optionally, the above word segmentation technology may be a word segmentation technology based on string matching, a word segmentation technology based on understanding, and a word segmentation technology based on statistics. For specific implementations of these word segmentation technologies, reference may be made to the prior art, which is not described in the embodiments of the present application.
S205、服务器根据所述关系的权重值,确定所述关系的上下文表示向量。S205. The server determines a context representation vector of the relationship according to a weight value of the relationship.
其中,所述关系的上下文表示向量用于表征所述关系的权重特征。Wherein, the context representation vector of the relationship is used to represent the weight feature of the relationship.
具体的,服务器根据公式:
Figure PCTCN2019096895-appb-000026
确定所述关系的上下文表示向量。其中,g 2(r)表示所述关系的上下文表示向量,v r表示所述关系的表示向量,n i表示所述关系的权重值,ε 3表示关系的所有权重值构成的集合,
Figure PCTCN2019096895-appb-000027
表示n i的表示向量。
Specifically, the server uses the formula:
Figure PCTCN2019096895-appb-000026
Determine the contextual representation vector of the relationship. Among them, g 2 (r) represents a context representation vector of the relationship, v r represents a representation vector of the relationship, n i represents a weight value of the relationship, and ε 3 represents a set of ownership weight values of the relationship.
Figure PCTCN2019096895-appb-000027
A representation vector representing n i .
S206、服务器根据所述实体的类型表示向量、所述实体的上下文表示向量、所述关系的类型表示向量以及所述关系的上下文表示向量,构建所述三元组的打分函数。S206. The server constructs a scoring function for the triplet according to a type representation vector of the entity, a context representation vector of the entity, a type representation vector of the relationship, and a context representation vector of the relationship.
其中,所述三元组的打分函数为:The scoring function of the triples is:
Figure PCTCN2019096895-appb-000028
Figure PCTCN2019096895-appb-000028
其中,
Figure PCTCN2019096895-appb-000029
表示复合运算。需要说明的是,所述复合运算包括向量加法或者按位相乘。
among them,
Figure PCTCN2019096895-appb-000029
Represents a compound operation. It should be noted that the compound operation includes vector addition or bitwise multiplication.
其中,按位相乘是指第一向量的每一个维度的取值乘以第二向量的对应维度的取值,生成第三向量的对应维度的取值。示例性的,
Figure PCTCN2019096895-appb-000030
Figure PCTCN2019096895-appb-000031
Bitwise multiplication refers to multiplying the value of each dimension of the first vector by the value of the corresponding dimension of the second vector to generate the value of the corresponding dimension of the third vector. Exemplary,
Figure PCTCN2019096895-appb-000030
Figure PCTCN2019096895-appb-000031
S207、服务器根据所述三元组的打分函数,构建目标函数。S207. The server constructs an objective function according to the scoring function of the triple.
其中,所述目标函数为:The objective function is:
Figure PCTCN2019096895-appb-000032
Figure PCTCN2019096895-appb-000032
其中,(h,r,t)表示正例三元组,Δ表示正例三元组集合,(h′,r,t′)表示负例三元组,h′表示负例头实体,t′表示负例尾实体,Δ′表示负例三元组集合,M为常数。Among them, (h, r, t) represents a positive example triplet, Δ represents a positive example triplet set, (h ′, r, t ′) represents a negative example triplet, h ′ represents a negative example head entity, and t ′ Represents a negative example tail entity, Δ ′ represents a negative example triple set, and M is a constant.
需要说明的是,正例三元组为融合文本的知识图谱中存在的三元组,负例三元组为融合文本的知识图谱中不存在的三元组。负例三元组由正例三元组随机替换头实体或者尾实体而得到。负例三元组集合中不包括正例三元组。It should be noted that the positive triples are triples that exist in the knowledge graph of the fused text, and the negative triples are triples that do not exist in the knowledge graph of the fused text. The negative triples are obtained by randomly replacing the head or tail entities with the positive triples. The negative triples are not included in the negative triples.
另外,S(h′,r,t′)的生成方法可参考上述S(h,r,t)的生成方法,本申请实施例对此不再赘述。In addition, for the method for generating S (h ′, r, t ′), reference may be made to the above-mentioned method for generating S (h, r, t), which is not described in the embodiment of the present application.
S208、服务器最小化所述目标函数,学习所述实体的表示向量和所述关系的表示向量。S208. The server minimizes the objective function and learns a representation vector of the entity and a representation vector of the relationship.
一种可选的实现方式中,服务器以梯度下降算法来迭代更新所述头实体的表示向量、所述尾实体的表示向量、所述关系的表示向量、所述实体类型表示矩阵、所述关系类型表示矩阵、所述词表示矩阵、权重值的表示向量、负例头实体的表示向量以及负例尾实体的表示向量等,以使得目标函数能求解出最小值,进而能够确定所述头实体的表示向量、所述尾实体的表示向量和所述关系的表示向量。In an optional implementation manner, the server uses a gradient descent algorithm to iteratively update the representation vector of the head entity, the representation vector of the tail entity, the representation vector of the relationship, the entity type representation matrix, and the relationship. The type representation matrix, the word representation matrix, the representation vector of weight values, the representation vector of negative example head entities, and the representation vector of negative example tail entities, so that the objective function can solve the minimum value, and then the head entity can be determined. A representation vector of, a representation vector of the tail entity, and a representation vector of the relationship.
需要说明的是,所述梯度下降算法可参考现有技术,本申请实施例对此不予赘述。It should be noted that, for the gradient descent algorithm, reference may be made to the prior art, which is not described in the embodiment of the present application.
本申请实施例提供的表示学习方法,由于实体的类型和上下文,关系的类型和权重值均代表一定的深层次语义信息,因此通过考虑实体的类型和上下文,以及关系的类型和权重值,从而使得确定出来的实体的表示向量以及关系的表示向量能够刻画出知识图谱中深层次的语义信息,提高表示学习的准确性。In the representation learning method provided in the embodiments of the present application, since the type and context of the entity, the type and weight value of the relationship represent certain deep-level semantic information, therefore, by considering the type and context of the entity and the type and weight value of the relationship, The identified vector of the entity and the vector of the relation can describe the deep-level semantic information in the knowledge map, and improve the accuracy of the representation learning.
上述主要从服务器的角度对本申请实施例提供的方案进行了介绍。可以理解的是服务器为了实现上述功能,其包含了执行各个功能相应的硬件结构和/或软件模块。本领域技术人员应该很容易意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,本申请能够以硬件或硬件和计算机软件的结合形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。The above mainly introduces the solution provided by the embodiment of the present application from the perspective of the server. It can be understood that, in order to implement the above functions, the server includes a hardware structure and / or a software module corresponding to each function. Those skilled in the art should easily realize that, with reference to the units and algorithm steps of each example described in the embodiments disclosed herein, this application can be implemented in the form of hardware or a combination of hardware and computer software. Whether a certain function is performed by hardware or computer software-driven hardware depends on the specific application of the technical solution and design constraints. Professional technicians can use different methods to implement the described functions for each specific application, but such implementation should not be considered to be beyond the scope of this application.
本申请实施例可以根据上述方法示例对服务器进行划分,例如,可以对应各个功能划分各个模块或者单元,也可以将两个或两个以上的功能集成在一个处理模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件模块或者单元的形式实现。其中,本申请实施例中对模块或者单元的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。In the embodiment of the present application, the server may be divided according to the foregoing method example. For example, each module or unit may be divided corresponding to each function, or two or more functions may be integrated into one processing module. The above integrated modules may be implemented in the form of hardware, or in the form of software modules or units. The division of modules or units in the embodiments of the present application is schematic, and is only a logical function division. In actual implementation, there may be another division manner.
比如,在采用对应各个功能划分各个功能模块的情况下,图7示出上述实施例涉及的表示学习装置的一种可能的结构示意图。如图7所示,该表示学习装置包括:类型表示模块701、上下文表示模块702、处理模块703、框架扩展模块704、数据获取模块705以及扩展映射模块706。其中,类型表示模块701用于支持服务器执行图6中的步骤S202和S203。上下文表示模块702用于支持服务器执行图6中的步骤S204和S205。处理模块703用于支持服务器执行图6中的步骤S201、S206、S207和S208。框架扩展模块704用于支持服务器执行图3中的步骤S101和S102。数据获取模块705用于支持服务器执行图3中的步骤S103。扩展映射模块706用于支持服务器执行图3中的步骤S104。For example, in a case where each functional module is divided corresponding to each function, FIG. 7 illustrates a possible structural diagram of a learning device according to the foregoing embodiment. As shown in FIG. 7, the representation learning device includes a type representation module 701, a context representation module 702, a processing module 703, a framework extension module 704, a data acquisition module 705, and an extension mapping module 706. The type indicating module 701 is configured to support the server to perform steps S202 and S203 in FIG. 6. The context representation module 702 is configured to support the server to perform steps S204 and S205 in FIG. 6. The processing module 703 is configured to support the server to execute steps S201, S206, S207, and S208 in FIG. The frame expansion module 704 is configured to support the server to perform steps S101 and S102 in FIG. 3. The data acquisition module 705 is configured to support the server to execute step S103 in FIG. 3. The extended mapping module 706 is configured to support the server to execute step S104 in FIG. 3.
在本申请实施例中,该表示学习装置以对应各个功能划分各个功能模块的形式来呈现,或者,该表示学习装置以采用集成的方式划分各个功能模块的形式来呈现。这里的“模块”可以包括特定应用集成电路(Application-Specific Integrated Circuit,ASIC),电路,执行一个或多个软件或固件程序的处理器和存储器,集成逻辑电路,或其他可以提供上述功能的器件。在一个简单的实施例中,本领域的技术人员可以想到该表示学习装置可以采用图2所示的服务器来实现。比如,图7中的类型表示模块701、上下文表示模块702、处理模 块703、框架扩展模块704以及扩展映射模块706可以由图2中的处理器101来实现,图7中的数据获取模块705可以由图2中的通信接口104来实现。本申请实施例对此不做任何限制。In the embodiment of the present application, the presentation learning device is presented in the form of dividing each functional module corresponding to each function, or the presentation learning device is presented in the form of dividing each functional module in an integrated manner. The “module” here may include application-specific integrated circuits (ASICs), circuits, processors and memories executing one or more software or firmware programs, integrated logic circuits, or other devices that can provide the above functions . In a simple embodiment, those skilled in the art can think that the representation learning apparatus may be implemented by using the server shown in FIG. 2. For example, the type representation module 701, context representation module 702, processing module 703, framework extension module 704, and extension mapping module 706 in FIG. 7 may be implemented by the processor 101 in FIG. 2, and the data acquisition module 705 in FIG. 7 may It is implemented by the communication interface 104 in FIG. 2. The embodiment of the present application does not make any limitation on this.
可选的,本申请实施例还提供一种计算机可读存储介质,所述计算机可读存储介质中存储有指令;当所述计算机可读存储介质在图2所示的服务器上运行时,使得该服务器执行图3或图6所示的表示学习方法。Optionally, the embodiment of the present application further provides a computer-readable storage medium, where the computer-readable storage medium stores instructions; when the computer-readable storage medium runs on the server shown in FIG. 2, The server executes the representation learning method shown in FIG. 3 or FIG. 6.
可选的,本申请实施例还提供一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机可以执行图3或图6所示的表示学习方法。Optionally, the embodiment of the present application further provides a computer program product containing instructions, which when run on a computer enables the computer to execute the representation learning method shown in FIG. 3 or FIG. 6.
可选的,本申请实施例提供了一种芯片系统,该芯片系统包括处理器,用于支持服务器实现图3或图6所示的表示学习方法。在一种可能的设计中,该芯片系统还包括存储器。该存储器,用于保存接收机必要的程序指令和数据。当然,存储器也可以不在芯片系统中。该芯片系统,可以由芯片构成,也可以包含芯片和其他分立器件,本申请实施例对此不作具体限定。Optionally, an embodiment of the present application provides a chip system including a processor, which is configured to support a server to implement the representation learning method shown in FIG. 3 or FIG. 6. In a possible design, the chip system further includes a memory. This memory is used to store the necessary program instructions and data of the receiver. Of course, the memory may not be in the chip system. The chip system may be composed of a chip, and may also include a chip and other discrete devices, which are not specifically limited in the embodiments of the present application.
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件程序实现时,可以全部或部分地以计算机程序产品的形式来实现。该计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或者数据中心通过有线(例如同轴电缆、光纤、数字用户线(digital subscriber line,DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可以用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如,软盘、硬盘、磁带),光介质(例如,DVD)、或者半导体介质(例如固态硬盘(solid state disk,SSD))等。In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented using a software program, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the processes or functions according to the embodiments of the present application are wholly or partially generated. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices. The computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be from a website site, a computer, a server, or a data center. Transmission by wire (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.) to another website site, computer, server, or data center. The computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device including one or more servers, data centers, and the like that can be integrated with the medium. The usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, a magnetic tape), an optical medium (for example, a DVD), or a semiconductor medium (for example, a solid state disk (SSD)).
尽管在此结合各实施例对本申请进行了描述,然而,在实施所要求保护的本申请过程中,本领域技术人员通过查看所述附图、公开内容、以及所附权利要求书,可理解并实现所述公开实施例的其他变化。在权利要求中,“包括”(comprising)一词不排除其他组成部分或步骤,“一”或“一个”不排除多个的情况。单个处理器或其他单元可以实现权利要求中列举的若干项功能。相互不同的从属权利要求中记载了某些措施,但这并不表示这些措施不能组合起来产生良好的效果。Although the present application is described in conjunction with the embodiments herein, in the process of implementing the claimed application, those skilled in the art can understand and understand by looking at the drawings, the disclosure, and the appended claims. Other variations of the disclosed embodiments are implemented. In the claims, the word "comprising" does not exclude other components or steps, and "a" or "an" does not exclude the case of a plurality. A single processor or other unit may fulfill the functions of several items recited in the claims. Certain measures are recited in mutually different dependent claims, but this does not mean that these measures cannot be combined to produce good results.
尽管结合具体特征及其实施例对本申请进行了描述,显而易见的,在不脱离本申请的精神和范围的情况下,可对其进行各种修改和组合。相应地,本说明书和附图仅仅是所附权利要求所界定的本申请的示例性说明,且视为已覆盖本申请范围内的任意和所有修改、变化、组合或等同物。显然,本领域的技术人员可以对本申请进行各种改动和变型而不脱离本申请的精神和范围。这样,倘若本申请的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包含这些改动和变型在内。Although the present application has been described with reference to specific features and embodiments thereof, it is obvious that various modifications and combinations can be made thereto without departing from the spirit and scope of the application. Accordingly, the specification and drawings are merely exemplary illustrations of the application as defined by the appended claims, and are deemed to have covered any and all modifications, changes, combinations, or equivalents that fall within the scope of the application. Obviously, those skilled in the art can make various modifications and variations to this application without departing from the spirit and scope of this application. In this way, if these modifications and variations of the present application fall within the scope of the claims of the present application and their equivalent technologies, the present application also intends to include these changes and variations.

Claims (20)

  1. 一种表示学习方法,其特征在于,所述方法包括:A representation learning method, characterized in that the method comprises:
    根据融合文本的知识图谱的三元组中实体的类型,确定所述实体的类型表示向量,所述实体包括头实体和尾实体;Determining a type representation vector of the entity according to the type of the entity in the triple of the knowledge graph of the fused text, the entity including a head entity and a tail entity;
    根据所述三元组中关系的类型,确定所述关系的类型表示向量;Determine the type representation vector of the relationship according to the type of the relationship in the triple;
    根据所述实体的文本信息,确定所述实体的上下文表示向量;Determining a context representation vector of the entity according to the text information of the entity;
    根据所述关系的权重值,确定所述关系的上下文表示向量;Determining a contextual representation vector of the relationship according to a weight value of the relationship;
    根据所述实体的类型表示向量、所述实体的上下文表示向量、所述关系的类型表示向量以及所述关系的上下文表示向量,构建所述三元组的打分函数;Constructing a scoring function for the triplet according to the type representation vector of the entity, the context representation vector of the entity, the type representation vector of the relationship, and the context representation vector of the relationship;
    根据所述三元组的打分函数,构建目标函数;Constructing an objective function according to the scoring function of the triplet;
    最小化所述目标函数,学习所述实体的表示向量和所述关系的表示向量。Minimize the objective function and learn a representation vector of the entity and a representation vector of the relationship.
  2. 根据权利要求1所述的表示学习方法,其特征在于,在所述根据融合文本的知识图谱的三元组中实体的类型,确定所述实体的类型表示向量之前,所述方法还包括:The representation learning method according to claim 1, wherein before determining the type representation vector of the entity in the triple of the knowledge graph of the fused text knowledge map, the method further comprises:
    初始化所述头实体的表示向量、所述尾实体的表示向量、所述关系的表示向量、实体类型表示矩阵、关系类型表示矩阵、词表示矩阵以及权重值的表示向量。Initialize a representation vector of the head entity, a representation vector of the tail entity, a representation vector of the relationship, an entity type representation matrix, a relationship type representation matrix, a word representation matrix, and a weight value representation vector.
  3. 根据权利要求2所述的表示学习方法,其特征在于,所述根据融合文本的知识图谱的三元组中实体的类型,确定所述实体的类型表示向量,包括:The representation learning method according to claim 2, wherein determining the type representation vector of the entity according to the type of the entity in the triple of the knowledge graph of the fused text comprises:
    根据所述头实体的类型,确定所述头实体的类型标识向量;Determining a type identification vector of the head entity according to the type of the head entity;
    根据公式
    Figure PCTCN2019096895-appb-100001
    确定所述头实体的类型表示向量;其中,f 1(h)表示所述头实体的类型表示向量,W etype表示所述实体类型表示矩阵,v etype(h)表示所述头实体的类型标识向量;
    According to formula
    Figure PCTCN2019096895-appb-100001
    Determine the type representation vector of the head entity; where f 1 (h) represents the type representation vector of the head entity, W etype represents the entity type representation matrix, and v etype (h) represents the type identifier of the head entity vector;
    根据所述尾实体的类型,确定所述尾实体的类型标识向量;Determining a type identification vector of the tail entity according to the type of the tail entity;
    根据公式
    Figure PCTCN2019096895-appb-100002
    确定所述尾实体的类型表示向量;其中,f 1(t)表示所述尾实体的类型表示向量,v etype(t)表示所述尾实体的类型标识向量。
    According to formula
    Figure PCTCN2019096895-appb-100002
    The type representation vector of the tail entity is determined; wherein f 1 (t) represents the type representation vector of the tail entity, and v etype (t) represents the type identification vector of the tail entity.
  4. 根据权利要求2所述的表示学习方法,其特征在于,所述根据所述三元组中关系的类型,确定所述关系的类型表示向量,包括:The representation learning method according to claim 2, wherein determining the type representation vector of the relationship according to the type of the relationship in the triplet comprises:
    根据所述关系的类型,确定所述关系的类型标识向量;Determining a type identification vector of the relationship according to the type of the relationship;
    根据公式:
    Figure PCTCN2019096895-appb-100003
    确定所述关系的类型表示向量;其中,g 1(r)表示所述关系的类型表示向量,W rtype表示所述关系类型表示矩阵,v rtype(r)表示所述关系的类型标识向量。
    According to the formula:
    Figure PCTCN2019096895-appb-100003
    Determine the type representation vector of the relationship; where g 1 (r) represents the type representation vector of the relationship, W rtype represents the relationship type representation matrix, and v rtype (r) represents the type identification vector of the relationship.
  5. 根据权利要求2所述的表示学习方法,其特征在于,所述根据所述实体的文本信息,确定所述实体的上下文表示向量,包括:The representation learning method according to claim 2, wherein determining the context representation vector of the entity based on the text information of the entity comprises:
    根据所述头实体的文本信息,确定与所述头实体相关的词;Determining words related to the head entity according to the text information of the head entity;
    根据公式:
    Figure PCTCN2019096895-appb-100004
    确定所述头实体的上下文表示向量;其中,f 2(h)表示所述头实体的上下文表示向量,α、β为取值在0到1之间的常数,v h表示所述头实体的表示向量,w i表示与所述头实体相关的词,ε 1表示所有与所述头实体相关的词构成的集合,W word表示词表示矩阵,V vocubalary(w i)表示w i的 标识向量;
    According to the formula:
    Figure PCTCN2019096895-appb-100004
    Determine the context representation vector of the head entity; where f 2 (h) represents the context representation vector of the head entity, α, β are constants between 0 and 1, and v h represents the head entity's Represents a vector, w i represents a word related to the head entity, ε 1 represents a set of all words related to the head entity, W word represents a matrix of words, and V vocubalary (w i ) represents an identification vector of w i ;
    根据所述尾实体的文本信息,确定与所述尾实体相关的词;Determining words related to the tail entity according to the text information of the tail entity;
    根据公式:
    Figure PCTCN2019096895-appb-100005
    确定所述尾实体的上下文表示向量;其中,f 2(t)表示所述尾实体的上下文表示向量,v t表示所述尾实体的表示向量,m i表示与所述尾实体相关的词,ε 2表示所有与所述尾实体相关的词构成的集合,V vocubalary(m i)表示m i的标识向量。
    According to the formula:
    Figure PCTCN2019096895-appb-100005
    Determining a context representation vector of the tail entity; wherein f 2 (t) represents a context representation vector of the tail entity, v t represents a representation vector of the tail entity, and mi represents a word related to the tail entity, ε 2 represents a set of all words related to the tail entity, and V vocubalary (m i ) represents an identification vector of mi .
  6. 根据权利要求2所述的表示学习方法,其特征在于,所述根据所述关系的权重值,确定所述关系的上下文表示向量,包括:The representation learning method according to claim 2, wherein the determining a context representation vector of the relationship according to a weight value of the relationship comprises:
    根据公式:
    Figure PCTCN2019096895-appb-100006
    确定所述关系的上下文表示向量;其中,g 2(r)表示所述关系的上下文表示向量,v r表示所述关系的表示向量,n i表示所述关系的权重值,ε 3表示所述关系的所有权重值构成的集合,
    Figure PCTCN2019096895-appb-100007
    表示n i的表示向量。
    According to the formula:
    Figure PCTCN2019096895-appb-100006
    Determine the context representation vector of the relationship; where g 2 (r) represents the context representation vector of the relationship, v r represents the representation vector of the relationship, n i represents the weight value of the relationship, and ε 3 represents the A collection of relationship weights,
    Figure PCTCN2019096895-appb-100007
    A representation vector representing n i .
  7. 根据权利要求1至6任一项所述的表示学习方法,其特征在于,所述三元组的打分函数为:The representation learning method according to any one of claims 1 to 6, wherein the scoring function of the triples is:
    Figure PCTCN2019096895-appb-100008
    Figure PCTCN2019096895-appb-100008
    其中,ο表示复合运算,f 1(h)表示头实体的类型表示向量,g 1(r)表示关系的类型表示向量,f 1(t)表示尾实体的类型表示向量,f 2(h)表示头实体的上下文表示向量,g 2(r)表示关系的上下文表示向量,f 2(t)表示尾实体的上下文表示向量。 Among them, ο represents a compound operation, f 1 (h) represents a type of a head entity represents a vector, g 1 (r) represents a type of a relation represents a vector, f 1 (t) represents a type of a tail entity represents a vector, and f 2 (h) The context representation vector represents the head entity, g 2 (r) represents the context representation vector of the relationship, and f 2 (t) represents the context representation vector of the tail entity.
  8. 根据权利要求7所述的表示学习方法,其特征在于,所述目标函数为:The representation learning method according to claim 7, wherein the objective function is:
    Figure PCTCN2019096895-appb-100009
    Figure PCTCN2019096895-appb-100009
    其中,(h,r,t)表示正例三元组,Δ表示正例三元组集合,(h′,r,t′)表示负例三元组,h′表示负例的头实体,t′表示负例的尾实体,Δ′表示负例三元组集合,M为常数。Among them, (h, r, t) represents a positive example triplet, Δ represents a positive example triplet set, (h ′, r, t ′) represents a negative example triplet, and h ′ represents a head entity of a negative example, t ′ represents the tail entity of the negative example, Δ ′ represents the set of triples of the negative example, and M is a constant.
  9. 根据权利要求1至8任一项所述的表示学习方法,其特征在于,在所述根据融合文本的知识图谱的三元组中实体的类型,确定所述实体的类型表示向量之前,所述方法还包括:The representation learning method according to any one of claims 1 to 8, characterized in that before determining the type representation vector of the entity according to the type of the entity in the triple of the knowledge graph of the fused text, the The method also includes:
    获取初始知识图谱;Obtain the initial knowledge map;
    基于所述初始知识图谱的框架,构建融合文本的知识图谱的框架;所述融合文本的知识图谱的框架至少定义以下内容:实体的扩展属性、关系的扩展属性以及实体之间的扩展关系;所述实体的扩展属性包括实体的文本信息;Based on the framework of the initial knowledge map, construct a framework of the knowledge map of the fused text; the framework of the knowledge map of the fused text defines at least the following: the extended attributes of the entity, the extended attributes of the relationship, and the extended relationships between the entities; The extended attributes of the entity include text information of the entity;
    根据所述初始知识图谱中实体的信息或者关系的信息,获取外部数据;Obtaining external data according to entity information or relationship information in the initial knowledge map;
    从外部数据中,确定所述实体的扩展属性值以及所述关系的扩展属性值,以构建所述融合文本的知识图谱。From the external data, determine the extended attribute value of the entity and the extended attribute value of the relationship to construct a knowledge map of the fused text.
  10. 一种表示学习装置,其特征在于,包括:A representation learning device, comprising:
    类型表示模块,用于根据融合文本的知识图谱的三元组中实体的类型,确定所述实体的类型表示向量,所述实体包括头实体和尾实体;根据所述三元组中关系的类型,确定所述关系的类型表示向量;A type representation module, configured to determine the type representation vector of the entity according to the type of the entity in the triple of the knowledge graph of the fused text, the entity including a head entity and a tail entity; and according to the type of the relationship in the triple To determine a type representation vector of the relationship;
    上下文表示模块,用于根据所述实体的文本信息,确定所述实体的上下文表示向 量;根据所述关系的权重值,确定所述关系的上下文表示向量;A context representation module, configured to determine a context representation vector of the entity according to text information of the entity; and determine a context representation vector of the relationship according to a weight value of the relationship;
    处理模块,用于根据所述实体的类型表示向量、所述实体的上下文表示向量、所述关系的类型表示向量以及所述关系的上下文表示向量,构建所述三元组的打分函数;根据所述三元组的打分函数,构建目标函数;最小化所述目标函数,学习所述实体的表示向量和所述关系的表示向量。A processing module, configured to construct a scoring function of the triple according to the type representation vector of the entity, the context representation vector of the entity, the type representation vector of the relationship, and the context representation vector of the relationship; The scoring function of the triplet is described to construct an objective function; the objective function is minimized, and a representation vector of the entity and a representation vector of the relationship are learned.
  11. 根据权利要求10所述的表示学习装置,其特征在于,The representation learning device according to claim 10, wherein:
    所述处理模块,还用于初始化所述头实体的表示向量、所述尾实体的表示向量、所述关系的表示向量、实体类型表示矩阵、关系类型表示矩阵、词表示矩阵以及权重值的表示向量。The processing module is further configured to initialize a representation vector of the head entity, a representation vector of the tail entity, a representation vector of the relationship, an entity type representation matrix, a relationship type representation matrix, a word representation matrix, and a weight value representation. vector.
  12. 根据权利要求11所述的表示学习装置,其特征在于,The representation learning device according to claim 11, wherein:
    所述类型表示模块,用于根据所述头实体的类型,确定所述头实体的类型标识向量;The type indicating module is configured to determine a type identification vector of the head entity according to the type of the head entity;
    根据公式
    Figure PCTCN2019096895-appb-100010
    确定所述头实体的类型表示向量;其中,f 1(h)表示所述头实体的类型表示向量,W etype表示所述实体类型表示矩阵,v etype(h)表示所述头实体的类型标识向量;
    According to formula
    Figure PCTCN2019096895-appb-100010
    Determine the type representation vector of the head entity; where f 1 (h) represents the type representation vector of the head entity, W etype represents the entity type representation matrix, and v etype (h) represents the type identifier of the head entity vector;
    根据所述尾实体的类型,确定所述尾实体的类型标识向量;Determining a type identification vector of the tail entity according to the type of the tail entity;
    根据公式
    Figure PCTCN2019096895-appb-100011
    确定所述尾实体的类型表示向量;其中,f 1(t)表示所述尾实体的类型表示向量,v etype(t)表示所述尾实体的类型标识向量。
    According to formula
    Figure PCTCN2019096895-appb-100011
    The type representation vector of the tail entity is determined; wherein f 1 (t) represents the type representation vector of the tail entity, and v etype (t) represents the type identification vector of the tail entity.
  13. 根据权利要求11所述的表示学习装置,其特征在于,The representation learning device according to claim 11, wherein:
    所述类型表示模块,用于根据所述关系的类型,确定所述关系的类型标识向量;The type indicating module is configured to determine a type identification vector of the relationship according to the type of the relationship;
    根据公式:
    Figure PCTCN2019096895-appb-100012
    确定所述关系的类型表示向量;其中,g 1(r)表示所述关系的类型表示向量,W rtype表示所述关系类型表示矩阵,v rtype(r)表示所述关系的类型标识向量。
    According to the formula:
    Figure PCTCN2019096895-appb-100012
    Determine the type representation vector of the relationship; where g 1 (r) represents the type representation vector of the relationship, W rtype represents the relationship type representation matrix, and v rtype (r) represents the type identification vector of the relationship.
  14. 根据权利要求11所述的表示学习装置,其特征在于,The representation learning device according to claim 11, wherein:
    所述上下文表示模块,用于根据所述头实体的文本信息,确定与所述头实体相关的词;The context representation module, configured to determine words related to the head entity according to the text information of the head entity;
    根据公式:
    Figure PCTCN2019096895-appb-100013
    确定所述头实体的上下文表示向量;其中,f 2(h)表示所述头实体的上下文表示向量,α、β为取值在0到1之间的常数,v h表示所述头实体的表示向量,w i表示与所述头实体相关的词,ε 1表示所有与所述头实体相关的词构成的集合,W word表示词表示矩阵,V vocubalary(w i)表示w i的标识向量;
    According to the formula:
    Figure PCTCN2019096895-appb-100013
    Determine the context representation vector of the head entity; where f 2 (h) represents the context representation vector of the head entity, α, β are constants between 0 and 1, and v h represents the head entity's Represents a vector, w i represents a word related to the head entity, ε 1 represents a set of all words related to the head entity, W word represents a matrix of words, and V vocubalary (w i ) represents an identification vector of w i ;
    根据所述尾实体的文本信息,确定与所述尾实体相关的词;Determining words related to the tail entity according to the text information of the tail entity;
    根据公式:
    Figure PCTCN2019096895-appb-100014
    确定所述尾实体的上下文表示向量;其中,f 2(t)表示所述尾实体的上下文表示向量,v t表示所述尾实体的表示向量,m i表示与所述尾实体相关的词,ε 2表示所有与所述尾实体相关的词构成的集合,V vocubalary(m i)表示m i的标识向量。
    According to the formula:
    Figure PCTCN2019096895-appb-100014
    Determining a context representation vector of the tail entity; wherein f 2 (t) represents a context representation vector of the tail entity, v t represents a representation vector of the tail entity, and mi represents a word related to the tail entity, ε 2 represents a set of all words related to the tail entity, and V vocubalary (m i ) represents an identification vector of mi .
  15. 根据权利要求11所述的表示学习装置,其特征在于,The representation learning device according to claim 11, wherein:
    所述上下文表示模块,用于根据公式:
    Figure PCTCN2019096895-appb-100015
    确定所述关系的上下文表示向量;其中,g 2(r)表示所述关系的上下文表示向量,v r表示所述关系的表示向量,n i表示所述关系的权重值,ε 3表示所述关系的所有权重值构成的集合,
    Figure PCTCN2019096895-appb-100016
    表示n i的表示向量。
    The context representation module is used to formulate according to the formula:
    Figure PCTCN2019096895-appb-100015
    Determine the context representation vector of the relationship; where g 2 (r) represents the context representation vector of the relationship, v r represents the representation vector of the relationship, n i represents the weight value of the relationship, and ε 3 represents the A collection of relationship weights,
    Figure PCTCN2019096895-appb-100016
    A representation vector representing n i .
  16. 根据权利要求10至15任一项所述的表示学习装置,其特征在于,所述三元组的打分函数为:The representation learning device according to any one of claims 10 to 15, wherein the scoring function of the triples is:
    Figure PCTCN2019096895-appb-100017
    Figure PCTCN2019096895-appb-100017
    其中,ο表示复合运算,f 1(h)表示头实体的类型表示向量,g 1(r)表示关系的类型表示向量,f 1(t)表示尾实体的类型表示向量,f 2(h)表示头实体的上下文表示向量,g 2(r)表示关系的上下文表示向量,f 2(t)表示尾实体的上下文表示向量。 Among them, ο represents a compound operation, f 1 (h) represents a type of a head entity represents a vector, g 1 (r) represents a type of a relation represents a vector, f 1 (t) represents a type of a tail entity represents a vector, and f 2 (h) The context representation vector represents the head entity, g 2 (r) represents the context representation vector of the relationship, and f 2 (t) represents the context representation vector of the tail entity.
  17. 根据权利要求16所述的表示学习装置,其特征在于,所述目标函数为:The representation learning device according to claim 16, wherein the objective function is:
    Figure PCTCN2019096895-appb-100018
    Figure PCTCN2019096895-appb-100018
    其中,(h,r,t)表示正例三元组,Δ表示正例三元组集合,(h′,r,t′)表示负例三元组,h′表示负例的头实体,t′表示负例的尾实体,Δ′表示负例三元组集合,M为常数。Among them, (h, r, t) represents a positive example triplet, Δ represents a positive example triplet set, (h ′, r, t ′) represents a negative example triplet, and h ′ represents a head entity of a negative example, t ′ represents the tail entity of the negative example, Δ ′ represents the set of triples of the negative example, and M is a constant.
  18. 根据权利要求10至17任一项所述的表示学习装置,其特征在于,所述表示学习装置还包括:框架扩展模块、数据获取模块以及扩展映射模块;The representation learning device according to any one of claims 10 to 17, wherein the representation learning device further comprises: a frame extension module, a data acquisition module, and an extension mapping module;
    框架扩展模块,用于获取初始知识图谱;基于所述初始知识图谱的框架,构建融合文本的知识图谱的框架;所述融合文本的知识图谱的框架至少定义以下内容:实体的扩展属性、关系的扩展属性以及实体之间的扩展关系;所述实体的扩展属性包括实体的文本信息;A framework extension module is used to obtain an initial knowledge map; based on the framework of the initial knowledge map, construct a framework of the knowledge map of the fused text; the framework of the knowledge map of the fused text defines at least the following: the extended attributes of the entity, the Extended attributes and extended relationships between entities; the extended attributes of the entities include textual information of the entities;
    所述数据获取模块,用于根据所述初始知识图谱中实体的信息或者关系的信息,获取外部数据;The data acquisition module is configured to acquire external data according to information of entities or relationships in the initial knowledge map;
    所述扩展映射模块,用于从外部数据中,确定所述实体的扩展属性值以及所述关系的扩展属性值,以构建所述融合文本的知识图谱。The extended mapping module is configured to determine an extended attribute value of the entity and an extended attribute value of the relationship from external data to construct a knowledge map of the fused text.
  19. 一种服务器,其特征在于,包括:处理器、存储器、总线和通信接口;所述存储器用于存储计算机执行指令,所述处理器与所述存储器通过所述总线连接,当所述服务器运行时,所述处理器执行所述存储器存储的所述计算机执行指令,以使所述服务器执行权利要求1至9任一项所述的表示学习方法。A server is characterized in that it includes: a processor, a memory, a bus, and a communication interface; the memory is used to store a computer to execute instructions, the processor is connected to the memory through the bus, and when the server is running The processor executes the computer execution instructions stored in the memory, so that the server executes the representation learning method according to any one of claims 1 to 9.
  20. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有指令,当所述计算机可读存储介质在计算机上运行时,使得计算机执行权利要求1至9任一项所述的表示学习方法。A computer-readable storage medium, characterized in that instructions are stored in the computer-readable storage medium, and when the computer-readable storage medium is run on a computer, the computer causes the computer to execute any one of claims 1 to 9 The representation learning method described above.
PCT/CN2019/096895 2018-07-24 2019-07-19 Representation learning method and device WO2020020085A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810822334.5A CN110851609A (en) 2018-07-24 2018-07-24 Representation learning method and device
CN201810822334.5 2018-07-24

Publications (1)

Publication Number Publication Date
WO2020020085A1 true WO2020020085A1 (en) 2020-01-30

Family

ID=69181200

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/096895 WO2020020085A1 (en) 2018-07-24 2019-07-19 Representation learning method and device

Country Status (2)

Country Link
CN (1) CN110851609A (en)
WO (1) WO2020020085A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111496784A (en) * 2020-03-27 2020-08-07 山东大学 Space environment identification method and system for robot intelligent service

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112784059A (en) * 2021-01-20 2021-05-11 和美(深圳)信息技术股份有限公司 Knowledge graph representation learning method and device, electronic device and storage medium
CN112784066B (en) * 2021-03-15 2023-11-03 中国平安人寿保险股份有限公司 Knowledge graph-based information feedback method, device, terminal and storage medium
CN113158668B (en) * 2021-04-19 2023-02-28 平安科技(深圳)有限公司 Relationship alignment method, device, equipment and medium based on structured information
CN115168599B (en) * 2022-06-20 2023-06-20 北京百度网讯科技有限公司 Multi-triplet extraction method, device, equipment, medium and product

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105630901A (en) * 2015-12-21 2016-06-01 清华大学 Knowledge graph representation learning method
CN107644062A (en) * 2017-08-29 2018-01-30 广州思涵信息科技有限公司 The knowledge content Weight Analysis System and method of a kind of knowledge based collection of illustrative plates
CN107741943A (en) * 2017-06-08 2018-02-27 清华大学 The representation of knowledge learning method and server of a kind of binding entity image
CN107871158A (en) * 2016-09-26 2018-04-03 清华大学 A kind of knowledge mapping of binding sequence text message represents learning method and device
CN107885760A (en) * 2016-12-21 2018-04-06 桂林电子科技大学 It is a kind of to represent learning method based on a variety of semantic knowledge mappings
CN108197290A (en) * 2018-01-19 2018-06-22 桂林电子科技大学 A kind of knowledge mapping expression learning method for merging entity and relationship description

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105630901A (en) * 2015-12-21 2016-06-01 清华大学 Knowledge graph representation learning method
CN107871158A (en) * 2016-09-26 2018-04-03 清华大学 A kind of knowledge mapping of binding sequence text message represents learning method and device
CN107885760A (en) * 2016-12-21 2018-04-06 桂林电子科技大学 It is a kind of to represent learning method based on a variety of semantic knowledge mappings
CN107741943A (en) * 2017-06-08 2018-02-27 清华大学 The representation of knowledge learning method and server of a kind of binding entity image
CN107644062A (en) * 2017-08-29 2018-01-30 广州思涵信息科技有限公司 The knowledge content Weight Analysis System and method of a kind of knowledge based collection of illustrative plates
CN108197290A (en) * 2018-01-19 2018-06-22 桂林电子科技大学 A kind of knowledge mapping expression learning method for merging entity and relationship description

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111496784A (en) * 2020-03-27 2020-08-07 山东大学 Space environment identification method and system for robot intelligent service

Also Published As

Publication number Publication date
CN110851609A (en) 2020-02-28

Similar Documents

Publication Publication Date Title
WO2020020085A1 (en) Representation learning method and device
US11301637B2 (en) Methods, devices, and systems for constructing intelligent knowledge base
US11520812B2 (en) Method, apparatus, device and medium for determining text relevance
US9754188B2 (en) Tagging personal photos with deep networks
JP6634515B2 (en) Question clustering processing method and apparatus in automatic question answering system
JP2022002075A (en) Information recommendation method and device, electronic apparatus, program and computer readable storage medium
CN111539197B (en) Text matching method and device, computer system and readable storage medium
US9536444B2 (en) Evaluating expert opinions in a question and answer system
WO2014126657A1 (en) Latent semantic analysis for application in a question answer system
US20130031126A1 (en) Weighting metric for visual search of entity-relationship databases
CN111563192B (en) Entity alignment method, device, electronic equipment and storage medium
WO2017181866A1 (en) Making graph pattern queries bounded in big graphs
US10474747B2 (en) Adjusting time dependent terminology in a question and answer system
CN110275962B (en) Method and apparatus for outputting information
US20160124954A1 (en) Using Synthetic Events to Identify Complex Relation Lookups
CN111522886B (en) Information recommendation method, terminal and storage medium
CN106777218B (en) Ontology matching method based on attribute similarity
CN112364947A (en) Text similarity calculation method and device
WO2023168810A1 (en) Method and apparatus for predicting properties of drug molecule, storage medium, and computer device
US10977573B1 (en) Distantly supervised wrapper induction for semi-structured documents
CN117435685A (en) Document retrieval method, document retrieval device, computer equipment, storage medium and product
CN109472023B (en) Entity association degree measuring method and system based on entity and text combined embedding and storage medium
Xu et al. An upper-ontology-based approach for automatic construction of IOT ontology
US9910890B2 (en) Synthetic events to chain queries against structured data
US11409773B2 (en) Selection device, selection method, and non-transitory computer readable storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19840846

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19840846

Country of ref document: EP

Kind code of ref document: A1