CN115292520A

CN115292520A - Knowledge graph construction method for multi-source mobile application

Info

Publication number: CN115292520A
Application number: CN202211187813.7A
Authority: CN
Inventors: 李炜卓; 罗维柒; 张浩魏; 边宇阳; 周文博; 隋永波; 季秋; 高辉
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2022-09-28
Filing date: 2022-09-28
Publication date: 2022-11-04
Anticipated expiration: 2042-09-28
Also published as: CN115292520B

Abstract

The invention discloses a multisource-oriented mobile application knowledge graph construction method, which comprises the steps of generating a triple set based on mobile application data from different data sources; coding the entity and the relation to obtain corresponding vector representation; calculating the similarity between entity vectors, determining entities corresponding to vector representations with the similarity exceeding a set threshold value as initial semantic equivalent entity pairs, and determining a seed set; deducing potential semantic equivalent entity pairs from the seed set according to meta-rules; calculating the probability of the establishment of the potential semantic equivalent entity pair; and comparing the calculated probability with a set probability threshold, and finally determining the semantic equivalence relation between the entities in the multi-source mobile application according to the comparison result so as to obtain the knowledge graph of the multi-source mobile application. The method can obviously reduce the manual annotation cost of entity semantic equivalence relations of the multi-source data in the process of establishing the knowledge graph.

Description

Knowledge graph construction method for multi-source mobile application

Technical Field

The invention belongs to the field of knowledge representation and processing in knowledge engineering, and particularly relates to a mobile application knowledge graph construction method under the condition of multi-source data.

Background

With the popularization of smart phones and mobile devices, the number of mobile applications (APP for short) is rapidly increasing, and a lot of convenience is provided for people to perform online shopping, education, financing and other aspects.

However, as more and more APPs are developed and released, there are also many APPs on the network that contain malicious risks, either propagating bad information, or violating user privacy, or even violating national information security regulations. For common netizens, the construction of comprehensive mobile application knowledge base information is helpful for users to inquire and prevent APP fraud; for network security analysts, the comprehensive mobile application knowledge graph not only can help the network security analysts to find out the potential risks more quickly, but also can ensure the security of the mobile network to a certain extent.

Although in the research of the related art, there are mobile application knowledge bases such as DREBIN, androZoo + +, androVault, etc. proposed by scholars. However, the construction of these knowledge bases only focuses on the problems of a single data source, a small amount of overall data, and incomplete attributes, so that the information of the APP cannot be comprehensively displayed. On the other hand, the existing APP knowledge base focuses on the analysis of single APP underlying data (such as application permission and application privacy), so that the method is lack of correlation analysis among the APPs to a certain extent, and the sharing and reuse of the APPs among multi-source data cannot be realized. Therefore, the mobile application knowledge graph is constructed from the multi-source data, the semantic association of the APP among different data sources is established, and the method is very important for upper-layer application analysis (such as risk early warning and risk association) of the APP. Meanwhile, the method can also provide high-quality data resources for the research of knowledge engineering and network security communities.

Disclosure of Invention

The invention aims to construct a mobile application knowledge graph from multi-source data to obtain a low-cost and high-quality mobile application knowledge graph.

In order to realize the technical purpose, the invention adopts the following technical scheme:

the invention provides a multisource-oriented mobile application knowledge graph construction method, which comprises the following steps:

generating a set of triples { (S) based on the retrieved mobile application data from different data sources _o _app _z , r,e) In which S is _o _app _z Corresponding head entity, S _o _app _z Is defined asoThe source number of the seed data iszIn the mobile application of (1) a mobile application,r the corresponding relation is that the number of the first and the second groups is,ecorresponding to the tail entity;

respectively coding the entity and the relation to obtain corresponding vector representation;

calculating the similarity between entity vectors by utilizing the cosine values, and preliminarily determining the entity corresponding to the vector representation with the similarity exceeding a set threshold value as a semantic equivalence pair of the entity;

determining a seed set according to the preliminarily determined semantic equivalence pairs of the entities, and reasoning out the semantic equivalence pairs of potential entities or relationships from the seed set according to a meta-rule;

calculating the probability that the semantic equivalence pair of the potential entities or relations is established according to the probability graph model; and comparing the calculated probability with a set probability threshold, finally determining the semantic equivalence relation between the entities or relations in the multi-source mobile application according to the comparison result, and further obtaining the knowledge graph of the multi-source mobile application.

Further, encoding the entities and the relationships, respectively, to obtain corresponding vector representations, includes:

sentence statement expression is carried out on each triple in a form of 'subject predicate is object', and the sentence is expressed as follows: (S) _o _app _z [SEP]r[SEP]Is [ SEP ]]e) (ii) a Wherein [ SEP]For word segmentation symbol identification, "S _o _app _z "," r "," is ", and" e "are all considered as word blocks in the word segmentation process;

using sentences as input, adopting an adaptive Chinese pre-training model BERT to encode word blocks obtained by word segmentation to obtain S in each triple _o _app _z Vector representations of "" r "" and "e".

Further, in the process of coding the entity and the relationship, based on the synonym dictionary, the nouns or adjectives in the word block after word segmentation are randomly replaced by the synonyms according to the replacement probability, and the calculation formula of the replacement probability is as follows:

；

wherein,t _i is a block of words in a sentence,n _w is the number of word blocks in the sentence,jis the sequence number of the word block,w(t _i ) For replacing word blocks in sentencest _i The penalty incurred, exp.

Furthermore, the seed set is marked as ES = AES \8899, RES \8899andEES, wherein AES represents a semantic equivalence pair set of a head entity, RES represents a semantic equivalence pair set of a relationship, and EES represents a semantic equivalence pair set of a tail entity;

the meta-rule includes:

rule 1R ₁ : for triple

And

in which S is _i _app _x Is a firstiThe source number of the seed data isxOf mobile applications S _i _r _x Is a firstiThe source number of the seed data isxOf the mobile application, S _i _e _x Is as followsiThe source number of the seed data isxThe mobile application of (1) corresponding to the tail entity; s. the _j _app _y Is as followsjThe source number of the seed data isyThe mobile application of (1); s _j _r _y Denotes the firstjThe source number of the seed data isyOf the mobile application, S _j _e _y Is shown asjThe source number of the seed data isyThe mobile application of (1) corresponding to the tail entity;

if S is _i _app _x And S _j _app _y Is a semantic equivalence pair of head entities, i.e., there is a semantic equivalence relationship of head entities, expressed as

，S _i _e _x And S _j _e _y Is a semantic equivalence pair of tail entities, i.e., there is a semantic equivalence relationship of tail entities, expressed as

Then S _i _r _x And S _j _r _y Is a semantic equivalence pair of relationships, i.e. a semantic equivalence relationship having a relationship

Has a confidence ofp(ii) a RulesR ₁ Expressed as:

；

rule 2R ₂ : for triplets

And

if S is _i _app _x And S _j _app _y Semantic equivalence of Presence head entities, expressed as

Relation S _i _r _x And S _j _r _y There is a semantic equivalence of the relationship, expressed as

Then S _i _e _x And S _j _e _y Semantic equivalence relations with tail entities

Has a confidence ofq(ii) a RulesR ₂ Expressed as:

；

rule 3R ₃ : for triple

And

if S is _i _r _x And S _j _r _y There is a semantic equivalence of a relationship, expressed as:

；S _i _e _x and S _j _e _y There is a semantic equivalence relation of the tail entities, expressed as

Then S _i _app _x And S _j _app _y Semantic equivalence relations of head-of-presence entities

Has a confidence ofl(ii) a Rules are setR ₃ Expressed as:

。

further, calculating the probability of the establishment of the semantic equivalence pair of the potential entities or relations according to the probability graph model, wherein the specific formula is as follows:

；

；

；

wherein,R _i = Tis shown asiThe rule of regulation satisfies the triggering condition,i∈{1,2,3}，R _i = Fis shown asiThe rule does not satisfy the trigger condition,λ ₀ representing the similarity between the original pair of semantically equivalent entities,

is shown asR _i Probability of bar rule being true, corresponding toiRule of stripR _i The degree of confidence of (a) is,K _i is shown asiRule of stripR _i The number of times of the triggering is carried out,

is shown asiRule of stripR _i The probability distribution of (a) is determined,S ₀ the initial probabilities of the semantic equivalence or relation semantic equivalence of different data source entities;

as shown in rule 1R ₁ Rule 2R ₂ 、Rule 3R ₃ And initial probabilityS ₀ Under the condition, the probability that the semantic equivalence or relation semantic equivalence of different data source entities is not established,

as shown in rule 1R ₁ Rule 2R ₂ 、Rule 3R ₃ And initial probabilityS ₀ And then, the probability that the semantic equivalence or relation semantic equivalence of different data source entities is established.

Further, determining a seed set according to the initial semantic equivalent entity pair, comprising:

and determining a seed set based on the semantic equivalence pairs of the entities preliminarily determined according to the similarity and the equivalence pairs of the entities obtained by comparing the character lengths between the entities by using the character strings.

Further, encoding the entity and the relationship to obtain a vector representation, and then:

and representing the learning model by using a knowledge graph, updating the vector representation of the head entity and the tail entity, and representing the learning model by using a network to obtain the final entity vector representation based on the updated vector representation.

Further, the iterative mixed representation learning is performed on the network representation learning model and the knowledge graph representation learning model in advance, and the iterative mixed representation learning method comprises the following steps:

step 201: training a knowledge graph representation learning model, wherein a loss function of the training model is as follows:

；

wherein,kthe number of cycles of learning is represented for the iterative mixture,

is shown ask+1 round represents the loss function of the learning model based on the knowledge graph,

representing a negative example triplet set obtained by a negative sampling process that samples the head entity in the triplethWith tail entitieseRandom substitution into head entityh'Or tail entitiese'，rThe corresponding relation is that the number of the first and the second groups,

is the fold loss function, which is the maximum of x or 0,

representation of knowledge graph representation learning model inkA triplet of sub-iterations (h,r,e) A scoring function of;

representation of knowledge spectra the representation of the learning model is inkUpdating the triples after the head entity and the tail entity for each iteration (h',r,e') A scoring function of;

step 202: for the vector representation of the triples after the learning training of the knowledge graph representation, in the training of the network representation learning model, a head entity vector and a tail entity vector are respectively updated to be the first entity vector in the network representation learning modelkNode of sub-iterationv _i Node, nodev _j The corresponding vector is represented by a vector that,

，dfor the dimensions of the representation of the vector,R ^d representing a network semantic space with dimension d, a loss function of network representation learning is defined as follows:

；

denotes the firstk+The 1-round network represents the loss function of the learning model,𝑉a set of nodes representing a network representation learning model;

representing nodes

Of the neighboring node of (a) is,

is shown in𝑘Updating nodes in a sub-iterationv _i Node, nodev _j Represents a scoring function of the learning model;

step 203: will learn to obtain nodesv _i Node, nodev _j Corresponding vector representation as the second𝑘+1 degree knowledge graph represents the head entity vector and tail entity vector of the learning model, and the first degree knowledge graph represents the learning model𝑘+1 rounds of training;

and terminating the iterative mixed representation learning according to the drawn iteration times to obtain the final vector representation of all the entities.

Further, the method includes the following constraints:

constraining CS ₁ : semantic equivalence pairs for obtained head entities

And known triple representations

And

in the negative sampling process, S of the corresponding head entity in the two triples is processed _i _app _x And S _j _app _y When replacing, S is required _i _app _x And S _j _app _y Excluded as a negative example alternative;

constraining CS ₂ : semantic equivalence pairs for obtained tail entities

And known triple representations

And

in the negative sampling process, for the corresponding tail entity in the two tripletsS _i _e _x OrS _j _e _y When the replacement is carried out, theS _i _e _x AndS _j _e _y excluded as a negative example; wherein S _i _app _x Is as followsiThe source number of the seed data isxOf mobile applications, S _i _r _x Is as followsiThe source number of the seed data isxCorresponding relation of the mobile application of (1), S _i _e _x Is a firstiThe source number of the seed data isxThe mobile application of (1) corresponding to the tail entity; s. the _j _app _y Is as followsjThe source number of the seed data isyThe mobile application of (2); s _j _r _y Is shown asjThe source number of the seed data isyOf the mobile application, S _j _e _y Is shown asjThe source number of the seed data isyThe mobile application of (1) corresponding to the tail entity.

Further, calculating the similarity between entity vectors by using cosine values, and determining the entity corresponding to the vector representation with the similarity exceeding a set threshold as an initial semantic equivalent entity pair, including:

step 3.1: calculating S of corresponding head entity by cosine value _i _app _x And S _j _app _y Direct similarity between them

The formula is as follows:

；

wherein S is _i _app _x Is shown asiThe source number of the seed data isxOf mobile applications, S _j _app _y Denotes the firstjThe source number of the seed data isyIn the mobile application of (1) a mobile application,

and

are respectively S _i _app _x And S _j _app _y A vector representation of (a);

step 3.2: computing S for corresponding head entity in combination with vector representation of tail entity _i _app _x And S _j _app _y Indirect similarity between them

The formula is as follows:

；

；

；

wherein the first stepiThe source number of the seed data isxMobile application S _i _app _x The vector representation of the associated tail entity is denoted as:

(ii) a First, thejThe source number of the seed data isyMobile application S _j _app _y The vector representation of the associated tail entity is denoted as:

；N、Mis the number;

is a firstiThe source number of the seed data isxMobile application S _i _app _x An indirect vector representation of the associated entity;

is as followsjThe source number of the seed data isyMobile application S _j _app _y An indirect vector representation of the associated entity;

step 3.3: will be firstiThe source number of the seed data isxMobile application S _i _app _x And a first step ofjThe source number of the seed data isyMobile application S _j _app _y Weighting the direct similarity and the indirect similarity to obtain S _i _app _x And S _j _app _y Final similarity between them

The calculation formula is as follows:

；

wherein,

is the weight of direct similarity, and takes the value of [0,1]Real values in between.

The invention has the following beneficial technical effects: the method utilizes similarity calculation to obtain an initial semantic equivalence entity pair, further utilizes meta-rules to mine potential entity semantic equivalence relations, utilizes a probability graph model to calculate the probability of establishing the potential semantic equivalence entity pair according to the probability graph model, finally determines the semantic equivalence relations among entities in multi-source mobile application according to the probability, and reduces the complexity of entity semantic equivalence relation data set calculation. The method is beneficial to being migrated to the construction process of other multi-domain multi-source knowledge graphs. The method can obviously reduce the manual annotation cost of the entity semantic equivalence relation of the multi-source data in the process of establishing the knowledge graph, can generate the equivalence relation of the high-quality structured triple and the entity, and realizes the value of sharing and reusing the mobile application information. In addition, the invention can further improve the accuracy of the discovery of the associated entity by utilizing a hybrid training mode of knowledge graph representation learning and network representation learning.

Drawings

The invention will be further described with reference to the accompanying drawings and examples, in which:

FIG. 1 is a flow diagram of a multi-source mobile application knowledge graph construction method of an embodiment of the method of the present invention;

FIG. 2 is a flow diagram of knowledge graph representation learning and network representation based entity discovery for a method embodiment of the present invention;

FIG. 3 is a meta-rule for modeling entity alignment based on a probabilistic graph model Noisy-or according to an embodiment of the method of the present invention.

Detailed Description

To further clarify the technical solutions of the present application, the following detailed description will be made with reference to the accompanying drawings and specific embodiments. It should be noted that the following description is only a preferred embodiment of the present invention, and it will be apparent to those skilled in the art that various modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be construed as the protective scope of the present invention.

Example (b): the method for constructing the knowledge graph of the multi-source mobile application comprises the following steps:

step 1: generating a set of triples based on acquired mobile application data from different data sources (S) _o _app _z , r,e)}，S _o _app _z Is a unique identifier, wherein S _o _app _z Corresponding head entity, S _o _app _z Is defined asoThe source number of the seed data iszIn the mobile application of (1) a mobile application,r the corresponding relation is that the number of the first and the second groups,ecorresponding to the tail entity;

and 2, step: respectively coding the entity and the relation to obtain corresponding vector representation;

and step 3: calculating the similarity between entity vectors by utilizing cosine values, and preliminarily determining the entity corresponding to the vector representation with the similarity exceeding a set threshold value as a semantic equivalence pair of the entity;

and 4, step 4: determining a seed set according to the preliminarily determined semantic equivalence pairs of the entities, and reasoning out the semantic equivalence pairs of potential entities or relationships from the seed set according to a meta-rule;

calculating the probability that the semantic equivalence pair of the potential entities or relations is established according to the probability graph model; and comparing the calculated probability with a set probability threshold, finally determining the semantic equivalence relation between entities or relations in the multi-source mobile application according to the comparison result, and further obtaining the knowledge graph of the multi-source mobile application.

In a specific embodiment, a script framework is adopted to collect data associated with mobile applications from various large application malls, and names of the mobile applications are defined to form an application name list. From encyclopedicObtaining more comprehensive data supplements a name list for a mobile application, where the name listSMay be obtained by the following collective operations;

S={app _z | S _o _app _z , o=1,2,…,O;z=1,2,..,Z}；

app for the z-th mobile application _z Define it at the data sourceoIs marked as S _o _app _z In whichOThe number of different sources is the same as the number of different sources,Zfor the number of mobile applications in each data source.

Preprocessing the acquired data, analyzing the acquired data types, and converting the classified structured and unstructured data into a structured triple set, wherein the method comprises the following steps:

step 1.1: analyzing and collecting data types associated with data sources corresponding to the mobile application, and dividing the data types into two types, namely structured data and unstructured data;

step 1.2: resolving the attribute type of the structured data, usually attribute tags of mobile applications in application stores (e.g. developers, companies, languages, version numbers, release dates of APPs in APP stores, etc.) and infobox descriptions in mobile applications in encyclopedia, can convert such data directly into the form of structured triples { (S) _o _app _z R, e) }, in which S _o _app _z Corresponding to the head entity, wherein r is the corresponding relation of the attribute tags, and e is the entity corresponding to the attribute tag r and corresponds to the tail entity in the triple;

step 1.3: analyzing attribute types of unstructured data, generally word introduction or text description about mobile applications in mobile stores and encyclopedias, adopting named entity recognition technology to recognize entities in texts, then adopting drawn-up relations to classify the recognized entities in relation, and finally forming a certain amount of triples { (S) _o _app _z R ', e') } to complete the structured triplet information in the mobile application, where S _o _app _z Corresponding to the head entity, r' is a proposed relationship, corresponding to the tripletE' is an entity corresponding to the proposed relationship r, and corresponds to the tail entity in the triplet.

In this embodiment, encoding the entity and the relationship to obtain a vector representation corresponding to the entity and the relationship includes: step 2.1: converting them into text format required by pre-training model according to different data types, and for structured triple form { (S) _o _app _z R, e) }, the expression of sentence statement is carried out by adopting the 'subject predicate as object': (S) _o _app _z [SEP]r[SEP]Is [ SEP ]]e) Wherein [ SEP]For word segmentation symbol identification, "S _o _app _z All of the words "r", "e" and "e" are regarded as word blocks in the word segmentation process and are denoted as tokens.

For the attribute type of unstructured data, a word segmentation tool is adopted to segment words of the text, and in order to improve the precision of the word segmentation, a manual definition dictionary can be bound in the word segmentation tool.

Step 2.2: and coding the token after word segmentation by adopting an adaptive Chinese pre-training model BERT to obtain vector representation of all tokens.

Further, in other embodiments, in order to improve the encoding effect and the accuracy of the prediction of the original sentence sequence, in the encoding process, the noun or the adjective after the word segmentation is randomly replaced based on the synonym dictionary, and the calculation formula of the replacement probability is as follows:

；

wherein,t _i are the word blocks in the sentence, and the word blocks,n _w is the number of word blocks in the sentence,w(t _i ) For replacing word blockst _i The loss caused by the reaction is [0,1 ]]Exp (.) is a power exponent function;

in a specific embodiment, based on the vector representation obtained by using the adapted chinese pre-training model BERT, the method further includes the following steps: and representing the learning model by using a knowledge graph, updating vector representations of the head entity and the tail entity, and obtaining a final entity vector representation by using a network representation learning model based on the updated vector representation, wherein the final entity vector representation is used for calculating the similarity between entity vectors by using cosine values.

A mixed training mode of knowledge graph representation learning and network representation learning is adopted, and mixed iterative training is performed on all triples, as shown in fig. 2, the method specifically includes:

step 201: and for all the triples, training by using a knowledge graph to represent a learning model, wherein the loss function of the training model is as follows:

；

representing a negative example triplet set obtained by a negative sampling process that samples the head entity in the triplethWith the tail entityeRandom substitution into head entityh'Or tail entitye'，

Is the fold loss function, which is the maximum of x or 0,

representation of knowledge spectra the representation of the learning model is inkA triplet of sub-iterations (h,r,e) A scoring function of;

representation of knowledge graph representation learning model inkAfter updating head and tail entities in a single iterationTriple group (h',r,e') A scoring function of (a);

step 202: for the vector representation of the triples after the learning training of the knowledge graph representation, in the training of the network representation learning model, a head entity vector and a tail entity vector are respectively updated to be the first entity vector in the network representation learning modelkNode of sub-iterationv _i ,v _j The corresponding vector is represented by a vector that,

；

wherein,

is shown ask+The 1-round network represents the loss function of the learning model,𝑉a set of nodes representing a network representation learning model;

representing nodes

Of the neighboring node of (a) is,

is shown in𝑘Updating nodes in sub-iterationsv _i ,v _j Represents a scoring function of the learning model;

step 203: will learn to obtain the nodev _i ,v _j Corresponding vector representation as the second𝑘+1 knowledge graph represents the head and tail entity vectors of the learning model, the first knowledge graph represents the learning model𝑘+1 round of training;

In a specific embodiment, the specific knowledge graph representation learning model and the network representation learning model can be realized by adopting the prior art, which is not the invention point of the present application, and the present application does not need to limit the specific realization method of the model, so that the detailed description is omitted.

In this embodiment, step 3 is referred to as a "multi-source entity discovery algorithm"; step 4 is referred to as a "multi-source entity alignment algorithm", as shown in fig. 1. In other embodiments, to increase the reliability of the similarity between mobile applications, the entity discovery algorithm employed includes: the similarity between the mobile applications is indirectly calculated by utilizing the entity vectors associated with the mobile applications, and is weighted with the direct similarity to finally obtain the similarity between the mobile applications; the method comprises the following steps:

The formula is as follows:

；

wherein S is _i _app _x Is shown asiThe source number of the seed data isxOf mobile applications S _j _app _y Denotes the firstjThe source number of the seed data isyIn the mobile application of (1) a mobile application,

and

are respectively S _i _app _x And S _j _app _y A vector representation of (a);

step 3.2: knotVector representation of closure entities computes S for corresponding head entities _i _app _x And S _j _app _y Indirect similarity between them

The formula is as follows:

；

；

；

wherein the first stepiThe source number of the seed data isxOf a mobile application S _i _app _x The vector representation of the associated tail entity is noted as:

(ii) a First, thejThe source number of the seed data isyOf a mobile application S _j _app _y Associated tail entity, noted

，N、MIs the number;

is as followsiThe source number of the seed data isxOf a mobile application S _i _app _x An indirect vector representation of the associated entity,

is as followsjThe source number of the seed data isyMobile application S _j _app _y Indirect vector of associated entityRepresents;

step 3.3: will be firstiThe source number of the seed data isxOf a mobile application S _i _app _x And a firstjThe source number of the seed data isyMobile application S _j _app _y Weighting the direct similarity and the indirect similarity to obtain S _i _app _x And S _j _app _y Final similarity between them

The calculation formula is as follows:

；

wherein,

is the weight of direct similarity and takes the value of [0, 1%]Real values in between.

In this embodiment, the entity discovery result after the threshold screening is used as an input of the entity alignment algorithm, and a corresponding relationship between entities in the multi-source data is obtained from the entity alignment algorithm.

Optionally, in the multi-source entity discovery algorithm, an initial semantic equivalence pair set is obtained according to the similarity between corresponding vectors of entities from a semantic level of the entities, a grammar equivalence pair set obtained by calculating the grammar similarity between the entities by a method based on a character string length and a short distance is obtained from a grammar level of the entities, the results of the two sets are complemented, the semantic equivalence pair set and the grammar equivalence pair set are subjected to union operation, an initial seed set is determined, and the accuracy of entity discovery can be improved.

The multi-source entity alignment algorithm specifically comprises the following steps:

step 4.1: the method comprises the steps of obtaining semantic equivalence pairs of entities among different data sources based on a character string equality algorithm, screening out equivalence pairs of entities with high similarity in entity discovery results according to a drawn threshold, and manually checking the results of the two entities to form an initial seed set of the semantic equivalence pairs, wherein the seed set is marked as ES = AES \8899RES \/8899and EES, AES represents a semantic equivalence pair set of a head entity, RES represents a semantic equivalence pair set of a relation, and EES represents a semantic equivalence pair set of a tail entity;

and 4.2: finding semantic equivalence pairs of potential entities or relationships from AES, RES, EES according to a designed meta-rule comprising:

the meta-rule includes:

rule 1R ₁ : for triple

And

in which S is _i _app _x Is a firstiThe source number of the seed data isxOf mobile applications S _i _r _x Is a firstiThe source number of the seed data isxOf the mobile application, S _i _e _x Is a firstiThe source number of the seed data isxThe mobile application of (2) corresponding to the tail entity; s _j _app _y Is as followsjThe source number of the seed data isyThe mobile application of (2); s _j _r _y Is shown asjThe source number of the seed data isyCorresponding relation of the mobile application of (1), S _j _e _y Is shown asjThe source number of the seed data isyThe mobile application of (1) corresponding to the tail entity;

if S is _i _app _x And S _j _app _y There is a semantic equivalence relation of the head entities, i.e. a semantic equivalence pair of the head entities, expressed as

，S _i _e _x And S _j _e _y There is a semantic equivalence relationship of the tail entities, i.e., a semantic equivalence pair of the tail entities, expressed as

Then S _i _r _x And S _j _r _y Semantic equivalence pairs having semantic equivalence relationships, i.e., relationships, of relationships

Has a confidence ofp(ii) a RulesR ₁ Expressed as:

；

rule 2R ₂ : for triplets

And

Has a confidence ofq；

In a specific embodiment, optionally, the similarity between the relationship vectors is also calculated by using the cosine values, and the relationship corresponding to the vector representation whose similarity exceeds the set threshold is preliminarily determined as the semantic equivalence pair of the relationship.

RulesR ₂ Expressed as:

；

rule 3R ₃ : for triple

And

Then S _i _app _x And S _j _app _y Semantic equivalence relations of presence head entities

Has a confidence ofl(ii) a RulesR ₃ Expressed as:

。

step 4.3: as shown in fig. 3, the probability that the semantic equivalence pair of the potential entities or relations holds is calculated according to the probability graph model, and the specific formula is as follows:

；

；

；

wherein,R _i = Tdenotes the firstiThe rule of regulation satisfies the triggering condition,i∈{1,2,3}，R _i = Fdenotes the firstiThe rule does not satisfy the trigger condition,λ ₀ representing the similarity between the original pair of semantically equivalent entities,

is shown asR _i Probability of bar rule being true, corresponding toiRule of stripR _i The degree of confidence of (a) is,K _i denotes the firstiRule of stripR _i The number of times of the triggering is to be done,

denotes the firstiRule of stripR _i The probability distribution of (a) is determined,S ₀ the initial probabilities of the semantic equivalence or relation semantic equivalence of different data source entities;

as shown in rule 1R ₁ Rule 2R ₂ 、Rule 3R ₃ And initial probabilityS ₀ And then, the probability that the semantic equivalence or relation semantic equivalence of different data source entities is established. Alternatively,S ₀ semantics for different data source entities, etcInitial probability of price, i.e.λ ₀ 。

Step 4.4: and calculating the probability of establishing the semantic equivalence relations of the different data source entities based on the designed meta-rules of the three equivalence relations, and screening according to a drawn threshold value to obtain a final semantic equivalence entity pair.

In a specific embodiment, the method for constructing the knowledge graph for the multi-source mobile application further comprises the following steps: and taking the result of the multi-source entity alignment algorithm as the constraint of the multi-source entity discovery algorithm, so that the mutual supplement and the mutual constraint of the entity discovery algorithm and the entity alignment algorithm are realized, and the iteration of the algorithm is finally completed, wherein the specific constraint is as follows:

constraining CS ₁ : semantic equivalence relations for obtained head entities

And known triple representations

And

constraining CS ₂ : semantic equivalence relations to obtained tail entities

And known triple representations

And

in the negative sampling process, the pairs in the two triples are subjected to the comparisonResponsive to the end entityS _i _e _x OrS _j _e _y When the replacement is carried out, theS _i _e _x AndS _j _e _y excluded as a negative example; wherein S _i _app _x Is a firstiThe source number of the seed data isxOf mobile applications S _i _r _x Is a firstiThe source number of the seed data isxOf the mobile application, S _i _e _x Is a firstiThe source number of the seed data isxThe mobile application of (1) corresponding to the tail entity; s. the _j _app _y Is a firstjThe source number of the seed data isyThe mobile application of (2); s _j _r _y Is shown asjThe source number of the seed data isyOf the mobile application, S _j _e _y Is shown asjThe source number of the seed data isyThe mobile application of (1) corresponding to the tail entity.

In a specific embodiment, the process of the multi-source entity discovery algorithm in the step 3 and the process of the multi-source entity alignment algorithm in the step 4 can be repeated until no new entity pair appears, and finally the multi-source mobile application knowledge graph is formed.

In the above embodiment, the probability of establishment of the entity semantic equivalence relation in the initial semantic equivalence entity pairs of different data sources is calculated based on the designed meta-rules of the three equivalence relations, and the final semantic equivalence entity pair is obtained by screening according to the drawn threshold. In other embodiments, the accuracy of the entity alignment algorithm may be improved by obtaining semantically equivalent pairs of entities using image rules PR, including:

carrying out vector representation on picture identifiers of mobile applications of different data sources, and modeling by adopting the gray scale of the picture identifiers;

carrying out depth feature representation on the gray level of the extracted image by adopting a convolutional neural network, and judging whether the mobile application is equivalent by utilizing an image matching rule:

image rule PR:

(S _i _app _x identification of the picture, S _i _Pic _x )∧(S _j _app _y Identification of the picture, S _j _Pic _y ) )∧Sim(S _i _Pic _x ,S _j _Pic _t )≥δ⇒(S _i _app _x ,S _j _app _y ,≡)；

Wherein S _i _app _x And S _j _app _y Corresponding to different data sourcesi，jOf mobile applications, S _i _Pic _x And S _j _Pic _y Corresponding mobile application S _i _app _x And S _j _app _y The picture identifies the associated image "

"is a" semantic equivalence "relationship,

represents an image matching threshold of [0,1 ]]If the similarity of the image matching is larger than the set threshold value, the mobile application S _i _app _x And S _j _app _y The semantics are equal.

The invention combines the entity discovery and entity alignment iterative strategy, can obviously reduce the manual annotation cost of entity corresponding relation of multi-source data in the process of map construction, and is beneficial to expanding to the process of knowledge map construction in other fields.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and so forth) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A knowledge graph construction method for multi-source mobile application is characterized by comprising the following steps:

generating a set of triples based on acquired mobile application data from different data sources (S) _o _app _z , r,e) In which S is _o _app _z Corresponding head entity, S _o _app _z Is defined asoThe source number of the seed data iszIn the mobile application of (1) a mobile application,r the corresponding relation is that the number of the first and the second groups,ecorresponding to the tail entity;

calculating the similarity between entity vectors by utilizing cosine values, and preliminarily determining the entity corresponding to the vector representation with the similarity exceeding a set threshold value as a semantic equivalence pair of the entity;

calculating the probability of establishing a semantic equivalence pair of potential entities or relations according to a probability graph model; and comparing the calculated probability with a set probability threshold, finally determining the semantic equivalence relation between entities or relations in the multi-source mobile application according to the comparison result, and further obtaining the knowledge graph of the multi-source mobile application.

2. The method for constructing the knowledge graph for the multi-source mobile application according to claim 1, wherein the encoding of the entities and the relationships to obtain corresponding vector representations comprises:

sentence statement expression is carried out on each triple in a form of 'subject predicate is object', and the sentence is expressed as: (S) _o _app _z [SEP]r[SEP]Is [ SEP ]]e) (ii) a Wherein [ SEP]For word segmentation symbol identification, "S _o _app _z "," r "," is ", and" e "are all considered word blocks in the word segmentation process;

using sentences as input, adopting an adaptive Chinese pre-training model BERT to encode word blocks obtained by word segmentation to obtain three words"S" in tuple _o _app _z Vector representations of "" r "" and "e".

3. The multi-source-oriented mobile application knowledge graph construction method according to claim 1, characterized in that in the process of coding the entity and the relationship, nouns or adjectives in the word block after word segmentation are randomly replaced with synonyms thereof according to a replacement probability based on a synonym dictionary, and the calculation formula of the replacement probability is as follows:

；

wherein,t _i is a block of words in a sentence,n _w is the number of word blocks in the sentence,jis the sequence number of the word block,w(t _i ) For replacing word blocks in sentencest _i The penalty incurred, exp (.), is a power exponential function.

4. The knowledge graph construction method facing the multi-source mobile application is characterized in that the seed set is recorded as ES = AES \8899, RES \8899andEES, wherein AES represents a semantic equivalence pair set of a head entity, RES represents a semantic equivalence pair set of a relationship, and EES represents a semantic equivalence pair set of a tail entity;

the meta-rule includes:

rule 1R ₁ : for triplets

And

in which S is _i _app _x Is as followsiThe source number of the seed data isxOf mobile applications, S _i _r _x Is a firstiThe source number of the seed data isxThe corresponding relationship of the mobile application of (1),S _i _e _x is as followsiThe source number of the seed data isxThe mobile application of (1) corresponding to the tail entity; s _j _app _y Is as followsjThe source number of the seed data isyThe mobile application of (1); s _j _r _y Is shown asjThe source number of the seed data isyCorresponding relation of the mobile application of (1), S _j _e _y Is shown asjThe source number of the seed data isyThe mobile application of (1) corresponding to the tail entity;

if S is _i _app _x And S _j _app _y Is a semantically equivalent pair of head entities, denoted as

，S _i _e _x And S _j _e _y Is a semantically equivalent pair of tail entities, denoted as

Then S _i _r _x And S _j _r _y Is a semantically equivalent pair of relationships

Has a confidence ofp(ii) a RulesR ₁ Expressed as:

；

rule 2R ₂ : for triplets

And

Relation S _i _r _x And S _j _r _y Is a semantically equivalent pair of relationships, expressed as

Then S _i _e _x And S _j _e _y Is a semantically equivalent pair of tail entities

Has a confidence ofq(ii) a RulesR ₂ Expressed as:

；

rule 3R ₃ : for triple

And

if S is _i _r _x And S _j _r _y Is a semantically equivalent pair of relationships, expressed as:

；S _i _e _x and S _j _e _y Is a semantically equivalent pair of tail entities, denoted as

Then S _i _app _x And S _j _app _y Is a semantically equivalent pair of head entities,

has a confidence ofl(ii) a Rules are setR ₃ Expressed as:

。

5. the method for constructing the knowledge graph for the multi-source mobile application according to claim 4, wherein the probability of establishing the semantic equivalence pair of the potential entities or relations is calculated according to a probability graph model, and the specific formula is as follows:

；

；

；

wherein,R _i = Tis shown asiThe rule of regulations satisfies the triggering condition,i∈{1,2,3}，R _i = Fdenotes the firstiThe rule does not satisfy the trigger condition,λ ₀ representing the similarity between the original pair of semantically equivalent entities,

is shown asiRule of stripR _i The probability distribution of (a) is determined,S ₀ the initial probability of semantic equivalence or relation semantic equivalence of different data source entities;

is shown in rule 1R ₁ Rule 2R ₂ 、Rule 3R ₃ And initial probabilityS ₀ Under the condition, the probability that the semantic equivalence or relation semantic equivalence of different data source entities is not established,

6. The method for constructing the knowledge-graph for the multi-source mobile application according to claim 1, wherein determining the seed set according to the initial semantic equivalent entity pair comprises:

and determining a seed set based on the semantic equivalence pairs of the entities preliminarily determined according to the similarity and the semantic equivalence pairs of the entities obtained by comparing the character lengths between the entities by using the character strings.

7. The method of claim 1, wherein the entity and relationship are encoded to obtain a vector representation, and then further comprising:

8. The multi-source mobile application-oriented knowledge graph construction method according to claim 7, wherein iterative hybrid representation learning is performed on a network representation learning model and a knowledge graph representation learning model in advance, and the iterative hybrid representation learning comprises the following steps:

；

wherein,kthe number of cycles representing learning is iteratively mixed,

Is the fold loss function, which is the maximum of x or 0,

representation of knowledge graph representation learning model inkUpdating the triples after the head entity and the tail entity for each iteration (h',r,e') A scoring function of;

step 202: for the experienceThe recognition chart represents the vector representation of the triples after the learning training, and in the network representation learning model training, the head entity vector and the tail entity vector are respectively updated to the first vector in the network representation learning modelkNode of sub-iterationv _i Node, nodev _j The corresponding vector is represented by a vector that,

，dis a dimension that is represented by a vector,R ^d representing a network semantic space with dimension d, a loss function of network representation learning is defined as follows:

；

wherein,

representing nodes

Of the neighboring node of (a) is,

is shown in𝑘Updating nodes in sub-iterationsv _i Node, and method for controlling the samev _j Represents a scoring function of the learning model;

step 203: will learn to obtain the nodev _i Node, nodev _j Corresponding vector representation as the second𝑘+1 knowledge graph represents the head and tail entity vectors of the learning model, the first knowledge graph represents the learning model𝑘+1 round of training;

9. The method for constructing the knowledge graph for the multi-source mobile application according to claim 8, wherein the method comprises the following constraints:

constraining CS ₁ : semantic equivalence pairs for obtained head entities

And known triple representations

And

constraining CS ₂ : semantic equivalence pairs for obtained tail entities

And known triple representations

And

in the negative sampling process, for the corresponding tail entity in the two tripletsS _i _e _x OrS _j _e _y When the replacement is carried out, theS _i _e _x AndS _j _e _y excluded as a negative example; wherein S _i _app _x Is a firstiThe source number of the seed data isxOf mobile applications S _i _r _x Is as followsiThe source number of the seed data isxOf the mobile application, S _i _e _x Is a firstiThe source number of the seed data isxThe mobile application of (2) corresponding to the tail entity; s _j _app _y Is a firstjThe source number of the seed data isyThe mobile application of (2); s _j _r _y Denotes the firstjThe source number of the seed data isyCorresponding relation of the mobile application of (1), S _j _e _y Is shown asjThe source number of the seed data isyThe mobile application of (2) to the corresponding tail entity.

10. The multi-source-oriented mobile application knowledge graph construction method of claim 1, wherein the similarity between entity vectors is calculated by using cosine values, and an entity corresponding to a vector representation with the similarity exceeding a set threshold is determined as an initial semantic equivalent entity pair, comprising:

The formula is as follows:

；

wherein S is _i _app _x Is shown asiThe source number of the seed data isxOf mobile applications S _j _app _y Is shown asjThe source number of the seed data isyIn the mobile application of (1) a mobile application,

and

are respectively S _i _app _x And S _j _app _y A vector representation of (a);

The formula is as follows:

；

；

；

wherein, the firstiThe source number of the seed data isxOf a mobile application S _i _app _x The vector representation of the associated tail entity is noted as:

(ii) a First, thejThe source number of the seed data isyMobile application S _j _app _y Vector representation of the associated tail entity, denoted as

，N、MIs the number;

is as followsiThe source number of the seed data isxMobile application S _i _app _x An indirect vector representation of the associated entity,

The calculation formula is as follows:

；

wherein,