CN115292520A - Knowledge graph construction method for multi-source mobile application - Google Patents

Knowledge graph construction method for multi-source mobile application Download PDF

Info

Publication number
CN115292520A
CN115292520A CN202211187813.7A CN202211187813A CN115292520A CN 115292520 A CN115292520 A CN 115292520A CN 202211187813 A CN202211187813 A CN 202211187813A CN 115292520 A CN115292520 A CN 115292520A
Authority
CN
China
Prior art keywords
app
entity
mobile application
entities
source
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211187813.7A
Other languages
Chinese (zh)
Other versions
CN115292520B (en
Inventor
李炜卓
罗维柒
张浩魏
边宇阳
周文博
隋永波
季秋
高辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN202211187813.7A priority Critical patent/CN115292520B/en
Publication of CN115292520A publication Critical patent/CN115292520A/en
Application granted granted Critical
Publication of CN115292520B publication Critical patent/CN115292520B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a multisource-oriented mobile application knowledge graph construction method, which comprises the steps of generating a triple set based on mobile application data from different data sources; coding the entity and the relation to obtain corresponding vector representation; calculating the similarity between entity vectors, determining entities corresponding to vector representations with the similarity exceeding a set threshold value as initial semantic equivalent entity pairs, and determining a seed set; deducing potential semantic equivalent entity pairs from the seed set according to meta-rules; calculating the probability of the establishment of the potential semantic equivalent entity pair; and comparing the calculated probability with a set probability threshold, and finally determining the semantic equivalence relation between the entities in the multi-source mobile application according to the comparison result so as to obtain the knowledge graph of the multi-source mobile application. The method can obviously reduce the manual annotation cost of entity semantic equivalence relations of the multi-source data in the process of establishing the knowledge graph.

Description

Knowledge graph construction method for multi-source mobile application
Technical Field
The invention belongs to the field of knowledge representation and processing in knowledge engineering, and particularly relates to a mobile application knowledge graph construction method under the condition of multi-source data.
Background
With the popularization of smart phones and mobile devices, the number of mobile applications (APP for short) is rapidly increasing, and a lot of convenience is provided for people to perform online shopping, education, financing and other aspects.
However, as more and more APPs are developed and released, there are also many APPs on the network that contain malicious risks, either propagating bad information, or violating user privacy, or even violating national information security regulations. For common netizens, the construction of comprehensive mobile application knowledge base information is helpful for users to inquire and prevent APP fraud; for network security analysts, the comprehensive mobile application knowledge graph not only can help the network security analysts to find out the potential risks more quickly, but also can ensure the security of the mobile network to a certain extent.
Although in the research of the related art, there are mobile application knowledge bases such as DREBIN, androZoo + +, androVault, etc. proposed by scholars. However, the construction of these knowledge bases only focuses on the problems of a single data source, a small amount of overall data, and incomplete attributes, so that the information of the APP cannot be comprehensively displayed. On the other hand, the existing APP knowledge base focuses on the analysis of single APP underlying data (such as application permission and application privacy), so that the method is lack of correlation analysis among the APPs to a certain extent, and the sharing and reuse of the APPs among multi-source data cannot be realized. Therefore, the mobile application knowledge graph is constructed from the multi-source data, the semantic association of the APP among different data sources is established, and the method is very important for upper-layer application analysis (such as risk early warning and risk association) of the APP. Meanwhile, the method can also provide high-quality data resources for the research of knowledge engineering and network security communities.
Disclosure of Invention
The invention aims to construct a mobile application knowledge graph from multi-source data to obtain a low-cost and high-quality mobile application knowledge graph.
In order to realize the technical purpose, the invention adopts the following technical scheme:
the invention provides a multisource-oriented mobile application knowledge graph construction method, which comprises the following steps:
generating a set of triples { (S) based on the retrieved mobile application data from different data sources o _app z , r,e) In which S is o _app z Corresponding head entity, S o _app z Is defined asoThe source number of the seed data iszIn the mobile application of (1) a mobile application,r the corresponding relation is that the number of the first and the second groups is,ecorresponding to the tail entity;
respectively coding the entity and the relation to obtain corresponding vector representation;
calculating the similarity between entity vectors by utilizing the cosine values, and preliminarily determining the entity corresponding to the vector representation with the similarity exceeding a set threshold value as a semantic equivalence pair of the entity;
determining a seed set according to the preliminarily determined semantic equivalence pairs of the entities, and reasoning out the semantic equivalence pairs of potential entities or relationships from the seed set according to a meta-rule;
calculating the probability that the semantic equivalence pair of the potential entities or relations is established according to the probability graph model; and comparing the calculated probability with a set probability threshold, finally determining the semantic equivalence relation between the entities or relations in the multi-source mobile application according to the comparison result, and further obtaining the knowledge graph of the multi-source mobile application.
Further, encoding the entities and the relationships, respectively, to obtain corresponding vector representations, includes:
sentence statement expression is carried out on each triple in a form of 'subject predicate is object', and the sentence is expressed as follows: (S) o _app z [SEP]r[SEP]Is [ SEP ]]e) (ii) a Wherein [ SEP]For word segmentation symbol identification, "S o _app z "," r "," is ", and" e "are all considered as word blocks in the word segmentation process;
using sentences as input, adopting an adaptive Chinese pre-training model BERT to encode word blocks obtained by word segmentation to obtain S in each triple o _app z Vector representations of "" r "" and "e".
Further, in the process of coding the entity and the relationship, based on the synonym dictionary, the nouns or adjectives in the word block after word segmentation are randomly replaced by the synonyms according to the replacement probability, and the calculation formula of the replacement probability is as follows:
Figure 115001DEST_PATH_IMAGE001
wherein,t i is a block of words in a sentence,n w is the number of word blocks in the sentence,jis the sequence number of the word block,w(t i ) For replacing word blocks in sentencest i The penalty incurred, exp.
Furthermore, the seed set is marked as ES = AES \8899, RES \8899andEES, wherein AES represents a semantic equivalence pair set of a head entity, RES represents a semantic equivalence pair set of a relationship, and EES represents a semantic equivalence pair set of a tail entity;
the meta-rule includes:
rule 1R 1 : for triple
Figure 751519DEST_PATH_IMAGE002
And
Figure 871660DEST_PATH_IMAGE003
in which S is i _app x Is a firstiThe source number of the seed data isxOf mobile applications S i _r x Is a firstiThe source number of the seed data isxOf the mobile application, S i _e x Is as followsiThe source number of the seed data isxThe mobile application of (1) corresponding to the tail entity; s. the j _app y Is as followsjThe source number of the seed data isyThe mobile application of (1); s j _r y Denotes the firstjThe source number of the seed data isyOf the mobile application, S j _e y Is shown asjThe source number of the seed data isyThe mobile application of (1) corresponding to the tail entity;
if S is i _app x And S j _app y Is a semantic equivalence pair of head entities, i.e., there is a semantic equivalence relationship of head entities, expressed as
Figure 960838DEST_PATH_IMAGE004
,S i _e x And S j _e y Is a semantic equivalence pair of tail entities, i.e., there is a semantic equivalence relationship of tail entities, expressed as
Figure 224461DEST_PATH_IMAGE005
Then S i _r x And S j _r y Is a semantic equivalence pair of relationships, i.e. a semantic equivalence relationship having a relationship
Figure 449906DEST_PATH_IMAGE006
Has a confidence ofp(ii) a RulesR 1 Expressed as:
Figure 507991DEST_PATH_IMAGE007
rule 2R 2 : for triplets
Figure 818887DEST_PATH_IMAGE008
And
Figure 620621DEST_PATH_IMAGE009
if S is i _app x And S j _app y Semantic equivalence of Presence head entities, expressed as
Figure 700572DEST_PATH_IMAGE010
Relation S i _r x And S j _r y There is a semantic equivalence of the relationship, expressed as
Figure 54193DEST_PATH_IMAGE011
Then S i _e x And S j _e y Semantic equivalence relations with tail entities
Figure 232146DEST_PATH_IMAGE012
Has a confidence ofq(ii) a RulesR 2 Expressed as:
Figure 962204DEST_PATH_IMAGE013
rule 3R 3 : for triple
Figure 506449DEST_PATH_IMAGE014
And
Figure 296551DEST_PATH_IMAGE015
if S is i _r x And S j _r y There is a semantic equivalence of a relationship, expressed as:
Figure 457405DEST_PATH_IMAGE016
;S i _e x and S j _e y There is a semantic equivalence relation of the tail entities, expressed as
Figure 459996DEST_PATH_IMAGE017
Then S i _app x And S j _app y Semantic equivalence relations of head-of-presence entities
Figure 514540DEST_PATH_IMAGE018
Has a confidence ofl(ii) a Rules are setR 3 Expressed as:
Figure 350909DEST_PATH_IMAGE019
further, calculating the probability of the establishment of the semantic equivalence pair of the potential entities or relations according to the probability graph model, wherein the specific formula is as follows:
Figure 858113DEST_PATH_IMAGE020
Figure 303876DEST_PATH_IMAGE021
Figure 212926DEST_PATH_IMAGE022
wherein,R i = Tis shown asiThe rule of regulation satisfies the triggering condition,i∈{1,2,3},R i = Fis shown asiThe rule does not satisfy the trigger condition,λ 0 representing the similarity between the original pair of semantically equivalent entities,
Figure 954617DEST_PATH_IMAGE023
is shown asR i Probability of bar rule being true, corresponding toiRule of stripR i The degree of confidence of (a) is,K i is shown asiRule of stripR i The number of times of the triggering is carried out,
Figure 214697DEST_PATH_IMAGE024
is shown asiRule of stripR i The probability distribution of (a) is determined,S 0 the initial probabilities of the semantic equivalence or relation semantic equivalence of different data source entities;
Figure 700036DEST_PATH_IMAGE025
as shown in rule 1R 1 Rule 2R 2 Rule 3R 3 And initial probabilityS 0 Under the condition, the probability that the semantic equivalence or relation semantic equivalence of different data source entities is not established,
Figure 463593DEST_PATH_IMAGE026
as shown in rule 1R 1 Rule 2R 2 Rule 3R 3 And initial probabilityS 0 And then, the probability that the semantic equivalence or relation semantic equivalence of different data source entities is established.
Further, determining a seed set according to the initial semantic equivalent entity pair, comprising:
and determining a seed set based on the semantic equivalence pairs of the entities preliminarily determined according to the similarity and the equivalence pairs of the entities obtained by comparing the character lengths between the entities by using the character strings.
Further, encoding the entity and the relationship to obtain a vector representation, and then:
and representing the learning model by using a knowledge graph, updating the vector representation of the head entity and the tail entity, and representing the learning model by using a network to obtain the final entity vector representation based on the updated vector representation.
Further, the iterative mixed representation learning is performed on the network representation learning model and the knowledge graph representation learning model in advance, and the iterative mixed representation learning method comprises the following steps:
step 201: training a knowledge graph representation learning model, wherein a loss function of the training model is as follows:
Figure 500819DEST_PATH_IMAGE027
wherein,kthe number of cycles of learning is represented for the iterative mixture,
Figure 123561DEST_PATH_IMAGE028
is shown ask+1 round represents the loss function of the learning model based on the knowledge graph,
Figure 537225DEST_PATH_IMAGE029
representing a negative example triplet set obtained by a negative sampling process that samples the head entity in the triplethWith tail entitieseRandom substitution into head entityh'Or tail entitiese'rThe corresponding relation is that the number of the first and the second groups,
Figure 529190DEST_PATH_IMAGE030
is the fold loss function, which is the maximum of x or 0,
Figure 2896DEST_PATH_IMAGE031
representation of knowledge graph representation learning model inkA triplet of sub-iterations (h,r,e) A scoring function of;
Figure 847356DEST_PATH_IMAGE032
representation of knowledge spectra the representation of the learning model is inkUpdating the triples after the head entity and the tail entity for each iteration (h',r,e') A scoring function of;
step 202: for the vector representation of the triples after the learning training of the knowledge graph representation, in the training of the network representation learning model, a head entity vector and a tail entity vector are respectively updated to be the first entity vector in the network representation learning modelkNode of sub-iterationv i Node, nodev j The corresponding vector is represented by a vector that,
Figure 533552DEST_PATH_IMAGE033
dfor the dimensions of the representation of the vector,R d representing a network semantic space with dimension d, a loss function of network representation learning is defined as follows:
Figure 537280DEST_PATH_IMAGE034
Figure 526096DEST_PATH_IMAGE035
denotes the firstk+The 1-round network represents the loss function of the learning model,𝑉a set of nodes representing a network representation learning model;
Figure 982485DEST_PATH_IMAGE036
representing nodes
Figure 613317DEST_PATH_IMAGE037
Of the neighboring node of (a) is,
Figure 471552DEST_PATH_IMAGE038
is shown in𝑘Updating nodes in a sub-iterationv i Node, nodev j Represents a scoring function of the learning model;
step 203: will learn to obtain nodesv i Node, nodev j Corresponding vector representation as the second𝑘+1 degree knowledge graph represents the head entity vector and tail entity vector of the learning model, and the first degree knowledge graph represents the learning model𝑘+1 rounds of training;
and terminating the iterative mixed representation learning according to the drawn iteration times to obtain the final vector representation of all the entities.
Further, the method includes the following constraints:
constraining CS 1 : semantic equivalence pairs for obtained head entities
Figure 660963DEST_PATH_IMAGE039
And known triple representations
Figure 339069DEST_PATH_IMAGE040
And
Figure 632647DEST_PATH_IMAGE041
in the negative sampling process, S of the corresponding head entity in the two triples is processed i _app x And S j _app y When replacing, S is required i _app x And S j _app y Excluded as a negative example alternative;
constraining CS 2 : semantic equivalence pairs for obtained tail entities
Figure 220754DEST_PATH_IMAGE042
And known triple representations
Figure 207165DEST_PATH_IMAGE040
And
Figure 247933DEST_PATH_IMAGE041
in the negative sampling process, for the corresponding tail entity in the two tripletsS i _e x OrS j _e y When the replacement is carried out, theS i _e x AndS j _e y excluded as a negative example; wherein S i _app x Is as followsiThe source number of the seed data isxOf mobile applications, S i _r x Is as followsiThe source number of the seed data isxCorresponding relation of the mobile application of (1), S i _e x Is a firstiThe source number of the seed data isxThe mobile application of (1) corresponding to the tail entity; s. the j _app y Is as followsjThe source number of the seed data isyThe mobile application of (2); s j _r y Is shown asjThe source number of the seed data isyOf the mobile application, S j _e y Is shown asjThe source number of the seed data isyThe mobile application of (1) corresponding to the tail entity.
Further, calculating the similarity between entity vectors by using cosine values, and determining the entity corresponding to the vector representation with the similarity exceeding a set threshold as an initial semantic equivalent entity pair, including:
step 3.1: calculating S of corresponding head entity by cosine value i _app x And S j _app y Direct similarity between them
Figure 345202DEST_PATH_IMAGE043
The formula is as follows:
Figure 787816DEST_PATH_IMAGE044
wherein S is i _app x Is shown asiThe source number of the seed data isxOf mobile applications, S j _app y Denotes the firstjThe source number of the seed data isyIn the mobile application of (1) a mobile application,
Figure 679548DEST_PATH_IMAGE045
and
Figure 597826DEST_PATH_IMAGE046
are respectively S i _app x And S j _app y A vector representation of (a);
step 3.2: computing S for corresponding head entity in combination with vector representation of tail entity i _app x And S j _app y Indirect similarity between them
Figure 601249DEST_PATH_IMAGE047
The formula is as follows:
Figure 288582DEST_PATH_IMAGE048
Figure 226582DEST_PATH_IMAGE049
Figure 366576DEST_PATH_IMAGE050
wherein the first stepiThe source number of the seed data isxMobile application S i _app x The vector representation of the associated tail entity is denoted as:
Figure 946593DEST_PATH_IMAGE051
(ii) a First, thejThe source number of the seed data isyMobile application S j _app y The vector representation of the associated tail entity is denoted as:
Figure 488433DEST_PATH_IMAGE052
N、Mis the number;
Figure 721968DEST_PATH_IMAGE053
is a firstiThe source number of the seed data isxMobile application S i _app x An indirect vector representation of the associated entity;
Figure 224625DEST_PATH_IMAGE054
is as followsjThe source number of the seed data isyMobile application S j _app y An indirect vector representation of the associated entity;
step 3.3: will be firstiThe source number of the seed data isxMobile application S i _app x And a first step ofjThe source number of the seed data isyMobile application S j _app y Weighting the direct similarity and the indirect similarity to obtain S i _app x And S j _app y Final similarity between them
Figure 201808DEST_PATH_IMAGE055
The calculation formula is as follows:
Figure 237635DEST_PATH_IMAGE056
wherein,
Figure 642072DEST_PATH_IMAGE057
is the weight of direct similarity, and takes the value of [0,1]Real values in between.
The invention has the following beneficial technical effects: the method utilizes similarity calculation to obtain an initial semantic equivalence entity pair, further utilizes meta-rules to mine potential entity semantic equivalence relations, utilizes a probability graph model to calculate the probability of establishing the potential semantic equivalence entity pair according to the probability graph model, finally determines the semantic equivalence relations among entities in multi-source mobile application according to the probability, and reduces the complexity of entity semantic equivalence relation data set calculation. The method is beneficial to being migrated to the construction process of other multi-domain multi-source knowledge graphs. The method can obviously reduce the manual annotation cost of the entity semantic equivalence relation of the multi-source data in the process of establishing the knowledge graph, can generate the equivalence relation of the high-quality structured triple and the entity, and realizes the value of sharing and reusing the mobile application information. In addition, the invention can further improve the accuracy of the discovery of the associated entity by utilizing a hybrid training mode of knowledge graph representation learning and network representation learning.
Drawings
The invention will be further described with reference to the accompanying drawings and examples, in which:
FIG. 1 is a flow diagram of a multi-source mobile application knowledge graph construction method of an embodiment of the method of the present invention;
FIG. 2 is a flow diagram of knowledge graph representation learning and network representation based entity discovery for a method embodiment of the present invention;
FIG. 3 is a meta-rule for modeling entity alignment based on a probabilistic graph model Noisy-or according to an embodiment of the method of the present invention.
Detailed Description
To further clarify the technical solutions of the present application, the following detailed description will be made with reference to the accompanying drawings and specific embodiments. It should be noted that the following description is only a preferred embodiment of the present invention, and it will be apparent to those skilled in the art that various modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be construed as the protective scope of the present invention.
Example (b): the method for constructing the knowledge graph of the multi-source mobile application comprises the following steps:
step 1: generating a set of triples based on acquired mobile application data from different data sources (S) o _app z , r,e)},S o _app z Is a unique identifier, wherein S o _app z Corresponding head entity, S o _app z Is defined asoThe source number of the seed data iszIn the mobile application of (1) a mobile application,r the corresponding relation is that the number of the first and the second groups,ecorresponding to the tail entity;
and 2, step: respectively coding the entity and the relation to obtain corresponding vector representation;
and step 3: calculating the similarity between entity vectors by utilizing cosine values, and preliminarily determining the entity corresponding to the vector representation with the similarity exceeding a set threshold value as a semantic equivalence pair of the entity;
and 4, step 4: determining a seed set according to the preliminarily determined semantic equivalence pairs of the entities, and reasoning out the semantic equivalence pairs of potential entities or relationships from the seed set according to a meta-rule;
calculating the probability that the semantic equivalence pair of the potential entities or relations is established according to the probability graph model; and comparing the calculated probability with a set probability threshold, finally determining the semantic equivalence relation between entities or relations in the multi-source mobile application according to the comparison result, and further obtaining the knowledge graph of the multi-source mobile application.
In a specific embodiment, a script framework is adopted to collect data associated with mobile applications from various large application malls, and names of the mobile applications are defined to form an application name list. From encyclopedicObtaining more comprehensive data supplements a name list for a mobile application, where the name listSMay be obtained by the following collective operations;
S={app z | S o _app z , o=1,2,…,O;z=1,2,..,Z};
app for the z-th mobile application z Define it at the data sourceoIs marked as S o _app z In whichOThe number of different sources is the same as the number of different sources,Zfor the number of mobile applications in each data source.
Preprocessing the acquired data, analyzing the acquired data types, and converting the classified structured and unstructured data into a structured triple set, wherein the method comprises the following steps:
step 1.1: analyzing and collecting data types associated with data sources corresponding to the mobile application, and dividing the data types into two types, namely structured data and unstructured data;
step 1.2: resolving the attribute type of the structured data, usually attribute tags of mobile applications in application stores (e.g. developers, companies, languages, version numbers, release dates of APPs in APP stores, etc.) and infobox descriptions in mobile applications in encyclopedia, can convert such data directly into the form of structured triples { (S) o _app z R, e) }, in which S o _app z Corresponding to the head entity, wherein r is the corresponding relation of the attribute tags, and e is the entity corresponding to the attribute tag r and corresponds to the tail entity in the triple;
step 1.3: analyzing attribute types of unstructured data, generally word introduction or text description about mobile applications in mobile stores and encyclopedias, adopting named entity recognition technology to recognize entities in texts, then adopting drawn-up relations to classify the recognized entities in relation, and finally forming a certain amount of triples { (S) o _app z R ', e') } to complete the structured triplet information in the mobile application, where S o _app z Corresponding to the head entity, r' is a proposed relationship, corresponding to the tripletE' is an entity corresponding to the proposed relationship r, and corresponds to the tail entity in the triplet.
In this embodiment, encoding the entity and the relationship to obtain a vector representation corresponding to the entity and the relationship includes: step 2.1: converting them into text format required by pre-training model according to different data types, and for structured triple form { (S) o _app z R, e) }, the expression of sentence statement is carried out by adopting the 'subject predicate as object': (S) o _app z [SEP]r[SEP]Is [ SEP ]]e) Wherein [ SEP]For word segmentation symbol identification, "S o _app z All of the words "r", "e" and "e" are regarded as word blocks in the word segmentation process and are denoted as tokens.
For the attribute type of unstructured data, a word segmentation tool is adopted to segment words of the text, and in order to improve the precision of the word segmentation, a manual definition dictionary can be bound in the word segmentation tool.
Step 2.2: and coding the token after word segmentation by adopting an adaptive Chinese pre-training model BERT to obtain vector representation of all tokens.
Further, in other embodiments, in order to improve the encoding effect and the accuracy of the prediction of the original sentence sequence, in the encoding process, the noun or the adjective after the word segmentation is randomly replaced based on the synonym dictionary, and the calculation formula of the replacement probability is as follows:
Figure 632024DEST_PATH_IMAGE058
wherein,t i are the word blocks in the sentence, and the word blocks,n w is the number of word blocks in the sentence,w(t i ) For replacing word blockst i The loss caused by the reaction is [0,1 ]]Exp (.) is a power exponent function;
in a specific embodiment, based on the vector representation obtained by using the adapted chinese pre-training model BERT, the method further includes the following steps: and representing the learning model by using a knowledge graph, updating vector representations of the head entity and the tail entity, and obtaining a final entity vector representation by using a network representation learning model based on the updated vector representation, wherein the final entity vector representation is used for calculating the similarity between entity vectors by using cosine values.
A mixed training mode of knowledge graph representation learning and network representation learning is adopted, and mixed iterative training is performed on all triples, as shown in fig. 2, the method specifically includes:
step 201: and for all the triples, training by using a knowledge graph to represent a learning model, wherein the loss function of the training model is as follows:
Figure 412899DEST_PATH_IMAGE027
wherein,kthe number of cycles of learning is represented for the iterative mixture,
Figure 663751DEST_PATH_IMAGE028
is shown ask+1 round represents the loss function of the learning model based on the knowledge graph,
Figure 380035DEST_PATH_IMAGE029
representing a negative example triplet set obtained by a negative sampling process that samples the head entity in the triplethWith the tail entityeRandom substitution into head entityh'Or tail entitye'
Figure 981917DEST_PATH_IMAGE030
Is the fold loss function, which is the maximum of x or 0,
Figure 176269DEST_PATH_IMAGE059
representation of knowledge spectra the representation of the learning model is inkA triplet of sub-iterations (h,r,e) A scoring function of;
Figure 547208DEST_PATH_IMAGE032
representation of knowledge graph representation learning model inkAfter updating head and tail entities in a single iterationTriple group (h',r,e') A scoring function of (a);
step 202: for the vector representation of the triples after the learning training of the knowledge graph representation, in the training of the network representation learning model, a head entity vector and a tail entity vector are respectively updated to be the first entity vector in the network representation learning modelkNode of sub-iterationv i ,v j The corresponding vector is represented by a vector that,
Figure 293447DEST_PATH_IMAGE033
dfor the dimensions of the representation of the vector,R d representing a network semantic space with dimension d, a loss function of network representation learning is defined as follows:
Figure 490948DEST_PATH_IMAGE034
wherein,
Figure 20149DEST_PATH_IMAGE035
is shown ask+The 1-round network represents the loss function of the learning model,𝑉a set of nodes representing a network representation learning model;
Figure 980015DEST_PATH_IMAGE036
representing nodes
Figure 162735DEST_PATH_IMAGE037
Of the neighboring node of (a) is,
Figure 614576DEST_PATH_IMAGE038
is shown in𝑘Updating nodes in sub-iterationsv i ,v j Represents a scoring function of the learning model;
step 203: will learn to obtain the nodev i ,v j Corresponding vector representation as the second𝑘+1 knowledge graph represents the head and tail entity vectors of the learning model, the first knowledge graph represents the learning model𝑘+1 round of training;
and terminating the iterative mixed representation learning according to the drawn iteration times to obtain the final vector representation of all the entities.
In a specific embodiment, the specific knowledge graph representation learning model and the network representation learning model can be realized by adopting the prior art, which is not the invention point of the present application, and the present application does not need to limit the specific realization method of the model, so that the detailed description is omitted.
In this embodiment, step 3 is referred to as a "multi-source entity discovery algorithm"; step 4 is referred to as a "multi-source entity alignment algorithm", as shown in fig. 1. In other embodiments, to increase the reliability of the similarity between mobile applications, the entity discovery algorithm employed includes: the similarity between the mobile applications is indirectly calculated by utilizing the entity vectors associated with the mobile applications, and is weighted with the direct similarity to finally obtain the similarity between the mobile applications; the method comprises the following steps:
step 3.1: calculating S of corresponding head entity by cosine value i _app x And S j _app y Direct similarity between them
Figure 275364DEST_PATH_IMAGE043
The formula is as follows:
Figure 230682DEST_PATH_IMAGE060
wherein S is i _app x Is shown asiThe source number of the seed data isxOf mobile applications S j _app y Denotes the firstjThe source number of the seed data isyIn the mobile application of (1) a mobile application,
Figure 849882DEST_PATH_IMAGE045
and
Figure 21975DEST_PATH_IMAGE046
are respectively S i _app x And S j _app y A vector representation of (a);
step 3.2: knotVector representation of closure entities computes S for corresponding head entities i _app x And S j _app y Indirect similarity between them
Figure 220875DEST_PATH_IMAGE061
The formula is as follows:
Figure 420912DEST_PATH_IMAGE048
Figure 820801DEST_PATH_IMAGE062
Figure 840710DEST_PATH_IMAGE063
wherein the first stepiThe source number of the seed data isxOf a mobile application S i _app x The vector representation of the associated tail entity is noted as:
Figure 249825DEST_PATH_IMAGE051
(ii) a First, thejThe source number of the seed data isyOf a mobile application S j _app y Associated tail entity, noted
Figure 304369DEST_PATH_IMAGE064
N、MIs the number;
Figure 875159DEST_PATH_IMAGE065
is as followsiThe source number of the seed data isxOf a mobile application S i _app x An indirect vector representation of the associated entity,
Figure 382363DEST_PATH_IMAGE066
is as followsjThe source number of the seed data isyMobile application S j _app y Indirect vector of associated entityRepresents;
step 3.3: will be firstiThe source number of the seed data isxOf a mobile application S i _app x And a firstjThe source number of the seed data isyMobile application S j _app y Weighting the direct similarity and the indirect similarity to obtain S i _app x And S j _app y Final similarity between them
Figure 454225DEST_PATH_IMAGE067
The calculation formula is as follows:
Figure 743035DEST_PATH_IMAGE056
wherein,
Figure 874940DEST_PATH_IMAGE057
is the weight of direct similarity and takes the value of [0, 1%]Real values in between.
In this embodiment, the entity discovery result after the threshold screening is used as an input of the entity alignment algorithm, and a corresponding relationship between entities in the multi-source data is obtained from the entity alignment algorithm.
Optionally, in the multi-source entity discovery algorithm, an initial semantic equivalence pair set is obtained according to the similarity between corresponding vectors of entities from a semantic level of the entities, a grammar equivalence pair set obtained by calculating the grammar similarity between the entities by a method based on a character string length and a short distance is obtained from a grammar level of the entities, the results of the two sets are complemented, the semantic equivalence pair set and the grammar equivalence pair set are subjected to union operation, an initial seed set is determined, and the accuracy of entity discovery can be improved.
The multi-source entity alignment algorithm specifically comprises the following steps:
step 4.1: the method comprises the steps of obtaining semantic equivalence pairs of entities among different data sources based on a character string equality algorithm, screening out equivalence pairs of entities with high similarity in entity discovery results according to a drawn threshold, and manually checking the results of the two entities to form an initial seed set of the semantic equivalence pairs, wherein the seed set is marked as ES = AES \8899RES \/8899and EES, AES represents a semantic equivalence pair set of a head entity, RES represents a semantic equivalence pair set of a relation, and EES represents a semantic equivalence pair set of a tail entity;
and 4.2: finding semantic equivalence pairs of potential entities or relationships from AES, RES, EES according to a designed meta-rule comprising:
the meta-rule includes:
rule 1R 1 : for triple
Figure 479227DEST_PATH_IMAGE068
And
Figure 354779DEST_PATH_IMAGE069
in which S is i _app x Is a firstiThe source number of the seed data isxOf mobile applications S i _r x Is a firstiThe source number of the seed data isxOf the mobile application, S i _e x Is a firstiThe source number of the seed data isxThe mobile application of (2) corresponding to the tail entity; s j _app y Is as followsjThe source number of the seed data isyThe mobile application of (2); s j _r y Is shown asjThe source number of the seed data isyCorresponding relation of the mobile application of (1), S j _e y Is shown asjThe source number of the seed data isyThe mobile application of (1) corresponding to the tail entity;
if S is i _app x And S j _app y There is a semantic equivalence relation of the head entities, i.e. a semantic equivalence pair of the head entities, expressed as
Figure 259282DEST_PATH_IMAGE070
,S i _e x And S j _e y There is a semantic equivalence relationship of the tail entities, i.e., a semantic equivalence pair of the tail entities, expressed as
Figure 30928DEST_PATH_IMAGE005
Then S i _r x And S j _r y Semantic equivalence pairs having semantic equivalence relationships, i.e., relationships, of relationships
Figure 778305DEST_PATH_IMAGE006
Has a confidence ofp(ii) a RulesR 1 Expressed as:
Figure 67335DEST_PATH_IMAGE007
rule 2R 2 : for triplets
Figure 950977DEST_PATH_IMAGE071
And
Figure 267427DEST_PATH_IMAGE072
if S is i _app x And S j _app y Semantic equivalence of Presence head entities, expressed as
Figure 236520DEST_PATH_IMAGE010
Relation S i _r x And S j _r y There is a semantic equivalence of the relationship, expressed as
Figure 329241DEST_PATH_IMAGE011
Then S i _e x And S j _e y Semantic equivalence relations with tail entities
Figure 801810DEST_PATH_IMAGE012
Has a confidence ofq
In a specific embodiment, optionally, the similarity between the relationship vectors is also calculated by using the cosine values, and the relationship corresponding to the vector representation whose similarity exceeds the set threshold is preliminarily determined as the semantic equivalence pair of the relationship.
RulesR 2 Expressed as:
Figure 446418DEST_PATH_IMAGE013
rule 3R 3 : for triple
Figure 778174DEST_PATH_IMAGE073
And
Figure 533640DEST_PATH_IMAGE074
if S is i _r x And S j _r y There is a semantic equivalence of a relationship, expressed as:
Figure 267241DEST_PATH_IMAGE075
;S i _e x and S j _e y There is a semantic equivalence relation of the tail entities, expressed as
Figure 817171DEST_PATH_IMAGE017
Then S i _app x And S j _app y Semantic equivalence relations of presence head entities
Figure 869178DEST_PATH_IMAGE018
Has a confidence ofl(ii) a RulesR 3 Expressed as:
Figure 428335DEST_PATH_IMAGE019
step 4.3: as shown in fig. 3, the probability that the semantic equivalence pair of the potential entities or relations holds is calculated according to the probability graph model, and the specific formula is as follows:
Figure 875497DEST_PATH_IMAGE020
Figure 737274DEST_PATH_IMAGE021
Figure 168255DEST_PATH_IMAGE022
wherein,R i = Tdenotes the firstiThe rule of regulation satisfies the triggering condition,i∈{1,2,3},R i = Fdenotes the firstiThe rule does not satisfy the trigger condition,λ 0 representing the similarity between the original pair of semantically equivalent entities,
Figure 875311DEST_PATH_IMAGE023
is shown asR i Probability of bar rule being true, corresponding toiRule of stripR i The degree of confidence of (a) is,K i denotes the firstiRule of stripR i The number of times of the triggering is to be done,
Figure 442559DEST_PATH_IMAGE024
denotes the firstiRule of stripR i The probability distribution of (a) is determined,S 0 the initial probabilities of the semantic equivalence or relation semantic equivalence of different data source entities;
Figure 475237DEST_PATH_IMAGE025
as shown in rule 1R 1 Rule 2R 2 Rule 3R 3 And initial probabilityS 0 Under the condition, the probability that the semantic equivalence or relation semantic equivalence of different data source entities is not established,
Figure 862356DEST_PATH_IMAGE026
as shown in rule 1R 1 Rule 2R 2 Rule 3R 3 And initial probabilityS 0 And then, the probability that the semantic equivalence or relation semantic equivalence of different data source entities is established. Alternatively,S 0 semantics for different data source entities, etcInitial probability of price, i.e.λ 0
Step 4.4: and calculating the probability of establishing the semantic equivalence relations of the different data source entities based on the designed meta-rules of the three equivalence relations, and screening according to a drawn threshold value to obtain a final semantic equivalence entity pair.
In a specific embodiment, the method for constructing the knowledge graph for the multi-source mobile application further comprises the following steps: and taking the result of the multi-source entity alignment algorithm as the constraint of the multi-source entity discovery algorithm, so that the mutual supplement and the mutual constraint of the entity discovery algorithm and the entity alignment algorithm are realized, and the iteration of the algorithm is finally completed, wherein the specific constraint is as follows:
constraining CS 1 : semantic equivalence relations for obtained head entities
Figure 763316DEST_PATH_IMAGE039
And known triple representations
Figure 824551DEST_PATH_IMAGE040
And
Figure 887185DEST_PATH_IMAGE041
in the negative sampling process, S of the corresponding head entity in the two triples is processed i _app x And S j _app y When replacing, S is required i _app x And S j _app y Excluded as a negative example alternative;
constraining CS 2 : semantic equivalence relations to obtained tail entities
Figure 902545DEST_PATH_IMAGE042
And known triple representations
Figure 341617DEST_PATH_IMAGE040
And
Figure 758823DEST_PATH_IMAGE041
in the negative sampling process, the pairs in the two triples are subjected to the comparisonResponsive to the end entityS i _e x OrS j _e y When the replacement is carried out, theS i _e x AndS j _e y excluded as a negative example; wherein S i _app x Is a firstiThe source number of the seed data isxOf mobile applications S i _r x Is a firstiThe source number of the seed data isxOf the mobile application, S i _e x Is a firstiThe source number of the seed data isxThe mobile application of (1) corresponding to the tail entity; s. the j _app y Is a firstjThe source number of the seed data isyThe mobile application of (2); s j _r y Is shown asjThe source number of the seed data isyOf the mobile application, S j _e y Is shown asjThe source number of the seed data isyThe mobile application of (1) corresponding to the tail entity.
In a specific embodiment, the process of the multi-source entity discovery algorithm in the step 3 and the process of the multi-source entity alignment algorithm in the step 4 can be repeated until no new entity pair appears, and finally the multi-source mobile application knowledge graph is formed.
In the above embodiment, the probability of establishment of the entity semantic equivalence relation in the initial semantic equivalence entity pairs of different data sources is calculated based on the designed meta-rules of the three equivalence relations, and the final semantic equivalence entity pair is obtained by screening according to the drawn threshold. In other embodiments, the accuracy of the entity alignment algorithm may be improved by obtaining semantically equivalent pairs of entities using image rules PR, including:
carrying out vector representation on picture identifiers of mobile applications of different data sources, and modeling by adopting the gray scale of the picture identifiers;
carrying out depth feature representation on the gray level of the extracted image by adopting a convolutional neural network, and judging whether the mobile application is equivalent by utilizing an image matching rule:
image rule PR:
(S i _app x identification of the picture, S i _Pic x )∧(S j _app y Identification of the picture, S j _Pic y ) )∧Sim(S i _Pic x ,S j _Pic t )≥δ⇒(S i _app x ,S j _app y ,≡);
Wherein S i _app x And S j _app y Corresponding to different data sourcesi,jOf mobile applications, S i _Pic x And S j _Pic y Corresponding mobile application S i _app x And S j _app y The picture identifies the associated image "
Figure 257937DEST_PATH_IMAGE076
"is a" semantic equivalence "relationship,
Figure 760594DEST_PATH_IMAGE077
represents an image matching threshold of [0,1 ]]If the similarity of the image matching is larger than the set threshold value, the mobile application S i _app x And S j _app y The semantics are equal.
The invention combines the entity discovery and entity alignment iterative strategy, can obviously reduce the manual annotation cost of entity corresponding relation of multi-source data in the process of map construction, and is beneficial to expanding to the process of knowledge map construction in other fields.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and so forth) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (10)

1. A knowledge graph construction method for multi-source mobile application is characterized by comprising the following steps:
generating a set of triples based on acquired mobile application data from different data sources (S) o _app z , r,e) In which S is o _app z Corresponding head entity, S o _app z Is defined asoThe source number of the seed data iszIn the mobile application of (1) a mobile application,r the corresponding relation is that the number of the first and the second groups,ecorresponding to the tail entity;
respectively coding the entity and the relation to obtain corresponding vector representation;
calculating the similarity between entity vectors by utilizing cosine values, and preliminarily determining the entity corresponding to the vector representation with the similarity exceeding a set threshold value as a semantic equivalence pair of the entity;
determining a seed set according to the preliminarily determined semantic equivalence pairs of the entities, and reasoning out the semantic equivalence pairs of potential entities or relationships from the seed set according to a meta-rule;
calculating the probability of establishing a semantic equivalence pair of potential entities or relations according to a probability graph model; and comparing the calculated probability with a set probability threshold, finally determining the semantic equivalence relation between entities or relations in the multi-source mobile application according to the comparison result, and further obtaining the knowledge graph of the multi-source mobile application.
2. The method for constructing the knowledge graph for the multi-source mobile application according to claim 1, wherein the encoding of the entities and the relationships to obtain corresponding vector representations comprises:
sentence statement expression is carried out on each triple in a form of 'subject predicate is object', and the sentence is expressed as: (S) o _app z [SEP]r[SEP]Is [ SEP ]]e) (ii) a Wherein [ SEP]For word segmentation symbol identification, "S o _app z "," r "," is ", and" e "are all considered word blocks in the word segmentation process;
using sentences as input, adopting an adaptive Chinese pre-training model BERT to encode word blocks obtained by word segmentation to obtain three words"S" in tuple o _app z Vector representations of "" r "" and "e".
3. The multi-source-oriented mobile application knowledge graph construction method according to claim 1, characterized in that in the process of coding the entity and the relationship, nouns or adjectives in the word block after word segmentation are randomly replaced with synonyms thereof according to a replacement probability based on a synonym dictionary, and the calculation formula of the replacement probability is as follows:
Figure 424069DEST_PATH_IMAGE001
wherein,t i is a block of words in a sentence,n w is the number of word blocks in the sentence,jis the sequence number of the word block,w(t i ) For replacing word blocks in sentencest i The penalty incurred, exp (.), is a power exponential function.
4. The knowledge graph construction method facing the multi-source mobile application is characterized in that the seed set is recorded as ES = AES \8899, RES \8899andEES, wherein AES represents a semantic equivalence pair set of a head entity, RES represents a semantic equivalence pair set of a relationship, and EES represents a semantic equivalence pair set of a tail entity;
the meta-rule includes:
rule 1R 1 : for triplets
Figure 183952DEST_PATH_IMAGE002
And
Figure 760427DEST_PATH_IMAGE003
in which S is i _app x Is as followsiThe source number of the seed data isxOf mobile applications, S i _r x Is a firstiThe source number of the seed data isxThe corresponding relationship of the mobile application of (1),S i _e x is as followsiThe source number of the seed data isxThe mobile application of (1) corresponding to the tail entity; s j _app y Is as followsjThe source number of the seed data isyThe mobile application of (1); s j _r y Is shown asjThe source number of the seed data isyCorresponding relation of the mobile application of (1), S j _e y Is shown asjThe source number of the seed data isyThe mobile application of (1) corresponding to the tail entity;
if S is i _app x And S j _app y Is a semantically equivalent pair of head entities, denoted as
Figure 624478DEST_PATH_IMAGE004
,S i _e x And S j _e y Is a semantically equivalent pair of tail entities, denoted as
Figure 783058DEST_PATH_IMAGE005
Then S i _r x And S j _r y Is a semantically equivalent pair of relationships
Figure 74362DEST_PATH_IMAGE006
Has a confidence ofp(ii) a RulesR 1 Expressed as:
Figure 606974DEST_PATH_IMAGE007
rule 2R 2 : for triplets
Figure 399350DEST_PATH_IMAGE008
And
Figure 537070DEST_PATH_IMAGE009
if S is i _app x And S j _app y Is a semantically equivalent pair of head entities, denoted as
Figure 373177DEST_PATH_IMAGE010
Relation S i _r x And S j _r y Is a semantically equivalent pair of relationships, expressed as
Figure 330768DEST_PATH_IMAGE011
Then S i _e x And S j _e y Is a semantically equivalent pair of tail entities
Figure 661256DEST_PATH_IMAGE012
Has a confidence ofq(ii) a RulesR 2 Expressed as:
Figure 653482DEST_PATH_IMAGE013
rule 3R 3 : for triple
Figure 348906DEST_PATH_IMAGE002
And
Figure 934739DEST_PATH_IMAGE014
if S is i _r x And S j _r y Is a semantically equivalent pair of relationships, expressed as:
Figure 678704DEST_PATH_IMAGE015
;S i _e x and S j _e y Is a semantically equivalent pair of tail entities, denoted as
Figure 915651DEST_PATH_IMAGE016
Then S i _app x And S j _app y Is a semantically equivalent pair of head entities,
Figure 719658DEST_PATH_IMAGE017
has a confidence ofl(ii) a Rules are setR 3 Expressed as:
Figure 25744DEST_PATH_IMAGE018
5. the method for constructing the knowledge graph for the multi-source mobile application according to claim 4, wherein the probability of establishing the semantic equivalence pair of the potential entities or relations is calculated according to a probability graph model, and the specific formula is as follows:
Figure 573400DEST_PATH_IMAGE019
Figure 664853DEST_PATH_IMAGE020
Figure 905341DEST_PATH_IMAGE021
wherein,R i = Tis shown asiThe rule of regulations satisfies the triggering condition,i∈{1,2,3},R i = Fdenotes the firstiThe rule does not satisfy the trigger condition,λ 0 representing the similarity between the original pair of semantically equivalent entities,
Figure 121559DEST_PATH_IMAGE022
is shown asR i Probability of bar rule being true, corresponding toiRule of stripR i The degree of confidence of (a) is,K i is shown asiRule of stripR i The number of times of the triggering is carried out,
Figure 348272DEST_PATH_IMAGE023
is shown asiRule of stripR i The probability distribution of (a) is determined,S 0 the initial probability of semantic equivalence or relation semantic equivalence of different data source entities;
Figure 169597DEST_PATH_IMAGE024
is shown in rule 1R 1 Rule 2R 2 Rule 3R 3 And initial probabilityS 0 Under the condition, the probability that the semantic equivalence or relation semantic equivalence of different data source entities is not established,
Figure 440042DEST_PATH_IMAGE025
as shown in rule 1R 1 Rule 2R 2 Rule 3R 3 And initial probabilityS 0 And then, the probability that the semantic equivalence or relation semantic equivalence of different data source entities is established.
6. The method for constructing the knowledge-graph for the multi-source mobile application according to claim 1, wherein determining the seed set according to the initial semantic equivalent entity pair comprises:
and determining a seed set based on the semantic equivalence pairs of the entities preliminarily determined according to the similarity and the semantic equivalence pairs of the entities obtained by comparing the character lengths between the entities by using the character strings.
7. The method of claim 1, wherein the entity and relationship are encoded to obtain a vector representation, and then further comprising:
and representing the learning model by using a knowledge graph, updating the vector representation of the head entity and the tail entity, and representing the learning model by using a network to obtain the final entity vector representation based on the updated vector representation.
8. The multi-source mobile application-oriented knowledge graph construction method according to claim 7, wherein iterative hybrid representation learning is performed on a network representation learning model and a knowledge graph representation learning model in advance, and the iterative hybrid representation learning comprises the following steps:
step 201: training a knowledge graph representation learning model, wherein a loss function of the training model is as follows:
Figure 346818DEST_PATH_IMAGE026
wherein,kthe number of cycles representing learning is iteratively mixed,
Figure 33014DEST_PATH_IMAGE027
is shown ask+1 round represents the loss function of the learning model based on the knowledge graph,
Figure 342467DEST_PATH_IMAGE028
representing a negative example triplet set obtained by a negative sampling process that samples the head entity in the triplethWith the tail entityeRandom substitution into head entityh'Or tail entitye'
Figure 924758DEST_PATH_IMAGE029
Is the fold loss function, which is the maximum of x or 0,
Figure 443464DEST_PATH_IMAGE030
representation of knowledge graph representation learning model inkA triplet of sub-iterations (h,r,e) A scoring function of;
Figure 871035DEST_PATH_IMAGE031
representation of knowledge graph representation learning model inkUpdating the triples after the head entity and the tail entity for each iteration (h',r,e') A scoring function of;
step 202: for the experienceThe recognition chart represents the vector representation of the triples after the learning training, and in the network representation learning model training, the head entity vector and the tail entity vector are respectively updated to the first vector in the network representation learning modelkNode of sub-iterationv i Node, nodev j The corresponding vector is represented by a vector that,
Figure 276739DEST_PATH_IMAGE032
dis a dimension that is represented by a vector,R d representing a network semantic space with dimension d, a loss function of network representation learning is defined as follows:
Figure 29932DEST_PATH_IMAGE033
wherein,
Figure 35934DEST_PATH_IMAGE034
is shown ask+The 1-round network represents the loss function of the learning model,𝑉a set of nodes representing a network representation learning model;
Figure 329512DEST_PATH_IMAGE035
representing nodes
Figure 979936DEST_PATH_IMAGE036
Of the neighboring node of (a) is,
Figure 746773DEST_PATH_IMAGE037
is shown in𝑘Updating nodes in sub-iterationsv i Node, and method for controlling the samev j Represents a scoring function of the learning model;
step 203: will learn to obtain the nodev i Node, nodev j Corresponding vector representation as the second𝑘+1 knowledge graph represents the head and tail entity vectors of the learning model, the first knowledge graph represents the learning model𝑘+1 round of training;
and terminating the iterative mixed representation learning according to the drawn iteration times to obtain the final vector representation of all the entities.
9. The method for constructing the knowledge graph for the multi-source mobile application according to claim 8, wherein the method comprises the following constraints:
constraining CS 1 : semantic equivalence pairs for obtained head entities
Figure 115437DEST_PATH_IMAGE038
And known triple representations
Figure 275023DEST_PATH_IMAGE039
And
Figure 45533DEST_PATH_IMAGE009
in the negative sampling process, S of the corresponding head entity in the two triples is processed i _app x And S j _app y When replacing, S is required i _app x And S j _app y Excluded as a negative example alternative;
constraining CS 2 : semantic equivalence pairs for obtained tail entities
Figure 671686DEST_PATH_IMAGE040
And known triple representations
Figure 403013DEST_PATH_IMAGE039
And
Figure 976077DEST_PATH_IMAGE009
in the negative sampling process, for the corresponding tail entity in the two tripletsS i _e x OrS j _e y When the replacement is carried out, theS i _e x AndS j _e y excluded as a negative example; wherein S i _app x Is a firstiThe source number of the seed data isxOf mobile applications S i _r x Is as followsiThe source number of the seed data isxOf the mobile application, S i _e x Is a firstiThe source number of the seed data isxThe mobile application of (2) corresponding to the tail entity; s j _app y Is a firstjThe source number of the seed data isyThe mobile application of (2); s j _r y Denotes the firstjThe source number of the seed data isyCorresponding relation of the mobile application of (1), S j _e y Is shown asjThe source number of the seed data isyThe mobile application of (2) to the corresponding tail entity.
10. The multi-source-oriented mobile application knowledge graph construction method of claim 1, wherein the similarity between entity vectors is calculated by using cosine values, and an entity corresponding to a vector representation with the similarity exceeding a set threshold is determined as an initial semantic equivalent entity pair, comprising:
step 3.1: calculating S of corresponding head entity by cosine value i _app x And S j _app y Direct similarity between them
Figure 460148DEST_PATH_IMAGE041
The formula is as follows:
Figure 522782DEST_PATH_IMAGE042
wherein S is i _app x Is shown asiThe source number of the seed data isxOf mobile applications S j _app y Is shown asjThe source number of the seed data isyIn the mobile application of (1) a mobile application,
Figure 600459DEST_PATH_IMAGE043
and
Figure 616694DEST_PATH_IMAGE044
are respectively S i _app x And S j _app y A vector representation of (a);
step 3.2: computing S for corresponding head entity in combination with vector representation of tail entity i _app x And S j _app y Indirect similarity between them
Figure 96217DEST_PATH_IMAGE045
The formula is as follows:
Figure 392069DEST_PATH_IMAGE046
Figure 957043DEST_PATH_IMAGE047
Figure 934226DEST_PATH_IMAGE048
wherein, the firstiThe source number of the seed data isxOf a mobile application S i _app x The vector representation of the associated tail entity is noted as:
Figure 143622DEST_PATH_IMAGE049
(ii) a First, thejThe source number of the seed data isyMobile application S j _app y Vector representation of the associated tail entity, denoted as
Figure 485741DEST_PATH_IMAGE050
N、MIs the number;
Figure 662645DEST_PATH_IMAGE051
is as followsiThe source number of the seed data isxMobile application S i _app x An indirect vector representation of the associated entity,
Figure 381202DEST_PATH_IMAGE052
is as followsjThe source number of the seed data isyMobile application S j _app y An indirect vector representation of the associated entity;
step 3.3: will be firstiThe source number of the seed data isxMobile application S i _app x And a first step ofjThe source number of the seed data isyMobile application S j _app y Weighting the direct similarity and the indirect similarity to obtain S i _app x And S j _app y Final similarity between them
Figure 632055DEST_PATH_IMAGE053
The calculation formula is as follows:
Figure 784557DEST_PATH_IMAGE054
wherein,
Figure 58543DEST_PATH_IMAGE055
is the weight of direct similarity and takes the value of [0, 1%]Real values in between.
CN202211187813.7A 2022-09-28 2022-09-28 Knowledge graph construction method for multi-source mobile application Active CN115292520B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211187813.7A CN115292520B (en) 2022-09-28 2022-09-28 Knowledge graph construction method for multi-source mobile application

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211187813.7A CN115292520B (en) 2022-09-28 2022-09-28 Knowledge graph construction method for multi-source mobile application

Publications (2)

Publication Number Publication Date
CN115292520A true CN115292520A (en) 2022-11-04
CN115292520B CN115292520B (en) 2023-02-03

Family

ID=83833596

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211187813.7A Active CN115292520B (en) 2022-09-28 2022-09-28 Knowledge graph construction method for multi-source mobile application

Country Status (1)

Country Link
CN (1) CN115292520B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116049148A (en) * 2023-04-03 2023-05-02 中国科学院成都文献情报中心 Construction method of domain meta knowledge engine in meta publishing environment
CN116756327A (en) * 2023-08-21 2023-09-15 天际友盟(珠海)科技有限公司 Threat information relation extraction method and device based on knowledge inference and electronic equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160335544A1 (en) * 2015-05-12 2016-11-17 Claudia Bretschneider Method and Apparatus for Generating a Knowledge Data Model
CN109582761A (en) * 2018-09-21 2019-04-05 浙江师范大学 A kind of Chinese intelligent Answer System method of the Words similarity based on the network platform
CN109992786A (en) * 2019-04-09 2019-07-09 杭州电子科技大学 A kind of semantic sensitive RDF knowledge mapping approximate enquiring method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160335544A1 (en) * 2015-05-12 2016-11-17 Claudia Bretschneider Method and Apparatus for Generating a Knowledge Data Model
CN109582761A (en) * 2018-09-21 2019-04-05 浙江师范大学 A kind of Chinese intelligent Answer System method of the Words similarity based on the network platform
CN109992786A (en) * 2019-04-09 2019-07-09 杭州电子科技大学 A kind of semantic sensitive RDF knowledge mapping approximate enquiring method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
傅端康: "基于知识图谱的软件众包服务的语义搜索", 《中国优秀硕士学位论文全文数据库(信息科技辑)》 *
胡盼盼: "《自然语言处理从入门到实战》", 30 April 2020, 中国铁道出版社 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116049148A (en) * 2023-04-03 2023-05-02 中国科学院成都文献情报中心 Construction method of domain meta knowledge engine in meta publishing environment
CN116049148B (en) * 2023-04-03 2023-07-18 中国科学院成都文献情报中心 Construction method of domain meta knowledge engine in meta publishing environment
CN116756327A (en) * 2023-08-21 2023-09-15 天际友盟(珠海)科技有限公司 Threat information relation extraction method and device based on knowledge inference and electronic equipment
CN116756327B (en) * 2023-08-21 2023-11-10 天际友盟(珠海)科技有限公司 Threat information relation extraction method and device based on knowledge inference and electronic equipment

Also Published As

Publication number Publication date
CN115292520B (en) 2023-02-03

Similar Documents

Publication Publication Date Title
US11501182B2 (en) Method and apparatus for generating model
CN111444340B (en) Text classification method, device, equipment and storage medium
CN116775847B (en) Question answering method and system based on knowledge graph and large language model
CN115292520B (en) Knowledge graph construction method for multi-source mobile application
CN116992005B (en) Intelligent dialogue method, system and equipment based on large model and local knowledge base
CN117033571A (en) Knowledge question-answering system construction method and system
US20240143644A1 (en) Event detection
CN109918647A (en) A kind of security fields name entity recognition method and neural network model
Zhang et al. Multifeature named entity recognition in information security based on adversarial learning
CN116304748B (en) Text similarity calculation method, system, equipment and medium
CN115310551A (en) Text analysis model training method and device, electronic equipment and storage medium
Kathuria et al. Real time sentiment analysis on twitter data using deep learning (Keras)
CN113761190A (en) Text recognition method and device, computer readable medium and electronic equipment
CN113704393A (en) Keyword extraction method, device, equipment and medium
CN115759254A (en) Question-answering method, system and medium based on knowledge-enhanced generative language model
CN116303881A (en) Enterprise organization address matching method and device based on self-supervision representation learning
CN117688560A (en) Semantic analysis-oriented intelligent detection method for malicious software
Pu et al. Lexical knowledge enhanced text matching via distilled word sense disambiguation
CN117807482A (en) Method, device, equipment and storage medium for classifying customs clearance notes
Yang et al. CNN-based two-branch multi-scale feature extraction network for retrosynthesis prediction
CN115757837B (en) Confidence evaluation method and device for knowledge graph, electronic equipment and medium
CN115048929A (en) Sensitive text monitoring method and device
Sultana et al. Fake News Detection Using Machine Learning Techniques
Mamatha et al. Supervised aspect category detection of co-occurrence data using conditional random fields
CN117971990B (en) Entity relation extraction method based on relation perception

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant