Detailed Description
Reference will now be made in detail to the embodiments of the present disclosure, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of illustrating the present disclosure and should not be construed as limiting the same.
Fig. 1 is a schematic flowchart of a relationship building method based on tags according to an embodiment of the present disclosure.
The embodiment is exemplified by the tag-based relationship building method being configured as a tag-based relationship building apparatus, the tag-based relationship building method in the embodiment may be configured in the tag-based relationship building apparatus, and the tag-based relationship building apparatus may be disposed in a server or may also be disposed in an electronic device, which is not limited in this disclosure.
The embodiment takes the example that the label-based relationship building method is configured in the electronic device. Among them, electronic devices such as smart phones, tablet computers, personal digital assistants, electronic books, and other hardware devices having various operating systems.
It should be noted that the execution subject of the embodiment of the present disclosure may be, for example, a Central Processing Unit (CPU) in a server or an electronic device in terms of hardware, and may be, for example, a related background service in the server or the electronic device in terms of software, which is not limited to this.
As shown in fig. 1, the method for constructing a relationship based on tags includes:
s101: an entity identity is determined.
The unique identifier inherent to the entity for distinguishing the entity from other entities may be referred to as an entity identifier, the entity may be identified using data, character strings or texts, the symbol for identifying the entity may be referred to as an entity identifier, the entity identifier may be, for example, a character string composed of one or more of letters (a-Z ), numbers (0-9), and an underline "_", or may be a text representing the characteristics of the entity, and the like, without limitation.
In the embodiment of the present disclosure, the determining the entity identifier may be selecting an entity with an entity identifier configured in advance in a database, or determining the entity identifier corresponding to the entity from a sample text according to entity features such as semantics of the entity.
S102: analyzing a plurality of candidate entities corresponding to the entity identification from a plurality of data sets, wherein the plurality of data sets comprise: data sets of multiple business domains, data sets of multiple channels, and data sets of multiple types.
The plurality of data sets may be data sets of a plurality of business domains, a plurality of channels, and a plurality of types, and the data sets including the plurality of business domains, the plurality of channels, and the plurality of types are regarded as the plurality of data sets. The data sets of the multiple service domains may be data sets containing different service data, such as data sets generated by service domains for city management, traffic administration, and the like; the data sets of the multiple channels may be different data sets formed according to different data channel sources, such as a data set derived from a production channel or a data set derived from a processing channel; the multiple types of data sets may be data sets partitioned differently according to data structuring and unstructured, such as data sets representing traffic and data sets representing various types of sensor measurements in a city.
In the embodiment of the present disclosure, all the service domains, all the channels, and all the types of data may be collectively referred to as global data, and the plurality of data sets may be sets of global data.
In the embodiment of the present disclosure, the multiple candidate entities corresponding to the entity identifier are obtained by parsing from the multiple data sets, and may be obtained by determining entities in the multiple data sets according to the entity identifier, where the multiple candidate entities may have the same entity identifier, but semantics and attributes represented by the multiple candidate entities may be the same or different, that is, ambiguity may exist among the multiple candidate entities.
For example, the candidate entity indicated by the entity identifier is set to be "Xiaoming", and in different types of data sets, the candidate entity "Xiaoming" may represent a person, or may represent a business or a vehicle, that is, the candidate entity "Xiaoming" may have different semantics and attributes.
S103: and selecting a part of candidate entities from the plurality of candidate entities as target entities.
In the embodiment of the present disclosure, the selecting of the partial candidate entities may be randomly selecting partial entities of the multiple candidate entities, or selecting the partial entities according to the type of the data set, which is not limited herein.
For example, the multiple candidate entities analyzed from the multiple data sets to represent "xiaoming" may be set to a selection ratio (e.g., 50%), a part of the candidate entities may be randomly selected as target entities as needed, or the multiple candidate entities "xiaoming" may be selected from the data sets to represent the type of people.
S104: a plurality of target tags corresponding to a plurality of target entities is determined.
Herein, the label reflecting the semantic information of the application scenario for the recognition and description of the target entity may be referred to as a target label, it is understood that both the static basic attribute and the dynamic behavior data of the target entity may be referred to as target labels, for example, when the target entity is a person, the static basic attribute may be basic information (age, sex, etc.), family information, work information, etc., and the dynamic behavior data may be data of various dynamic behaviors generated by learning, work, life, entertainment, social activities, etc.
It is understood that one target entity may have a plurality of target tags, and one target tag may correspond to one or more target entities, without limitation.
In the embodiment of the present disclosure, the determining of the multiple target tags corresponding to the multiple target entities may be performed by determining the target tags according to attributes of the target entities in the data set, or may be performed by setting one or more attributes as the target tags in advance, and determining the target tags corresponding to the attributes of the target entities according to the attributes of the target entities in the data set.
For example, the sample text is: the cup for Xiaoming is a glass cup, the target entity cup can be determined to be an article, and meanwhile, the labels of the glass product and the belonged Xiaoming of the cup can be determined according to the context semantics of the target entity cup in the database.
S105: and determining the target entity relationship among the target entities according to the target tags.
For describing relationships between multiple target entities, which may be referred to as target entity relationships, the target entity relationships may be, for example, words representing relationships between target entities, such as occur in sample text: the target entity relationship between the target entity Xiaoming and the target entity glass can be determined to be 'owned' according to the sample text.
In this embodiment, the entity identifier is determined, and a plurality of candidate entities corresponding to the entity identifier are obtained by parsing from a plurality of data sets, where the plurality of data sets include: the method comprises the steps of selecting partial candidate entities from a plurality of candidate entities as target entities, determining a plurality of target labels corresponding to the target entities, and determining target entity relationships among the target entities according to the target labels, wherein the target entities are subjected to labeling processing, the labels corresponding to the candidate entities are determined, and then the entity relationships among the candidate entities are matched, so that the intelligent matching of the candidate entities and the entity relationships can be realized, the influence of ambiguity on candidate entity identification is reduced, the accuracy of target entity and target entity relationship identification is effectively improved, and the identification effect of the target entities and the target entity relationships is further improved.
Fig. 2 is a schematic flowchart of a method for constructing a relationship based on a tag according to another embodiment of the present disclosure.
As shown in fig. 2, the method for constructing a relationship based on tags includes:
s201: and acquiring mass entity data.
The data volume is huge, the data containing a plurality of target entities and a plurality of target entity relationships may be referred to as massive entity data, and the massive entity data may contain a plurality of data sets, which is not limited to this.
In the embodiment of the present disclosure, mass entity data may be obtained, mass data including a relationship between a target entity and the target entity may be obtained according to an actual application scenario, or data related to the relationship between the target entity and the target entity may be collected and obtained in a large database.
For example, in a real-world scenario, an enterprise acquires data recorded in an enterprise management platform, wherein the data includes information such as user basic information and user consumption information.
S202: extracting a plurality of reference entities from the mass entity data, wherein the plurality of reference entities correspond to the same or different information, and the information is service domain information, channel information or type information.
In the embodiment of the present disclosure, the extracting of the multiple reference entities may be multiple reference entities randomly extracted from the massive entity data, and it can be understood that the multiple reference entities correspond to the same or different information, that is, the multiple reference entities may represent the same object, for example, the reference entity "fire truck", the reference entity "fire emergency vehicle", and the reference entity "urban fire truck" may all represent the same object.
In the embodiment of the present disclosure, the information may be service domain information, or channel information, or type information, for example, in a service domain, the information may be customer information, service information, and the like represented by a reference entity, in a data set of a channel, the information may be a plurality of pieces of information of a plurality of reference entities collected by the channel, and in a data set of a type, the information may be a plurality of pieces of information representing types of the reference entities.
S203: and constructing a target entity knowledge base according to the plurality of reference entities.
The knowledge base related to the service domain information, the channel information or the type information is constructed according to the plurality of reference entities, and may be called a target entity knowledge base, and the target entity knowledge base may be a knowledge graph or a knowledge base containing a plurality of target entities and target entity relations.
In the embodiment of the present disclosure, a knowledge base based on a target entity may be constructed according to corresponding information between a plurality of reference entities, for example, in a city management scenario, a target entity "xiaoming" and a target entity "xiaohong" representing a person and a target entity "xiaohong" and "xiaohong" representing a vehicle may be constructed, and a target entity knowledge base { xiaohingman, xiaohong-man, vehicle, xiaohingman } may be constructed.
S204: and respectively labeling the mass entity data based on the plurality of service domains to obtain a plurality of candidate labels respectively corresponding to the plurality of service domains.
In the embodiment of the disclosure, massive entity data of all service domains, all channels and all types can be sorted, and information corresponding to the massive entity is labeled to generate a plurality of corresponding candidate labels.
In the embodiment of the present disclosure, the production data of the production business domain may be subjected to tagging, the sales data of the sales business domain may be subjected to tagging, and the like, and the mass entity data is divided into the production data, the sales data, and the like, so as to obtain the production tags, the sales tags, and the like corresponding to a plurality of business domains, such as the production business domain and the sales business domain, respectively.
S205: and labeling the mass entity data based on a plurality of channels respectively to obtain a plurality of candidate labels corresponding to the channels respectively.
In the embodiment of the disclosure, the tagging processing may be performed on the massive entity data according to a plurality of channels, for example, the massive entity data may be divided into data of different source channels such as internet data and real scene data, and the data may be tagged to obtain a plurality of candidate tags corresponding to the plurality of channels.
S206: and respectively labeling the mass entity data based on multiple types to obtain multiple candidate labels respectively corresponding to the multiple types.
In the embodiment of the present disclosure, tagging may be performed on the massive entity data according to multiple types, for example, the massive entity data may be divided into multiple category lines for representing people, enterprises, objects, others, and the like, and tagging may be performed according to the divided types to obtain multiple candidate tags respectively corresponding to the multiple types.
S207: and forming a universe data tag library according to the candidate tags.
In the embodiment of the present disclosure, the global data tag library includes the obtained multiple candidate tags respectively corresponding to multiple service domains, multiple channels, and multiple types.
Optionally, in some embodiments, the types of candidate tags include: attribute tags, feature tags, fact tags, inference tags.
The candidate tags formed by the attributes which are inherent in the mass entity data and do not change along with the change of the external conditions can be called as attribute tags, the attributes are description attributes of a certain entity or a certain relation, and the corresponding data source is generally unrelated to a specific scene, so that the attribute tags are relatively static and have a long life cycle.
The candidate tag formed by tagging the features that can be distinguished and identified from each other in the mass entity data may be called a feature tag, and different from the attribute tag, the feature tag may change with the change of the external condition.
The candidate tags can be divided into fact tags and inference tags according to different generation and calculation modes, the fact tags can be non-enumeration tags and can be obtained by simply sorting massive entity data, for example: a population address, a birth date, a social security account number, etc. The inference tags can be subdivided, for example, the inference tags can be divided into statistical tags, rule tags, mining tags and the like, the statistical tags can be statistical tags according to the dimension and the measurement matrix of the scene where the massive entity data are located and can be assembled into statistical tags through experience and actual service requirements, for example, effective sample numbers and the like, the rule tags can be tags which do not directly correspond to the massive entity data and need to be obtained through rule definition and calculation, for example, tags for infants, teenagers and the like, the mining tags can be tags which cannot be directly obtained and need to be obtained through complex logic analysis reasoning, and related conclusions obtained according to regularity of a plurality of events occurring in a plurality of scenes of analysis objects, for example, high-risk enterprise tags, high-growth enterprise tags and the like.
S208: an entity identity is determined.
S209: analyzing a plurality of candidate entities corresponding to the entity identification from a plurality of data sets, wherein the plurality of data sets comprise: data sets of multiple business domains, data sets of multiple channels, and data sets of multiple types.
For the description of S208 to S209, reference may be made to the above embodiments, which are not described herein again.
S210: and carrying out disambiguation processing on the candidate entities according to the target entity knowledge base to obtain the target entity.
In the target entity knowledge base, the term of the candidate entity may correspond to one or more different meanings, that is, the candidate entity has ambiguity and needs to be disambiguated, for example, in the sample text "door has no lock", the "lock" may be regarded as the candidate entity, and the "lock" of the candidate entity may refer to a "lock" of the noun static state or a "lock action" of the dynamic state, and the "lock" of the candidate entity has ambiguity and needs to be disambiguated, and entity disambiguation may be a process of linking the candidate entity with the corresponding target entity.
Fig. 3 is a schematic diagram of a candidate entity disambiguation model according to another embodiment of the present disclosure, as shown in fig. 3, disambiguation is performed on candidate entities by using entity links and an "unsupervised clustering + relationship transfer" method according to a plurality of candidate tags corresponding to a plurality of candidate entities before disambiguation, calculation of similarity after disambiguation is obtained by using entity term similarity calculation based on surface features, entity term similarity calculation based on extended features, and an entity term similarity calculation method based on a social network, and each category in a clustering result corresponds to a target entity by clustering the entity terms through a clustering algorithm. The method for calculating and processing the similarity by using the clustering algorithm and the relation transfer can be called an unsupervised clustering and relation transfer method.
The surface features may refer to basic features corresponding to the candidate entities, hidden features, features of the candidate entities that need to be calculated or processed to some extent, social networks, and features of the candidate entities that play roles in the society. The similarity calculation method of the entity nominal item based on the surface layer characteristics, the extended characteristics and the social network can improve the similarity calculation effect.
For example, when the candidate entity is a vehicle, the surface feature may be a color, a shape, a size, and other features corresponding to the candidate entity, the hidden feature may be a corresponding vehicle condition, a driving condition, and other features, and the social network may be a use (such as urban fire protection, public transportation) of the vehicle corresponding to the candidate entity.
Optionally, in some embodiments, the disambiguation processing is performed on a plurality of candidate entities according to the target entity knowledge base to obtain the target entity, the candidate entity may be connected to a plurality of target candidate entities in the target entity knowledge base, a plurality of similarities between the candidate entities and the plurality of target candidate entities are determined according to a pre-trained entity link model, and the target candidate entity with the similarity satisfying a set condition is used as the target entity, wherein the entity link model is an artificial intelligence model trained in advance based on an association modeling method and a consistency modeling method until the artificial intelligence model converges, and the artificial intelligence model obtained by training is used as the entity link model, since the similarity is determined according to the pre-trained entity link model and the candidate entity with the similarity meeting the requirement is selected as the target entity, the similarity between the candidate entity and the target candidate entity can be more accurately calculated, thereby improving the disambiguation processing effect of the candidate entity with ambiguity.
In other embodiments, disambiguation processing may be performed on a plurality of candidate entities, semantic features of the candidate entities in the context may also be mined, and a target entity may be determined using the semantic features, which is not limited thereto.
For example, as shown in fig. 4, fig. 4 is a schematic diagram of a BERT model structure according to another embodiment of the present disclosure, which is given an input sequence of sample text sentences
After the pretreatment and vectorization of BERT, the corresponding sentence segmentation vector, position vector, word vector and other vector representations are obtained
Then, the sentence segmentation vector, the position vector and the word vector are coded by a coder to obtain an output vector
After being processed by the BERT model, the similarity between the triple entities can be calculated according to the target label, and the inference of the universe weak relationship between the triple entities is further completed.
S211: candidate tags corresponding to the target entities are identified from the universe data tag library and serve as target tags.
In the embodiment of the present disclosure, the candidate tags corresponding to the multiple target entities are identified from the global data tag library, the candidate tags of the target entities may be determined in the disambiguation process, and then identified from the global data tag library, or the target tags may be identified after the disambiguation process is completed.
For example, in disambiguating the candidate entity "Xiaoming", the candidate tag of the candidate entity "Xiaoming" is determined to be "car", and the candidate tag "car" may be identified from the global database tag library and used as the target tag.
S212: and determining the target entity relationship among the target entities according to the target tags.
For the description of S212, reference may be made to the above embodiments, which are not described herein again.
In the embodiment, by obtaining mass entity data, extracting a plurality of reference entities from the mass entity data, wherein the plurality of reference entities correspond to the same or different information, the information is service domain information, channel information or type information, constructing a target entity knowledge base according to the plurality of reference entities, labeling the mass entity data based on the plurality of service domains to obtain a plurality of candidate labels respectively corresponding to the plurality of service domains, labeling the mass entity data based on the plurality of channels to obtain a plurality of candidate labels respectively corresponding to the plurality of channels, labeling the mass entity data based on the plurality of types to obtain a plurality of candidate labels respectively corresponding to the plurality of types, forming a global data label base according to the plurality of candidate labels, then determining entity identification, and analyzing from a plurality of data sets to obtain a plurality of candidate entities corresponding to the entity identification, the plurality of data sets includes: the method comprises the steps that a plurality of candidate entities are disambiguated according to a target entity knowledge base to obtain target entities, candidate tags corresponding to the target entities are identified from a global data tag base and serve as target tags, and target entity relations among the target entities are determined according to the target tags. Because the multiple candidate tags are divided, a large and complete tag system can be constructed, the accuracy of tag identification is effectively improved, meanwhile, the abundant candidate tags are beneficial to disambiguation of candidate entities, because the multiple reference entities are extracted from mass entity data, and the target entity knowledge base is constructed according to the multiple reference entities, the source range of the reference data can be expanded, the applicability of the target entity knowledge base is wider, and the disambiguation treatment effect on the multiple candidate entities is assisted to be improved.
Fig. 5 is a schematic flowchart of a method for constructing a relationship based on tags according to another embodiment of the present disclosure.
As shown in fig. 5, the method for constructing a relationship based on tags includes:
s501: an entity identity is determined.
S502: analyzing a plurality of candidate entities corresponding to the entity identification from a plurality of data sets, wherein the plurality of data sets comprise: data sets of multiple business domains, data sets of multiple channels, and data sets of multiple types.
S503: and selecting a part of candidate entities from the plurality of candidate entities as target entities.
S504: a plurality of target tags corresponding to a plurality of target entities is determined.
For the description of S501-S504, reference may be made to the above embodiments, which are not described herein again.
S505: and identifying the triple entity from the target entities by adopting an entity identification method.
The two target entities and the target entity relationship between the two target entities may be referred to as a triplet, and the target entities in the triplet may be referred to as triplet entities, for example, a sample text "little lie is a minuscule leader" may constitute a triplet: "plum, leader, xiaoming", wherein "leader" is an entity relationship and "plum" and "xiaoming" are triplet entities.
Optionally, in some embodiments, the identification of the triplet entity may be to perform entity tagging and part-of-speech tagging on a sample text to which the target entity belongs, extract a target entity conforming to the tag from the sample text, determine a part-of-speech of the tagged target entity, and determine the triplet entity from the target entity and the tagged target entity according to the part-of-speech.
In the embodiment of the present disclosure, the entity tagging and the part-of-speech tagging may be performed on the sample text to which the target entity belongs by using a method combining a sequence tagging algorithm and a rule matching algorithm, for example, the part-of-speech tag "P" may be a name of a person, "L" may be location information, "O" may be a multi-person organization, and the like.
In an embodiment of the present disclosure, the sequence annotation algorithm may be an algorithm model including three models, namely a Language Model (LM), a Long short-term memory (LSTM), and a Conditional Random Field (CRF), and the rule matching algorithm may be a predefined rule, and then identifies and extracts the triplet entities according to the predefined rule, for example, the predefined rule may be: extracting the subject/object and the modifier and compound words thereof, and extracting punctuation marks between the subject/object and the modifier and the compound words.
Optionally, in other embodiments, the triple entity may be determined by using a grammatical order, the sample text may be decomposed into a subject, a predicate, an object, and other character strings according to an algorithm, and the entity represented by the subject and the object may be determined as the triple entity according to the grammatical order, for example, in the sample text "Mingming" department service, the subject "Mingming", the predicate "processing", and the object "department service" may be determined, and then the triple entity is determined as "Mingming" and "department service".
S506: and processing the target tags and the triple entities by adopting a sequence labeling algorithm and a text binary classification algorithm to determine the universe weak relationship among the triple entities, wherein the universe weak relationship is based on a plurality of service domains, a plurality of channels and a plurality of types of entity relationships.
In the universe range, the relationship between the triplet entities, which is relatively dynamic and has a small influence on the triplet entities, may be referred to as a universe weak relationship.
In the embodiment of the present disclosure, the plurality of target tags and triplet entities may be processed, a weak global relationship between triplet entities may be extracted according to the target tags, a Bidirectional transform coder language model (BERT model may be used hereinafter), the BERT model may capture timing information between different words in a sample text, and implement Bidirectional information transfer in both forward and backward directions, in the BERT model, a header and a tail of a complete sample text statement may be respectively marked with [ CLS ] and [ SEP ] for distinguishing two different sample text statements, for example, a sample text is "performing urban greening processing", and after being marked with [ CLS ], the [ SEP ] is performing urban greening processing.
In the embodiment of the present disclosure, the global weak relationship between the triple entities may be obtained by processing the multiple target tags and the triple entities through a text binary classification algorithm, where the text binary classification algorithm may be to extract text features of sample text data, represent information in the sample text according to the text features, and convert the information in the sample text into a format (e.g., a computer programming language format, etc.) that can be recognized by a computer.
S507: and taking the universe weak relation as an entity relation among the triple entities.
In the embodiment of the present disclosure, the entity relationship among the triple entities is regarded according to the multiple service domains, the multiple channels, and the multiple types of entity relationships, for example, the entity relationship "business amount" and the entity relationship "cost" of the multiple service domains may be regarded as the entity relationship among the triple entities, that is, the global weak relationship, and when the triple entity is associated with the global weak relationship "business amount" or the global weak relationship "cost", the triple entity corresponding to the global weak relationship may be labeled.
S508: entity attributes corresponding to the triplet entities are determined.
The static attribute of the triple entity may be referred to as an entity attribute, and the entity attribute may be unrelated to a specific scene and has a long life cycle, such as attributes of a region, a gender, and the like.
In the embodiment of the present disclosure, the determining of the entity attribute corresponding to the triple entity may be to search an attribute tag used for representing the entity attribute in a global data tag library in advance, then determine the attribute tag corresponding to the global weak relationship according to the global weak relationship between the triple entities, or enumerate the global weak relationship between the triple entities, and then determine the entity attribute according to a machine identification or a manual identification, which is not limited.
S509: and taking the triple entity, the entity relationship and the entity attribute as the target entity relationship.
In the embodiment of the present disclosure, the triple entity, the global weak relationship corresponding to the triple entity, and the entity attribute corresponding to the triple entity are collectively used as the target entity relationship.
For example, as shown in fig. 6, fig. 6 is a flowchart illustrating a label-based relationship construction and convergence method according to another embodiment of the disclosure. Firstly, establishing a relation based on labels, wherein the relation is logically divided into an event layer, an entity layer and a space-time layer, and the event layer, the entity layer and the space-time layer respectively correspond to an event identification code, an entity identification code and a space-time identification code, wherein the event identification code is used for identifying corresponding events occurring in a city, such as health code identification, close investigation, vaccination, nucleic acid detection and other events; in the entity layer, entities can be types of people, enterprises, houses, vehicles, city components and the like corresponding to the entities in a city, a plurality of attribute labels and a plurality of feature labels can be extracted from data such as a base table, a text, an image, audio, video and the like by an automatic labeling learning method, an atomic-scale large-scale weak relation is constructed, and then a strong relation accurate convergence based on space-time mapping of the attribute labels and the feature labels is obtained by combining event identification codes with constraint conditions (such as sequential constraint, co-operation, co-trip and the like and spatial constraint, co-living, co-working, co-checking and the like), so that the establishment of the entity layer is realized; the space-time identification code corresponds to the time-space layer and is used for identifying data such as time sequence characteristics, space-time grids, space-time tracks, high-resolution images and the like. And obtaining the strength relation among the entities of the corresponding entities in the entity layer through the event code, the entity code and the space-time code, and then obtaining the result of the corresponding event according to the strength relation among the entities.
In this embodiment, the entity identifier is determined, and a plurality of candidate entities corresponding to the entity identifier are obtained by parsing from a plurality of data sets, where the plurality of data sets include: selecting partial candidate entities from the multiple candidate entities as target entities, determining multiple target labels corresponding to the multiple target entities, identifying triple entities from the multiple target entities by adopting an entity identification method, processing the multiple target labels and the triple entities by adopting a sequence labeling algorithm and a text binary classification algorithm to determine a global weak relationship among the triple entities, wherein the global weak relationship is based on the multiple service domains, the multiple channels and the multiple types of entity relationships, the global weak relationship is used as the entity relationship among the triple entities, the entity attributes corresponding to the triple entities are determined, the triple entities, the entity relationships and the entity attributes are used as the target entity relationship together, and the entity relationships among the triple entities are determined by combining a relationship identification model, the triple entities, the entity relationships and the entity attributes are jointly used as the target entity relationships, the extraction efficiency of the target entity relationships can be improved, the target entity relationships with higher quality can be obtained, and the sequence labeling algorithm and the text classification algorithm are adopted to determine the global weak relationships among the triple entities, so that the global weak relationships among the triple entities can be established in a large scale, the large-scale atomic-level global weak relationships are established, and the integrity of the entity relationship expression among the triple entities is improved.
Fig. 7 is a schematic structural diagram of a tag-based relationship building apparatus according to an embodiment of the present disclosure.
As shown in fig. 7, the tag-based relationship building apparatus 70 includes:
a determining module 701, configured to determine an entity identifier;
an analysis module 702, configured to analyze multiple candidate entities corresponding to the entity identifier from multiple data sets, where the multiple data sets include: the data collection of a plurality of service domains, the data collection of a plurality of channels and the data collection of a plurality of types;
a selecting module 703, configured to select a part of candidate entities from the multiple candidate entities as target entities;
a first determining module 704 for determining a plurality of target tags corresponding to a plurality of target entities;
the second determining module 705 is configured to determine a target entity relationship between a plurality of target entities according to the plurality of target tags.
In some embodiments of the present disclosure, as shown in fig. 8, fig. 8 is a schematic structural diagram of a tag-based relationship building apparatus according to another embodiment of the present disclosure, further including:
an obtaining module 706, configured to obtain massive entity data before determining the entity identifier;
a first processing module 707, configured to perform tagging processing on the mass entity data based on the multiple service domains, respectively, to obtain multiple candidate tags corresponding to the multiple service domains, respectively;
a second processing module 708, configured to perform tagging processing on the massive entity data based on a plurality of channels, respectively, so as to obtain a plurality of candidate tags corresponding to the plurality of channels, respectively;
a third processing module 709, configured to perform tagging processing on the mass entity data based on multiple types, respectively, so as to obtain multiple candidate tags respectively corresponding to the multiple types;
a forming module 710 for forming a population data tag library according to a plurality of candidate tags;
the first determining module 704 is specifically configured to:
candidate tags corresponding to the target entities are identified from the universe data tag library and serve as target tags.
In some embodiments of the present disclosure, as shown in fig. 8, the types of candidate tags include:
attribute tags, feature tags, fact tags, inference tags.
In some embodiments of the present disclosure, as shown in fig. 8, further comprising:
the extraction module 711 is configured to extract a plurality of reference entities from the massive entity data, where the plurality of reference entities correspond to the same or different information, and the information is service domain information, channel information, or type information;
a construction module 712 for constructing a target entity knowledge base from a plurality of reference entities;
the selecting module 703 is specifically configured to:
and carrying out disambiguation processing on the candidate entities according to the target entity knowledge base to obtain the target entity.
In some embodiments of the present disclosure, as shown in fig. 8, the selecting module 703 is specifically configured to:
connecting the candidate entity to a plurality of target candidate entities in a target entity knowledge base;
determining a plurality of similarities between the candidate entity and a plurality of target candidate entities respectively according to a pre-trained entity link model;
and taking a target candidate entity with the similarity meeting the set condition as a target entity, wherein the entity link model is an initial artificial intelligence model which is trained in advance based on an association modeling method and a consistency modeling method until the artificial intelligence model is converged, and taking the artificial intelligence model obtained by training as the entity link model.
In some embodiments of the present disclosure, as shown in fig. 8, the second determining module 705 is specifically configured to:
identifying a triple entity from a plurality of target entities by adopting an entity identification method;
determining entity relationships among the triple entities by combining a relationship recognition model according to the target tags;
determining entity attributes corresponding to the triple entities;
and taking the triple entity, the entity relationship and the entity attribute as the target entity relationship.
In some embodiments of the present disclosure, as shown in fig. 8, the second determining module 705 is specifically configured to:
carrying out entity marking and part-of-speech marking on a sample text to which a target entity belongs;
extracting a target entity which accords with the mark from the sample text, and determining the part of speech of the marked target entity;
and determining the triple entity from the target entity and the marked target entity according to the part of speech.
In some embodiments of the present disclosure, as shown in fig. 8, the relationship recognition model includes: a sequence labeling algorithm and a text classification algorithm, wherein the second determining module 705 is specifically configured to:
processing the target tags and the triple entities by adopting a sequence labeling algorithm and a text binary classification algorithm to determine the universe weak relationship among the triple entities, wherein the universe weak relationship is based on a plurality of service domains, a plurality of channels and a plurality of types of entity relationships;
and taking the universe weak relation as an entity relation among the triple entities.
Corresponding to the relationship construction method based on the label provided in the embodiments of fig. 1 to 6, the present disclosure also provides a relationship construction apparatus based on the label, and since the relationship construction apparatus based on the label provided in the embodiments of the present disclosure corresponds to the relationship construction method based on the label provided in the embodiments of fig. 1 to 6, the implementation manner of the relationship construction method based on the label is also applicable to the relationship construction apparatus based on the label provided in the embodiments of the present disclosure, and will not be described in detail in the embodiments of the present disclosure.
In this embodiment, the entity identifier is determined, and a plurality of candidate entities corresponding to the entity identifier are obtained by parsing from a plurality of data sets, where the plurality of data sets include: the method comprises the steps of selecting partial candidate entities from a plurality of candidate entities as target entities, determining a plurality of target labels corresponding to the target entities, and determining target entity relationships among the target entities according to the target labels, wherein the target entities are subjected to labeling processing, the labels corresponding to the candidate entities are determined, and then the entity relationships among the candidate entities are matched, so that the intelligent matching of the candidate entities and the entity relationships can be realized, the influence of ambiguity on candidate entity identification is reduced, the accuracy of target entity and target entity relationship identification is effectively improved, and the identification effect of the target entities and the target entity relationships is further improved.
In order to achieve the above embodiments, the present disclosure also proposes a non-transitory computer-readable storage medium on which a computer program is stored, which when executed by a processor implements the tag-based relationship construction method as proposed by the foregoing embodiments of the present disclosure.
In order to implement the above embodiments, the present disclosure also provides an electronic device, including: the system comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein when the processor executes the program, the label-based relationship construction method as proposed by the foregoing embodiments of the present disclosure is realized.
In order to implement the foregoing embodiments, the present disclosure also proposes a computer program product, which when executed by an instruction processor in the computer program product, executes the label-based relationship construction method proposed by the foregoing embodiments of the present disclosure.
FIG. 9 illustrates a block diagram of an exemplary electronic device suitable for use in implementing embodiments of the present disclosure. The electronic device 12 shown in fig. 9 is only an example and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.
As shown in fig. 9, electronic device 12 is embodied in the form of a general purpose computing device. The components of electronic device 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, and a bus 18 that couples various system components including the system memory 28 and the processing unit 16. Bus 18 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. These architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus, to name a few.
Electronic device 12 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by electronic device 12 and includes both volatile and nonvolatile media, removable and non-removable media.
Memory 28 may include computer system readable media in the form of volatile Memory, such as Random Access Memory (RAM) 30 and/or cache Memory 32. The electronic device 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 9, and commonly referred to as a "hard drive").
Although not shown in FIG. 9, a disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a Compact disk Read Only Memory (CD-ROM), a Digital versatile disk Read Only Memory (DVD-ROM), or other optical media) may be provided. In these cases, each drive may be connected to bus 18 by one or more data media interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the disclosure.
A program/utility 40 having a set (at least one) of program modules 42 may be stored, for example, in memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may comprise an implementation of a network environment. Program modules 42 generally perform the functions and/or methodologies of the embodiments described in this disclosure.
Electronic device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), with one or more devices that enable a user to interact with electronic device 12, and/or with any devices (e.g., network card, modem, etc.) that enable electronic device 12 to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface 22. Also, the electronic device 12 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public Network such as the Internet) via the Network adapter 20. As shown, the network adapter 20 communicates with other modules of the electronic device 12 via the bus 18. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with electronic device 12, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
The processing unit 16 executes various functional applications and data processing, for example, implementing the label-based relationship construction method mentioned in the foregoing embodiments, by executing a program stored in the system memory 28.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof.
It should be understood that although the terms first, second, third, etc. may be used in the embodiments of the present application to describe various information, the information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, the first information may also be referred to as second information, and similarly, the second information may also be referred to as first information, without departing from the scope of the embodiments of the present application. The words "if" and "if" as used herein may be interpreted as "at … …" or "at … …" or "in response to a determination", depending on the context.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and the scope of the preferred embodiments of the present disclosure includes other implementations in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the embodiments of the present disclosure.
It should be understood that portions of the present disclosure may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.
In addition, functional units in the embodiments of the present disclosure may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.
The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present disclosure. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
Although embodiments of the present disclosure have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present disclosure, and that changes, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present disclosure.