CN116542244A - Entity disambiguation method and device for power industry - Google Patents

Entity disambiguation method and device for power industry Download PDF

Info

Publication number
CN116542244A
CN116542244A CN202310457707.4A CN202310457707A CN116542244A CN 116542244 A CN116542244 A CN 116542244A CN 202310457707 A CN202310457707 A CN 202310457707A CN 116542244 A CN116542244 A CN 116542244A
Authority
CN
China
Prior art keywords
similarity
power
sample
text
entity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310457707.4A
Other languages
Chinese (zh)
Inventor
尹从峰
章玥
史亚冰
蒋烨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202310457707.4A priority Critical patent/CN116542244A/en
Publication of CN116542244A publication Critical patent/CN116542244A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y04INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
    • Y04SSYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
    • Y04S10/00Systems supporting electrical power generation, transmission or distribution
    • Y04S10/50Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The disclosure provides an entity disambiguation method and device in the power industry, relates to the technical field of artificial intelligence, and particularly relates to the technical field of big data and knowledge maps. The specific implementation scheme is as follows: inputting sample data of at least two sample electric power entities into a similarity scoring model to be trained, and outputting predicted similarity of the at least two sample electric power entities by the similarity scoring model to be trained; comparing the predicted similarity with similarity labels of at least two sample electric power entities, and adjusting parameters of a similarity scoring model to be trained according to a comparison result to obtain a similarity scoring model after training; the similarity scoring model to be trained comprises at least two text embedding representation modules and a similarity scoring module, wherein the output ends of the text embedding representation modules are respectively connected with the input ends of the similarity scoring module. The present disclosure enables disambiguation of entities in the power industry.

Description

Entity disambiguation method and device for power industry
Technical Field
The disclosure relates to the technical field of artificial intelligence, in particular to the technical field of big data and knowledge maps.
Background
Entities in the power industry not only have power equipment, but also contain various types of knowledge nodes, such as power equipment operation and maintenance knowledge, accident handling knowledge, power equipment overhaul knowledge and the like. The knowledge is derived from operation record documents in the power industry, and different expressions of the same knowledge points are unavoidable, so ambiguity in the knowledge graph needs to be avoided through disambiguation.
Disclosure of Invention
The disclosure provides an entity disambiguation method and device in the power industry.
According to an aspect of the present disclosure, there is provided a training method of a similarity scoring model, including:
inputting sample data of at least two sample electric power entities into a similarity scoring model to be trained, and outputting predicted similarity of the at least two sample electric power entities by the similarity scoring model to be trained;
comparing the predicted similarity with similarity labels of at least two sample electric power entities, and adjusting parameters of a similarity scoring model to be trained according to a comparison result to obtain a similarity scoring model after training; wherein, the liquid crystal display device comprises a liquid crystal display device,
the similarity scoring model to be trained comprises at least two text embedding representation modules and a similarity scoring module, wherein the output ends of the text embedding representation modules are respectively connected with the input ends of the similarity scoring module.
According to another aspect of the present disclosure, there is provided an entity disambiguation method of the power industry, comprising:
acquiring at least two candidate data of the power industry;
inputting at least two candidate data of the electric power industry into a pre-trained similarity scoring model, and outputting the similarity of at least two candidate electric power entities by the similarity scoring model;
Disambiguating the at least two candidate power entities according to the similarity;
the similarity scoring model is obtained through training by adopting a training method of the similarity scoring model.
According to another aspect of the present disclosure, there is provided a training apparatus of a similarity scoring model, including:
the similarity prediction module is used for inputting sample data of at least two sample electric power entities into a similarity scoring model to be trained, and outputting predicted similarity of the at least two sample electric power entities by the similarity scoring model to be trained;
the adjustment module is used for comparing the predicted similarity with similarity labels of at least two sample electric power entities, and adjusting parameters of a similarity scoring model to be trained according to a comparison result so as to obtain a similarity scoring model after training; wherein, the liquid crystal display device comprises a liquid crystal display device,
the similarity scoring model to be trained comprises at least two text embedding representation modules and a similarity scoring module, wherein the output ends of the text embedding representation modules are respectively connected with the input ends of the similarity scoring module.
According to another aspect of the present disclosure, there is provided an entity disambiguation device for the power industry, comprising:
the second acquisition module is used for acquiring at least two candidate data of the electric power industry;
The similarity determination module is used for inputting at least two candidate data of the electric power industry into a pre-trained similarity scoring model, and outputting the similarity of at least two candidate electric power entities by the similarity scoring model;
the disambiguation module is used for disambiguating at least two candidate electric power entities according to the similarity;
the similarity scoring model is obtained by training a training device of the similarity scoring model.
According to another aspect of the present disclosure, there is provided an electronic device including:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the embodiments of the present disclosure.
According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform a method according to any one of the embodiments of the present disclosure.
According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method according to any of the embodiments of the present disclosure.
The embodiment of the disclosure provides a training method of a similarity scoring model, wherein the model is used for predicting the similarity between two or more electric power entities; because the similarity label of the sample power entity is used as the labeling data used for training the model, the model can be trained without a large amount of labeling data related to tasks, so that training efficiency and effect can be improved, and accuracy for judging the similarity of the power entity is improved.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.
Drawings
The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
fig. 1 is an application scenario schematic diagram of a training method of a similarity scoring model according to an embodiment of the present disclosure;
FIG. 2 is a schematic flow chart of an entity disambiguation method for the power industry, proposed by an embodiment of the present disclosure;
FIG. 3 is an overall schematic diagram of an entity disambiguation method for the power industry according to an embodiment of the present disclosure;
FIG. 4 is a schematic diagram of a similarity scoring model according to an embodiment of the present disclosure;
FIG. 5 is a training method of a similarity scoring model proposed in accordance with an embodiment of the present disclosure;
FIG. 6 is a schematic diagram of a similarity scoring model according to an embodiment of the present disclosure;
FIG. 7 is a flow diagram of a training method for a text-embedded presentation module according to an embodiment of the present disclosure;
FIG. 8A is a schematic diagram I of masking pre-training text proposed in accordance with an embodiment of the present disclosure;
FIG. 8B is a schematic diagram II of masking pre-training text proposed in accordance with an embodiment of the present disclosure;
FIG. 9 is a schematic diagram of a training apparatus 900 of a similarity scoring model according to one embodiment of the present disclosure;
FIG. 10 is a schematic diagram of a training apparatus 1000 of a similarity scoring model according to one embodiment of the present disclosure;
fig. 11 is a schematic structural diagram of an entity disambiguation device 1100 of the power industry according to an embodiment of the present disclosure;
fig. 12 shows a schematic block diagram of an example electronic device 1200 that can be used to implement embodiments of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
The entity disambiguation method for the power industry aims to solve the ambiguity problem of the entity in the knowledge graph of the power industry. Two different descriptions of the same device operation and maintenance knowledge points are presented below.
Description 1:
description 2:
as can be seen from the above two descriptions, although the descriptions of the "work requirements" in the two descriptions are different, they actually correspond to the operation and maintenance knowledge of the same device, so that knowledge points with diverse descriptions need to be found through entity disambiguation.
The scheme provided by the embodiment of the disclosure can be applied to the fields of artificial intelligence, big data, knowledge maps and the like, can be directly applied to the construction of the knowledge maps in the power industry, and is also the basis of services such as information retrieval, intelligent analysis, auxiliary decision making and the like. The following is a brief description of various techniques involved in embodiments of the present disclosure.
Artificial intelligence (Artificial Intelligence, AI) technology. The system is a theory, a method, a technology and an application system which simulate, extend and extend human intelligence by using a digital computer or a machine controlled by the digital computer, sense environment, acquire knowledge and acquire an optimal result by using the knowledge. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.
Knowledge Graph (KG) is a semantic network formed by interconnecting entity concepts, and may include nodes (entity/attribute values) and edges (relationships/attributes). The entities are the most basic elements in the knowledge graph, and different relationships exist among different entities. Attributes mainly refer to features, characteristics, features and parameters that an object may have. The attribute value mainly refers to a value of an object specified attribute.
Current physical disambiguation is mainly performed by the following ways:
(1) Performing accurate matching based on a name dictionary: by maintaining a name dictionary for each term, when the input entity/facet name is located in the name dictionary, returning the corresponding entity/facet name as a disambiguation result;
(2) Disambiguation based on manually defined rules: when the content modes of the entity/facet names are fixed, the identity of the input entity/facet names and the entity/facet in the library can be judged by manually pre-configuring N-gram matching templates and the like, so that the identity is used for disambiguation;
(3) Disambiguation is performed based on a machine learning method: based on artificial feature engineering, designing and extracting features for disambiguation, and then performing disambiguation based on machine learning methods such as a support vector machine, a random forest, a gradient lifting tree and the like;
(4) Disambiguation is performed based on a semantic similarity matching model on a large-scale supervision corpus: when there are a large number of labeled semantic similarity corpora in the industry, a classifier can be trained based on a deep network such as a Long short-term memory network (LSTM), a convolutional neural network (CNN, convolutional Neural Networks), etc., to determine the equivalence of an input entity/facet and an entity/facet in a knowledge base, for disambiguation.
The main problem of the above-described embodiment (1) is that: the cost of building a complete dictionary set is high, and the dictionary set is difficult to use for large-scale power industry data. The problem of the embodiment (2) is mainly that: first, manually defined rules are less generalizable, and once new term expression occurs, the manually defined rules are likely to fail; second, this approach is also labor intensive and difficult to use for large-scale power industry data. The main problems of the mode (3) are: the cost of manually designing features is typically high and this approach does not work well for some situations where it is necessary to resort to semantics or knowledge to determine equivalence. The main problem of the mode (4) is that; a large amount of similarity labeling data is needed, and labeling of the power industry data is needed by means of professional knowledge, so that the cost is high. Moreover, the conventional depth model based on LSTM, CNN and other structures often cannot fully utilize information on a large-scale corpus, and the effect is poor.
Entities in the power industry not only have power equipment, but also contain various types of knowledge nodes, such as power equipment operation and maintenance knowledge, accident handling knowledge, power equipment overhaul knowledge and the like. The knowledge is derived from operation record documents in the power industry, and different expressions of the same knowledge point are unavoidable due to different record personnel and record modes of different operation record documents. Aiming at the power industry data, the embodiment of the disclosure provides a training method of a similarity scoring model, wherein the similarity scoring model is used for determining the similarity between entities (short for electric entities) of different power industries, so as to disambiguate the different electric entities with the similarity exceeding a threshold value. In addition, the embodiment of the disclosure also provides an entity disambiguation method in the power industry, and the similarity scoring model obtained through training by the training method is used for performing entity disambiguation in the power industry.
Fig. 1 is an application scenario schematic diagram of a training method of a similarity scoring model according to an embodiment of the present disclosure. Referring to fig. 1, the training method of the similarity score model according to the embodiments of the present disclosure may be used in a system including a server 110 and a terminal 120. The server 110 establishes a wired or wireless communication connection with the terminal 120. Alternatively, the server 110 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server that provides a cloud computing service. The terminal 120 may be a personal computer (Personal Computer, PC), a vehicle-mounted terminal, a tablet computer, a smart phone, a wearable device, a smart robot, or the like, which has data computing, processing, and storage capabilities.
In embodiments of the present disclosure, the terminal 120 in the system may be configured to obtain training text, such as sample data of a sample power entity, and send the training text to the server 110. Server 110 may then train the similarity scoring model using the training text.
In some embodiments, the training text for training the similarity scoring model may also be pre-stored in the server 110. Accordingly, the training system of the similarity scoring model may not include the terminal 120.
Alternatively, the system can also perform specific tasks, such as entity disambiguation in the power industry. Accordingly, the terminal 120 in the system may be configured to obtain the power industry candidate data and send the power industry candidate data to the server 110 for entity disambiguation. The server 110 stores a similarity scoring model after training is completed in advance, and after the server receives the power industry candidate data, the server may input two or more power industry candidate data into the similarity scoring model, perform entity disambiguation on the two or more power industry candidate data by using the similarity scoring model, and output a result. The server 110 may then send the results of the entity disambiguation to the terminal 120.
In some examples, the terminal 120 may also store a similarity score model that has been pre-trained, and after the terminal 120 obtains two or more power industry candidate data, the two or more power industry candidate data may be directly input into the similarity score model, and the similarity score model may perform entity disambiguation on the input two or more power industry candidate data and output the result data. Accordingly, the entity disambiguation system of the power industry may also not include the server 110.
The training method of the similarity scoring model and the entity disambiguation method in the power industry provided by the embodiments of the present disclosure are described below with reference to the technical introduction and the application scenario. The method may be applied to a computer device, which may be the server 110 in the scenario shown in fig. 1, or may be the terminal 110 in the scenario shown in fig. 1, or may be another device, which is not limited by the present disclosure.
For convenience of description, the method for entity disambiguation in the power industry according to the embodiments of the present disclosure is described first, and then the method for training the similarity score model according to the embodiments of the present disclosure is described. The similarity scoring model adopted in the entity disambiguation method in the power industry can be obtained through training by adopting the training method.
Fig. 2 is a schematic flow chart of an entity disambiguation method for the power industry according to an embodiment of the present disclosure, including:
s210, acquiring at least two candidate data of the electric power industry;
s220, inputting the at least two candidate data of the electric power industry into a pre-trained similarity scoring model, and outputting the similarity of the at least two candidate electric power entities by the similarity scoring model;
s230, performing disambiguation on at least two candidate electric power entities according to the similarity;
the similarity scoring model is obtained through training by the training method of the similarity scoring model provided by the embodiment of the disclosure. The specific training will be described later.
In some embodiments, the at least two power industry candidate data input to the similarity score model are data belonging to the same group after the plurality of power industry candidate data are subjected to the grouping processing in advance. The foregoing "grouping" may also be referred to as "binning". The grouping process (or barrel-sorting process) may be considered as a rough grouping of massive power industry candidate data extracted from a power industry document, where the same grouping may include repeated data (e.g., power industry candidate data with a similarity score greater than or equal to a preset threshold) and may also include similar non-repeated data (e.g., power industry candidate data with a similarity score less than a preset threshold), and the purpose of the disambiguation process is to accurately identify other repeated portions.
In the process of entity disambiguation in the power industry, a similarity scoring model is introduced, and the similarity scoring model can be used for comparing the similarity of two candidate power entities, so that the similar candidate power entities can be eliminated to a certain extent.
FIG. 3 is an overall schematic diagram of an entity disambiguation method for the power industry according to an embodiment of the present disclosure; in some implementations, as shown in fig. 3, a manner of acquiring at least two power industry candidate data set forth in the embodiment of the disclosure may include:
s310, extracting a plurality of power industry candidate data from a power industry document; according to a preset architecture (Schema) file and/or configuration file of the power industry knowledge points, grouping the extracted candidate data of the power industry to obtain a plurality of groups; each group comprises a plurality of power industry candidate data;
s320, acquiring at least two power industry candidate data from any group.
The extracted multiple power industry candidate data are grouped through the preset scheme file and/or configuration file of the power industry knowledge points, so that similar candidate power industry data can be extracted from complex and changeable power industry candidate data, and the calculation amount required by the subsequent disambiguation of the entities of the power industry can be reduced.
The architecture (Schema) file may include at least one of an entity in the power industry, an entity attribute corresponding to the entity, and a value type of the entity attribute.
Table 1 is an example of an architecture (Schema) file. In table 1, the "operation and maintenance policy" is an entity whose corresponding entity attributes are "maintenance category", "work requirement", "period class i", "period class ii", "device type" and "device category". Other entities, entity attributes corresponding to the entities, value types of the entity attributes and the like are shown in table 1 in detail.
TABLE 1
Class of Attribute/relationship name Attribute valueType(s) Single multi-value
Operation and maintenance strategy Maintenance category Text of Single sheet
Operation and maintenance strategy Work requirements Text of Single sheet
Operation and maintenance strategy Period I stage Text of Single sheet
Operation and maintenance strategy Cycle II stage Text of Single sheet
Operation and maintenance strategy Device type Text of Single sheet
Operation and maintenance strategy Device class Text of Single sheet
Event of accident Accident equipment Text of Single sheet
Event of accident Protection action Text of Single sheet
Event of accident Part(s) Text of Single sheet
Event of accident Label (Label) Text of Single sheet
Event of accident Event(s) Text of Single sheet
Equipment overhaul maintenance plan Unit (B) Text of Single sheet
Equipment overhaul maintenance plan Type of service and maintenance Text of Single sheet
Equipment overhaul maintenance plan Device class Text of Single sheet
Equipment overhaul maintenance plan Device type Text of Single sheet
Equipment overhaul maintenance plan Association device Text of Single sheet
Equipment overhaul maintenance plan Service class Text of Single sheet
Equipment overhaul maintenance plan Voltage class Text of Single sheet
Equipment overhaul maintenance plan Device model Text of Single sheet
Equipment overhaul maintenance plan Equipment manufacturer Text of Single sheet
Equipment overhaul maintenance plan Time of plan development Date of day Single sheet
Equipment overhaul maintenance plan Actual development time Date of day Single sheet
Equipment overhaul maintenance plan Whether or not to exceed the period of time Text of Single sheet
The profile of the power industry knowledge point may include: at least one of an entity attribute for packet processing, a key granularity of a packet, and a preset threshold value used in data aggregation processing.
The specific content of the Schema file and the configuration file can be used for ensuring that the candidate power entities corresponding to at least two power industry candidate data in the same group can have a certain association degree, so that the power industry candidate data corresponding to the candidate power entities with excessive phase difference are prevented from being in the same group.
Still taking fig. 3 as an example, after obtaining at least two candidate data of the power industries, the at least two candidate data of the power industries may be input into a pre-trained similarity scoring model; and performing disambiguation according to the similarity output by the similarity scoring model.
Specifically, as shown in fig. 3, the disambiguating at least two candidate power entities according to the similarity includes:
s330, under the condition that the similarity is larger than or equal to a preset threshold value, aggregating at least two candidate electric power entities to obtain a group; selecting a first candidate electric power entity from two or more candidate electric power entities in the same grouping, and deleting the rest candidate electric power entities;
and S340, storing the first candidate electric power entity in the knowledge graph, and taking the deleted candidate electric power entity as related information of the first candidate electric power entity.
The method for disambiguating at least two candidate electric power entities provided by the embodiment of the disclosure can be used for accurately predicting the similarity between two or more electric power entities; problems arising from determining the similarity between two or more power entities in error are reduced.
For example, taking an example that the power industry candidate data corresponding to the two candidate electric power entities are respectively input into a similarity scoring model, and the similarity output by the similarity scoring model is any rational number in the range of [0,1], if the preset threshold is 0.5, if the similarity output by the similarity scoring model is 0.1, the prediction result of the similarity scoring model on the two candidate electric power entities is considered to be: the two candidate power entities are dissimilar; if the similarity output by the similarity scoring model is 0.9, the prediction result of the similarity scoring model on the two candidate electric power entities is considered to be: the two candidate power entities are similar, and the two candidate power entities may be aggregated to obtain a group.
In combination with the above, the entity disambiguation method for the power industry provided by the embodiment of the present disclosure may input at least two candidate data for the power industry into a pre-trained similarity scoring model, output the similarity of at least two candidate electric power entities by the pre-trained similarity scoring model, and perform disambiguation processing on the at least two candidate electric power entities according to the similarity.
Fig. 4 is a schematic diagram of a similarity scoring model according to an embodiment of the present disclosure. As shown in fig. 4, the similarity scoring model includes at least two text-embedded representation modules and a similarity scoring module, where the output ends of the text-embedded representation modules are respectively connected with the input ends of the similarity scoring module.
Taking fig. 4 as an example, based on the foregoing, inputting at least two power industry candidate data into a pre-trained similarity scoring model may include:
and respectively inputting the candidate data of each power industry into a corresponding text embedded representation module.
The similarity scoring model provided by the embodiment of the disclosure can utilize the text embedded representation module to preprocess candidate data of each power industry, so that the accuracy of the subsequent similarity scoring module can be improved, and the time required for determining the similarity of each power entity can be reduced.
Wherein the text embedded representation module consists of 12 transducers encoders.
The text embedded representation module can contain the association relation between the electric entity and the attribute in the electric power industry and semantic information of the electric power text. The text embedded representation model can be obtained by training according to a training method of the similarity scoring model provided by the embodiment of the disclosure, and a specific training method will be described in the following.
It should be noted that, in the embodiment of the present disclosure, any power industry candidate data may include a candidate power entity and at least one candidate attribute value of the candidate power entity.
Accordingly, based on the foregoing, the text-embedded representation module for inputting the power industry candidate data includes:
at least one candidate attribute value in the power industry candidate data is input into a corresponding text embedded representation module.
In some embodiments, at least one candidate attribute value corresponding to any one of the power industry candidate data may be used to describe attribute information of a power entity in the power industry candidate data; specifically, the attribute information may be obtained from an existing knowledge graph, or may be obtained by capturing unstructured data of candidate data of each power industry in the network by a web crawler, and obtaining the unstructured data. The data source that captures unstructured data may be a website with a basic description of the power entity, such as an encyclopedia, forum, etc. For example, if there is a power industry candidate data "a power transformer is a stationary electric device, which is a device for converting an ac voltage of a certain value into another voltage or voltages of different values having the same frequency", then the electric entity corresponding to the power industry candidate data may include the power transformer, and the candidate attribute value in the power industry candidate data may include the electric device.
The text embedding representation module can encode according to at least one candidate attribute value in the power industry candidate data, so that the similarity scoring module can score by utilizing the encoding corresponding to the at least one candidate attribute value in the power industry candidate data, and the time required by disambiguation of the power industry entity is reduced.
The above briefly introduces the entity disambiguation method of the power industry proposed by the embodiments of the present disclosure, and a similarity scoring model for implementing entity disambiguation of the power industry.
The following will describe in detail a training method of the similarity score model according to the embodiments of the present disclosure. The similarity scoring model adopted in the entity disambiguation method in the power industry can be obtained through training by adopting the training method.
Fig. 5 is a training method of a similarity scoring model according to an embodiment of the present disclosure, where the method may be applied to a training device of the similarity scoring model, for example, where the device may be deployed in a terminal or a server or other processing devices in a stand-alone, multi-machine or clustered system for execution, and may implement processing such as searching for various application scenarios such as pictures, texts, video, and so on. The terminal may be a User Equipment (UE), a mobile device, a personal digital assistant (PDA, personal Digital Assistant), a handheld device, a computing device, an in-vehicle device, a wearable device, etc. In some possible implementations, the method may also be implemented by way of a processor invoking computer readable instructions stored in a memory. As shown in fig. 5, the training method of the similarity scoring model includes:
S510, inputting sample data of at least two sample electric power entities into a similarity scoring model to be trained, and outputting predicted similarity of the at least two sample electric power entities by the similarity scoring model to be trained;
s520, comparing the predicted similarity with similarity labels of at least two sample electric power entities, and adjusting parameters of a similarity scoring model to be trained according to a comparison result to obtain a similarity scoring model after training; wherein, the liquid crystal display device comprises a liquid crystal display device,
the similarity scoring model to be trained comprises at least two text embedding representation modules and a similarity scoring module, wherein the output ends of the text embedding representation modules are respectively connected with the input ends of the similarity scoring module.
The embodiment of the disclosure provides a training method of a similarity scoring model, wherein the model is used for predicting the similarity between two or more electric power entities; because the similarity label of the sample power entity is used as the labeling data used for training the model, the model can be trained without a large amount of labeling data related to tasks, so that training efficiency and effect can be improved, and accuracy for judging the similarity of the power entity is improved.
Fig. 6 is a schematic structural diagram of a similarity scoring model according to an embodiment of the present disclosure. As shown in fig. 6, in the embodiment of the present disclosure, the number of sample electric entities is equal to the number of text-embedded representation modules, and two sample electric entities are in one-to-one correspondence with at least two text-embedded representation modules;
Inputting sample data of at least two sample power entities into a similarity scoring model to be trained, comprising: sample data of each sample power entity is respectively input into a corresponding text embedded representation module.
By inputting the sample data of each sample electric entity into the corresponding text embedded representation module, each text embedded representation module can be ensured to accurately process the sample data of each sample electric entity, and the problem of confusion of the sample data of a plurality of sample electric entities is prevented.
Wherein the text representation embedding module may be comprised of 12 transducer encoders.
Taking fig. 6 as an example, if the number of the sample power entities is two (i.e., entity a and entity B in the figure), then the similarity scoring model may include two text-embedded representation modules.
In one example, the sample data of the sample power entity includes at least one sample attribute value of the sample power entity;
in combination with the above, the method includes respectively inputting the sample data of each sample power entity into a corresponding text embedded representation module, including:
for each sample power entity, at least one sample attribute value of the sample power entity is entered into a corresponding text-embedded representation module.
The text embedded representation module provided by the embodiment of the disclosure can encode according to at least one candidate attribute value in the candidate data of the power industry, can facilitate the subsequent scoring processing by the similarity scoring module, and reduces the time required by the disambiguation of the entity of the power industry.
For example, if the sample power entity includes a "power transformer"; the sample data of the sample electric entity comprises a sample attribute value of electrical equipment, and input text of the electrical equipment can be embedded into a representation module.
Of course, embodiments of the present disclosure are not limited to a particular number of sample attribute values of a sample entity that the sample data of the sample entity specifically includes, e.g., the sample data of the sample entity may also include 3 and/or 5 sample attribute values.
Specifically, in the case that at least two sample attribute values of a sample power entity are included in sample data of the sample power entity, inputting the at least one sample attribute value of the sample power entity into a corresponding text-embedded representation module, comprising:
Sequentially connecting at least two sample attribute values in sample data of a sample electric entity, and inserting separators between adjacent sample attribute values to obtain an attribute value sequence;
the sequence of attribute value inputs is text-embedded into the representation module.
For example, if the sample data of the sample power entity includes "the power transformer is a stationary electrical device; power transformers are labor intensive products; a power transformer is a power transmission and transformation device manufactured according to an electromagnetic principle, and then the sample power entity is a power transformer, and sample attribute values of the sample power entity in sample data of the sample power entity comprise electrical devices, labor-intensive products and power transmission and transformation devices; at this point, a separator may be interposed between the "electrical equipment," "labor intensive product," and "power transmission and transformation equipment," which may include [ SEP ], so that the sample data "power transformer" of the sample power entity is a stationary electrical equipment; the power transformer is a labor intensive product; the power transformer is a power transmission and transformation equipment manufactured according to an electromagnetic principle, and the attribute value sequence corresponding to the power transmission and transformation equipment can comprise 'electric equipment [ SEP ] labor-intensive product [ SEP ] power transmission and transformation equipment', and then the attribute value sequence can be input into a text embedded representation module.
Alternatively, if the sample data of the sample power entity includes "the transformer is a power transmission and transformation device manufactured according to electromagnetic principles; the transformer is a device for changing alternating voltage by utilizing the principle of electromagnetic induction; the transformer is a basic device of power transmission and distribution, and then the sample power entity is a transformer, and the sample attribute value of the sample power entity in the sample data of the sample power entity comprises a power transmission and transformation device, a device for changing alternating voltage and a basic device; at this time, a separator may be interposed between the "power transmission and transformation device", "means for changing the ac voltage" and the "base device", and the separator may include [ SEP ], so that the sample data "transformer" of the sample electric power entity is a power transmission and transformation device manufactured according to the electromagnetic principle; the transformer is a device for changing alternating voltage by utilizing the principle of electromagnetic induction; the transformer is a basic device of power transmission and distribution, "the corresponding attribute value sequence can comprise a device [ SEP ] basic device of power transmission and transformation equipment [ SEP ] for changing alternating voltage," and then the attribute value sequence input text can be embedded into the representation module.
By inputting a plurality of sample attribute values corresponding to the sample electric entity into the text embedding representation module, the accurate value of the similarity output by the similarity scoring model can be improved, and the problem of inaccurate output similarity caused by a single sample attribute value is avoided.
As shown in fig. 6, if the sample data of the sample power entity includes the attribute value 1 and the attribute value 2 of the entity a, the attribute value sequence corresponding to the sample data of the sample power entity, that is, "attribute value 1[ sep ] entity attribute value 2 of the entity a" may be input into the text embedding representation module.
As also shown in fig. 6, the attribute value sequence proposed by an embodiment of the present disclosure may further include a [ CLS ], which may be used to represent semantic features of the attribute value sequence.
In some implementations, taking fig. 6 as an example, the similarity scoring model set forth in the embodiments of the present disclosure may further include a similarity scoring module that may be used to determine a predicted similarity of at least two sample entities.
Specifically, the similarity scoring model to be trained outputs a predicted similarity of at least two sample electric power entities, including:
each text embedded representation module in the similarity scoring model to be trained encodes the received attribute value sequence to obtain corresponding vector representation, and inputs the vector representation into the similarity scoring module in the similarity scoring model to be trained;
the similarity scoring module calculates a similarity of the received two or more vector representations to obtain a predicted similarity of the at least two sample power entities.
For example, if the sequence of attribute values of the input text embedded representation module includes "electrical equipment [ SEP ] labor intensive product [ SEP ] power plant", and "means for power plant [ SEP ] to change ac voltage [ SEP ] base equipment", then the similarity scoring module may encode for "electrical equipment [ SEP ] labor intensive product [ SEP ] power plant", and "means for power plant [ SEP ] to change ac voltage [ SEP ] base equipment", respectively; and the vector representation obtained by encoding is adopted to determine the prediction similarity between the electric entity 'power transformer' corresponding to the electric equipment [ SEP ] labor-intensive product [ SEP ] power transmission and transformation equipment and the electric entity 'transformer' corresponding to the device [ SEP ] basic equipment of the power transmission and transformation equipment [ SEP ] for changing alternating voltage.
The similarity scoring model to be trained provided by the embodiment of the disclosure utilizes the received two or more vectors to determine the at least two sample electric entities, so that the time required by the similarity scoring module to output the predicted similarity can be reduced, and the resources required by training the similarity aspect model are saved.
In some implementations, the disclosed embodiments may calculate a loss function (i.e., compare the predicted similarity to the similarity labels of the at least two sample electrical entities) using the predicted similarity of the at least two sample electrical entities and the similarity labels of the at least two sample electrical entities, and adjust parameters of a similarity scoring model to be trained according to the loss function (i.e., the comparison result). In the embodiment of the disclosure, when the parameters of the similarity scoring model to be trained are adjusted by using the loss function (i.e., the comparison result), a set of samples of the sample data of at least two sample electric entities may be input into the similarity scoring model.
In the case that the prediction results corresponding to the at least two sample electric entities and the similarity labels of the at least two sample electric entities do not match, the loss function (i.e., the comparison result) may be calculated using the prediction results corresponding to the at least two sample electric entities and the similarity labels of the at least two sample electric entities.
For example, taking as an example that the predicted similarity outputted by the similarity scoring model is any rational number within the range of [0,1], if the predicted similarity outputted by the similarity scoring model for the sample data of at least two sample electric entities is 0.1, the predicted result of the similarity scoring model for the inputted at least two sample electric entities is considered as: the at least two sample power entities are dissimilar; if the predicted similarity of the sample data output of the similarity scoring model for the at least two sample electrical entities is 0.9, then the predicted outcome of the similarity scoring model for the at least two sample electrical entities input is considered to be: at least two of the sample power entities are similar.
Furthermore, if the at least two sample power entities are similar, the similarity tag corresponding to the at least two sample power entities may be set to "1"; alternatively, if the at least two sample power entities are not similar, the similarity tag corresponding to the at least two sample power entities may be set to "0".
When the similarity scoring model is trained, sample data corresponding to at least two similar sample electric entities and/or sample data corresponding to at least two dissimilar sample electric entities can be input into the similarity scoring model, and the similarity scoring model outputs predicted similarity for at least two similar sample electric entities or at least two dissimilar sample electric entities.
For example, if the predicted similarity of the sample data output by the similarity scoring model for at least two sample electrical entities that are similar is 0.1, then the at least two sample electrical entities that are similar therefore do not conform to the pre-labeled similarity labels, in which case the loss function (i.e., the comparison result) may be determined from the difference between the predicted similarity and the similarity labels;
alternatively, if the predicted similarity of the sample data output by the similarity scoring model for at least two sample electrical entities that are similar is 0.9, then the at least two sample electrical entities that are similar therefore conform to the pre-labeled similarity labels, in which case the loss function may be determined to be 0 (i.e., the comparison result is a difference between the predicted similarity and the similarity labels of the at least two sample electrical entities of 0).
The larger the difference between the predicted similarity and the similarity label, the larger the corresponding loss function (i.e., the larger the difference between the predicted similarity and the similarity label of at least two sample power entities is proved by the comparison result); by adopting the method to determine the loss function (namely the comparison result), the convergence of the similarity scoring model can be accelerated, and the training speed of the similarity scoring model can be improved.
It should be noted that the embodiments of the present disclosure are not limited to using the loss function to display the comparison result, and the above is merely an example, and for example, parameters of the similarity scoring model may be directly adjusted according to the comparison result (i.e. the predicted similarity and/or the similarity between the similarity labels of at least two sample electric entities).
In combination with the foregoing, the similarity scoring model provided by the embodiments of the present disclosure may include at least two text-embedded representation modules and a similarity scoring module.
Therefore, in order to improve accuracy of the similarity scoring model, the embodiment of the disclosure further provides a training method of the text embedded representation module.
FIG. 7 is a flow diagram of a training method for a text-embedded presentation module, the method generally comprising:
S710, acquiring a pre-training text;
s720, masking the pre-training text so that at least one word in the pre-training text is replaced by a mask;
s730, pre-training the text embedded representation module by using the pre-training text after mask processing.
The pre-training text includes public data of the power industry, such as text fragments of power related documents in hundred-degree libraries and documents provided by other web pages, or may include text fragments and/or documents after the text fragments and/or documents are further removed from the text fragments and/or documents, wherein the text fragments and/or documents have to be removed from the text fragments and/or documents in low quality (including meaningless contents such as a large number of symbols and tables).
In an embodiment of the disclosure, in order for the trained text embedded representation module to better learn knowledge semantics in the power industry domain, the pre-trained text may be masked, which may include at least one of an entity mask, a relationship mask, and a concept mask; or the pre-training text may be masked in combination with an entity mask, a relationship mask, and a concept mask.
By masking the pre-training text, a large number of samples (i.e., sample data of sample power entities) for training the text embedded representation module can be obtained; and training the text embedded representation module by using a large number of samples, so that the accuracy of the acquired text embedded representation module can be improved.
Marked text resources existing in the power industry are scarce, but unmarked text resources are rich. In particular for specific tasks, such as entity disambiguation in the power industry, the relevant training sample data is so limited that the similarity judgment model cannot simply learn from the scarcity of annotated text resources to summarize the available rules.
Accordingly, embodiments of the present disclosure propose masking of the pre-training text such that at least one word in the pre-training text is replaced with a mask. The method specifically comprises the following steps:
randomly selecting words in the pre-training text, and replacing the randomly selected words with masks; and/or the number of the groups of groups,
and determining a first power entity and a first attribute corresponding to the first power entity in the pre-training text, and replacing the first power entity and the first attribute with masks.
The method for masking the pre-training text provided by the embodiment of the disclosure can be used for marking a large amount of unlabeled data (namely, the pre-training text), so that a large amount of data (namely, sample data of a sample power entity) which can be used for embedding the subsequent training text into the representation module is obtained.
Fig. 8A is a schematic diagram one of masking pre-training text proposed in accordance with an embodiment of the present disclosure. As shown in fig. 8A, the method of masking the pre-training text is to randomly select words in the pre-training text and replace the randomly selected words with a schematic diagram of the mask. It should be noted that, a method of randomly selecting words in the pre-training text and replacing the randomly selected words with masks is adopted, so that there is a problem that the correlation between the entity and the attribute value cannot be represented.
Therefore, the embodiment of the disclosure also provides a way to mask the pre-training text. Fig. 8B is a schematic diagram two of masking pre-training text proposed in accordance with an embodiment of the present disclosure. As shown in fig. 8B, the method of masking the pre-training text is to determine a first power entity and a first attribute corresponding to the first power entity in the pre-training text, and replace the first power entity and the first attribute with a mask. As also shown in FIG. 8B, the method can represent the correlation between the entity and the attribute value, and because the method for masking the pre-training text can represent the relationship between the entity and the attribute value, if the text embedded representation module is trained by using the pre-training text masked in the mode, the text embedded representation module can learn the association relationship between the power entity and the attribute in the power industry and the semantic information of the power text.
Therefore, in order to replace the first electric entity and the first attribute in the pretrained text with the mask, in an embodiment of the disclosure, it is further required to determine the first electric entity and the first attribute corresponding to the first electric entity in the pretrained text, where the determining method specifically includes:
Acquiring pretraining text information of the power industry, wherein the pretraining text information of the power industry comprises association relations between power entities and attributes in pretraining texts;
and determining the first electric entity in the pre-training text and the first attribute corresponding to the first electric entity by using the association relation.
The first electric entity and the first attribute corresponding to the first electric entity in the training text are determined, so that the learning degree of the text embedded representation module on the knowledge of the electric industry can be improved in the subsequent training of the text embedded representation module; meanwhile, by adopting the association relation, the first electric entity in the pre-training text and the first attribute corresponding to the first electric entity can be determined, and the accuracy of masking the electric entity and the attribute in the training text information can be improved.
In addition, in order to enable the text embedded representation module to acquire knowledge and characteristics of the power industry and correlation between the power entity and the attribute, the embodiment of the disclosure further provides pre-training the text embedded representation module by using the pre-training text after mask processing, which specifically includes:
and taking the mask language model (Masked Language Model, MLM) task as a training task, and pre-training the text embedding representation module by utilizing the pre-training text after mask processing so that the text embedding representation module learns semantic information in the pre-training text and/or association relation between the first electric entity and the first attribute.
The text embedded representation module provided by the embodiment of the disclosure effectively utilizes the latent semantics in the language model, and increases the knowledge semantic expression capability of the text embedded representation model in the field of electric industry.
The embodiment of the disclosure further provides a training device for a similarity scoring model, and fig. 9 is a schematic structural diagram of a training device 900 for a similarity scoring model according to an embodiment of the disclosure, including:
the similarity prediction module 910 is configured to input sample data of at least two sample electric power entities into a similarity scoring model to be trained, and output a predicted similarity of the at least two sample electric power entities from the similarity scoring model to be trained;
the adjustment module 920 is configured to compare the predicted similarity with similarity labels of at least two sample electric power entities, and adjust parameters of a similarity scoring model to be trained according to a comparison result, so as to obtain a similarity scoring model after training is completed; wherein, the liquid crystal display device comprises a liquid crystal display device,
the similarity scoring model to be trained comprises at least two text embedding representation modules and a similarity scoring module, wherein the output ends of the text embedding representation modules are respectively connected with the input ends of the similarity scoring module.
In some embodiments, the number of sample power entities is equal to the number of text-embedded representation modules, and two sample power entities are in one-to-one correspondence with at least two text-embedded representation modules;
the similarity prediction module 910 is configured to input sample data of each sample power entity into a corresponding text embedded representation module respectively.
In some embodiments, the sample data of the sample power entity includes at least one sample attribute value of the sample power entity;
the similarity prediction module 910 is configured to input, for each sample power entity, at least one sample attribute value of the sample power entity into a corresponding text-embedded representation module.
In some embodiments, in the case that at least two sample attribute values of a sample power entity are included in sample data of the sample power entity, the similarity prediction module 910 is configured to:
sequentially connecting at least two sample attribute values in sample data of a sample electric entity, and inserting separators between adjacent sample attribute values to obtain an attribute value sequence;
the sequence of attribute value inputs is text-embedded into the representation module.
In some implementations, the similarity prediction module 910 is configured to:
Each text embedded representation module in the similarity scoring model to be trained encodes the received attribute value sequence to obtain corresponding vector representation, and inputs the vector representation into the similarity scoring module in the similarity scoring model to be trained;
the similarity scoring module calculates a similarity of the received two or more vector representations to obtain a predicted similarity of the at least two sample power entities.
Fig. 10 is a schematic structural diagram of a training apparatus 1000 for a similarity score model according to an embodiment of the disclosure, as shown in fig. 10, in some implementations, the training apparatus 1000 for a similarity score model further includes:
a first obtaining module 1030, configured to obtain a pre-training text;
a masking module 1040, configured to perform masking processing on the pre-training text, so that at least one word in the pre-training text is replaced by a mask;
the pre-training module 1050 is configured to pre-train the text embedding representation module using the pre-trained text after the mask processing.
In some implementations, the masking module 1040 is configured to:
randomly selecting words in the pre-training text, and replacing the randomly selected words with masks; and/or the number of the groups of groups,
and determining a first power entity and a first attribute corresponding to the first power entity in the pre-training text, and replacing the first power entity and the first attribute with masks.
In some embodiments, the pre-training module 1050 is configured to:
and pre-training the text embedding representation module by using the mask language model task as a training task and utilizing the pre-trained text after mask processing, so that the text embedding representation module learns semantic information in the pre-trained text and/or association relation between the first electric entity and the first attribute.
In some implementations, the masking module 1040 is configured to:
acquiring pretraining text information of the power industry, wherein the pretraining text information of the power industry comprises association relations between power entities and attributes in pretraining texts;
and determining the first electric entity in the pre-training text and the first attribute corresponding to the first electric entity by using the association relation.
The embodiment of the present disclosure further provides an entity disambiguation device in the power industry, and fig. 11 is a schematic structural diagram of an entity disambiguation device 1100 in the power industry according to an embodiment of the present disclosure, including:
a second obtaining module 1110, configured to obtain at least two candidate power industry data;
the similarity determining module 1120 is configured to input at least two candidate data of the power industry into a pre-trained similarity scoring model, and output a similarity of at least two candidate power entities from the similarity scoring model;
A disambiguation module 1130 configured to disambiguate at least two candidate power entities according to the similarity;
the similarity scoring model is obtained through training according to the training method of the similarity scoring model.
In some embodiments, the similarity scoring model includes at least two text-embedded representation modules and a similarity scoring module, where the output ends of the text-embedded representation modules are respectively connected with the input ends of the similarity scoring module;
the similarity determining module 1120 is configured to input each piece of power industry candidate data into a corresponding text embedded representation module respectively.
In some embodiments, any of the power industry candidate data includes a candidate power entity and at least one candidate attribute value for the candidate power entity;
and the similarity scoring model is used for inputting at least one candidate attribute value in the power industry candidate data into the corresponding text embedded representation module.
In some embodiments, the second acquisition module 1110 is configured to:
extracting a plurality of power industry candidate data from a power industry document;
according to a preset architecture file and/or configuration file of the power industry knowledge points, grouping the extracted candidate data of the power industry to obtain a plurality of groups; each group comprises a plurality of power industry candidate data;
And acquiring at least two pieces of power industry candidate data from any group.
In some embodiments, the architecture file of the power industry knowledge point includes:
at least one of an entity in the power industry, an entity attribute corresponding to the entity, and a value type of the entity attribute.
In some embodiments, the profile of the power industry knowledge point includes:
at least one of an entity attribute for packet processing, a key granularity of a packet, and a preset threshold value used in data aggregation processing.
In some implementations, the disambiguation module 1130 is to:
under the condition that the similarity is larger than or equal to a preset threshold value, at least two candidate electric power entities are aggregated to obtain a group;
selecting a first candidate electric power entity from two or more candidate electric power entities in the same grouping, and deleting the rest candidate electric power entities;
and saving the first candidate electric power entity in the knowledge graph, and taking the deleted candidate electric power entity as related information of the first candidate electric power entity.
For descriptions of specific functions and examples of each module and sub-module of the apparatus in the embodiments of the present disclosure, reference may be made to the related descriptions of corresponding steps in the foregoing method embodiments, which are not repeated herein.
In the technical scheme of the disclosure, the acquisition, storage, application and the like of the related user personal information all conform to the regulations of related laws and regulations, and the public sequence is not violated.
According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.
Fig. 12 shows a schematic block diagram of an example electronic device 1200 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile apparatuses, such as personal digital assistants, cellular telephones, smartphones, wearable devices, and other similar computing apparatuses. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 12, the apparatus 1200 includes a computing unit 1201, which may perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 1202 or a computer program loaded from a storage unit 1208 into a Random Access Memory (RAM) 1203. In the RAM 1203, various programs and data required for the operation of the device 1200 may also be stored. The computing unit 1201, the ROM 1202, and the RAM 1203 are connected to each other via a bus 1204. An input/output (I/O) interface 1205 is also connected to the bus 1204.
Various components in device 1200 are connected to I/O interface 1205, including: an input unit 1206 such as a keyboard, mouse, etc.; an output unit 1207 such as various types of displays, speakers, and the like; a storage unit 1208 such as a magnetic disk, an optical disk, or the like; and a communication unit 1209, such as a network card, modem, wireless communication transceiver, etc. The communication unit 1209 allows the device 1200 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks.
The computing unit 1201 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 1201 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, digital Signal Processors (DSPs), and any suitable processor, controller, microcontroller, etc. The computing unit 1201 performs the various methods and processes described above, such as the entity disambiguation method of the power industry. For example, in some embodiments, the physical disambiguation method of the power industry may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 1208. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 1200 via ROM 1202 and/or communication unit 1209. When the computer program is loaded into the RAM 1203 and executed by the computing unit 1201, one or more steps of the above-described method of entity disambiguation of the power industry may be performed. Alternatively, in other embodiments, the computing unit 1201 may be configured to perform the entity disambiguation method of the power industry by any other suitable means (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.
The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims (21)

1. A training method of a similarity scoring model, comprising:
inputting sample data of at least two sample electric power entities into a similarity scoring model to be trained, and outputting predicted similarity of the at least two sample electric power entities by the similarity scoring model to be trained;
comparing the predicted similarity with similarity labels of the at least two sample electric power entities, and adjusting parameters of the similarity scoring model to be trained according to a comparison result to obtain a similarity scoring model after training; wherein, the liquid crystal display device comprises a liquid crystal display device,
The similarity scoring model to be trained comprises at least two text embedding representation modules and a similarity scoring module, and the output ends of the text embedding representation modules are respectively connected with the input ends of the similarity scoring module.
2. The method of claim 1, wherein the number of sample power entities is equal to the number of text-embedded representation modules, the two sample power entities being in one-to-one correspondence with the at least two text-embedded representation modules;
the inputting the sample data of at least two sample electric entities into a similarity scoring model to be trained comprises: sample data of each sample power entity is respectively input into a corresponding text embedded representation module.
3. The method of claim 2, wherein the sample data of the sample power entity includes at least one sample attribute value of the sample power entity;
the step of respectively inputting the sample data of each sample electric entity into a corresponding text embedded representation module comprises the following steps:
for each sample power entity, at least one sample attribute value of the sample power entity is input into a corresponding text-embedded representation module.
4. A method according to claim 3, wherein, in case at least two sample property values of the sample power entity are included in the sample data of the sample power entity, the entering of the at least one sample property value of the sample power entity into the corresponding text-embedded representation module comprises:
Sequentially connecting at least two sample attribute values in sample data of the sample electric entity, and inserting separators between adjacent sample attribute values to obtain an attribute value sequence;
and inputting the attribute value sequence into the text embedded representation module.
5. The method of claim 4, wherein the similarity scoring model to be trained outputs predicted similarities for the at least two sample power entities, comprising:
each text embedded representation module in the similarity scoring model to be trained encodes the received attribute value sequence to obtain corresponding vector representation, and inputs the vector representation into the similarity scoring module in the similarity scoring model to be trained;
the similarity scoring module calculates a similarity of the received two or more vector representations to obtain a predicted similarity of the at least two sample power entities.
6. The method of any of claims 1-5, further comprising:
acquiring a pre-training text;
masking the pre-training text so that at least one word in the pre-training text is replaced with a mask;
and pre-training the text embedded representation module by using the pre-training text after mask processing.
7. The method of claim 6, wherein the masking the pre-training text such that at least one word in the pre-training text is replaced with a mask comprises:
randomly selecting words in the pre-training text, and replacing the randomly selected words with masks; and/or the number of the groups of groups,
and determining a first power entity in the pre-training text and a first attribute corresponding to the first power entity, and replacing the first power entity and the first attribute with masks.
8. The method of claim 7, wherein the pre-training the text-embedded representation module with the masked pre-trained text comprises:
and taking a mask language model MLM task as a training task, and pre-training the text embedding representation module by utilizing the pre-training text after mask processing so that the text embedding representation module learns semantic information in the pre-training text and/or the association relation between the first electric entity and the first attribute.
9. The method of claim 7 or 8, wherein the determining a first power entity in the pre-training text and a first attribute corresponding to the first power entity comprises:
Acquiring pretraining text information of the power industry, wherein the pretraining text information of the power industry comprises association relations between power entities and attributes in the pretraining text;
and determining the first power entity in the pre-training text and a first attribute corresponding to the first power entity by utilizing the association relation.
10. A method of entity disambiguation in the power industry, comprising:
acquiring at least two candidate data of the power industry;
inputting the at least two candidate data of the electric power industry into a pre-trained similarity scoring model, and outputting the similarity of at least two candidate electric power entities by the similarity scoring model;
disambiguating the at least two candidate power entities according to the similarity;
wherein the similarity scoring model is trained by the method of any one of claims 1-9.
11. The method of claim 10, wherein the similarity scoring model comprises at least two text-embedded representation modules and a similarity scoring module, and the output end of each text-embedded representation module is respectively connected with the input end of the similarity scoring module;
the inputting the at least two candidate power industry data into a pre-trained similarity scoring model comprises:
And respectively inputting each candidate data of the electric power industry into a corresponding text embedded representation module.
12. The method of claim 11, wherein the any one of the power industry candidate data comprises a candidate power entity and at least one candidate attribute value for the candidate power entity;
inputting the text embedded representation module corresponding to the power industry candidate data, comprising:
and inputting the at least one candidate attribute value in the power industry candidate data into a corresponding text embedded representation module.
13. The method of any of claims 1-12, wherein the obtaining at least two power industry candidate data comprises:
extracting a plurality of power industry candidate data from a power industry document;
according to a preset architecture file and/or configuration file of the power industry knowledge points, grouping the extracted candidate data of the power industry to obtain a plurality of groups; each group comprises a plurality of power industry candidate data;
and acquiring at least two power industry candidate data from any group.
14. The method of claim 13, wherein the architecture file of the power industry knowledge point comprises:
At least one of an entity in the power industry, an entity attribute corresponding to the entity, and a value type of the entity attribute.
15. The method of claim 13, wherein the profile of the power industry knowledge point comprises:
at least one of an entity attribute for packet processing, a key granularity of a packet, and a preset threshold value used in data aggregation processing.
16. The method according to any of claims 10-15, wherein said disambiguating said at least two candidate power entities according to said similarity comprises:
under the condition that the similarity is larger than or equal to a preset threshold value, the at least two candidate electric power entities are aggregated to obtain a group;
selecting a first candidate electric power entity from two or more candidate electric power entities in the same grouping, and deleting the rest candidate electric power entities;
and storing the first candidate electric power entity in a knowledge graph, and taking the deleted candidate electric power entity as related information of the first candidate electric power entity.
17. A training device for a similarity scoring model, comprising:
the similarity prediction module is used for inputting sample data of at least two sample electric power entities into a similarity scoring model to be trained, and outputting predicted similarity of the at least two sample electric power entities by the similarity scoring model to be trained;
The adjustment module is used for comparing the predicted similarity with the similarity labels of the at least two sample electric power entities, and adjusting parameters of the similarity scoring model to be trained according to a comparison result so as to obtain a similarity scoring model after training; wherein, the liquid crystal display device comprises a liquid crystal display device,
the similarity scoring model to be trained comprises at least two text embedding representation modules and a similarity scoring module, and the output ends of the text embedding representation modules are respectively connected with the input ends of the similarity scoring module.
18. An entity disambiguation device in the power industry, comprising:
the second acquisition module is used for acquiring at least two candidate data of the electric power industry;
the similarity determination module is used for inputting the at least two candidate data of the electric power industry into a pre-trained similarity scoring model, and outputting the similarity of the at least two candidate electric power entities by the similarity scoring model;
the disambiguation module is used for disambiguating the at least two candidate electric power entities according to the similarity;
wherein the similarity scoring model is trained using the apparatus of any one of claims 17.
19. An electronic device, comprising:
At least one processor; and
a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-16.
20. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-16.
21. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any of claims 1-16.
CN202310457707.4A 2023-04-24 2023-04-24 Entity disambiguation method and device for power industry Pending CN116542244A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310457707.4A CN116542244A (en) 2023-04-24 2023-04-24 Entity disambiguation method and device for power industry

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310457707.4A CN116542244A (en) 2023-04-24 2023-04-24 Entity disambiguation method and device for power industry

Publications (1)

Publication Number Publication Date
CN116542244A true CN116542244A (en) 2023-08-04

Family

ID=87451665

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310457707.4A Pending CN116542244A (en) 2023-04-24 2023-04-24 Entity disambiguation method and device for power industry

Country Status (1)

Country Link
CN (1) CN116542244A (en)

Similar Documents

Publication Publication Date Title
CN110705301B (en) Entity relationship extraction method and device, storage medium and electronic equipment
CN113836314B (en) Knowledge graph construction method, device, equipment and storage medium
CN113268560A (en) Method and device for text matching
CN113434636A (en) Semantic-based approximate text search method and device, computer equipment and medium
CN112818686A (en) Domain phrase mining method and device and electronic equipment
CN113609847B (en) Information extraction method, device, electronic equipment and storage medium
CN113836316B (en) Processing method, training method, device, equipment and medium for ternary group data
CN112560425B (en) Template generation method and device, electronic equipment and storage medium
CN112906368B (en) Industry text increment method, related device and computer program product
CN112699237B (en) Label determination method, device and storage medium
CN114254636A (en) Text processing method, device, equipment and storage medium
CN114818736B (en) Text processing method, chain finger method and device for short text and storage medium
CN113378015B (en) Search method, search device, electronic apparatus, storage medium, and program product
CN114817476A (en) Language model training method and device, electronic equipment and storage medium
CN115292506A (en) Knowledge graph ontology construction method and device applied to office field
CN114841172A (en) Knowledge distillation method, apparatus and program product for text matching double tower model
CN114969371A (en) Heat sorting method and device of combined knowledge graph
CN116542244A (en) Entity disambiguation method and device for power industry
CN114417862A (en) Text matching method, and training method and device of text matching model
CN114218431A (en) Video searching method and device, electronic equipment and storage medium
CN114254642A (en) Entity information processing method, device, electronic equipment and medium
CN112784600A (en) Information sorting method and device, electronic equipment and storage medium
CN116127948B (en) Recommendation method and device for text data to be annotated and electronic equipment
CN116628315B (en) Search method, training method and device of deep learning model and electronic equipment
CN116069914B (en) Training data generation method, model training method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination