CN112307134A - Entity information processing method, entity information processing device, electronic equipment and storage medium - Google Patents

Entity information processing method, entity information processing device, electronic equipment and storage medium Download PDF

Info

Publication number
CN112307134A
CN112307134A CN202011196563.4A CN202011196563A CN112307134A CN 112307134 A CN112307134 A CN 112307134A CN 202011196563 A CN202011196563 A CN 202011196563A CN 112307134 A CN112307134 A CN 112307134A
Authority
CN
China
Prior art keywords
entity
candidate
target
entity names
names
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011196563.4A
Other languages
Chinese (zh)
Other versions
CN112307134B (en
Inventor
骆金昌
万凡
王海威
王杰
陈坤斌
刘准
和为
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202011196563.4A priority Critical patent/CN112307134B/en
Publication of CN112307134A publication Critical patent/CN112307134A/en
Application granted granted Critical
Publication of CN112307134B publication Critical patent/CN112307134B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/288Entity relationship models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Evolutionary Biology (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The disclosure provides an entity information processing method, an entity information processing device, electronic equipment and a storage medium, and relates to the field of deep learning and the like. The specific implementation scheme is as follows: identifying N document materials of a target department to obtain candidate entity names corresponding to the N document materials respectively; n is an integer greater than or equal to 1; generating M candidate clusters corresponding to the target department based on the candidate entity names respectively corresponding to the N document materials; and determining target entity names of M first-class entities corresponding to the target department in the relational graph based on the candidate entity names respectively contained in the M candidate clusters.

Description

Entity information processing method, entity information processing device, electronic equipment and storage medium
Technical Field
The present disclosure relates to the field of computer technology. The present disclosure relates to the field of deep learning, among others.
Background
Relationship maps, in which a first type of entity (i.e., "things") and a second type of entity (i.e., "people") and relationships between the first type of entity and the second type of entity, and the like, are increasingly used in enterprises. The relationship graph may provide more functionality, such as being able to search for a person in charge of the event, view information about the person, and so forth. However, how to efficiently and accurately construct the first kind of entities in the relational graph becomes a problem to be solved.
Disclosure of Invention
The disclosure provides an entity information processing method, an entity information processing device, an electronic device and a storage medium.
According to a first aspect of the present disclosure, there is provided an entity information processing method, including:
identifying N document materials of a target department to obtain candidate entity names corresponding to the N document materials respectively; n is an integer greater than or equal to 1;
generating M candidate clusters corresponding to the target department based on the candidate entity names respectively corresponding to the N document materials; m is an integer greater than or equal to 1;
and determining target entity names of M first-class entities corresponding to the target department in the relational graph based on the candidate entity names respectively contained in the M candidate clusters.
According to a second aspect of the present disclosure, there is provided an entity information processing apparatus including:
the identification module is used for identifying N document materials of a target department to obtain candidate entity names corresponding to the N document materials respectively; n is an integer greater than or equal to 1;
a clustering module, configured to generate M candidate clusters corresponding to the target department based on candidate entity names respectively corresponding to the N document materials; m is an integer greater than or equal to 1;
and the entity name determining module is used for determining target entity names of M first-class entities corresponding to the target department in the relational graph based on the candidate entity names respectively contained in the M candidate clusters.
According to a third aspect of the present disclosure, there is provided an electronic device comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the aforementioned method.
According to a fourth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the aforementioned method.
By adopting the method and the device, the candidate entity names corresponding to the document materials can be determined based on the document materials of the target department, and then the target entity names of one or more first-class entities corresponding to the target department in the relation map are determined based on the candidate entity names, so that the problems of low efficiency, poor timeliness, inaccurate results and the like caused by manual analysis of the entity names can be avoided, the processing efficiency and the accuracy of obtaining the target entity names are ensured, and the efficiency and the accuracy of constructing or updating the relation map are further ensured.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 is a schematic flow chart diagram of an entity information processing method according to an embodiment of the disclosure;
FIG. 2 is a schematic diagram of a process flow for constructing a candidate cluster in an information processing method according to an embodiment of the disclosure;
FIG. 3 is a first schematic diagram of a composition structure of an information processing apparatus according to an embodiment of the disclosure;
FIG. 4 is a schematic diagram of a second information processing apparatus according to an embodiment of the present disclosure;
fig. 5 is a block diagram of an electronic device for implementing an information processing method of an embodiment of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
An embodiment of the present disclosure provides an entity information processing method, as shown in fig. 1, including:
s101: identifying N document materials of a target department to obtain candidate entity names corresponding to the N document materials respectively; n is an integer greater than or equal to 1;
s102: generating M candidate clusters corresponding to the target department based on the candidate entity names respectively corresponding to the N document materials; m is an integer greater than or equal to 1;
s103: and determining target entity names of M first-class entities corresponding to the target department in the relational graph based on the candidate entity names respectively contained in the M candidate clusters.
The embodiment of the invention can be applied to electronic equipment, such as a server, or terminal equipment and the like.
The target department may be any one of a plurality of departments in a unit or an enterprise, and each department may be processed by using the scheme provided in this embodiment, where any one department is referred to as a target department, and the processing of the other departments is the same as the target department, so that the description thereof is omitted.
The N pieces of document material of the target department may be at least one of document material such as a weekly report and promotion material of the target department.
The method for acquiring N document materials of a target department can be used for collecting materials in the department, or collecting all document materials uploaded by each employee of the target department as the N document materials of the target department; or, the N document materials of the target department are randomly extracted from the document materials uploaded by each employee of the target department.
Identifying N document materials of a target department to obtain candidate entity names of first-class entities corresponding to the N document materials, which may include: and respectively inputting the N document materials of the target department into a preset model to obtain candidate entity names respectively output by the preset model.
Generating M candidate clusters corresponding to the target department based on the candidate entity names corresponding to the N document materials, respectively, may include: and clustering the candidate entity names respectively corresponding to the N document materials of the target department to obtain the M candidate clusters corresponding to the target department.
Further, a candidate entity name may be selected from one or more candidate entity names included in each candidate cluster as a target entity name corresponding to each candidate cluster; the target entity name is taken as the target entity name of a first type entity of the target department.
It should be understood that the specific number of the M candidate clusters may be different according to actual situations. Assuming that a target department can finally obtain a target entity name of a first type entity, M is equal to 1; assuming that the target entity names respectively corresponding to 2 or more first-class entities corresponding to the target department are 2 or more; not all possible cases are exhaustive here.
The first type of entity may refer to a fact entity in the relationship graph, and the "fact" entity may contain various contents, for example, may include: items, platforms, tools, etc.; it is to be understood that the first type of entity may comprise one or more entities, that is, one or more event entities may be included in the relationship graph.
The target entity name or the candidate entity name of the corresponding first-type entity may refer to an attribute or information of "things" to be used in the relationship graph, for example, the entity name of "things" may be: the name of the project, the name of the platform, the name of the tool, etc.
By the scheme, the candidate entity names corresponding to the document materials can be determined based on the document materials of the department input by taking the department as a unit, and one or more target entity names corresponding to each department in the relation map can be determined based on the candidate entity names, so that the target entity names of the departments contained in the relation map can be finally determined only by collecting the document materials of the departments, the problems of low efficiency, poor timeliness, inaccurate results and the like caused by manual analysis can be avoided, the processing efficiency and the accuracy of obtaining the target entity names are ensured, and the efficiency and the accuracy of constructing or updating the relation map are further ensured.
Specifically, in the above S101, the identifying N document materials of the target department to obtain candidate entity names corresponding to the N document materials includes:
inputting a jth document material in the N document materials of the target department and a target department corresponding to the jth document material into a preset model to obtain a candidate entity name corresponding to the jth document material output by the preset model; wherein j is an integer of 1 or more and N or less.
The N document materials may be extracted from documents within the enterprise, including weekly reports, promotional materials, report reports, project filing materials, and so on. Because these materials exist in large quantities within the enterprise, they can be obtained at very low cost; moreover, the materials are often time-sensitive, for example, weekly writing is needed for weekly reporting, so that the time-sensitive requirement can be met by collecting the part of the document materials.
The jth document material is any one of the N document materials. And processing the N document materials in the same way to obtain corresponding candidate entity names, so that the processing of all the N document materials is not repeated one by one.
It should be understood that the input information of the preset model may specifically be the name of the target department and the jth document material; furthermore, the jth document material may be pre-segmented to obtain at least one segmented sentence, and the at least one segmented sentence and the name of the target department are used as input information of the preset model; correspondingly, the output information of the preset model may be a candidate entity name.
Therefore, the embodiment provides that the document material is analyzed by adopting the preset model to obtain the candidate entity name corresponding to the document material, so that the problems of low efficiency and poor accuracy caused by manual analysis or simple character matching can be avoided, the accuracy of subsequently determining the target entity name is improved, and the processing efficiency is improved.
Further, for the preset model, the preset model may be obtained by training sample data included in a training set. Regarding the way of constructing the training set, it may include:
acquiring historical candidate entity names respectively corresponding to a plurality of departments;
matching the historical document materials of each department in the plurality of departments with the historical candidate entity names of the corresponding department to obtain the historical entity names corresponding to the historical document materials of each department;
and generating a training set based on the historical document materials of all departments and the corresponding historical entity names.
Specifically, the historical document material may be obtained by extracting the historical document material from the documents inside the enterprise; for example, historical documentation material including department project names may be included, such as weekly reports, promotional material, and the like. Because such historic document material exists in large quantities within the enterprise, it can be obtained at a very low cost.
Generating a training set based on the historic document materials of each department and the historic entity names corresponding to the historic document materials, wherein each historic document material, the historic entity name corresponding to the historic document material and the corresponding department can be used as each sample data, and each sample data is added to the training set. Finally, the training set may include all of the above sample data.
It should be noted that, in the construction of the training set, the historical entity name of the same department needs to be matched with the historical document material of the same department to label the historical document material, so that the noise can be reduced, and the quality of the training set can be improved. The determining of the historical entity name corresponding to each piece of historical document material may be labeling the historical document material, that is, taking the historical entity name matched with the historical document material as a label of the historical document material. In the related art, sample data in a training set is generally marked manually, so that the cost is high; in the embodiment, the historical entity names corresponding to the historical document material labels can be automatically processed only by matching the historical entity names and the historical document materials in the same department, so that the problem of overlarge cost of manual labeling is avoided, and compared with the manual labeling, the efficiency is higher and the accuracy is higher.
Therefore, the labeling work of the data of the training set is automatically completed by equipment, and as the historical entity names of the same department are adopted to label the historical document materials of the same department during the labeling of the sample data, the department is used as the granularity of the information or as the global information, the effect of entity extraction can be improved, the quality of the sample data of the training set can be improved, and meanwhile, the noise can be reduced.
And then, training the preset model based on the historical document materials of all departments and the corresponding historical entity names contained in the training set to obtain the trained preset model.
Namely, training a preset model based on the constructed training set containing the historical document materials of each department in a plurality of departments and the sample data of the historical entity names (such as project names) in the corresponding department; in the training, the historical document materials contained in each sample data in the training set can be divided into one or more sentences, the one or more sentences obtained by division and the names of departments are used as the input of a preset model, and the historical entity names corresponding to the historical document materials in the sample data are used as the output, so that the preset model is trained. For example, when the preset model is trained, the input layers and features thereof include: the expression mode of the sentences and departments of the historical document materials can be as follows: sentence + < SEP > + department.
The convergence condition in the training of the preset model may be that the number of iterations reaches a preset threshold and/or that the loss function is smaller than the preset threshold. The specific convergence conditions may also include more, and this embodiment is not exhaustive.
The predetermined model may be constructed by using a BERT (Bidirectional Encoder) and a Conditional Random Field (CRF) model. The semantic vector extraction is performed by adopting the preset training language model BERT, so that accurate semantic extraction can be realized on sentences, the semantic mobility can be improved, and better results can be obtained under the condition of smaller training set.
It can be seen that, in the processing of the training preset model, since the labeling work of the data of the training set is automatically completed by the equipment, and the historical entity name of the same department is adopted to label the historical document material of the same department during the labeling of the sample data, the quality of the sample data of the training set can be improved, the noise can be reduced, and the identification accuracy of the finally obtained preset model can be ensured when the training of the preset model is performed based on the training set.
By adopting the processing, the currently input document materials can be analyzed based on the preset model, the entity name corresponding to each currently input document material is obtained, and the entity name is used as the candidate entity name of each document material. Then, the foregoing processing of S102 is executed, and based on the candidate entity names respectively corresponding to the N document materials, M candidate clusters corresponding to the target department are determined, as shown in fig. 2, which may include:
s201: screening the N candidate entity names respectively corresponding to the N document materials to obtain L candidate entity names;
s202: clustering the L candidate entity names to obtain M candidate clusters corresponding to the target department; wherein different candidate clusters of the M candidate clusters contain different candidate entity names.
As to S201, the following processing methods may be specifically included:
the method 1 includes acquiring frequency information of the N candidate entity names, and selecting L candidate entity names with frequency information larger than a preset frequency threshold from the N candidate entity names;
alternatively, the first and second electrodes may be,
mode 2, filtering the N candidate entity names of the N document materials based on a preset rule, and reserving L candidate entity names which do not meet the preset rule;
in the alternative to this, either,
mode 3, in which the processing is performed by combining the above mode 1 with mode 2, may be:
filtering N candidate entity names of the N document materials based on a preset rule, and reserving at least one candidate entity name which does not meet the preset rule; and acquiring frequency information of the at least one candidate entity name, and selecting L candidate entity names with frequency information larger than a preset frequency threshold from the at least one candidate entity name.
In the method 1, firstly, frequency statistics is performed on the candidate entity names of each department to obtain frequency information corresponding to each candidate entity name, and then the low-frequency candidate entity names are filtered by combining the frequency information. Therefore, the accuracy of subsequent clustering can be improved.
The preset frequency threshold may be set according to actual conditions, for example, 3 times may be used as the preset frequency threshold, or 4 times may be used as the preset frequency threshold.
In mode 2, the preset rule may include: the same as the preset keyword. The preset keyword may be set according to an actual situation, for example, "commercialization" may be used as a preset keyword, and accordingly, the candidate entity name including the preset keyword "commercialization" is deleted.
In the method 3, the two methods may be used in combination, and a part of candidate entity names that satisfy the preset rule is deleted first, and then a part of candidate entity names with a lower frequency is filtered. Of course, after filtering out a part of candidate entity names with a frequency lower than a preset frequency threshold, deleting the candidate entity names satisfying the preset rule from the remaining candidate entity names, and finally obtaining the L candidate entity names corresponding to the target department.
In S202, clustering the L candidate entity names to obtain M candidate clusters corresponding to the target department, which may specifically include: and performing similarity calculation on the L candidate entity names, and adding the candidate entity names with the similarity smaller than a preset similarity threshold value into the same cluster to finally obtain M candidate clusters corresponding to the target department.
Further, the similarity calculation may be: and compiling the calculation of distance similarity and/or semantic similarity. Accordingly, the preset similarity threshold may include: and at least one of a preset editing distance similarity threshold value and a preset semantic similarity threshold value is set.
For example, in an example, the DBSCAN nearest neighbor clustering algorithm may be used to cluster the candidate entity names, which is to solve the entity fusion problem. The similarity of the candidate entity names may be a literal edit distance, that is, when the edit distance between two candidate entity names is smaller than a preset edit distance similarity threshold, the candidate entity names are clustered.
In another example, the semantic similarity may be calculated by using a Deep Structured Semantic Model (DSSM) or other models, and names of candidate entities with semantic distances smaller than a preset semantic similarity threshold are taken as the same class and grouped under the same cluster.
In another example, when the edit distance between any two candidate entity names is smaller than the preset edit distance similarity threshold and the semantic distance is smaller than the preset semantic similarity threshold, the two candidate entity names are clustered into the same cluster.
Of course, other similarity calculation may also be adopted to determine the similarity between the candidate entity names, which all may be within the protection scope of the present embodiment, and this is not exhaustive here.
The candidate entity names obtained from the document materials are filtered in advance, the filtered candidate entity names are further clustered, and M candidate clusters corresponding to the target department are obtained.
In S103, determining target entity names of M first-class entities corresponding to the target department in the relationship graph based on the candidate entity names included in the M candidate clusters, respectively, includes:
acquiring frequency information of candidate entity names contained in the ith candidate cluster in the M candidate clusters;
and taking the candidate entity name with the highest secondary information in the candidate entity names contained in the ith candidate cluster as a target entity standard name of the ith first-class entity corresponding to the ith candidate cluster, and taking other entity names except the target standard entity name in the ith candidate cluster as target entity aliases of the ith first-class entity.
The ith candidate cluster may be any one of the M candidate clusters, and since the corresponding target entity name is determined in the same manner for each candidate cluster, only one of the candidate clusters is described here, and the processing manners of the remaining candidate clusters are the same, which is not described in detail.
By adopting the above processing, based on the frequency information of each candidate entity name in the ith candidate cluster, one candidate entity name with the highest frequency of occurrence is selected as the target entity standard name of the ith first-class entity corresponding to the ith candidate cluster, and the rest candidate entity names in the ith candidate cluster are all used as entity alias names of the ith first-class entity. In this way, each candidate cluster may derive one or more entity aliases for the corresponding first-type entity, but may derive only one target entity standard name.
Because a target department can construct a plurality of candidate clusters, each candidate cluster can be considered to correspond to a first-class entity, and a target entity standard name and a target entity alias of the first-class entity can be determined based on one candidate cluster; and finally, the standard names of the target entities and the aliases of the target entities corresponding to the plurality of first-class entities of the target department can be obtained.
Therefore, by the scheme, the target entity standard name and one or more target entity aliases for a matter can be finally determined based on the constructed candidate cluster, so that a more accurate expression mode can be provided for constructing the matter entities in the relation graph, and more reference information is provided for searching in subsequent generalization searching due to the fact that the information of the target entity aliases is added, so that the relation graph is more accurate and more convenient to use.
Based on the above processing, the target entity name of the first kind of entity in the relationship map can be obtained, and further, the relationship between the event and the related second kind of entity can be obtained, so that the relationship between the target entity name of the first kind of entity and the related second kind of entity in the relationship map is constructed. The method specifically comprises the following steps:
acquiring second-class entities associated with kth first-class entities from document materials respectively corresponding to target entity names of the kth first-class entities in the M first-class entities, and establishing association relations between the kth first-class entities and the second-class entities in the relation graph based on the second-class entities associated with the kth first-class entities; wherein k is an integer of 1 or more and M or less.
Specifically, each of the M first-class entities may include a target entity standard name and one or more target entity aliases; one or more document materials corresponding to the standard name of the target entity and the alias names of the one or more target entities can be searched, and one or more second-class entities are extracted from the one or more document materials. This allows the acquisition of the relevant second type entities having a relationship with each of the first type entities.
Wherein the second type of entity may specifically refer to a "human" entity in the relationship graph.
Further, relationships between the first type entities and related second type entities having an association relationship therewith may be established in the relationship graph. That is, one or more second-class entities having a relationship with each first-class entity are obtained, and then the association relationship between each first-class entity and the one or more second-class entities related to the first-class entity is added to the relationship graph.
Wherein the second type entity may be a person, and the person may be represented as a name of the person in the relationship graph; in addition, the second type of entity, such as a person, may also include related attribute information or entity information, such as, for example, a position, a title, and the like of the person, which is not exhaustive herein.
Therefore, related second-class entities in the relation map can be determined through the names of things, so that the construction of the relation map is perfected, and because the construction of the names of the things and the acquisition of the related second-class entities are the same, the relation between the things and the related second-class entities in the relation map can be constructed only by analyzing the entities of the things in advance, and the efficiency of constructing the relation map is improved.
An embodiment of the present invention further provides an entity information processing apparatus, as shown in fig. 3, including:
the identification module 31 is configured to identify N document materials of a target department, and obtain candidate entity names corresponding to the N document materials respectively; n is an integer greater than or equal to 1;
a clustering module 32, configured to generate M candidate clusters corresponding to the target department based on candidate entity names respectively corresponding to the N document materials; m is an integer greater than or equal to 1;
and an entity name determining module 33, configured to determine, based on candidate entity names included in the M candidate clusters, target entity names of the M first-class entities corresponding to the target department in the relationship graph.
The identification module 31 is configured to input a jth document material of the N document materials of the target department and a target department corresponding to the jth document material into a preset model, so as to obtain a candidate entity name corresponding to the jth document material output by the preset model; wherein j is an integer of 1 or more and N or less.
On the basis of fig. 3, the information processing apparatus provided in the present embodiment, as shown in fig. 4, further includes:
a training set constructing module 34, configured to obtain historical candidate entity names corresponding to multiple departments respectively; matching the historical document materials of each department in the plurality of departments with the historical candidate entity names of the corresponding department to obtain the historical entity names corresponding to the historical document materials of each department; and generating a training set based on the historical document materials of all departments and the corresponding historical entity names.
As shown in fig. 4, the apparatus further includes:
and the model training module 35 is configured to train the preset model based on the historical document materials of the departments and the historical entity names corresponding to the historical document materials included in the training set, so as to obtain the trained preset model.
The clustering module 32 is configured to screen N candidate entity names corresponding to the N document materials, respectively, to obtain L candidate entity names; l is an integer of 1 or more and N or less; clustering the L candidate entity names to obtain M candidate clusters corresponding to the target department; wherein different candidate clusters of the M candidate clusters contain different candidate entity names.
The entity name determining module 33 is configured to obtain frequency information of candidate entity names included in an ith candidate cluster of the M candidate clusters; wherein i is an integer of 1 or more and M or less; and taking the candidate entity name with the highest secondary information in the candidate entity names contained in the ith candidate cluster as a target entity standard name of the ith first-class entity corresponding to the ith candidate cluster, and taking other entity names except the target standard entity name in the ith candidate cluster as target entity aliases of the ith first-class entity.
As shown in fig. 4, the apparatus further includes:
a relationship construction module 36, configured to obtain second type entities associated with a kth first type entity from document materials respectively corresponding to target entity names of the kth first type entity in the M first type entities; establishing an incidence relation between the kth first-class entity and the second-class entity in the relation graph based on the second-class entity associated with the kth first-class entity; wherein k is an integer of 1 or more and M or less.
According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.
As shown in fig. 5, it is a block diagram of an electronic device according to the information processing method of the embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.
As shown in fig. 5, the electronic apparatus includes: one or more processors 701, a memory 702, and interfaces for connecting the various components, including a high-speed interface and a low-speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). In fig. 5, one processor 701 is taken as an example.
The memory 702 is a non-transitory computer readable storage medium as provided herein. The memory stores instructions executable by at least one processor, so that the at least one processor executes the information processing method provided by the application. The non-transitory computer-readable storage medium of the present application stores computer instructions for causing a computer to execute the entity information processing method provided by the present application.
The memory 702, which is a non-transitory computer-readable storage medium, may be used to store non-transitory software programs, non-transitory computer-executable programs, and modules, such as program instructions/modules corresponding to the information processing method in the embodiments of the present application (e.g., the recognition module, the clustering module, the entity name determination module, the training set construction module, and the model training module shown in fig. 4). The processor 701 executes various functional applications of the server and data processing by executing non-transitory software programs, instructions, and modules stored in the memory 702, that is, implements the information processing method in the above-described method embodiment.
The memory 702 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the electronic device of the information processing method, and the like. Further, the memory 702 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 702 may optionally include a memory remotely located from the processor 701, and such remote memory may be connected to the electronic device of the information processing method through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The electronic device of the information processing method may further include: an input device 703 and an output device 704. The processor 701, the memory 702, the input device 703 and the output device 704 may be connected by a bus or other means, and fig. 5 illustrates an example of a connection by a bus.
The input device 703 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic apparatus, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointing stick, one or more mouse buttons, a track ball, a joystick, or other input devices. The output devices 704 may include a display device, auxiliary lighting devices (e.g., LEDs), and tactile feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.
Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and Virtual Private Server (VPS) service. The server may also be a server of a distributed system, or a server incorporating a blockchain.
According to the technical scheme of the embodiment of the application, the candidate entity names corresponding to the document materials are determined based on the document materials of the department input by taking the department as a unit, and then one or more target entity names corresponding to each department in the relational graph are determined based on the candidate entity names, so that the target entity names of the departments contained in the relational graph can be finally determined only by collecting the document materials of the departments, the problems of low efficiency, poor timeliness, inaccurate results and the like caused by manual analysis can be avoided, the processing efficiency and the accuracy of obtaining the target entity names are ensured, and the efficiency and the accuracy of building or updating the relational graph are further ensured.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present application can be achieved, and the present invention is not limited herein.
The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (16)

1. An entity information processing method includes:
identifying N document materials of a target department to obtain candidate entity names corresponding to the N document materials respectively; n is an integer greater than or equal to 1;
generating M candidate clusters corresponding to the target department based on the candidate entity names respectively corresponding to the N document materials; m is an integer greater than or equal to 1;
and determining target entity names of M first-class entities corresponding to the target department in the relational graph based on the candidate entity names respectively contained in the M candidate clusters.
2. The method according to claim 1, wherein the identifying N document materials of the target department to obtain candidate entity names corresponding to the N document materials respectively comprises:
inputting a jth document material in the N document materials of the target department and a target department corresponding to the jth document material into a preset model to obtain a candidate entity name corresponding to the jth document material output by the preset model; wherein j is an integer of 1 or more and N or less.
3. The method of claim 2, wherein the method further comprises:
acquiring historical candidate entity names respectively corresponding to a plurality of departments;
matching the historical document materials of each department in the plurality of departments with the historical candidate entity names of the corresponding department to obtain the historical entity names corresponding to the historical document materials of each department;
and generating a training set based on the historical document materials of all departments and the corresponding historical entity names.
4. The method of claim 3, wherein the method further comprises:
and training the preset model based on the historical document materials of all departments and the corresponding historical entity names contained in the training set to obtain the trained preset model.
5. The method of claim 1, wherein the determining M candidate clusters corresponding to the target department based on the candidate entity names corresponding to the N document materials respectively comprises:
screening the N candidate entity names respectively corresponding to the N document materials to obtain L candidate entity names; l is an integer of 1 or more and N or less;
clustering the L candidate entity names to obtain M candidate clusters corresponding to the target department; wherein different candidate clusters of the M candidate clusters contain different candidate entity names.
6. The method according to claim 1, wherein the determining, based on the candidate entity names respectively included in the M candidate clusters, the target entity names of the M first-class entities corresponding to the target department in the relationship graph includes:
acquiring frequency information of candidate entity names contained in the ith candidate cluster in the M candidate clusters; wherein i is an integer of 1 or more and M or less;
and taking the candidate entity name with the highest secondary information in the candidate entity names contained in the ith candidate cluster as a target entity standard name of the ith first-class entity corresponding to the ith candidate cluster, and taking other entity names except the target standard entity name in the ith candidate cluster as target entity aliases of the ith first-class entity.
7. The method of any of claims 1-6, wherein the method further comprises:
obtaining second-class entities related to kth first-class entities from document materials respectively corresponding to target entity names of the kth first-class entities in the M first-class entities; establishing an incidence relation between the kth first-class entity and the second-class entity in the relation graph based on the second-class entity associated with the kth first-class entity; wherein k is an integer of 1 or more and M or less.
8. An entity information processing apparatus comprising:
the identification module is used for identifying N document materials of a target department to obtain candidate entity names corresponding to the N document materials respectively; n is an integer greater than or equal to 1;
a clustering module, configured to generate M candidate clusters corresponding to the target department based on candidate entity names respectively corresponding to the N document materials; m is an integer greater than or equal to 1;
and the entity name determining module is used for determining target entity names of M first-class entities corresponding to the target department in the relational graph based on the candidate entity names respectively contained in the M candidate clusters.
9. The apparatus according to claim 8, wherein the identifying module is configured to input a jth document material of the N document materials of the target department and a target department corresponding to the jth document material into a preset model, so as to obtain a candidate entity name corresponding to the jth document material output by the preset model; wherein j is an integer of 1 or more and N or less.
10. The apparatus of claim 9, wherein the apparatus further comprises:
the training set building module is used for acquiring historical candidate entity names respectively corresponding to a plurality of departments; matching the historical document materials of each department in the plurality of departments with the historical candidate entity names of the corresponding department to obtain the historical entity names corresponding to the historical document materials of each department; and generating a training set based on the historical document materials of all departments and the corresponding historical entity names.
11. The apparatus of claim 10, wherein the apparatus further comprises:
and the model training module is used for training the preset model based on the historical document materials of all departments and the corresponding historical entity names thereof contained in the training set to obtain the trained preset model.
12. The apparatus according to claim 8, wherein the clustering module is configured to screen L candidate entity names from N candidate entity names respectively corresponding to the N document materials; l is an integer of 1 or more and N or less; clustering the L candidate entity names to obtain M candidate clusters corresponding to the target department; wherein different candidate clusters of the M candidate clusters contain different candidate entity names.
13. The apparatus according to claim 8, wherein the entity name determining module is configured to obtain frequency information of candidate entity names included in an ith candidate cluster of the M candidate clusters; wherein i is an integer of 1 or more and M or less; and taking the candidate entity name with the highest secondary information in the candidate entity names contained in the ith candidate cluster as a target entity standard name of the ith first-class entity corresponding to the ith candidate cluster, and taking other entity names except the target standard entity name in the ith candidate cluster as target entity aliases of the ith first-class entity.
14. The apparatus of any one of claims 8-13, wherein the apparatus further comprises:
the relation construction module is used for acquiring second-class entities related to the kth first-class entity from the document materials respectively corresponding to the target entity names of the kth first-class entity in the M first-class entities; establishing an incidence relation between the kth first-class entity and the second-class entity in the relation graph based on the second-class entity associated with the kth first-class entity; wherein k is an integer of 1 or more and M or less.
15. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.
16. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-7.
CN202011196563.4A 2020-10-30 2020-10-30 Entity information processing method, device, electronic equipment and storage medium Active CN112307134B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011196563.4A CN112307134B (en) 2020-10-30 2020-10-30 Entity information processing method, device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011196563.4A CN112307134B (en) 2020-10-30 2020-10-30 Entity information processing method, device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112307134A true CN112307134A (en) 2021-02-02
CN112307134B CN112307134B (en) 2024-02-06

Family

ID=74333114

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011196563.4A Active CN112307134B (en) 2020-10-30 2020-10-30 Entity information processing method, device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112307134B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114118087A (en) * 2021-10-18 2022-03-01 广东明创软件科技有限公司 Entity determination method, entity determination device, electronic equipment and storage medium

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100877477B1 (en) * 2007-06-28 2009-01-07 주식회사 케이티 Apparatus and method for recognizing the named entity using backoff n-gram features
US20130311467A1 (en) * 2012-05-18 2013-11-21 Xerox Corporation System and method for resolving entity coreference
CN106776711A (en) * 2016-11-14 2017-05-31 浙江大学 A kind of Chinese medical knowledge mapping construction method based on deep learning
CN106909655A (en) * 2017-02-27 2017-06-30 中国科学院电子学研究所 Found and link method based on the knowledge mapping entity that production alias is excavated
US9785696B1 (en) * 2013-10-04 2017-10-10 Google Inc. Automatic discovery of new entities using graph reconciliation
CN107861939A (en) * 2017-09-30 2018-03-30 昆明理工大学 A kind of domain entities disambiguation method for merging term vector and topic model
US20180137404A1 (en) * 2016-11-15 2018-05-17 International Business Machines Corporation Joint learning of local and global features for entity linking via neural networks
CN110263318A (en) * 2018-04-23 2019-09-20 腾讯科技(深圳)有限公司 Processing method, device, computer-readable medium and the electronic equipment of entity name
CN110277149A (en) * 2019-06-28 2019-09-24 北京百度网讯科技有限公司 Processing method, device and the equipment of electronic health record
CN110334211A (en) * 2019-06-14 2019-10-15 电子科技大学 A kind of Chinese medicine diagnosis and treatment knowledge mapping method for auto constructing based on deep learning
CN111723575A (en) * 2020-06-12 2020-09-29 杭州未名信科科技有限公司 Method, device, electronic equipment and medium for recognizing text

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100877477B1 (en) * 2007-06-28 2009-01-07 주식회사 케이티 Apparatus and method for recognizing the named entity using backoff n-gram features
US20130311467A1 (en) * 2012-05-18 2013-11-21 Xerox Corporation System and method for resolving entity coreference
US9785696B1 (en) * 2013-10-04 2017-10-10 Google Inc. Automatic discovery of new entities using graph reconciliation
CN106776711A (en) * 2016-11-14 2017-05-31 浙江大学 A kind of Chinese medical knowledge mapping construction method based on deep learning
US20180137404A1 (en) * 2016-11-15 2018-05-17 International Business Machines Corporation Joint learning of local and global features for entity linking via neural networks
CN106909655A (en) * 2017-02-27 2017-06-30 中国科学院电子学研究所 Found and link method based on the knowledge mapping entity that production alias is excavated
CN107861939A (en) * 2017-09-30 2018-03-30 昆明理工大学 A kind of domain entities disambiguation method for merging term vector and topic model
CN110263318A (en) * 2018-04-23 2019-09-20 腾讯科技(深圳)有限公司 Processing method, device, computer-readable medium and the electronic equipment of entity name
CN110334211A (en) * 2019-06-14 2019-10-15 电子科技大学 A kind of Chinese medicine diagnosis and treatment knowledge mapping method for auto constructing based on deep learning
CN110277149A (en) * 2019-06-28 2019-09-24 北京百度网讯科技有限公司 Processing method, device and the equipment of electronic health record
CN111723575A (en) * 2020-06-12 2020-09-29 杭州未名信科科技有限公司 Method, device, electronic equipment and medium for recognizing text

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
杨一帆;陈文亮;: "旅游场景下的实体别名抽取联合模型", 中文信息学报, no. 06, pages 59 - 67 *
熊玲;徐增壮;王潇斌;洪宇;朱巧明;: "基于共指消解的实体搜索模型研究", 中文信息学报, no. 05, pages 94 - 101 *
陆伟;武川;: "实体链接研究综述", 情报学报, no. 01, pages 107 - 114 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114118087A (en) * 2021-10-18 2022-03-01 广东明创软件科技有限公司 Entity determination method, entity determination device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN112307134B (en) 2024-02-06

Similar Documents

Publication Publication Date Title
CN113807098B (en) Model training method and device, electronic equipment and storage medium
CN111966890B (en) Text-based event pushing method and device, electronic equipment and storage medium
CN111967262A (en) Method and device for determining entity tag
CN111428049B (en) Event thematic generation method, device, equipment and storage medium
CN111709247A (en) Data set processing method and device, electronic equipment and storage medium
CN111522967A (en) Knowledge graph construction method, device, equipment and storage medium
CN110020422A (en) The determination method, apparatus and server of Feature Words
CN112541359B (en) Document content identification method, device, electronic equipment and medium
CN111767334B (en) Information extraction method, device, electronic equipment and storage medium
CN111078878B (en) Text processing method, device, equipment and computer readable storage medium
CN110569370B (en) Knowledge graph construction method and device, electronic equipment and storage medium
CN111538815A (en) Text query method, device, equipment and storage medium
US11468236B2 (en) Method and apparatus for performing word segmentation on text, device, and medium
CN111783861A (en) Data classification method, model training device and electronic equipment
CN112329453A (en) Sample chapter generation method, device, equipment and storage medium
CN111241302B (en) Position information map generation method, device, equipment and medium
CN111782975A (en) Retrieval method and device and electronic equipment
CN112084150A (en) Model training method, data retrieval method, device, equipment and storage medium
CN111738015A (en) Method and device for analyzing emotion polarity of article, electronic equipment and storage medium
CN113342946B (en) Model training method and device for customer service robot, electronic equipment and medium
CN112307134B (en) Entity information processing method, device, electronic equipment and storage medium
CN113361240A (en) Method, device, equipment and readable storage medium for generating target article
CN111310481B (en) Speech translation method, device, computer equipment and storage medium
CN112015866A (en) Method, device, electronic equipment and storage medium for generating synonymous text
CN111026916A (en) Text description conversion method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant