CN112307134A

CN112307134A - Entity information processing method, entity information processing device, electronic equipment and storage medium

Info

Publication number: CN112307134A
Application number: CN202011196563.4A
Authority: CN
Inventors: 骆金昌; 万凡; 王海威; 王杰; 陈坤斌; 刘准; 和为
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-10-30
Filing date: 2020-10-30
Publication date: 2021-02-02
Anticipated expiration: 2040-10-30
Also published as: CN112307134B

Abstract

The disclosure provides an entity information processing method, an entity information processing device, electronic equipment and a storage medium, and relates to the field of deep learning and the like. The specific implementation scheme is as follows: identifying N document materials of a target department to obtain candidate entity names corresponding to the N document materials respectively; n is an integer greater than or equal to 1; generating M candidate clusters corresponding to the target department based on the candidate entity names respectively corresponding to the N document materials; and determining target entity names of M first-class entities corresponding to the target department in the relational graph based on the candidate entity names respectively contained in the M candidate clusters.

Description

Entity information processing method, entity information processing device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technology. The present disclosure relates to the field of deep learning, among others.

Background

Relationship maps, in which a first type of entity (i.e., "things") and a second type of entity (i.e., "people") and relationships between the first type of entity and the second type of entity, and the like, are increasingly used in enterprises. The relationship graph may provide more functionality, such as being able to search for a person in charge of the event, view information about the person, and so forth. However, how to efficiently and accurately construct the first kind of entities in the relational graph becomes a problem to be solved.

Disclosure of Invention

The disclosure provides an entity information processing method, an entity information processing device, an electronic device and a storage medium.

According to a first aspect of the present disclosure, there is provided an entity information processing method, including:

identifying N document materials of a target department to obtain candidate entity names corresponding to the N document materials respectively; n is an integer greater than or equal to 1;

generating M candidate clusters corresponding to the target department based on the candidate entity names respectively corresponding to the N document materials; m is an integer greater than or equal to 1;

and determining target entity names of M first-class entities corresponding to the target department in the relational graph based on the candidate entity names respectively contained in the M candidate clusters.

According to a second aspect of the present disclosure, there is provided an entity information processing apparatus including:

the identification module is used for identifying N document materials of a target department to obtain candidate entity names corresponding to the N document materials respectively; n is an integer greater than or equal to 1;

a clustering module, configured to generate M candidate clusters corresponding to the target department based on candidate entity names respectively corresponding to the N document materials; m is an integer greater than or equal to 1;

and the entity name determining module is used for determining target entity names of M first-class entities corresponding to the target department in the relational graph based on the candidate entity names respectively contained in the M candidate clusters.

According to a third aspect of the present disclosure, there is provided an electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the aforementioned method.

According to a fourth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the aforementioned method.

By adopting the method and the device, the candidate entity names corresponding to the document materials can be determined based on the document materials of the target department, and then the target entity names of one or more first-class entities corresponding to the target department in the relation map are determined based on the candidate entity names, so that the problems of low efficiency, poor timeliness, inaccurate results and the like caused by manual analysis of the entity names can be avoided, the processing efficiency and the accuracy of obtaining the target entity names are ensured, and the efficiency and the accuracy of constructing or updating the relation map are further ensured.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic flow chart diagram of an entity information processing method according to an embodiment of the disclosure;

FIG. 2 is a schematic diagram of a process flow for constructing a candidate cluster in an information processing method according to an embodiment of the disclosure;

FIG. 3 is a first schematic diagram of a composition structure of an information processing apparatus according to an embodiment of the disclosure;

FIG. 4 is a schematic diagram of a second information processing apparatus according to an embodiment of the present disclosure;

fig. 5 is a block diagram of an electronic device for implementing an information processing method of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

An embodiment of the present disclosure provides an entity information processing method, as shown in fig. 1, including:

s101: identifying N document materials of a target department to obtain candidate entity names corresponding to the N document materials respectively; n is an integer greater than or equal to 1;

s102: generating M candidate clusters corresponding to the target department based on the candidate entity names respectively corresponding to the N document materials; m is an integer greater than or equal to 1;

s103: and determining target entity names of M first-class entities corresponding to the target department in the relational graph based on the candidate entity names respectively contained in the M candidate clusters.

The embodiment of the invention can be applied to electronic equipment, such as a server, or terminal equipment and the like.

The target department may be any one of a plurality of departments in a unit or an enterprise, and each department may be processed by using the scheme provided in this embodiment, where any one department is referred to as a target department, and the processing of the other departments is the same as the target department, so that the description thereof is omitted.

The N pieces of document material of the target department may be at least one of document material such as a weekly report and promotion material of the target department.

The method for acquiring N document materials of a target department can be used for collecting materials in the department, or collecting all document materials uploaded by each employee of the target department as the N document materials of the target department; or, the N document materials of the target department are randomly extracted from the document materials uploaded by each employee of the target department.

Identifying N document materials of a target department to obtain candidate entity names of first-class entities corresponding to the N document materials, which may include: and respectively inputting the N document materials of the target department into a preset model to obtain candidate entity names respectively output by the preset model.

Generating M candidate clusters corresponding to the target department based on the candidate entity names corresponding to the N document materials, respectively, may include: and clustering the candidate entity names respectively corresponding to the N document materials of the target department to obtain the M candidate clusters corresponding to the target department.

Further, a candidate entity name may be selected from one or more candidate entity names included in each candidate cluster as a target entity name corresponding to each candidate cluster; the target entity name is taken as the target entity name of a first type entity of the target department.

It should be understood that the specific number of the M candidate clusters may be different according to actual situations. Assuming that a target department can finally obtain a target entity name of a first type entity, M is equal to 1; assuming that the target entity names respectively corresponding to 2 or more first-class entities corresponding to the target department are 2 or more; not all possible cases are exhaustive here.

The first type of entity may refer to a fact entity in the relationship graph, and the "fact" entity may contain various contents, for example, may include: items, platforms, tools, etc.; it is to be understood that the first type of entity may comprise one or more entities, that is, one or more event entities may be included in the relationship graph.

The target entity name or the candidate entity name of the corresponding first-type entity may refer to an attribute or information of "things" to be used in the relationship graph, for example, the entity name of "things" may be: the name of the project, the name of the platform, the name of the tool, etc.

By the scheme, the candidate entity names corresponding to the document materials can be determined based on the document materials of the department input by taking the department as a unit, and one or more target entity names corresponding to each department in the relation map can be determined based on the candidate entity names, so that the target entity names of the departments contained in the relation map can be finally determined only by collecting the document materials of the departments, the problems of low efficiency, poor timeliness, inaccurate results and the like caused by manual analysis can be avoided, the processing efficiency and the accuracy of obtaining the target entity names are ensured, and the efficiency and the accuracy of constructing or updating the relation map are further ensured.

Specifically, in the above S101, the identifying N document materials of the target department to obtain candidate entity names corresponding to the N document materials includes:

inputting a jth document material in the N document materials of the target department and a target department corresponding to the jth document material into a preset model to obtain a candidate entity name corresponding to the jth document material output by the preset model; wherein j is an integer of 1 or more and N or less.

The N document materials may be extracted from documents within the enterprise, including weekly reports, promotional materials, report reports, project filing materials, and so on. Because these materials exist in large quantities within the enterprise, they can be obtained at very low cost; moreover, the materials are often time-sensitive, for example, weekly writing is needed for weekly reporting, so that the time-sensitive requirement can be met by collecting the part of the document materials.

The jth document material is any one of the N document materials. And processing the N document materials in the same way to obtain corresponding candidate entity names, so that the processing of all the N document materials is not repeated one by one.

It should be understood that the input information of the preset model may specifically be the name of the target department and the jth document material; furthermore, the jth document material may be pre-segmented to obtain at least one segmented sentence, and the at least one segmented sentence and the name of the target department are used as input information of the preset model; correspondingly, the output information of the preset model may be a candidate entity name.

Therefore, the embodiment provides that the document material is analyzed by adopting the preset model to obtain the candidate entity name corresponding to the document material, so that the problems of low efficiency and poor accuracy caused by manual analysis or simple character matching can be avoided, the accuracy of subsequently determining the target entity name is improved, and the processing efficiency is improved.

Further, for the preset model, the preset model may be obtained by training sample data included in a training set. Regarding the way of constructing the training set, it may include:

acquiring historical candidate entity names respectively corresponding to a plurality of departments;

matching the historical document materials of each department in the plurality of departments with the historical candidate entity names of the corresponding department to obtain the historical entity names corresponding to the historical document materials of each department;

and generating a training set based on the historical document materials of all departments and the corresponding historical entity names.

Specifically, the historical document material may be obtained by extracting the historical document material from the documents inside the enterprise; for example, historical documentation material including department project names may be included, such as weekly reports, promotional material, and the like. Because such historic document material exists in large quantities within the enterprise, it can be obtained at a very low cost.

Generating a training set based on the historic document materials of each department and the historic entity names corresponding to the historic document materials, wherein each historic document material, the historic entity name corresponding to the historic document material and the corresponding department can be used as each sample data, and each sample data is added to the training set. Finally, the training set may include all of the above sample data.

It should be noted that, in the construction of the training set, the historical entity name of the same department needs to be matched with the historical document material of the same department to label the historical document material, so that the noise can be reduced, and the quality of the training set can be improved. The determining of the historical entity name corresponding to each piece of historical document material may be labeling the historical document material, that is, taking the historical entity name matched with the historical document material as a label of the historical document material. In the related art, sample data in a training set is generally marked manually, so that the cost is high; in the embodiment, the historical entity names corresponding to the historical document material labels can be automatically processed only by matching the historical entity names and the historical document materials in the same department, so that the problem of overlarge cost of manual labeling is avoided, and compared with the manual labeling, the efficiency is higher and the accuracy is higher.

Therefore, the labeling work of the data of the training set is automatically completed by equipment, and as the historical entity names of the same department are adopted to label the historical document materials of the same department during the labeling of the sample data, the department is used as the granularity of the information or as the global information, the effect of entity extraction can be improved, the quality of the sample data of the training set can be improved, and meanwhile, the noise can be reduced.

And then, training the preset model based on the historical document materials of all departments and the corresponding historical entity names contained in the training set to obtain the trained preset model.

Namely, training a preset model based on the constructed training set containing the historical document materials of each department in a plurality of departments and the sample data of the historical entity names (such as project names) in the corresponding department; in the training, the historical document materials contained in each sample data in the training set can be divided into one or more sentences, the one or more sentences obtained by division and the names of departments are used as the input of a preset model, and the historical entity names corresponding to the historical document materials in the sample data are used as the output, so that the preset model is trained. For example, when the preset model is trained, the input layers and features thereof include: the expression mode of the sentences and departments of the historical document materials can be as follows: sentence + < SEP > + department.

The convergence condition in the training of the preset model may be that the number of iterations reaches a preset threshold and/or that the loss function is smaller than the preset threshold. The specific convergence conditions may also include more, and this embodiment is not exhaustive.

The predetermined model may be constructed by using a BERT (Bidirectional Encoder) and a Conditional Random Field (CRF) model. The semantic vector extraction is performed by adopting the preset training language model BERT, so that accurate semantic extraction can be realized on sentences, the semantic mobility can be improved, and better results can be obtained under the condition of smaller training set.

It can be seen that, in the processing of the training preset model, since the labeling work of the data of the training set is automatically completed by the equipment, and the historical entity name of the same department is adopted to label the historical document material of the same department during the labeling of the sample data, the quality of the sample data of the training set can be improved, the noise can be reduced, and the identification accuracy of the finally obtained preset model can be ensured when the training of the preset model is performed based on the training set.

By adopting the processing, the currently input document materials can be analyzed based on the preset model, the entity name corresponding to each currently input document material is obtained, and the entity name is used as the candidate entity name of each document material. Then, the foregoing processing of S102 is executed, and based on the candidate entity names respectively corresponding to the N document materials, M candidate clusters corresponding to the target department are determined, as shown in fig. 2, which may include:

s201: screening the N candidate entity names respectively corresponding to the N document materials to obtain L candidate entity names;

s202: clustering the L candidate entity names to obtain M candidate clusters corresponding to the target department; wherein different candidate clusters of the M candidate clusters contain different candidate entity names.

As to S201, the following processing methods may be specifically included:

the method 1 includes acquiring frequency information of the N candidate entity names, and selecting L candidate entity names with frequency information larger than a preset frequency threshold from the N candidate entity names;

alternatively, the first and second electrodes may be,

mode 2, filtering the N candidate entity names of the N document materials based on a preset rule, and reserving L candidate entity names which do not meet the preset rule;

in the alternative to this, either,

mode 3, in which the processing is performed by combining the above mode 1 with mode 2, may be:

filtering N candidate entity names of the N document materials based on a preset rule, and reserving at least one candidate entity name which does not meet the preset rule; and acquiring frequency information of the at least one candidate entity name, and selecting L candidate entity names with frequency information larger than a preset frequency threshold from the at least one candidate entity name.

In the method 1, firstly, frequency statistics is performed on the candidate entity names of each department to obtain frequency information corresponding to each candidate entity name, and then the low-frequency candidate entity names are filtered by combining the frequency information. Therefore, the accuracy of subsequent clustering can be improved.

The preset frequency threshold may be set according to actual conditions, for example, 3 times may be used as the preset frequency threshold, or 4 times may be used as the preset frequency threshold.

In mode 2, the preset rule may include: the same as the preset keyword. The preset keyword may be set according to an actual situation, for example, "commercialization" may be used as a preset keyword, and accordingly, the candidate entity name including the preset keyword "commercialization" is deleted.

In the method 3, the two methods may be used in combination, and a part of candidate entity names that satisfy the preset rule is deleted first, and then a part of candidate entity names with a lower frequency is filtered. Of course, after filtering out a part of candidate entity names with a frequency lower than a preset frequency threshold, deleting the candidate entity names satisfying the preset rule from the remaining candidate entity names, and finally obtaining the L candidate entity names corresponding to the target department.

In S202, clustering the L candidate entity names to obtain M candidate clusters corresponding to the target department, which may specifically include: and performing similarity calculation on the L candidate entity names, and adding the candidate entity names with the similarity smaller than a preset similarity threshold value into the same cluster to finally obtain M candidate clusters corresponding to the target department.

Further, the similarity calculation may be: and compiling the calculation of distance similarity and/or semantic similarity. Accordingly, the preset similarity threshold may include: and at least one of a preset editing distance similarity threshold value and a preset semantic similarity threshold value is set.

For example, in an example, the DBSCAN nearest neighbor clustering algorithm may be used to cluster the candidate entity names, which is to solve the entity fusion problem. The similarity of the candidate entity names may be a literal edit distance, that is, when the edit distance between two candidate entity names is smaller than a preset edit distance similarity threshold, the candidate entity names are clustered.

In another example, the semantic similarity may be calculated by using a Deep Structured Semantic Model (DSSM) or other models, and names of candidate entities with semantic distances smaller than a preset semantic similarity threshold are taken as the same class and grouped under the same cluster.

In another example, when the edit distance between any two candidate entity names is smaller than the preset edit distance similarity threshold and the semantic distance is smaller than the preset semantic similarity threshold, the two candidate entity names are clustered into the same cluster.

Of course, other similarity calculation may also be adopted to determine the similarity between the candidate entity names, which all may be within the protection scope of the present embodiment, and this is not exhaustive here.

The candidate entity names obtained from the document materials are filtered in advance, the filtered candidate entity names are further clustered, and M candidate clusters corresponding to the target department are obtained.

In S103, determining target entity names of M first-class entities corresponding to the target department in the relationship graph based on the candidate entity names included in the M candidate clusters, respectively, includes:

acquiring frequency information of candidate entity names contained in the ith candidate cluster in the M candidate clusters;

and taking the candidate entity name with the highest secondary information in the candidate entity names contained in the ith candidate cluster as a target entity standard name of the ith first-class entity corresponding to the ith candidate cluster, and taking other entity names except the target standard entity name in the ith candidate cluster as target entity aliases of the ith first-class entity.

The ith candidate cluster may be any one of the M candidate clusters, and since the corresponding target entity name is determined in the same manner for each candidate cluster, only one of the candidate clusters is described here, and the processing manners of the remaining candidate clusters are the same, which is not described in detail.

By adopting the above processing, based on the frequency information of each candidate entity name in the ith candidate cluster, one candidate entity name with the highest frequency of occurrence is selected as the target entity standard name of the ith first-class entity corresponding to the ith candidate cluster, and the rest candidate entity names in the ith candidate cluster are all used as entity alias names of the ith first-class entity. In this way, each candidate cluster may derive one or more entity aliases for the corresponding first-type entity, but may derive only one target entity standard name.

Because a target department can construct a plurality of candidate clusters, each candidate cluster can be considered to correspond to a first-class entity, and a target entity standard name and a target entity alias of the first-class entity can be determined based on one candidate cluster; and finally, the standard names of the target entities and the aliases of the target entities corresponding to the plurality of first-class entities of the target department can be obtained.

Therefore, by the scheme, the target entity standard name and one or more target entity aliases for a matter can be finally determined based on the constructed candidate cluster, so that a more accurate expression mode can be provided for constructing the matter entities in the relation graph, and more reference information is provided for searching in subsequent generalization searching due to the fact that the information of the target entity aliases is added, so that the relation graph is more accurate and more convenient to use.

Based on the above processing, the target entity name of the first kind of entity in the relationship map can be obtained, and further, the relationship between the event and the related second kind of entity can be obtained, so that the relationship between the target entity name of the first kind of entity and the related second kind of entity in the relationship map is constructed. The method specifically comprises the following steps:

acquiring second-class entities associated with kth first-class entities from document materials respectively corresponding to target entity names of the kth first-class entities in the M first-class entities, and establishing association relations between the kth first-class entities and the second-class entities in the relation graph based on the second-class entities associated with the kth first-class entities; wherein k is an integer of 1 or more and M or less.

Specifically, each of the M first-class entities may include a target entity standard name and one or more target entity aliases; one or more document materials corresponding to the standard name of the target entity and the alias names of the one or more target entities can be searched, and one or more second-class entities are extracted from the one or more document materials. This allows the acquisition of the relevant second type entities having a relationship with each of the first type entities.

Wherein the second type of entity may specifically refer to a "human" entity in the relationship graph.

Further, relationships between the first type entities and related second type entities having an association relationship therewith may be established in the relationship graph. That is, one or more second-class entities having a relationship with each first-class entity are obtained, and then the association relationship between each first-class entity and the one or more second-class entities related to the first-class entity is added to the relationship graph.

Wherein the second type entity may be a person, and the person may be represented as a name of the person in the relationship graph; in addition, the second type of entity, such as a person, may also include related attribute information or entity information, such as, for example, a position, a title, and the like of the person, which is not exhaustive herein.

Therefore, related second-class entities in the relation map can be determined through the names of things, so that the construction of the relation map is perfected, and because the construction of the names of the things and the acquisition of the related second-class entities are the same, the relation between the things and the related second-class entities in the relation map can be constructed only by analyzing the entities of the things in advance, and the efficiency of constructing the relation map is improved.

An embodiment of the present invention further provides an entity information processing apparatus, as shown in fig. 3, including:

the identification module 31 is configured to identify N document materials of a target department, and obtain candidate entity names corresponding to the N document materials respectively; n is an integer greater than or equal to 1;

a clustering module 32, configured to generate M candidate clusters corresponding to the target department based on candidate entity names respectively corresponding to the N document materials; m is an integer greater than or equal to 1;

and an entity name determining module 33, configured to determine, based on candidate entity names included in the M candidate clusters, target entity names of the M first-class entities corresponding to the target department in the relationship graph.

The identification module 31 is configured to input a jth document material of the N document materials of the target department and a target department corresponding to the jth document material into a preset model, so as to obtain a candidate entity name corresponding to the jth document material output by the preset model; wherein j is an integer of 1 or more and N or less.

On the basis of fig. 3, the information processing apparatus provided in the present embodiment, as shown in fig. 4, further includes:

a training set constructing module 34, configured to obtain historical candidate entity names corresponding to multiple departments respectively; matching the historical document materials of each department in the plurality of departments with the historical candidate entity names of the corresponding department to obtain the historical entity names corresponding to the historical document materials of each department; and generating a training set based on the historical document materials of all departments and the corresponding historical entity names.

As shown in fig. 4, the apparatus further includes:

and the model training module 35 is configured to train the preset model based on the historical document materials of the departments and the historical entity names corresponding to the historical document materials included in the training set, so as to obtain the trained preset model.

The clustering module 32 is configured to screen N candidate entity names corresponding to the N document materials, respectively, to obtain L candidate entity names; l is an integer of 1 or more and N or less; clustering the L candidate entity names to obtain M candidate clusters corresponding to the target department; wherein different candidate clusters of the M candidate clusters contain different candidate entity names.

The entity name determining module 33 is configured to obtain frequency information of candidate entity names included in an ith candidate cluster of the M candidate clusters; wherein i is an integer of 1 or more and M or less; and taking the candidate entity name with the highest secondary information in the candidate entity names contained in the ith candidate cluster as a target entity standard name of the ith first-class entity corresponding to the ith candidate cluster, and taking other entity names except the target standard entity name in the ith candidate cluster as target entity aliases of the ith first-class entity.

As shown in fig. 4, the apparatus further includes:

a relationship construction module 36, configured to obtain second type entities associated with a kth first type entity from document materials respectively corresponding to target entity names of the kth first type entity in the M first type entities; establishing an incidence relation between the kth first-class entity and the second-class entity in the relation graph based on the second-class entity associated with the kth first-class entity; wherein k is an integer of 1 or more and M or less.

According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.

As shown in fig. 5, it is a block diagram of an electronic device according to the information processing method of the embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 5, the electronic apparatus includes: one or more processors 701, a memory 702, and interfaces for connecting the various components, including a high-speed interface and a low-speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). In fig. 5, one processor 701 is taken as an example.

The memory 702 is a non-transitory computer readable storage medium as provided herein. The memory stores instructions executable by at least one processor, so that the at least one processor executes the information processing method provided by the application. The non-transitory computer-readable storage medium of the present application stores computer instructions for causing a computer to execute the entity information processing method provided by the present application.

The memory 702, which is a non-transitory computer-readable storage medium, may be used to store non-transitory software programs, non-transitory computer-executable programs, and modules, such as program instructions/modules corresponding to the information processing method in the embodiments of the present application (e.g., the recognition module, the clustering module, the entity name determination module, the training set construction module, and the model training module shown in fig. 4). The processor 701 executes various functional applications of the server and data processing by executing non-transitory software programs, instructions, and modules stored in the memory 702, that is, implements the information processing method in the above-described method embodiment.

The memory 702 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the electronic device of the information processing method, and the like. Further, the memory 702 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 702 may optionally include a memory remotely located from the processor 701, and such remote memory may be connected to the electronic device of the information processing method through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device of the information processing method may further include: an input device 703 and an output device 704. The processor 701, the memory 702, the input device 703 and the output device 704 may be connected by a bus or other means, and fig. 5 illustrates an example of a connection by a bus.

The input device 703 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic apparatus, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointing stick, one or more mouse buttons, a track ball, a joystick, or other input devices. The output devices 704 may include a display device, auxiliary lighting devices (e.g., LEDs), and tactile feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and Virtual Private Server (VPS) service. The server may also be a server of a distributed system, or a server incorporating a blockchain.

According to the technical scheme of the embodiment of the application, the candidate entity names corresponding to the document materials are determined based on the document materials of the department input by taking the department as a unit, and then one or more target entity names corresponding to each department in the relational graph are determined based on the candidate entity names, so that the target entity names of the departments contained in the relational graph can be finally determined only by collecting the document materials of the departments, the problems of low efficiency, poor timeliness, inaccurate results and the like caused by manual analysis can be avoided, the processing efficiency and the accuracy of obtaining the target entity names are ensured, and the efficiency and the accuracy of building or updating the relational graph are further ensured.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present application can be achieved, and the present invention is not limited herein.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. An entity information processing method includes:

2. The method according to claim 1, wherein the identifying N document materials of the target department to obtain candidate entity names corresponding to the N document materials respectively comprises:

3. The method of claim 2, wherein the method further comprises:

4. The method of claim 3, wherein the method further comprises:

and training the preset model based on the historical document materials of all departments and the corresponding historical entity names contained in the training set to obtain the trained preset model.

5. The method of claim 1, wherein the determining M candidate clusters corresponding to the target department based on the candidate entity names corresponding to the N document materials respectively comprises:

screening the N candidate entity names respectively corresponding to the N document materials to obtain L candidate entity names; l is an integer of 1 or more and N or less;

clustering the L candidate entity names to obtain M candidate clusters corresponding to the target department; wherein different candidate clusters of the M candidate clusters contain different candidate entity names.

6. The method according to claim 1, wherein the determining, based on the candidate entity names respectively included in the M candidate clusters, the target entity names of the M first-class entities corresponding to the target department in the relationship graph includes:

acquiring frequency information of candidate entity names contained in the ith candidate cluster in the M candidate clusters; wherein i is an integer of 1 or more and M or less;

7. The method of any of claims 1-6, wherein the method further comprises:

obtaining second-class entities related to kth first-class entities from document materials respectively corresponding to target entity names of the kth first-class entities in the M first-class entities; establishing an incidence relation between the kth first-class entity and the second-class entity in the relation graph based on the second-class entity associated with the kth first-class entity; wherein k is an integer of 1 or more and M or less.

8. An entity information processing apparatus comprising:

9. The apparatus according to claim 8, wherein the identifying module is configured to input a jth document material of the N document materials of the target department and a target department corresponding to the jth document material into a preset model, so as to obtain a candidate entity name corresponding to the jth document material output by the preset model; wherein j is an integer of 1 or more and N or less.

10. The apparatus of claim 9, wherein the apparatus further comprises:

the training set building module is used for acquiring historical candidate entity names respectively corresponding to a plurality of departments; matching the historical document materials of each department in the plurality of departments with the historical candidate entity names of the corresponding department to obtain the historical entity names corresponding to the historical document materials of each department; and generating a training set based on the historical document materials of all departments and the corresponding historical entity names.

11. The apparatus of claim 10, wherein the apparatus further comprises:

and the model training module is used for training the preset model based on the historical document materials of all departments and the corresponding historical entity names thereof contained in the training set to obtain the trained preset model.

12. The apparatus according to claim 8, wherein the clustering module is configured to screen L candidate entity names from N candidate entity names respectively corresponding to the N document materials; l is an integer of 1 or more and N or less; clustering the L candidate entity names to obtain M candidate clusters corresponding to the target department; wherein different candidate clusters of the M candidate clusters contain different candidate entity names.

13. The apparatus according to claim 8, wherein the entity name determining module is configured to obtain frequency information of candidate entity names included in an ith candidate cluster of the M candidate clusters; wherein i is an integer of 1 or more and M or less; and taking the candidate entity name with the highest secondary information in the candidate entity names contained in the ith candidate cluster as a target entity standard name of the ith first-class entity corresponding to the ith candidate cluster, and taking other entity names except the target standard entity name in the ith candidate cluster as target entity aliases of the ith first-class entity.

14. The apparatus of any one of claims 8-13, wherein the apparatus further comprises:

the relation construction module is used for acquiring second-class entities related to the kth first-class entity from the document materials respectively corresponding to the target entity names of the kth first-class entity in the M first-class entities; establishing an incidence relation between the kth first-class entity and the second-class entity in the relation graph based on the second-class entity associated with the kth first-class entity; wherein k is an integer of 1 or more and M or less.

15. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.

16. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-7.