CN117725108A

CN117725108A - Data mining method, device, electronic equipment and computer readable storage medium

Info

Publication number: CN117725108A
Application number: CN202310499102.1A
Authority: CN
Inventors: 李�浩
Original assignee: Xiaohongshu Technology Co ltd
Current assignee: Xiaohongshu Technology Co ltd
Priority date: 2023-05-05
Filing date: 2023-05-05
Publication date: 2024-03-19

Abstract

The embodiment of the application extracts a plurality of entities in text information by acquiring the text information to be mined, and recalls candidate entities corresponding to each entity; determining at least one group of positive example entities with entity relation from a plurality of entities based on candidate entities corresponding to the entities, wherein each positive example entity in each group is matched with the candidate entity corresponding to the positive example entity; selecting at least one group of negative example entities without entity relation from the entities except the positive example entities; and obtaining a sample data set based on at least one group of the positive example entities and at least one group of the negative example entities, and training a preset entity relation model according to the sample data set, wherein the entity relation model is used for identifying entity relations between two entities input into the model. According to the embodiment of the application, the manpower resources consumed for mining the data during model training can be reduced.

Description

Data mining method, device, electronic equipment and computer readable storage medium

Technical Field

The present disclosure relates to the field of data mining technologies, and in particular, to a data mining method, a data mining device, an electronic device, and a computer readable storage medium.

Background

At present, the neural network model is increasingly widely applied in life, and related sample data required for training the neural network model is needed to be mined through a data mining method for training to obtain the neural network model, for example, a plurality of entities are mined in a text to serve as the sample data, but because entity information is not provided in a structured manner in the text, whether relations exist among different entities in the text cannot be accurately identified, and in this way, the relations among the different entities can be marked only in a manual marking manner, so that great manpower resources are required to be spent for mining the data during model training.

Disclosure of Invention

The embodiment of the application provides a data mining method, a data mining device, electronic equipment and a computer readable storage medium, which can reduce human resources spent on mining data during model training.

In a first aspect, an embodiment of the present application provides a data mining method, where the method includes:

acquiring text information to be mined, extracting a plurality of entities in the text information, and recalling candidate entities corresponding to each entity;

Determining at least one set of positive example entities with entity relation from the entities based on candidate entities corresponding to the entities, wherein each positive example entity in each set is matched with the candidate entity corresponding to the positive example entity;

selecting at least one group of negative example entities without entity relation from the entities except the positive example entities;

and obtaining a sample data set based on at least one group of the positive example entities and at least one group of the negative example entities, and training a preset entity relation model according to the sample data set, wherein the entity relation model is used for identifying entity relations between two entities input into the model.

In a second aspect, an embodiment of the present application further provides a data mining apparatus, where the apparatus includes:

the information acquisition module is used for acquiring text information to be mined, extracting a plurality of entities in the text information and recalling candidate entities corresponding to each entity;

the entity determining module is used for determining at least one group of positive example entities with entity relation from the entities based on candidate entities corresponding to the entities, wherein the positive example entities and the candidate entities corresponding to the positive example entities in each group are matched with each other;

The entity selection module is used for selecting at least one group of negative example entities without entity relation from the entities except the positive example entities;

the model training module is used for obtaining a sample data set based on at least one group of positive example entities and at least one group of negative example entities, and training a preset entity relation model according to the sample data set, wherein the entity relation model is used for identifying entity relations between two entities input into the model.

In a third aspect, embodiments of the present application further provide an electronic device, including a memory storing a plurality of instructions; the processor loads instructions from the memory to perform steps in any of the data mining methods provided by the embodiments of the present application.

In a fourth aspect, embodiments of the present application further provide a computer readable storage medium storing a plurality of instructions adapted to be loaded by a processor to perform steps in any of the data mining methods provided by the embodiments of the present application.

In the embodiment of the application, text information to be mined is obtained, a plurality of entities in the text information are extracted, and candidate entities corresponding to each entity are recalled; determining at least one set of positive example entities with entity relation from the entities based on candidate entities corresponding to the entities, wherein each positive example entity in each set is matched with the candidate entity corresponding to the positive example entity; selecting at least one group of negative example entities without entity relation from the entities except the positive example entities; and obtaining a sample data set based on at least one set of the positive example entities and at least one set of the negative example entities, and training a preset entity relation model according to the sample data set, wherein the entity relation model is used for identifying the entity relation between two entities input into the model, so that at least one set of positive example entities with the entity relation in the text can be mined through candidate entities corresponding to the entities in the text, and at least one set of negative example data without the entity relation can be mined, thereby reducing the manpower resources spent on mining the data during model training.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of an embodiment of a data mining method according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a data mining apparatus according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

Before explaining the embodiments of the present application in detail, some terms related to the embodiments of the present application are explained.

Wherein in the description of embodiments of the present application, the terms "first," "second," and the like may be used herein to describe various concepts, but such concepts are not limited by these terms unless otherwise specified. These terms are only used to distinguish one concept from another. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The embodiment of the application provides a data mining method, a data mining device, electronic equipment and a computer readable storage medium. Specifically, the data mining method of the embodiment of the application may be executed by an electronic device, where the electronic device may be a device such as a terminal or a server. The terminal may be a terminal device such as a smart phone, a tablet computer, a notebook computer, a touch screen, a game console, a personal computer (PC, personal Computer), a personal digital assistant (Personal Digital Assistant, PDA), and the like, and the terminal may further include a client, which may be a game application client, a browser client carrying a game program, or an instant messaging client, and the like. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligent platforms.

For example, the electronic device is illustrated by taking a terminal as an example, and the terminal can acquire text information to be mined, extract a plurality of entities in the text information, and recall candidate entities corresponding to each entity; determining at least one set of positive example entities with entity relation from the entities based on candidate entities corresponding to the entities, wherein each positive example entity in each set is matched with the candidate entity corresponding to the positive example entity; selecting at least one group of negative example entities without entity relation from the entities except the positive example entities; and obtaining a sample data set based on at least one group of the positive example entities and at least one group of the negative example entities, and training a preset entity relation model according to the sample data set, wherein the entity relation model is used for identifying entity relations between two entities input into the model.

Based on the above problems, embodiments of the present application provide a data mining method, apparatus, electronic device, and computer readable storage medium, which can reduce human resources spent on mining data during model training.

The following detailed description is provided with reference to the accompanying drawings. The following description of the embodiments is not intended to limit the preferred embodiments. Although a logical order is depicted in the flowchart, in some cases the steps shown or described may be performed in an order different than depicted in the figures.

In this embodiment, a terminal is taken as an example for illustration, and this embodiment provides a data mining method, as shown in fig. 1, a specific flow of the data mining method may be as follows:

101. and acquiring text information to be mined, extracting a plurality of entities in the text information, and recalling candidate entities corresponding to each entity.

The text information to be mined refers to that sample data can be mined to be text information, and the entity can be a specific object or a main body in the text information, namely, an object which can exist independently. For example, the text information may be a note, and accordingly, the entity in the text information may be an entity word in the note.

In this embodiment, the terminal may obtain a plurality of entities in the text information by extracting the entities from the text information, so that a candidate entity corresponding to each entity is obtained by recall means, where the candidate entity is used to indicate other entities having a relationship with the entity, so that whether there is a relationship between the entities may be determined based on the candidate entity. The terminal may recall the candidate entity corresponding to each entity through the elastic search technology.

For example, when the text information is a note about a movie work (such as the "dream-warrior" of a television), the above entity may be a person or an authored work in the movie work, for example, "Liu Yifei" and "Zhao Paner", then the candidate entities that can be recalled by "Liu Yifei" may be related entities related to the word "Liu Yifei" such as "Zhao Paner", "xiaolong" and "dream-warrior", and the candidate entities that can be recalled by "Zhao Paner" may be related entities related to the word "Zhao Paner" such as "Liu Yifei", "Le Jiatong", "Zhao Paner wind-month-saving dust" and "dream-warrior".

Further, when recalling the candidate entity corresponding to each entity, the candidate entity may be an entity identifier corresponding to the candidate entity, or may be a hyperlink ID corresponding to at least one candidate entity, that is, a hyperlink ID may be used to jump to a relevant description page, where the relevant description page includes at least one candidate entity corresponding to the entity, for example, the relevant description page includes, but is not limited to, a summary or overview for the entity, basic information of the entity, a main work corresponding to the entity, a person relationship related to the entity, and so on, where the candidate entity is mentioned.

Illustratively, based on the examples of "Liu Yifei" and "Zhao Paner" described above, the hyperlink IDs corresponding to the different candidate entities of "136156", "60781132" are each "Liu Yifei" may be set, or alternatively, the hyperlink IDs corresponding to the different candidate entities of "3277104", "56105242" are each "Zhao Paner" may be set.

In some embodiments, after obtaining the text information to be mined, in order to facilitate accurate extraction of entities from the text information, the text information may be preprocessed, so as to facilitate entity extraction of data in the preprocessed text information, and obtain key entities in the preprocessed text information.

Specifically, the text information can be preprocessed through a note information processing module in the terminal, namely, special symbols in texts such as titles, texts, topics and the like in the text information are removed through the note information processing module, and English cases, complex or different languages and the like in the text information can be converted, namely, texts with the same semantics are converted into entity texts with the same format and the same language.

In some embodiments, to enable entity extraction of all the information in the text information, before extracting the plurality of entities in the text information, the method may further include: if at least one text picture exists in the text information, carrying out picture character recognition on each text picture to obtain picture character information of at least one text picture; and adding the picture text information of at least one picture to a preset text adding position of the text information, wherein the text adding position can be set according to requirements, for example, a position after a title and before a text is used as the text adding position.

Specifically, the terminal may perform the above-mentioned picture character recognition by using OCR character recognition technology.

For example, when the text information is a note, since the user typically issues a note in which information of the current commodity is added as a subtitle to the picture, OCR character recognition is required for the information shown in the picture in the note, so that the recognized picture text information is added to a position behind the title and in front of the text in the note.

102. And determining at least one set of positive example entities with entity relation from the plurality of entities based on candidate entities corresponding to the plurality of entities, wherein each positive example entity in each set is matched with the candidate entity corresponding to the positive example entity.

It can be understood that, since the candidate entity corresponding to the above entity is used for indicating other entities having a relationship with the entity, in this embodiment, when determining a certain entity of the plurality of entities, the terminal only needs to compare the candidate entity of the currently determined entity with the other entity of the plurality of entities or the candidate entity of the entity, so as to determine which entity has a relationship with the entity of the plurality of entities, determine at least two entities having a relationship with the entity as a set of positive example entities, and at least one set of positive example entities exists in the plurality of entities.

It can be appreciated that, because it cannot be accurately identified whether there is a relationship between different entities in the text, for example, a description that "Liu Yifei" is related to "Zhao Paner" in a new play in a note, but it cannot be identified whether there is a relationship between "Liu Yifei" and "Zhao Paner", in this embodiment, the terminal identifies that there is a relationship between "Liu Yifei" and "Zhao Paner" in a note based on candidate entities corresponding to the entities, so as to perform a downstream entity linking task based on the mined relationship data.

In some embodiments, the determining, by the terminal, at least one set of positive example entities by comparing the entity with the candidate entities, that is, the candidate entities corresponding to the entities, and determining, from the entities, at least one set of positive example entities having a relationship between the entities may include: and determining the target entity and the matching entity as a set of normal entity with entity relationship if one matching entity matched with the candidate entity corresponding to the target entity exists in the plurality of entities based on the candidate entity corresponding to the target entity. The target entity is the entity currently judged.

Illustratively, based on the above example, if the above entities are set to be "Liu Yifei" and "Zhao Paner" and the above target entity is set to be "Liu Yifei", then the entities "Liu Yifei" and "Zhao Paner" may be determined as a set of positive example entities having an entity relationship since the candidate entities that the target entity "Liu Yifei" can recall may be related entities related to the word "Liu Yifei" such as "Zhao Paner", "xiaolong" and "menghua", and since the entity "Zhao Paner" matches the candidate entity "Zhao Paner" corresponding to the target entity "Liu Yifei".

Illustratively, based on the above example, if the above entities are set to be "Liu Yifei" and "Zhao Paner", the above target entity is set to be "Liu Yifei", and the hyperlink ID "136156" corresponding to the candidate entity corresponding to the set entity "Liu Yifei" includes "Zhao Paner", then, since the entity "Zhao Paner" matches the candidate entity "136156" corresponding to the target entity "Liu Yifei" including "Zhao Paner", the entities "Liu Yifei" and "Zhao Paner" can be determined as a set of positive example entities having an entity relationship.

In some embodiments, the method may further include determining, by the terminal, at least one set of positive example entities from the plurality of entities by comparing the candidate entity with the candidate entity, where the positive example entities are based on candidate entities corresponding to the plurality of entities, and the method may further include: and determining the target entity and the entity corresponding to the matching candidate entity as a group of normal entity with entity relationship if one matching candidate entity matched with the candidate entity corresponding to the target entity exists in the candidate entities corresponding to the target entity based on the candidate entity corresponding to the target entity. The target entity is the entity currently judged.

In some embodiments, in the process that the terminal compares the relations between candidate entities of different entities, between entities or between an entity and a candidate entity to obtain a set of normal entities with entity relations, the terminal may introduce an entity data set to assist the terminal in quickly determining at least one set of normal entities, that is, the entity data set includes a correlation entity pair, that is, the correlation entity pair is used to indicate entities and/or candidate entities that are matched with each other, so that the determination speed of at least one set of normal entities is improved based on the entity data set. The number of candidate entity pairs in the entity data set may be set based on requirements, for example 660 ten thousand.

Specifically, the entity data set may include a pair of associated entities constructed according to encyclopedia entries, and may further include a pair of associated entities considered to be constructed.

Specifically, the establishing the corresponding entity data set may include: the terminal can analyze the entry of the hundred-degree encyclopedia to obtain the inlink information, namely the information on the related description page of the candidate entity, acquire the associated entity pair, namely the relation pair formed by the entity in the entry and the entity in the inlink information, and store the associated entity pair in the entity data set.

Illustratively, the entry related to the movie works is parsed, and the cast in the in-link information can be parsed, so that the associated entity pair corresponding to the actor and the role is extracted. For example, in a movie work, "Liu Yifei" and "Zhao Paner" may constitute a pair of associated entities, and "Liu Yifei" and "xiaolong" may constitute a pair of associated entities.

In some embodiments, to ensure the accuracy of a set of positive example entities, the positive example entities may be further defined, that is, if an entity has only one trusted recall entity, that is, only one candidate entity matches with other entities, the entity and the corresponding entity on the match are determined to be a set of positive example entities having an entity relationship. Specifically, the terminal may determine at least one set of positive example entities, and if there are two entities in the at least one set of positive example entities, delete two sets of positive example entities corresponding to the two entities.

103. And selecting at least one group of negative example entities which do not have entity relation from the entities except the positive example entities.

It can be appreciated that since only positive example data is generally mined during mining, the data construction for negative examples is very simple. In this embodiment, the terminal may further select at least one set of negative example entities that do not have an entity relationship from the entities other than the positive example entities, so that data set in the text information is fully utilized, and efficiency of data mining is improved.

In some embodiments, the selecting at least one set of negative example entities that do not have an entity relationship from the entities other than the positive example entities may include: based on the entity type of any target positive entity in a group of positive entities, randomly selecting one entity of the same type from the entities except the positive entities, and replacing the target positive entities by the entities of the same type to obtain a group of negative entities. Among the entities other than the above-mentioned example entity, the entity other than the above-mentioned example entity extracted from the text information may be limited, or may be an entity other than the above-mentioned example entity in a preset entity library.

Where the above type may be determined based on entities, e.g., a set of positive entities are cartoon and fire shadow lovers, then the same type of entity may be named scout Ke Na, with "named scout Ke Na" replacing "fire shadow lovers", or may be cartoon, with "cartoon" replacing "cartoon".

In some embodiments, the selecting at least one set of negative example entities that do not have an entity relationship from the entities other than the positive example entities may include: based on the entity type of any target positive entity in a group of positive entities, randomly selecting a non-type entity from the entities except the positive entities, and replacing the target positive entities by the same type entity to obtain a group of negative entities. Among the entities other than the above-mentioned example entity, the entity other than the above-mentioned example entity extracted from the text information may be limited, or may be an entity other than the above-mentioned example entity in a preset entity library.

In some embodiments, the selecting at least one set of negative example entities that do not have an entity relationship from the entities other than the positive example entities may include: dividing any two entities in the plurality of entities into a group to obtain a plurality of groups of entities; determining other groups of entities except at least one group of the positive example entities, and calculating recall scores among the other groups of entities; and taking the entity with the highest recall score in other groups of entities as a negative example entity.

If the candidate entity is the hyperlink ID, the recall score may be the highest number of identical characters contained in the relevant description pages of the candidate entities corresponding to the two entities respectively.

In some embodiments, the selecting at least one set of negative example entities that do not have an entity relationship from the entities other than the positive example entities may include: dividing any two entities in the plurality of entities into a group to obtain a plurality of groups of entities; and determining other groups of entities except at least one group of positive instance entities, and randomly sampling one group from the other groups of entities to serve as a negative instance entity.

104. And obtaining a sample data set based on at least one group of the positive example entities and at least one group of the negative example entities, and training a preset entity relation model according to the sample data set, wherein the entity relation model is used for identifying entity relations between two entities input into the model.

In this embodiment, after obtaining a positive sample composed of at least one set of positive example entities and a negative sample composed of at least one set of negative example entities, the terminal is equivalent to obtaining a sample data set composed of the positive sample and the negative sample, so that the terminal can obtain a corresponding entity relationship model capable of identifying relationships between entities only by training based on the sample data set. The labels of each group of positive instance entities corresponding to the positive samples are in a relationship of existence entities, and the labels of each group of negative instance entities corresponding to the negative samples are in a relationship of non-existence entities.

It can be understood that, because during model training, the relationships between different entities can be marked only by manual marking, so as to realize the mining of training data, which results in high manpower resource consumption. Moreover, the time cost for re-labeling sample data in the face of a new field is avoided, and the mobility of training models in the face of different scenes is improved.

In some embodiments, in order to improve the accuracy of data in the sample data set, the terminal may reject noise samples from the data in the sample data set in the embodiments of the present application based on a means of confidence learning, so as to obtain a high-quality data set, thereby improving the accuracy and quality of the data, reducing the manpower consumption, and ensuring that the data provided to the downstream is high-quality and reliable, so that the mobility is high and the robustness is good. For example, the noise data may be positive samples that are treated as negative samples.

In some embodiments, obtaining the sample data set based on at least one set of the positive instance entities and at least one set of the negative instance entities may include: dividing at least one set of the positive example entities and at least one set of the negative example entities into a test data set and a training data set; training each training sample in the training data set to obtain a supervision model; carrying out entity relation prediction on each test sample in the test data set through the supervision model to obtain a probability value of each test sample under at least one relation class; sample screening is carried out on the test samples in the test data set based on the sample labels corresponding to each test sample and the probability value of each test sample under at least one relation class, so as to obtain a screened test data set; the sample data set is obtained based on the test samples in the screened test sample set and the training samples in the training data set, so that a high-quality data set is obtained, the accuracy and quality of data are improved, the manpower consumption is reduced, the data provided to the downstream are guaranteed to be high-quality and reliable, the mobility is high, and the robustness is good.

If the sample label of each set of positive example entity corresponding to the positive sample is a present entity relationship, and the sample label of each set of negative example entity corresponding to the negative sample is a non-present entity relationship, then the relationship types are two, namely a category in which the entity relationship exists and a category in which the entity relationship does not exist, and the two categories can be indicated by 0 and 1.

The terminal may divide at least one set of the positive example entities and at least one set of the negative example entities into 10 parts, select one part of the positive example entities and the negative example entities as a test data set, and use the rest 9 parts as a training data set, and train through the 9 training data sets to obtain a supervision model; the test data set is input into the supervision model to obtain probability values of each test sample of the test data set under at least one relation category. Then, determining again; and a sample label corresponding to each test sample in the test data set, namely which relation category each test sample in the test data set finally belongs to.

In some embodiments, the performing sample screening on the test samples in the test data set based on the sample label corresponding to each test sample and the probability value of each test sample under at least one relationship category to obtain a screened test data set may include: based on the probability values of the test samples in the test data set under at least one relation category, obtaining an average probability value under each relation category; based on the sample label of the test sample, determining a comparison probability value under the relation category corresponding to the sample label from the average probability value under each relation category, and determining a target probability value under the relation category corresponding to the sample label from the probability value of the test sample under at least one relation category; calculating a probability difference value between the comparison probability value corresponding to each test sample and the target probability value; and carrying out sample screening on the test samples in the test data set based on the probability difference value corresponding to each test sample to obtain a screened test data set.

The sample screening may be to screen a preset number of test samples with the largest probability difference, screen a preset number of test samples with the largest probability difference under a certain relation category and smaller than the average probability value. The preset number may be the product of the total number and a comparison coefficient, and the comparison coefficient may be set according to the requirement, for example, 1/20.

For example, 10 test samples exist in a test data set, sample labels of 5 test samples are in no entity relationship, sample labels of 5 test samples are in entity relationship, and probability values of the test samples in the category in which the entity relationship does not exist in the 5 sample labels are respectively 0.64, 0.67, 0.82, 0.84 and 0.88, namely, average probability values of the category in which the entity relationship does not exist are 0.77,5, namely, probability values of the test samples in the category in which the entity relationship exists in the 5 sample labels are respectively 0.72, 0.75, 0.83, 0.84 and 0.95, namely, average probability values of the category in which the entity relationship exists are respectively 0.818. At this time, since the first one of the test samples in which no entity relationship exists is smaller than the average probability value and the difference is the largest, it can be screened out, thereby obtaining a screened test data set.

In some embodiments, the above-mentioned sample screening is performed on the test samples in the test data set to obtain a screened test data set, and multiple cycle screening may be performed, that is, after each screening, the screened test data set and the sample data set are rearranged to reselect a new round of test data set for screening, or, after each screening, the screened test data set is subjected to sample screening again. The number of cycles may be set according to the need, for example, 3 cycles.

It can be understood that, by performing sample screening on the test samples in the test data set to obtain a screened test data set, and obtaining the sample data set based on the test samples in the screened test sample set and the training samples in the training data set, a high-quality data set is obtained, so as to improve the accuracy and quality of data, for example, as shown by experiments, before screening, the accuracy of measurement reaches 83.7%, and after screening, the accuracy of measurement reaches 92.4%.

From the above, it can be seen that text information to be mined is obtained, a plurality of entities in the text information are extracted, and candidate entities corresponding to each entity are recalled; determining at least one set of positive example entities with entity relation from the entities based on candidate entities corresponding to the entities, wherein each positive example entity in each set is matched with the candidate entity corresponding to the positive example entity; selecting at least one group of negative example entities without entity relation from the entities except the positive example entities; and obtaining a sample data set based on at least one set of the positive example entities and at least one set of the negative example entities, and training a preset entity relation model according to the sample data set, wherein the entity relation model is used for identifying the entity relation between two entities input into the model, so that at least one set of positive example entities with the entity relation in the text can be mined through candidate entities corresponding to the entities in the text, and at least one set of negative example data without the entity relation can be mined, thereby reducing the manpower resources spent on mining the data during model training.

In order to better implement the above method, the embodiment of the application also provides a data mining device, which may be specifically integrated in an electronic device, for example, a computer device, where the computer device may be a terminal, a server, or other devices.

The terminal can be a mobile phone, a tablet personal computer, an intelligent Bluetooth device, a notebook computer, a personal computer and other devices; the server may be a single server or a server cluster composed of a plurality of servers.

For example, in this embodiment, a specific integration of a data mining apparatus in a terminal will be taken as an example, and a method in this embodiment of the present application is described in detail, where this embodiment provides a data mining apparatus, as shown in fig. 2, where the data mining apparatus may include:

an information obtaining module 201, configured to obtain text information to be mined, extract a plurality of entities in the text information, and recall candidate entities corresponding to each of the entities;

an entity determining module 202, configured to determine at least one set of positive example entities having an entity relationship from a plurality of entities based on candidate entities corresponding to the plurality of entities, where each positive example entity and the candidate entity corresponding to the positive example entity in each set are matched with each other;

The entity selection module 203 is configured to select at least one set of negative example entities that do not have an entity relationship from the entities other than the positive example entities;

the model training module 204 is configured to obtain a sample data set based on at least one set of the positive example entities and at least one set of the negative example entities, and train a preset entity relationship model according to the sample data set, where the entity relationship model is used to identify an entity relationship between two entities input into the model.

In some embodiments, the data mining apparatus further includes a text recognition module, where the text recognition module is specifically configured to:

if at least one text picture exists in the text information, carrying out picture character recognition on each text picture to obtain picture character information of at least one text picture;

and adding the picture text information of at least one picture to a preset text adding position of the text information.

In some embodiments, the model training module 204 is specifically configured to:

dividing at least one set of the positive example entities and at least one set of the negative example entities into a test data set and a training data set;

training each training sample in the training data set to obtain a supervision model;

Carrying out entity relation prediction on each test sample in the test data set through the supervision model to obtain a probability value of each test sample under at least one relation class;

sample screening is carried out on the test samples in the test data set based on the sample labels corresponding to each test sample and the probability value of each test sample under at least one relation class, so as to obtain a screened test data set;

and obtaining the sample data set based on the test samples in the screened test sample set and the training samples in the training data set.

In some embodiments, the model training module 204 is further specifically configured to:

based on the probability values of the test samples in the test data set under at least one relation category, obtaining an average probability value under each relation category;

based on the sample label of the test sample, determining a comparison probability value under the relation category corresponding to the sample label from the average probability value under each relation category, and determining a target probability value under the relation category corresponding to the sample label from the probability value of the test sample under at least one relation category;

Calculating a probability difference value between the comparison probability value corresponding to each test sample and the target probability value;

and carrying out sample screening on the test samples in the test data set based on the probability difference value corresponding to each test sample to obtain a screened test data set.

In some embodiments, the entity determination module 202 is specifically configured to:

and determining the target entity and the matching entity as a set of normal entity with entity relationship if one matching entity matched with the candidate entity corresponding to the target entity exists in the plurality of entities based on the candidate entity corresponding to the target entity.

In some embodiments, the entity selection module 203 is specifically configured to:

based on the entity type of any target positive entity in a group of positive entities, randomly selecting one entity of the same type from the entities except the positive entities, and replacing the target positive entities by the entities of the same type to obtain a group of negative entities.

dividing any two entities in the plurality of entities into a group to obtain a plurality of groups of entities;

determining other groups of entities except at least one group of the positive example entities, and calculating recall scores among the other groups of entities;

And taking the entity with the highest recall score in other groups of entities as a negative example entity.

As can be seen from the above, the information obtaining module 201 obtains the text information to be mined, extracts a plurality of entities in the text information, and recalls candidate entities corresponding to each entity; determining at least one set of positive example entities with entity relationships from the entities based on candidate entities corresponding to the entities through an entity determining module 202, wherein each positive example entity in each set is matched with the candidate entity corresponding to the positive example entity; selecting at least one set of negative example entities without entity relationship from the entities except the positive example entities through an entity selection module 203; the model training module 204 is configured to obtain a sample data set based on at least one set of the positive example entities and at least one set of the negative example entities, and train a preset entity relationship model according to the sample data set, where the entity relationship model is used to identify an entity relationship between two entities input into the model, so that at least one set of positive example entities with entity relationships in the text is mined through candidate entities corresponding to the entities in the text, and at least one set of negative example data without entity relationships can be mined, so as to reduce human resources spent on mining data during model training.

Correspondingly, the embodiment of the application also provides electronic equipment, which can be a terminal, and the terminal can be terminal equipment such as a smart phone, a tablet personal computer, a notebook computer, a touch screen, a game machine, a personal computer (PC, personal Computer), a personal digital assistant (Personal Digital Assistant, PDA) and the like. As shown in fig. 3, fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present application. The electronic device 300 includes a processor 301 having one or more processing cores, a memory 302 having one or more computer-readable storage media, and a computer program stored on the memory 302 and executable on the processor. The processor 301 is electrically connected to the memory 302. It will be appreciated by those skilled in the art that the electronic device structure shown in the figures is not limiting of the electronic device and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.

The processor 301 is a control center of the electronic device 300, connects various portions of the entire electronic device 300 using various interfaces and lines, and performs various functions of the electronic device 300 and processes data by running or loading software programs and/or modules stored in the memory 302, and invoking data stored in the memory 302, thereby performing overall monitoring of the electronic device 300.

In the embodiment of the present application, the processor 301 in the electronic device 300 loads the instructions corresponding to the processes of one or more application programs into the memory 302 according to the following steps, and the processor 301 executes the application programs stored in the memory 302, so as to implement various functions:

In some embodiments, before extracting the plurality of entities in the text information, the method further includes:

In some embodiments, the obtaining the sample data set based on at least one set of the positive instance entities and at least one set of the negative instance entities includes:

In some embodiments, the sample screening is performed on the test samples in the test data set based on the sample label corresponding to each test sample and the probability value of each test sample under at least one relationship category, so as to obtain a screened test data set, including:

In some embodiments, the determining, based on candidate entities corresponding to the entities, at least one set of normal entities having entity relationships from the entities includes:

In some embodiments, the selecting at least one set of negative example entities that do not have an entity relationship from the entities other than the positive example entities includes:

Thus, the electronic device 300 provided in this embodiment may have the following technical effects: the manpower resources spent in mining the data in model training are reduced.

The specific implementation of each operation above may be referred to the previous embodiments, and will not be described herein.

Optionally, as shown in fig. 3, the electronic device 300 further includes: a touch display 303, a radio frequency circuit 304, an audio circuit 305, an input unit 306, and a power supply 307. The processor 301 is electrically connected to the touch display 303, the radio frequency circuit 304, the audio circuit 305, the input unit 306, and the power supply 307, respectively. Those skilled in the art will appreciate that the electronic device structure shown in fig. 3 is not limiting of the electronic device and may include more or fewer components than shown, or may combine certain components, or may be arranged in different components.

The touch display 303 may be used to display a graphical user interface and receive operation instructions generated by a user acting on the graphical user interface. The touch display 303 may include a display panel and a touch panel. Wherein the display panel may be used to display information entered by a user or provided to a user as well as various graphical user interfaces of the electronic device, which may be composed of graphics, text, icons, video, and any combination thereof. Alternatively, the display panel may be configured in the form of a liquid crystal display (LCD, liquid Crystal Display), an Organic Light-Emitting Diode (OLED), or the like. The touch panel may be used to collect touch operations on or near the user (such as operations on or near the touch panel by the user using any suitable object or accessory such as a finger, stylus, etc.), and generate corresponding operation instructions, and the operation instructions execute corresponding programs. Alternatively, the touch panel may include two parts, a touch detection device and a touch controller. The touch detection device detects the touch azimuth of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch detection device, converts it into touch point coordinates, and sends the touch point coordinates to the processor 301, and can receive and execute commands sent from the processor 301. The touch panel may overlay the display panel, and upon detection of a touch operation thereon or thereabout, the touch panel is passed to the processor 301 to determine the type of touch event, and the processor 301 then provides a corresponding visual output on the display panel in accordance with the type of touch event. In the embodiment of the present application, the touch panel and the display panel may be integrated into the touch display screen 303 to implement the input and output functions. In some embodiments, however, the touch panel and the touch panel may be implemented as two separate components to perform the input and output functions. I.e. the touch-sensitive display 303 may also implement an input function as part of the input unit 306.

The radio frequency circuit 304 may be configured to receive and transmit radio frequency signals to and from a network device or other electronic device via wireless communication to and from the network device or other electronic device.

The audio circuit 305 may be used to provide an audio interface between a user and an electronic device through a speaker, microphone. The audio circuit 305 may transmit the received electrical signal after audio data conversion to a speaker, and convert the electrical signal into a sound signal for output by the speaker; on the other hand, the microphone converts the collected sound signals into electrical signals, which are received by the audio circuit 305 and converted into audio data, which are processed by the audio data output processor 301 for transmission to, for example, another electronic device via the radio frequency circuit 304, or which are output to the memory 302 for further processing. The audio circuit 305 may also include an ear bud jack to provide communication of the peripheral headphones with the electronic device.

The input unit 306 may be used to receive input numbers, character information, or user characteristic information (e.g., fingerprint, iris, facial information, etc.), and to generate keyboard, mouse, joystick, optical, or trackball signal inputs related to user settings and function control.

The power supply 307 is used to power the various components of the electronic device 300. Alternatively, the power supply 307 may be logically connected to the processor 301 through a power management system, so as to perform functions of managing charging, discharging, and power consumption management through the power management system. The power supply 307 may also include one or more of any components, such as a direct current or alternating current power supply, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.

Although not shown in fig. 3, the electronic device 300 may further include a camera, a sensor, a wireless fidelity module, a bluetooth module, etc., which are not described herein.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments.

Those of ordinary skill in the art will appreciate that all or a portion of the steps of the various methods of the above embodiments may be performed by instructions, or by instructions controlling associated hardware, which may be stored in a computer-readable storage medium and loaded and executed by a processor.

To this end, embodiments of the present application provide a computer readable storage medium having stored therein a plurality of computer programs that can be loaded by a processor to perform the steps of any of the data mining methods provided by the embodiments of the present application. For example, the computer program may perform the steps of:

It can be seen that the computer program can be loaded by a processor to perform the steps of any of the data mining methods provided in the embodiments of the present application, thereby bringing about the following technical effects: the manpower resources spent in mining the data in model training are reduced.

Wherein the computer-readable storage medium may comprise: read Only Memory (ROM), random access Memory (RAM, random Access Memory), magnetic or optical disk, and the like.

Since the computer program stored in the computer readable storage medium may execute the steps in any data mining method provided in the embodiments of the present application, the beneficial effects that any data mining method provided in the embodiments of the present application may be achieved are detailed in the previous embodiments, and will not be described herein.

The foregoing has described in detail a data mining method, apparatus, electronic device and computer readable storage medium provided by embodiments of the present application, and specific examples have been applied herein to illustrate the principles and embodiments of the present application, where the foregoing examples are provided to assist in understanding the methods of the present application and their core ideas; meanwhile, those skilled in the art will have variations in the specific embodiments and application scope in light of the ideas of the present application, and the present description should not be construed as limiting the present application in view of the above.

Claims

1. A method of data mining, the method comprising:

determining at least one set of positive example entities with entity relation from a plurality of entities based on candidate entities corresponding to the entities, wherein each positive example entity in each set is matched with the candidate entity corresponding to the positive example entity;

and obtaining a sample data set based on at least one group of positive instance entities and at least one group of negative instance entities, and training a preset entity relation model according to the sample data set, wherein the entity relation model is used for identifying entity relations between two entities input into the model.

2. The data mining method of claim 1, further comprising, prior to extracting the plurality of entities in the text information:

3. The data mining method of claim 1, wherein the obtaining a sample data set based on at least one set of the positive instance entities and at least one set of the negative instance entities comprises:

dividing at least one set of said positive instance entities and at least one set of said negative instance entities into a test data set and a training data set;

carrying out entity relation prediction on each test sample in the test data set through the supervision model to obtain a probability value of each test sample under at least one relation category;

sample screening is carried out on the test samples in the test data set based on the sample labels corresponding to each test sample and the probability value of each test sample under at least one relation class, so that a screened test data set is obtained;

4. The data mining method according to claim 3, wherein the performing sample screening on the test samples in the test data set based on the sample label corresponding to each test sample and the probability value of each test sample under at least one relation category to obtain a screened test data set includes:

Obtaining an average probability value under each relation category based on the probability value of each test sample in the test data set under at least one relation category;

5. The data mining method according to claim 1, wherein the determining at least one set of positive example entities having entity relationships from the plurality of entities based on candidate entities corresponding to the plurality of entities includes:

and determining the target entity and the matching entity as a set of normal entity with entity relationship if one matching entity matched with the candidate entity corresponding to the target entity exists in a plurality of entities based on the candidate entity corresponding to the target entity.

6. The method of any one of claims 1 to 5, wherein selecting at least one set of negative example entities that do not have an entity relationship from the entities other than the positive example entity comprises:

based on the entity type of any target positive example entity in a group of positive example entities, randomly selecting one entity of the same type from the entities except the positive example entities, and replacing the target positive example entity with the entity of the same type to obtain a group of negative example entities.

7. The method of any one of claims 1 to 5, wherein selecting at least one set of negative example entities that do not have an entity relationship from the entities other than the positive example entity comprises:

8. A data mining apparatus, the apparatus comprising:

The entity determining module is used for determining at least one group of positive example entities with entity relation from the entities based on candidate entities corresponding to the entities, wherein the candidate entities corresponding to the positive example entities in each group are matched with each other;

the model training module is used for obtaining a sample data set based on at least one group of positive instance entities and at least one group of negative instance entities, and training a preset entity relation model according to the sample data set, wherein the entity relation model is used for identifying entity relations between two entities input into the model.

9. An electronic device comprising a processor and a memory, the memory storing a plurality of instructions; the processor loads instructions from the memory to perform the steps in the data mining method as claimed in any one of claims 1 to 7.

10. A computer readable storage medium storing a plurality of instructions adapted to be loaded by a processor to perform the steps in the data mining method of any of claims 1 to 7.