CN113239149B - Entity processing method, device, electronic equipment and storage medium - Google Patents

Entity processing method, device, electronic equipment and storage medium Download PDF

Info

Publication number
CN113239149B
CN113239149B CN202110529375.7A CN202110529375A CN113239149B CN 113239149 B CN113239149 B CN 113239149B CN 202110529375 A CN202110529375 A CN 202110529375A CN 113239149 B CN113239149 B CN 113239149B
Authority
CN
China
Prior art keywords
information
entity
clustering
entities
processed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110529375.7A
Other languages
Chinese (zh)
Other versions
CN113239149A (en
Inventor
张皓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202110529375.7A priority Critical patent/CN113239149B/en
Publication of CN113239149A publication Critical patent/CN113239149A/en
Application granted granted Critical
Publication of CN113239149B publication Critical patent/CN113239149B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques

Abstract

The disclosure provides an entity processing method, an entity processing device, electronic equipment and a storage medium, and relates to the field of natural language processing. The specific implementation scheme is as follows: clustering a plurality of first information containing target entities to obtain N information clusters; determining a clustering effect based on the N information clusters; determining an ambiguity score of the target entity according to the clustering effect; and determining a mode for detecting the target entity in the second information according to the ambiguity score of the target entity. By using the embodiment of the disclosure, the efficiency and accuracy of detecting the target entity in the information can be improved.

Description

Entity processing method, device, electronic equipment and storage medium
Technical Field
The present disclosure relates to the field of data processing, and in particular, to the field of natural language processing.
Background
In some application scenarios, it is desirable to detect certain entities in the information that have a particular meaning, which may be referred to as target entities. Generally, the detection mode comprises vocabulary detection and model detection. The vocabulary detection refers to directly comparing the vocabulary, and searching the entity in the vocabulary in the information. Model detection refers to inputting information into a model and outputting a target entity contained in the information from the model.
Disclosure of Invention
The disclosure provides an entity processing method, an entity processing device, electronic equipment and a storage medium.
According to an aspect of the present disclosure, there is provided an entity processing method, including:
clustering a plurality of first information containing target entities to obtain N information clusters; wherein N is an integer greater than or equal to 1;
determining a clustering effect based on the N information clusters;
determining an ambiguity score of the target entity according to the clustering effect;
and determining a mode for detecting the target entity in the second information according to the ambiguity score of the target entity.
According to another aspect of the present disclosure, there is provided an entity processing apparatus, comprising:
the information clustering module is used for clustering a plurality of pieces of first information containing target entities to obtain N information clusters; wherein N is an integer greater than or equal to 1;
the effect determining module is used for determining a clustering effect based on the N information clusters;
the score determining module is used for determining an ambiguity score of the target entity according to the clustering effect;
and the mode determining module is used for determining a mode for detecting the target entity in the second information according to the ambiguity score of the target entity.
According to another aspect of the present disclosure, there is provided an electronic device including:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the embodiments of the present disclosure.
According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method of any of the embodiments of the present disclosure.
According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method in any of the embodiments of the present disclosure.
According to the technology disclosed by the invention, the efficiency and the accuracy of detecting the target entity in the second information are improved.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.
Drawings
The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 is a schematic diagram of an entity processing method according to an embodiment of the present disclosure;
FIG. 2 is a schematic diagram of an entity processing method according to another embodiment of the present disclosure;
FIG. 3 is a schematic diagram of an entity processing method according to yet another embodiment of the present disclosure;
FIG. 4 is a schematic diagram of an entity processing apparatus according to one embodiment of the present disclosure;
FIG. 5 is a schematic diagram of an entity processing apparatus according to another embodiment of the present disclosure;
FIG. 6 is a schematic diagram of an entity processing apparatus according to yet another embodiment of the present disclosure;
fig. 7 is a block diagram of an electronic device for implementing the entity processing method of an embodiment of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
FIG. 1 is a schematic diagram of an entity processing method according to an embodiment of the present disclosure, the method comprising:
step S11, clustering a plurality of pieces of first information containing target entities to obtain N information clusters; wherein N is an integer greater than or equal to 1;
step S12, determining a clustering effect based on N information clusters;
step S13, determining ambiguity scores of target entities according to the clustering effect;
step S14, determining the mode of detecting the target entity in the second information according to the ambiguity score of the target entity.
Illustratively, the information in the embodiments of the present disclosure, including but not limited to at least one of the first information, the second information, and the third information, may be various types of text information, such as advertisements, newspapers, magazines, documents, and the like. Under different application scenarios, the information and the target entity therein have different manifestations. For example, in the advertising arts, the information may comprise advertisements and the target entity may comprise a brand name. In the field of news dissemination, the information may include newspapers, magazines, etc., and the target entity may include some specific sensitive words.
For example, a preset clustering algorithm may be used to cluster a plurality of first information including the target entity. Among them, clustering algorithms include, but are not limited to, K-Means (K-Means Clustering Algorithm ), DBSCAN (Density-Based Spatial Clustering of Applications with Noise, density-based clustering method with noise), OPTICS (Ordering Points to identify the clustering structure, determining cluster structure based on point-to-point ordering), etc.
For example, after the N information clusters are clustered, a conventional clustering effect evaluation method may be used to determine the clustering effect. The clustering effect evaluation method is, for example, CH Index (Calinski-Harabasz Index). By clustering, similar first information in the plurality of first information can be grouped into one category. Among N information clusters obtained by clustering, the first information in the same information cluster is similar, and the first information in different information clusters is different. Thus, the clustering effect may reflect whether semantics between the plurality of first information are diverse. For example, if the number N of clustered information clusters is smaller, the semantics of the first information including the target entity are single and clear, and the ambiguity of the target entity is lower. Therefore, according to the clustering effect, the ambiguity score of the target entity is determined, so that the ambiguity score accurately reflects the ambiguity of the target entity.
In the embodiment of the disclosure, ambiguity is used for representing whether the meaning of the text corresponding to the entity, which can be read out in the information, is various. For example, "notebooks" may have different meanings in different information, such as in digital-like information, generally referring to portable computers; in the educational field, it is generally referred to as a paper booklet for recording things. Thus, the ambiguity of "notebook" is high. For another example, the meaning of "federal learning" in different information is single, which generally refers to machine learning by combining data of multiple parties, and has low ambiguity.
In practical applications, ambiguity of a target entity may lead to inaccurate results in detecting the target entity in the information. For example, when the target entity is a portable computer "notebook", it is easy to detect a notebook having a paper booklet meaning. However, in order to improve accuracy, a high-precision mode is selected for detection, which often brings about a large calculation cost.
The disclosed embodiments determine a manner of detecting the target entity in the second information based on the ambiguity score of the target entity, e.g. in case the ambiguity score is low, the target entity is detected in the second information using an efficient manner. And under the condition of high ambiguity score, detecting the target entity in the second information by adopting a high-accuracy mode. In this way, the efficiency and accuracy of detecting the target entity in the second information can be improved. For example, the methods of the disclosed embodiments may be applied to the field of advertising to identify brand names in advertisements to confirm whether the brand names used by advertisers are authorized.
Illustratively, the determining the mode of detecting the target entity in the second information according to the ambiguity score of the target entity in the foregoing step S14 may include:
and selecting a mode for detecting the target entity in the second information from the vocabulary detection mode and the model detection mode according to the ambiguity score of the target entity and a preset threshold value.
For example, when the ambiguity score of the target entity is equal to or greater than a preset threshold, a model detection mode is selected as a mode for detecting the target entity in the second information. For another example, if the ambiguity score of the target entity is smaller than the preset threshold, selecting the vocabulary detection mode as the mode for detecting the target entity in the second information.
The vocabulary detection is to directly search the text which is the same as the target entity in the information by comparing with the vocabulary of the target entity, so that the efficiency is highest. The model detection is based on the neural network model to analyze the input information, and based on the analysis result, whether the target entity exists in the information is identified, so that the model detection has high accuracy. According to the ambiguity score of the target entity and the preset threshold, the mode of detecting the target entity in the second information is selected in the two modes, so that the target entity can be detected more efficiently and accurately.
For example, the plurality of first information may be clustered based on features of the plurality of first information. In a specific example, the step S11 includes clustering a plurality of first information including the target entity to obtain N information clusters, where the clustering includes:
performing word segmentation on each piece of first information in the plurality of pieces of first information respectively to obtain a word sequence of each piece of first information;
combining the words in the word sequence of each piece of first information based on a preset combination mode to obtain a feature set of each piece of first information;
and clustering the plurality of first information based on the feature set of each first information to obtain N information clusters.
Taking the first information as an advertisement, the target entity is a brand name example, and a plurality of advertisements comprising brand name B form an advertisement data setWherein (1)>For the ith advertisement, |T B And I is the number of advertisements in the advertisement data set.
For each advertisement, the word sequence { t ' can be obtained by cutting the word for t ' without losing generality to be t ' 1 ,t′ 2 ,…,t′ l |l>1}. Wherein t' j For the j-th word in the word sequence, l is the number of words in the word sequence.
The preset combination may be an N-gram (N-gram) And carrying out sliding window operation in the word sequence, combining text fragments with a preset length based on each word, taking the text fragments as the ngram characteristics of the first information, and obtaining a characteristic set based on the plurality of ngram characteristics. Specifically, given the parameter ngram=k, then the word sequence { t 'for advertisement t' 1 ,t′ 2 ,…,t′′|l>1} the generated ngram feature set is:
f t′,k ={t′ i t′ i+1 …t′ i+j |i,i+j∈[1,l]j.epsilon.0, k) } formula (1)
For example, the advertisement "vacation village B worth one trip" is segmented to obtain a word sequence { vacation village, B worth, one trip }, if k=3, the obtained ngram feature set f t′,3 A text segment comprising successive words within 3 of the word sequence, a feature set f t′,3 Is { vacation village, vacation village B value, B, B value for one trip, one trip }.
Based on the feature sets of the advertisements, the feature set of the brand B in the advertisement dimension can be obtained and is expressed as S B,k ={f t′,k T' e B }. Based on S B,k And clustering the advertisements to obtain at least one information cluster.
Therefore, according to the steps, rich text fragments related to the target entity are obtained based on the plurality of first information containing the target entity, and the accuracy of the clustering result can be improved by clustering the plurality of first information based on the rich text fragments, so that the accuracy of the clustering effect and the ambiguity score can be improved.
Illustratively, the foregoing step S12, determining the clustering effect based on the N information clusters, includes:
and determining the clustering effect according to the intra-cluster distance in each of the N information clusters.
That is, the clustering effect is determined according to the distance between each first information in each information cluster and the center point of the information cluster. As the smaller the intra-cluster distance of the information clusters is, the meaning of each information containing the target entity is clear, and therefore, the clustering effect is determined according to the intra-cluster distance, and the intra-cluster effect can accurately reflect the ambiguity of the target entity.
The number may also be determined by the number of clusters of information, for example. Specifically, determining a clustering effect based on the N information clusters includes:
determining a clustering effect according to N;
or,
and determining the clustering effect according to the intra-cluster distance and N in each of the N information clusters.
That is, in the embodiment of the present disclosure, the clustering effect may be determined according to the intra-cluster spacing in each information cluster and/or the number N of information clusters.
Illustratively, the cluster effect may be determined based on the following cluster effect evaluation function:
wherein,is a clustering effect; i T B The i is the number of first information; n (N) G The number N of the information clusters obtained by clustering; c (C) i Representing an ith information cluster; a represents a certain first information containing the target entity B, and a is in the ith information cluster C i In (a) and (b);representing the center point of the ith cluster; />Representing the distance of the first information a from the center point of the i-th information cluster.
The smaller the number of the information clusters is, the smaller the inter-cluster distance is, which indicates that each information semantic containing the target entity is single and clear, so the above-mentioned exemplary mode determines the clustering effect based on the inter-cluster distance and N, and the clustering effect can accurately reflect the ambiguity of the target entity.
In practical applications, clustering algorithms (e.g., K-Means) requiring parameter tuning may be employed for clustering. For the situation of multi-time parameter adjustment clustering, clustering effects can be evaluated respectively aiming at the multi-time parameter adjustment, and the optimal clustering effect is taken for determining the ambiguity score. For example, the clustering effect may be determined as an ambiguity score. In particular, the ambiguity score may be expressed based on the following formula:
wherein A is B,k The ambiguity score, c is a clustering algorithm; θ is a parameter of the clustering algorithm c,is the clustering effect for the parameter θ.
In case there are a plurality of target entities, step S14 of determining a manner of detecting the target entity in the second information according to the ambiguity score of the target entity may include: sorting the plurality of target entities from small to large based on the ambiguity scores of the plurality of target entities; traversing a plurality of target entities, and determining that a mode for detecting the traversed target entity in the second information is a vocabulary detection mode before the ambiguity score of the traversed target entity is smaller than a preset threshold value; and after the ambiguity score of the traversed target entity is smaller than a preset threshold value, determining that the mode of detecting the traversed target entity in the second information is a vocabulary detection mode.
Fig. 2 is a schematic diagram of an entity processing method according to another embodiment of the present disclosure, the method comprising:
step S21, a plurality of entities to be processed and a plurality of third information sets respectively corresponding to the entities to be processed are obtained;
step S22, clustering a plurality of entities to be processed based on a plurality of third information sets to obtain at least one entity cluster;
step S23, traversing at least one entity cluster, and carrying out batch correction on at least one entity to be processed in the traversed entity cluster to obtain a target entity.
The third information set comprises a plurality of third information, and each third information comprises a corresponding entity to be processed.
The entity processing method shown in fig. 2 and the entity processing method shown in fig. 1 may be implemented independently or in association with each other. For example, the method shown in fig. 2 alone or the method shown in fig. 1 alone may be performed, or the method shown in fig. 1 may be performed first to obtain the target entity, and then the method shown in fig. 1 may be performed to determine a manner of detecting the target entity in the second information for the target entity.
Illustratively, the plurality of entities to be processed may be entity nouns identified in various types of information in a particular domain. For example, the plurality of entities to be processed may be brand names identified in various types of advertisements using techniques such as template mining, named body recognition (Name Entity Recognition, NER), and the like.
Since the to-be-processed entity identified by adopting the technologies such as template mining, NER and the like often has boundary identification errors, the to-be-processed entity needs to be manually corrected. The manual correction of multiple entities to be processed one by one often has the problem of inefficiency.
According to the method, the plurality of entities to be processed can be clustered based on the third information corresponding to the entities to be processed, so that the entities to be processed with similar characteristics are clustered into one entity cluster. When correcting a plurality of to-be-processed entities, traversing at least one entity cluster, and correcting the to-be-processed entities in the traversed entity clusters in batches, namely correcting the to-be-processed entities in different entity clusters separately, and correcting the to-be-processed entities in the same entity cluster in a concentrated manner, so that the concentrated correction is beneficial to improving the efficiency of manual correction due to the similar characteristics of the to-be-processed entities in the same entity cluster.
For example, the boundary features of the entity to be processed may be obtained based on the third information set, and the clustering may be performed based on the boundary features of the entity to be processed. Specifically, in some optional exemplary embodiments, step S22, clustering, based on the plurality of third information sets, the plurality of to-be-processed entities to obtain at least one entity cluster includes:
based on the plurality of third information sets, a plurality of boundary feature sets respectively corresponding to the plurality of entities to be processed are obtained;
based on the plurality of boundary feature sets, clustering a plurality of to-be-processed entities to obtain at least one entity cluster.
Illustratively, clustering a plurality of entities to be processed based on a plurality of boundary feature sets
Because the clustering is carried out based on the boundary characteristics, the boundary characteristics of the entities to be processed in the same entity cluster are similar, so that the problem of error in boundary identification is solved, and the efficiency of manual correction is improved more effectively.
Illustratively, obtaining a plurality of boundary feature sets corresponding to a plurality of to-be-processed entities respectively based on a plurality of third information sets includes:
based on a kth entity to be processed in the plurality of entities to be processed, segmenting each third information in a kth third information set in the plurality of third information sets to obtain a plurality of text fragments corresponding to each third information; wherein k is an integer of 1 or more;
obtaining boundary feature subsets corresponding to each piece of third information respectively based on words adjacent to the kth entity to be processed in the text fragments;
and obtaining a boundary feature set corresponding to the kth entity to be processed based on the boundary feature subsets respectively corresponding to each third information.
Taking the entity to be processed as a brand name, the third information as an advertisement, the third information set as an advertisement data set as an example, and obtaining the boundary feature set of the entity to be processed based on a plurality of advertisements in the advertisement data set as follows:
let the advertisement data set containing the brand name B beWherein (1)>For advertising data setsIth advertisement, |T B And I is the advertisement quantity in the advertisement data set.
Each advertisement containing B is set as t 'without losing generality, can be segmented into a plurality of text fragments by B, and the set of the text fragments is { t' 1 ,B,t′ 2 ,B,t′ 3 ,…,B,t′ l |l>1}. Wherein t' i Is the ith text segment in advertisement t' except B.
For each shape { t }' i ,B,t′ i+1 Text of }, t 'respectively' i And t' i+1 Word segmentation is carried out to obtain two word sequences { t' i,1 ,t′ i,2 ,…,t′ i,m M is greater than or equal to 1 and { t' i+1,1 ,t′ i+1,2 ,…,t′ i+1,n And n is not less than 1. Wherein m is t' i The number of words in (1), n is t' i+1 The number of words in (a).
Given the parameter ngram=k, then for each shape like { t' i ,B,t′ i+1 The text-generated ngram boundary feature set of } is:
for all shapes in the advertisement based t ', like { t' i ,B,t′ i+1 Combining the text-generated ngram boundary feature sets to obtain boundary feature subsets corresponding to the advertisement t' and related to the brand B:
for example, advertisement t' is "some spring holiday village B deserves a good deal of travel". First, the advertisement t' is segmented into { some ground spa holidays, B, worth one trip }, based on B. Then, the words of 'certain ground hot spring holiday village' and 'worth doing one trip' are respectively cut, and two word sequences { certain ground, hot spring, holiday village } and { worth doing, everything, one trip } are obtained. If k=2, the generated boundary feature set includes a text segment "spa-holiday" composed of the last two words in the word sequence { somewhere, spa, holiday }, a text segment "holiday" composed of the last word, and a text segment "worth" composed of the first two words in the word sequence { worth, everyone, one trip }, a text segment "worth" composed of the first word. Since B appears only once in the advertisement, the boundary feature subset of B corresponding to the advertisement is { spa-vacation village, worth }.
Combining boundary feature subsets corresponding to the advertisements respectively and related to the brand B to obtain an overall boundary feature set of the brand B:
f B,k ={f t′,B,k |t′∈T B formula (6)
For example, after obtaining the boundary feature set of each entity to be processed, any clustering algorithm may be used to cluster a plurality of entities to be processed based on a plurality of boundary feature sets. Clustering algorithms include, but are not limited to, K-Means, DBSCAN, OPTICS, and the like.
The implementation of the embodiments of the present disclosure is described below using a specific example. Fig. 3 is a schematic diagram of this specific example. As shown in fig. 3, first, a brand entity recognition result list 310 is recognized in a mass advertisement based on a recognition technology, such as NER, and includes a plurality of brands to be processed (corresponding to the to be processed entities), where the brands to be processed may have boundary errors and need to be manually corrected.
Then, for each brand to be processed, advertisements containing the brand to be processed are searched in the advertisement library 320, and each brand advertisement sampling result 330 (corresponding to a plurality of third information sets corresponding to a plurality of entities to be processed respectively) is obtained.
After each brand advertisement sampling result 330 is obtained, each brand to be processed is corrected based on each brand advertisement sampling result 330 using the brand boundary recognition error correction module 340. The correction process includes:
for each brand, brand dimension feature set generation is achieved. The brand dimension feature set is, for example, a boundary feature set;
realizing brand dimension clustering based on the brand dimension feature set to obtain at least one brand cluster (equivalent to an entity cluster);
traversing at least one brand cluster, and correcting boundary recognition errors in batches for at least one brand in the traversed brand cluster.
After correction for a plurality of brands to be processed, a plurality of relatively accurate brands can be obtained and added to the brand library. Using the brand ambiguity ranking module 350, each brand in the brand library (equivalent to the target entity) may be ranked to determine whether the manner in which each brand is subsequently detected in the advertisement is vocabulary detection or model detection based on the ranking results. Specifically, the ordering process includes:
for each brand, respectively realizing advertisement dimension feature set generation, namely, for a certain brand, generating feature sets of a plurality of advertisements (corresponding to first information) of the brand;
evaluating ambiguity scores for each brand in the advertisement dimension cluster; specifically, clustering is carried out on each brand based on a plurality of advertisements of the brand to obtain a plurality of advertisement clusters, and ambiguity scores are evaluated based on clustering effects of the advertisement clusters;
brands are ranked according to the ambiguity score.
It can be seen that the method of the embodiments of the present disclosure determines, according to the ambiguity score of the target entity, a manner of detecting the target entity in the second information. In this way, the efficiency and accuracy of detecting the target entity in the second information can be improved. In some application examples, the to-be-processed entities identified by adopting technologies such as template mining and NER are clustered, the to-be-processed entities in different entity clusters are corrected separately, and the to-be-processed entities in the same entity cluster are corrected in a concentrated manner, so that the efficiency of manual correction can be improved because the to-be-processed entities in the same entity cluster are similar in characteristics.
As an implementation of the above methods, the present disclosure further provides an entity processing apparatus. Fig. 4 is a schematic diagram of an entity processing apparatus according to an embodiment of the present disclosure, the apparatus including:
an information clustering module 410, configured to cluster a plurality of first information including a target entity to obtain N information clusters; wherein N is an integer greater than or equal to 1;
an effect determining module 420, configured to determine a clustering effect based on the N information clusters;
a score determining module 430, configured to determine an ambiguity score of the target entity according to the clustering effect;
a mode determination module 440 for determining a mode for detecting the target entity in the second information based on the ambiguity score of the target entity.
Illustratively, as shown in FIG. 5, the information clustering module 410 includes:
a word segmentation unit 411, configured to segment each first information in the plurality of first information, to obtain a word sequence of each first information;
a combining unit 412, configured to combine the words in the word sequence of each first information based on a preset combination manner, so as to obtain a feature set of each first information;
the first clustering unit 413 clusters the plurality of first information based on the feature set of each first information, resulting in N information clusters.
Illustratively, the effect determination module 420 is to:
and determining the clustering effect according to the intra-cluster distance in each of the N information clusters.
Illustratively, the score determination module 430 is configured to:
and selecting a mode for detecting the target entity in the second information from the vocabulary detection mode and the model detection mode according to the ambiguity score of the target entity and a preset threshold value.
Fig. 6 is a schematic diagram of an entity processing apparatus according to another embodiment of the present disclosure, the apparatus including:
an obtaining module 610, configured to obtain a plurality of to-be-processed entities and a plurality of third information sets corresponding to the plurality of to-be-processed entities respectively;
the entity clustering module 620 is configured to cluster a plurality of entities to be processed based on a plurality of third information sets, to obtain at least one entity cluster;
the correction module 630 is configured to traverse at least one entity cluster, and perform batch correction on at least one entity to be processed in the traversed entity cluster, so as to obtain a target entity.
Illustratively, as shown in FIG. 6, the entity clustering module 620 includes:
a boundary determining unit 621, configured to obtain a plurality of boundary feature sets corresponding to a plurality of entities to be processed, respectively, based on the plurality of third information sets;
a second clustering unit 622, configured to cluster the plurality of to-be-processed entities based on the plurality of boundary feature sets, to obtain at least one entity cluster.
The functions of each unit, module or sub-module in each apparatus of the embodiments of the present disclosure may be referred to the corresponding descriptions in the above method embodiments, which are not repeated herein.
According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.
Fig. 7 illustrates a schematic block diagram of an example electronic device 700 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 7, the electronic device 700 includes a computing unit 701 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the electronic device 700 may also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other through a bus 704. An input output (I/O) interface 705 is also connected to bus 704.
Various components in the electronic device 700 are connected to the I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, etc.; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, an optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the electronic device 700 to exchange information/data with other devices through a computer network, such as the internet, and/or various telecommunication networks.
The computing unit 701 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 701 performs the various methods and processes described above, such as the entity processing method. For example, in some embodiments, the entity processing method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 708. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 700 via the ROM 702 and/or the communication unit 709. When the computer program is loaded into RAM 703 and executed by computing unit 701, one or more steps of the entity processing method described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured to perform the entity processing method by any other suitable means (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.
The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims (10)

1. An entity processing method, comprising:
clustering a plurality of first information containing target entities to obtain N information clusters; wherein N is an integer greater than or equal to 1;
determining a clustering effect based on intra-cluster spacing in each of the N information clusters and/or the number N of the information clusters, wherein the clustering effect characterizes whether semantics among a plurality of first information are diversified;
determining ambiguity scores of the target entities according to the clustering effect, wherein the ambiguity is used for representing whether the meaning of the text corresponding to the entity, which can be read out in the information, is various;
and selecting a mode for detecting the target entity in the second information from a vocabulary detection mode and a model detection mode according to the ambiguity score of the target entity and a preset threshold.
2. The method of claim 1, wherein clustering the plurality of first information containing the target entity to obtain N clusters of information comprises:
performing word segmentation on each piece of first information in the plurality of pieces of first information containing the target entity respectively to obtain a word sequence of each piece of first information;
combining the words in the word sequence of each piece of first information based on a preset combination mode to obtain a feature set of each piece of first information;
and clustering the plurality of first information based on the feature set of each first information to obtain N information clusters.
3. The method of claim 1 or 2, further comprising:
acquiring a plurality of to-be-processed entities and a plurality of third information sets respectively corresponding to the plurality of to-be-processed entities;
clustering the plurality of entities to be processed based on the plurality of third information sets to obtain at least one entity cluster;
traversing the at least one entity cluster, and carrying out batch correction on at least one entity to be processed in the traversed entity cluster to obtain the target entity.
4. The method of claim 3, wherein the clustering the plurality of to-be-processed entities based on the plurality of third information sets to obtain at least one entity cluster includes:
based on the plurality of third information sets, a plurality of boundary feature sets respectively corresponding to the plurality of entities to be processed are obtained;
and clustering the plurality of entities to be processed based on the plurality of boundary feature sets to obtain at least one entity cluster.
5. An entity processing apparatus, comprising:
the information clustering module is used for clustering a plurality of pieces of first information containing target entities to obtain N information clusters; wherein N is an integer greater than or equal to 1;
the effect determining module is used for determining a clustering effect based on the N information clusters, wherein the clustering effect characterizes semantic diversity among a plurality of first information;
the score determining module is used for determining ambiguity scores of the target entities according to the clustering effect, wherein the ambiguity is used for representing whether the meaning of the text corresponding to the entity, which can be read out in the information, is various;
a mode determining module, configured to determine a mode of detecting the target entity in the second information according to the ambiguity score of the target entity;
the effect determining module is used for:
determining a clustering effect according to the intra-cluster spacing in each of the N information clusters and/or the number N of the information clusters;
the score determining module is used for:
and selecting a mode for detecting the target entity in the second information from a vocabulary detection mode and a model detection mode according to the ambiguity score of the target entity and a preset threshold.
6. The apparatus of claim 5, wherein the information clustering module comprises:
the word segmentation unit is used for respectively segmenting each first message in the plurality of first messages containing the target entity to obtain a word sequence of each first message;
the combination unit is used for combining the words in the word sequence of each piece of first information based on a preset combination mode to obtain a feature set of each piece of first information;
and the first clustering unit is used for clustering the plurality of first information based on the feature set of each piece of first information to obtain N information clusters.
7. The apparatus of claim 5 or 6, further comprising:
the acquisition module is used for acquiring a plurality of to-be-processed entities and a plurality of third information sets respectively corresponding to the to-be-processed entities;
the entity clustering module is used for clustering the plurality of to-be-processed entities based on the plurality of third information sets to obtain at least one entity cluster;
and the correction module is used for traversing the at least one entity cluster, and carrying out batch correction on at least one entity to be processed in the traversed entity cluster to obtain the target entity.
8. The apparatus of claim 7, wherein the entity clustering module comprises:
the boundary determining unit is used for obtaining a plurality of boundary feature sets corresponding to the plurality of entities to be processed respectively based on the plurality of third information sets;
and the second clustering unit is used for clustering the plurality of entities to be processed based on the plurality of boundary feature sets to obtain at least one entity cluster.
9. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-4.
10. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1-4.
CN202110529375.7A 2021-05-14 2021-05-14 Entity processing method, device, electronic equipment and storage medium Active CN113239149B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110529375.7A CN113239149B (en) 2021-05-14 2021-05-14 Entity processing method, device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110529375.7A CN113239149B (en) 2021-05-14 2021-05-14 Entity processing method, device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113239149A CN113239149A (en) 2021-08-10
CN113239149B true CN113239149B (en) 2024-01-19

Family

ID=77134378

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110529375.7A Active CN113239149B (en) 2021-05-14 2021-05-14 Entity processing method, device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113239149B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113704643B (en) * 2021-09-03 2022-10-18 北京百度网讯科技有限公司 Method and device for determining state of target object, electronic equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110674304A (en) * 2019-10-09 2020-01-10 北京明略软件系统有限公司 Entity disambiguation method and device, readable storage medium and electronic equipment
CN112231507A (en) * 2020-10-14 2021-01-15 维沃移动通信有限公司 Identification method and device and electronic equipment

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7685201B2 (en) * 2006-09-08 2010-03-23 Microsoft Corporation Person disambiguation using name entity extraction-based clustering
US10546026B2 (en) * 2017-03-31 2020-01-28 International Business Machines Corporation Advanced search-term disambiguation
US20210110322A1 (en) * 2019-10-09 2021-04-15 Visa International Service Association Computer Implemented Method for Detecting Peers of a Client Entity

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110674304A (en) * 2019-10-09 2020-01-10 北京明略软件系统有限公司 Entity disambiguation method and device, readable storage medium and electronic equipment
CN112231507A (en) * 2020-10-14 2021-01-15 维沃移动通信有限公司 Identification method and device and electronic equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于上下文信息的中文命名实体消歧方法研究;王旭阳;姜喜秋;;计算机应用研究(04);全文 *

Also Published As

Publication number Publication date
CN113239149A (en) 2021-08-10

Similar Documents

Publication Publication Date Title
US11816888B2 (en) Accurate tag relevance prediction for image search
US10235623B2 (en) Accurate tag relevance prediction for image search
CN108170650B (en) Text comparison method and text comparison device
CN114549874A (en) Training method of multi-target image-text matching model, image-text retrieval method and device
CN113128209B (en) Method and device for generating word stock
CN113220835B (en) Text information processing method, device, electronic equipment and storage medium
CN113657274B (en) Table generation method and device, electronic equipment and storage medium
CN113660541B (en) Method and device for generating abstract of news video
CN112988753B (en) Data searching method and device
CN110674635B (en) Method and device for dividing text paragraphs
CN112579729A (en) Training method and device for document quality evaluation model, electronic equipment and medium
CN111753029A (en) Entity relationship extraction method and device
CN113239149B (en) Entity processing method, device, electronic equipment and storage medium
CN112699237B (en) Label determination method, device and storage medium
CN115858773A (en) Keyword mining method, device and medium suitable for long document
CN113408660A (en) Book clustering method, device, equipment and storage medium
CN114254636A (en) Text processing method, device, equipment and storage medium
CN114547301A (en) Document processing method, document processing device, recognition model training equipment and storage medium
CN113919424A (en) Training of text processing model, text processing method, device, equipment and medium
CN112560425B (en) Template generation method and device, electronic equipment and storage medium
CN112307183B (en) Search data identification method, apparatus, electronic device and computer storage medium
CN114329206A (en) Title generation method and device, electronic equipment and computer readable medium
CN114417862A (en) Text matching method, and training method and device of text matching model
CN113378015A (en) Search method, search apparatus, electronic device, storage medium, and program product
CN112784600A (en) Information sorting method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant