CN110442674B - Label propagation clustering method, terminal equipment, storage medium and device - Google Patents

Label propagation clustering method, terminal equipment, storage medium and device Download PDF

Info

Publication number
CN110442674B
CN110442674B CN201910504157.0A CN201910504157A CN110442674B CN 110442674 B CN110442674 B CN 110442674B CN 201910504157 A CN201910504157 A CN 201910504157A CN 110442674 B CN110442674 B CN 110442674B
Authority
CN
China
Prior art keywords
text
target
node
clustering
heterogeneous
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910504157.0A
Other languages
Chinese (zh)
Other versions
CN110442674A (en
Inventor
尹帆
张广凯
宋中山
覃俊
郑禄
吴经龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South Central Minzu University
Original Assignee
South Central University for Nationalities
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South Central University for Nationalities filed Critical South Central University for Nationalities
Priority to CN201910504157.0A priority Critical patent/CN110442674B/en
Publication of CN110442674A publication Critical patent/CN110442674A/en
Application granted granted Critical
Publication of CN110442674B publication Critical patent/CN110442674B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a clustering method, terminal equipment, storage medium and device for label propagation, wherein the method comprises the following steps: acquiring frequent words of each text; extracting text information of the text from a sample text set, and constructing a heterogeneous text network according to the text information through a preset mapping relation; generating a node influence threshold value by corresponding text nodes in the heterogeneous text network through a preset node influence relationship, and acquiring a target label according to the node influence threshold value; generating a total similarity threshold value between the texts in the heterogeneous text network through a preset total similarity relation, and acquiring a target text node according to the total similarity threshold value; and transmitting the target label among the target text nodes, and clustering texts corresponding to the same target label to obtain a clustering result cluster. The technical scheme of the invention can solve the technical problems of low tag propagation randomness, low clustering accuracy and low reliability.

Description

Label propagation clustering method, terminal equipment, storage medium and device
Technical Field
The present invention relates to the field of tag propagation and clustering technologies, and in particular, to a tag propagation clustering method, a terminal device, a storage medium, and an apparatus.
Background
At present, in the aspects of agricultural production, information retrieval, financial and biological information processing and the like, a large amount of data information needs to be processed and then used, and generally, labels are used for propagation processing and then clustering; for example, when analyzing pest damage of crops, the damaged phenomenon of the damaged crops needs to be marked, then whether the damaged crops belong to the pest type is judged, the phenomenon can be quickly clustered by using a label propagation algorithm to obtain a result, and finally the pest can be remedied. However, the label propagation algorithm is not only random, but also has low accuracy and reliability after clustering data subjected to marking processing.
The above is only for the purpose of assisting understanding of the technical aspects of the present invention, and does not represent an admission that the above is prior art.
Disclosure of Invention
The invention mainly aims to provide a label propagation clustering method, a terminal device, a storage medium and a device, and aims to solve the technical problems of low label propagation randomness, low clustering accuracy and low reliability.
In order to achieve the above object, the present invention provides a label propagation clustering method, which includes the following steps:
performing word segmentation processing on the texts in the sample text set to obtain frequent words of each text;
extracting text information of the text from the sample text set, and constructing a heterogeneous text network according to the text information through a preset mapping relation;
generating a node influence threshold value by corresponding text nodes in the heterogeneous text network through a preset node influence relationship, and acquiring a target label according to the node influence threshold value;
generating a total similarity threshold value between the texts in the heterogeneous text network through a preset total similarity relation, and acquiring a target text node according to the total similarity threshold value;
and transmitting the target label among the target text nodes, and clustering texts corresponding to the same target label to obtain a clustering result cluster.
Preferably, the performing word segmentation processing on the texts in the sample text set to obtain frequent words of each text specifically includes:
performing word segmentation and part-of-speech tagging on the texts in the sample text set through FNLP to obtain feature words;
performing TF-IDF operation on the characteristic words to obtain the word frequency and the inverse document frequency of the characteristic words;
generating a weight threshold value of the characteristic word through a preset weight corresponding relation according to the word frequency and the inverse document frequency;
and comparing the weight threshold of the feature words with a preset frequent word threshold, and acquiring target feature words according to comparison results so as to take the target feature words as frequent words of the text.
Preferably, the extracting text information of the text from the sample text set, and constructing a heterogeneous text network according to the text information through a preset mapping relationship specifically includes:
extracting text information of the text from the sample text set;
and setting directed edges between the text nodes with the text information according to the text information through a preset mapping relation so as to construct a heterogeneous text network.
Preferably, the generating a node influence threshold value from the corresponding text node in the heterogeneous text network according to a preset node influence relationship, and acquiring the target label according to the node influence threshold value specifically includes:
generating a node influence threshold value by the corresponding text node in the heterogeneous text network through a preset node influence relationship;
and comparing the node influence threshold with a preset node influence threshold, and acquiring a target text according to a comparison result so as to take frequent words of the target text as target labels.
Preferably, the generating a total similarity threshold between the texts in the heterogeneous text network through a preset total similarity relationship, and obtaining a target text node according to the total similarity threshold specifically includes:
constructing a frequent word-text matrix according to the frequent words and the text to obtain text vectors corresponding to the text, and generating an internal feature similarity threshold value between the texts through a preset cosine similarity relation for the text vectors;
in the heterogeneous text network, generating an extrinsic feature similarity threshold value between the texts through a preset path similarity relation;
generating a total similarity threshold of the text by presetting a total similarity relation according to the internal feature similarity threshold and the external feature similarity threshold;
and acquiring a target text node according to the total similarity threshold.
Preferably, the obtaining a target text node according to the total similarity threshold specifically includes:
according to the total similarity threshold value;
and comparing the total similarity threshold with a preset text total similarity threshold, and acquiring target text nodes in the heterogeneous text network according to the comparison result.
Preferably, the propagating the target label among the target text nodes and clustering texts corresponding to the same target label to obtain a clustering result cluster specifically includes:
if the target text node is a target text node of a directed edge in the heterogeneous text network, the target label is spread among the target text nodes according to the direction of the directed edge;
if the target text node is a target text node with no directional edge or two-way edge in the heterogeneous text network, sequencing according to a node influence threshold corresponding to the target text node and obtaining a sequencing result, and spreading the target label among the target text nodes according to the sequencing result;
and clustering the texts corresponding to the same target label to obtain a clustering result cluster.
In addition, to achieve the above object, the present invention further provides a terminal device, including: a memory, a processor and a tag propagated clustering program stored on the memory and executable on the processor, the tag propagated clustering program when executed by the processor implementing the steps of the tag propagated clustering method as described above.
In addition, to achieve the above object, the present invention further provides a storage medium, on which a tag propagation clustering program is stored, and the tag propagation clustering program, when executed by a processor, implements the steps of the tag propagation clustering method as described above.
In addition, in order to achieve the above object, the present invention further provides a tag propagation clustering device, including:
the frequent word acquisition module is used for carrying out word segmentation processing on the texts in the sample text set so as to acquire frequent words of each text;
the heterogeneous text network construction module is used for extracting text information of the text from the sample text set and constructing a heterogeneous text network according to the text information through a preset mapping relation;
the target label acquisition module is used for generating a node influence threshold value for the corresponding text node in the heterogeneous text network through a preset node influence relationship and acquiring a target label according to the node influence threshold value;
the target text node acquisition module is used for generating a total similarity threshold value between the texts in the heterogeneous text network through a preset total similarity relation and acquiring a target text node according to the total similarity threshold value;
and the propagation and clustering module is used for propagating the target labels among the target text nodes and clustering texts corresponding to the same target labels to obtain a clustering result cluster.
In the invention, the frequent words of each text are obtained by performing word segmentation processing on the texts in the sample text set; extracting text information of the text from the sample text set, and constructing a heterogeneous text network according to the text information through a preset mapping relation; generating a node influence threshold value by corresponding text nodes in the heterogeneous text network through a preset node influence relationship, and acquiring a target label according to the node influence threshold value; generating a total similarity threshold value between the texts in the heterogeneous text network through a preset total similarity relation, and acquiring a target text node according to the total similarity threshold value; and transmitting the target label among the target text nodes, and clustering texts corresponding to the same target label to obtain a clustering result cluster. The technical scheme of the invention can solve the technical problems of low tag propagation randomness, low clustering accuracy and low reliability.
Drawings
Fig. 1 is a schematic structural diagram of a terminal device in a hardware operating environment according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart of a first embodiment of a clustering method for tag propagation according to the present invention;
FIG. 3 is a flowchart illustrating a clustering method for tag propagation according to a second embodiment of the present invention;
fig. 4 is a block diagram of a first embodiment of a tag propagation clustering apparatus according to the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Referring to fig. 1, fig. 1 is a schematic structural diagram of a terminal device in a hardware operating environment according to an embodiment of the present invention.
As shown in fig. 1, the terminal device may include: a processor 1001, such as a Central Processing Unit (CPU), a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), and the optional user interface 1003 may further include a standard wired interface and a wireless interface, and the wired interface for the user interface 1003 may be a USB interface in the present invention. The network interface 1004 may optionally include a standard wired interface, a WIreless interface (e.g., a WIreless-FIdelity (WI-FI) interface). The Memory 1005 may be a Random Access Memory (RAM) Memory or a Non-volatile Memory (NVM), such as a disk Memory. The memory 1005 may alternatively be a storage device separate from the processor 1001.
Those skilled in the art will appreciate that the configuration shown in fig. 1 does not constitute a limitation of the terminal device and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.
As shown in fig. 1, a memory 1005, which is a kind of computer storage medium, may include therein an operating system, a network communication module, a user interface module, and a tag-propagated clustering program.
In the terminal device shown in fig. 1, the network interface 1004 is mainly used for connecting to a backend server and performing data communication with the backend server; the user interface 1003 is mainly used for connecting a peripheral and performing data communication with the peripheral; the terminal device calls the tag propagation clustering program stored in the memory 1005 through the processor 1001, and executes the tag propagation clustering method provided by the embodiment of the present invention.
Performing word segmentation processing on the texts in the sample text set to obtain frequent words of each text;
extracting text information of the text from the sample text set, and constructing a heterogeneous text network according to the text information through a preset mapping relation;
generating a node influence threshold value by corresponding text nodes in the heterogeneous text network through a preset node influence relationship, and acquiring a target label according to the node influence threshold value;
generating a total similarity threshold value between the texts in the heterogeneous text network through a preset total similarity relation, and acquiring a target text node according to the total similarity threshold value;
and transmitting the target label among the target text nodes, and clustering texts corresponding to the same target label to obtain a clustering result cluster.
Further, the processor 1001 may call a clustering routine of tag propagation stored in the memory 1005, and also perform the following operations:
performing word segmentation and part-of-speech tagging on the texts in the sample text set through FNLP to obtain feature words;
performing TF-IDF operation on the characteristic words to obtain the word frequency and the inverse document frequency of the characteristic words;
generating a weight threshold value of the characteristic word through a preset weight corresponding relation according to the word frequency and the inverse document frequency;
and comparing the weight threshold of the feature words with a preset frequent word threshold, and acquiring target feature words according to comparison results so as to take the target feature words as frequent words of the text.
Further, the processor 1001 may call a clustering routine of tag propagation stored in the memory 1005, and also perform the following operations:
extracting text information of the text from the sample text set;
and setting directed edges between the text nodes with the text information according to the text information through a preset mapping relation so as to construct a heterogeneous text network.
Further, the processor 1001 may call a clustering routine of tag propagation stored in the memory 1005, and also perform the following operations:
generating a node influence threshold value by the corresponding text node in the heterogeneous text network through a preset node influence relationship;
and comparing the node influence threshold with a preset node influence threshold, and acquiring a target text according to a comparison result so as to take frequent words of the target text as target labels.
Further, the processor 1001 may call a clustering routine of tag propagation stored in the memory 1005, and also perform the following operations:
constructing a frequent word-text matrix according to the frequent words and the text to obtain text vectors corresponding to the text, and generating an internal feature similarity threshold value between the texts through a preset cosine similarity relation for the text vectors;
in the heterogeneous text network, generating an extrinsic feature similarity threshold value between the texts through a preset path similarity relation;
generating a total similarity threshold of the text by presetting a total similarity relation according to the internal feature similarity threshold and the external feature similarity threshold;
and acquiring a target text node according to the total similarity threshold.
Further, the processor 1001 may call a clustering routine of tag propagation stored in the memory 1005, and also perform the following operations:
according to the total similarity threshold value;
and comparing the total similarity threshold with a preset text total similarity threshold, and acquiring target text nodes in the heterogeneous text network according to the comparison result.
Further, the processor 1001 may call a clustering routine of tag propagation stored in the memory 1005, and also perform the following operations:
if the target text node is a target text node of a directed edge in the heterogeneous text network, the target label is spread among the target text nodes according to the direction of the directed edge;
if the target text node is a target text node with no directional edge or two-way edge in the heterogeneous text network, sequencing according to a node influence threshold corresponding to the target text node and obtaining a sequencing result, and spreading the target label among the target text nodes according to the sequencing result;
and clustering the texts corresponding to the same target label to obtain a clustering result cluster.
In the embodiment, the frequent words of each text are obtained by performing word segmentation processing on the texts in the sample text set; extracting text information of the text from the sample text set, and constructing a heterogeneous text network according to the text information through a preset mapping relation; generating a node influence threshold value by corresponding text nodes in the heterogeneous text network through a preset node influence relationship, and acquiring a target label according to the node influence threshold value; generating a total similarity threshold value between the texts in the heterogeneous text network through a preset total similarity relation, and acquiring a target text node according to the total similarity threshold value; and transmitting the target label among the target text nodes, and clustering texts corresponding to the same target label to obtain a clustering result cluster. The technical scheme of the invention can solve the technical problems of low tag propagation randomness, low clustering accuracy and low reliability.
Based on the hardware structure, the embodiment of the clustering method for the label propagation is provided.
Referring to fig. 2, fig. 2 is a schematic flow chart of a first embodiment of the label propagation clustering method, and the first embodiment of the label propagation clustering method is provided.
In a first embodiment, the label propagation clustering method includes the following steps:
step S10: and performing word segmentation processing on the texts in the sample text set to obtain frequent words of each text.
It is understood that, in the present embodiment, the text refers to a representation form of written language, and from a literature perspective, it is usually a sentence or a combination of sentences having complete and systematic meaning; a text may be a sentence, a paragraph, or a chapter, which is not described in detail herein.
In the specific implementation, a sample text set is collected in advance, word segmentation and part-of-speech tagging are performed on texts in the sample text set to obtain feature words, word frequency and inverse document frequency of the feature words are obtained according to the feature words, and then frequent words of each text are obtained according to the preset weight corresponding relation.
Step S20: and extracting text information of the text from the sample text set, and constructing a heterogeneous text network according to the text information through a preset mapping relation.
It should be noted that, in this embodiment, the text information includes information of interest among authors of the text, information of approval of the text, forwarding and citation, and the like, and details are not repeated here.
In specific implementation, text information of the text is extracted from the sample text set, and text nodes with the text information are set as directed edges according to the text information through a preset mapping relation so as to construct a heterogeneous text network.
Step S30: and generating a node influence threshold value by corresponding text nodes in the heterogeneous text network through a preset node influence relationship, and acquiring a target label according to the node influence threshold value.
It should be noted that, in this embodiment, according to the node influence threshold, the node influence threshold is compared with a preset node influence threshold, and a target text is obtained according to a comparison result, so that frequent words of the target text are used as a target label.
Step S40: and generating a total similarity threshold value between the texts in the heterogeneous text network through a preset total similarity relation, and acquiring a target text node according to the total similarity threshold value.
It should be noted that, in this embodiment, the intrinsic feature similarity threshold is obtained according to the frequent word and the preset cosine similarity relationship; and finally, generating a total similarity threshold of the text by presetting a total similarity relation according to the internal feature similarity threshold and the external feature similarity threshold so as to obtain a target text node.
Step S50: and transmitting the target label among the target text nodes, and clustering texts corresponding to the same target label to obtain a clustering result cluster.
It should be noted that, in this embodiment, a label propagation algorithm is introduced, the target labels are propagated among the target text nodes, and finally, the texts corresponding to the same target labels are clustered to obtain a clustering result cluster until the whole process is finished.
It is worth to be noted that, in the embodiment, a weighted directed heterogeneous text network is introduced, and the multi-dimensional features of the text are mined to perform similarity calculation, so that the accuracy and the reliability of the clustering result are improved.
In the first embodiment, the frequent words of each text are obtained by performing word segmentation processing on the texts in the sample text set; extracting text information of the text from the sample text set, and constructing a heterogeneous text network according to the text information through a preset mapping relation; generating a node influence threshold value by corresponding text nodes in the heterogeneous text network through a preset node influence relationship, and acquiring a target label according to the node influence threshold value; generating a total similarity threshold value between the texts in the heterogeneous text network through a preset total similarity relation, and acquiring a target text node according to the total similarity threshold value; and transmitting the target label among the target text nodes, and clustering texts corresponding to the same target label to obtain a clustering result cluster. The technical scheme of the invention can solve the technical problems of low tag propagation randomness, low clustering accuracy and low reliability.
Referring to fig. 3, fig. 3 is a flowchart illustrating a clustering method for tag propagation according to a second embodiment of the present invention, and the second embodiment of the clustering method for tag propagation according to the present invention is proposed based on the first embodiment illustrated in fig. 2.
In the second embodiment, the step S10 specifically includes:
step S11: performing word segmentation and part-of-speech tagging on the texts in the sample text set through FNLP (development kit for Chinese natural language text processing based on machine learning) to obtain feature words; and performing TF-IDF (Term-Inverse Document Frequency) operation on the feature words for a common weighting technology for information retrieval and data mining, wherein TF means Term Frequency Term Frequency and IDF means Inverse text Frequency index Inverse Document Frequency) operation to obtain the Term Frequency and the Inverse Document Frequency of the feature words.
In this embodiment, the TF-IDF operation, i.e., the following calculation formula, is used
Figure GDA0002213811640000101
And
Figure GDA0002213811640000102
obtaining the word frequency tfijAnd the inverse document frequency idfiWherein i and j are positive integers.
Step S12: generating a weight threshold value of the characteristic word through a preset weight corresponding relation according to the word frequency and the inverse document frequency; and comparing the weight threshold of the feature words with a preset frequent word threshold, and acquiring target feature words according to comparison results so as to take the target feature words as frequent words of the text.
It should be noted that, in this embodiment, the preset weight correspondence relationship is adopted, that is, the formula W is calculated as followsi=tfij*idfiObtaining a weight threshold value w of the characteristic wordsiThe weight threshold value w of the characteristic word is setiComparing with the preset frequent word threshold value, and mining the weight threshold value wiThe characteristic words larger than the preset frequent word threshold value are used as frequent words f of the texti
Further, the step S20 specifically includes:
step S21: extracting textual information of the text from the sample text set.
It should be noted that, in this embodiment, the text information includes information about concern among authors of the text, information about approval of the text, forwarding and citation, and the like, and is not described herein any more; and taking each text and the corresponding author thereof as nodes respectively.
Step S22: and setting directed edges between the text nodes with the text information according to the text information through a preset mapping relation so as to construct a heterogeneous text network.
It should be noted that, in this embodiment, for two author nodes marked as having a concern relationship, an author node marked as having a forwarding relationship, a forwarded text node, and a text node marked as having a reference relationship, a directed edge is added between nodes having the above corresponding preset mapping relationship; in addition, for an author node which is not marked to have an attention relationship, if one author approves or comments on another author, and the percentage of the text number exceeds a preset attention probability threshold, a directed edge is added, and the abstract representation of the directed edge is as follows:
If(uicomment on approver dj)
{
New edge u in networki→dj
}
If(uiAttention uj)
{
New edge u in networki→uj
}
Else if(uinot concern uj and uiAttention ujIs greater than the preset attention probability threshold value)
{
New edge u in networki→uj
}
And constructing a two-dimensional heterogeneous text network according to the rules. The table of the correspondence between different edges in the specific network is as follows:
network relationships Representation form
Author u1Published text d1 u1-d1
Author u1Pay attention to the author u2 Eu12:u1u2
Author node u1Praise or comment text d4 Eud14:u1----→d4
Text d1Reference is made to text d2 Ed12:d1---→d2
It is easy to understand that a multidimensional heterogeneous text network can be constructed according to a plurality of nodes and characteristic information thereof, which is not described in detail herein.
Further, the step S30 specifically includes:
step S31: and generating a node influence threshold value by the corresponding text node in the heterogeneous text network through a preset node influence relationship.
It should be noted that, in this embodiment, the preset node influence relationship is adopted, that is, the following calculation formula is adopted
Figure GDA0002213811640000111
Obtaining the node influence threshold; wherein the ith node and the jth node are directly connected, then aij1, otherwise 0; k is a radical ofjRepresents the degree of the j-th node,
Figure GDA0002213811640000112
representing the probability of the ith node randomly walking to the jth node; s of all nodes except the initial node g in the initial statei(0) 1, and sg(0) 0; and finally, averagely distributing the node influence threshold of the node g to other N nodes, wherein the calculation formula is as follows: si=si(tc)+sg(tc)·N-1(ii) a Wherein s isg(tc) Is the node influence threshold, t, of node g at steady statecIndicating the number of convergence times.
Step S32: and comparing the node influence threshold with a preset node influence threshold, and acquiring a target text according to a comparison result so as to take frequent words of the target text as target labels.
It should be noted that, in this embodiment, for a text node whose node influence threshold is greater than the preset node influence threshold, a text corresponding to the text node is mined to obtain a target text, and frequent words of the target text are used as target labels.
Further, the step S40 specifically includes:
step S41: and constructing a frequent word-text matrix according to the frequent words and the text to obtain a text vector corresponding to the text, and generating an intrinsic feature similarity threshold value between the texts through a preset cosine similarity relation for the text vector.
It should be noted that, in this embodiment, the frequent word f to be minediAnd constructing a frequent word-text matrix M with the text, wherein M is a matrix of 0-1, and the expression form of M is as follows:
Figure GDA0002213811640000121
assigning an abstract representation by measuring whether the text contains the frequent words as follows: if (frequent word f)i∈dj)
{
M[i][j]=1;
}
else
{
M[i][j]=0;
}
Wherein each text d is caused tojThe expression form of (a) is represented by an n-dimensional text vector composed of 0 and 1, and the expression form is as follows: dj1, 0. }; and then utilizing the preset cosine similarity relation to calculate an internal feature similarity threshold S between the textsIndijWherein, the calculation formula of the preset cosine similarity relation is as follows:
Figure GDA0002213811640000122
i.e. the cosine value between each of said n-dimensional vectors and this vector is calculated.
Step S42: and in the heterogeneous text network, generating an extrinsic feature similarity threshold value between the texts through a preset path similarity relation.
It should be noted that, in this embodiment, the path of each weighted directed edge element is used as a basis
Figure GDA0002213811640000123
Each containing an attribute function delta on said textual information relation Rl(Rl) Is a determined value, and the similarity between the author nodes is calculated by using the preset path similarity relation, namely the similarity S of the external features of the text is calculatedOutdijThe formula is as follows:
Figure GDA0002213811640000131
where P is the meta path and the same type objects are x and y.
Step S43: generating a total similarity threshold of the text by presetting a total similarity relation according to the internal feature similarity threshold and the external feature similarity threshold; and comparing the total similarity threshold with a preset text total similarity threshold, and acquiring target text nodes in the heterogeneous text network according to the comparison result.
It should be noted that, in this embodiment, the preset total similarity relationship is adopted, that is, the formula S is calculated as followsdij=SIndij*WIn+SOutdij*WOutObtaining the total similarity threshold SdijWherein W isIn、WOutRespectively assigning weights of intrinsic feature similarity and weights of extrinsic feature similarity; and taking the text node in the heterogeneous text network with the total similarity threshold value larger than the preset text total similarity threshold value as a target text node.
Further, the step S50 specifically includes:
step S51: if the target text node is a target text node of a directed edge in the heterogeneous text network, the target label is spread among the target text nodes according to the direction of the directed edge; and clustering the texts corresponding to the same target label to obtain a clustering result cluster.
Step S52: if the target text node is a target text node with no directional edge or two-way edge in the heterogeneous text network, sequencing according to a node influence threshold corresponding to the target text node and obtaining a sequencing result, and spreading the target label among the target text nodes according to the sequencing result; and clustering the texts corresponding to the same target label to obtain a clustering result cluster.
It should be noted that, in this embodiment, the sorting result is obtained by sorting the node influence thresholds corresponding to the target text node in a descending order.
In the second embodiment, the frequent words of each text are obtained by performing word segmentation processing on the texts in the sample text set; extracting text information of the text from the sample text set, and constructing a heterogeneous text network according to the text information through a preset mapping relation; generating a node influence threshold value by corresponding text nodes in the heterogeneous text network through a preset node influence relationship, and acquiring a target label according to the node influence threshold value; generating a total similarity threshold value between the texts in the heterogeneous text network through a preset total similarity relation, and acquiring a target text node according to the total similarity threshold value; and transmitting the target label among the target text nodes, and clustering texts corresponding to the same target label to obtain a clustering result cluster. The technical scheme of the invention can solve the technical problems of low tag propagation randomness, low clustering accuracy and low reliability.
In addition, an embodiment of the present invention further provides a storage medium, where a tag propagation clustering program is stored on the storage medium, and when executed by a processor, the tag propagation clustering program implements the following operations:
performing word segmentation processing on the texts in the sample text set to obtain frequent words of each text;
extracting text information of the text from the sample text set, and constructing a heterogeneous text network according to the text information through a preset mapping relation;
generating a node influence threshold value by corresponding text nodes in the heterogeneous text network through a preset node influence relationship, and acquiring a target label according to the node influence threshold value;
generating a total similarity threshold value between the texts in the heterogeneous text network through a preset total similarity relation, and acquiring a target text node according to the total similarity threshold value;
and transmitting the target label among the target text nodes, and clustering texts corresponding to the same target label to obtain a clustering result cluster.
Further, the tag propagated clustering program when executed by the processor further implements the following operations:
performing word segmentation and part-of-speech tagging on the texts in the sample text set through FNLP to obtain feature words;
performing TF-IDF operation on the characteristic words to obtain the word frequency and the inverse document frequency of the characteristic words;
generating a weight threshold value of the characteristic word through a preset weight corresponding relation according to the word frequency and the inverse document frequency;
and comparing the weight threshold of the feature words with a preset frequent word threshold, and acquiring target feature words according to comparison results so as to take the target feature words as frequent words of the text.
Further, the tag propagated clustering program when executed by the processor further implements the following operations:
extracting text information of the text from the sample text set;
and setting directed edges between the text nodes with the text information according to the text information through a preset mapping relation so as to construct a heterogeneous text network.
Further, the tag propagated clustering program when executed by the processor further implements the following operations:
generating a node influence threshold value by the corresponding text node in the heterogeneous text network through a preset node influence relationship;
and comparing the node influence threshold with a preset node influence threshold, and acquiring a target text according to a comparison result so as to take frequent words of the target text as target labels.
Further, the tag propagated clustering program when executed by the processor further implements the following operations:
constructing a frequent word-text matrix according to the frequent words and the text to obtain text vectors corresponding to the text, and generating an internal feature similarity threshold value between the texts through a preset cosine similarity relation for the text vectors;
in the heterogeneous text network, generating an extrinsic feature similarity threshold value between the texts through a preset path similarity relation;
generating a total similarity threshold of the text by presetting a total similarity relation according to the internal feature similarity threshold and the external feature similarity threshold;
and acquiring a target text node according to the total similarity threshold.
Further, the tag propagated clustering program when executed by the processor further implements the following operations:
according to the total similarity threshold value;
and comparing the total similarity threshold with a preset text total similarity threshold, and acquiring target text nodes in the heterogeneous text network according to the comparison result.
Further, the tag propagated clustering program when executed by the processor further implements the following operations:
if the target text node is a target text node of a directed edge in the heterogeneous text network, the target label is spread among the target text nodes according to the direction of the directed edge;
if the target text node is a target text node with no directional edge or two-way edge in the heterogeneous text network, sequencing according to a node influence threshold corresponding to the target text node and obtaining a sequencing result, and spreading the target label among the target text nodes according to the sequencing result;
and clustering the texts corresponding to the same target label to obtain a clustering result cluster.
In the embodiment, the frequent words of each text are obtained by performing word segmentation processing on the texts in the sample text set; extracting text information of the text from the sample text set, and constructing a heterogeneous text network according to the text information through a preset mapping relation; generating a node influence threshold value by corresponding text nodes in the heterogeneous text network through a preset node influence relationship, and acquiring a target label according to the node influence threshold value; generating a total similarity threshold value between the texts in the heterogeneous text network through a preset total similarity relation, and acquiring a target text node according to the total similarity threshold value; and transmitting the target label among the target text nodes, and clustering texts corresponding to the same target label to obtain a clustering result cluster. The technical scheme of the invention can solve the technical problems of low tag propagation randomness, low clustering accuracy and low reliability.
In addition, referring to fig. 4, an embodiment of the present invention further provides a tag propagation clustering apparatus, where the tag propagation clustering apparatus includes:
and the frequent word obtaining module 10 is configured to perform word segmentation processing on the texts in the sample text set to obtain frequent words of each text.
It is understood that, in the present embodiment, the text refers to a representation form of written language, and from a literature perspective, it is usually a sentence or a combination of sentences having complete and systematic meaning; a text may be a sentence, a paragraph, or a chapter, which is not described in detail herein.
In the specific implementation, a sample text set is collected in advance, word segmentation and part-of-speech tagging are performed on texts in the sample text set to obtain feature words, word frequency and inverse document frequency of the feature words are obtained according to the feature words, and then frequent words of each text are obtained according to the preset weight corresponding relation.
And the heterogeneous text network construction module 20 is configured to extract text information of the text from the sample text set, and construct a heterogeneous text network according to the text information through a preset mapping relationship.
It should be noted that, in this embodiment, the text information includes information of interest among authors of the text, information of approval of the text, forwarding and citation, and the like, and details are not repeated here.
In specific implementation, text information of the text is extracted from the sample text set, and text nodes with the text information are set as directed edges according to the text information through a preset mapping relation so as to construct a heterogeneous text network.
And the target label obtaining module 30 is configured to generate a node influence threshold value from the corresponding text node in the heterogeneous text network according to a preset node influence relationship, and obtain a target label according to the node influence threshold value.
It should be noted that, in this embodiment, according to the node influence threshold, the node influence threshold is compared with a preset node influence threshold, and a target text is obtained according to a comparison result, so that frequent words of the target text are used as a target label.
And the target text node obtaining module 40 is configured to generate a total similarity threshold between the texts in the heterogeneous text network through a preset total similarity relationship, and obtain a target text node according to the total similarity threshold.
It should be noted that, in this embodiment, the intrinsic feature similarity threshold is obtained according to the frequent word and the preset cosine similarity relationship; and finally, generating a total similarity threshold of the text by presetting a total similarity relation according to the internal feature similarity threshold and the external feature similarity threshold so as to obtain a target text node.
And a propagation and clustering module 50, configured to propagate the target labels among the target text nodes, and cluster texts corresponding to the same target labels to obtain a clustering result cluster.
It should be noted that, in this embodiment, a label propagation algorithm is introduced, the target labels are propagated among the target text nodes, and finally, the texts corresponding to the same target labels are clustered to obtain a clustering result cluster until the whole process is finished.
It is worth to be noted that, in the embodiment, a weighted directed heterogeneous text network is introduced, and the multi-dimensional features of the text are mined to perform similarity calculation, so that the accuracy and the reliability of the clustering result are improved.
In the embodiment, the frequent words of each text are obtained by performing word segmentation processing on the texts in the sample text set; extracting text information of the text from the sample text set, and constructing a heterogeneous text network according to the text information through a preset mapping relation; generating a node influence threshold value by corresponding text nodes in the heterogeneous text network through a preset node influence relationship, and acquiring a target label according to the node influence threshold value; generating a total similarity threshold value between the texts in the heterogeneous text network through a preset total similarity relation, and acquiring a target text node according to the total similarity threshold value; and transmitting the target label among the target text nodes, and clustering texts corresponding to the same target label to obtain a clustering result cluster. The technical scheme of the invention can solve the technical problems of low tag propagation randomness, low clustering accuracy and low reliability.
Other embodiments or specific implementation manners of the label propagation clustering device of the present invention may refer to the above method embodiments, and are not described herein again.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, third, etc. do not denote any order, but rather the words first, second, third, etc. are to be interpreted as names.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention or portions thereof that contribute to the prior art may be embodied in the form of a software product, where the computer software product is stored in a storage medium (e.g., a Read Only Memory (ROM)/Random Access Memory (RAM), a magnetic disk, an optical disk), and includes several instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (9)

1. A label propagation clustering method is characterized by comprising the following steps:
performing word segmentation processing on the texts in the sample text set to obtain frequent words of each text;
extracting text information of the text from the sample text set, and constructing a heterogeneous text network according to the text information through a preset mapping relation;
generating a node influence threshold value by corresponding text nodes in the heterogeneous text network through a preset node influence relationship, and acquiring a target label according to the node influence threshold value;
generating a total similarity threshold value between the texts in the heterogeneous text network through a preset total similarity relation, and acquiring a target text node according to the total similarity threshold value;
propagating the target labels among the target text nodes, and clustering texts corresponding to the same target labels to obtain a clustering result cluster;
the propagating the target label among the target text nodes and clustering texts corresponding to the same target label to obtain a clustering result cluster specifically includes:
if the target text node is a target text node of a directed edge in the heterogeneous text network, the target label is spread among the target text nodes according to the direction of the directed edge;
if the target text node is a target text node with no directional edge or two-way edge in the heterogeneous text network, sequencing according to a node influence threshold corresponding to the target text node and obtaining a sequencing result, and spreading the target label among the target text nodes according to the sequencing result;
and clustering the texts corresponding to the same target label to obtain a clustering result cluster.
2. The label propagation clustering method according to claim 1, wherein the performing word segmentation on the texts in the sample text set to obtain frequent words of each text specifically comprises:
performing word segmentation and part-of-speech tagging on the texts in the sample text set through FNLP to obtain feature words;
performing TF-IDF operation on the characteristic words to obtain the word frequency and the inverse document frequency of the characteristic words;
generating a weight threshold value of the characteristic word through a preset weight corresponding relation according to the word frequency and the inverse document frequency;
and comparing the weight threshold of the feature words with a preset frequent word threshold, and acquiring target feature words according to comparison results so as to take the target feature words as frequent words of the text.
3. The label propagation clustering method according to claim 1, wherein the extracting text information of the text from the sample text set and constructing a heterogeneous text network according to the text information through a preset mapping relationship specifically comprises:
extracting text information of the text from the sample text set;
and setting directed edges between the text nodes with the text information according to the text information through a preset mapping relation so as to construct a heterogeneous text network.
4. The method according to any one of claims 1 to 3, wherein the generating a node influence threshold value from a preset node influence relationship for a corresponding text node in the heterogeneous text network, and obtaining a target label according to the node influence threshold value specifically comprises:
generating a node influence threshold value by the corresponding text node in the heterogeneous text network through a preset node influence relationship;
and comparing the node influence threshold with a preset node influence threshold, and acquiring a target text according to a comparison result so as to take frequent words of the target text as target labels.
5. The label propagation clustering method according to any one of claims 1 to 3, wherein the generating a total similarity threshold between the texts through a preset total similarity relationship in the heterogeneous text network, and obtaining a target text node according to the total similarity threshold specifically comprises:
constructing a frequent word-text matrix according to the frequent words and the text to obtain text vectors corresponding to the text, and generating an internal feature similarity threshold value between the texts through a preset cosine similarity relation for the text vectors;
in the heterogeneous text network, generating an extrinsic feature similarity threshold value between the texts through a preset path similarity relation;
generating a total similarity threshold of the text by presetting a total similarity relation according to the internal feature similarity threshold and the external feature similarity threshold;
and acquiring a target text node according to the total similarity threshold.
6. The label propagation clustering method according to claim 5, wherein the obtaining of the target text node according to the total similarity threshold specifically comprises:
according to the total similarity threshold value;
and comparing the total similarity threshold with a preset text total similarity threshold, and acquiring target text nodes in the heterogeneous text network according to the comparison result.
7. A terminal device, characterized in that the terminal device comprises: memory, a processor and a tag-propagated clustering program stored on the memory and executable on the processor, which when executed by the processor implements the steps of the tag-propagated clustering method according to any one of claims 1 to 6.
8. A storage medium, characterized in that the storage medium has stored thereon a tag-propagated clustering program, which when executed by a processor implements the steps of the tag-propagated clustering method according to any one of claims 1 to 6.
9. A label propagation clustering device, characterized in that the label propagation clustering device comprises:
the frequent word acquisition module is used for carrying out word segmentation processing on the texts in the sample text set so as to acquire frequent words of each text;
the heterogeneous text network construction module is used for extracting text information of the text from the sample text set and constructing a heterogeneous text network according to the text information through a preset mapping relation;
the target label acquisition module is used for generating a node influence threshold value for the corresponding text node in the heterogeneous text network through a preset node influence relationship and acquiring a target label according to the node influence threshold value;
the target text node acquisition module is used for generating a total similarity threshold value between the texts in the heterogeneous text network through a preset total similarity relation and acquiring a target text node according to the total similarity threshold value;
the propagation and clustering module is used for propagating the target labels among the target text nodes and clustering texts corresponding to the same target labels to obtain a clustering result cluster;
the propagation and clustering module is further configured to propagate the target label between the target text nodes according to the direction of the directed edge when the target text node is a target text node of the directed edge in the heterogeneous text network;
the propagation and clustering module is further configured to, when the target text node is a target text node with no directional edge or a bidirectional edge in the heterogeneous text network, perform ranking according to a node influence threshold corresponding to the target text node and obtain a ranking result, and propagate the target label among the target text nodes according to the ranking result;
and the propagation and clustering module is also used for clustering the texts corresponding to the same target label to obtain a clustering result cluster.
CN201910504157.0A 2019-06-11 2019-06-11 Label propagation clustering method, terminal equipment, storage medium and device Active CN110442674B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910504157.0A CN110442674B (en) 2019-06-11 2019-06-11 Label propagation clustering method, terminal equipment, storage medium and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910504157.0A CN110442674B (en) 2019-06-11 2019-06-11 Label propagation clustering method, terminal equipment, storage medium and device

Publications (2)

Publication Number Publication Date
CN110442674A CN110442674A (en) 2019-11-12
CN110442674B true CN110442674B (en) 2021-09-14

Family

ID=68429199

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910504157.0A Active CN110442674B (en) 2019-06-11 2019-06-11 Label propagation clustering method, terminal equipment, storage medium and device

Country Status (1)

Country Link
CN (1) CN110442674B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111191882B (en) * 2019-12-17 2022-11-25 安徽大学 Method and device for identifying influential developers in heterogeneous information network
CN112699237B (en) * 2020-12-24 2021-10-15 百度在线网络技术(北京)有限公司 Label determination method, device and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102768670A (en) * 2012-05-31 2012-11-07 哈尔滨工程大学 Webpage clustering method based on node property label propagation
US8832091B1 (en) * 2012-10-08 2014-09-09 Amazon Technologies, Inc. Graph-based semantic analysis of items
CN106951524A (en) * 2017-03-21 2017-07-14 哈尔滨工程大学 Overlapping community discovery method based on node influence power
CN108364234A (en) * 2018-03-08 2018-08-03 重庆邮电大学 A kind of microblogging community discovery method propagated based on node influence power label
CN108959453A (en) * 2018-06-14 2018-12-07 中南民族大学 Information extracting method, device and readable storage medium storing program for executing based on text cluster

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8990209B2 (en) * 2012-09-06 2015-03-24 International Business Machines Corporation Distributed scalable clustering and community detection

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102768670A (en) * 2012-05-31 2012-11-07 哈尔滨工程大学 Webpage clustering method based on node property label propagation
US8832091B1 (en) * 2012-10-08 2014-09-09 Amazon Technologies, Inc. Graph-based semantic analysis of items
CN106951524A (en) * 2017-03-21 2017-07-14 哈尔滨工程大学 Overlapping community discovery method based on node influence power
CN108364234A (en) * 2018-03-08 2018-08-03 重庆邮电大学 A kind of microblogging community discovery method propagated based on node influence power label
CN108959453A (en) * 2018-06-14 2018-12-07 中南民族大学 Information extracting method, device and readable storage medium storing program for executing based on text cluster

Also Published As

Publication number Publication date
CN110442674A (en) 2019-11-12

Similar Documents

Publication Publication Date Title
CN108804641B (en) Text similarity calculation method, device, equipment and storage medium
CN108629043B (en) Webpage target information extraction method, device and storage medium
CN109815487B (en) Text quality inspection method, electronic device, computer equipment and storage medium
JP6398510B2 (en) Entity linking method and entity linking apparatus
EP1304627B1 (en) Methods, systems, and articles of manufacture for soft hierarchical clustering of co-occurring objects
Broderick et al. Combinatorial clustering and the beta negative binomial process
EP2866421A1 (en) Method and apparatus for identifying a same user in multiple social networks
WO2020000717A1 (en) Web page classification method and device, and computer-readable storage medium
CN108269122B (en) Advertisement similarity processing method and device
CN110880006B (en) User classification method, apparatus, computer device and storage medium
CN110032650B (en) Training sample data generation method and device and electronic equipment
CN110442674B (en) Label propagation clustering method, terminal equipment, storage medium and device
CN116719997A (en) Policy information pushing method and device and electronic equipment
CN110705281A (en) Resume information extraction method based on machine learning
CN117763126A (en) Knowledge retrieval method, device, storage medium and apparatus
Dendek et al. Evaluation of features for author name disambiguation using linear support vector machines
JP4534019B2 (en) Name and keyword grouping method, program, recording medium and apparatus thereof
CN113449063B (en) Method and device for constructing document structure information retrieval library
CN113095073B (en) Corpus tag generation method and device, computer equipment and storage medium
CN111985217B (en) Keyword extraction method, computing device and readable storage medium
CN111159331B (en) Text query method, text query device and computer storage medium
CN114741489A (en) Document retrieval method, document retrieval device, storage medium and electronic equipment
US20130238607A1 (en) Seed set expansion
CN114067343A (en) Data set construction method, model training method and corresponding device
CN110059180B (en) Article author identity recognition and evaluation model training method and device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant