CN110442674A - Clustering method, terminal device, storage medium and the device that label is propagated - Google Patents
Clustering method, terminal device, storage medium and the device that label is propagated Download PDFInfo
- Publication number
- CN110442674A CN110442674A CN201910504157.0A CN201910504157A CN110442674A CN 110442674 A CN110442674 A CN 110442674A CN 201910504157 A CN201910504157 A CN 201910504157A CN 110442674 A CN110442674 A CN 110442674A
- Authority
- CN
- China
- Prior art keywords
- text
- node
- target
- propagated
- label
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
Abstract
The invention discloses clustering method, terminal device, storage medium and devices that a kind of label is propagated, this method comprises: obtaining the frequent word of each text;The text information for extracting the text is concentrated from sample text, and heterogeneous text network is constructed by default mapping relations according to the text information;Corresponding text node in the heterogeneous text network, which is generated node by default node influence power relationship, influences force threshold, influences force threshold according to the node and obtains target labels;Total similarity threshold between the text is generated by presetting total similarity relationship in the heterogeneous text network, target text node is obtained according to total similarity threshold;The target labels are propagated between the target text node, and there will be the corresponding text of identical target labels to cluster, to obtain cluster result cluster.Technical solution of the present invention is able to solve label and propagates randomness and cluster accuracy and technical problem with a low credibility.
Description
Technical field
The present invention relates to the clustering methods of label propagation and clustering technique field more particularly to a kind of propagation of label, terminal
Equipment, storage medium and device.
Background technique
At present agricultural production, information retrieval, finance and in terms of, require for a large amount of number
It is believed that breath handled after carry out again using, generally will use label carry out dissemination process after clustered again;For example, grinding
When studying carefully the analysis of crop pests, needs to carry out aggrieved phenomenon to aggrieved crops to carry out mark, then carry out judging whether to belong to
Which kind of, in pest, cracking this phenomenon can be clustered to obtain as a result, finally can using label propagation algorithm
It is remedied for this pest.But there is only randomnesss for this label propagation algorithm, and to mark treated data
Its accuracy and confidence level be not high after being clustered.
Above content is only used to facilitate the understanding of the technical scheme, and is not represented and is recognized that above content is existing skill
Art.
Summary of the invention
The main purpose of the present invention is to provide clustering method, terminal device, storage medium and dresses that a kind of label is propagated
It sets, it is intended to solve label and propagate randomness and cluster accuracy and technical problem with a low credibility.
To achieve the above object, the present invention provides a kind of clustering method that label is propagated, the cluster side that the label is propagated
Method the following steps are included:
Word segmentation processing is carried out to the text that sample text is concentrated, to obtain the frequent word of each text;
The text information for extracting the text is concentrated from the sample text, is reflected according to the text information by default
It penetrates relationship and constructs heterogeneous text network;
Corresponding text node in the heterogeneous text network, which is generated node by default node influence power relationship, to be influenced
Force threshold influences force threshold according to the node and obtains target labels;
Total similarity threshold between the text is generated by presetting total similarity relationship in the heterogeneous text network
Value obtains target text node according to total similarity threshold;
The target labels are propagated between the target text node, and there will be the identical target mark
It signs corresponding text to be clustered, to obtain cluster result cluster.
Preferably, the text concentrated to sample text carries out word segmentation processing, to obtain the frequent word of each text, tool
Body includes:
Participle and part-of-speech tagging operation are carried out by the text that FNLP concentrates the sample sample text, to obtain feature
Word;
TF-IDF operation is carried out to the Feature Words, to obtain the word frequency and inverse document frequency of the Feature Words;
According to the word frequency and the inverse document frequency, the power of the Feature Words is generated by presetting weight corresponding relationship
Weight threshold value;
The weight threshold of the Feature Words is compared with frequent word threshold value is preset, target is obtained according to comparison result
Feature Words, using the target signature word as the frequent word of the text.
Preferably, described that the text information for extracting the text is concentrated from the sample text, according to the text information
Heterogeneous text network is constructed by default mapping relations, is specifically included:
The text information for extracting the text is concentrated from the sample text;
According to the text information by presetting mapping relations, will be set between the text node with the text information
It is set to directed edge, to construct heterogeneous text network.
Preferably, described that corresponding text node in the heterogeneous text network is passed through into default node influence power relationship
Generating node influences force threshold, influences force threshold according to the node and obtains target labels, specifically includes:
Corresponding text node in the heterogeneous text network, which is generated node by default node influence power relationship, to be influenced
Force threshold;
The node is influenced force threshold to be compared with default node influence force threshold, mesh is obtained according to comparison result
Text is marked, using the frequent word of the target text as target labels.
Preferably, described to be generated between the text in the heterogeneous text network by presetting total similarity relationship
Total similarity threshold, target text node is obtained according to the total similarity threshold, is specifically included:
Frequent word-text matrix is constructed according to the frequent word and the text, to obtain the corresponding text of the text
Vector, and the internal characteristics similarity between the text is generated by default cosine similarity relationship to the text vector
Threshold value;
In the heterogeneous text network, the external spy between the text is generated by preset path similarity relationship
Levy similarity threshold;
According to the internal characteristics similarity threshold and the external feature similarity threshold, by presetting total similarity
Relationship generates total similarity threshold of the text;
Target text node is obtained according to total similarity threshold.
Preferably, described that target text node is obtained according to total similarity threshold, it specifically includes:
According to total similarity threshold;
Total similarity threshold is compared with the total similarity threshold of pre-set text, institute is obtained according to comparison result
State the target text node in heterogeneous text network.
Preferably, described to propagate the target labels between the target text node, and will have identical
The corresponding text of the target labels is clustered, and to obtain cluster result cluster, is specifically included:
It, will be described if the target text node is the target text node of directed edge in the heterogeneous text network
Target labels are propagated between the target text node according to the direction of the directed edge;
If the target text node is nonoriented edge or the target text section on two-way side in the heterogeneous text network
Point influences force threshold according to the corresponding node of the target text node and is ranked up and obtains ranking results, by the target
Label is propagated between the target text node according to the ranking results;
There to be the corresponding text of identical target labels to cluster, to obtain cluster result cluster.
In addition, to achieve the above object, the present invention also proposes that a kind of terminal device, the terminal device include: storage
Device, processor and the Cluster Program for being stored in the label propagation that can be run on the memory and on the processor, it is described
The Cluster Program that label is propagated realizes the step for the clustering method that label as described above is propagated when being executed by the processor
Suddenly.
In addition, to achieve the above object, the present invention also proposes a kind of storage medium, mark is stored on the storage medium
The Cluster Program propagated is signed, the Cluster Program that the label is propagated realizes that label as described above passes when being executed by processor
The step of clustering method broadcast.
In addition, to achieve the above object, the present invention also proposes that a kind of clustering apparatus that label is propagated, the label are propagated
Clustering apparatus include:
Frequent word obtains module, and the text for concentrating to sample text carries out word segmentation processing, to obtain the frequency of each text
Numerous word;
Heterogeneous text network struction module, for concentrating the text information for extracting the text, root from the sample text
Heterogeneous text network is constructed by default mapping relations according to the text information;
Target labels obtain module, for corresponding text node in the heterogeneous text network to be passed through default node
Influence power relationship, which generates node, influences force threshold, influences force threshold according to the node and obtains target labels;
Target text node obtains module, for raw by presetting total similarity relationship in the heterogeneous text network
At total similarity threshold between the text, target text node is obtained according to total similarity threshold;
Propagation and cluster module, for the target labels to be propagated between the target text node, and will
It is clustered with the corresponding text of the identical target labels, to obtain cluster result cluster.
In the present invention, word segmentation processing is carried out by the text concentrated to sample text, to obtain the frequent word of each text;
The text information for extracting the text is concentrated from the sample text, according to the text information by presetting mapping relations structure
Build heterogeneous text network;Corresponding text node in the heterogeneous text network is generated by default node influence power relationship
Node influences force threshold, influences force threshold according to the node and obtains target labels;By pre- in the heterogeneous text network
If total similarity relationship generates total similarity threshold between the text, target text is obtained according to total similarity threshold
This node;The target labels are propagated between the target text node, and there will be the identical target labels
Corresponding text is clustered, to obtain cluster result cluster.Technical solution of the present invention be able to solve label propagate randomness and
Cluster accuracy and technical problem with a low credibility.
Detailed description of the invention
Fig. 1 is the terminal device structural schematic diagram for the hardware running environment that the embodiment of the present invention is related to;
Fig. 2 is the flow diagram for the clustering method first embodiment that label of the present invention is propagated;
Fig. 3 is the flow diagram for the clustering method second embodiment that label of the present invention is propagated;
Fig. 4 is the structural block diagram for the clustering apparatus first embodiment that label of the present invention is propagated.
The embodiments will be further described with reference to the accompanying drawings for the realization, the function and the advantages of the object of the present invention.
Specific embodiment
It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not used to limit this hair
It is bright.
Referring to Fig.1, Fig. 1 is the terminal device structural schematic diagram for the hardware running environment that the embodiment of the present invention is related to.
As shown in Figure 1, the terminal device may include: processor 1001, such as central processing unit (Central
Processing Unit, CPU), communication bus 1002, user interface 1003, network interface 1004, memory 1005.Wherein,
Communication bus 1002 is for realizing the connection communication between these components.User interface 1003 may include display screen
(Display), optional user interface 1003 can also include standard wireline interface and wireless interface, for user interface 1003
Wireline interface in the present invention can be USB interface.Network interface 1004 optionally may include the wireline interface of standard, nothing
Line interface (such as Wireless Fidelity (WIreless-FIdelity, WI-FI) interface).Memory 1005 can be depositing at random for high speed
Access to memory (Random Access Memory, RAM) memory, is also possible to stable memory (Non-volatile
Memory, NVM), such as magnetic disk storage.Memory 1005 optionally can also be depositing independently of aforementioned processor 1001
Storage device.
It, can be with it will be understood by those skilled in the art that structure shown in Fig. 1 does not constitute the restriction to terminal device
Including perhaps combining certain components or different component layouts than illustrating more or fewer components.
As shown in Figure 1, as may include operating system, network in a kind of memory 1005 of computer storage medium
The Cluster Program that communication module, Subscriber Interface Module SIM and label are propagated.
In terminal device shown in Fig. 1, network interface 1004 is mainly used for connecting background server, with the backstage
Server carries out data communication;User interface 1003 is mainly used for connecting peripheral hardware, carries out data communication with the peripheral hardware;It is described
The Cluster Program that terminal device calls the label stored in memory 1005 to propagate by processor 1001, and execute the present invention
The clustering method that the label that embodiment provides is propagated.
Word segmentation processing is carried out to the text that sample text is concentrated, to obtain the frequent word of each text;
The text information for extracting the text is concentrated from the sample text, is reflected according to the text information by default
It penetrates relationship and constructs heterogeneous text network;
Corresponding text node in the heterogeneous text network, which is generated node by default node influence power relationship, to be influenced
Force threshold influences force threshold according to the node and obtains target labels;
Total similarity threshold between the text is generated by presetting total similarity relationship in the heterogeneous text network
Value obtains target text node according to total similarity threshold;
The target labels are propagated between the target text node, and there will be the identical target mark
It signs corresponding text to be clustered, to obtain cluster result cluster.
Further, the Cluster Program that processor 1001 can call the label stored in memory 1005 to propagate, also holds
The following operation of row:
Participle and part-of-speech tagging operation are carried out by the text that FNLP concentrates the sample sample text, to obtain feature
Word;
TF-IDF operation is carried out to the Feature Words, to obtain the word frequency and inverse document frequency of the Feature Words;
According to the word frequency and the inverse document frequency, the power of the Feature Words is generated by presetting weight corresponding relationship
Weight threshold value;
The weight threshold of the Feature Words is compared with frequent word threshold value is preset, target is obtained according to comparison result
Feature Words, using the target signature word as the frequent word of the text.
Further, the Cluster Program that processor 1001 can call the label stored in memory 1005 to propagate, also holds
The following operation of row:
The text information for extracting the text is concentrated from the sample text;
According to the text information by presetting mapping relations, will be set between the text node with the text information
It is set to directed edge, to construct heterogeneous text network.
Further, the Cluster Program that processor 1001 can call the label stored in memory 1005 to propagate, also holds
The following operation of row:
Corresponding text node in the heterogeneous text network, which is generated node by default node influence power relationship, to be influenced
Force threshold;
The node is influenced force threshold to be compared with default node influence force threshold, mesh is obtained according to comparison result
Text is marked, using the frequent word of the target text as target labels.
Further, the Cluster Program that processor 1001 can call the label stored in memory 1005 to propagate, also holds
The following operation of row:
Frequent word-text matrix is constructed according to the frequent word and the text, to obtain the corresponding text of the text
Vector, and the internal characteristics similarity between the text is generated by default cosine similarity relationship to the text vector
Threshold value;
In the heterogeneous text network, the external spy between the text is generated by preset path similarity relationship
Levy similarity threshold;
According to the internal characteristics similarity threshold and the external feature similarity threshold, by presetting total similarity
Relationship generates total similarity threshold of the text;
Target text node is obtained according to total similarity threshold.
Further, the Cluster Program that processor 1001 can call the label stored in memory 1005 to propagate, also holds
The following operation of row:
According to total similarity threshold;
Total similarity threshold is compared with the total similarity threshold of pre-set text, institute is obtained according to comparison result
State the target text node in heterogeneous text network.
Further, the Cluster Program that processor 1001 can call the label stored in memory 1005 to propagate, also holds
The following operation of row:
It, will be described if the target text node is the target text node of directed edge in the heterogeneous text network
Target labels are propagated between the target text node according to the direction of the directed edge;
If the target text node is nonoriented edge or the target text section on two-way side in the heterogeneous text network
Point influences force threshold according to the corresponding node of the target text node and is ranked up and obtains ranking results, by the target
Label is propagated between the target text node according to the ranking results;
There to be the corresponding text of identical target labels to cluster, to obtain cluster result cluster.
In the present embodiment, word segmentation processing is carried out by the text concentrated to sample text, to obtain the frequent of each text
Word;The text information for extracting the text is concentrated from the sample text, according to the text information by presetting mapping relations
Construct heterogeneous text network;Corresponding text node in the heterogeneous text network is raw by default node influence power relationship
Force threshold is influenced at node, force threshold is influenced according to the node and obtains target labels;Pass through in the heterogeneous text network
It presets total similarity relationship and generates total similarity threshold between the text, target is obtained according to total similarity threshold
Text node;The target labels are propagated between the target text node, and there will be the identical target mark
It signs corresponding text to be clustered, to obtain cluster result cluster.Technical solution of the present invention is able to solve label and propagates randomness
With cluster accuracy and technical problem with a low credibility.
Based on above-mentioned hardware configuration, the embodiment for the clustering method that label of the present invention is propagated is proposed.
Referring to Fig. 2, Fig. 2 is the flow diagram for the clustering method first embodiment that label of the present invention is propagated, and proposes this hair
The clustering method first embodiment that bright label is propagated.
In the first embodiment, the label is propagated clustering method the following steps are included:
Step S10: word segmentation processing is carried out to the text that sample text is concentrated, to obtain the frequent word of each text.
It is understood that the text refers to the form of expression of written language in the present embodiment, from the perspective of from literature angle,
Usually there is a complete, sentence of system meaning or the combination of multiple sentences;One text can be sentence, one
A paragraph or a chapter, no longer repeat one by one herein.
In the concrete realization, preparatory collecting sample text set carries out participle and part of speech mark to the text that sample text is concentrated
Note operation obtains its word frequency and inverse document frequency according to the Feature Words, further according to the default weight pair to obtain Feature Words
It should be related to obtain the frequent word of each text.
Step S20: concentrating the text information for extracting the text from the sample text, logical according to the text information
It crosses default mapping relations and constructs heterogeneous text network.
It should be noted that in the present embodiment, the text information includes concern information between the author of text, text
Originally the information etc. for thumbing up, forwarding and quoting, no longer repeats one by one herein.
In the concrete realization, the text information for extracting the text is concentrated from the sample text, according to the text envelope
Breath will be set as directed edge between the text node with the text information, to construct heterogeneous text by presetting mapping relations
Present networks.
Step S30: corresponding text node in the heterogeneous text network is generated by default node influence power relationship
Node influences force threshold, influences force threshold according to the node and obtains target labels.
It should be noted that influencing force threshold according to the node in the present embodiment, the node is influenced into force threshold
Force threshold is influenced with default node to be compared, and target text is obtained according to comparison result, by the frequent of the target text
Word is as target labels.
Step S40: in the heterogeneous text network by preset total similarity relationship generate it is total between the text
Similarity threshold obtains target text node according to total similarity threshold.
It should be noted that obtaining institute according to the frequent word and the default cosine similarity relationship in the present embodiment
State internal characteristics similarity threshold;It is obtained in the heterogeneous text network by the preset path similarity relationship simultaneously
The external feature similarity threshold, finally according to the internal characteristics similarity threshold and the external feature similarity threshold
Value, generates total similarity threshold of the text by presetting total similarity relationship to obtain target text node.
Step S50: the target labels are propagated between the target text node, and there will be identical institute
It states the corresponding text of target labels to be clustered, to obtain cluster result cluster.
It should be noted that label propagation algorithm is quoted in the present embodiment, by the target labels in the target text
It is propagated between this node, will finally have the corresponding text of identical target labels to cluster, to obtain cluster knot
Fruit cluster is until whole process terminates.
It is worth noting that introducing the oriented heterogeneous text network of weighting in the present embodiment, the multidimensional for excavating text is special
Sign carries out Similarity measures, improves the accuracy and confidence level of cluster result.
In the first embodiment, word segmentation processing is carried out by the text concentrated to sample text, to obtain the frequency of each text
Numerous word;The text information for extracting the text is concentrated from the sample text, is closed according to the text information by default mapping
System constructs heterogeneous text network;By corresponding text node in the heterogeneous text network by presetting node influence power relationship
Generating node influences force threshold, influences force threshold according to the node and obtains target labels;Lead in the heterogeneous text network
It crosses and presets total similarity relationship and generate total similarity threshold between the text, mesh is obtained according to total similarity threshold
Mark text node;The target labels are propagated between the target text node, and there will be the identical target
The corresponding text of label is clustered, to obtain cluster result cluster.Technical solution of the present invention is able to solve label and propagates at random
Property and cluster accuracy and technical problem with a low credibility.
It is the flow diagram for the clustering method second embodiment that label of the present invention is propagated referring to Fig. 3, Fig. 3, based on above-mentioned
First embodiment shown in Fig. 2 proposes the second embodiment for the clustering method that label of the present invention is propagated.
In a second embodiment, the step S10, specifically includes:
Step S11: by FNLP (development kit of the Chinese natural language text-processing based on machine learning) to institute
The text for stating sample sample text concentration carries out participle and part-of-speech tagging operation, to obtain Feature Words;The Feature Words are carried out
TF-IDF (Term frequency-inverse document frequency, for the normal of information retrieval and data mining
With weighting technique, wherein TF means that word frequency Term Frequency, IDF mean inverse document frequency Inverse
Document Frequency) operation, to obtain the word frequency and inverse document frequency of the Feature Words.
It should be noted that in the present embodiment, using TF-IDF operation, that is, following calculation formulaAndObtain the word frequency tfijAnd the inverse document frequency idfi,
Wherein i and j is positive integer.
Step S12: according to the word frequency and the inverse document frequency, the spy is generated by default weight corresponding relationship
Levy the weight threshold of word;The weight threshold of the Feature Words is compared with frequent word threshold value is preset, is obtained according to comparison result
Target signature word is taken, using the target signature word as the frequent word of the text.
It should be noted that in the present embodiment, using the default weight corresponding relationship, that is, following calculation formula Wi=
tfij*idfiObtain the weight threshold w of the Feature Wordsi, by the weight threshold w of the Feature WordsiFrequent word threshold is preset with described
Value is compared, and excavates the weight threshold wiFrequency greater than the Feature Words for presetting frequent word threshold value as the text
Numerous word fi。
Further, the step S20, specifically includes:
Step S21: the text information for extracting the text is concentrated from the sample text.
It should be noted that in the present embodiment, the text information includes concern information between the author of text, text
Originally the information etc. for thumbing up, forwarding and quoting, no longer repeats one by one herein;Wherein, by each text and its corresponding author point
It Zuo Wei not node.
Step S22: according to the text information by presetting mapping relations, by the text section with the text information
It is set as directed edge between point, to construct heterogeneous text network.
It should be noted that for two author nodes indicated with concern relation, there is forwarding to close in the present embodiment
It the author node of system and the text node that is forwarded and indicates the text node with adduction relationship, will have the above correspondence
Default mapping relations situation node between increase newly a directed edge;In addition for not indicating the work with concern relation
Person's node, it is more than default concern that an author, which thumbs up or comment on the percentage of amount of text described in another author, if it exists
Probability threshold value then increases a directed edge newly, and abstract representation is as follows:
If(uiThumb up or comment dj)
{
Increase side u in network newlyi→dj;
}
If(uiPay close attention to uj)
{
Increase side u in network newlyi→uj;
}
Else if(uiNot pays close attention to uj and uiPay close attention to ujAssociation probability be greater than the default concern probability threshold value)
{
Increase side u in network newlyi→uj
}
According to the heterogeneous text network of the above rule building two dimension.Different sides mapping table is as follows in specific network:
It can be readily appreciated that also the heterogeneous text network of multidimensional can be constructed according to multiple nodes and its characteristic information, herein not
It repeats one by one again.
Further, the step S30, specifically includes:
Step S31: corresponding text node in the heterogeneous text network is generated by default node influence power relationship
Node influences force threshold.
It should be noted that in the present embodiment, using the default node influence power relationship, that is, following calculation formulaObtaining the node influences force threshold;Wherein the i-th node and jth node are connected directly then aij
=1, it is otherwise 0;kjThe degree of jth node is represented,Represent the i-th node random walk to jth node probability;Initial shape
The s of all nodes under state in addition to start node gi(0)=1, sg(0)=0;The node of node g is finally influenced into force threshold
Other N number of nodes are averagely given, calculation formula is as follows: Si=si(tc)+sg(tc)·N-1;Wherein, sg(tc) it is under stable state
The node of node g influences force threshold, tcIndicate convergence number.
Step S32: the node is influenced into force threshold and is compared with default node influence force threshold, is tied according to comparing
Fruit obtains target text, using the frequent word of the target text as target labels.
It should be noted that influencing force threshold in the present embodiment for the node and being greater than the default node influence power
The text node of threshold value excavates its corresponding text to obtain target text, and using the frequent word of the target text as mesh
Mark label.
Further, the step S40, specifically includes:
Step S41: frequent word-text matrix is constructed according to the frequent word and the text, to obtain the text pair
The text vector answered, and it is special by presetting the inherence that cosine similarity relationship generates between the text to the text vector
Levy similarity threshold.
It should be noted that in the present embodiment, by the frequent word f of excavationiFrequent word-text is constructed with the text
This matrix M, wherein M is 0-1 matrix, the form of expression of M are as follows:
Abstract table is assigned by whether containing the frequent word in the measurement text
Show as follows: If (frequent word fi∈df)
{
M [i] [j]=1;
}
else
{
M [i] [j]=0;
}
Wherein make each text djThe form of expression be to be indicated by 0,1 n this vector of Balakrishnan for constituting, the form of expression
It is as follows: df={ 1,0 ..., };The default cosine similarity relationship is recycled to calculate the internal characteristics phase between the text
Like degree threshold value SIndij, wherein the calculation formula of the default cosine similarity relationship is as follows:Calculate the cosine value between each described n-dimensional vector and this vector.
Step S42: it in the heterogeneous text network, is generated between the text by preset path similarity relationship
External feature similarity threshold.
It should be noted that weighting directed edge member path according to each in the present embodimentIn, each includes the attribute function δ on the text information relationship Rl(Rl) be
One determining value calculates the similarity between author node using the preset path similarity relationship, i.e., described in calculating
The external feature similarity S of textOutdijFormula is as follows:
Wherein P is first path, same type pair
As for x and y.
Step S43: according to the internal characteristics similarity threshold and the external feature similarity threshold, by default
Total similarity relationship generates total similarity threshold of the text;By total similarity threshold and the total similarity of pre-set text
Threshold value is compared, and obtains the target text node in the heterogeneous text network according to comparison result.
It should be noted that presetting the i.e. following calculation formula S of total similarity relationship using described in the present embodimentdij=
SIndij*WIn+SOutdij*WOutObtain total similarity threshold Sdij, wherein WIn、WOutRespectively assign internal characteristics similitude
Weight and external feature similitude weight;Total similarity threshold is greater than the total similarity threshold of the pre-set text
The heterogeneous text network in text node as target text node.
Further, the step S50, specifically includes:
Step S51: if the target text node is the target text node of directed edge in the heterogeneous text network,
Then direction of the target labels between the target text node according to the directed edge is propagated;There to be phase
With the target labels, corresponding text is clustered, to obtain cluster result cluster.
Step S52: if the target text node is the target on nonoriented edge or two-way side in the heterogeneous text network
Text node influences force threshold according to the corresponding node of the target text node and is ranked up and obtains ranking results, by institute
Target labels are stated to be propagated between the target text node according to the ranking results;There to be the identical target
The corresponding text of label is clustered, to obtain cluster result cluster.
It should be noted that the ranking results are according to the corresponding node of the target text node in the present embodiment
Influence the ranking results that the arrangement of force threshold descending obtains.
In a second embodiment, word segmentation processing is carried out by the text concentrated to sample text, to obtain the frequency of each text
Numerous word;The text information for extracting the text is concentrated from the sample text, is closed according to the text information by default mapping
System constructs heterogeneous text network;By corresponding text node in the heterogeneous text network by presetting node influence power relationship
Generating node influences force threshold, influences force threshold according to the node and obtains target labels;Lead in the heterogeneous text network
It crosses and presets total similarity relationship and generate total similarity threshold between the text, mesh is obtained according to total similarity threshold
Mark text node;The target labels are propagated between the target text node, and there will be the identical target
The corresponding text of label is clustered, to obtain cluster result cluster.Technical solution of the present invention is able to solve label and propagates at random
Property and cluster accuracy and technical problem with a low credibility.
In addition, the embodiment of the present invention also proposes a kind of storage medium, the poly- of label propagation is stored on the storage medium
Class method realizes following operation when the Cluster Program that the label is propagated is executed by processor:
Word segmentation processing is carried out to the text that sample text is concentrated, to obtain the frequent word of each text;
The text information for extracting the text is concentrated from the sample text, is reflected according to the text information by default
It penetrates relationship and constructs heterogeneous text network;
Corresponding text node in the heterogeneous text network, which is generated node by default node influence power relationship, to be influenced
Force threshold influences force threshold according to the node and obtains target labels;
Total similarity threshold between the text is generated by presetting total similarity relationship in the heterogeneous text network
Value obtains target text node according to total similarity threshold;
The target labels are propagated between the target text node, and there will be the identical target mark
It signs corresponding text to be clustered, to obtain cluster result cluster.
Further, following operation is also realized when the Cluster Program that the label is propagated is executed by processor:
Participle and part-of-speech tagging operation are carried out by the text that FNLP concentrates the sample sample text, to obtain feature
Word;
TF-IDF operation is carried out to the Feature Words, to obtain the word frequency and inverse document frequency of the Feature Words;
According to the word frequency and the inverse document frequency, the power of the Feature Words is generated by presetting weight corresponding relationship
Weight threshold value;
The weight threshold of the Feature Words is compared with frequent word threshold value is preset, target is obtained according to comparison result
Feature Words, using the target signature word as the frequent word of the text.
Further, following operation is also realized when the Cluster Program that the label is propagated is executed by processor:
The text information for extracting the text is concentrated from the sample text;
According to the text information by presetting mapping relations, will be set between the text node with the text information
It is set to directed edge, to construct heterogeneous text network.
Further, following operation is also realized when the Cluster Program that the label is propagated is executed by processor:
Corresponding text node in the heterogeneous text network, which is generated node by default node influence power relationship, to be influenced
Force threshold;
The node is influenced force threshold to be compared with default node influence force threshold, mesh is obtained according to comparison result
Text is marked, using the frequent word of the target text as target labels.
Further, following operation is also realized when the Cluster Program that the label is propagated is executed by processor:
Frequent word-text matrix is constructed according to the frequent word and the text, to obtain the corresponding text of the text
Vector, and the internal characteristics similarity between the text is generated by default cosine similarity relationship to the text vector
Threshold value;
In the heterogeneous text network, the external spy between the text is generated by preset path similarity relationship
Levy similarity threshold;
According to the internal characteristics similarity threshold and the external feature similarity threshold, by presetting total similarity
Relationship generates total similarity threshold of the text;
Target text node is obtained according to total similarity threshold.
Further, following operation is also realized when the Cluster Program that the label is propagated is executed by processor:
According to total similarity threshold;
Total similarity threshold is compared with the total similarity threshold of pre-set text, institute is obtained according to comparison result
State the target text node in heterogeneous text network.
Further, following operation is also realized when the Cluster Program that the label is propagated is executed by processor:
It, will be described if the target text node is the target text node of directed edge in the heterogeneous text network
Target labels are propagated between the target text node according to the direction of the directed edge;
If the target text node is nonoriented edge or the target text section on two-way side in the heterogeneous text network
Point influences force threshold according to the corresponding node of the target text node and is ranked up and obtains ranking results, by the target
Label is propagated between the target text node according to the ranking results;
There to be the corresponding text of identical target labels to cluster, to obtain cluster result cluster.
In the present embodiment, word segmentation processing is carried out by the text concentrated to sample text, to obtain the frequent of each text
Word;The text information for extracting the text is concentrated from the sample text, according to the text information by presetting mapping relations
Construct heterogeneous text network;Corresponding text node in the heterogeneous text network is raw by default node influence power relationship
Force threshold is influenced at node, force threshold is influenced according to the node and obtains target labels;Pass through in the heterogeneous text network
It presets total similarity relationship and generates total similarity threshold between the text, target is obtained according to total similarity threshold
Text node;The target labels are propagated between the target text node, and there will be the identical target mark
It signs corresponding text to be clustered, to obtain cluster result cluster.Technical solution of the present invention is able to solve label and propagates randomness
With cluster accuracy and technical problem with a low credibility.
In addition, the embodiment of the present invention also proposes a kind of clustering apparatus that label is propagated, what the label was propagated referring to Fig. 4
Clustering apparatus includes:
Frequent word obtains module 10, and the text for concentrating to sample text carries out word segmentation processing, to obtain each text
Frequent word.
It is understood that the text refers to the form of expression of written language in the present embodiment, from the perspective of from literature angle,
Usually there is a complete, sentence of system meaning or the combination of multiple sentences;One text can be sentence, one
A paragraph or a chapter, no longer repeat one by one herein.
In the concrete realization, preparatory collecting sample text set carries out participle and part of speech mark to the text that sample text is concentrated
Note operation obtains its word frequency and inverse document frequency according to the Feature Words, further according to the default weight pair to obtain Feature Words
It should be related to obtain the frequent word of each text.
Heterogeneous text network struction module 20, for concentrating the text information for extracting the text from the sample text,
Heterogeneous text network is constructed by default mapping relations according to the text information.
It should be noted that in the present embodiment, the text information includes concern information between the author of text, text
Originally the information etc. for thumbing up, forwarding and quoting, no longer repeats one by one herein.
In the concrete realization, the text information for extracting the text is concentrated from the sample text, according to the text envelope
Breath will be set as directed edge between the text node with the text information, to construct heterogeneous text by presetting mapping relations
Present networks.
Target labels obtain module 30, for corresponding text node in the heterogeneous text network to be passed through default section
Point influence power relationship, which generates node, influences force threshold, influences force threshold according to the node and obtains target labels.
It should be noted that influencing force threshold according to the node in the present embodiment, the node is influenced into force threshold
Force threshold is influenced with default node to be compared, and target text is obtained according to comparison result, by the frequent of the target text
Word is as target labels.
Target text node obtain module 40, in the heterogeneous text network by presetting total similarity relationship
Total similarity threshold between the text is generated, target text node is obtained according to total similarity threshold.
It should be noted that obtaining institute according to the frequent word and the default cosine similarity relationship in the present embodiment
State internal characteristics similarity threshold;It is obtained in the heterogeneous text network by the preset path similarity relationship simultaneously
The external feature similarity threshold, finally according to the internal characteristics similarity threshold and the external feature similarity threshold
Value, generates total similarity threshold of the text by presetting total similarity relationship to obtain target text node.
Propagation and cluster module 50, for the target labels to be propagated between the target text node, and
There to be the corresponding text of identical target labels to cluster, to obtain cluster result cluster.
It should be noted that label propagation algorithm is quoted in the present embodiment, by the target labels in the target text
It is propagated between this node, will finally have the corresponding text of identical target labels to cluster, to obtain cluster knot
Fruit cluster is until whole process terminates.
It is worth noting that introducing the oriented heterogeneous text network of weighting in the present embodiment, the multidimensional for excavating text is special
Sign carries out Similarity measures, improves the accuracy and confidence level of cluster result.
In the present embodiment, word segmentation processing is carried out by the text concentrated to sample text, to obtain the frequent of each text
Word;The text information for extracting the text is concentrated from the sample text, according to the text information by presetting mapping relations
Construct heterogeneous text network;Corresponding text node in the heterogeneous text network is raw by default node influence power relationship
Force threshold is influenced at node, force threshold is influenced according to the node and obtains target labels;Pass through in the heterogeneous text network
It presets total similarity relationship and generates total similarity threshold between the text, target is obtained according to total similarity threshold
Text node;The target labels are propagated between the target text node, and there will be the identical target mark
It signs corresponding text to be clustered, to obtain cluster result cluster.Technical solution of the present invention is able to solve label and propagates randomness
With cluster accuracy and technical problem with a low credibility.
The other embodiments or specific implementation for the clustering apparatus that label of the present invention is propagated can refer to above-mentioned each side
Method embodiment, details are not described herein again.
It should be noted that, in this document, the terms "include", "comprise" or its any other variant be intended to it is non-
It is exclusive to include, so that the process, method, article or the system that include a series of elements not only include those elements,
It but also including other elements that are not explicitly listed, or further include for this process, method, article or system institute
Intrinsic element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that
There is also other identical elements in process, method, article or system including the element.
The serial number of the above embodiments of the invention is only for description, does not represent the advantages or disadvantages of the embodiments.If listing equipment for drying
Unit claim in, several in these devices, which can be, to be embodied by the same item of hardware.Word
One, second and the use of third etc. do not indicate any sequence, can be title by these word explanations.
Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment
Method can be realized by means of software and necessary general hardware platform, naturally it is also possible to by hardware, but many situations
It is lower the former be more preferably embodiment.Based on this understanding, technical solution of the present invention is substantially in other words to the prior art
The part to contribute can be embodied in the form of software products, which is stored in a storage and is situated between
Matter (such as read-only memory mirror image (Read Only Memory image, ROM)/random access memory (Random Access
Memory, RAM), magnetic disk, CD) in, including some instructions are used so that terminal device (can be mobile phone, computer,
Server, air conditioner or network equipment etc.) execute method described in each embodiment of the present invention.
The above is only a preferred embodiment of the present invention, is not intended to limit the scope of the invention, all to utilize this hair
Equivalent structure or equivalent flow shift made by bright specification and accompanying drawing content, it is relevant to be applied directly or indirectly in other
Technical field is included within the scope of the present invention.
Claims (10)
1. the clustering method that a kind of label is propagated, which is characterized in that clustering method that the label is propagated the following steps are included:
Word segmentation processing is carried out to the text that sample text is concentrated, to obtain the frequent word of each text;
The text information for extracting the text is concentrated from the sample text, according to the text information by presetting mapping relations
Construct heterogeneous text network;
Corresponding text node in the heterogeneous text network is generated into node influence power threshold by default node influence power relationship
Value influences force threshold according to the node and obtains target labels;
Total similarity threshold between the text, root are generated by presetting total similarity relationship in the heterogeneous text network
Target text node is obtained according to total similarity threshold;
The target labels are propagated between the target text node, and there will be the identical target labels corresponding
Text clustered, to obtain cluster result cluster.
2. the clustering method that label as described in claim 1 is propagated, which is characterized in that the text concentrated to sample text
Word segmentation processing is carried out to specifically include to obtain the frequent word of each text:
Participle and part-of-speech tagging operation are carried out by the text that FNLP concentrates the sample sample text, to obtain Feature Words;
TF-IDF operation is carried out to the Feature Words, to obtain the word frequency and inverse document frequency of the Feature Words;
According to the word frequency and the inverse document frequency, the weight threshold of the Feature Words is generated by presetting weight corresponding relationship
Value;
The weight threshold of the Feature Words is compared with frequent word threshold value is preset, target signature is obtained according to comparison result
Word, using the target signature word as the frequent word of the text.
3. the clustering method that label as described in claim 1 is propagated, which is characterized in that described to be mentioned from sample text concentration
The text information for taking the text constructs heterogeneous text network by default mapping relations according to the text information, specific to wrap
It includes:
The text information for extracting the text is concentrated from the sample text;
According to the text information by presetting mapping relations, will be provided between the text node with the text information
Xiang Bian, to construct heterogeneous text network.
4. the clustering method that the label as described in claims 1 to 3 any one is propagated, which is characterized in that it is described will be described different
Corresponding text node, which generates node by default node influence power relationship, in matter text network influences force threshold, according to the section
Point influences force threshold and obtains target labels, specifically includes:
Corresponding text node in the heterogeneous text network is generated into node influence power threshold by default node influence power relationship
Value;
The node is influenced force threshold to be compared with default node influence force threshold, target text is obtained according to comparison result
This, using the frequent word of the target text as target labels.
5. the clustering method that the label as described in claims 1 to 3 any one is propagated, which is characterized in that described described different
Total similarity threshold between the text is generated by presetting total similarity relationship in matter text network, according to described total similar
It spends threshold value and obtains target text node, specifically include:
Frequent word-text matrix is constructed according to the frequent word and the text, to obtain the corresponding text vector of the text,
And the internal characteristics similarity threshold between the text is generated by default cosine similarity relationship to the text vector;
In the heterogeneous text network, the external feature generated between the text by preset path similarity relationship is similar
Spend threshold value;
It is raw by presetting total similarity relationship according to the internal characteristics similarity threshold and the external feature similarity threshold
At total similarity threshold of the text;
Target text node is obtained according to total similarity threshold.
6. the clustering method that label as claimed in claim 5 is propagated, which is characterized in that described according to total similarity threshold
Target text node is obtained, is specifically included:
According to total similarity threshold;
Total similarity threshold is compared with the total similarity threshold of pre-set text, is obtained according to comparison result described heterogeneous
Target text node in text network.
7. the clustering method that the label as described in claim 1 to 6 any one is propagated, which is characterized in that described by the mesh
Mark label is propagated between the target text node, and will have the corresponding text of identical target labels to gather
Class is specifically included with obtaining cluster result cluster:
If the target text node is the target text node of directed edge in the heterogeneous text network, by the target mark
The direction between the target text node according to the directed edge is signed to be propagated;
If the target text node is nonoriented edge or the target text node on two-way side in the heterogeneous text network, according to
The corresponding node of the target text node influences force threshold and is ranked up and obtains ranking results, by the target labels in institute
It states and is propagated between target text node according to the ranking results;
There to be the corresponding text of identical target labels to cluster, to obtain cluster result cluster.
8. a kind of terminal device, which is characterized in that the terminal device includes: memory, processor and is stored in the storage
On device and Cluster Program that the label that can run on the processor is propagated, the Cluster Program that the label is propagated is by the place
Manage the step of clustering method that the label as described in any one of claims 1 to 7 is propagated is realized when device executes.
9. a kind of storage medium, which is characterized in that be stored with the Cluster Program of label propagation, the label on the storage medium
The cluster side that the label as described in any one of claims 1 to 7 is propagated is realized when the Cluster Program of propagation is executed by processor
The step of method.
10. the clustering apparatus that a kind of label is propagated, which is characterized in that the clustering apparatus that the label is propagated includes:
Frequent word obtains module, and the text for concentrating to sample text carries out word segmentation processing, to obtain the frequent word of each text;
Heterogeneous text network struction module, for concentrating the text information for extracting the text from the sample text, according to institute
It states text information and constructs heterogeneous text network by default mapping relations;
Target labels obtain module, for corresponding text node in the heterogeneous text network to be passed through default node influence power
Relationship, which generates node, influences force threshold, influences force threshold according to the node and obtains target labels;
Target text node obtains module, in the heterogeneous text network by presetting described in total similarity relationship generates
Total similarity threshold between text obtains target text node according to total similarity threshold;
Propagation and cluster module, for propagating the target labels between the target text node, and will have
The corresponding text of identical target labels is clustered, to obtain cluster result cluster.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910504157.0A CN110442674B (en) | 2019-06-11 | 2019-06-11 | Label propagation clustering method, terminal equipment, storage medium and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910504157.0A CN110442674B (en) | 2019-06-11 | 2019-06-11 | Label propagation clustering method, terminal equipment, storage medium and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110442674A true CN110442674A (en) | 2019-11-12 |
CN110442674B CN110442674B (en) | 2021-09-14 |
Family
ID=68429199
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910504157.0A Active CN110442674B (en) | 2019-06-11 | 2019-06-11 | Label propagation clustering method, terminal equipment, storage medium and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110442674B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111191882A (en) * | 2019-12-17 | 2020-05-22 | 安徽大学 | Method and device for identifying influential developers in heterogeneous information network |
CN112699237A (en) * | 2020-12-24 | 2021-04-23 | 百度在线网络技术(北京)有限公司 | Label determination method, device and storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102768670A (en) * | 2012-05-31 | 2012-11-07 | 哈尔滨工程大学 | Webpage clustering method based on node property label propagation |
US20140067808A1 (en) * | 2012-09-06 | 2014-03-06 | International Business Machines Corporation | Distributed Scalable Clustering and Community Detection |
US8832091B1 (en) * | 2012-10-08 | 2014-09-09 | Amazon Technologies, Inc. | Graph-based semantic analysis of items |
CN106951524A (en) * | 2017-03-21 | 2017-07-14 | 哈尔滨工程大学 | Overlapping community discovery method based on node influence power |
CN108364234A (en) * | 2018-03-08 | 2018-08-03 | 重庆邮电大学 | A kind of microblogging community discovery method propagated based on node influence power label |
CN108959453A (en) * | 2018-06-14 | 2018-12-07 | 中南民族大学 | Information extracting method, device and readable storage medium storing program for executing based on text cluster |
-
2019
- 2019-06-11 CN CN201910504157.0A patent/CN110442674B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102768670A (en) * | 2012-05-31 | 2012-11-07 | 哈尔滨工程大学 | Webpage clustering method based on node property label propagation |
US20140067808A1 (en) * | 2012-09-06 | 2014-03-06 | International Business Machines Corporation | Distributed Scalable Clustering and Community Detection |
US8832091B1 (en) * | 2012-10-08 | 2014-09-09 | Amazon Technologies, Inc. | Graph-based semantic analysis of items |
CN106951524A (en) * | 2017-03-21 | 2017-07-14 | 哈尔滨工程大学 | Overlapping community discovery method based on node influence power |
CN108364234A (en) * | 2018-03-08 | 2018-08-03 | 重庆邮电大学 | A kind of microblogging community discovery method propagated based on node influence power label |
CN108959453A (en) * | 2018-06-14 | 2018-12-07 | 中南民族大学 | Information extracting method, device and readable storage medium storing program for executing based on text cluster |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111191882A (en) * | 2019-12-17 | 2020-05-22 | 安徽大学 | Method and device for identifying influential developers in heterogeneous information network |
CN111191882B (en) * | 2019-12-17 | 2022-11-25 | 安徽大学 | Method and device for identifying influential developers in heterogeneous information network |
CN112699237A (en) * | 2020-12-24 | 2021-04-23 | 百度在线网络技术(北京)有限公司 | Label determination method, device and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN110442674B (en) | 2021-09-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110837550B (en) | Knowledge graph-based question answering method and device, electronic equipment and storage medium | |
US7194466B2 (en) | Object clustering using inter-layer links | |
EP2866421B1 (en) | Method and apparatus for identifying a same user in multiple social networks | |
CN112148987B (en) | Message pushing method based on target object activity and related equipment | |
US9536201B2 (en) | Identifying associations in data and performing data analysis using a normalized highest mutual information score | |
US20230102337A1 (en) | Method and apparatus for training recommendation model, computer device, and storage medium | |
CN109739978A (en) | A kind of Text Clustering Method, text cluster device and terminal device | |
WO2021143267A1 (en) | Image detection-based fine-grained classification model processing method, and related devices | |
CN105868108A (en) | Instruction-set-irrelevant binary code similarity detection method based on neural network | |
Belle et al. | Serial coalescent simulations suggest a weak genealogical relationship between Etruscans and modern Tuscans | |
CN108154198A (en) | Knowledge base entity normalizing method, system, terminal and computer readable storage medium | |
WO2019071904A1 (en) | Bayesian network-based question-answering apparatus, method and storage medium | |
CN110909222A (en) | User portrait establishing method, device, medium and electronic equipment based on clustering | |
CN112183881A (en) | Public opinion event prediction method and device based on social network and storage medium | |
CN110929145A (en) | Public opinion analysis method, public opinion analysis device, computer device and storage medium | |
Paez et al. | Inducing non-orthogonal and non-linear decision boundaries in decision trees via interactive basis functions | |
CN112785005A (en) | Multi-target task assistant decision-making method and device, computer equipment and medium | |
CN110442674A (en) | Clustering method, terminal device, storage medium and the device that label is propagated | |
WO2022227171A1 (en) | Method and apparatus for extracting key information, electronic device, and medium | |
WO2021120588A1 (en) | Method and apparatus for language generation, computer device, and storage medium | |
CN115248890A (en) | User interest portrait generation method and device, electronic equipment and storage medium | |
CN111667018A (en) | Object clustering method and device, computer readable medium and electronic equipment | |
WO2020252925A1 (en) | Method and apparatus for searching user feature group for optimized user feature, electronic device, and computer nonvolatile readable storage medium | |
CN116204709A (en) | Data processing method and related device | |
CN110941638A (en) | Application classification rule base construction method, application classification method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |