CN110442674A - Clustering method, terminal device, storage medium and the device that label is propagated - Google Patents

Clustering method, terminal device, storage medium and the device that label is propagated Download PDF

Info

Publication number
CN110442674A
CN110442674A CN201910504157.0A CN201910504157A CN110442674A CN 110442674 A CN110442674 A CN 110442674A CN 201910504157 A CN201910504157 A CN 201910504157A CN 110442674 A CN110442674 A CN 110442674A
Authority
CN
China
Prior art keywords
text
node
target
propagated
label
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910504157.0A
Other languages
Chinese (zh)
Other versions
CN110442674B (en
Inventor
尹帆
张广凯
宋中山
覃俊
郑禄
吴经龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South Central Minzu University
Original Assignee
South Central University for Nationalities
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South Central University for Nationalities filed Critical South Central University for Nationalities
Priority to CN201910504157.0A priority Critical patent/CN110442674B/en
Publication of CN110442674A publication Critical patent/CN110442674A/en
Application granted granted Critical
Publication of CN110442674B publication Critical patent/CN110442674B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes

Abstract

The invention discloses clustering method, terminal device, storage medium and devices that a kind of label is propagated, this method comprises: obtaining the frequent word of each text;The text information for extracting the text is concentrated from sample text, and heterogeneous text network is constructed by default mapping relations according to the text information;Corresponding text node in the heterogeneous text network, which is generated node by default node influence power relationship, influences force threshold, influences force threshold according to the node and obtains target labels;Total similarity threshold between the text is generated by presetting total similarity relationship in the heterogeneous text network, target text node is obtained according to total similarity threshold;The target labels are propagated between the target text node, and there will be the corresponding text of identical target labels to cluster, to obtain cluster result cluster.Technical solution of the present invention is able to solve label and propagates randomness and cluster accuracy and technical problem with a low credibility.

Description

Clustering method, terminal device, storage medium and the device that label is propagated
Technical field
The present invention relates to the clustering methods of label propagation and clustering technique field more particularly to a kind of propagation of label, terminal Equipment, storage medium and device.
Background technique
At present agricultural production, information retrieval, finance and in terms of, require for a large amount of number It is believed that breath handled after carry out again using, generally will use label carry out dissemination process after clustered again;For example, grinding When studying carefully the analysis of crop pests, needs to carry out aggrieved phenomenon to aggrieved crops to carry out mark, then carry out judging whether to belong to Which kind of, in pest, cracking this phenomenon can be clustered to obtain as a result, finally can using label propagation algorithm It is remedied for this pest.But there is only randomnesss for this label propagation algorithm, and to mark treated data Its accuracy and confidence level be not high after being clustered.
Above content is only used to facilitate the understanding of the technical scheme, and is not represented and is recognized that above content is existing skill Art.
Summary of the invention
The main purpose of the present invention is to provide clustering method, terminal device, storage medium and dresses that a kind of label is propagated It sets, it is intended to solve label and propagate randomness and cluster accuracy and technical problem with a low credibility.
To achieve the above object, the present invention provides a kind of clustering method that label is propagated, the cluster side that the label is propagated Method the following steps are included:
Word segmentation processing is carried out to the text that sample text is concentrated, to obtain the frequent word of each text;
The text information for extracting the text is concentrated from the sample text, is reflected according to the text information by default It penetrates relationship and constructs heterogeneous text network;
Corresponding text node in the heterogeneous text network, which is generated node by default node influence power relationship, to be influenced Force threshold influences force threshold according to the node and obtains target labels;
Total similarity threshold between the text is generated by presetting total similarity relationship in the heterogeneous text network Value obtains target text node according to total similarity threshold;
The target labels are propagated between the target text node, and there will be the identical target mark It signs corresponding text to be clustered, to obtain cluster result cluster.
Preferably, the text concentrated to sample text carries out word segmentation processing, to obtain the frequent word of each text, tool Body includes:
Participle and part-of-speech tagging operation are carried out by the text that FNLP concentrates the sample sample text, to obtain feature Word;
TF-IDF operation is carried out to the Feature Words, to obtain the word frequency and inverse document frequency of the Feature Words;
According to the word frequency and the inverse document frequency, the power of the Feature Words is generated by presetting weight corresponding relationship Weight threshold value;
The weight threshold of the Feature Words is compared with frequent word threshold value is preset, target is obtained according to comparison result Feature Words, using the target signature word as the frequent word of the text.
Preferably, described that the text information for extracting the text is concentrated from the sample text, according to the text information Heterogeneous text network is constructed by default mapping relations, is specifically included:
The text information for extracting the text is concentrated from the sample text;
According to the text information by presetting mapping relations, will be set between the text node with the text information It is set to directed edge, to construct heterogeneous text network.
Preferably, described that corresponding text node in the heterogeneous text network is passed through into default node influence power relationship Generating node influences force threshold, influences force threshold according to the node and obtains target labels, specifically includes:
Corresponding text node in the heterogeneous text network, which is generated node by default node influence power relationship, to be influenced Force threshold;
The node is influenced force threshold to be compared with default node influence force threshold, mesh is obtained according to comparison result Text is marked, using the frequent word of the target text as target labels.
Preferably, described to be generated between the text in the heterogeneous text network by presetting total similarity relationship Total similarity threshold, target text node is obtained according to the total similarity threshold, is specifically included:
Frequent word-text matrix is constructed according to the frequent word and the text, to obtain the corresponding text of the text Vector, and the internal characteristics similarity between the text is generated by default cosine similarity relationship to the text vector Threshold value;
In the heterogeneous text network, the external spy between the text is generated by preset path similarity relationship Levy similarity threshold;
According to the internal characteristics similarity threshold and the external feature similarity threshold, by presetting total similarity Relationship generates total similarity threshold of the text;
Target text node is obtained according to total similarity threshold.
Preferably, described that target text node is obtained according to total similarity threshold, it specifically includes:
According to total similarity threshold;
Total similarity threshold is compared with the total similarity threshold of pre-set text, institute is obtained according to comparison result State the target text node in heterogeneous text network.
Preferably, described to propagate the target labels between the target text node, and will have identical The corresponding text of the target labels is clustered, and to obtain cluster result cluster, is specifically included:
It, will be described if the target text node is the target text node of directed edge in the heterogeneous text network Target labels are propagated between the target text node according to the direction of the directed edge;
If the target text node is nonoriented edge or the target text section on two-way side in the heterogeneous text network Point influences force threshold according to the corresponding node of the target text node and is ranked up and obtains ranking results, by the target Label is propagated between the target text node according to the ranking results;
There to be the corresponding text of identical target labels to cluster, to obtain cluster result cluster.
In addition, to achieve the above object, the present invention also proposes that a kind of terminal device, the terminal device include: storage Device, processor and the Cluster Program for being stored in the label propagation that can be run on the memory and on the processor, it is described The Cluster Program that label is propagated realizes the step for the clustering method that label as described above is propagated when being executed by the processor Suddenly.
In addition, to achieve the above object, the present invention also proposes a kind of storage medium, mark is stored on the storage medium The Cluster Program propagated is signed, the Cluster Program that the label is propagated realizes that label as described above passes when being executed by processor The step of clustering method broadcast.
In addition, to achieve the above object, the present invention also proposes that a kind of clustering apparatus that label is propagated, the label are propagated Clustering apparatus include:
Frequent word obtains module, and the text for concentrating to sample text carries out word segmentation processing, to obtain the frequency of each text Numerous word;
Heterogeneous text network struction module, for concentrating the text information for extracting the text, root from the sample text Heterogeneous text network is constructed by default mapping relations according to the text information;
Target labels obtain module, for corresponding text node in the heterogeneous text network to be passed through default node Influence power relationship, which generates node, influences force threshold, influences force threshold according to the node and obtains target labels;
Target text node obtains module, for raw by presetting total similarity relationship in the heterogeneous text network At total similarity threshold between the text, target text node is obtained according to total similarity threshold;
Propagation and cluster module, for the target labels to be propagated between the target text node, and will It is clustered with the corresponding text of the identical target labels, to obtain cluster result cluster.
In the present invention, word segmentation processing is carried out by the text concentrated to sample text, to obtain the frequent word of each text; The text information for extracting the text is concentrated from the sample text, according to the text information by presetting mapping relations structure Build heterogeneous text network;Corresponding text node in the heterogeneous text network is generated by default node influence power relationship Node influences force threshold, influences force threshold according to the node and obtains target labels;By pre- in the heterogeneous text network If total similarity relationship generates total similarity threshold between the text, target text is obtained according to total similarity threshold This node;The target labels are propagated between the target text node, and there will be the identical target labels Corresponding text is clustered, to obtain cluster result cluster.Technical solution of the present invention be able to solve label propagate randomness and Cluster accuracy and technical problem with a low credibility.
Detailed description of the invention
Fig. 1 is the terminal device structural schematic diagram for the hardware running environment that the embodiment of the present invention is related to;
Fig. 2 is the flow diagram for the clustering method first embodiment that label of the present invention is propagated;
Fig. 3 is the flow diagram for the clustering method second embodiment that label of the present invention is propagated;
Fig. 4 is the structural block diagram for the clustering apparatus first embodiment that label of the present invention is propagated.
The embodiments will be further described with reference to the accompanying drawings for the realization, the function and the advantages of the object of the present invention.
Specific embodiment
It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not used to limit this hair It is bright.
Referring to Fig.1, Fig. 1 is the terminal device structural schematic diagram for the hardware running environment that the embodiment of the present invention is related to.
As shown in Figure 1, the terminal device may include: processor 1001, such as central processing unit (Central Processing Unit, CPU), communication bus 1002, user interface 1003, network interface 1004, memory 1005.Wherein, Communication bus 1002 is for realizing the connection communication between these components.User interface 1003 may include display screen (Display), optional user interface 1003 can also include standard wireline interface and wireless interface, for user interface 1003 Wireline interface in the present invention can be USB interface.Network interface 1004 optionally may include the wireline interface of standard, nothing Line interface (such as Wireless Fidelity (WIreless-FIdelity, WI-FI) interface).Memory 1005 can be depositing at random for high speed Access to memory (Random Access Memory, RAM) memory, is also possible to stable memory (Non-volatile Memory, NVM), such as magnetic disk storage.Memory 1005 optionally can also be depositing independently of aforementioned processor 1001 Storage device.
It, can be with it will be understood by those skilled in the art that structure shown in Fig. 1 does not constitute the restriction to terminal device Including perhaps combining certain components or different component layouts than illustrating more or fewer components.
As shown in Figure 1, as may include operating system, network in a kind of memory 1005 of computer storage medium The Cluster Program that communication module, Subscriber Interface Module SIM and label are propagated.
In terminal device shown in Fig. 1, network interface 1004 is mainly used for connecting background server, with the backstage Server carries out data communication;User interface 1003 is mainly used for connecting peripheral hardware, carries out data communication with the peripheral hardware;It is described The Cluster Program that terminal device calls the label stored in memory 1005 to propagate by processor 1001, and execute the present invention The clustering method that the label that embodiment provides is propagated.
Word segmentation processing is carried out to the text that sample text is concentrated, to obtain the frequent word of each text;
The text information for extracting the text is concentrated from the sample text, is reflected according to the text information by default It penetrates relationship and constructs heterogeneous text network;
Corresponding text node in the heterogeneous text network, which is generated node by default node influence power relationship, to be influenced Force threshold influences force threshold according to the node and obtains target labels;
Total similarity threshold between the text is generated by presetting total similarity relationship in the heterogeneous text network Value obtains target text node according to total similarity threshold;
The target labels are propagated between the target text node, and there will be the identical target mark It signs corresponding text to be clustered, to obtain cluster result cluster.
Further, the Cluster Program that processor 1001 can call the label stored in memory 1005 to propagate, also holds The following operation of row:
Participle and part-of-speech tagging operation are carried out by the text that FNLP concentrates the sample sample text, to obtain feature Word;
TF-IDF operation is carried out to the Feature Words, to obtain the word frequency and inverse document frequency of the Feature Words;
According to the word frequency and the inverse document frequency, the power of the Feature Words is generated by presetting weight corresponding relationship Weight threshold value;
The weight threshold of the Feature Words is compared with frequent word threshold value is preset, target is obtained according to comparison result Feature Words, using the target signature word as the frequent word of the text.
Further, the Cluster Program that processor 1001 can call the label stored in memory 1005 to propagate, also holds The following operation of row:
The text information for extracting the text is concentrated from the sample text;
According to the text information by presetting mapping relations, will be set between the text node with the text information It is set to directed edge, to construct heterogeneous text network.
Further, the Cluster Program that processor 1001 can call the label stored in memory 1005 to propagate, also holds The following operation of row:
Corresponding text node in the heterogeneous text network, which is generated node by default node influence power relationship, to be influenced Force threshold;
The node is influenced force threshold to be compared with default node influence force threshold, mesh is obtained according to comparison result Text is marked, using the frequent word of the target text as target labels.
Further, the Cluster Program that processor 1001 can call the label stored in memory 1005 to propagate, also holds The following operation of row:
Frequent word-text matrix is constructed according to the frequent word and the text, to obtain the corresponding text of the text Vector, and the internal characteristics similarity between the text is generated by default cosine similarity relationship to the text vector Threshold value;
In the heterogeneous text network, the external spy between the text is generated by preset path similarity relationship Levy similarity threshold;
According to the internal characteristics similarity threshold and the external feature similarity threshold, by presetting total similarity Relationship generates total similarity threshold of the text;
Target text node is obtained according to total similarity threshold.
Further, the Cluster Program that processor 1001 can call the label stored in memory 1005 to propagate, also holds The following operation of row:
According to total similarity threshold;
Total similarity threshold is compared with the total similarity threshold of pre-set text, institute is obtained according to comparison result State the target text node in heterogeneous text network.
Further, the Cluster Program that processor 1001 can call the label stored in memory 1005 to propagate, also holds The following operation of row:
It, will be described if the target text node is the target text node of directed edge in the heterogeneous text network Target labels are propagated between the target text node according to the direction of the directed edge;
If the target text node is nonoriented edge or the target text section on two-way side in the heterogeneous text network Point influences force threshold according to the corresponding node of the target text node and is ranked up and obtains ranking results, by the target Label is propagated between the target text node according to the ranking results;
There to be the corresponding text of identical target labels to cluster, to obtain cluster result cluster.
In the present embodiment, word segmentation processing is carried out by the text concentrated to sample text, to obtain the frequent of each text Word;The text information for extracting the text is concentrated from the sample text, according to the text information by presetting mapping relations Construct heterogeneous text network;Corresponding text node in the heterogeneous text network is raw by default node influence power relationship Force threshold is influenced at node, force threshold is influenced according to the node and obtains target labels;Pass through in the heterogeneous text network It presets total similarity relationship and generates total similarity threshold between the text, target is obtained according to total similarity threshold Text node;The target labels are propagated between the target text node, and there will be the identical target mark It signs corresponding text to be clustered, to obtain cluster result cluster.Technical solution of the present invention is able to solve label and propagates randomness With cluster accuracy and technical problem with a low credibility.
Based on above-mentioned hardware configuration, the embodiment for the clustering method that label of the present invention is propagated is proposed.
Referring to Fig. 2, Fig. 2 is the flow diagram for the clustering method first embodiment that label of the present invention is propagated, and proposes this hair The clustering method first embodiment that bright label is propagated.
In the first embodiment, the label is propagated clustering method the following steps are included:
Step S10: word segmentation processing is carried out to the text that sample text is concentrated, to obtain the frequent word of each text.
It is understood that the text refers to the form of expression of written language in the present embodiment, from the perspective of from literature angle, Usually there is a complete, sentence of system meaning or the combination of multiple sentences;One text can be sentence, one A paragraph or a chapter, no longer repeat one by one herein.
In the concrete realization, preparatory collecting sample text set carries out participle and part of speech mark to the text that sample text is concentrated Note operation obtains its word frequency and inverse document frequency according to the Feature Words, further according to the default weight pair to obtain Feature Words It should be related to obtain the frequent word of each text.
Step S20: concentrating the text information for extracting the text from the sample text, logical according to the text information It crosses default mapping relations and constructs heterogeneous text network.
It should be noted that in the present embodiment, the text information includes concern information between the author of text, text Originally the information etc. for thumbing up, forwarding and quoting, no longer repeats one by one herein.
In the concrete realization, the text information for extracting the text is concentrated from the sample text, according to the text envelope Breath will be set as directed edge between the text node with the text information, to construct heterogeneous text by presetting mapping relations Present networks.
Step S30: corresponding text node in the heterogeneous text network is generated by default node influence power relationship Node influences force threshold, influences force threshold according to the node and obtains target labels.
It should be noted that influencing force threshold according to the node in the present embodiment, the node is influenced into force threshold Force threshold is influenced with default node to be compared, and target text is obtained according to comparison result, by the frequent of the target text Word is as target labels.
Step S40: in the heterogeneous text network by preset total similarity relationship generate it is total between the text Similarity threshold obtains target text node according to total similarity threshold.
It should be noted that obtaining institute according to the frequent word and the default cosine similarity relationship in the present embodiment State internal characteristics similarity threshold;It is obtained in the heterogeneous text network by the preset path similarity relationship simultaneously The external feature similarity threshold, finally according to the internal characteristics similarity threshold and the external feature similarity threshold Value, generates total similarity threshold of the text by presetting total similarity relationship to obtain target text node.
Step S50: the target labels are propagated between the target text node, and there will be identical institute It states the corresponding text of target labels to be clustered, to obtain cluster result cluster.
It should be noted that label propagation algorithm is quoted in the present embodiment, by the target labels in the target text It is propagated between this node, will finally have the corresponding text of identical target labels to cluster, to obtain cluster knot Fruit cluster is until whole process terminates.
It is worth noting that introducing the oriented heterogeneous text network of weighting in the present embodiment, the multidimensional for excavating text is special Sign carries out Similarity measures, improves the accuracy and confidence level of cluster result.
In the first embodiment, word segmentation processing is carried out by the text concentrated to sample text, to obtain the frequency of each text Numerous word;The text information for extracting the text is concentrated from the sample text, is closed according to the text information by default mapping System constructs heterogeneous text network;By corresponding text node in the heterogeneous text network by presetting node influence power relationship Generating node influences force threshold, influences force threshold according to the node and obtains target labels;Lead in the heterogeneous text network It crosses and presets total similarity relationship and generate total similarity threshold between the text, mesh is obtained according to total similarity threshold Mark text node;The target labels are propagated between the target text node, and there will be the identical target The corresponding text of label is clustered, to obtain cluster result cluster.Technical solution of the present invention is able to solve label and propagates at random Property and cluster accuracy and technical problem with a low credibility.
It is the flow diagram for the clustering method second embodiment that label of the present invention is propagated referring to Fig. 3, Fig. 3, based on above-mentioned First embodiment shown in Fig. 2 proposes the second embodiment for the clustering method that label of the present invention is propagated.
In a second embodiment, the step S10, specifically includes:
Step S11: by FNLP (development kit of the Chinese natural language text-processing based on machine learning) to institute The text for stating sample sample text concentration carries out participle and part-of-speech tagging operation, to obtain Feature Words;The Feature Words are carried out TF-IDF (Term frequency-inverse document frequency, for the normal of information retrieval and data mining With weighting technique, wherein TF means that word frequency Term Frequency, IDF mean inverse document frequency Inverse Document Frequency) operation, to obtain the word frequency and inverse document frequency of the Feature Words.
It should be noted that in the present embodiment, using TF-IDF operation, that is, following calculation formulaAndObtain the word frequency tfijAnd the inverse document frequency idfi, Wherein i and j is positive integer.
Step S12: according to the word frequency and the inverse document frequency, the spy is generated by default weight corresponding relationship Levy the weight threshold of word;The weight threshold of the Feature Words is compared with frequent word threshold value is preset, is obtained according to comparison result Target signature word is taken, using the target signature word as the frequent word of the text.
It should be noted that in the present embodiment, using the default weight corresponding relationship, that is, following calculation formula Wi= tfij*idfiObtain the weight threshold w of the Feature Wordsi, by the weight threshold w of the Feature WordsiFrequent word threshold is preset with described Value is compared, and excavates the weight threshold wiFrequency greater than the Feature Words for presetting frequent word threshold value as the text Numerous word fi
Further, the step S20, specifically includes:
Step S21: the text information for extracting the text is concentrated from the sample text.
It should be noted that in the present embodiment, the text information includes concern information between the author of text, text Originally the information etc. for thumbing up, forwarding and quoting, no longer repeats one by one herein;Wherein, by each text and its corresponding author point It Zuo Wei not node.
Step S22: according to the text information by presetting mapping relations, by the text section with the text information It is set as directed edge between point, to construct heterogeneous text network.
It should be noted that for two author nodes indicated with concern relation, there is forwarding to close in the present embodiment It the author node of system and the text node that is forwarded and indicates the text node with adduction relationship, will have the above correspondence Default mapping relations situation node between increase newly a directed edge;In addition for not indicating the work with concern relation Person's node, it is more than default concern that an author, which thumbs up or comment on the percentage of amount of text described in another author, if it exists Probability threshold value then increases a directed edge newly, and abstract representation is as follows:
If(uiThumb up or comment dj)
{
Increase side u in network newlyi→dj
}
If(uiPay close attention to uj)
{
Increase side u in network newlyi→uj
}
Else if(uiNot pays close attention to uj and uiPay close attention to ujAssociation probability be greater than the default concern probability threshold value)
{
Increase side u in network newlyi→uj
}
According to the heterogeneous text network of the above rule building two dimension.Different sides mapping table is as follows in specific network:
It can be readily appreciated that also the heterogeneous text network of multidimensional can be constructed according to multiple nodes and its characteristic information, herein not It repeats one by one again.
Further, the step S30, specifically includes:
Step S31: corresponding text node in the heterogeneous text network is generated by default node influence power relationship Node influences force threshold.
It should be noted that in the present embodiment, using the default node influence power relationship, that is, following calculation formulaObtaining the node influences force threshold;Wherein the i-th node and jth node are connected directly then aij =1, it is otherwise 0;kjThe degree of jth node is represented,Represent the i-th node random walk to jth node probability;Initial shape The s of all nodes under state in addition to start node gi(0)=1, sg(0)=0;The node of node g is finally influenced into force threshold Other N number of nodes are averagely given, calculation formula is as follows: Si=si(tc)+sg(tc)·N-1;Wherein, sg(tc) it is under stable state The node of node g influences force threshold, tcIndicate convergence number.
Step S32: the node is influenced into force threshold and is compared with default node influence force threshold, is tied according to comparing Fruit obtains target text, using the frequent word of the target text as target labels.
It should be noted that influencing force threshold in the present embodiment for the node and being greater than the default node influence power The text node of threshold value excavates its corresponding text to obtain target text, and using the frequent word of the target text as mesh Mark label.
Further, the step S40, specifically includes:
Step S41: frequent word-text matrix is constructed according to the frequent word and the text, to obtain the text pair The text vector answered, and it is special by presetting the inherence that cosine similarity relationship generates between the text to the text vector Levy similarity threshold.
It should be noted that in the present embodiment, by the frequent word f of excavationiFrequent word-text is constructed with the text This matrix M, wherein M is 0-1 matrix, the form of expression of M are as follows:
Abstract table is assigned by whether containing the frequent word in the measurement text Show as follows: If (frequent word fi∈df)
{
M [i] [j]=1;
}
else
{
M [i] [j]=0;
}
Wherein make each text djThe form of expression be to be indicated by 0,1 n this vector of Balakrishnan for constituting, the form of expression It is as follows: df={ 1,0 ..., };The default cosine similarity relationship is recycled to calculate the internal characteristics phase between the text Like degree threshold value SIndij, wherein the calculation formula of the default cosine similarity relationship is as follows:Calculate the cosine value between each described n-dimensional vector and this vector.
Step S42: it in the heterogeneous text network, is generated between the text by preset path similarity relationship External feature similarity threshold.
It should be noted that weighting directed edge member path according to each in the present embodimentIn, each includes the attribute function δ on the text information relationship Rl(Rl) be One determining value calculates the similarity between author node using the preset path similarity relationship, i.e., described in calculating The external feature similarity S of textOutdijFormula is as follows:
Wherein P is first path, same type pair As for x and y.
Step S43: according to the internal characteristics similarity threshold and the external feature similarity threshold, by default Total similarity relationship generates total similarity threshold of the text;By total similarity threshold and the total similarity of pre-set text Threshold value is compared, and obtains the target text node in the heterogeneous text network according to comparison result.
It should be noted that presetting the i.e. following calculation formula S of total similarity relationship using described in the present embodimentdij= SIndij*WIn+SOutdij*WOutObtain total similarity threshold Sdij, wherein WIn、WOutRespectively assign internal characteristics similitude Weight and external feature similitude weight;Total similarity threshold is greater than the total similarity threshold of the pre-set text The heterogeneous text network in text node as target text node.
Further, the step S50, specifically includes:
Step S51: if the target text node is the target text node of directed edge in the heterogeneous text network, Then direction of the target labels between the target text node according to the directed edge is propagated;There to be phase With the target labels, corresponding text is clustered, to obtain cluster result cluster.
Step S52: if the target text node is the target on nonoriented edge or two-way side in the heterogeneous text network Text node influences force threshold according to the corresponding node of the target text node and is ranked up and obtains ranking results, by institute Target labels are stated to be propagated between the target text node according to the ranking results;There to be the identical target The corresponding text of label is clustered, to obtain cluster result cluster.
It should be noted that the ranking results are according to the corresponding node of the target text node in the present embodiment Influence the ranking results that the arrangement of force threshold descending obtains.
In a second embodiment, word segmentation processing is carried out by the text concentrated to sample text, to obtain the frequency of each text Numerous word;The text information for extracting the text is concentrated from the sample text, is closed according to the text information by default mapping System constructs heterogeneous text network;By corresponding text node in the heterogeneous text network by presetting node influence power relationship Generating node influences force threshold, influences force threshold according to the node and obtains target labels;Lead in the heterogeneous text network It crosses and presets total similarity relationship and generate total similarity threshold between the text, mesh is obtained according to total similarity threshold Mark text node;The target labels are propagated between the target text node, and there will be the identical target The corresponding text of label is clustered, to obtain cluster result cluster.Technical solution of the present invention is able to solve label and propagates at random Property and cluster accuracy and technical problem with a low credibility.
In addition, the embodiment of the present invention also proposes a kind of storage medium, the poly- of label propagation is stored on the storage medium Class method realizes following operation when the Cluster Program that the label is propagated is executed by processor:
Word segmentation processing is carried out to the text that sample text is concentrated, to obtain the frequent word of each text;
The text information for extracting the text is concentrated from the sample text, is reflected according to the text information by default It penetrates relationship and constructs heterogeneous text network;
Corresponding text node in the heterogeneous text network, which is generated node by default node influence power relationship, to be influenced Force threshold influences force threshold according to the node and obtains target labels;
Total similarity threshold between the text is generated by presetting total similarity relationship in the heterogeneous text network Value obtains target text node according to total similarity threshold;
The target labels are propagated between the target text node, and there will be the identical target mark It signs corresponding text to be clustered, to obtain cluster result cluster.
Further, following operation is also realized when the Cluster Program that the label is propagated is executed by processor:
Participle and part-of-speech tagging operation are carried out by the text that FNLP concentrates the sample sample text, to obtain feature Word;
TF-IDF operation is carried out to the Feature Words, to obtain the word frequency and inverse document frequency of the Feature Words;
According to the word frequency and the inverse document frequency, the power of the Feature Words is generated by presetting weight corresponding relationship Weight threshold value;
The weight threshold of the Feature Words is compared with frequent word threshold value is preset, target is obtained according to comparison result Feature Words, using the target signature word as the frequent word of the text.
Further, following operation is also realized when the Cluster Program that the label is propagated is executed by processor:
The text information for extracting the text is concentrated from the sample text;
According to the text information by presetting mapping relations, will be set between the text node with the text information It is set to directed edge, to construct heterogeneous text network.
Further, following operation is also realized when the Cluster Program that the label is propagated is executed by processor:
Corresponding text node in the heterogeneous text network, which is generated node by default node influence power relationship, to be influenced Force threshold;
The node is influenced force threshold to be compared with default node influence force threshold, mesh is obtained according to comparison result Text is marked, using the frequent word of the target text as target labels.
Further, following operation is also realized when the Cluster Program that the label is propagated is executed by processor:
Frequent word-text matrix is constructed according to the frequent word and the text, to obtain the corresponding text of the text Vector, and the internal characteristics similarity between the text is generated by default cosine similarity relationship to the text vector Threshold value;
In the heterogeneous text network, the external spy between the text is generated by preset path similarity relationship Levy similarity threshold;
According to the internal characteristics similarity threshold and the external feature similarity threshold, by presetting total similarity Relationship generates total similarity threshold of the text;
Target text node is obtained according to total similarity threshold.
Further, following operation is also realized when the Cluster Program that the label is propagated is executed by processor:
According to total similarity threshold;
Total similarity threshold is compared with the total similarity threshold of pre-set text, institute is obtained according to comparison result State the target text node in heterogeneous text network.
Further, following operation is also realized when the Cluster Program that the label is propagated is executed by processor:
It, will be described if the target text node is the target text node of directed edge in the heterogeneous text network Target labels are propagated between the target text node according to the direction of the directed edge;
If the target text node is nonoriented edge or the target text section on two-way side in the heterogeneous text network Point influences force threshold according to the corresponding node of the target text node and is ranked up and obtains ranking results, by the target Label is propagated between the target text node according to the ranking results;
There to be the corresponding text of identical target labels to cluster, to obtain cluster result cluster.
In the present embodiment, word segmentation processing is carried out by the text concentrated to sample text, to obtain the frequent of each text Word;The text information for extracting the text is concentrated from the sample text, according to the text information by presetting mapping relations Construct heterogeneous text network;Corresponding text node in the heterogeneous text network is raw by default node influence power relationship Force threshold is influenced at node, force threshold is influenced according to the node and obtains target labels;Pass through in the heterogeneous text network It presets total similarity relationship and generates total similarity threshold between the text, target is obtained according to total similarity threshold Text node;The target labels are propagated between the target text node, and there will be the identical target mark It signs corresponding text to be clustered, to obtain cluster result cluster.Technical solution of the present invention is able to solve label and propagates randomness With cluster accuracy and technical problem with a low credibility.
In addition, the embodiment of the present invention also proposes a kind of clustering apparatus that label is propagated, what the label was propagated referring to Fig. 4 Clustering apparatus includes:
Frequent word obtains module 10, and the text for concentrating to sample text carries out word segmentation processing, to obtain each text Frequent word.
It is understood that the text refers to the form of expression of written language in the present embodiment, from the perspective of from literature angle, Usually there is a complete, sentence of system meaning or the combination of multiple sentences;One text can be sentence, one A paragraph or a chapter, no longer repeat one by one herein.
In the concrete realization, preparatory collecting sample text set carries out participle and part of speech mark to the text that sample text is concentrated Note operation obtains its word frequency and inverse document frequency according to the Feature Words, further according to the default weight pair to obtain Feature Words It should be related to obtain the frequent word of each text.
Heterogeneous text network struction module 20, for concentrating the text information for extracting the text from the sample text, Heterogeneous text network is constructed by default mapping relations according to the text information.
It should be noted that in the present embodiment, the text information includes concern information between the author of text, text Originally the information etc. for thumbing up, forwarding and quoting, no longer repeats one by one herein.
In the concrete realization, the text information for extracting the text is concentrated from the sample text, according to the text envelope Breath will be set as directed edge between the text node with the text information, to construct heterogeneous text by presetting mapping relations Present networks.
Target labels obtain module 30, for corresponding text node in the heterogeneous text network to be passed through default section Point influence power relationship, which generates node, influences force threshold, influences force threshold according to the node and obtains target labels.
It should be noted that influencing force threshold according to the node in the present embodiment, the node is influenced into force threshold Force threshold is influenced with default node to be compared, and target text is obtained according to comparison result, by the frequent of the target text Word is as target labels.
Target text node obtain module 40, in the heterogeneous text network by presetting total similarity relationship Total similarity threshold between the text is generated, target text node is obtained according to total similarity threshold.
It should be noted that obtaining institute according to the frequent word and the default cosine similarity relationship in the present embodiment State internal characteristics similarity threshold;It is obtained in the heterogeneous text network by the preset path similarity relationship simultaneously The external feature similarity threshold, finally according to the internal characteristics similarity threshold and the external feature similarity threshold Value, generates total similarity threshold of the text by presetting total similarity relationship to obtain target text node.
Propagation and cluster module 50, for the target labels to be propagated between the target text node, and There to be the corresponding text of identical target labels to cluster, to obtain cluster result cluster.
It should be noted that label propagation algorithm is quoted in the present embodiment, by the target labels in the target text It is propagated between this node, will finally have the corresponding text of identical target labels to cluster, to obtain cluster knot Fruit cluster is until whole process terminates.
It is worth noting that introducing the oriented heterogeneous text network of weighting in the present embodiment, the multidimensional for excavating text is special Sign carries out Similarity measures, improves the accuracy and confidence level of cluster result.
In the present embodiment, word segmentation processing is carried out by the text concentrated to sample text, to obtain the frequent of each text Word;The text information for extracting the text is concentrated from the sample text, according to the text information by presetting mapping relations Construct heterogeneous text network;Corresponding text node in the heterogeneous text network is raw by default node influence power relationship Force threshold is influenced at node, force threshold is influenced according to the node and obtains target labels;Pass through in the heterogeneous text network It presets total similarity relationship and generates total similarity threshold between the text, target is obtained according to total similarity threshold Text node;The target labels are propagated between the target text node, and there will be the identical target mark It signs corresponding text to be clustered, to obtain cluster result cluster.Technical solution of the present invention is able to solve label and propagates randomness With cluster accuracy and technical problem with a low credibility.
The other embodiments or specific implementation for the clustering apparatus that label of the present invention is propagated can refer to above-mentioned each side Method embodiment, details are not described herein again.
It should be noted that, in this document, the terms "include", "comprise" or its any other variant be intended to it is non- It is exclusive to include, so that the process, method, article or the system that include a series of elements not only include those elements, It but also including other elements that are not explicitly listed, or further include for this process, method, article or system institute Intrinsic element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that There is also other identical elements in process, method, article or system including the element.
The serial number of the above embodiments of the invention is only for description, does not represent the advantages or disadvantages of the embodiments.If listing equipment for drying Unit claim in, several in these devices, which can be, to be embodied by the same item of hardware.Word One, second and the use of third etc. do not indicate any sequence, can be title by these word explanations.
Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment Method can be realized by means of software and necessary general hardware platform, naturally it is also possible to by hardware, but many situations It is lower the former be more preferably embodiment.Based on this understanding, technical solution of the present invention is substantially in other words to the prior art The part to contribute can be embodied in the form of software products, which is stored in a storage and is situated between Matter (such as read-only memory mirror image (Read Only Memory image, ROM)/random access memory (Random Access Memory, RAM), magnetic disk, CD) in, including some instructions are used so that terminal device (can be mobile phone, computer, Server, air conditioner or network equipment etc.) execute method described in each embodiment of the present invention.
The above is only a preferred embodiment of the present invention, is not intended to limit the scope of the invention, all to utilize this hair Equivalent structure or equivalent flow shift made by bright specification and accompanying drawing content, it is relevant to be applied directly or indirectly in other Technical field is included within the scope of the present invention.

Claims (10)

1. the clustering method that a kind of label is propagated, which is characterized in that clustering method that the label is propagated the following steps are included:
Word segmentation processing is carried out to the text that sample text is concentrated, to obtain the frequent word of each text;
The text information for extracting the text is concentrated from the sample text, according to the text information by presetting mapping relations Construct heterogeneous text network;
Corresponding text node in the heterogeneous text network is generated into node influence power threshold by default node influence power relationship Value influences force threshold according to the node and obtains target labels;
Total similarity threshold between the text, root are generated by presetting total similarity relationship in the heterogeneous text network Target text node is obtained according to total similarity threshold;
The target labels are propagated between the target text node, and there will be the identical target labels corresponding Text clustered, to obtain cluster result cluster.
2. the clustering method that label as described in claim 1 is propagated, which is characterized in that the text concentrated to sample text Word segmentation processing is carried out to specifically include to obtain the frequent word of each text:
Participle and part-of-speech tagging operation are carried out by the text that FNLP concentrates the sample sample text, to obtain Feature Words;
TF-IDF operation is carried out to the Feature Words, to obtain the word frequency and inverse document frequency of the Feature Words;
According to the word frequency and the inverse document frequency, the weight threshold of the Feature Words is generated by presetting weight corresponding relationship Value;
The weight threshold of the Feature Words is compared with frequent word threshold value is preset, target signature is obtained according to comparison result Word, using the target signature word as the frequent word of the text.
3. the clustering method that label as described in claim 1 is propagated, which is characterized in that described to be mentioned from sample text concentration The text information for taking the text constructs heterogeneous text network by default mapping relations according to the text information, specific to wrap It includes:
The text information for extracting the text is concentrated from the sample text;
According to the text information by presetting mapping relations, will be provided between the text node with the text information Xiang Bian, to construct heterogeneous text network.
4. the clustering method that the label as described in claims 1 to 3 any one is propagated, which is characterized in that it is described will be described different Corresponding text node, which generates node by default node influence power relationship, in matter text network influences force threshold, according to the section Point influences force threshold and obtains target labels, specifically includes:
Corresponding text node in the heterogeneous text network is generated into node influence power threshold by default node influence power relationship Value;
The node is influenced force threshold to be compared with default node influence force threshold, target text is obtained according to comparison result This, using the frequent word of the target text as target labels.
5. the clustering method that the label as described in claims 1 to 3 any one is propagated, which is characterized in that described described different Total similarity threshold between the text is generated by presetting total similarity relationship in matter text network, according to described total similar It spends threshold value and obtains target text node, specifically include:
Frequent word-text matrix is constructed according to the frequent word and the text, to obtain the corresponding text vector of the text, And the internal characteristics similarity threshold between the text is generated by default cosine similarity relationship to the text vector;
In the heterogeneous text network, the external feature generated between the text by preset path similarity relationship is similar Spend threshold value;
It is raw by presetting total similarity relationship according to the internal characteristics similarity threshold and the external feature similarity threshold At total similarity threshold of the text;
Target text node is obtained according to total similarity threshold.
6. the clustering method that label as claimed in claim 5 is propagated, which is characterized in that described according to total similarity threshold Target text node is obtained, is specifically included:
According to total similarity threshold;
Total similarity threshold is compared with the total similarity threshold of pre-set text, is obtained according to comparison result described heterogeneous Target text node in text network.
7. the clustering method that the label as described in claim 1 to 6 any one is propagated, which is characterized in that described by the mesh Mark label is propagated between the target text node, and will have the corresponding text of identical target labels to gather Class is specifically included with obtaining cluster result cluster:
If the target text node is the target text node of directed edge in the heterogeneous text network, by the target mark The direction between the target text node according to the directed edge is signed to be propagated;
If the target text node is nonoriented edge or the target text node on two-way side in the heterogeneous text network, according to The corresponding node of the target text node influences force threshold and is ranked up and obtains ranking results, by the target labels in institute It states and is propagated between target text node according to the ranking results;
There to be the corresponding text of identical target labels to cluster, to obtain cluster result cluster.
8. a kind of terminal device, which is characterized in that the terminal device includes: memory, processor and is stored in the storage On device and Cluster Program that the label that can run on the processor is propagated, the Cluster Program that the label is propagated is by the place Manage the step of clustering method that the label as described in any one of claims 1 to 7 is propagated is realized when device executes.
9. a kind of storage medium, which is characterized in that be stored with the Cluster Program of label propagation, the label on the storage medium The cluster side that the label as described in any one of claims 1 to 7 is propagated is realized when the Cluster Program of propagation is executed by processor The step of method.
10. the clustering apparatus that a kind of label is propagated, which is characterized in that the clustering apparatus that the label is propagated includes:
Frequent word obtains module, and the text for concentrating to sample text carries out word segmentation processing, to obtain the frequent word of each text;
Heterogeneous text network struction module, for concentrating the text information for extracting the text from the sample text, according to institute It states text information and constructs heterogeneous text network by default mapping relations;
Target labels obtain module, for corresponding text node in the heterogeneous text network to be passed through default node influence power Relationship, which generates node, influences force threshold, influences force threshold according to the node and obtains target labels;
Target text node obtains module, in the heterogeneous text network by presetting described in total similarity relationship generates Total similarity threshold between text obtains target text node according to total similarity threshold;
Propagation and cluster module, for propagating the target labels between the target text node, and will have The corresponding text of identical target labels is clustered, to obtain cluster result cluster.
CN201910504157.0A 2019-06-11 2019-06-11 Label propagation clustering method, terminal equipment, storage medium and device Active CN110442674B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910504157.0A CN110442674B (en) 2019-06-11 2019-06-11 Label propagation clustering method, terminal equipment, storage medium and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910504157.0A CN110442674B (en) 2019-06-11 2019-06-11 Label propagation clustering method, terminal equipment, storage medium and device

Publications (2)

Publication Number Publication Date
CN110442674A true CN110442674A (en) 2019-11-12
CN110442674B CN110442674B (en) 2021-09-14

Family

ID=68429199

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910504157.0A Active CN110442674B (en) 2019-06-11 2019-06-11 Label propagation clustering method, terminal equipment, storage medium and device

Country Status (1)

Country Link
CN (1) CN110442674B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111191882A (en) * 2019-12-17 2020-05-22 安徽大学 Method and device for identifying influential developers in heterogeneous information network
CN112699237A (en) * 2020-12-24 2021-04-23 百度在线网络技术(北京)有限公司 Label determination method, device and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102768670A (en) * 2012-05-31 2012-11-07 哈尔滨工程大学 Webpage clustering method based on node property label propagation
US20140067808A1 (en) * 2012-09-06 2014-03-06 International Business Machines Corporation Distributed Scalable Clustering and Community Detection
US8832091B1 (en) * 2012-10-08 2014-09-09 Amazon Technologies, Inc. Graph-based semantic analysis of items
CN106951524A (en) * 2017-03-21 2017-07-14 哈尔滨工程大学 Overlapping community discovery method based on node influence power
CN108364234A (en) * 2018-03-08 2018-08-03 重庆邮电大学 A kind of microblogging community discovery method propagated based on node influence power label
CN108959453A (en) * 2018-06-14 2018-12-07 中南民族大学 Information extracting method, device and readable storage medium storing program for executing based on text cluster

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102768670A (en) * 2012-05-31 2012-11-07 哈尔滨工程大学 Webpage clustering method based on node property label propagation
US20140067808A1 (en) * 2012-09-06 2014-03-06 International Business Machines Corporation Distributed Scalable Clustering and Community Detection
US8832091B1 (en) * 2012-10-08 2014-09-09 Amazon Technologies, Inc. Graph-based semantic analysis of items
CN106951524A (en) * 2017-03-21 2017-07-14 哈尔滨工程大学 Overlapping community discovery method based on node influence power
CN108364234A (en) * 2018-03-08 2018-08-03 重庆邮电大学 A kind of microblogging community discovery method propagated based on node influence power label
CN108959453A (en) * 2018-06-14 2018-12-07 中南民族大学 Information extracting method, device and readable storage medium storing program for executing based on text cluster

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111191882A (en) * 2019-12-17 2020-05-22 安徽大学 Method and device for identifying influential developers in heterogeneous information network
CN111191882B (en) * 2019-12-17 2022-11-25 安徽大学 Method and device for identifying influential developers in heterogeneous information network
CN112699237A (en) * 2020-12-24 2021-04-23 百度在线网络技术(北京)有限公司 Label determination method, device and storage medium

Also Published As

Publication number Publication date
CN110442674B (en) 2021-09-14

Similar Documents

Publication Publication Date Title
CN110837550B (en) Knowledge graph-based question answering method and device, electronic equipment and storage medium
US7194466B2 (en) Object clustering using inter-layer links
EP2866421B1 (en) Method and apparatus for identifying a same user in multiple social networks
CN112148987B (en) Message pushing method based on target object activity and related equipment
US9536201B2 (en) Identifying associations in data and performing data analysis using a normalized highest mutual information score
US20230102337A1 (en) Method and apparatus for training recommendation model, computer device, and storage medium
CN109739978A (en) A kind of Text Clustering Method, text cluster device and terminal device
WO2021143267A1 (en) Image detection-based fine-grained classification model processing method, and related devices
CN105868108A (en) Instruction-set-irrelevant binary code similarity detection method based on neural network
Belle et al. Serial coalescent simulations suggest a weak genealogical relationship between Etruscans and modern Tuscans
CN108154198A (en) Knowledge base entity normalizing method, system, terminal and computer readable storage medium
WO2019071904A1 (en) Bayesian network-based question-answering apparatus, method and storage medium
CN110909222A (en) User portrait establishing method, device, medium and electronic equipment based on clustering
CN112183881A (en) Public opinion event prediction method and device based on social network and storage medium
CN110929145A (en) Public opinion analysis method, public opinion analysis device, computer device and storage medium
Paez et al. Inducing non-orthogonal and non-linear decision boundaries in decision trees via interactive basis functions
CN112785005A (en) Multi-target task assistant decision-making method and device, computer equipment and medium
CN110442674A (en) Clustering method, terminal device, storage medium and the device that label is propagated
WO2022227171A1 (en) Method and apparatus for extracting key information, electronic device, and medium
WO2021120588A1 (en) Method and apparatus for language generation, computer device, and storage medium
CN115248890A (en) User interest portrait generation method and device, electronic equipment and storage medium
CN111667018A (en) Object clustering method and device, computer readable medium and electronic equipment
WO2020252925A1 (en) Method and apparatus for searching user feature group for optimized user feature, electronic device, and computer nonvolatile readable storage medium
CN116204709A (en) Data processing method and related device
CN110941638A (en) Application classification rule base construction method, application classification method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant