CN110134768A - Processing method, device, equipment and the storage medium of text - Google Patents

Processing method, device, equipment and the storage medium of text Download PDF

Info

Publication number
CN110134768A
CN110134768A CN201910395287.5A CN201910395287A CN110134768A CN 110134768 A CN110134768 A CN 110134768A CN 201910395287 A CN201910395287 A CN 201910395287A CN 110134768 A CN110134768 A CN 110134768A
Authority
CN
China
Prior art keywords
text
node
tree
text node
participle
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910395287.5A
Other languages
Chinese (zh)
Other versions
CN110134768B (en
Inventor
赵旸
邱旻峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201910395287.5A priority Critical patent/CN110134768B/en
Publication of CN110134768A publication Critical patent/CN110134768A/en
Application granted granted Critical
Publication of CN110134768B publication Critical patent/CN110134768B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files

Abstract

This application discloses a kind of processing method of text, device, equipment and storage mediums, are related to internet area.This method includes the participle set by text, determines the similarity between text two-by-two;When similarity is greater than similarity threshold, the corresponding text node of similar two texts is established a connection, generates the first tree figure;The first root node is determined from the first tree figure, the text after the corresponding text of the first root node to be determined as to duplicate removal.This method forms tree figure by will establish a connection between similar text;In tree figure, the root node in tree figure is quickly found by the technological means of traversal, realizes the quick duplicate removal to a large amount of texts.

Description

Processing method, device, equipment and the storage medium of text
Technical field
This application involves internet area, in particular to a kind of processing method of text, device, equipment and storage medium.
Background technique
There are under the scene of Massive short documents sheet, there are a large amount of Similar Texts in above-mentioned Massive short documents sheet;Due to business Demand needs to be filtered above-mentioned a large amount of Similar Text processing, the text collection after exporting duplicate removal.
It is most representative with SimHash algorithm for the duplicate removal of mass text.SimHash algorithm is to map text It is gone here and there for one 01,01 string that Similar Text obtains is similar;Compare two texts 01 string between on how many a positions word Symbol is different, and the different positional number of above-mentioned character is exactly to hash (Hashing) distance;When Hashing distance be less than or equal to away from When from threshold value, indicates that the similarity between two texts is high, be Similar Text, need to do duplicate removal processing;When Hashing distance is big It when distance threshold, indicates that the similarity between two texts is low, is different two texts, does not need to do duplicate removal processing.
In general, the distance threshold of Hashing distance is set as 3 when carrying out duplicate removal processing for long text;The distance threshold The similitude between long text can be accurately measured, and complexity and the time of algorithm execution can greatly be reduced, is obtained Good duplicate removal effect.But for short text, typically greater than 3 of Hashing distance between two Similar Texts, And the distance threshold of Hashing distance is bigger, then the complexity that algorithm executes is bigger, correspondingly, the time executed is also longer;Institute With SimHash algorithm can not guarantee the relatively high precision of duplicate removal while meeting the quick duplicate removal processing to short text Degree.
Summary of the invention
The embodiment of the present application provides processing method, device, equipment and the storage medium of a kind of text, can solve short essay When this progress duplicate removal processing, it can not guarantee the high accuracy of duplicate removal while meeting the quick duplicate removal processing to short text The problem of.The technical solution is as follows:
According to the one aspect of the application, a kind of processing method of text is provided, this method comprises:
Receive at least two texts that terminal is sent;It include the first text and the second text at least two texts;
According to first the first text node of text generation, according to second the second text node of text generation;First text section Include the first participle set of the first text in point, includes the second participle set of the second text in the second text node;
Determine the first similarity of first participle set and the second participle set;
When the first similarity is greater than similarity threshold, the first text node and the second text node are established into connection and closed System generates the first tree figure;
Duplicate removal processing is carried out to the corresponding text of text node in the first tree figure.
According to the another aspect of the application, a kind of processing unit of text is provided, which includes:
Receiving module, for receiving at least two texts of terminal transmission;In at least two texts include the first text and Second text;
Generation module is used for according to first the first text node of text generation, according to second the second text section of text generation Point;Include the first participle set of the first text in first text node, includes the second of the second text in the second text node Participle set;
Determining module, for determining the first similarity of first participle set and the second participle set;
Link block is used for when the first similarity is greater than similarity threshold, by the first text node and the second text section Point establishes a connection, and generates the first tree figure;
Deduplication module, for carrying out duplicate removal processing to the corresponding text of text node in the first tree figure.
According to the another aspect of the application, a kind of terminal is provided, which includes:
Memory;
The processor being connected with memory;
Wherein, processor is configured as loading and executing executable instruction to realize such as above-mentioned first aspect and its optional reality Apply the processing method of text described in example.
According to the another aspect of the application, a kind of computer readable storage medium, above-mentioned computer-readable storage are provided At least one instruction, at least a Duan Chengxu, code set or instruction set, above-mentioned at least one instruction, at least one are stored in medium Duan Chengxu, code set or instruction set are as processor loads and executes to realize as described in above-mentioned first aspect and its alternative embodiment Text processing method.
Technical solution bring beneficial effect provided by the embodiments of the present application includes at least:
Server generates the first text node and second by the first text and the second text sent according to terminal respectively Text node, wherein include first participle set in the first text node, include the second participle set in the second text node; Determine the first similarity of first participle set with the second participle set;When the first similarity is greater than similarity threshold, by phase As the first text node and the second text node establish a connection, generate the first tree figure;From the first tree The first root node is determined in figure, the text after the corresponding text of the first root node to be determined as to duplicate removal.This method is by by phase As establish a connection between text, form tree figure;Since the text node in tree figure is similar text This text node abandons remaining text section so obtaining one of text node from said one tree figure Point, then the text node after obtaining duplicate removal, and then obtain the text after duplicate removal.In tree figure, traversal can be passed through Technological means quickly finds the root node in tree figure, realizes the quick duplicate removal to a large amount of texts;Meanwhile by similar The setting for spending threshold value, accurately determines Similar Text, ensure that the accuracy of duplicate removal.
Detailed description of the invention
In order to more clearly explain the technical solutions in the embodiments of the present application, make required in being described below to embodiment Attached drawing is briefly described, it should be apparent that, the drawings in the following description are only some examples of the present application, for For those of ordinary skill in the art, without creative efforts, it can also be obtained according to these attached drawings other Attached drawing.
Fig. 1 is the structural schematic diagram for the computer system that one exemplary embodiment of the application provides;
Fig. 2 is the flow chart of the processing method for the text that one exemplary embodiment of the application provides;
Fig. 3 is the flow chart of the processing method for the text that another exemplary embodiment of the application provides;
Fig. 4 is the flow chart of the processing method for the text that another exemplary embodiment of the application provides;
Fig. 5 is the schematic diagram for the text node that one exemplary embodiment of the application provides;
Fig. 6 is the structural schematic diagram for the inverted index that one exemplary embodiment of the application provides;
Fig. 7 is the structural schematic diagram for the inverted index that another exemplary embodiment of the application provides;
Fig. 8 is the structural schematic diagram for the inverted index that another exemplary embodiment of the application provides;
Fig. 9 is the schematic diagram of the tree figure for the text node that one exemplary embodiment of the application provides;
Figure 10 is the schematic diagram of the tree figure for the text node that another exemplary embodiment of the application provides;
Figure 11 is the schematic diagram of the tree figure for the text node that another exemplary embodiment of the application provides;
Figure 12 is the schematic diagram of the tree subgraph for the text node that one exemplary embodiment of the application provides;
Figure 13 is the block diagram of the processing unit for the text that one exemplary embodiment of the application provides;
Figure 14 is the structural schematic diagram for the electronic equipment that one exemplary embodiment of the application provides;
Figure 15 is the structural schematic diagram for the server that one exemplary embodiment of the application provides.
Specific embodiment
To keep the purposes, technical schemes and advantages of the application clearer, below in conjunction with attached drawing to the application embodiment party Formula is described in further detail.
To this application involves several nouns carry out brief introduction:
Hash similarity (SimHash) algorithm: it is a kind of Duplicate Removal Algorithm for the text that quantity is hundreds of millions ranks, the calculation Text is mapped as one 01 string according to Hash (Hash) function by method, calculates above-mentioned two text by 01 string between two texts (Hashing) distance is hashed between this, Hashing distance is for characterizing the similarity between two texts.Wherein, similar text Originally 01 string obtained be it is similar, can be determined whether similar between two texts, and then determined whether by Hashing distance Duplicate removal processing is done to two texts.
In SimHash algorithm, comparison to 01 string of two texts is compared between 01 string of two texts how many Character is different on a position;The different number of positions of above-mentioned character is exactly to hash (Hashing) distance;When Hashing is apart from small When distance threshold, indicates that the similarity between two texts is high, be Similar Text, need to do duplicate removal processing;When When Hashing distance is greater than distance threshold, indicates that the similarity between two texts is low, be different two texts, do not need Do duplicate removal processing.
In general, setting 3 for above-mentioned Hashing distance.For example, for one 64 hash values, by its from a high position to Low level is divided into 4 continuous 16, then including the different position of 1 character, two Similar Texts in be up to 3 16 Between always have 1 16, character is all identical on each of which position, therefore, when Hashing distance be less than or equal to 3 When, the similarity between two texts is high, can determine that two texts are similar, and carry out duplicate removal processing to above-mentioned two text. And when Hashing distance is greater than 3, then in 4 16 each 16 can include the different positions of at least one character, two Similarity between a text is low, can determine two text dissmilarities, not need to do duplicate removal processing.
Therefore, 64 hash values of text can be divided into 4 continuous 16 01 strings segments from a high position to low level, it will Above-mentioned 16 01 string segment respectively as index key assignments, and by corresponding position it is having the same 16 01 string text pair It should be added under corresponding key assignments, form index;For example, 01 string by high 16 of text 1 is used as key assignments 1, it will be 16 high Upper other texts with identical 01 string are added under key assignments 1.It is formed after index, is completed parallel by SimHash algorithm pair All texts in above-mentioned index under each key assignments carry out duplicate removal processing.
Jie Kade (Jaccard) similarity: it can be used for characterizing the similarity between two texts.Schematically, with text This A and text B defines Jaccard similarity, and the set after text A is segmented obtains participle set A, and text B is divided Set after word obtains participle set B, then the calculation formula of Jaccard similarity are as follows:
Wherein, Sim (A, B) is Jaccard similarity, the quantity segmented in the intersection of A ∩ B expression A and B, A ∪ B expression A With B's and concentrate the quantity of participle.
Jaccard similarity is between 0-1, when Jaccard similarity is 0, indicates that two texts are completely different The text of sample;When Jaccard similarity is 1, indicate that two texts are identical texts.In the mistake of actual text duplicate removal Benchmark of the similarity threshold as text duplicate removal is usually arranged in Cheng Zhong, is somebody's turn to do when the Jaccard similarity of two texts is greater than When similarity threshold, two texts are judged as Similar Text, then carry out duplicate removal processing to two texts;When two texts When Jaccard similarity is less than or equal to the similarity threshold, two texts are judged as dissimilar text, then to two texts This does not do duplicate removal processing.
Union-find Sets (Union-Find) are calculated Method: it is a kind of non-intersecting data structure, which will Two disjoint subsets, which merge, becomes a set, and subset belonging to an element can be determined in this set. In this application, Union-Find algorithm can be used for from tree figure determining all disjoint tree Figure, to obtain the corresponding text of text node in each tree subgraph after duplicate removal, i.e. text after duplicate removal.
In this application, n tree figure is joined together, and forms new tree figure, above-mentioned n tree-shaped knots Composition becomes the subgraph of new tree figure.Wherein, in above-mentioned new tree figure, the root node conduct of each subgraph Child node is connect with the root node of new tree figure.
Neo4j: it is a kind of chart database, for storing the node and connection relationship of graph data structure;Or a kind of high property The engine of energy supports the realization of Union-Find algorithm.In this application, the text node that Neo4j is used to store text is constituted Tree data structure figure, and by Union-Find algorithm realize duplicate removal after text node output.
LDA topic model: a kind of unsupervised learning model is to divide the model that theme is trained, energy to text It is enough to sort out the text of identical theme after inputting text.The no fixed value of setting of the theme quantity of the model, needs It is preset in model training, and after model training terminates, needs to manually adjust training parameter, to reach expected Clustering Effect.
During to text duplicate removal, when the distance threshold of Hashing distance is set as 3 in above-mentioned SimHash algorithm, The similitude between long text can be accurately measured for long text duplicate removal, and the complexity of algorithm execution can greatly be reduced Degree and time, obtain good duplicate removal effect.But for short text, Hashing distance between two Similar Texts it is logical 3 often are greater than, and the distance threshold of Hashing distance is bigger, then the complexity that algorithm executes is bigger, correspondingly, execute Time is also longer;So SimHash algorithm can not guarantee while meeting the quick duplicate removal processing to short text The high accuracy of weight.Therefore, this application provides a kind of processing methods of text, for solving the above problems.
Referring to FIG. 1, showing the mechanism block diagram of the computer system of exemplary embodiment offer, the department of computer science System includes: terminal 120, text-processing server 140, Chinese Word Segmentation Service device 160 and database server 180;
Application program is installed in terminal 120;The application program can be used for the publication of text information.Optionally, this is answered It may include at least one of online games, communication program, video reproduction program, service for life program with program.When When user passes through application issued text information, terminal can be sent above-mentioned text information by wired or wireless network Into text-processing server 140.
Text-processing server 140 and terminal 120 pass through wired or wireless network connection;Text-processing server 140 Also it is connect respectively with Chinese Word Segmentation Service device 160, database server 180 by wired or wireless network.
Text processing server 140 receives the text information that terminal 120 is sent, and extracts the text in text information, and Word segmentation processing is carried out to text by Chinese Word Segmentation Service device 160, the participle set of each text is obtained, to each text generation Corresponding text node, wherein include the participle set of corresponding text in each text node.Text-processing server 160 is also The similarity between text two-by-two is determined according to the participle set of text, and according to the relationship of phase Sihe dissmilarity in text two-by-two Between establish different connection relationships, form tree figure, above-mentioned tree figure stored to database server 180 In.
Text-processing server 140 in database server 180 also by nomography from obtaining in above-mentioned tree figure The corresponding text of text node after duplicate removal, and text is stored into database server 180.Optionally, above-mentioned nomography It can be Union-Find algorithm.
It optionally, include chart database and text database in database server 180.Text-processing server 140 will Above-mentioned tree figure is stored into chart database, and above-mentioned text is stored into text database.Optionally, chart database branch Hold the realization of Union-Find algorithm.Optionally, chart database can be Neo4j.
In some embodiments, text-processing server 140 also carries out secondary duplicate removal to the text after above-mentioned duplicate removal.It is optional Ground, text-processing server 140 carries out secondary duplicate removal to the text after above-mentioned duplicate removal by SimHash algorithm, by secondary duplicate removal Text afterwards is stored into the text database of database server 180.
Referring to FIG. 2, showing the processing method of the text of exemplary embodiment offer, it is applied to figure in this way For in computer system shown in 1, this method comprises:
Step 201, terminal to server sending information information.
Step 202, the text information that server receiving terminal is sent.
It include at least two texts, at least two texts that server receiving terminal is sent in above-mentioned text information.Wherein, It include the first text and the second text in above-mentioned at least two text.
Step 203, server is according to first the first text node of text generation, according to second the second text section of text generation Point.
The processing that server segments the first text;Obtain the first participle set of the first text;Generate includes the First text node of the first text of one participle set.Same as described above, it includes the second participle set that server, which generates, The second text the second text node.
Step 204, server determines the first similarity of first participle set and the second participle set.
Optionally, the first similarity of first participle set and the second participle set is characterized by Jaccard similarity; Illustrated steps are as follows:
The participle quantity that server statistics first participle set and the second participle intersection of sets are concentrated, is first participle number Amount;
The participle quantity that server statistics first participle set and the second participle union of sets are concentrated, for the second participle number Amount,
First participle quantity and the ratio of the second participle quantity are determined as Jaccard similarity by server.
Step 205, when the first similarity is greater than similarity threshold, server is by the first text node and the second text section Point establishes a connection, and generates the first tree figure.
Similarity threshold is for determining the whether similar threshold value of two texts, when similarity is greater than the similarity threshold When, it is determined that two texts are similar;When similarity is less than or equal to similarity threshold, it is determined that two text dissmilarities.
When server judges to obtain the first similarity greater than similarity threshold, the first text and the second text phase are determined Seemingly, it establishes a connection between the first text node and the second text node, generates the first tree figure.This is first tree-shaped Structure chart indicates the connection relationship between the first text node and the second text node.
Step 206, server carries out duplicate removal processing to the corresponding text of text node in the first tree figure.
It include the first root node in first tree;Traverse the text node in the first tree figure;Find first Root node, the text after the corresponding text of the first root node in the first tree figure to be determined as to duplicate removal.
Schematically, the text node in the first tree figure is traversed;By the second text in the first tree figure Node is determined as the first root node;Text after the corresponding text of first root node to be determined as to duplicate removal.
In conclusion the processing method of text provided in this embodiment, server is literary by sent according to terminal first This and the second text generate the first text node and the second text node respectively, wherein include first point in the first text node Set of words includes the second participle set in the second text node;Determine the first phase of first participle set with the second participle set Like degree;When the first similarity is greater than similarity threshold, similar first text node and the second text node are established into connection Relationship generates the first tree figure;The first root node is determined from the first tree figure, the first root node is corresponding Text is determined as the text after duplicate removal.This method forms tree figure by will establish a connection between similar text; Since the text node in tree figure is the text node of Similar Text, so being obtained from said one tree figure It takes one of text node, abandons remaining text node, then the text node after obtaining duplicate removal, and then after obtaining duplicate removal Text.In tree figure, the root node in tree figure can be quickly found by the technological means of traversal, realized To the quick duplicate removal of a large amount of texts;Meanwhile by the setting of similarity threshold, accurately determines Similar Text, ensure that The accuracy of weight.
At least two texts can be divided before determining the similarity between text two-by-two based on Fig. 2, it will Potential similar text condenses together, and reduces the comparison scale between text, schematical steps are as follows:
The inverted index of server generation the first text node and the second text node;It include key assignments in the inverted index, The key assignments is generated according to the participle of text.Server determines that the first text node and the second text node belong to same key assignments Under.
Schematically, the first text node further includes first node ID, and the second text node further includes second node ID;It is right The process of first text node and the second text node inverted index is as follows:
Server establishes key assignments according to the first participle set of the first text node, wherein includes j in first participle set A participle, using a participle as a key assignments, j key assignments of correspondence establishment;The first node ID of first text node is stored To each key assignments.
Server segments the participle corresponding with j key assignments respectively of the participle in gathering for the second of the second text node Match;When p-th of participle participle corresponding with q-th of key assignments matches, the second node ID of the second text node is stored to q Under a key assignments;When p-th of participle participle corresponding with j key assignments mismatches, with p-th of participle for a new key assignments, And the second node ID of the second text node is stored to new key assignments, wherein j, q, p are positive integer.
After above-mentioned inverted index is raw, server determines the first text node and the second text node in same key assignments Under, then calculate the similarity of first participle set and the second participle set;The not text under same key assignments is then not phase completely Two same texts, do not need to be compared duplicate removal.
In the processing method of text, it can effectively polymerize the text with same participle using the means of inverted index This, excludes entirely different text, greatly reduced the number compared two-by-two between text, and can would be possible to similar Text condenses together, and improves screening precision.
It should be noted that it is based on Fig. 2, and in above-mentioned at least two text node, in addition to the first text and the second text, It further include third text, referring to FIG. 3, in conjunction with the method for above-mentioned inverted index, in third text and the first text, the second text In the case where dissimilar, the processing method of text the following steps are included:
Step 301, terminal to server sending information information.
Step 302, the text information that server receiving terminal is sent.
Step 303, server is according to first the first text node of text generation, according to second the second text section of text generation Point generates third text node according to third text node.
Server generates the first node ID of the first text, and first node ID is the exclusive node mark of the first text node Know;The content of text of first text is segmented, first participle set is obtained;Generate includes first node ID and the first participle First text node of set.
Optionally, first node ID can be what server generated at random, be also possible to be advised according to generation predetermined It then generates, this is not limited in the application.
Process is same as above, and server is according to second the second text node of text generation;According to third text generation third text Node;Wherein, include the second participle set and second node ID in the second text node, divide in third text node including third Set of words and third node ID.
Step 304, server generates the inverted index of the first text node, the second text node and third text node.
Referring to the above-mentioned process to the first text node and the second text node inverted index, third text node is added It stores into index, that is, by third node ID to corresponding key assignments.
Step 305, server determines that the first text node, the second text node and third text node belong to same key assignments Under.
Server determines to include first node ID, second node ID and third node ID under a key assignments, then it represents that first Text node, the second text node and third text node are to belong to text node under same key assignments.
Step 306, server is according to the first text node, the second text node and the third text node under same key assignments Generate tree-shaped structure chart.
Server obtains first node ID, second node ID and third node ID under same key assignments;According to first node ID obtains first participle set, obtains the second participle according to second node ID and gathers, and obtains third participle according to third node ID Set.
Server determines that the similarity of first participle set and the second participle set is the first similarity;When the first similarity When greater than similarity threshold, the first text node and the second text node are established a connection, generate the first tree figure.
Server determine respectively the participle of the third under same key assignments set and first participle set, the second participle set the Two similarities;When the second similarity is less than or equal to similarity threshold, the second tree-shaped knot is generated according to third text node Composition.Wherein, the text node in the text node and the first tree figure in the second tree figure is without intersection.
Server is merged the first tree figure and the second tree figure by Union-Find algorithm, is generated new Tree figure;Wherein, the first tree figure and the second tree figure are respectively the subgraph in new tree figure.
Schematically, the first text node and the second text node are established into connection, generates the first tree figure, it can be with It is realized by Union-Find algorithm, it is schematical that steps are as follows,
First text node is connected on the second text node by server by Union-find Sets Union-Find algorithm, is generated First tree figure, that is to say, that server uses first the first text node of connection identifier label and the second text node Connection relationship, generate the first tree figure;Wherein, the first connection identifier indicates the first text node and the second text node It is similar.
The initial value of the depth of the first text node of Server Default and the second text node is i, " using the first connection The connection relationship of label the first text node and the second text node is identified, the first tree figure is generated " during, it is first First, the second text node being determined as to the father node of the first text node, then the depth of the second text node remains as i, and first The depth of text node becomes i+1;Secondly, the connection relationship of the first text node and father node is remembered using the first connection identifier Record is in the first text node;Again, above-mentioned first text node and the connection of the second text node generate the first tree Figure.
In the first tree figure, the second text node is the father node of the first text node, in the first text node During being connect with the second text node, closed in the first text node using the first connection identifier label and the connection of father node System, the first connection identifier indicate that the first text node is similar to father node, i.e., the first text node is similar to the second text node; The node identification that father node is also recorded in first text node connects for determining to exist between the first text node and father node It connects.
In tree figure, when the father node of a text node is itself, text node is the tree-shaped knot Root node in composition.Therefore, third text node is determined as the second root node by server, generates the second tree figure.
Further more, there is also the second connection identifier, the second connection marks between two text nodes in Union-Find algorithm Knowing indicates that child node and father node are dissimilar.
Server determine the first tree figure the first root node and the second tree figure the second root node it Afterwards, using the connection relationship of the second connection identifier label the first root node and the second root node, by the first tree figure and Two tree figures merge, and generate new tree figure;
Optionally, second section the first root node in the first tree figure being determined as in the second tree figure The father node of point is closed the first tree figure and the second tree figure by the first root node of connection and the second root node And generate new tree figure.First root node is also determined as the root node of new tree figure by server, that is, Second text node is determined as to the root node of new tree figure.
Wherein, the second root node is third text node, and third text node is the child node of the second text node, Using the connection relationship of the second connection identifier label and father node in three text nodes, the second connection identifier indicates third text section Point is dissimilar with father node, i.e., third text node and the second text node are dissimilar;Father's section is also recorded in third text node There is connection between third text node and father node for determining in the node identification of point.
Server generates new depth relationship after generating new tree figure, wherein the first text node Depth be still i+1, the depth of the second text node is still i, and the depth of third text node becomes i+1 from i, and i is integer.
Step 307, server carries out duplicate removal processing to the corresponding text of above-mentioned tree figure.
Server handles the first tree figure and the corresponding text of the second tree figure, that is, traversal is new Tree figure in each text node, determined according to connection identifier and node identification include in new tree figure Tree subgraph, be the corresponding text node of text after duplicate removal by the root node in each tree subgraph.
Optionally, the content of text in text node including text, service attribute, issuing time, contextual information etc. Information, the corresponding text of text node after duplicate removal is obtained according to above- mentioned information.
Schematically, the first tree figure and the second tree figure are the subgraphs of new tree figure, are passed through Union-Find algorithm determines the first tree figure and the second tree figure from new tree figure, and by pair The traversal of text node obtains the first root node and the second root node in two subgraphs, by the first root node and the second root node Corresponding text is determined as the text after duplicate removal.
In conclusion the processing method of text provided in this embodiment, server is literary by sent according to terminal first Originally, the second text and third text generate the first text node, the second text node and third text node respectively, wherein the Include first participle set in one text node, includes the second participle set in the second text node, wrapped in third text node Include third participle set;According between any two similar of first participle set, the second participle set and third participle set Degree generates new tree figure, includes two subgraphs in new tree figure, be the first tree figure and second respectively Tree figure;Each text node in new tree figure is traversed, the text after obtaining duplicate removal.This method passes through will be similar Text between establish a connection, formed tree figure;Since the text node in tree figure is Similar Text Text node abandon remaining text node so obtaining one of text node from said one tree figure, Text node after then obtaining duplicate removal, and then obtain the text after duplicate removal.In above-mentioned tree figure, pass through Union-Find Algorithm traverses the text node in tree figure, can quickly search to obtain the root node of each subgraph, realize to a large amount of The quick duplicate removal of text;Meanwhile by the setting of similarity threshold, Similar Text is accurately determined, ensure that the standard of duplicate removal True property.
In addition, be the polymerization of Similar Text in each subgraph, so, this method also effectively realizes Similar Text Polymerization.Compared to LDA topic model, this method does not need manual setting theme, and manually adjusts training parameter, can incite somebody to action Similar text is polymerize.
In addition, dissimilar tree figure is combined together by Union-Find algorithm, text node is traversed, While improving the speed to a large amount of text duplicate removals, the complicated degree traversed to multiple tree figures is reduced.
Schematically, the embodiment shown in Fig. 3 is explained, by taking three texts as an example, is respectively as follows:
Text 1: main broadcaster your lower sub-band I play together.
Text 2: main broadcaster we play together can be with?
Text 3: we open black next time together.
Such as Fig. 4, it is divided into following six step:
One, it segments.
Server segments three texts, and sentence is cut into the combination of word, and remove stop words, modal particle, Punctuation mark etc. retains the main information of sentence, forms participle set.The participle set point obtained after three text participles Not are as follows:
Text 1:[" main broadcaster ", " next time ", " together ", " object for appreciation "];
Text 2:[" main broadcaster ", " we ", " together ", " object for appreciation "];
Text 3:[" we ", " next time ", " together ", " opening black "].
Two, node is created.
Text is mapped as text node by server.The node ID of server generation text;Optionally, server also obtains The service attribute of text;Server generates text node according to the participle set, node ID and service attribute of text node.
It optionally, further include content of text, service attribute, issuing time, the contextual information etc. of text in text node Etc. information.
As shown in figure 5, being text 1, text 2 and the corresponding text node 1 of text 3, text node 2 and text node 3;
Wherein, in text node 1 including text 1 node ID " 1 ", participle " [" main broadcaster ", " next time ", " together ", " object for appreciation "] " and business " business 1 ";
In text node 2 including text 2 node ID " 2 ", participle " [" main broadcaster ", " we ", " together ", " object for appreciation "] " and Business " business 2 ";
In text node 3 including text 3 node ID " 3 ", participle " [" we ", " next time ", " together ", " opening black "] " with And business " business 3 ".
Three, inverted index is added.
Server establishes inverted index to text node 1, text node 2 and text node 3, and text ID is added and is indexed In.
Server creates inverted index to text 1, firstly, using the word in text node 1 as key assignments, each key-value pair A list structure should be established, which includes chained list node, and chained list node is used for memory node ID;Secondly, by node ID " 1 " is stored in chained list node.
As shown in fig. 6, being stored with section respectively in chained list node under key assignments " main broadcaster ", " next time ", " together " and " object for appreciation " Point ID " 1 ".
Server creates inverted index to text 2, firstly, determining that key assignments " main broadcaster ", " together " and " object for appreciation " is created It builds, then " we " are as new key assignments;Secondly, node ID " 2 " is stored in corresponding chained list node.
Optionally, server can deposit node ID " 2 " when storing node ID " 2 " in corresponding chained list node In the chained list node of Chu Xin, alternatively, node ID " 2 " and node ID " 1 " are collectively stored in chained list node.
Optionally, it is stored in determining the chained list node under key assignments " main broadcaster " node ID " 1 ", server determines text section Similarity between the participle set of point 1 and the participle set of text node 2, it is 0.6 that Jaccard similarity, which is calculated,;It is false If similarity threshold is 0.5, then server determines that the participle set of text node 1 is similar to the participle set of text node 2, will Node ID " 2 " and node ID " 1 " are collectively stored in chained list node.
As shown in fig. 7, at this point, being stored with node ID " 1 " and node ID " 2 " in chained list node under key assignments " main broadcaster ";Key Node ID " 1 " is stored in chained list node under value " next time ";Node ID " 1 " is stored in chained list node under key assignments " together " With node ID " 2 ";Node ID " 1 " and node ID " 2 " are stored in chained list node under key assignments " object for appreciation ";Chain under key assignments " we " Node ID " 1 " and node ID " 2 " are stored in table node.
Server creates inverted index to text 3, firstly, determining that key assignments " we ", " next time ", " together " are created It builds, then " opens black " as new key assignments;Secondly, node ID " 3 " is stored in corresponding chained list node.
Optionally, when being stored with node ID " 1 " and node ID " 2 " in determining the chained list node under key assignments " together ", clothes Business device traverse node ID " 1 " and the corresponding text node 1 of node ID " 2 " and text node 2, determine text node 3 and text section The similarity of the participle set of point 1, and determine the similarity between text node 3 and the participle set of text node 2, it calculates Obtaining Jaccard similarity is 0.33 (taking 2 significant digits);Assuming that similarity threshold is 0.5, then server determines text The participle set of this node 3 and participle set, the participle set of text node 2 of text node 1 are dissimilar, by node ID " 3 " are stored in new chained list node.
As shown in figure 8, at this point, being stored with node ID " 1 ", node ID " 2 " in chained list node under key assignments " main broadcaster ";Key assignments Node ID " 1 " and node ID " 3 " are stored in chained list node under " next time ";It is stored in chained list node under key assignments " together " Node ID " 1 ", node ID " 2 " and node ID " 3 ";Node ID " 1 " and node ID are stored in chained list node under key assignments " object for appreciation " "2";Node ID " 2 " and node ID " 3 " are stored in chained list node under key assignments " we ";Chained list node under key assignments " opening black " In be stored with node ID " 3 ".
Four, node connection is established.
After server determines that the participle set of text node 1 is similar to the participle set of text node 2, in text node Connection is established between 1 and text node 2, generates the first tree figure, as shown in Figure 9;Text node 3 is determined in server Participle set and the participle set of text node 1 are dissimilar, and determine that text node 3 and the participle of text node 2 gather it Between after dissmilarity, text node 3 is determined as the second tree figure, is not connected to text node 1 and text node 2, As shown in Figure 10.
Five, text cluster.
Server is according to Union-Find algorithm by the above-mentioned first tree figure of not intersection and above-mentioned second tree-shaped Structure chart merges into a new tree figure, carries out the polymerization of text node, is new tree figure as shown in figure 11 Schematic diagram, wherein text node 2 is determined as root node constructing new tree figure by server, and text node 2 is still The root node of the subgraph " the first tree figure " of new tree figure;Text node 3 is the subgraph of new tree figure The root node of " the second tree figure ".
In addition, text node 1 and the connection relationship of text node 2 are indicated by the solid line, text node 2 and text section in figure The connection relationship of point 3 is represented by dashed line, and the connection relationship to indicate the two is different.
Six, result is exported.
Server traverses the text node in new tree figure by Union-Find algorithm, is capable of determining that two Disjoint subgraph obtains text node 2 and the corresponding text 1 of text node 3 and text 2 later, as goes as shown in figure 12 Text after weight.
The text output after duplicate removal is stored into database it should be noted that server can according to need.
In conclusion the processing method of text provided in this embodiment, server segments text, create node, Inverted index is added, establishes the duplicate removal that node connection, text cluster and output six steps of result complete text.This method By will establish a connection between similar text, tree figure is formed;Since the text node in tree figure is equal It is the text node of Similar Text, so obtaining one of text node from said one tree figure, abandons remaining Text node, then the text node after obtaining duplicate removal, and then obtain the text after duplicate removal.In above-mentioned tree figure, lead to The text node in Union-Find algorithm traversal tree figure is crossed, can quickly search to obtain the root section of each subgraph Point realizes the quick duplicate removal to a large amount of texts;Meanwhile by the setting of similarity threshold, Similar Text is accurately determined, protect The accuracy of duplicate removal is demonstrate,proved.
In addition, be the polymerization of Similar Text in each subgraph, so, this method also effectively realizes Similar Text Polymerization.Compared to LDA topic model, this method does not need manual setting theme, and manually adjusts training parameter, can incite somebody to action Similar text is polymerize.
It should also be noted that, the pseudo-code of the algorithm of the process of above-mentioned steps three and four can be as follows:
Original state:
Text collection T={ t_1, t_2 ..., t_n }
Duplicate removal set R={ }
Similarity threshold sim_th
Export result:
Duplicate removal result R
Algorithmic procedure:
Figure 13 is please referred to, is the processing unit for the text that an exemplary embodiment shown in the application provides, the device The some or all of server can be implemented in combination with by software, hardware or the two, which is equipped with using journey Sequence, the device include:
Receiving module 401, for receiving at least two texts of terminal transmission;It include the first text at least two texts With the second text;
First generation module 402 is used for according to first the first text node of text generation, according to the second text generation second Text node;Include the first participle set of the first text in first text node, includes the second text in the second text node Second participle set;
First determining module 403, for determining the first similarity of first participle set and the second participle set;
Link block 404 is used for when the first similarity is greater than similarity threshold, by the first text node and the second text Node establishes a connection, and generates the first tree figure;
Deduplication module 405, for carrying out duplicate removal processing to the corresponding text of text node in the first tree figure.
In some embodiments, link block 404, for using first the first text node of connection identifier label and second The connection relationship of text node generates the first tree figure;First connection identifier indicates the first text node and the second text Node is similar.
In some embodiments, the depth of the first text node and the second text node is initial value i;
Link block 404, for the second text node to be determined as to the father node of the first text node;By the first text section The connection relationship of point and father node is recorded in the first text node using the first connection identifier;Generate the first tree figure; Wherein, the depth of the second text node is i, and the depth of the first text node becomes i+1, and i is integer.
In some embodiments, deduplication module 405, for traversing the text node in the first tree figure;By first The second text node in tree figure is determined as the first root node;After the corresponding text of first root node is determined as duplicate removal Text.
In some embodiments, the device further include:
Second generation module 406, for generating the inverted index of the first text node and the second text node, inverted index In include key assignments, key assignments be according to text participle generate;
Second determining module 407, for determining that the first text node and the second text node belong under same key assignments.
In some embodiments, at least two texts further include third text;
The device further include:
First generation module 402, for according to third text generation third text node, third text node to include third Participle set;
Second determining module 407, for segmented according to third gather determine third text node and the first text node and Second text node belongs under same key assignments;
First determining module 403, for determining the participle of the third under same key assignments set and first participle set, the respectively Second similarity of two participle set;
Link block 404 is used for when the second similarity is less than or equal to similarity threshold, according to third text node Generate the second tree figure;The text node in text node and the first tree figure in second tree figure is without friendship Collection;
Deduplication module 405, for being carried out at duplicate removal to the first tree figure and the corresponding text of the second tree figure Reason.
In some embodiments, deduplication module 405, for determining the first root node and second of the first tree figure Second root node of tree figure;Using the connection relationship of the second connection identifier label the first root node and the second root node, First tree figure and the second tree figure are merged, new tree figure is generated;Second connection identifier indicates first Root node and the second root node are dissimilar;First tree figure and the second tree figure are respectively in new tree figure Subgraph;New tree figure is traversed, the corresponding text of text node in each subgraph after duplicate removal is obtained.
In some embodiments, the first generation module 402, for generating the first node ID of the first text;By the first text This content of text is segmented, and first participle set is obtained;Generate includes the first of first node ID and first participle set Text node.
In some embodiments, the first text node includes first node ID, and the second text node includes second node ID;
The device further include:
Module 408 is obtained, for obtaining first participle set according to first node ID;Second is obtained according to second node ID Participle set.
In conclusion the processing unit of text provided in this embodiment, pass through the first text for sending according to terminal, second Text and third text generate the first text node, the second text node and third text node respectively, wherein the first text section Include first participle set in point, includes the second participle set in the second text node, divide in third text node including third Set of words;According to the similarity between any two of first participle set, the second participle set and third participle set, generate new Tree figure, include two subgraphs in new tree figure, be the first tree figure and the second tree respectively Figure;Each text node in new tree figure is traversed, the text after obtaining duplicate removal.The device by by similar text it Between establish a connection, formed tree figure;Since the text node in tree figure is the text section of Similar Text Point is abandoned remaining text node, is then gone so obtaining one of text node from said one tree figure Text node after weight, and then obtain the text after duplicate removal.It, can be quick by the means of traversal in above-mentioned tree figure Lookup obtains the root node of each subgraph, realizes the quick duplicate removal to a large amount of texts;Meanwhile setting by similarity threshold It sets, accurately determines Similar Text, ensure that the accuracy of duplicate removal.
Figure 14 is please referred to, it illustrates the structural block diagrams for the terminal 500 that one exemplary embodiment of the application provides.The end End 500 may is that smart phone, tablet computer, MP3 player (Moving Picture Experts Group Audio Layer III, dynamic image expert's compression standard audio level 3), MP4 (Moving Picture Experts Group Audio Layer IV, dynamic image expert's compression standard audio level 4) player, laptop or desktop computer.Terminal 500 are also possible to referred to as other titles such as user equipment, portable terminal, laptop terminal, terminal console.
In general, terminal 500 includes: processor 501 and memory 502.
Processor 501 may include one or more processing cores, such as 4 core processors, 5 core processors etc..Place Reason device 501 can use DSP (Digital Signal Processing, Digital Signal Processing), FPGA (Field- Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array, may be programmed Logic array) at least one of example, in hardware realize.Processor 501 also may include primary processor and coprocessor, master Processor is the processor for being handled data in the awake state, also referred to as CPU (Central Processing Unit, central processing unit);Coprocessor is the low power processor for being handled data in the standby state.? In some embodiments, processor 501 can be integrated with GPU (Graphics Processing Unit, image processor), GPU is used to be responsible for the rendering and drafting of content to be shown needed for display screen.In some embodiments, processor 501 can also be wrapped AI (Artificial Intelligence, artificial intelligence) processor is included, the AI processor is for handling related machine learning Calculating operation.
Memory 502 may include one or more computer readable storage mediums, which can To be non-transient.Memory 502 may also include high-speed random access memory and nonvolatile memory, such as one Or multiple disk storage equipments, flash memory device.In some embodiments, the non-transient computer in memory 502 can Storage medium is read for storing at least one instruction, at least one instruction performed by processor 501 for realizing this Shen Please in embodiment of the method provide text processing method.
In some embodiments, terminal 500 is also optional includes: peripheral device interface 503 and at least one peripheral equipment. It can be connected by bus or signal wire between processor 501, memory 502 and peripheral device interface 503.Each peripheral equipment It can be connected by bus, signal wire or circuit board with peripheral device interface 503.Specifically, peripheral equipment includes: radio circuit 504, at least one of display screen 505, voicefrequency circuit 506, positioning component 507 and power supply 508.
Peripheral device interface 503 can be used for I/O (Input/Output, input/output) is relevant outside at least one Peripheral equipment is connected to processor 501 and memory 502.In some embodiments, processor 501, memory 502 and peripheral equipment Interface 503 is integrated on same chip or circuit board;In some other embodiments, processor 501, memory 502 and outer Any one or two in peripheral equipment interface 503 can realize on individual chip or circuit board, the present embodiment to this not It is limited.
Radio circuit 504 is for receiving and emitting RF (Radio Frequency, radio frequency) signal, also referred to as electromagnetic signal.It penetrates Frequency circuit 504 is communicated by electromagnetic signal with communication network and other communication equipments.Radio circuit 504 turns electric signal It is changed to electromagnetic signal to be sent, alternatively, the electromagnetic signal received is converted to electric signal.Optionally, radio circuit 504 wraps It includes: antenna system, RF transceiver, one or more amplifiers, tuner, oscillator, digital signal processor, codec chip Group, user identity module card etc..Radio circuit 504 can be carried out by least one wireless communication protocol with other terminals Communication.The wireless communication protocol includes but is not limited to: Metropolitan Area Network (MAN), each third generation mobile communication network (2G, 3G, 4G and 5G), wireless office Domain net and/or WiFi (Wireless Fidelity, Wireless Fidelity) network.In some embodiments, radio circuit 504 may be used also To include the related circuit of NFC (Near Field Communication, wireless near field communication), the application is not subject to this It limits.
Display screen 505 is for showing UI (User Interface, user interface).The UI may include figure, text, figure Mark, video and its their any combination.When display screen 505 is touch display screen, display screen 505 also there is acquisition to show The ability of the touch signal on the surface or surface of screen 505.The touch signal can be used as control signal and be input to processor 501 are handled.At this point, display screen 505 can be also used for providing virtual push button and/or dummy keyboard, also referred to as soft button and/or Soft keyboard.In some embodiments, display screen 505 can be one, and the front panel of terminal 500 is arranged;In other embodiments In, display screen 505 can be at least two, be separately positioned on the different surfaces of terminal 500 or in foldover design;In still other reality It applies in example, display screen 505 can be flexible display screen, be arranged on the curved surface of terminal 500 or on fold plane.Even, it shows Display screen 505 can also be arranged to non-rectangle irregular figure, namely abnormity screen.Display screen 505 can use LCD (Liquid Crystal Display, liquid crystal display), OLED (Organic Light-Emitting Diode, Organic Light Emitting Diode) Etc. materials preparation.
Voicefrequency circuit 506 may include microphone and loudspeaker.Microphone is used to acquire the sound wave of user and environment, and will Sound wave, which is converted to electric signal and is input to processor 501, to be handled, or is input to radio circuit 504 to realize voice communication. For stereo acquisition or the purpose of noise reduction, microphone can be separately positioned on the different parts of terminal 500 to be multiple.Mike Wind can also be array microphone or omnidirectional's acquisition type microphone.Loudspeaker is then used to that processor 501 or radio circuit will to be come from 504 electric signal is converted to sound wave.Loudspeaker can be traditional wafer speaker, be also possible to piezoelectric ceramic loudspeaker.When When loudspeaker is piezoelectric ceramic loudspeaker, the audible sound wave of the mankind can be not only converted electrical signals to, it can also be by telecommunications Number the sound wave that the mankind do not hear is converted to carry out the purposes such as ranging.In some embodiments, voicefrequency circuit 506 can also include Earphone jack.
Positioning component 507 is used for the current geographic position of positioning terminal 500, to realize navigation or LBS (Location Based Service, location based service).Positioning component 507 can be the GPS (Global based on the U.S. Positioning System, global positioning system), the dipper system of China, Russia Gray receive this system or European Union The positioning component of Galileo system.
Power supply 508 is used to be powered for the various components in terminal 500.Power supply 508 can be alternating current, direct current, Disposable battery or rechargeable battery.When power supply 508 includes rechargeable battery, which can support wired charging Or wireless charging.The rechargeable battery can be also used for supporting fast charge technology.
It will be understood by those skilled in the art that the restriction of the not structure paired terminal 500 of structure shown in Figure 14, can wrap It includes than illustrating more or fewer components, perhaps combine certain components or is arranged using different components.
Figure 15 shows the structural schematic diagram of the server of the application one embodiment offer.The server is for implementing The processing method of the text provided in embodiment is provided.Specifically:
The server 600 is including central processing unit (CPU) 601 including random access memory (RAM) 602 and only Read the system storage 604 of memory (ROM) 603, and the system of connection system storage 604 and central processing unit 601 Bus 605.The server 600 further includes the basic input/output that information is transmitted between each device helped in computer System (I/O system) 606, and large capacity for storage program area 613, application program 614 and other program modules 615 are deposited Store up equipment 607.
The basic input/output 606 includes display 608 for showing information and inputs letter for user The input equipment 609 of such as mouse, keyboard etc of breath.Wherein the display 608 and input equipment 609 are all by being connected to The input and output controller 610 of system bus 605 is connected to central processing unit 601.The basic input/output 606 Can also include input and output controller 610 with for receive and handle from keyboard, mouse or electronic touch pen etc. it is multiple its The input of his equipment.Similarly, input and output controller 610 also provides output to display screen, printer or other kinds of defeated Equipment out.
The mass-memory unit 607 is by being connected to the bulk memory controller (not shown) of system bus 605 It is connected to central processing unit 601.The mass-memory unit 607 and its associated computer-readable medium are server 600 provide non-volatile memories.That is, the mass-memory unit 607 may include such as hard disk or CD-ROM The computer-readable medium (not shown) of driver etc.
Without loss of generality, the computer-readable medium may include computer storage media and communication media.Computer Storage medium includes information such as computer readable instructions, data structure, program module or other data for storage The volatile and non-volatile of any method or technique realization, removable and irremovable medium.Computer storage medium includes RAM, ROM, EPROM, EEPROM, flash memory or other solid-state storages its technologies, CD-ROM, DVD or other optical storages, tape Box, tape, disk storage or other magnetic storage devices.Certainly, skilled person will appreciate that the computer storage medium It is not limited to above-mentioned several.Above-mentioned system storage 604 and mass-memory unit 607 may be collectively referred to as memory.
According to the various embodiments of the application, the server 600 can also be arrived by network connections such as internets Remote computer operation on network.Namely server 600 can be by the network interface that is connected on the system bus 605 Unit 611 is connected to network 612, in other words, Network Interface Unit 611 also can be used be connected to other kinds of network or Remote computer system (not shown).
Above-mentioned the embodiment of the present application serial number is for illustration only, does not represent the advantages or disadvantages of the embodiments.
Those of ordinary skill in the art will appreciate that realizing that all or part of the steps of above-described embodiment can pass through hardware It completes, relevant hardware can also be instructed to complete by program, the program can store in a kind of computer-readable In storage medium, storage medium mentioned above can be read-only memory, disk or CD etc..
The foregoing is merely the preferred embodiments of the application, not to limit the application, it is all in spirit herein and Within principle, any modification, equivalent replacement, improvement and so on be should be included within the scope of protection of this application.

Claims (12)

1. a kind of processing method of text, which is characterized in that the described method includes:
Receive at least two texts that terminal is sent;It include the first text and the second text at least two text;
According to first text node of the first text generation, according to second text node of the second text generation;Described Include the first participle set of first text in one text node, includes second text in second text node Second participle set;
Determine the first similarity of the first participle set and the second participle set;
When first similarity is greater than similarity threshold, first text node and second text node are established Connection relationship generates the first tree figure;
Duplicate removal processing is carried out to the corresponding text of text node in the first tree figure.
2. the method according to claim 1, wherein described by first text node and second text Node establishes a connection, and generates the first tree figure, comprising:
The connection relationship of first text node and second text node is marked using the first connection identifier, described in generation First tree figure;First connection identifier indicates that first text node is similar to second text node.
3. according to the method described in claim 2, it is characterized in that, first text node and second text node Depth is initial value i;
The connection relationship that first text node and second text node are marked using the first connection identifier, is generated The first tree figure, comprising:
Second text node is determined as to the father node of first text node;
The connection relationship of first text node and the father node is recorded in described the using first connection identifier In one text node;
Generate the first tree figure;Wherein, the depth of second text node is i, first text node Depth becomes i+1, and i is integer.
4. method according to any one of claims 1 to 3, which is characterized in that described to the first tree figure Chinese The corresponding text of this node carries out duplicate removal processing, comprising:
Traverse the text node in the first tree figure;
Second text node in the first tree figure is determined as the first root node;
Text after the corresponding text of first root node to be determined as to duplicate removal.
5. method according to any one of claims 1 to 3, which is characterized in that the determination first participle set and institute Before the similarity for stating the second participle set, comprising:
The inverted index of first text node and second text node is generated, includes key assignments in the inverted index, The key assignments is generated according to the participle of text;
Determine that first text node and second text node belong under the same key assignments.
6. according to the method described in claim 5, it is characterized in that, at least two text further includes third text;
The method also includes:
According to the third text generation third text node, the third text node includes third participle set;
Set, which is segmented, according to the third determines the third text node and first text node and second text Node belongs under the same key assignments;
Determine that third participle set and the first participle set, second participle under the same key assignments collect respectively The second similarity closed;
When second similarity is less than or equal to the similarity threshold, second is generated according to the third text node Tree figure;The text node in text node and the first tree figure in the second tree figure is without friendship Collection;
Duplicate removal processing is carried out to the first tree figure and the corresponding text of the second tree figure.
7. according to the method described in claim 6, it is characterized in that, described to the first tree figure and second tree The corresponding text of shape structure chart carries out duplicate removal processing, comprising:
Determine first root node of the first tree figure and the second root node of the second tree figure;
The connection relationship that first root node and second root node are marked using the second connection identifier, by first tree Shape structure chart and the second tree figure merge, and generate new tree figure;Described in the second connection identifier expression First root node and second root node are dissimilar;The first tree figure and the second tree figure are respectively Subgraph in the new tree figure;
The new tree figure is traversed, the corresponding text of text node in each subgraph after duplicate removal is obtained.
8. method according to any one of claims 1 to 3, which is characterized in that described according to first text generation first Text node, comprising:
Generate the first node ID of first text;
The content of text of first text is segmented, the first participle set is obtained;
Generate first text node including the first node ID and the first participle set.
9. according to the method described in claim 8, it is characterized in that, first text node includes the first node ID, Second text node includes second node ID;
Before the similarity for determining the first participle set and the second participle set, further includes:
The first participle set is obtained according to the first node ID;Second participle is obtained according to the second node ID Set.
10. a kind of processing unit of text, which is characterized in that described device includes:
Receiving module, for receiving at least two texts of terminal transmission;In at least two text include the first text and Second text;
First generation module, for according to first text node of the first text generation, according to second text generation the Two text nodes;It include the first participle set of first text, second text node in first text node In include second text second participle set;
First determining module, for determining the first similarity of the first participle set and the second participle set;
Link block, for when first similarity is greater than similarity threshold, by first text node and described the Two text nodes establish a connection, and generate the first tree figure;
Deduplication module, for carrying out duplicate removal processing to the corresponding text of text node in the first tree figure.
11. a kind of terminal, the terminal include:
Memory;
The processor being connected with the memory;
Wherein, the processor is configured to loading and to execute executable instruction as described in any one of claim 1 to 9 to realize The processing method of text.
12. a kind of computer readable storage medium, which is characterized in that be stored at least one in the computer readable storage medium Item instruction, at least a Duan Chengxu, code set or instruction set, at least one instruction, an at least Duan Chengxu, the code Collection or instruction set are loaded by processor and are executed the processing method to realize text as described in any one of claim 1 to 9.
CN201910395287.5A 2019-05-13 2019-05-13 Text processing method, device, equipment and storage medium Active CN110134768B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910395287.5A CN110134768B (en) 2019-05-13 2019-05-13 Text processing method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910395287.5A CN110134768B (en) 2019-05-13 2019-05-13 Text processing method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN110134768A true CN110134768A (en) 2019-08-16
CN110134768B CN110134768B (en) 2023-05-26

Family

ID=67573701

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910395287.5A Active CN110134768B (en) 2019-05-13 2019-05-13 Text processing method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN110134768B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110765756A (en) * 2019-10-29 2020-02-07 北京齐尔布莱特科技有限公司 Text processing method and device, computing equipment and medium
CN110955751A (en) * 2019-11-13 2020-04-03 广州供电局有限公司 Method, device and system for removing duplication of work ticket text and computer storage medium
CN113688629A (en) * 2021-08-04 2021-11-23 德邦证券股份有限公司 Text deduplication method and device and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130031461A1 (en) * 2011-07-29 2013-01-31 Hewlett-Packard Development Company, L.P. Detecting repeat patterns on a web page
CN106294350A (en) * 2015-05-13 2017-01-04 阿里巴巴集团控股有限公司 A kind of text polymerization and device
CN106407195A (en) * 2015-07-28 2017-02-15 北京京东尚科信息技术有限公司 Method and system for eliminating duplication of webpage
CN107025218A (en) * 2017-04-07 2017-08-08 腾讯科技(深圳)有限公司 A kind of text De-weight method and device
CN107644010A (en) * 2016-07-20 2018-01-30 阿里巴巴集团控股有限公司 A kind of Text similarity computing method and device
CN107844527A (en) * 2017-10-13 2018-03-27 平安科技(深圳)有限公司 Web page address De-weight method, electronic equipment and computer-readable recording medium
CN109299443A (en) * 2018-09-04 2019-02-01 中山大学 A kind of newsletter archive De-weight method based on Minimum Vertex Covering

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130031461A1 (en) * 2011-07-29 2013-01-31 Hewlett-Packard Development Company, L.P. Detecting repeat patterns on a web page
CN106294350A (en) * 2015-05-13 2017-01-04 阿里巴巴集团控股有限公司 A kind of text polymerization and device
CN106407195A (en) * 2015-07-28 2017-02-15 北京京东尚科信息技术有限公司 Method and system for eliminating duplication of webpage
CN107644010A (en) * 2016-07-20 2018-01-30 阿里巴巴集团控股有限公司 A kind of Text similarity computing method and device
CN107025218A (en) * 2017-04-07 2017-08-08 腾讯科技(深圳)有限公司 A kind of text De-weight method and device
CN107844527A (en) * 2017-10-13 2018-03-27 平安科技(深圳)有限公司 Web page address De-weight method, electronic equipment and computer-readable recording medium
CN109299443A (en) * 2018-09-04 2019-02-01 中山大学 A kind of newsletter archive De-weight method based on Minimum Vertex Covering

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
YU WANG等: "A Fast KNN algorithm for text categorization", 《2007 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS》 *
马月: "基于正文结构树的近似网页去重研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110765756A (en) * 2019-10-29 2020-02-07 北京齐尔布莱特科技有限公司 Text processing method and device, computing equipment and medium
CN110765756B (en) * 2019-10-29 2023-12-01 北京齐尔布莱特科技有限公司 Text processing method, device, computing equipment and medium
CN110955751A (en) * 2019-11-13 2020-04-03 广州供电局有限公司 Method, device and system for removing duplication of work ticket text and computer storage medium
CN113688629A (en) * 2021-08-04 2021-11-23 德邦证券股份有限公司 Text deduplication method and device and storage medium

Also Published As

Publication number Publication date
CN110134768B (en) 2023-05-26

Similar Documents

Publication Publication Date Title
US20180357312A1 (en) Generating a playlist
CN109918669A (en) Entity determines method, apparatus and storage medium
CN107451175A (en) A kind of data processing method and equipment based on block chain
CN109166593A (en) audio data processing method, device and storage medium
CN107251011A (en) Training system and method for sequence label device
CN109522538A (en) Table content divides column method, apparatus, equipment and storage medium automatically
TW202029079A (en) Method and device for identifying irregular group
CN108536463A (en) Obtain method, apparatus, equipment and the computer readable storage medium of resource packet
CN102999562B (en) Routing inquiry result
CN107526777A (en) A kind of method and apparatus handled based on version number file
US20220100972A1 (en) Configurable generic language understanding models
CN110134768A (en) Processing method, device, equipment and the storage medium of text
CN106233282A (en) Use the application searches of capacity of equipment
US20150248409A1 (en) Sorting and displaying documents according to sentiment level in an online community
CN106970958B (en) A kind of inquiry of stream file and storage method and device
CN108536753A (en) The determination method and relevant apparatus of duplicate message
US20180018392A1 (en) Topic identification based on functional summarization
CN108572789A (en) Disk storage method and apparatus, information push method and device and electronic equipment
US20180365551A1 (en) Cognitive communication assistant services
CN105612511A (en) Identifying and structuring related data
CN107609880A (en) A kind of user's appraisal procedure, device and equipment being directed to using sharing articles
CN109346102B (en) Method and device for detecting audio beginning crackle and storage medium
CN113935332A (en) Book grading method and book grading equipment
CN113971400B (en) Text detection method and device, electronic equipment and storage medium
CN111062490B (en) Method and device for processing and identifying network data containing private data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant