CN106649273A - Text processing method and text processing device - Google Patents

Text processing method and text processing device Download PDF

Info

Publication number
CN106649273A
CN106649273A CN201611220192.2A CN201611220192A CN106649273A CN 106649273 A CN106649273 A CN 106649273A CN 201611220192 A CN201611220192 A CN 201611220192A CN 106649273 A CN106649273 A CN 106649273A
Authority
CN
China
Prior art keywords
similarity
threshold value
candidate solution
multigroup
threshold
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201611220192.2A
Other languages
Chinese (zh)
Other versions
CN106649273B (en
Inventor
董超
张霞
赵立军
崔朝辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Neusoft Corp
Original Assignee
Neusoft Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Neusoft Corp filed Critical Neusoft Corp
Priority to CN201611220192.2A priority Critical patent/CN106649273B/en
Publication of CN106649273A publication Critical patent/CN106649273A/en
Application granted granted Critical
Publication of CN106649273B publication Critical patent/CN106649273B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a text processing method and a text processing device. The method comprises the following step: after randomly acquiring two to-be-detected texts, obtaining first type similarity and second type similarity between the two to-be-detected texts at least according to a first similarity algorithm and a second similarity algorithm, further obtaining the similarity between the two to-be-detected texts according to the first type similarity, the second type similarity, a first threshold value and a second threshold valve, namely, obtaining two types of type similarity according to at least two algorithms and obtaining the similarity indicating whether the two to-be-detected texts are similar or not according to the two types of type similarity and the corresponding threshold values. Compared with a mode of judging whether the two to-be-detected texts are similar or not through one similarity algorithm, the mode provided by the invention has the advantages that the accuracy of indicating whether the two to-be-detected texts are similar or not is improved, so that the detecting accuracy is further improved.

Description

A kind of text handling method and device
Technical field
The invention belongs to text information processing technical field, in particular, more particularly to a kind of text handling method and Device.
Background technology
With popularization of the computer to the various natural language processing applications such as text message, it is desirable to provide one effectively and accurate True method calculating text to be detected and detect the text similarity between text, text (particularly short text) similarity Computational methods play more and more important role in the related research of computer version and application.Such as examine in text Rope field (Text Retrieval), short text similarity can improve the recall rate (Recall) of search engine and the degree of accuracy (Precision);At text mining field (Text Mining), short text similarity is used for finding as a measuring method Potential knowledge in text database;In image retrieval (Image Retrieval) field based on webpage, it is possible to use image Around descriptive short text improving accuracy rate, wherein having detected that text is the text for being detected by text similarity This.
At present the computational methods of text similarity can carry out participle to two texts to be detected by participle technique, respectively Each word in two texts to be detected is obtained, the word for obtaining is mapped into VSM (Virtual Switch Matrix, virtual friendship Change matrix) in, the vectorization of text fragments is realized by VSM, then two texts to be detected are obtained by vectorial similar computational algorithm Segment-similarity between this, according to segment-similarity the similarity between two texts to be detected is obtained, but this is passed through The similarity that vectorization is obtained is stronger to the disappearance susceptibility of word so that the accuracy of similarity, wherein the disappearance to word When susceptibility refers to more by force calculating similarity, the difference of word can cause the value of similarity to change very greatly.
The content of the invention
In view of this, it is an object of the invention to provide a kind of text handling method and device, for improving similarity The degree of accuracy, and then improve the degree of accuracy of detection.Specifically, technical scheme is as follows:
The present invention provides a kind of text handling method, and methods described includes:
Two texts to be detected are obtained at random;
According at least to the first similarity algorithm and the second similarity algorithm, the between described two texts to be detected is calculated Second Type similarity between one type similarity and described two texts to be detected, wherein the first kind similarity root It is calculated according to first similarity algorithm, the Second Type similarity is calculated according to second similarity algorithm Arrive;
According to the first kind similarity, the Second Type similarity, first threshold and Second Threshold, obtain described Similarity between two texts to be detected, wherein the first threshold is being previously obtained with first similarity algorithm pair The threshold value answered, the Second Threshold is the threshold value corresponding with second similarity algorithm being previously obtained;
When the similarity between described two texts to be detected is in preset range, described two texts to be detected are determined It is similar;
When the similarity between described two texts to be detected not in preset range when, determine described two texts to be detected It is dissimilar.
Preferably, it is described according to the first kind similarity, the Second Type similarity, first threshold and the second threshold Value, obtains the similarity between described two texts to be detected, including:
According to the first kind similarity and the first threshold, first between described two texts to be detected is obtained Similarity;
According to the Second Type similarity and the Second Threshold, second between described two texts to be detected is obtained Similarity;
According to first similarity, second similarity, default first weight and default second weight, obtain described Similarity between two texts to be detected.
Preferably, methods described also includes:It is previously obtained first threshold corresponding with first similarity algorithm and pre- First obtain Second Threshold corresponding with second similarity algorithm;
It is described to be previously obtained first threshold corresponding with first similarity algorithm and be previously obtained and second phase The corresponding Second Threshold of algorithm is seemingly spent, including:
Multigroup candidate solution is generated at random, and every group of candidate solution includes the 3rd threshold corresponding with first similarity algorithm Value and the 4th threshold value corresponding with second similarity algorithm;
Multigroup best candidate solution is obtained from multigroup candidate solution, wherein the acquisition process of multigroup best candidate solution is: The corresponding fitness function of every group of candidate solution is obtained, is calculated in training set by the corresponding fitness function of every group of candidate solution Each pair training sample between similarity, according to the similarity between each pair training sample, obtain every group of candidate solution Fitness, and according to the fitness of every group of candidate solution, chooses multigroup best candidate solution, each pair training sample include two by The text of artificial mark similarity, the fitness of best candidate solution is more than the fitness of other candidate solutions;
Cross and variation is carried out to the 3rd threshold value in multigroup best candidate solution, in multigroup best candidate solution 4th threshold value carries out cross and variation, obtains multigroup new candidate solution, and performs described acquisition to multigroup new candidate solution Journey is pre-conditioned up to meeting to obtain multigroup best candidate solution from multigroup new candidate solution;
Best candidate solution of the fitness more than the fitness of other best candidate solutions is chosen, in selected best candidate solution The 3rd threshold value as the first threshold, the 4th threshold value in selected best candidate solution is used as the Second Threshold.
Preferably, the fitness according to every group of candidate solution, chooses multigroup best candidate solution, including:
Obtain the fitness summation of all candidate solutions;
According to the fitness summation and the fitness of every group of candidate solution of all candidate solutions, the phase of every group of candidate solution is obtained To fitness;
The numerical value between 0 and 1 is generated at random, and multigroup best candidate is chosen according to the random numerical value for generating Solution.
Preferably, the value of the 3rd threshold value and the 4th threshold value is between 0 and 1, and by binary coding Mode is represented, so that the corresponding string of binary characters of the 3rd threshold value and the corresponding string of binary characters point of the 4th threshold value Not as a chromosome;
The 3rd threshold value in multigroup best candidate solution carries out cross and variation, to multigroup best candidate solution In the 4th threshold value carry out cross and variation, obtain multigroup new candidate solution, including:
Random pair is carried out to the corresponding chromosome of the 3rd threshold value in multigroup best candidate solution;
According to the length of the corresponding chromosome of the 3rd threshold value, cross-point locations are randomly provided, and according to the intersection Point position, exchanges the portion gene between the corresponding chromosome of the 3rd threshold value of random pair;
Be randomly provided genetic mutation position in the corresponding chromosome of the 3rd threshold value, and to the genetic mutation position at Gene carry out inversion operation;
At in the portion gene exchanged between the corresponding chromosome of the 3rd threshold value and to the genetic mutation position Gene is carried out after inversion operation, corresponding 3rd threshold value of the chromosome after being changed;
Random pair is carried out to the corresponding chromosome of the 4th threshold value in multigroup best candidate solution;
According to the length of the corresponding chromosome of the 4th threshold value, cross-point locations are randomly provided, and according to the intersection Point position, exchanges the portion gene between the corresponding chromosome of the 4th threshold value of random pair;
Be randomly provided genetic mutation position in the corresponding chromosome of the 4th threshold value, and to the genetic mutation position at Gene carry out inversion operation;
At in the portion gene exchanged between the corresponding chromosome of the 4th threshold value and to the genetic mutation position Gene is carried out after inversion operation, corresponding 4th threshold value of the chromosome after being changed;
According to corresponding 4th threshold value of chromosome after corresponding 3rd threshold value of chromosome after the change and change, obtain To multigroup new candidate solution.
The present invention also provides a kind of text processing apparatus, and described device includes:
Text acquiring unit, for obtaining two texts to be detected at random;
First computing unit, for according at least to the first similarity algorithm and the second similarity algorithm, calculating described two The Second Type similarity between first kind similarity and described two texts to be detected between text to be detected, wherein institute State first kind similarity to be calculated according to first similarity algorithm, the Second Type similarity is according to described second Similarity algorithm is calculated;
Second computing unit, for according to the first kind similarity, the Second Type similarity, first threshold and Second Threshold, obtains the similarity between described two texts to be detected, wherein the first threshold be previously obtained with institute State the corresponding threshold value of the first similarity algorithm, the Second Threshold is be previously obtained corresponding with second similarity algorithm Threshold value;
Determining unit, for when the similarity between described two texts to be detected is in preset range, it is determined that described Two texts to be detected are similar, and for when the similarity between described two texts to be detected not in preset range when, really Fixed described two texts to be detected are dissimilar.
Preferably, second computing unit, for according to the first kind similarity and the first threshold, obtaining The first similarity between described two texts to be detected, according to the Second Type similarity and the Second Threshold, obtains The second similarity between described two texts to be detected, and according to first similarity, second similarity, default the One weight and default second weight, obtain the similarity between described two texts to be detected.
Preferably, described device also includes:Obtaining unit, it is corresponding with first similarity algorithm for being previously obtained First threshold and it is previously obtained Second Threshold corresponding with second similarity algorithm;
The obtaining unit, including:First generates subelement, the first selection subelement, the second generation subelement and second Choose subelement;
Described first generates subelement, and for generating multigroup candidate solution at random, every group of candidate solution includes one with described the Corresponding 3rd threshold value of one similarity algorithm and the 4th threshold value corresponding with second similarity algorithm;
Described first chooses subelement, for obtaining multigroup best candidate solution from multigroup candidate solution, wherein multigroup The acquisition process of best candidate solution is:The corresponding fitness function of every group of candidate solution is obtained, by every group of candidate solution correspondence Fitness function calculate training set in each pair training sample between similarity, according between each pair training sample Similarity, obtains the fitness of every group of candidate solution, and according to the fitness of every group of candidate solution, chooses multigroup best candidate Solution, each pair training sample includes two by the text for manually marking similarity, and the fitness of best candidate solution is more than other candidates The fitness of solution;
Described second generates subelement, for carrying out cross and variation to the 3rd threshold value in multigroup best candidate solution, Cross and variation is carried out to the 4th threshold value in multigroup best candidate solution, multigroup new candidate solution is obtained, and to described multigroup It is default up to meeting to obtain multigroup best candidate solution from multigroup new candidate solution that new candidate solution performs the acquisition process Condition;
Described second chooses subelement, for choosing best candidate of the fitness more than the fitness of other best candidate solutions Solution, the 3rd threshold value in selected best candidate solution is used as the first threshold, the 4th in selected best candidate solution Threshold value is used as the Second Threshold.
Preferably, described first fitness of the subelement according to every group of candidate solution is chosen, chooses multigroup best candidate Solution, including:
The fitness summation of all candidate solutions is obtained, according to the fitness summation and every group of candidate solution of all candidate solutions Fitness, obtain the relative adaptability degrees of every group of candidate solution, the numerical value between 0 and 1 is generated at random, and according to random The numerical value of generation chooses multigroup best candidate solution.
Preferably, the value of the 3rd threshold value and the 4th threshold value is between 0 and 1, and by binary coding Mode is represented, so that the corresponding string of binary characters of the 3rd threshold value and the corresponding string of binary characters point of the 4th threshold value Not as a chromosome;
Described second generates subelement, including:First with sub-unit, first exchange subelement, first negate subelement, First acquisition subelement, second with sub-unit, second exchange subelement, second negate subelement, second obtain subelement and Candidate solution obtains subelement;
Described first matches somebody with somebody sub-unit, for entering to the corresponding chromosome of the 3rd threshold value in multigroup best candidate solution Row random pair;
Described first exchanges subelement, for according to the length of the corresponding chromosome of the 3rd threshold value, being randomly provided friendship Crunode position, and according to the cross-point locations, exchange the part base between the corresponding chromosome of the 3rd threshold value of random pair Cause;
Described first negates subelement, for being randomly provided the corresponding chromosome of the 3rd threshold value in genetic mutation position Put, and inversion operation is carried out to the gene at the genetic mutation position;
Described first obtains subelement, in the portion gene exchanged between the corresponding chromosome of the 3rd threshold value and Gene at the genetic mutation position is carried out after inversion operation, corresponding 3rd threshold value of the chromosome after being changed;
Described second matches somebody with somebody sub-unit, for entering to the corresponding chromosome of the 4th threshold value in multigroup best candidate solution Row random pair;
Described second exchanges subelement, for according to the length of the corresponding chromosome of the 4th threshold value, being randomly provided friendship Crunode position, and according to the cross-point locations, exchange the part base between the corresponding chromosome of the 4th threshold value of random pair Cause;
Described second negates subelement, for being randomly provided the corresponding chromosome of the 4th threshold value in genetic mutation position Put, and inversion operation is carried out to the gene at the genetic mutation position;
Described second obtains subelement, in the portion gene exchanged between the corresponding chromosome of the 4th threshold value and Gene at the genetic mutation position is carried out after inversion operation, corresponding 4th threshold value of the chromosome after being changed;
The candidate solution obtains subelement, after according to corresponding 3rd threshold value of chromosome after the change and change Corresponding 4th threshold value of chromosome, obtain multigroup new candidate solution.
Compared with prior art, the above-mentioned technical proposal that the present invention is provided has the advantage that:
Knowable to above-mentioned technical proposal, after two texts to be detected are obtained at random, according at least to the first similarity algorithm And the second similarity algorithm first kind similarity and Second Type similarity that obtain between two texts to be detected, Jin Ergen According to first kind similarity, Second Type similarity, first threshold and Second Threshold, the phase between two texts to be detected is obtained Like spending, that is to say, that the present invention obtains two types similarity according at least two algorithms, and according to two types similarity and Each self-corresponding threshold value obtains indicating the similarity whether two texts to be detected are similar that this mode is relative to existing by one For planting the whether similar mode of similarity algorithm two texts to be detected of judgement, two texts to be detected of instruction that the present invention is obtained Whether the degree of accuracy of similar similarity is improved for this, and then improves the degree of accuracy of detection.
Description of the drawings
In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing The accompanying drawing to be used needed for having technology description is briefly described, it should be apparent that, drawings in the following description are the present invention Some embodiments, for those of ordinary skill in the art, on the premise of not paying creative work, can be with basis These accompanying drawings obtain other accompanying drawings.
Fig. 1 is the flow chart of text handling method provided in an embodiment of the present invention;
Fig. 2 is the flow chart of acquisition threshold value provided in an embodiment of the present invention;
Fig. 3 is the flow chart for generating new candidate solution provided in an embodiment of the present invention;
Fig. 4 is a kind of structural representation of text processing apparatus provided in an embodiment of the present invention;
Fig. 5 is another kind of structural representation of text processing apparatus provided in an embodiment of the present invention;
Fig. 6 is the structural representation of obtaining unit in text processing apparatus provided in an embodiment of the present invention.
Specific embodiment
To make purpose, technical scheme and the advantage of the embodiment of the present invention clearer, below in conjunction with the embodiment of the present invention In accompanying drawing, the technical scheme in the embodiment of the present invention is clearly and completely described, it is clear that described embodiment is The a part of embodiment of the present invention, rather than the embodiment of whole.Based on the embodiment in the present invention, those of ordinary skill in the art The every other embodiment obtained under the premise of creative work is not made, belongs to the scope of protection of the invention.
Fig. 1 is referred to, the flow chart of text handling method provided in an embodiment of the present invention is it illustrates, it is similar for improving The degree of accuracy of degree, and then improve the degree of accuracy of detection.Specifically, text handling method described in Fig. 1 may comprise steps of:
101:Obtain two texts to be detected at random, whether similar two of which text to be detected needs to carry out text Two texts, the two texts to be detected can be random two papers, works or patent application documents for obtaining etc., at this Can download from the website that these texts are provided in bright embodiment.
102:According at least to the first similarity algorithm and the second similarity algorithm, the between two texts to be detected is calculated Second Type similarity between one type similarity and two detection texts, wherein first kind similarity is similar according to first Degree algorithm is calculated, and Second Type similarity is calculated according to the second similarity algorithm.
That is in the embodiment of the present invention, calculate between two texts to be detected at least through two kinds of similarity algorithms Two types similarity.To improve the similarity between two texts to be detected, the first similarity algorithm and the second similarity operator At least a kind of similarity algorithm is less sensitive to the disappearance of word in method, and is obtained according to the less sensitive algorithm of the disappearance to word To similarity the value intensity of variation of similarity that obtains of the value intensity of variation algorithm sensitive less than the disappearance to word.
At present similarity algorithm has:Word frequency cosine similarity algorithm, TF-IDF (Term Frequency-Inverse Document Frequency, characteristic frequency-inverse document frequency weighting method) cosine similarity algorithm, text editing distance Similarity algorithm and SimHash similarity algorithms, inventor is studied these four similarity algorithms, it is found that these four are similar Degree algorithm is sensitive to the disappearance of word according to arriving little being ordered as greatly:Word frequency cosine similarity algorithm, TF-IDF cosine similarities are calculated Method, text editing Distance conformability degree algorithm and SimHash similarity algorithms, in embodiments of the present invention, can be by word frequency cosine Similarity algorithm and TF-IDF cosine similarity algorithms are respectively seen as the first similarity algorithm, and text editing Distance conformability degree is calculated Method and SimHash similarity algorithms are respectively seen as the second similarity algorithm.
Certainly word frequency cosine phase can such as be chosen arbitrarily to choose two kinds of similarity algorithms from these four similarity algorithms It is the first similarity algorithm like degree algorithm, it is the second similarity algorithm to choose TF-IDF cosine similarities algorithm, or is chosen Text editing Distance conformability degree algorithm is the first similarity algorithm, and it is the second similarity algorithm to choose SimHash similarity algorithms.
Below with the first similarity algorithm as word frequency cosine similarity algorithm, the second similarity algorithm is that SimHash is similar As a example by degree algorithm, the two types similarity for calculating two texts to be detected is illustrated how:
The calculating process of word frequency cosine similarity algorithm:Participle is carried out by participle technique to two texts to be detected, is obtained To each word in two texts to be detected, word frequency one N-dimensional vector of formation of these words is calculated, N is the word number after participle.Two Individual text D to be detected1And D2Corresponding vector representation is:
V1={ t11,t12,...,t1j...,t1N}
V2={ t21,t22,...,t2j...,t2N}
Wherein, V1It is text D to be detected1Vector, t1jIt is text D to be detected1In j-th word word frequency, V2It is to be detected Text D2Vector, t2jIt is text D to be detected2In j-th word word frequency.
It is to be checked to obtain two by calculating the cosine value between vector after the vector representation for obtaining two texts to be detected The word frequency cosine similarity surveyed between text:
Such as two texts to be detected are:D1=" during red early warning, according to related work prediction scheme, capital public security starts High-grade duties scheme tackles haze weather, carries out heavily contaminated weather reply work in every ", D2=" during red early warning, Beijing Public security starts high-grade duties scheme, carries out haze weather reply work in every ", participle is carried out to the two texts to be detected, Each word for obtaining records in word sequence={ red, early warning, period, according to related, work, prediction scheme, capital, public security is opened It is dynamic, high-grade, duties, scheme, reply, haze, weather does, good, heavily contaminated, every, Beijing }, then count in word sequence Word frequency of each word in corresponding text to be detected forms the corresponding vector of text to be detected:
V1=[1,1,1,1,1,2,1,1,1,1,1,1,1,2,1,2,1,1,1,1,0]
V2=[1,1,1,0,0,1,0,0,1,1,1,1,1,1,1,1,1,1,0,1,1]
Then basisObtain the word frequency cosine similarity Sim between two texts to be detectedtf= 0.8660。
The difference that word frequency cosine similarity algorithm intuitively can reflect between two texts to be detected very much from vocabulary feature The opposite sex, then this algorithm is sensitive higher to the disappearance of word so that the word difference meeting chosen when word frequency cosine similarity is calculated The value for causing word frequency cosine similarity is varied widely, for this purpose, the embodiment of the present invention can introduce a kind of disappearance to word The similarity algorithm of less sensitive degree, i.e., above-mentioned SimHash similarity algorithms.
Accordingly, the calculating process of SimHash similarity algorithms is:Two texts to be detected are carried out by participle technique Participle, obtains each word in two texts to be detected, and each word in two texts to be detected is converted into K position Tagged word, the tagged word of K positions forms cryptographic Hash HashCode, two text D to be detected1And D2Corresponding cryptographic Hash HashCode Difference is as follows:
HashCode1=hash (w11,w12,...,w1j,...,w1p)
HashCode2=hash (w21,w22,...w2j,...,w2q)
Wherein, HashCode1It is text D to be detected1For text cryptographic Hash, HashCode2It is text D to be detected2Kazakhstan Uncommon value, HashCode1And HashCode2For the byte that length is K, the value of K is default to be arranged, and optimum is 64, in enforcement of the invention Its value, w are not limited in example1jIt is text D to be detected1In the tagged word that is converted into of j-th word, w2jIt is text D to be detected2In The tagged word that j word is converted into is text, then according to the tagged word between the two texts to be detected apart from size come To SimHash similarities, concrete formula is:
Wherein, Hamming (HashCode1, HashCode2) for the Hamming distances between byte.
103:It is previously obtained first threshold corresponding with the first similarity algorithm and is previously obtained and the second similarity algorithm Corresponding Second Threshold.Wherein first threshold and Second Threshold are calculated by the training sample for having marked in advance, so-called pre- The training sample for first marking is that user has marked out manually the comparison result of two training samples (sample is similar or sample not phase Like), when first threshold and Second Threshold is calculated, can obtain marking out comparison result manually by existing similarity algorithm Two training samples between similarity, and according to the comparison result of two training samples for having marked out (sample it is similar or Sample is dissimilar) and the similarity between the two training samples, first threshold and Second Threshold are calculated, such as in present invention enforcement First threshold α=0.65 in example, Second Threshold β=0.88.
104:According to first kind similarity, Second Type similarity, first threshold and Second Threshold, obtain two it is to be checked The similarity surveyed between text.It is word frequency cosine similarity with first kind similarity, Second Type similarity is SimHash phases As a example by spending, the similarity between two texts to be detected can be obtained by following computing formula:
Score=f (a*f (Simtf-α)+b*f(Simhash-β))
Wherein, a is default first weight, and b is default second weight, typically may be configured as 0.5, α and β is respectively more than word frequency The corresponding first threshold of string similarity algorithm and the corresponding Second Threshold of SimHash similarity algorithms, the size of threshold value determines two Similarity between individual detection text, to judge that two detection texts whether there is according to the similarity between two detection texts It is similar, and then understand two detection texts with the presence or absence of plagiarism phenomenon.
105:When the similarity between two texts to be detected is in preset range, two text phases to be detected are determined Seemingly.
106:When the similarity between two texts to be detected not in preset range when, determine two texts to be detected not phase Seemingly.
Wherein preset range can be according to practical application depending on, such as in the situation that the similarity of two texts to be detected is higher Similar regarding two texts to be detected down, now preset range can be a scope with similarity 90% to 99%, certainly Other modes can also be adopted, the such as computing formula of similarity is Score=f (a*f (Simtf-α)+b*f(Simhash- β)) andThe value of the Score obtained by this computing formula is -1 or 1, when the value of Score For -1 when, represent that two texts to be detected are dissimilar, there is no plagiarism phenomenon between two texts to be detected;When taking for Score Be worth for 1 when, represent that two texts to be detected are much like, there is plagiarism between two texts to be detected, be this in root In the case of calculating Score according to this computing formula, preset range can only include a numerical value 1.
By taking above-mentioned two text to be detected as an example, based on computing formula:Score=f (a*f (Simtf-α)+b*f (simhash- β)) andWhen calculating the similarity between two texts to be detected, the value of a and b is 0.5, α=0.65, β=0.88, similarity Score=1 between two for finally giving text to be detected, judgement two is to be checked Survey text similar, there is plagiarism.
Based on the computing formula of above-mentioned similarity, can draw in the embodiment of the present invention according to first kind similarity, second Type similarity, first threshold and Second Threshold, obtaining the feasible pattern of the similarity between two texts to be detected includes:
According to first kind similarity and first threshold, the first similarity f between two texts to be detected is obtained (Simtf-α)。
According to Second Type similarity and Second Threshold, the second similarity f between two texts to be detected is obtained (Simhash-β)。
According to the first similarity, the second similarity, default first weight and default second weight, two texts to be detected are obtained Similarity Score between this.
Knowable to above-mentioned technical proposal, after two texts to be detected are obtained at random, according at least to the first similarity algorithm And the second similarity algorithm first kind similarity and Second Type similarity that obtain between two texts to be detected, Jin Ergen According to first kind similarity, Second Type similarity, first threshold and Second Threshold, the phase between two texts to be detected is obtained Like spending, that is to say, that the present invention obtains two types similarity according at least two algorithms, and according to two types similarity and Each self-corresponding threshold value obtains indicating the similarity whether two texts to be detected are similar that this mode is relative to existing by one For planting the whether similar mode of similarity algorithm two texts to be detected of judgement, two texts to be detected of instruction that the present invention is obtained Whether the degree of accuracy of similar similarity is improved for this, and then improves the degree of accuracy of detection.
In embodiments of the present invention, it is previously obtained first threshold corresponding with the first similarity algorithm and is previously obtained and The process of the corresponding Second Threshold of two similarity algorithms see shown in Fig. 2, may comprise steps of:
201:Multigroup candidate solution is generated at random, and every group of candidate solution includes the 3rd threshold corresponding with the first similarity algorithm Value and the 4th threshold value corresponding with the second similarity algorithm.
202:Multigroup best candidate solution is obtained from multigroup candidate solution, wherein the acquisition process of multigroup best candidate solution is: The corresponding fitness function of every group of candidate solution is obtained, calculates every in training set by the corresponding fitness function of every group of candidate solution To the similarity between training sample, according to the similarity between each pair training sample, the fitness of every group of candidate solution is obtained, and According to the fitness of every group of candidate solution, multigroup best candidate solution is chosen, each pair training sample is similar by manually marking including two The text of degree, the fitness of best candidate solution is more than the fitness of other candidate solutions.
In embodiments of the present invention, the identification means of fitness function are as follows:
Fit=P (a*f (Simtf-α)+b*f(Simhash-β))
Wherein, a is default first weight, and b is default second weight, typically may be configured as 0.5, α and β and be respectively the first phase Like corresponding 3rd threshold value of degree algorithm and corresponding 3rd threshold value of the second similarity algorithm.For N in training set to training sample, The similarity that each candidate solution calculates each pair training sample is counted, Fit is designated asi,j, represent what i-th candidate solution was calculated Similarity of the jth to training sample, then the computing formula of fitness is as follows:
Optionally, according to the fitness of every group of candidate solution, multigroup best candidate solution is chosen, including:Obtain all candidate solutions Fitness summation;According to the fitness summation and the fitness of every group of candidate solution of all candidate solutions, every group of candidate solution is obtained Relative adaptability degrees;Then the numerical value between 0 and 1 is generated at random, and it is multigroup most according to the random numerical value selection for generating Excellent candidate solution.
Such as fitness summation is:Ftotal, the fitness of i-th candidate solution is Fi, then relative adaptability degrees are Fi/Ftotal, The value is the probability that candidate solution is genetic to next group of candidate solution, and the probable value of every group of candidate solution constitutes a region, all general Rate value sum is 1.
When the random numerical value for generating is less than or equal to the relative adaptability degrees of certain candidate solution, then relative adaptation can be selected Degree is best candidate solution more than or equal to the candidate solution of this random numerical value for generating.Three groups of candidate solutions are such as generated, this three groups of times The relative adaptability degrees of choosing solution are 2/3,1/3 and 0, and the random numerical value for generating is 1/3, then it is 2/3 and 1/3 to choose relative adaptability degrees Candidate solution is best candidate solution.If the random numerical value for generating is 1/2, according to relative adaptability degrees this random generation is being more than or equal to The candidate solution of numerical value when being chosen for the selection rule of best candidate solution, be only capable of choosing a best candidate solution, be this Inventive embodiments can be supplemented selection rule, if according to relative adaptability degrees more than or equal to this random numerical value for generating When the quantity of the best candidate solution that candidate solution is selected for the selection rule of best candidate solution is one, then can further choose The candidate solution of the numerical value that one relative adaptability degrees is less than and most presses close to this random generation is best candidate solution.
203:Cross and variation is carried out to the 3rd threshold value in multigroup best candidate solution, to the 4th in multigroup best candidate solution Threshold value carries out cross and variation, obtains multigroup new candidate solution, and performs acquisition process to multigroup new candidate solution with from multigroup new Candidate solution in obtain multigroup best candidate solution until meet pre-conditioned.
So-called pre-conditioned can have two kinds of feasible conditions:It is a kind of it is pre-conditioned be a upper candidate solution fitness and upper In preset difference value, it is that artificial setting generates that another kind is pre-conditioned to the difference of the fitness of the corresponding current candidate solution of candidate solution The number of times of new candidate solution.In embodiments of the present invention, can arbitrarily choose a kind of pre-conditioned and pre-conditioned meeting In the case of be considered as convergence.
204:Choose best candidate solution of the fitness more than the fitness of other best candidate solutions, selected best candidate , used as first threshold, the 4th threshold value in selected best candidate solution is used as Second Threshold for the 3rd threshold value in solution.
In embodiments of the present invention, the value of the 3rd threshold value and the 4th threshold value is between 0 and 1, and is compiled by binary system Code mode represents so that the corresponding string of binary characters of the 3rd threshold value and the corresponding string of binary characters of the 4th threshold value respectively as One chromosome.If the corresponding string of binary characters of the 3rd threshold value is 0110, then it can be considered as a chromosome, and each two enters Character processed is considered as a gene on chromosome, corresponding, and to the 3rd threshold value and the 4th threshold value cross and variation is carried out, and obtains many The process of the new candidate solution of group see shown in Fig. 3, may comprise steps of:
301:Random pair is carried out to the corresponding chromosome of the 3rd threshold value in multigroup best candidate solution.I.e. from multigroup optimum The corresponding chromosome of multiple 3rd threshold values can be got in candidate solution, the corresponding chromosome of these multiple 3rd threshold values is carried out Random pairing two-by-two.
302:According to the length of the corresponding chromosome of the 3rd threshold value, cross-point locations are randomly provided, and according to crosspoint position Put, exchange the portion gene between the corresponding chromosome of the 3rd threshold value of random pair.
The corresponding chromosome of 3rd threshold value of such as random pair is respectively:0010 and 0100, the crosspoint being randomly provided Position is the 2nd gene, then with the 2nd gene as boundary, the full gene after the 2nd gene is exchanged, after exchange Chromosome is:0000 and 0110.
303:Genetic mutation position in the corresponding chromosome of the 3rd threshold value is randomly provided, and to the base at genetic mutation position Because carrying out inversion operation, so-called inversion operation is:If the gene at genetic mutation position is 0,1 is changed into, if gene becomes It is 1 that dystopy puts the gene at place, then be changed into 0.
304:In the portion gene exchanged between the corresponding chromosome of the 3rd threshold value and to the gene at genetic mutation position Corresponding 3rd threshold value of chromosome after carrying out inversion operation, after being changed.
305:The corresponding chromosome of the 4th threshold value in multigroup best candidate solution carries out random pair.
306:According to the length of the corresponding chromosome of the 4th threshold value, cross-point locations are randomly provided, and according to crosspoint position Put, exchange the portion gene between the corresponding chromosome of the 4th threshold value of random pair.
307:Genetic mutation position in the corresponding chromosome of the 4th threshold value is randomly provided, and to the base at genetic mutation position Because carrying out inversion operation.
308:In the portion gene exchanged between the corresponding chromosome of the 4th threshold value and to the gene at genetic mutation position Corresponding 4th threshold value of chromosome after carrying out inversion operation, after being changed.
309:According to corresponding 4th threshold value of chromosome after corresponding 3rd threshold value of chromosome after change and change, obtain To multigroup new candidate solution.
Here it should be noted is that:The embodiment of the present invention is removed implementation procedure to come to the 3rd in the order described above Threshold value and the 4th threshold value are carried out outside cross and variation process, can also carry out cross and variation to the 3rd threshold value and the 4th threshold value simultaneously Process, it is also possible to a threshold value is carried out after cross and variation process, then cross and variation process is carried out to another threshold value, and Can be while carrying out intersection change process when cross and variation is carried out to certain threshold value.
For aforesaid each method embodiment, in order to be briefly described, therefore it is all expressed as a series of combination of actions, but It is that those skilled in the art should know, the present invention is not limited by described sequence of movement, because according to the present invention, certain A little steps can adopt other orders or while carry out.Secondly, those skilled in the art also should know, be retouched in specification The embodiment stated belongs to preferred embodiment, and involved action and the module not necessarily present invention is necessary.
Fig. 4 is referred to, a kind of structure of text processing apparatus provided in an embodiment of the present invention is it illustrates, for reducing phase Like disappearance susceptibility of the degree to word, to improve the degree of accuracy of similarity.Specifically, text processing apparatus described in Fig. 4 can include: Text acquiring unit 11, the first computing unit 12, the second computing unit 13 and determining unit 14.
Text acquiring unit 11, for obtaining two texts to be detected at random, two of which text to be detected be need into Style of writing this whether two similar text, the two texts to be detected can be random two papers, works or the patents for obtaining Application documents etc., can download in embodiments of the present invention from the website for providing these texts.
First computing unit 12, for according at least to the first similarity algorithm and the second similarity algorithm, calculating two and treating The Second Type similarity between first kind similarity and two texts to be detected between detection text, the wherein first kind Similarity is calculated according to the first similarity algorithm, and Second Type similarity is calculated according to the second similarity algorithm.
That is in the embodiment of the present invention, the first computing unit 12 calculates two and treats at least through two kinds of similarity algorithms Two types similarity between detection text.To improve the similarity between two texts to be detected, the first similarity algorithm It is less sensitive to the disappearance of word with least a kind of similarity algorithm in the second similarity algorithm, and according to the disappearance to word not The similarity that the value intensity of variation of the similarity that too sensitive algorithm the is obtained algorithm sensitive less than the disappearance to word is obtained Value intensity of variation.
At present similarity algorithm has:Word frequency cosine similarity algorithm, TF-IDF cosine similarity algorithms, text editing distance Similarity algorithm and SimHash similarity algorithms, inventor is studied these four similarity algorithms, it is found that these four are similar Degree algorithm is sensitive to the disappearance of word according to arriving little being ordered as greatly:Word frequency cosine similarity algorithm, TF-IDF cosine similarities are calculated Method, text editing Distance conformability degree algorithm and SimHash similarity algorithms, in embodiments of the present invention, can be by word frequency cosine Similarity algorithm and TF-IDF cosine similarity algorithms are respectively seen as the first similarity algorithm, and text editing Distance conformability degree is calculated Method and SimHash similarity algorithms are respectively seen as the second similarity algorithm.
Certainly word frequency cosine phase can such as be chosen arbitrarily to choose two kinds of similarity algorithms from these four similarity algorithms It is the first similarity algorithm like degree algorithm, it is the second similarity algorithm to choose TF-IDF cosine similarities algorithm, or is chosen Text editing Distance conformability degree algorithm is the first similarity algorithm, and it is the second similarity algorithm to choose SimHash similarity algorithms.
Second computing unit 13, for according to first kind similarity, Second Type similarity, first threshold and the second threshold Value, obtains the similarity between two texts to be detected, wherein first threshold be previously obtained with the first similarity algorithm pair The threshold value answered, Second Threshold is the threshold value corresponding with the second similarity algorithm being previously obtained, and obtains first threshold and the second threshold The process of value refers to the related description of embodiment of the method part, and this embodiment of the present invention is no longer illustrated.
Optionally, the second computing unit 13, it is to be detected for according to first kind similarity and first threshold, obtaining two The first similarity between text, according to Second Type similarity and Second Threshold, obtains between two texts to be detected Two similarities, and according to the first similarity, the second similarity, default first weight and default second weight, obtain two it is to be checked The similarity surveyed between text.
Determining unit 14, for when the similarity between two texts to be detected is in preset range, determining that two are treated Detection text it is similar, and for when the similarity between two texts to be detected not in preset range when, determine two it is to be checked Survey text dissimilar.
Wherein preset range can be according to practical application depending on, such as in the situation that the similarity of two texts to be detected is higher Similar regarding two texts to be detected down, now preset range can be a scope with similarity 90% to 99%, certainly Other modes can also be adopted, the such as computing formula of similarity is Score=f (a*f (Simtf-α)+b*f(Simhash- β)) andThe value of the Score obtained by this computing formula is -1 or 1, when the value of Score For -1 when, represent that two texts to be detected are dissimilar, there is no plagiarism phenomenon between two texts to be detected;When taking for Score Be worth for 1 when, represent that two texts to be detected are much like, there is plagiarism between two texts to be detected, be this in root In the case of calculating Score according to this computing formula, preset range can only include a numerical value 1.
By taking above-mentioned two text to be detected as an example, based on formula:Score=f (a*f (Simtf-α)+b*f(Simhash- β)) andWhen calculating the similarity between two texts to be detected, the value of a and b be 0.5, α= 0.65, β=0.88, similarity Score=1 between two for finally giving text to be detected judges two texts to be detected It is similar, there is plagiarism.
Knowable to above-mentioned technical proposal, after two texts to be detected are obtained at random, according at least to the first similarity algorithm And the second similarity algorithm first kind similarity and Second Type similarity that obtain between two texts to be detected, Jin Ergen According to first kind similarity, Second Type similarity, first threshold and Second Threshold, the phase between two texts to be detected is obtained Like spending, that is to say, that the present invention obtains two types similarity according at least two algorithms, and according to two types similarity and Each self-corresponding threshold value obtains indicating the similarity whether two texts to be detected are similar that this mode is relative to existing by one For planting the whether similar mode of similarity algorithm two texts to be detected of judgement, two texts to be detected of instruction that the present invention is obtained Whether the degree of accuracy of similar similarity is improved for this, and then improves the degree of accuracy of detection.
Fig. 5 is referred to, another kind of structure of text processing apparatus provided in an embodiment of the present invention is it illustrates, in Fig. 4 bases Can also include on plinth:Obtaining unit 15, for being previously obtained first threshold corresponding with the first similarity algorithm and advance To Second Threshold corresponding with the second similarity algorithm.
In embodiments of the present invention, the structure of obtaining unit 15 is as shown in fig. 6, can include:First generates subelement 151st, first the generation selection subelement 154 of subelement 153 and second of subelement 152, second is chosen.
First generates subelement 151, and for generating multigroup candidate solution at random, every group of candidate solution is similar to first including one Degree corresponding 3rd threshold value of algorithm and the 4th threshold value corresponding with the second similarity algorithm.
First chooses subelement 152, for obtaining multigroup best candidate solution from multigroup candidate solution, wherein multigroup optimum time Selecting the acquisition process of solution is:The corresponding fitness function of every group of candidate solution is obtained, by the corresponding fitness letter of every group of candidate solution Number calculates the similarity between each pair training sample in training set, according to the similarity between each pair training sample, obtains every The fitness of group candidate solution, and according to the fitness of every group of candidate solution, multigroup best candidate solution is chosen, each pair training sample includes Two texts by manually mark similarity, the fitness of best candidate solution is more than the fitness of other candidate solutions.
Optionally, first fitness of the subelement 152 according to every group of candidate solution is chosen, chooses multigroup best candidate solution, bag Include:The fitness summation of all candidate solutions is obtained, according to the fitness summation and the fitness of every group of candidate solution of all candidate solutions, The relative adaptability degrees of every group of candidate solution are obtained, the numerical value between 0 and 1 is generated at random, and according to the random number for generating Value chooses multigroup best candidate solution, for the related description that other processes can refer to embodiment of the method part, to this present invention Embodiment is no longer illustrated.
Second generates subelement 153, for carrying out cross and variation to the 3rd threshold value in multigroup best candidate solution, to multigroup The 4th threshold value in best candidate solution carries out cross and variation, obtains multigroup new candidate solution, and multigroup new candidate solution is performed Acquisition process is pre-conditioned up to meeting to obtain multigroup best candidate solution from multigroup new candidate solution.
So-called pre-conditioned can have two kinds of feasible conditions:It is a kind of it is pre-conditioned be a upper candidate solution fitness and upper In preset difference value, it is that artificial setting generates that another kind is pre-conditioned to the difference of the fitness of the corresponding current candidate solution of candidate solution The number of times of new candidate solution.In embodiments of the present invention, can arbitrarily choose a kind of pre-conditioned and pre-conditioned meeting In the case of be considered as convergence.
Second chooses subelement 154, for choosing best candidate of the fitness more than the fitness of other best candidate solutions Solution, the 3rd threshold value in selected best candidate solution is used as first threshold, the 4th threshold value in selected best candidate solution As Second Threshold.
In embodiments of the present invention, the value of the 3rd threshold value and the 4th threshold value is between 0 and 1, and is compiled by binary system Code mode represents so that the corresponding string of binary characters of the 3rd threshold value and the corresponding string of binary characters of the 4th threshold value respectively as One chromosome.If the corresponding string of binary characters of the 3rd threshold value is 0110, then it can be considered as a chromosome, and each two enters Character processed is considered as a gene on chromosome, corresponding, and the second generation subelement can include:First with sub-unit, First exchange subelement, first negate subelement, first obtain subelement, second with sub-unit, second exchange subelement, the Two negate subelement, the second acquisition subelement and candidate solution obtains subelement.
First matches somebody with somebody sub-unit, for being matched somebody with somebody at random to the corresponding chromosome of the 3rd threshold value in multigroup best candidate solution It is right.
First exchanges subelement, for according to the length of the corresponding chromosome of the 3rd threshold value, being randomly provided cross-point locations, And according to cross-point locations, exchange the portion gene between the corresponding chromosome of the 3rd threshold value of random pair.
First negates subelement, for being randomly provided the corresponding chromosome of the 3rd threshold value in genetic mutation position, and to base Because the gene at variable position carries out inversion operation.
First obtains subelement, for becoming in the portion gene exchanged between the corresponding chromosome of the 3rd threshold value and to gene The gene at the dystopy place of putting is carried out after inversion operation, corresponding 3rd threshold value of the chromosome after being changed.
Second matches somebody with somebody sub-unit, for being matched somebody with somebody at random to the corresponding chromosome of the 4th threshold value in multigroup best candidate solution It is right.
Second exchanges subelement, for according to the length of the corresponding chromosome of the 4th threshold value, being randomly provided cross-point locations, And according to cross-point locations, exchange the portion gene between the corresponding chromosome of the 4th threshold value of random pair.
Second negates subelement, for being randomly provided the corresponding chromosome of the 4th threshold value in genetic mutation position, and to base Because the gene at variable position carries out inversion operation.
Second obtains subelement, for becoming in the portion gene exchanged between the corresponding chromosome of the 4th threshold value and to gene The gene at the dystopy place of putting is carried out after inversion operation, corresponding 4th threshold value of the chromosome after being changed.
Candidate solution obtains subelement, for according to the chromosome after corresponding 3rd threshold value of chromosome after change and change Corresponding 4th threshold value, obtains multigroup new candidate solution.
It should be noted that each embodiment in this specification is described by the way of progressive, each embodiment weight Point explanation is all difference with other embodiment, between each embodiment identical similar part mutually referring to. For device class embodiment, due to itself and embodiment of the method basic simlarity, so description is fairly simple, related part ginseng See the part explanation of embodiment of the method.
Finally, in addition it is also necessary to explanation, herein, such as first and second or the like relational terms be used merely to by One entity or operation make a distinction with another entity or operation, and not necessarily require or imply these entities or operation Between there is any this actual relation or order.And, term " including ", "comprising" or its any other variant meaning Covering including for nonexcludability, so that a series of process, method, article or equipment including key elements not only includes that A little key elements, but also including other key elements being not expressly set out, or also include for this process, method, article or The intrinsic key element of equipment.In the absence of more restrictions, the key element for being limited by sentence "including a ...", does not arrange Except also there is other identical element in including the process of the key element, method, article or equipment.
The foregoing description of the disclosed embodiments, enables those skilled in the art to realize or using the present invention.To this Various modifications of a little embodiments will be apparent for a person skilled in the art, and generic principles defined herein can Without departing from the spirit or scope of the present invention, to realize in other embodiments.Therefore, the present invention will not be limited It is formed on the embodiments shown herein, and is to fit to consistent with principles disclosed herein and features of novelty most wide Scope.
The above is only the preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art For member, under the premise without departing from the principles of the invention, some improvements and modifications can also be made, these improvements and modifications also should It is considered as protection scope of the present invention.

Claims (10)

1. a kind of text handling method, it is characterised in that methods described includes:
Two texts to be detected are obtained at random;
According at least to the first similarity algorithm and the second similarity algorithm, the first kind between described two texts to be detected is calculated Second Type similarity between type similarity and described two texts to be detected, wherein the first kind similarity is according to institute State the first similarity algorithm to be calculated, the Second Type similarity is calculated according to second similarity algorithm;
According to the first kind similarity, the Second Type similarity, first threshold and Second Threshold, obtain described two Similarity between text to be detected, wherein the first threshold is be previously obtained corresponding with first similarity algorithm Threshold value, the Second Threshold is the threshold value corresponding with second similarity algorithm being previously obtained;
When the similarity between described two texts to be detected is in preset range, described two text phases to be detected are determined Seemingly;
When the similarity between described two texts to be detected not in preset range when, determine described two texts to be detected not phase Seemingly.
2. method according to claim 1, it is characterised in that it is described according to the first kind similarity, described second Type similarity, first threshold and Second Threshold, obtain the similarity between described two texts to be detected, including:
According to the first kind similarity and the first threshold, obtain first similar between described two texts to be detected Degree;
According to the Second Type similarity and the Second Threshold, obtain second similar between described two texts to be detected Degree;
According to first similarity, second similarity, default first weight and default second weight, obtain described two Similarity between text to be detected.
3. method according to claim 1, it is characterised in that methods described also includes:It is previously obtained and first phase Seemingly spend the corresponding first threshold of algorithm and be previously obtained Second Threshold corresponding with second similarity algorithm;
It is described to be previously obtained first threshold corresponding with first similarity algorithm and be previously obtained and second similarity The corresponding Second Threshold of algorithm, including:
Generate multigroup candidate solution at random, every group of candidate solution include the 3rd threshold value corresponding with first similarity algorithm and One the 4th threshold value corresponding with second similarity algorithm;
Multigroup best candidate solution is obtained from multigroup candidate solution, wherein the acquisition process of multigroup best candidate solution is:Obtain The corresponding fitness function of every group of candidate solution, calculates every in training set by the corresponding fitness function of every group of candidate solution To the similarity between training sample, according to the similarity between each pair training sample, the adaptation of every group of candidate solution is obtained Degree, and according to the fitness of every group of candidate solution, multigroup best candidate solution is chosen, each pair training sample includes two by artificial The text of mark similarity, the fitness of best candidate solution is more than the fitness of other candidate solutions;
Cross and variation is carried out to the 3rd threshold value in multigroup best candidate solution, to the 4th in multigroup best candidate solution Threshold value carries out cross and variation, obtains multigroup new candidate solution, and the acquisition process is performed to multigroup new candidate solution with Multigroup best candidate solution is obtained from multigroup new candidate solution pre-conditioned up to meeting;
Choose best candidate solution of the fitness more than the fitness of other best candidate solutions, the in selected best candidate solution , used as the first threshold, the 4th threshold value in selected best candidate solution is used as the Second Threshold for three threshold values.
4. method according to claim 3, it is characterised in that the fitness according to every group of candidate solution, chooses Multigroup best candidate solution, including:
Obtain the fitness summation of all candidate solutions;
According to the fitness summation and the fitness of every group of candidate solution of all candidate solutions, fitting relatively for every group of candidate solution is obtained Response;
The numerical value between 0 and 1 is generated at random, and multigroup best candidate solution is chosen according to the random numerical value for generating.
5. method according to claim 3, it is characterised in that the value of the 3rd threshold value and the 4th threshold value between Between 0 and 1, and represented by binary coding mode, so that the corresponding string of binary characters of the 3rd threshold value and described The corresponding string of binary characters of four threshold values is respectively as a chromosome;
The 3rd threshold value in multigroup best candidate solution carries out cross and variation, in multigroup best candidate solution 4th threshold value carries out cross and variation, obtains multigroup new candidate solution, including:
Random pair is carried out to the corresponding chromosome of the 3rd threshold value in multigroup best candidate solution;
According to the length of the corresponding chromosome of the 3rd threshold value, cross-point locations are randomly provided, and according to the crosspoint position Put, exchange the portion gene between the corresponding chromosome of the 3rd threshold value of random pair;
Genetic mutation position in the corresponding chromosome of the 3rd threshold value is randomly provided, and to the base at the genetic mutation position Because carrying out inversion operation;
In the portion gene exchanged between the corresponding chromosome of the 3rd threshold value and to the gene at the genetic mutation position Corresponding 3rd threshold value of chromosome after carrying out inversion operation, after being changed;
Random pair is carried out to the corresponding chromosome of the 4th threshold value in multigroup best candidate solution;
According to the length of the corresponding chromosome of the 4th threshold value, cross-point locations are randomly provided, and according to the crosspoint position Put, exchange the portion gene between the corresponding chromosome of the 4th threshold value of random pair;
Genetic mutation position in the corresponding chromosome of the 4th threshold value is randomly provided, and to the base at the genetic mutation position Because carrying out inversion operation;
In the portion gene exchanged between the corresponding chromosome of the 4th threshold value and to the gene at the genetic mutation position Corresponding 4th threshold value of chromosome after carrying out inversion operation, after being changed;
According to corresponding 4th threshold value of chromosome after corresponding 3rd threshold value of chromosome after the change and change, obtain many The new candidate solution of group.
6. a kind of text processing apparatus, it is characterised in that described device includes:
Text acquiring unit, for obtaining two texts to be detected at random;
First computing unit, for according at least to the first similarity algorithm and the second similarity algorithm, calculating described two to be checked Second Type similarity between the first kind similarity surveyed between text and described two texts to be detected, wherein described the One type similarity is calculated according to first similarity algorithm, and the Second Type similarity is similar according to described second Degree algorithm is calculated;
Second computing unit, for according to the first kind similarity, the Second Type similarity, first threshold and second Threshold value, obtains the similarity between described two texts to be detected, wherein the first threshold is previously obtained with described The corresponding threshold value of one similarity algorithm, the Second Threshold is the threshold corresponding with second similarity algorithm being previously obtained Value;
Determining unit, for when the similarity between described two texts to be detected is in preset range, determining described two Text to be detected is similar, and for when the similarity between described two texts to be detected not in preset range when, determine institute State two texts to be detected dissimilar.
7. device according to claim 6, it is characterised in that second computing unit, for according to the first kind Type similarity and the first threshold, obtain the first similarity between described two texts to be detected, according to the Equations of The Second Kind Type similarity and the Second Threshold, obtain the second similarity between described two texts to be detected, and according to described first Similarity, second similarity, default first weight and default second weight, obtain between described two texts to be detected Similarity.
8. device according to claim 6, it is characterised in that described device also includes:Obtaining unit, for being previously obtained First threshold corresponding with first similarity algorithm and it is previously obtained the second threshold corresponding with second similarity algorithm Value;
The obtaining unit, including:First generates subelement, the first selection subelement, the second generation subelement and second chooses Subelement;
Described first generates subelement, and for generating multigroup candidate solution at random, every group of candidate solution includes one with first phase Like degree corresponding 3rd threshold value of algorithm and the 4th threshold value corresponding with second similarity algorithm;
Described first chooses subelement, for obtaining multigroup best candidate solution from multigroup candidate solution, wherein multigroup optimum The acquisition process of candidate solution is:The corresponding fitness function of every group of candidate solution is obtained, it is corresponding suitable by every group of candidate solution Response function calculates the similarity between each pair training sample in training set, according to similar between each pair training sample Degree, obtains the fitness of every group of candidate solution, and according to the fitness of every group of candidate solution, chooses multigroup best candidate solution, often Include two to training sample by the text for manually marking similarity, the fitness of best candidate solution is suitable more than other candidate solutions Response;
Described second generates subelement, for carrying out cross and variation to the 3rd threshold value in multigroup best candidate solution, to institute The 4th threshold value stated in multigroup best candidate solution carries out cross and variation, obtains multigroup new candidate solution, and to described multigroup new It is pre-conditioned up to meeting to obtain multigroup best candidate solution from multigroup new candidate solution that candidate solution performs the acquisition process;
Described second chooses subelement, for choosing best candidate solution of the fitness more than the fitness of other best candidate solutions, The 3rd threshold value in selected best candidate solution is used as the first threshold, the 4th threshold value in selected best candidate solution As the Second Threshold.
9. device according to claim 8, it is characterised in that described first chooses subelement according to every group of candidate solution Fitness, choose multigroup best candidate solution, including:
The fitness summation of all candidate solutions is obtained, according to the suitable of the fitness summation of all candidate solutions and every group of candidate solution Response, obtains the relative adaptability degrees of every group of candidate solution, and the numerical value between 0 and 1 is generated at random, and according to random generation Numerical value choose multigroup best candidate solution.
10. device according to claim 8, it is characterised in that the value of the 3rd threshold value and the 4th threshold value is situated between Between 0 and 1, and represented by binary coding mode, so that the corresponding string of binary characters of the 3rd threshold value and described The corresponding string of binary characters of 4th threshold value is respectively as a chromosome;
Described second generates subelement, including:First with sub-unit, first exchange subelement, first negate subelement, first Acquisition subelement, second negate subelement, the second acquisition subelement and candidate with sub-unit, the second exchange subelement, second Solution obtains subelement;
Described first match somebody with somebody sub-unit, for the corresponding chromosome of the 3rd threshold value in multigroup best candidate solution is carried out with Machine is matched;
Described first exchanges subelement, for according to the length of the corresponding chromosome of the 3rd threshold value, being randomly provided crosspoint Position, and according to the cross-point locations, exchange the portion gene between the corresponding chromosome of the 3rd threshold value of random pair;
Described first negates subelement, for being randomly provided the corresponding chromosome of the 3rd threshold value in genetic mutation position, and Inversion operation is carried out to the gene at the genetic mutation position;
Described first obtains subelement, in the portion gene exchanged between the corresponding chromosome of the 3rd threshold value and to institute Stating the gene at genetic mutation position is carried out after inversion operation, corresponding 3rd threshold value of the chromosome after being changed;
Described second match somebody with somebody sub-unit, for the corresponding chromosome of the 4th threshold value in multigroup best candidate solution is carried out with Machine is matched;
Described second exchanges subelement, for according to the length of the corresponding chromosome of the 4th threshold value, being randomly provided crosspoint Position, and according to the cross-point locations, exchange the portion gene between the corresponding chromosome of the 4th threshold value of random pair;
Described second negates subelement, for being randomly provided the corresponding chromosome of the 4th threshold value in genetic mutation position, and Inversion operation is carried out to the gene at the genetic mutation position;
Described second obtains subelement, in the portion gene exchanged between the corresponding chromosome of the 4th threshold value and to institute Stating the gene at genetic mutation position is carried out after inversion operation, corresponding 4th threshold value of the chromosome after being changed;
The candidate solution obtains subelement, for according to the dye after corresponding 3rd threshold value of chromosome after the change and change Corresponding 4th threshold value of colour solid, obtains multigroup new candidate solution.
CN201611220192.2A 2016-12-26 2016-12-26 Text processing method and device Active CN106649273B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611220192.2A CN106649273B (en) 2016-12-26 2016-12-26 Text processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611220192.2A CN106649273B (en) 2016-12-26 2016-12-26 Text processing method and device

Publications (2)

Publication Number Publication Date
CN106649273A true CN106649273A (en) 2017-05-10
CN106649273B CN106649273B (en) 2020-03-17

Family

ID=58828374

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611220192.2A Active CN106649273B (en) 2016-12-26 2016-12-26 Text processing method and device

Country Status (1)

Country Link
CN (1) CN106649273B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107729323A (en) * 2017-11-29 2018-02-23 深圳中泓在线股份有限公司 Web documents similarity detection method and device, server and storage medium
CN108021553A (en) * 2017-09-30 2018-05-11 北京颐圣智能科技有限公司 Word treatment method, device and the computer equipment of disease term
CN108573045A (en) * 2018-04-18 2018-09-25 同方知网数字出版技术股份有限公司 A kind of alignment matrix similarity retrieval method based on multistage fingerprint
CN109165291A (en) * 2018-06-29 2019-01-08 厦门快商通信息技术有限公司 A kind of text matching technique and electronic equipment
CN109508379A (en) * 2018-12-21 2019-03-22 上海文军信息技术有限公司 A kind of short text clustering method indicating and combine similarity based on weighted words vector
CN110362987A (en) * 2019-06-29 2019-10-22 南京理工大学 A kind of lightweight assessment algorithm of Cipher Strength
CN112733140A (en) * 2020-12-28 2021-04-30 上海观安信息技术股份有限公司 Detection method and system for model tilt attack

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103646080A (en) * 2013-12-12 2014-03-19 北京京东尚科信息技术有限公司 Microblog duplication-eliminating method and system based on reverse-order index
CN104657472A (en) * 2015-02-13 2015-05-27 南京邮电大学 EA (Evolutionary Algorithm)-based English text clustering method
CN105512249A (en) * 2015-12-01 2016-04-20 福建工程学院 Noumenon coupling method based on compact evolution algorithm
CN105955948A (en) * 2016-04-22 2016-09-21 武汉大学 Short text topic modeling method based on word semantic similarity

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103646080A (en) * 2013-12-12 2014-03-19 北京京东尚科信息技术有限公司 Microblog duplication-eliminating method and system based on reverse-order index
CN104657472A (en) * 2015-02-13 2015-05-27 南京邮电大学 EA (Evolutionary Algorithm)-based English text clustering method
CN105512249A (en) * 2015-12-01 2016-04-20 福建工程学院 Noumenon coupling method based on compact evolution algorithm
CN105955948A (en) * 2016-04-22 2016-09-21 武汉大学 Short text topic modeling method based on word semantic similarity

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
李春梅: "基于TF-IDF的网页新闻分类的研究与应用", 《贵州师范大学学报(自然科学版)》 *
潘炜炜: "遗传算法在XML文档聚类中的研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
翟丽丽: "基于广度优先搜索的变异加权模糊C-均值聚类算法", 《统计与决策》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108021553A (en) * 2017-09-30 2018-05-11 北京颐圣智能科技有限公司 Word treatment method, device and the computer equipment of disease term
CN107729323A (en) * 2017-11-29 2018-02-23 深圳中泓在线股份有限公司 Web documents similarity detection method and device, server and storage medium
CN108573045A (en) * 2018-04-18 2018-09-25 同方知网数字出版技术股份有限公司 A kind of alignment matrix similarity retrieval method based on multistage fingerprint
CN108573045B (en) * 2018-04-18 2021-12-24 同方知网数字出版技术股份有限公司 Comparison matrix similarity retrieval method based on multi-order fingerprints
CN109165291A (en) * 2018-06-29 2019-01-08 厦门快商通信息技术有限公司 A kind of text matching technique and electronic equipment
CN109165291B (en) * 2018-06-29 2021-07-09 厦门快商通信息技术有限公司 Text matching method and electronic equipment
CN109508379A (en) * 2018-12-21 2019-03-22 上海文军信息技术有限公司 A kind of short text clustering method indicating and combine similarity based on weighted words vector
CN110362987A (en) * 2019-06-29 2019-10-22 南京理工大学 A kind of lightweight assessment algorithm of Cipher Strength
CN112733140A (en) * 2020-12-28 2021-04-30 上海观安信息技术股份有限公司 Detection method and system for model tilt attack
CN112733140B (en) * 2020-12-28 2023-12-22 上海观安信息技术股份有限公司 Detection method and system for model inclination attack

Also Published As

Publication number Publication date
CN106649273B (en) 2020-03-17

Similar Documents

Publication Publication Date Title
CN106649273A (en) Text processing method and text processing device
JP6846469B2 (en) Method and device for determining the effectiveness of points of interest based on Internet text mining
CN110727766B (en) Sensitive word detection method
CN106611052B (en) The determination method and device of text label
CN111708888B (en) Classification method, device, terminal and storage medium based on artificial intelligence
CN110188346B (en) Intelligent research and judgment method for network security law case based on information extraction
CN104598611B (en) The method and system being ranked up to search entry
CN106855853A (en) Entity relation extraction system based on deep neural network
CN110738039B (en) Case auxiliary information prompting method and device, storage medium and server
CN110135157A (en) Malware homology analysis method, system, electronic equipment and storage medium
CN112463976B (en) Knowledge graph construction method taking crowd sensing task as center
CN105550227B (en) Named entity identification method and device
US8612371B1 (en) Computing device and method using associative pattern memory using recognition codes for input patterns
CN109241527B (en) Automatic generation method of false comment data set of Chinese commodity
CN110516210B (en) Text similarity calculation method and device
CN110830489B (en) Method and system for detecting counterattack type fraud website based on content abstract representation
CN110826056A (en) Recommendation system attack detection method based on attention convolution self-encoder
US20160350264A1 (en) Server and method for extracting content for commodity
CN110688455A (en) Method, medium and computer equipment for filtering invalid comments based on artificial intelligence
CN108710911A (en) It is a kind of based on semi-supervised application market brush list application detection method
CN106202007B (en) A kind of appraisal procedure of MATLAB program files similarity
CN114090880A (en) Method and device for commodity recommendation, electronic equipment and storage medium
Zhu et al. Human activity recognition based on similarity
CN110147798A (en) A kind of semantic similarity learning method can be used for network information detection
TWI778442B (en) Device and method for detecting purpose of article

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant