CN106649273A - Text processing method and text processing device - Google Patents
Text processing method and text processing device Download PDFInfo
- Publication number
- CN106649273A CN106649273A CN201611220192.2A CN201611220192A CN106649273A CN 106649273 A CN106649273 A CN 106649273A CN 201611220192 A CN201611220192 A CN 201611220192A CN 106649273 A CN106649273 A CN 106649273A
- Authority
- CN
- China
- Prior art keywords
- similarity
- threshold value
- candidate solution
- multigroup
- threshold
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a text processing method and a text processing device. The method comprises the following step: after randomly acquiring two to-be-detected texts, obtaining first type similarity and second type similarity between the two to-be-detected texts at least according to a first similarity algorithm and a second similarity algorithm, further obtaining the similarity between the two to-be-detected texts according to the first type similarity, the second type similarity, a first threshold value and a second threshold valve, namely, obtaining two types of type similarity according to at least two algorithms and obtaining the similarity indicating whether the two to-be-detected texts are similar or not according to the two types of type similarity and the corresponding threshold values. Compared with a mode of judging whether the two to-be-detected texts are similar or not through one similarity algorithm, the mode provided by the invention has the advantages that the accuracy of indicating whether the two to-be-detected texts are similar or not is improved, so that the detecting accuracy is further improved.
Description
Technical field
The invention belongs to text information processing technical field, in particular, more particularly to a kind of text handling method and
Device.
Background technology
With popularization of the computer to the various natural language processing applications such as text message, it is desirable to provide one effectively and accurate
True method calculating text to be detected and detect the text similarity between text, text (particularly short text) similarity
Computational methods play more and more important role in the related research of computer version and application.Such as examine in text
Rope field (Text Retrieval), short text similarity can improve the recall rate (Recall) of search engine and the degree of accuracy
(Precision);At text mining field (Text Mining), short text similarity is used for finding as a measuring method
Potential knowledge in text database;In image retrieval (Image Retrieval) field based on webpage, it is possible to use image
Around descriptive short text improving accuracy rate, wherein having detected that text is the text for being detected by text similarity
This.
At present the computational methods of text similarity can carry out participle to two texts to be detected by participle technique, respectively
Each word in two texts to be detected is obtained, the word for obtaining is mapped into VSM (Virtual Switch Matrix, virtual friendship
Change matrix) in, the vectorization of text fragments is realized by VSM, then two texts to be detected are obtained by vectorial similar computational algorithm
Segment-similarity between this, according to segment-similarity the similarity between two texts to be detected is obtained, but this is passed through
The similarity that vectorization is obtained is stronger to the disappearance susceptibility of word so that the accuracy of similarity, wherein the disappearance to word
When susceptibility refers to more by force calculating similarity, the difference of word can cause the value of similarity to change very greatly.
The content of the invention
In view of this, it is an object of the invention to provide a kind of text handling method and device, for improving similarity
The degree of accuracy, and then improve the degree of accuracy of detection.Specifically, technical scheme is as follows:
The present invention provides a kind of text handling method, and methods described includes:
Two texts to be detected are obtained at random;
According at least to the first similarity algorithm and the second similarity algorithm, the between described two texts to be detected is calculated
Second Type similarity between one type similarity and described two texts to be detected, wherein the first kind similarity root
It is calculated according to first similarity algorithm, the Second Type similarity is calculated according to second similarity algorithm
Arrive;
According to the first kind similarity, the Second Type similarity, first threshold and Second Threshold, obtain described
Similarity between two texts to be detected, wherein the first threshold is being previously obtained with first similarity algorithm pair
The threshold value answered, the Second Threshold is the threshold value corresponding with second similarity algorithm being previously obtained;
When the similarity between described two texts to be detected is in preset range, described two texts to be detected are determined
It is similar;
When the similarity between described two texts to be detected not in preset range when, determine described two texts to be detected
It is dissimilar.
Preferably, it is described according to the first kind similarity, the Second Type similarity, first threshold and the second threshold
Value, obtains the similarity between described two texts to be detected, including:
According to the first kind similarity and the first threshold, first between described two texts to be detected is obtained
Similarity;
According to the Second Type similarity and the Second Threshold, second between described two texts to be detected is obtained
Similarity;
According to first similarity, second similarity, default first weight and default second weight, obtain described
Similarity between two texts to be detected.
Preferably, methods described also includes:It is previously obtained first threshold corresponding with first similarity algorithm and pre-
First obtain Second Threshold corresponding with second similarity algorithm;
It is described to be previously obtained first threshold corresponding with first similarity algorithm and be previously obtained and second phase
The corresponding Second Threshold of algorithm is seemingly spent, including:
Multigroup candidate solution is generated at random, and every group of candidate solution includes the 3rd threshold corresponding with first similarity algorithm
Value and the 4th threshold value corresponding with second similarity algorithm;
Multigroup best candidate solution is obtained from multigroup candidate solution, wherein the acquisition process of multigroup best candidate solution is:
The corresponding fitness function of every group of candidate solution is obtained, is calculated in training set by the corresponding fitness function of every group of candidate solution
Each pair training sample between similarity, according to the similarity between each pair training sample, obtain every group of candidate solution
Fitness, and according to the fitness of every group of candidate solution, chooses multigroup best candidate solution, each pair training sample include two by
The text of artificial mark similarity, the fitness of best candidate solution is more than the fitness of other candidate solutions;
Cross and variation is carried out to the 3rd threshold value in multigroup best candidate solution, in multigroup best candidate solution
4th threshold value carries out cross and variation, obtains multigroup new candidate solution, and performs described acquisition to multigroup new candidate solution
Journey is pre-conditioned up to meeting to obtain multigroup best candidate solution from multigroup new candidate solution;
Best candidate solution of the fitness more than the fitness of other best candidate solutions is chosen, in selected best candidate solution
The 3rd threshold value as the first threshold, the 4th threshold value in selected best candidate solution is used as the Second Threshold.
Preferably, the fitness according to every group of candidate solution, chooses multigroup best candidate solution, including:
Obtain the fitness summation of all candidate solutions;
According to the fitness summation and the fitness of every group of candidate solution of all candidate solutions, the phase of every group of candidate solution is obtained
To fitness;
The numerical value between 0 and 1 is generated at random, and multigroup best candidate is chosen according to the random numerical value for generating
Solution.
Preferably, the value of the 3rd threshold value and the 4th threshold value is between 0 and 1, and by binary coding
Mode is represented, so that the corresponding string of binary characters of the 3rd threshold value and the corresponding string of binary characters point of the 4th threshold value
Not as a chromosome;
The 3rd threshold value in multigroup best candidate solution carries out cross and variation, to multigroup best candidate solution
In the 4th threshold value carry out cross and variation, obtain multigroup new candidate solution, including:
Random pair is carried out to the corresponding chromosome of the 3rd threshold value in multigroup best candidate solution;
According to the length of the corresponding chromosome of the 3rd threshold value, cross-point locations are randomly provided, and according to the intersection
Point position, exchanges the portion gene between the corresponding chromosome of the 3rd threshold value of random pair;
Be randomly provided genetic mutation position in the corresponding chromosome of the 3rd threshold value, and to the genetic mutation position at
Gene carry out inversion operation;
At in the portion gene exchanged between the corresponding chromosome of the 3rd threshold value and to the genetic mutation position
Gene is carried out after inversion operation, corresponding 3rd threshold value of the chromosome after being changed;
Random pair is carried out to the corresponding chromosome of the 4th threshold value in multigroup best candidate solution;
According to the length of the corresponding chromosome of the 4th threshold value, cross-point locations are randomly provided, and according to the intersection
Point position, exchanges the portion gene between the corresponding chromosome of the 4th threshold value of random pair;
Be randomly provided genetic mutation position in the corresponding chromosome of the 4th threshold value, and to the genetic mutation position at
Gene carry out inversion operation;
At in the portion gene exchanged between the corresponding chromosome of the 4th threshold value and to the genetic mutation position
Gene is carried out after inversion operation, corresponding 4th threshold value of the chromosome after being changed;
According to corresponding 4th threshold value of chromosome after corresponding 3rd threshold value of chromosome after the change and change, obtain
To multigroup new candidate solution.
The present invention also provides a kind of text processing apparatus, and described device includes:
Text acquiring unit, for obtaining two texts to be detected at random;
First computing unit, for according at least to the first similarity algorithm and the second similarity algorithm, calculating described two
The Second Type similarity between first kind similarity and described two texts to be detected between text to be detected, wherein institute
State first kind similarity to be calculated according to first similarity algorithm, the Second Type similarity is according to described second
Similarity algorithm is calculated;
Second computing unit, for according to the first kind similarity, the Second Type similarity, first threshold and
Second Threshold, obtains the similarity between described two texts to be detected, wherein the first threshold be previously obtained with institute
State the corresponding threshold value of the first similarity algorithm, the Second Threshold is be previously obtained corresponding with second similarity algorithm
Threshold value;
Determining unit, for when the similarity between described two texts to be detected is in preset range, it is determined that described
Two texts to be detected are similar, and for when the similarity between described two texts to be detected not in preset range when, really
Fixed described two texts to be detected are dissimilar.
Preferably, second computing unit, for according to the first kind similarity and the first threshold, obtaining
The first similarity between described two texts to be detected, according to the Second Type similarity and the Second Threshold, obtains
The second similarity between described two texts to be detected, and according to first similarity, second similarity, default the
One weight and default second weight, obtain the similarity between described two texts to be detected.
Preferably, described device also includes:Obtaining unit, it is corresponding with first similarity algorithm for being previously obtained
First threshold and it is previously obtained Second Threshold corresponding with second similarity algorithm;
The obtaining unit, including:First generates subelement, the first selection subelement, the second generation subelement and second
Choose subelement;
Described first generates subelement, and for generating multigroup candidate solution at random, every group of candidate solution includes one with described the
Corresponding 3rd threshold value of one similarity algorithm and the 4th threshold value corresponding with second similarity algorithm;
Described first chooses subelement, for obtaining multigroup best candidate solution from multigroup candidate solution, wherein multigroup
The acquisition process of best candidate solution is:The corresponding fitness function of every group of candidate solution is obtained, by every group of candidate solution correspondence
Fitness function calculate training set in each pair training sample between similarity, according between each pair training sample
Similarity, obtains the fitness of every group of candidate solution, and according to the fitness of every group of candidate solution, chooses multigroup best candidate
Solution, each pair training sample includes two by the text for manually marking similarity, and the fitness of best candidate solution is more than other candidates
The fitness of solution;
Described second generates subelement, for carrying out cross and variation to the 3rd threshold value in multigroup best candidate solution,
Cross and variation is carried out to the 4th threshold value in multigroup best candidate solution, multigroup new candidate solution is obtained, and to described multigroup
It is default up to meeting to obtain multigroup best candidate solution from multigroup new candidate solution that new candidate solution performs the acquisition process
Condition;
Described second chooses subelement, for choosing best candidate of the fitness more than the fitness of other best candidate solutions
Solution, the 3rd threshold value in selected best candidate solution is used as the first threshold, the 4th in selected best candidate solution
Threshold value is used as the Second Threshold.
Preferably, described first fitness of the subelement according to every group of candidate solution is chosen, chooses multigroup best candidate
Solution, including:
The fitness summation of all candidate solutions is obtained, according to the fitness summation and every group of candidate solution of all candidate solutions
Fitness, obtain the relative adaptability degrees of every group of candidate solution, the numerical value between 0 and 1 is generated at random, and according to random
The numerical value of generation chooses multigroup best candidate solution.
Preferably, the value of the 3rd threshold value and the 4th threshold value is between 0 and 1, and by binary coding
Mode is represented, so that the corresponding string of binary characters of the 3rd threshold value and the corresponding string of binary characters point of the 4th threshold value
Not as a chromosome;
Described second generates subelement, including:First with sub-unit, first exchange subelement, first negate subelement,
First acquisition subelement, second with sub-unit, second exchange subelement, second negate subelement, second obtain subelement and
Candidate solution obtains subelement;
Described first matches somebody with somebody sub-unit, for entering to the corresponding chromosome of the 3rd threshold value in multigroup best candidate solution
Row random pair;
Described first exchanges subelement, for according to the length of the corresponding chromosome of the 3rd threshold value, being randomly provided friendship
Crunode position, and according to the cross-point locations, exchange the part base between the corresponding chromosome of the 3rd threshold value of random pair
Cause;
Described first negates subelement, for being randomly provided the corresponding chromosome of the 3rd threshold value in genetic mutation position
Put, and inversion operation is carried out to the gene at the genetic mutation position;
Described first obtains subelement, in the portion gene exchanged between the corresponding chromosome of the 3rd threshold value and
Gene at the genetic mutation position is carried out after inversion operation, corresponding 3rd threshold value of the chromosome after being changed;
Described second matches somebody with somebody sub-unit, for entering to the corresponding chromosome of the 4th threshold value in multigroup best candidate solution
Row random pair;
Described second exchanges subelement, for according to the length of the corresponding chromosome of the 4th threshold value, being randomly provided friendship
Crunode position, and according to the cross-point locations, exchange the part base between the corresponding chromosome of the 4th threshold value of random pair
Cause;
Described second negates subelement, for being randomly provided the corresponding chromosome of the 4th threshold value in genetic mutation position
Put, and inversion operation is carried out to the gene at the genetic mutation position;
Described second obtains subelement, in the portion gene exchanged between the corresponding chromosome of the 4th threshold value and
Gene at the genetic mutation position is carried out after inversion operation, corresponding 4th threshold value of the chromosome after being changed;
The candidate solution obtains subelement, after according to corresponding 3rd threshold value of chromosome after the change and change
Corresponding 4th threshold value of chromosome, obtain multigroup new candidate solution.
Compared with prior art, the above-mentioned technical proposal that the present invention is provided has the advantage that:
Knowable to above-mentioned technical proposal, after two texts to be detected are obtained at random, according at least to the first similarity algorithm
And the second similarity algorithm first kind similarity and Second Type similarity that obtain between two texts to be detected, Jin Ergen
According to first kind similarity, Second Type similarity, first threshold and Second Threshold, the phase between two texts to be detected is obtained
Like spending, that is to say, that the present invention obtains two types similarity according at least two algorithms, and according to two types similarity and
Each self-corresponding threshold value obtains indicating the similarity whether two texts to be detected are similar that this mode is relative to existing by one
For planting the whether similar mode of similarity algorithm two texts to be detected of judgement, two texts to be detected of instruction that the present invention is obtained
Whether the degree of accuracy of similar similarity is improved for this, and then improves the degree of accuracy of detection.
Description of the drawings
In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing
The accompanying drawing to be used needed for having technology description is briefly described, it should be apparent that, drawings in the following description are the present invention
Some embodiments, for those of ordinary skill in the art, on the premise of not paying creative work, can be with basis
These accompanying drawings obtain other accompanying drawings.
Fig. 1 is the flow chart of text handling method provided in an embodiment of the present invention;
Fig. 2 is the flow chart of acquisition threshold value provided in an embodiment of the present invention;
Fig. 3 is the flow chart for generating new candidate solution provided in an embodiment of the present invention;
Fig. 4 is a kind of structural representation of text processing apparatus provided in an embodiment of the present invention;
Fig. 5 is another kind of structural representation of text processing apparatus provided in an embodiment of the present invention;
Fig. 6 is the structural representation of obtaining unit in text processing apparatus provided in an embodiment of the present invention.
Specific embodiment
To make purpose, technical scheme and the advantage of the embodiment of the present invention clearer, below in conjunction with the embodiment of the present invention
In accompanying drawing, the technical scheme in the embodiment of the present invention is clearly and completely described, it is clear that described embodiment is
The a part of embodiment of the present invention, rather than the embodiment of whole.Based on the embodiment in the present invention, those of ordinary skill in the art
The every other embodiment obtained under the premise of creative work is not made, belongs to the scope of protection of the invention.
Fig. 1 is referred to, the flow chart of text handling method provided in an embodiment of the present invention is it illustrates, it is similar for improving
The degree of accuracy of degree, and then improve the degree of accuracy of detection.Specifically, text handling method described in Fig. 1 may comprise steps of:
101:Obtain two texts to be detected at random, whether similar two of which text to be detected needs to carry out text
Two texts, the two texts to be detected can be random two papers, works or patent application documents for obtaining etc., at this
Can download from the website that these texts are provided in bright embodiment.
102:According at least to the first similarity algorithm and the second similarity algorithm, the between two texts to be detected is calculated
Second Type similarity between one type similarity and two detection texts, wherein first kind similarity is similar according to first
Degree algorithm is calculated, and Second Type similarity is calculated according to the second similarity algorithm.
That is in the embodiment of the present invention, calculate between two texts to be detected at least through two kinds of similarity algorithms
Two types similarity.To improve the similarity between two texts to be detected, the first similarity algorithm and the second similarity operator
At least a kind of similarity algorithm is less sensitive to the disappearance of word in method, and is obtained according to the less sensitive algorithm of the disappearance to word
To similarity the value intensity of variation of similarity that obtains of the value intensity of variation algorithm sensitive less than the disappearance to word.
At present similarity algorithm has:Word frequency cosine similarity algorithm, TF-IDF (Term Frequency-Inverse
Document Frequency, characteristic frequency-inverse document frequency weighting method) cosine similarity algorithm, text editing distance
Similarity algorithm and SimHash similarity algorithms, inventor is studied these four similarity algorithms, it is found that these four are similar
Degree algorithm is sensitive to the disappearance of word according to arriving little being ordered as greatly:Word frequency cosine similarity algorithm, TF-IDF cosine similarities are calculated
Method, text editing Distance conformability degree algorithm and SimHash similarity algorithms, in embodiments of the present invention, can be by word frequency cosine
Similarity algorithm and TF-IDF cosine similarity algorithms are respectively seen as the first similarity algorithm, and text editing Distance conformability degree is calculated
Method and SimHash similarity algorithms are respectively seen as the second similarity algorithm.
Certainly word frequency cosine phase can such as be chosen arbitrarily to choose two kinds of similarity algorithms from these four similarity algorithms
It is the first similarity algorithm like degree algorithm, it is the second similarity algorithm to choose TF-IDF cosine similarities algorithm, or is chosen
Text editing Distance conformability degree algorithm is the first similarity algorithm, and it is the second similarity algorithm to choose SimHash similarity algorithms.
Below with the first similarity algorithm as word frequency cosine similarity algorithm, the second similarity algorithm is that SimHash is similar
As a example by degree algorithm, the two types similarity for calculating two texts to be detected is illustrated how:
The calculating process of word frequency cosine similarity algorithm:Participle is carried out by participle technique to two texts to be detected, is obtained
To each word in two texts to be detected, word frequency one N-dimensional vector of formation of these words is calculated, N is the word number after participle.Two
Individual text D to be detected1And D2Corresponding vector representation is:
V1={ t11,t12,...,t1j...,t1N}
V2={ t21,t22,...,t2j...,t2N}
Wherein, V1It is text D to be detected1Vector, t1jIt is text D to be detected1In j-th word word frequency, V2It is to be detected
Text D2Vector, t2jIt is text D to be detected2In j-th word word frequency.
It is to be checked to obtain two by calculating the cosine value between vector after the vector representation for obtaining two texts to be detected
The word frequency cosine similarity surveyed between text:
Such as two texts to be detected are:D1=" during red early warning, according to related work prediction scheme, capital public security starts
High-grade duties scheme tackles haze weather, carries out heavily contaminated weather reply work in every ", D2=" during red early warning, Beijing
Public security starts high-grade duties scheme, carries out haze weather reply work in every ", participle is carried out to the two texts to be detected,
Each word for obtaining records in word sequence={ red, early warning, period, according to related, work, prediction scheme, capital, public security is opened
It is dynamic, high-grade, duties, scheme, reply, haze, weather does, good, heavily contaminated, every, Beijing }, then count in word sequence
Word frequency of each word in corresponding text to be detected forms the corresponding vector of text to be detected:
V1=[1,1,1,1,1,2,1,1,1,1,1,1,1,2,1,2,1,1,1,1,0]
V2=[1,1,1,0,0,1,0,0,1,1,1,1,1,1,1,1,1,1,0,1,1]
Then basisObtain the word frequency cosine similarity Sim between two texts to be detectedtf=
0.8660。
The difference that word frequency cosine similarity algorithm intuitively can reflect between two texts to be detected very much from vocabulary feature
The opposite sex, then this algorithm is sensitive higher to the disappearance of word so that the word difference meeting chosen when word frequency cosine similarity is calculated
The value for causing word frequency cosine similarity is varied widely, for this purpose, the embodiment of the present invention can introduce a kind of disappearance to word
The similarity algorithm of less sensitive degree, i.e., above-mentioned SimHash similarity algorithms.
Accordingly, the calculating process of SimHash similarity algorithms is:Two texts to be detected are carried out by participle technique
Participle, obtains each word in two texts to be detected, and each word in two texts to be detected is converted into K position
Tagged word, the tagged word of K positions forms cryptographic Hash HashCode, two text D to be detected1And D2Corresponding cryptographic Hash HashCode
Difference is as follows:
HashCode1=hash (w11,w12,...,w1j,...,w1p)
HashCode2=hash (w21,w22,...w2j,...,w2q)
Wherein, HashCode1It is text D to be detected1For text cryptographic Hash, HashCode2It is text D to be detected2Kazakhstan
Uncommon value, HashCode1And HashCode2For the byte that length is K, the value of K is default to be arranged, and optimum is 64, in enforcement of the invention
Its value, w are not limited in example1jIt is text D to be detected1In the tagged word that is converted into of j-th word, w2jIt is text D to be detected2In
The tagged word that j word is converted into is text, then according to the tagged word between the two texts to be detected apart from size come
To SimHash similarities, concrete formula is:
Wherein, Hamming (HashCode1, HashCode2) for the Hamming distances between byte.
103:It is previously obtained first threshold corresponding with the first similarity algorithm and is previously obtained and the second similarity algorithm
Corresponding Second Threshold.Wherein first threshold and Second Threshold are calculated by the training sample for having marked in advance, so-called pre-
The training sample for first marking is that user has marked out manually the comparison result of two training samples (sample is similar or sample not phase
Like), when first threshold and Second Threshold is calculated, can obtain marking out comparison result manually by existing similarity algorithm
Two training samples between similarity, and according to the comparison result of two training samples for having marked out (sample it is similar or
Sample is dissimilar) and the similarity between the two training samples, first threshold and Second Threshold are calculated, such as in present invention enforcement
First threshold α=0.65 in example, Second Threshold β=0.88.
104:According to first kind similarity, Second Type similarity, first threshold and Second Threshold, obtain two it is to be checked
The similarity surveyed between text.It is word frequency cosine similarity with first kind similarity, Second Type similarity is SimHash phases
As a example by spending, the similarity between two texts to be detected can be obtained by following computing formula:
Score=f (a*f (Simtf-α)+b*f(Simhash-β))
Wherein, a is default first weight, and b is default second weight, typically may be configured as 0.5, α and β is respectively more than word frequency
The corresponding first threshold of string similarity algorithm and the corresponding Second Threshold of SimHash similarity algorithms, the size of threshold value determines two
Similarity between individual detection text, to judge that two detection texts whether there is according to the similarity between two detection texts
It is similar, and then understand two detection texts with the presence or absence of plagiarism phenomenon.
105:When the similarity between two texts to be detected is in preset range, two text phases to be detected are determined
Seemingly.
106:When the similarity between two texts to be detected not in preset range when, determine two texts to be detected not phase
Seemingly.
Wherein preset range can be according to practical application depending on, such as in the situation that the similarity of two texts to be detected is higher
Similar regarding two texts to be detected down, now preset range can be a scope with similarity 90% to 99%, certainly
Other modes can also be adopted, the such as computing formula of similarity is Score=f (a*f (Simtf-α)+b*f(Simhash- β)) andThe value of the Score obtained by this computing formula is -1 or 1, when the value of Score
For -1 when, represent that two texts to be detected are dissimilar, there is no plagiarism phenomenon between two texts to be detected;When taking for Score
Be worth for 1 when, represent that two texts to be detected are much like, there is plagiarism between two texts to be detected, be this in root
In the case of calculating Score according to this computing formula, preset range can only include a numerical value 1.
By taking above-mentioned two text to be detected as an example, based on computing formula:Score=f (a*f (Simtf-α)+b*f
(simhash- β)) andWhen calculating the similarity between two texts to be detected, the value of a and b is
0.5, α=0.65, β=0.88, similarity Score=1 between two for finally giving text to be detected, judgement two is to be checked
Survey text similar, there is plagiarism.
Based on the computing formula of above-mentioned similarity, can draw in the embodiment of the present invention according to first kind similarity, second
Type similarity, first threshold and Second Threshold, obtaining the feasible pattern of the similarity between two texts to be detected includes:
According to first kind similarity and first threshold, the first similarity f between two texts to be detected is obtained
(Simtf-α)。
According to Second Type similarity and Second Threshold, the second similarity f between two texts to be detected is obtained
(Simhash-β)。
According to the first similarity, the second similarity, default first weight and default second weight, two texts to be detected are obtained
Similarity Score between this.
Knowable to above-mentioned technical proposal, after two texts to be detected are obtained at random, according at least to the first similarity algorithm
And the second similarity algorithm first kind similarity and Second Type similarity that obtain between two texts to be detected, Jin Ergen
According to first kind similarity, Second Type similarity, first threshold and Second Threshold, the phase between two texts to be detected is obtained
Like spending, that is to say, that the present invention obtains two types similarity according at least two algorithms, and according to two types similarity and
Each self-corresponding threshold value obtains indicating the similarity whether two texts to be detected are similar that this mode is relative to existing by one
For planting the whether similar mode of similarity algorithm two texts to be detected of judgement, two texts to be detected of instruction that the present invention is obtained
Whether the degree of accuracy of similar similarity is improved for this, and then improves the degree of accuracy of detection.
In embodiments of the present invention, it is previously obtained first threshold corresponding with the first similarity algorithm and is previously obtained and
The process of the corresponding Second Threshold of two similarity algorithms see shown in Fig. 2, may comprise steps of:
201:Multigroup candidate solution is generated at random, and every group of candidate solution includes the 3rd threshold corresponding with the first similarity algorithm
Value and the 4th threshold value corresponding with the second similarity algorithm.
202:Multigroup best candidate solution is obtained from multigroup candidate solution, wherein the acquisition process of multigroup best candidate solution is:
The corresponding fitness function of every group of candidate solution is obtained, calculates every in training set by the corresponding fitness function of every group of candidate solution
To the similarity between training sample, according to the similarity between each pair training sample, the fitness of every group of candidate solution is obtained, and
According to the fitness of every group of candidate solution, multigroup best candidate solution is chosen, each pair training sample is similar by manually marking including two
The text of degree, the fitness of best candidate solution is more than the fitness of other candidate solutions.
In embodiments of the present invention, the identification means of fitness function are as follows:
Fit=P (a*f (Simtf-α)+b*f(Simhash-β))
Wherein, a is default first weight, and b is default second weight, typically may be configured as 0.5, α and β and be respectively the first phase
Like corresponding 3rd threshold value of degree algorithm and corresponding 3rd threshold value of the second similarity algorithm.For N in training set to training sample,
The similarity that each candidate solution calculates each pair training sample is counted, Fit is designated asi,j, represent what i-th candidate solution was calculated
Similarity of the jth to training sample, then the computing formula of fitness is as follows:
Optionally, according to the fitness of every group of candidate solution, multigroup best candidate solution is chosen, including:Obtain all candidate solutions
Fitness summation;According to the fitness summation and the fitness of every group of candidate solution of all candidate solutions, every group of candidate solution is obtained
Relative adaptability degrees;Then the numerical value between 0 and 1 is generated at random, and it is multigroup most according to the random numerical value selection for generating
Excellent candidate solution.
Such as fitness summation is:Ftotal, the fitness of i-th candidate solution is Fi, then relative adaptability degrees are Fi/Ftotal,
The value is the probability that candidate solution is genetic to next group of candidate solution, and the probable value of every group of candidate solution constitutes a region, all general
Rate value sum is 1.
When the random numerical value for generating is less than or equal to the relative adaptability degrees of certain candidate solution, then relative adaptation can be selected
Degree is best candidate solution more than or equal to the candidate solution of this random numerical value for generating.Three groups of candidate solutions are such as generated, this three groups of times
The relative adaptability degrees of choosing solution are 2/3,1/3 and 0, and the random numerical value for generating is 1/3, then it is 2/3 and 1/3 to choose relative adaptability degrees
Candidate solution is best candidate solution.If the random numerical value for generating is 1/2, according to relative adaptability degrees this random generation is being more than or equal to
The candidate solution of numerical value when being chosen for the selection rule of best candidate solution, be only capable of choosing a best candidate solution, be this
Inventive embodiments can be supplemented selection rule, if according to relative adaptability degrees more than or equal to this random numerical value for generating
When the quantity of the best candidate solution that candidate solution is selected for the selection rule of best candidate solution is one, then can further choose
The candidate solution of the numerical value that one relative adaptability degrees is less than and most presses close to this random generation is best candidate solution.
203:Cross and variation is carried out to the 3rd threshold value in multigroup best candidate solution, to the 4th in multigroup best candidate solution
Threshold value carries out cross and variation, obtains multigroup new candidate solution, and performs acquisition process to multigroup new candidate solution with from multigroup new
Candidate solution in obtain multigroup best candidate solution until meet pre-conditioned.
So-called pre-conditioned can have two kinds of feasible conditions:It is a kind of it is pre-conditioned be a upper candidate solution fitness and upper
In preset difference value, it is that artificial setting generates that another kind is pre-conditioned to the difference of the fitness of the corresponding current candidate solution of candidate solution
The number of times of new candidate solution.In embodiments of the present invention, can arbitrarily choose a kind of pre-conditioned and pre-conditioned meeting
In the case of be considered as convergence.
204:Choose best candidate solution of the fitness more than the fitness of other best candidate solutions, selected best candidate
, used as first threshold, the 4th threshold value in selected best candidate solution is used as Second Threshold for the 3rd threshold value in solution.
In embodiments of the present invention, the value of the 3rd threshold value and the 4th threshold value is between 0 and 1, and is compiled by binary system
Code mode represents so that the corresponding string of binary characters of the 3rd threshold value and the corresponding string of binary characters of the 4th threshold value respectively as
One chromosome.If the corresponding string of binary characters of the 3rd threshold value is 0110, then it can be considered as a chromosome, and each two enters
Character processed is considered as a gene on chromosome, corresponding, and to the 3rd threshold value and the 4th threshold value cross and variation is carried out, and obtains many
The process of the new candidate solution of group see shown in Fig. 3, may comprise steps of:
301:Random pair is carried out to the corresponding chromosome of the 3rd threshold value in multigroup best candidate solution.I.e. from multigroup optimum
The corresponding chromosome of multiple 3rd threshold values can be got in candidate solution, the corresponding chromosome of these multiple 3rd threshold values is carried out
Random pairing two-by-two.
302:According to the length of the corresponding chromosome of the 3rd threshold value, cross-point locations are randomly provided, and according to crosspoint position
Put, exchange the portion gene between the corresponding chromosome of the 3rd threshold value of random pair.
The corresponding chromosome of 3rd threshold value of such as random pair is respectively:0010 and 0100, the crosspoint being randomly provided
Position is the 2nd gene, then with the 2nd gene as boundary, the full gene after the 2nd gene is exchanged, after exchange
Chromosome is:0000 and 0110.
303:Genetic mutation position in the corresponding chromosome of the 3rd threshold value is randomly provided, and to the base at genetic mutation position
Because carrying out inversion operation, so-called inversion operation is:If the gene at genetic mutation position is 0,1 is changed into, if gene becomes
It is 1 that dystopy puts the gene at place, then be changed into 0.
304:In the portion gene exchanged between the corresponding chromosome of the 3rd threshold value and to the gene at genetic mutation position
Corresponding 3rd threshold value of chromosome after carrying out inversion operation, after being changed.
305:The corresponding chromosome of the 4th threshold value in multigroup best candidate solution carries out random pair.
306:According to the length of the corresponding chromosome of the 4th threshold value, cross-point locations are randomly provided, and according to crosspoint position
Put, exchange the portion gene between the corresponding chromosome of the 4th threshold value of random pair.
307:Genetic mutation position in the corresponding chromosome of the 4th threshold value is randomly provided, and to the base at genetic mutation position
Because carrying out inversion operation.
308:In the portion gene exchanged between the corresponding chromosome of the 4th threshold value and to the gene at genetic mutation position
Corresponding 4th threshold value of chromosome after carrying out inversion operation, after being changed.
309:According to corresponding 4th threshold value of chromosome after corresponding 3rd threshold value of chromosome after change and change, obtain
To multigroup new candidate solution.
Here it should be noted is that:The embodiment of the present invention is removed implementation procedure to come to the 3rd in the order described above
Threshold value and the 4th threshold value are carried out outside cross and variation process, can also carry out cross and variation to the 3rd threshold value and the 4th threshold value simultaneously
Process, it is also possible to a threshold value is carried out after cross and variation process, then cross and variation process is carried out to another threshold value, and
Can be while carrying out intersection change process when cross and variation is carried out to certain threshold value.
For aforesaid each method embodiment, in order to be briefly described, therefore it is all expressed as a series of combination of actions, but
It is that those skilled in the art should know, the present invention is not limited by described sequence of movement, because according to the present invention, certain
A little steps can adopt other orders or while carry out.Secondly, those skilled in the art also should know, be retouched in specification
The embodiment stated belongs to preferred embodiment, and involved action and the module not necessarily present invention is necessary.
Fig. 4 is referred to, a kind of structure of text processing apparatus provided in an embodiment of the present invention is it illustrates, for reducing phase
Like disappearance susceptibility of the degree to word, to improve the degree of accuracy of similarity.Specifically, text processing apparatus described in Fig. 4 can include:
Text acquiring unit 11, the first computing unit 12, the second computing unit 13 and determining unit 14.
Text acquiring unit 11, for obtaining two texts to be detected at random, two of which text to be detected be need into
Style of writing this whether two similar text, the two texts to be detected can be random two papers, works or the patents for obtaining
Application documents etc., can download in embodiments of the present invention from the website for providing these texts.
First computing unit 12, for according at least to the first similarity algorithm and the second similarity algorithm, calculating two and treating
The Second Type similarity between first kind similarity and two texts to be detected between detection text, the wherein first kind
Similarity is calculated according to the first similarity algorithm, and Second Type similarity is calculated according to the second similarity algorithm.
That is in the embodiment of the present invention, the first computing unit 12 calculates two and treats at least through two kinds of similarity algorithms
Two types similarity between detection text.To improve the similarity between two texts to be detected, the first similarity algorithm
It is less sensitive to the disappearance of word with least a kind of similarity algorithm in the second similarity algorithm, and according to the disappearance to word not
The similarity that the value intensity of variation of the similarity that too sensitive algorithm the is obtained algorithm sensitive less than the disappearance to word is obtained
Value intensity of variation.
At present similarity algorithm has:Word frequency cosine similarity algorithm, TF-IDF cosine similarity algorithms, text editing distance
Similarity algorithm and SimHash similarity algorithms, inventor is studied these four similarity algorithms, it is found that these four are similar
Degree algorithm is sensitive to the disappearance of word according to arriving little being ordered as greatly:Word frequency cosine similarity algorithm, TF-IDF cosine similarities are calculated
Method, text editing Distance conformability degree algorithm and SimHash similarity algorithms, in embodiments of the present invention, can be by word frequency cosine
Similarity algorithm and TF-IDF cosine similarity algorithms are respectively seen as the first similarity algorithm, and text editing Distance conformability degree is calculated
Method and SimHash similarity algorithms are respectively seen as the second similarity algorithm.
Certainly word frequency cosine phase can such as be chosen arbitrarily to choose two kinds of similarity algorithms from these four similarity algorithms
It is the first similarity algorithm like degree algorithm, it is the second similarity algorithm to choose TF-IDF cosine similarities algorithm, or is chosen
Text editing Distance conformability degree algorithm is the first similarity algorithm, and it is the second similarity algorithm to choose SimHash similarity algorithms.
Second computing unit 13, for according to first kind similarity, Second Type similarity, first threshold and the second threshold
Value, obtains the similarity between two texts to be detected, wherein first threshold be previously obtained with the first similarity algorithm pair
The threshold value answered, Second Threshold is the threshold value corresponding with the second similarity algorithm being previously obtained, and obtains first threshold and the second threshold
The process of value refers to the related description of embodiment of the method part, and this embodiment of the present invention is no longer illustrated.
Optionally, the second computing unit 13, it is to be detected for according to first kind similarity and first threshold, obtaining two
The first similarity between text, according to Second Type similarity and Second Threshold, obtains between two texts to be detected
Two similarities, and according to the first similarity, the second similarity, default first weight and default second weight, obtain two it is to be checked
The similarity surveyed between text.
Determining unit 14, for when the similarity between two texts to be detected is in preset range, determining that two are treated
Detection text it is similar, and for when the similarity between two texts to be detected not in preset range when, determine two it is to be checked
Survey text dissimilar.
Wherein preset range can be according to practical application depending on, such as in the situation that the similarity of two texts to be detected is higher
Similar regarding two texts to be detected down, now preset range can be a scope with similarity 90% to 99%, certainly
Other modes can also be adopted, the such as computing formula of similarity is Score=f (a*f (Simtf-α)+b*f(Simhash- β)) andThe value of the Score obtained by this computing formula is -1 or 1, when the value of Score
For -1 when, represent that two texts to be detected are dissimilar, there is no plagiarism phenomenon between two texts to be detected;When taking for Score
Be worth for 1 when, represent that two texts to be detected are much like, there is plagiarism between two texts to be detected, be this in root
In the case of calculating Score according to this computing formula, preset range can only include a numerical value 1.
By taking above-mentioned two text to be detected as an example, based on formula:Score=f (a*f (Simtf-α)+b*f(Simhash-
β)) andWhen calculating the similarity between two texts to be detected, the value of a and b be 0.5, α=
0.65, β=0.88, similarity Score=1 between two for finally giving text to be detected judges two texts to be detected
It is similar, there is plagiarism.
Knowable to above-mentioned technical proposal, after two texts to be detected are obtained at random, according at least to the first similarity algorithm
And the second similarity algorithm first kind similarity and Second Type similarity that obtain between two texts to be detected, Jin Ergen
According to first kind similarity, Second Type similarity, first threshold and Second Threshold, the phase between two texts to be detected is obtained
Like spending, that is to say, that the present invention obtains two types similarity according at least two algorithms, and according to two types similarity and
Each self-corresponding threshold value obtains indicating the similarity whether two texts to be detected are similar that this mode is relative to existing by one
For planting the whether similar mode of similarity algorithm two texts to be detected of judgement, two texts to be detected of instruction that the present invention is obtained
Whether the degree of accuracy of similar similarity is improved for this, and then improves the degree of accuracy of detection.
Fig. 5 is referred to, another kind of structure of text processing apparatus provided in an embodiment of the present invention is it illustrates, in Fig. 4 bases
Can also include on plinth:Obtaining unit 15, for being previously obtained first threshold corresponding with the first similarity algorithm and advance
To Second Threshold corresponding with the second similarity algorithm.
In embodiments of the present invention, the structure of obtaining unit 15 is as shown in fig. 6, can include:First generates subelement
151st, first the generation selection subelement 154 of subelement 153 and second of subelement 152, second is chosen.
First generates subelement 151, and for generating multigroup candidate solution at random, every group of candidate solution is similar to first including one
Degree corresponding 3rd threshold value of algorithm and the 4th threshold value corresponding with the second similarity algorithm.
First chooses subelement 152, for obtaining multigroup best candidate solution from multigroup candidate solution, wherein multigroup optimum time
Selecting the acquisition process of solution is:The corresponding fitness function of every group of candidate solution is obtained, by the corresponding fitness letter of every group of candidate solution
Number calculates the similarity between each pair training sample in training set, according to the similarity between each pair training sample, obtains every
The fitness of group candidate solution, and according to the fitness of every group of candidate solution, multigroup best candidate solution is chosen, each pair training sample includes
Two texts by manually mark similarity, the fitness of best candidate solution is more than the fitness of other candidate solutions.
Optionally, first fitness of the subelement 152 according to every group of candidate solution is chosen, chooses multigroup best candidate solution, bag
Include:The fitness summation of all candidate solutions is obtained, according to the fitness summation and the fitness of every group of candidate solution of all candidate solutions,
The relative adaptability degrees of every group of candidate solution are obtained, the numerical value between 0 and 1 is generated at random, and according to the random number for generating
Value chooses multigroup best candidate solution, for the related description that other processes can refer to embodiment of the method part, to this present invention
Embodiment is no longer illustrated.
Second generates subelement 153, for carrying out cross and variation to the 3rd threshold value in multigroup best candidate solution, to multigroup
The 4th threshold value in best candidate solution carries out cross and variation, obtains multigroup new candidate solution, and multigroup new candidate solution is performed
Acquisition process is pre-conditioned up to meeting to obtain multigroup best candidate solution from multigroup new candidate solution.
So-called pre-conditioned can have two kinds of feasible conditions:It is a kind of it is pre-conditioned be a upper candidate solution fitness and upper
In preset difference value, it is that artificial setting generates that another kind is pre-conditioned to the difference of the fitness of the corresponding current candidate solution of candidate solution
The number of times of new candidate solution.In embodiments of the present invention, can arbitrarily choose a kind of pre-conditioned and pre-conditioned meeting
In the case of be considered as convergence.
Second chooses subelement 154, for choosing best candidate of the fitness more than the fitness of other best candidate solutions
Solution, the 3rd threshold value in selected best candidate solution is used as first threshold, the 4th threshold value in selected best candidate solution
As Second Threshold.
In embodiments of the present invention, the value of the 3rd threshold value and the 4th threshold value is between 0 and 1, and is compiled by binary system
Code mode represents so that the corresponding string of binary characters of the 3rd threshold value and the corresponding string of binary characters of the 4th threshold value respectively as
One chromosome.If the corresponding string of binary characters of the 3rd threshold value is 0110, then it can be considered as a chromosome, and each two enters
Character processed is considered as a gene on chromosome, corresponding, and the second generation subelement can include:First with sub-unit,
First exchange subelement, first negate subelement, first obtain subelement, second with sub-unit, second exchange subelement, the
Two negate subelement, the second acquisition subelement and candidate solution obtains subelement.
First matches somebody with somebody sub-unit, for being matched somebody with somebody at random to the corresponding chromosome of the 3rd threshold value in multigroup best candidate solution
It is right.
First exchanges subelement, for according to the length of the corresponding chromosome of the 3rd threshold value, being randomly provided cross-point locations,
And according to cross-point locations, exchange the portion gene between the corresponding chromosome of the 3rd threshold value of random pair.
First negates subelement, for being randomly provided the corresponding chromosome of the 3rd threshold value in genetic mutation position, and to base
Because the gene at variable position carries out inversion operation.
First obtains subelement, for becoming in the portion gene exchanged between the corresponding chromosome of the 3rd threshold value and to gene
The gene at the dystopy place of putting is carried out after inversion operation, corresponding 3rd threshold value of the chromosome after being changed.
Second matches somebody with somebody sub-unit, for being matched somebody with somebody at random to the corresponding chromosome of the 4th threshold value in multigroup best candidate solution
It is right.
Second exchanges subelement, for according to the length of the corresponding chromosome of the 4th threshold value, being randomly provided cross-point locations,
And according to cross-point locations, exchange the portion gene between the corresponding chromosome of the 4th threshold value of random pair.
Second negates subelement, for being randomly provided the corresponding chromosome of the 4th threshold value in genetic mutation position, and to base
Because the gene at variable position carries out inversion operation.
Second obtains subelement, for becoming in the portion gene exchanged between the corresponding chromosome of the 4th threshold value and to gene
The gene at the dystopy place of putting is carried out after inversion operation, corresponding 4th threshold value of the chromosome after being changed.
Candidate solution obtains subelement, for according to the chromosome after corresponding 3rd threshold value of chromosome after change and change
Corresponding 4th threshold value, obtains multigroup new candidate solution.
It should be noted that each embodiment in this specification is described by the way of progressive, each embodiment weight
Point explanation is all difference with other embodiment, between each embodiment identical similar part mutually referring to.
For device class embodiment, due to itself and embodiment of the method basic simlarity, so description is fairly simple, related part ginseng
See the part explanation of embodiment of the method.
Finally, in addition it is also necessary to explanation, herein, such as first and second or the like relational terms be used merely to by
One entity or operation make a distinction with another entity or operation, and not necessarily require or imply these entities or operation
Between there is any this actual relation or order.And, term " including ", "comprising" or its any other variant meaning
Covering including for nonexcludability, so that a series of process, method, article or equipment including key elements not only includes that
A little key elements, but also including other key elements being not expressly set out, or also include for this process, method, article or
The intrinsic key element of equipment.In the absence of more restrictions, the key element for being limited by sentence "including a ...", does not arrange
Except also there is other identical element in including the process of the key element, method, article or equipment.
The foregoing description of the disclosed embodiments, enables those skilled in the art to realize or using the present invention.To this
Various modifications of a little embodiments will be apparent for a person skilled in the art, and generic principles defined herein can
Without departing from the spirit or scope of the present invention, to realize in other embodiments.Therefore, the present invention will not be limited
It is formed on the embodiments shown herein, and is to fit to consistent with principles disclosed herein and features of novelty most wide
Scope.
The above is only the preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art
For member, under the premise without departing from the principles of the invention, some improvements and modifications can also be made, these improvements and modifications also should
It is considered as protection scope of the present invention.
Claims (10)
1. a kind of text handling method, it is characterised in that methods described includes:
Two texts to be detected are obtained at random;
According at least to the first similarity algorithm and the second similarity algorithm, the first kind between described two texts to be detected is calculated
Second Type similarity between type similarity and described two texts to be detected, wherein the first kind similarity is according to institute
State the first similarity algorithm to be calculated, the Second Type similarity is calculated according to second similarity algorithm;
According to the first kind similarity, the Second Type similarity, first threshold and Second Threshold, obtain described two
Similarity between text to be detected, wherein the first threshold is be previously obtained corresponding with first similarity algorithm
Threshold value, the Second Threshold is the threshold value corresponding with second similarity algorithm being previously obtained;
When the similarity between described two texts to be detected is in preset range, described two text phases to be detected are determined
Seemingly;
When the similarity between described two texts to be detected not in preset range when, determine described two texts to be detected not phase
Seemingly.
2. method according to claim 1, it is characterised in that it is described according to the first kind similarity, described second
Type similarity, first threshold and Second Threshold, obtain the similarity between described two texts to be detected, including:
According to the first kind similarity and the first threshold, obtain first similar between described two texts to be detected
Degree;
According to the Second Type similarity and the Second Threshold, obtain second similar between described two texts to be detected
Degree;
According to first similarity, second similarity, default first weight and default second weight, obtain described two
Similarity between text to be detected.
3. method according to claim 1, it is characterised in that methods described also includes:It is previously obtained and first phase
Seemingly spend the corresponding first threshold of algorithm and be previously obtained Second Threshold corresponding with second similarity algorithm;
It is described to be previously obtained first threshold corresponding with first similarity algorithm and be previously obtained and second similarity
The corresponding Second Threshold of algorithm, including:
Generate multigroup candidate solution at random, every group of candidate solution include the 3rd threshold value corresponding with first similarity algorithm and
One the 4th threshold value corresponding with second similarity algorithm;
Multigroup best candidate solution is obtained from multigroup candidate solution, wherein the acquisition process of multigroup best candidate solution is:Obtain
The corresponding fitness function of every group of candidate solution, calculates every in training set by the corresponding fitness function of every group of candidate solution
To the similarity between training sample, according to the similarity between each pair training sample, the adaptation of every group of candidate solution is obtained
Degree, and according to the fitness of every group of candidate solution, multigroup best candidate solution is chosen, each pair training sample includes two by artificial
The text of mark similarity, the fitness of best candidate solution is more than the fitness of other candidate solutions;
Cross and variation is carried out to the 3rd threshold value in multigroup best candidate solution, to the 4th in multigroup best candidate solution
Threshold value carries out cross and variation, obtains multigroup new candidate solution, and the acquisition process is performed to multigroup new candidate solution with
Multigroup best candidate solution is obtained from multigroup new candidate solution pre-conditioned up to meeting;
Choose best candidate solution of the fitness more than the fitness of other best candidate solutions, the in selected best candidate solution
, used as the first threshold, the 4th threshold value in selected best candidate solution is used as the Second Threshold for three threshold values.
4. method according to claim 3, it is characterised in that the fitness according to every group of candidate solution, chooses
Multigroup best candidate solution, including:
Obtain the fitness summation of all candidate solutions;
According to the fitness summation and the fitness of every group of candidate solution of all candidate solutions, fitting relatively for every group of candidate solution is obtained
Response;
The numerical value between 0 and 1 is generated at random, and multigroup best candidate solution is chosen according to the random numerical value for generating.
5. method according to claim 3, it is characterised in that the value of the 3rd threshold value and the 4th threshold value between
Between 0 and 1, and represented by binary coding mode, so that the corresponding string of binary characters of the 3rd threshold value and described
The corresponding string of binary characters of four threshold values is respectively as a chromosome;
The 3rd threshold value in multigroup best candidate solution carries out cross and variation, in multigroup best candidate solution
4th threshold value carries out cross and variation, obtains multigroup new candidate solution, including:
Random pair is carried out to the corresponding chromosome of the 3rd threshold value in multigroup best candidate solution;
According to the length of the corresponding chromosome of the 3rd threshold value, cross-point locations are randomly provided, and according to the crosspoint position
Put, exchange the portion gene between the corresponding chromosome of the 3rd threshold value of random pair;
Genetic mutation position in the corresponding chromosome of the 3rd threshold value is randomly provided, and to the base at the genetic mutation position
Because carrying out inversion operation;
In the portion gene exchanged between the corresponding chromosome of the 3rd threshold value and to the gene at the genetic mutation position
Corresponding 3rd threshold value of chromosome after carrying out inversion operation, after being changed;
Random pair is carried out to the corresponding chromosome of the 4th threshold value in multigroup best candidate solution;
According to the length of the corresponding chromosome of the 4th threshold value, cross-point locations are randomly provided, and according to the crosspoint position
Put, exchange the portion gene between the corresponding chromosome of the 4th threshold value of random pair;
Genetic mutation position in the corresponding chromosome of the 4th threshold value is randomly provided, and to the base at the genetic mutation position
Because carrying out inversion operation;
In the portion gene exchanged between the corresponding chromosome of the 4th threshold value and to the gene at the genetic mutation position
Corresponding 4th threshold value of chromosome after carrying out inversion operation, after being changed;
According to corresponding 4th threshold value of chromosome after corresponding 3rd threshold value of chromosome after the change and change, obtain many
The new candidate solution of group.
6. a kind of text processing apparatus, it is characterised in that described device includes:
Text acquiring unit, for obtaining two texts to be detected at random;
First computing unit, for according at least to the first similarity algorithm and the second similarity algorithm, calculating described two to be checked
Second Type similarity between the first kind similarity surveyed between text and described two texts to be detected, wherein described the
One type similarity is calculated according to first similarity algorithm, and the Second Type similarity is similar according to described second
Degree algorithm is calculated;
Second computing unit, for according to the first kind similarity, the Second Type similarity, first threshold and second
Threshold value, obtains the similarity between described two texts to be detected, wherein the first threshold is previously obtained with described
The corresponding threshold value of one similarity algorithm, the Second Threshold is the threshold corresponding with second similarity algorithm being previously obtained
Value;
Determining unit, for when the similarity between described two texts to be detected is in preset range, determining described two
Text to be detected is similar, and for when the similarity between described two texts to be detected not in preset range when, determine institute
State two texts to be detected dissimilar.
7. device according to claim 6, it is characterised in that second computing unit, for according to the first kind
Type similarity and the first threshold, obtain the first similarity between described two texts to be detected, according to the Equations of The Second Kind
Type similarity and the Second Threshold, obtain the second similarity between described two texts to be detected, and according to described first
Similarity, second similarity, default first weight and default second weight, obtain between described two texts to be detected
Similarity.
8. device according to claim 6, it is characterised in that described device also includes:Obtaining unit, for being previously obtained
First threshold corresponding with first similarity algorithm and it is previously obtained the second threshold corresponding with second similarity algorithm
Value;
The obtaining unit, including:First generates subelement, the first selection subelement, the second generation subelement and second chooses
Subelement;
Described first generates subelement, and for generating multigroup candidate solution at random, every group of candidate solution includes one with first phase
Like degree corresponding 3rd threshold value of algorithm and the 4th threshold value corresponding with second similarity algorithm;
Described first chooses subelement, for obtaining multigroup best candidate solution from multigroup candidate solution, wherein multigroup optimum
The acquisition process of candidate solution is:The corresponding fitness function of every group of candidate solution is obtained, it is corresponding suitable by every group of candidate solution
Response function calculates the similarity between each pair training sample in training set, according to similar between each pair training sample
Degree, obtains the fitness of every group of candidate solution, and according to the fitness of every group of candidate solution, chooses multigroup best candidate solution, often
Include two to training sample by the text for manually marking similarity, the fitness of best candidate solution is suitable more than other candidate solutions
Response;
Described second generates subelement, for carrying out cross and variation to the 3rd threshold value in multigroup best candidate solution, to institute
The 4th threshold value stated in multigroup best candidate solution carries out cross and variation, obtains multigroup new candidate solution, and to described multigroup new
It is pre-conditioned up to meeting to obtain multigroup best candidate solution from multigroup new candidate solution that candidate solution performs the acquisition process;
Described second chooses subelement, for choosing best candidate solution of the fitness more than the fitness of other best candidate solutions,
The 3rd threshold value in selected best candidate solution is used as the first threshold, the 4th threshold value in selected best candidate solution
As the Second Threshold.
9. device according to claim 8, it is characterised in that described first chooses subelement according to every group of candidate solution
Fitness, choose multigroup best candidate solution, including:
The fitness summation of all candidate solutions is obtained, according to the suitable of the fitness summation of all candidate solutions and every group of candidate solution
Response, obtains the relative adaptability degrees of every group of candidate solution, and the numerical value between 0 and 1 is generated at random, and according to random generation
Numerical value choose multigroup best candidate solution.
10. device according to claim 8, it is characterised in that the value of the 3rd threshold value and the 4th threshold value is situated between
Between 0 and 1, and represented by binary coding mode, so that the corresponding string of binary characters of the 3rd threshold value and described
The corresponding string of binary characters of 4th threshold value is respectively as a chromosome;
Described second generates subelement, including:First with sub-unit, first exchange subelement, first negate subelement, first
Acquisition subelement, second negate subelement, the second acquisition subelement and candidate with sub-unit, the second exchange subelement, second
Solution obtains subelement;
Described first match somebody with somebody sub-unit, for the corresponding chromosome of the 3rd threshold value in multigroup best candidate solution is carried out with
Machine is matched;
Described first exchanges subelement, for according to the length of the corresponding chromosome of the 3rd threshold value, being randomly provided crosspoint
Position, and according to the cross-point locations, exchange the portion gene between the corresponding chromosome of the 3rd threshold value of random pair;
Described first negates subelement, for being randomly provided the corresponding chromosome of the 3rd threshold value in genetic mutation position, and
Inversion operation is carried out to the gene at the genetic mutation position;
Described first obtains subelement, in the portion gene exchanged between the corresponding chromosome of the 3rd threshold value and to institute
Stating the gene at genetic mutation position is carried out after inversion operation, corresponding 3rd threshold value of the chromosome after being changed;
Described second match somebody with somebody sub-unit, for the corresponding chromosome of the 4th threshold value in multigroup best candidate solution is carried out with
Machine is matched;
Described second exchanges subelement, for according to the length of the corresponding chromosome of the 4th threshold value, being randomly provided crosspoint
Position, and according to the cross-point locations, exchange the portion gene between the corresponding chromosome of the 4th threshold value of random pair;
Described second negates subelement, for being randomly provided the corresponding chromosome of the 4th threshold value in genetic mutation position, and
Inversion operation is carried out to the gene at the genetic mutation position;
Described second obtains subelement, in the portion gene exchanged between the corresponding chromosome of the 4th threshold value and to institute
Stating the gene at genetic mutation position is carried out after inversion operation, corresponding 4th threshold value of the chromosome after being changed;
The candidate solution obtains subelement, for according to the dye after corresponding 3rd threshold value of chromosome after the change and change
Corresponding 4th threshold value of colour solid, obtains multigroup new candidate solution.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611220192.2A CN106649273B (en) | 2016-12-26 | 2016-12-26 | Text processing method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611220192.2A CN106649273B (en) | 2016-12-26 | 2016-12-26 | Text processing method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106649273A true CN106649273A (en) | 2017-05-10 |
CN106649273B CN106649273B (en) | 2020-03-17 |
Family
ID=58828374
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611220192.2A Active CN106649273B (en) | 2016-12-26 | 2016-12-26 | Text processing method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106649273B (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107729323A (en) * | 2017-11-29 | 2018-02-23 | 深圳中泓在线股份有限公司 | Web documents similarity detection method and device, server and storage medium |
CN108021553A (en) * | 2017-09-30 | 2018-05-11 | 北京颐圣智能科技有限公司 | Word treatment method, device and the computer equipment of disease term |
CN108573045A (en) * | 2018-04-18 | 2018-09-25 | 同方知网数字出版技术股份有限公司 | A kind of alignment matrix similarity retrieval method based on multistage fingerprint |
CN109165291A (en) * | 2018-06-29 | 2019-01-08 | 厦门快商通信息技术有限公司 | A kind of text matching technique and electronic equipment |
CN109508379A (en) * | 2018-12-21 | 2019-03-22 | 上海文军信息技术有限公司 | A kind of short text clustering method indicating and combine similarity based on weighted words vector |
CN110362987A (en) * | 2019-06-29 | 2019-10-22 | 南京理工大学 | A kind of lightweight assessment algorithm of Cipher Strength |
CN112733140A (en) * | 2020-12-28 | 2021-04-30 | 上海观安信息技术股份有限公司 | Detection method and system for model tilt attack |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103646080A (en) * | 2013-12-12 | 2014-03-19 | 北京京东尚科信息技术有限公司 | Microblog duplication-eliminating method and system based on reverse-order index |
CN104657472A (en) * | 2015-02-13 | 2015-05-27 | 南京邮电大学 | EA (Evolutionary Algorithm)-based English text clustering method |
CN105512249A (en) * | 2015-12-01 | 2016-04-20 | 福建工程学院 | Noumenon coupling method based on compact evolution algorithm |
CN105955948A (en) * | 2016-04-22 | 2016-09-21 | 武汉大学 | Short text topic modeling method based on word semantic similarity |
-
2016
- 2016-12-26 CN CN201611220192.2A patent/CN106649273B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103646080A (en) * | 2013-12-12 | 2014-03-19 | 北京京东尚科信息技术有限公司 | Microblog duplication-eliminating method and system based on reverse-order index |
CN104657472A (en) * | 2015-02-13 | 2015-05-27 | 南京邮电大学 | EA (Evolutionary Algorithm)-based English text clustering method |
CN105512249A (en) * | 2015-12-01 | 2016-04-20 | 福建工程学院 | Noumenon coupling method based on compact evolution algorithm |
CN105955948A (en) * | 2016-04-22 | 2016-09-21 | 武汉大学 | Short text topic modeling method based on word semantic similarity |
Non-Patent Citations (3)
Title |
---|
李春梅: "基于TF-IDF的网页新闻分类的研究与应用", 《贵州师范大学学报(自然科学版)》 * |
潘炜炜: "遗传算法在XML文档聚类中的研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
翟丽丽: "基于广度优先搜索的变异加权模糊C-均值聚类算法", 《统计与决策》 * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108021553A (en) * | 2017-09-30 | 2018-05-11 | 北京颐圣智能科技有限公司 | Word treatment method, device and the computer equipment of disease term |
CN107729323A (en) * | 2017-11-29 | 2018-02-23 | 深圳中泓在线股份有限公司 | Web documents similarity detection method and device, server and storage medium |
CN108573045A (en) * | 2018-04-18 | 2018-09-25 | 同方知网数字出版技术股份有限公司 | A kind of alignment matrix similarity retrieval method based on multistage fingerprint |
CN108573045B (en) * | 2018-04-18 | 2021-12-24 | 同方知网数字出版技术股份有限公司 | Comparison matrix similarity retrieval method based on multi-order fingerprints |
CN109165291A (en) * | 2018-06-29 | 2019-01-08 | 厦门快商通信息技术有限公司 | A kind of text matching technique and electronic equipment |
CN109165291B (en) * | 2018-06-29 | 2021-07-09 | 厦门快商通信息技术有限公司 | Text matching method and electronic equipment |
CN109508379A (en) * | 2018-12-21 | 2019-03-22 | 上海文军信息技术有限公司 | A kind of short text clustering method indicating and combine similarity based on weighted words vector |
CN110362987A (en) * | 2019-06-29 | 2019-10-22 | 南京理工大学 | A kind of lightweight assessment algorithm of Cipher Strength |
CN112733140A (en) * | 2020-12-28 | 2021-04-30 | 上海观安信息技术股份有限公司 | Detection method and system for model tilt attack |
CN112733140B (en) * | 2020-12-28 | 2023-12-22 | 上海观安信息技术股份有限公司 | Detection method and system for model inclination attack |
Also Published As
Publication number | Publication date |
---|---|
CN106649273B (en) | 2020-03-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106649273A (en) | Text processing method and text processing device | |
JP6846469B2 (en) | Method and device for determining the effectiveness of points of interest based on Internet text mining | |
CN110727766B (en) | Sensitive word detection method | |
CN106611052B (en) | The determination method and device of text label | |
CN111708888B (en) | Classification method, device, terminal and storage medium based on artificial intelligence | |
CN110188346B (en) | Intelligent research and judgment method for network security law case based on information extraction | |
CN104598611B (en) | The method and system being ranked up to search entry | |
CN106855853A (en) | Entity relation extraction system based on deep neural network | |
CN110738039B (en) | Case auxiliary information prompting method and device, storage medium and server | |
CN110135157A (en) | Malware homology analysis method, system, electronic equipment and storage medium | |
CN112463976B (en) | Knowledge graph construction method taking crowd sensing task as center | |
CN105550227B (en) | Named entity identification method and device | |
US8612371B1 (en) | Computing device and method using associative pattern memory using recognition codes for input patterns | |
CN109241527B (en) | Automatic generation method of false comment data set of Chinese commodity | |
CN110516210B (en) | Text similarity calculation method and device | |
CN110830489B (en) | Method and system for detecting counterattack type fraud website based on content abstract representation | |
CN110826056A (en) | Recommendation system attack detection method based on attention convolution self-encoder | |
US20160350264A1 (en) | Server and method for extracting content for commodity | |
CN110688455A (en) | Method, medium and computer equipment for filtering invalid comments based on artificial intelligence | |
CN108710911A (en) | It is a kind of based on semi-supervised application market brush list application detection method | |
CN106202007B (en) | A kind of appraisal procedure of MATLAB program files similarity | |
CN114090880A (en) | Method and device for commodity recommendation, electronic equipment and storage medium | |
Zhu et al. | Human activity recognition based on similarity | |
CN110147798A (en) | A kind of semantic similarity learning method can be used for network information detection | |
TWI778442B (en) | Device and method for detecting purpose of article |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |