CN109857856A - A kind of retrieval ordering of text determines method and system - Google Patents
A kind of retrieval ordering of text determines method and system Download PDFInfo
- Publication number
- CN109857856A CN109857856A CN201910082601.4A CN201910082601A CN109857856A CN 109857856 A CN109857856 A CN 109857856A CN 201910082601 A CN201910082601 A CN 201910082601A CN 109857856 A CN109857856 A CN 109857856A
- Authority
- CN
- China
- Prior art keywords
- text
- magnitude
- association
- collection
- degree
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Abstract
The invention discloses a kind of retrieval orderings of text to determine method and system, and this method comprises the following steps: obtaining target text and candidate text collection to be retrieved;Obtain the degree of association magnitude of each text in the target text and the candidate text collection;Each text in the candidate text collection is ranked up according to the first preset rules using the degree of association magnitude, constructs the first text collection according to the first default screening conditions;Each text in first text collection is ranked up according to the second preset rules, obtains the retrieval ordering result of target text.Embodiment provided by the invention gathers the advantage of many algorithms, improves the precision of patent search result, promotes the recall precision of user.
Description
Technical field
The present invention relates to data processing fields, and in particular to a kind of retrieval ordering of text determines method and system.
Background technique
When retrieving in the prior art to document (such as journal article, patent etc.), existing multiple and different kinds are utilized
The similarity calculating method (such as structural analysis, semantic analysis, key word analysis) of class, can after being ranked up to candidate document
Obtain different ranking results;In addition, Different Results may be also had for same class similarity calculating method, for example, with language
For justice analysis, for the similarity calculation result between same a pair of of patent original text and the similarity calculation result between its translation
Also there is difference.For the same target patent, for different solutions, the similarity of candidate patent
Arrangement mode be also it is diversified, each way respectively has its ordering rule, obtained ranking results may difference it is larger, such as
The maximally related patent that user really needs may be come in a solution wherein before 10, in another solution party
After 1000 may be come in case, in this case, user can not learn optimal search result, also, if user
It, also can extreme influence recall precision if browsing uses various arrangement modes one by one.
Summary of the invention
Therefore, the retrieval that the present invention provides a kind of document chooses and the determination method and system of sequence, overcomes existing skill
To the deficiency that can not obtain optimum search result caused by the difference of the arrangement mode of literature search in art.
In a first aspect, the retrieval ordering that the embodiment of the present invention provides a kind of text determines method, include the following steps: to obtain
Target text and candidate text collection to be retrieved;Obtain each text in the target text and the candidate text collection
Degree of association magnitude;Using the degree of association magnitude according to the first preset rules to each text in the candidate text collection into
Row sequence constructs the first text collection according to the first default screening conditions;By each text in first text collection according to
Second preset rules are ranked up, and obtain the retrieval ordering result of target text.
In one embodiment, described to arrange each text in first text collection according to the second preset rules
Sequence, the step of obtaining the retrieval ordering result of target text, comprising: by each text in first text collection according to third
Preset rules are ranked up, and are excluded noise text according to the second default screening conditions, are constructed the second text collection;By described second
Each text in text collection is ranked up according to the second preset rules, obtains the retrieval ordering result of target text.
In one embodiment, the degree of association magnitude of each text in the target text and the candidate text collection is obtained
The step of, comprising: it is calculated separately using default N kind relevance metric algorithm every in the target text and the candidate text collection
The degree of association magnitude of a text, the N are the positive integer more than or equal to 2.
In one embodiment, described to utilize the degree of association magnitude according to the first preset rules to the candidate text collection
In each text the step of being ranked up, constructing the first text collection according to the first default screening conditions, comprising: according to default
The degree of association magnitude that N kind relevance metric algorithm obtains respectively is ranked up each text in the candidate text collection, obtains
To N kind ordered set;It is integrated ordered according to the progress of the first preset rules to the N kind ordered set, according to the first default screening
Condition constructs the first text collection;Preferably, the step of constructing the first text collection according to the first default screening conditions includes: root
Each text in the target text and the candidate text collection is calculated separately to default N kind relevance metric algorithm according to preset strategy
This degree of association magnitude is analyzed, and analysis result is obtained;Each text in the candidate text collection is judged based on the analysis results
Whether this meets preset condition, and the text that the preset condition is met in candidate text collection is selected into first text collection
In.
In one embodiment, it is described the N kind ordered set is carried out according to the first preset rules it is integrated ordered, according to the
The step of one default screening conditions construct the first text collection, comprising: to the relevance metric obtained using default N kind metric algorithm
Value, distributes weight according to the first preset rules respectively, by the degree of association magnitude and corresponding multiplied by weight and addition obtain it is comprehensive
Degree of association magnitude is closed, determines integrated ordered according to the size of the Synthesis Relational Grade magnitude as a result, will be greater than the first preset comprehensive
The text of degree of association magnitude threshold value, is selected into the first text collection.
In one embodiment, described to utilize the degree of association magnitude according to the first preset rules to the candidate text collection
In each text the step of being ranked up, constructing the first text collection according to the first default screening conditions, comprising: according to utilization
Default N kind metric algorithm obtains degree of association magnitude and sorts respectively according to size, obtains N kind ordered set;The N kind is sorted and is collected
In the degree of association magnitude of each text closed, greater than the first degree of association magnitude threshold value and/or less than first row tagmeme subthreshold
Text is selected into first text collection.
In one embodiment, described to arrange each text in first text collection according to third preset rules
Sequence, the step of excluding noise text according to the second default screening conditions, construct the second text collection, comprising: by the first text set
Text in conjunction distributes weight according to third preset rules to the degree of association magnitude obtained using default N kind metric algorithm respectively;
By the degree of association magnitude and corresponding multiplied by weight and it is added and obtains Synthesis Relational Grade magnitude;According to the Synthesis Relational Grade amount
The size of value determines integrated ordered result;By less than the text of the second preset comprehensive degree of association magnitude threshold value, as noise text;
The noise text is removed from the first text collection, constructs second text collection.
In one embodiment, described to arrange each text in first text collection according to third preset rules
Sequence, the step of excluding noise text according to the second default screening conditions, construct the second text collection, comprising: closed according to default N kind
Connection metric algorithm obtains the second degree of association magnitude of text and the target text in first text collection;According to described
Second degree of association magnitude sorts respectively according to size, obtains N kind ordered set;By the pass of each text of N kind ordered set
Join in metric, less than the second degree of association magnitude threshold value and/or in the text for being greater than second row tagmeme time, as noise text;
The noise text is removed from the first text collection, constructs second text collection.
In one embodiment, by the text in first text collection, second preset rules according to the mesh
The sequence precedence of the degree of association magnitude size or degree of association magnitude of marking text is set, and the retrieval ordering knot of target text is obtained
Fruit;Preferably, the relevance metric of each text in default sample and candidate text collection is obtained using N kind relevance metric algorithm
Value obtains the recall rate on default section of the degree of association magnitude of default sample, gives N kind according to the recall rate on default section
Corresponding weight is arranged in relevance metric algorithm, obtains the integrated ordered value of each text in candidate text collection, is arranged according to comprehensive
The retrieval ordering result of sequence value acquisition target text;Preferably, the degree of association of target text is obtained according to N kind relevance metric algorithm
The N kind rank order of magnitude obtains the integrated ordered value of each text in candidate text collection according to N kind rank order, according to
Integrated ordered value obtains the retrieval ordering result of target text;Preferably, using N kind relevance metric algorithm obtain default sample with
The degree of association magnitude of each text in candidate text collection, and the corresponding most related text of default sample is obtained in candidate text collection
Middle to obtain sequence precedence according to degree of association magnitude, the basis presets the average recall rate of the ranking precedence of sample or default
Corresponding weight is arranged to N kind relevance metric algorithm in recall rate on section, and each text is comprehensive in the candidate text collection of acquisition
Ranking value is closed, the retrieval ordering result of target text is obtained according to integrated ordered value.
In one embodiment, by the text in second text collection, second preset rules according to the mesh
The sequence precedence of the degree of association magnitude size or degree of association magnitude of marking text is set, described to be ranked up, and obtains target text
This retrieval ordering result;Preferably, each text in default sample and candidate text collection is obtained using N kind relevance metric algorithm
This degree of association magnitude obtains the recall rate on default section of the degree of association magnitude of default sample, according on default section
Recall rate give N kind relevance metric algorithm that corresponding weight is set, obtain the integrated ordered of each text in candidate text collection
Value obtains the retrieval ordering result of target text according to integrated ordered value;Preferably, mesh is obtained according to N kind relevance metric algorithm
The N kind rank order for marking the degree of association magnitude of text obtains the comprehensive of each text in candidate text collection according to N kind rank order
Ranking value is closed, the retrieval ordering result of target text is obtained according to integrated ordered value;Preferably, N kind relevance metric algorithm is utilized
The degree of association magnitude of each text in default sample and candidate text collection is obtained, and obtains the corresponding most related text of default sample
Sequence precedence is obtained according to degree of association magnitude in candidate text collection, the basis presets being averaged for the ranking precedence of sample
Recall rate or the recall rate on default section, are arranged corresponding weight to N kind relevance metric algorithm, obtain candidate text collection
In each text integrated ordered value, according to it is integrated ordered value obtain target text retrieval ordering result.
In one embodiment, the degree of association for obtaining each text in the target text and the candidate text collection
The step of magnitude, comprising: using default a kind of or N kind relevance metric algorithm, according to the corresponding transmogrified text of the target text,
Obtain each text or transmogrified text corresponding with text each in candidate text collection in target text and candidate text collection
Degree of association magnitude.
Second aspect, the retrieval ordering that the embodiment of the present invention provides a kind of text determine system, comprising: target text and time
Text collection is selected to obtain module, for obtaining target text and candidate text collection to be retrieved;Degree of association magnitude obtains module,
For obtaining the degree of association magnitude of each text in the target text and the candidate text collection;The building of first text collection
Module, for being carried out according to the first preset rules to each text in the candidate text collection using the degree of association magnitude
Sequence constructs the first text collection according to the first default screening conditions;Retrieval ordering result obtains module, is used for described first
Each text is ranked up according to the second preset rules in text collection, obtains the retrieval ordering result of target text.
The third aspect, the embodiment of the present invention provide a kind of computer equipment, comprising: at least one processor, and with institute
State the memory of at least one processor communication connection, wherein the memory is stored with can be by least one described processor
The instruction of execution, described instruction are executed by least one described processor, so that at least one described processor executes the present invention
The retrieval ordering for the text that first aspect provides determines method.
Fourth aspect, the embodiment of the present invention provide a kind of computer readable storage medium, the computer-readable storage medium
Matter is stored with computer instruction, and the computer instruction is used to that the computer to be made to execute the text that first aspect present invention provides
Retrieval ordering determine method.
Technical solution of the present invention has the advantages that
The retrieval ordering of text provided by the invention determines method and system, obtains target text and time to be retrieved first
Text collection is selected, which can be a patent;Further obtain the target text and the candidate text set
The degree of association magnitude of each text in conjunction, the degree of association can be similarity;Then default according to first using degree of association magnitude
Rule is ranked up each text in the candidate text collection, constructs the first text set according to the first default screening conditions
It closes;Each text in first text collection is ranked up according to the second preset rules finally, obtains the inspection of target text
Rope ranking results.Compared with the existing technology, user can not learn optimal search result, need to browse the various arrangements of use one by one
Mode, recall precision is low, and the advantage of method set many algorithms provided by the embodiments of the present application improves the essence of patent search result
Accuracy promotes the recall precision of user.
Detailed description of the invention
It, below will be to specific in order to illustrate more clearly of the specific embodiment of the invention or technical solution in the prior art
Embodiment or attached drawing needed to be used in the description of the prior art be briefly described, it should be apparent that, it is described below
Attached drawing is some embodiments of the present invention, for those of ordinary skill in the art, before not making the creative labor
It puts, is also possible to obtain other drawings based on these drawings.
Fig. 1 is that the retrieval ordering of text provided in an embodiment of the present invention determines the flow chart of one specific example of method;
Fig. 2 determines for the retrieval ordering of text provided in an embodiment of the present invention constructs the unification of the first text set in fact in method
Apply the step flow diagram of example;
It is algorithm in each ranking section precision schematic diagram that Fig. 3, which is three kinds provided in an embodiment of the present invention,;
Fig. 4 is that the retrieval ordering of text provided in an embodiment of the present invention determines the flow chart of another specific example of method;
Fig. 5 is the step flow diagram of second text collection one embodiment of building provided in an embodiment of the present invention;
Fig. 6 is the step flow diagram of another embodiment of the second text collection of building provided in an embodiment of the present invention;
Fig. 7 is that the retrieval ordering of text provided in an embodiment of the present invention determines the composition figure of one specific example of system;
Fig. 8 is the composition figure of one specific example of computer equipment provided in an embodiment of the present invention.
Specific embodiment
Technical solution of the present invention is clearly and completely described below in conjunction with attached drawing, it is clear that described implementation
Example is a part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill
Personnel's every other embodiment obtained without making creative work, shall fall within the protection scope of the present invention.
Embodiment 1
The retrieval ordering that the embodiment of the present invention provides a kind of text determines method, can be applied to electronic equipment, the electronics
Equipment can be server, or terminal, as shown in Figure 1, this method comprises the following steps:
Step S1: target text and candidate text collection to be retrieved are obtained.
In practical applications, target text to be retrieved includes but is not limited to technical literature, patent, academic paper etc.,
In the embodiment of the present invention, which is illustrated by taking patent as an example, which can be candidate patent set.
Server can receive the target patent to be retrieved that user inputs in user terminal, and get time from patent database
Patent set is selected, according to usage scenario, it may be possible to the patent in full library, it is also possible to a customized patent by other means
Set, such as only include the set of Chinese patent or the candidate patent set may be a technical field in patent database
All patents a subset, may include 10,000 patents, it should be noted that include is special in the candidate patent set
Benefit quantity only illustrate and and it is non-limiting.
Step S2: the degree of association magnitude of each text in target text and candidate text collection is obtained.
In practical applications, target text can be similar to the degree of association magnitude of each text in candidate text collection
Degree, novelty degree, different degree, action value etc. can arbitrarily indicate each textual association degree in target text and candidate text collection
Metric.The embodiment of the present invention is illustrated by taking similarity as an example, and target patent and pre- can be obtained using N kind similarity algorithm
The similarity of each Wen Li of candidate patent set, wherein N is not less than 2.In embodiments of the present invention, with structural analysis, keyword
Three kinds of similarity calculating methods of analysis and semantic analysis obtain similarity value and are illustrated, and but not limited to this, in other implementations
It can choose two or more any similarity calculating method in example.
In practical applications, the degree of association magnitude of each text in the target text and the candidate text collection is obtained
The step of, comprising: target is obtained according to the corresponding transmogrified text of target text using a kind of default or N kind relevance metric algorithm
The degree of association of each text or transmogrified text corresponding with text each in candidate text collection in text and candidate text collection
Magnitude.
Transmogrified text in the embodiment of the present invention is other expression-form texts associated with original text, e.g.: with
The translation of other corresponding language of original text;Abbreviation, rewriting or the summary text carried out according to original text content;Original text includes
Part content of text (for example, for patent text, can choose the abstract of description of target text, claims or
All or part of the content in person's specification);Corresponding with original text content other texts (for example, for patent text and
Speech, can be the patent families text of former patent text) etc., it is above as an example, be not limited.
In one embodiment, the mistake of the degree of association magnitude of each text in target text and candidate text collection is obtained
Journey can use default N kind relevance metric algorithm, according to the middle text or its English translation in the target text, obtain respectively
Take the degree of association magnitude of each text in target text and candidate text collection.For example, can use default N kind similarity operator
Method, calculates separately the similarity of each patent in the English text and candidate text collection of target patent, or calculates English specially
The similarity of each patent in the Chinese translation of benefit and candidate text collection, to obtain different sortords.
In one embodiment, the mistake of the degree of association magnitude of each text in target text and candidate text collection is obtained
Journey can be the degree of association magnitude that each patent in target patent and candidate patent set is obtained using a kind of relevance metric algorithm
And in candidate patent text collection the translation of other corresponding language similarity value, to obtain different sortords.
In another embodiment, the degree of association magnitude of each text in target text and candidate text collection is obtained
Process can also be using default N kind relevance metric algorithm, according to word content different in the target text, obtain mesh
Mark the degree of association magnitude of each text in text and candidate text collection.For example, can use default N kind similarity algorithm, count
Calculate the content of abstract of description, claims or specification and each text in candidate patent set in target patent
The similarity of abstract of description, all or part of the content of claims or specification, to obtain different sortords.
Step S3: each text in candidate text collection is arranged according to the first preset rules using degree of association magnitude
Sequence constructs the first text collection according to the first default screening conditions.
In the embodiment of the present invention, the first text collection is structure to each patent progress preliminary screening in candidate patent set
The primary election patent set built.
In one embodiment, as shown in Fig. 2, the process of step S3 building primary election patent set, can specifically include following
Step:
Step S31: the degree of association magnitude obtained according to default N kind relevance metric algorithm, respectively in candidate text collection
Each text be ranked up, obtain N kind ordered set.
In one embodiment, it is analyzed using above structure, three kinds of similarity algorithms of semantic analysis and key word analysis,
Respectively by the similarity of target patent and each patent in default patent set according to being ranked up from big to small, to obtain three
Patent set X, Y, Z of the corresponding three kinds of sequential modes of kind similarity algorithm.
Step S32: it is integrated ordered according to the progress of the first preset rules to N kind ordered set, according to the first default screening item
Part constructs the first text collection.
In one embodiment, to the degree of association magnitude obtained using default N kind metric algorithm, according to the first preset rules point
Not Fen Pei weight, by the degree of association magnitude and corresponding multiplied by weight and be added and obtain Synthesis Relational Grade magnitude, according to described
The size of Synthesis Relational Grade magnitude determines integrated ordered as a result, will be greater than the text of the first preset comprehensive degree of association magnitude threshold value,
It is selected into the first text collection.
It in one embodiment, can X document conduct according to known most related text, e.g. in checking process
Most related text is compared and is verified to the similitude effect of above-mentioned three kinds of similarity algorithms, obtains the excellent of every kind of calculation method
Gesture section, and then every kind of calculation of selection is included in the patent numbers of primary election set.For example, having sampled 100 patents, adopt respectively
It is ranked up and is compared with above-mentioned three kinds of similarity algorithms, only illustrated with wherein 3 groups of data, as shown in table 1:
Table 1
It is shown in table 1, for the target patent (such as CN104983351A) of sampling, most related patents (the most phase of the patent
The X document for the patent that patent refers to that auditor provides is closed, CN104983351A is such as corresponded to, most related patents are
CN203247669U), key word analysis (algorithm 1), structural analysis (algorithm 2) and semantic analysis (algorithm 3) is utilized respectively to obtain
Arrangement value of the most related patents in full library.
For 100 patents of sampling, the most associated documents of each target patent are obtained in each ranking section by statistics
Number, the most associated documents as shown in Table 2 for each target patent are based in each statistical result for ranking section number
The precision correlation curve for three kinds of similarity algorithms that data are formed in table 2 is as shown in Figure 3.Recall rate is the correlation text retrieved
The ratio of all relevant documentation numbers, can measure the recall ratio of searching system in gear number and document library, based on the data in table 2
Form recall rate of the most associated documents in each ranking section number of each target patent as shown in table 3:
Table 2
Table 3
According to above-mentioned statistical result, for the patent J in candidate patent set, if its three kinds of algorithms calculate relative to
The similarity of target patent O is respectively Rx, Ry, Rz, can calculate separately the recall rate of section locating for Rx, Ry, Rz, is determined
Weight proportion on each section resequences after carrying out operation to it if it is respectively W1, W2, W3, then obtains its synthesis
Similarity value J' are as follows: it is special to be selected into primary election by J'=Rx*W1+Ry*W2+Rz*W3 for the patent that comprehensive similarity value is greater than preset value
In benefit set.
In one embodiment, it sorts, obtains respectively according to size according to using default N kind metric algorithm acquisition degree of association magnitude
To N kind ordered set;By in the degree of association magnitude of each text of the N kind ordered set, it is greater than the first degree of association magnitude threshold
Value and/or less than first row tagmeme subthreshold text, be selected into first text collection.
In one embodiment, can according to arrangement precedence summation mode be screened: respectively obtain patent K X, Y,
Three kinds of modes in Z tri- set arrange precedence Kx, Ky, Kz, carry out read group total, if its precedence when being less than preset threshold,
Then it is selected in primary election patent set, such as: for the patent K in candidate text collection, if ∑ (Kx, Ky, Kz) is less than 500 (this model
Enclose default based on experience value or can be by user's sets itself) within position, then the patent is selected into primary election patent set.
In one embodiment, it can be screened according to arrangement precedence average value: obtain patent K respectively in X, Y, Z tri-
Three kinds of modes in a set arrange the average value mean (Kx, Ky, Kz) of precedence, if its precedence be less than preset threshold when,
Selected primary election patent set, such as: for the patent K of candidate text collection, if mean (Kx, Ky, Kz) is less than 100 (this range
Based on experience value default or can be by user's sets itself) within position, then the patent is selected into primary election patent set.
In one embodiment, it can be screened according to arrangement precedence minimum value: obtain patent K respectively in X, Y, Z tri-
Arrangement precedence Kx, Ky, Kz in a set find out the similarity algorithm of arrangement precedence minimum min (Kx, Ky, Kz), precedence
When being less than preset threshold, then it is selected in primary election patent set, such as: for the patent K of candidate text collection, if min (Kx, Ky,
Kz) within preceding 50 (this range is defaulted based on experience value or can be by user's sets itself) positions, then the patent is selected into primary election
Patent set.
In one embodiment, the mode that default precedence threshold value can be less than simultaneously according to various arrangement precedence is sieved
Choosing: obtaining arrangement precedence Kx, Ky, Kz, such as Kx, Ky of the patent K in X, Y, Z tri- set respectively, in Kz as there are two or two
N before coming more than a, then it is selected in primary election patent set, it in practical applications can be with the increase of the item number for the condition that meets, in advance
The threshold value n first set can suitably increase.
In one embodiment, arrangement precedence Kx, Ky, Kz of the patent K in X, Y, Z tri- set can be obtained respectively,
And permutation and combination is carried out to two of them arrangement precedence and carries out summation operation, ∑ (Kx, Ky) is obtained respectively, ∑ (Kx, Kz), ∑
(Ky, Kz) takes minimum value therein, and precedence is then selected in primary election patent set when being less than preset threshold, such as: for waiting
The patent K in text collection is selected, if (this range is according to warp preceding 100 by min (∑ (Kx, Ky), ∑ (Kx, Kz), ∑ (Ky, Kz))
Test value default or can be by user's sets itself) within position, be then selected in primary election patent set.
In one embodiment, arrangement precedence Kx, Ky, Kz of the patent K in X, Y, Z tri- set are obtained respectively, are gone
Except the arrangement maximum numerical value of precedence, and its remainder values is summed, when precedence is less than preset threshold, is then selected in primary election patent collection
It closes, such as: for the patent K in candidate collection, if ∑ (Kx, Ky, Kz)-max (Kx, Ky, Kz) is less than 70, (this range is according to warp
Test value default or can be by user's sets itself) within position, then the patent is selected into primary election range.
In one embodiment, arrangement precedence Kx, Ky, Kz of the patent K in X, Y, Z tri- set are obtained respectively, are gone
After the arrangement maximum numerical value max (Kx, Ky, Kz) of precedence, the average value of other two ways arrangement precedence is obtained, precedence exists
When less than preset threshold, then it is selected in primary election patent set, such as: for the patent K in candidate collection, if (∑ (Kx, Ky, Kz)-
Max (Kx, Ky, Kz))/2 less than 70 (this range is defaulted based on experience value or can be by user's sets itself) positions within, then will
The patent is selected into primary election patent set.
In one embodiment, the target text is calculated separately to default N kind relevance metric algorithm according to preset strategy
The degree of association magnitude of this and each text in the candidate text collection is analyzed, and analysis result is obtained;Based on the analysis results
Judge that whether each text meets preset condition in the candidate text collection, will meet the default item in candidate text collection
The text of part is selected into first text collection.Such as: it can be screened: be obtained respectively special by way of differential analysis
Arrangement precedence Kx, Ky, Kz of the sharp K in X, Y, Z tri- set, choose the maximum value of the arrangement precedence in three set respectively
Max (Kx, Ky, Kz) and the minimum value min (Kx, Ky, Kz) of the arrangement precedence in three set calculate precedence coefficient of correlation, position
Secondary coefficient of correlation can be calculated by following optional four schemes:
Optinal plan 1:C1=(max (Kx, Ky, Kz)-min (Kx, Ky, Kz))/max (Kx, Ky, Kz);
Optinal plan 2:C1=(max (Kx, Ky, Kz)-min (Kx, Ky, Kz))/min (Kx, Ky, Kz);
Optinal plan 3:C3=max (Kx, Ky, Kz)/min (Kx, Ky, Kz);
Optinal plan 4:C4=min (Kx, Ky, Kz)/max (Kx, Ky, Kz).
Can be obtained by above-mentioned optional four optinal plans precedence coefficient of correlation (only illustrated with this, not as
Limit), (refer to the bigger feelings of two different sortord gaps according to whether preset threshold decision belongs to high drop patent
Condition), if belonged to, according to preset strategy, determine whether the patent is imported into primary election patent set.Wherein, in advance
The strategy of setting can be the patent Selection Strategy scheme according to big data statistic analysis result and practical experience acquisition.For example,
According to big data statistic analysis result and practical experience, it is believed that patent K is much smaller than it in set Y in the precedence Kx in set X
Precedence Ky when, if the patent meets condition 1 (such as technology belongs to technical field F1), be selected into primary election set, if this specially
Benefit meets condition 2 (such as technology belongs to technical field F2), then is not selected into primary election patent set.
In one embodiment, according to respectively be directed to every kind of degree of correlation calculation method provide one it is preset minimum
Relevance threshold Rtx, Rty, Rtz only can just be selected into primary election patent set higher than the patent of lowest threshold.
In one embodiment, it presets a composite thresholds Rt1 and is utilized respectively three kinds of relatedness computations for patent K
Mode obtains it in the similarity Rx, Ry, Rz relative to target patent O;Choose the maximum value max (Rx, Ry, Rz) of similarity
Judge whether max (Rx, Ry, Rz) is greater than composite thresholds Rt1 and patent K is selected into primary election set if it is greater than composite thresholds Rt1.
In one embodiment, it presets a composite thresholds Rt2 and is directed to patent K, be utilized respectively three kinds of relatedness computations
Mode obtains it in the similarity Rx, Ry, Rz relative to target patent O;Similarity average value mean (Rx, Ry, Rz) is chosen to sentence
Whether disconnected mean (Rx, Ry, Rz) is greater than composite thresholds Rt2, if it is greater than composite thresholds Rt2, by the patent K of candidate text collection
It is selected into primary election set.
In one embodiment, it is set for every kind of similarity algorithm in minimum relevance threshold Rx, Ry, Rz, such as
There are two fruit patent K or more than two greater than preset threshold value, then is conducted into primary election patent set.
Above embodiments only optional embodiment for example, is only illustrated with this and is not limited, in other embodiments,
As long as above-mentioned two or more than two choosing methods can be met to the patent in candidate patent set simultaneously by not conflicting with each other
It is chosen, constructs primary election patent set.
Step S4: text each in the first text collection is ranked up according to the second preset rules, obtains target text
Retrieval ordering result.In embodiments of the present invention, the second preset rules according to target text degree of association magnitude size or
The sequence precedence of degree of association magnitude is set.
In one embodiment, as shown in figure 4, executing step S4 may particularly include following steps:
Step S41: text each in the first text collection is ranked up according to third preset rules, default according to second
Screening conditions exclude noise text, construct the second text collection.
In the embodiment of the present invention, the second text collection is after user is further screened, denoised to primary election patent set
The similar patent set obtained.
In one embodiment, as shown in figure 5, constructing the process of the set of similar patent, following steps be may particularly include:
Step S411: by the text in the first text collection, to the degree of association magnitude obtained using default N kind metric algorithm
Weight is distributed respectively according to third preset rules.
In practical applications, it in the embodiment of the present invention, is obtained according to target patent using default N kind metric algorithm in primary election
The novel degree of each patent in patent set, similarity etc., third preset rules are referred in the first text collection of building
First rule mode, can be done on preset value adaptation adjustment, can also use other preset rules, e.g. artificial root
According to experience setting etc., is only illustrated, be not limited with this.
Step S412: it by degree of association magnitude and corresponding multiplied by weight and is added and obtains Synthesis Relational Grade magnitude.
It, can be according to each parser as shown in Figure 3 by the corresponding weight of degree of association magnitude in the embodiment of the present invention
The recall rate in locating section, determine weight proportion on each section.
Step S413: integrated ordered result is determined according to the size of Synthesis Relational Grade magnitude.
Step S414: by less than the text of the second preset comprehensive degree of association magnitude threshold value, as noise text.
In the embodiment of the present invention, Synthesis Relational Grade magnitude can be less than the patent of preset value or be greater than rank order
The patent of preset value is only illustrated as noise patent with this, is not limited.
Step S415: removing noise text from the first text collection, constructs the second text collection.
The embodiment of the present invention constructs similar patent set after removing noise patent in the primary election patent set of building.
In another embodiment, as shown in fig. 6, constructing the process of similar patent set, following steps be may particularly include:
Step S416: the text and the mesh in first text collection are obtained according to default N kind relevance metric algorithm
Mark the second degree of association magnitude of text.
Step S417: it is sorted respectively according to second degree of association magnitude according to size, obtains N kind ordered set.
Step S418: by the degree of association magnitude of each text of N kind ordered set, less than the second degree of association magnitude threshold value
And/or in the text for being greater than second row tagmeme time, as noise text.
Step S419: removing noise text from the first text collection, constructs the second text collection.
In the embodiment of the present invention, it can refer to employed in building primary election patent set according to similarity threshold and/or root
According to the sequence that each similarity algorithm obtains, noise patent is removed by the way that threshold value appropriate is arranged, constructs similar patent set, this
In repeat no more.
Step S42: each text in the second text collection is ranked up according to the second preset rules, obtains target text
This retrieval ordering result.
In embodiments of the present invention, the second preset rules are according to the degree of association magnitude size or relevance metric with target text
The sequence precedence of value is set.In one embodiment, using the mode of mean allocation weight (the weight phase of three kinds of algorithms
Together), it may be assumed that Wx=Wy=Wz=1/3 is for example, if the similarity point that certain patent J is obtained relative to three kinds of algorithms of target patent O
Not are as follows: Rx=90%, Ry=85%, Rz=96%, then simple weighted average phase of the candidate patent J relative to target patent
Like degree are as follows: R=90%*1/3+85%*1/3+96%*1/3=90.3% is obtained according to the weighted average similarity of each patent
Take the retrieval ordering result of target patent.
In a specific implementation, certain weight, example rule of thumb can artificially rule of thumb can be assigned to every kind of algorithm
Such as, Wx=20% can artificially be assigned;Wy=30%;Wz=50% obtains the inspection of target patent according to the weighted value of each patent
Rope ranking results.
In one embodiment, it is obtained using N kind relevance metric algorithm each in default sample and candidate text collection
The degree of association magnitude of text obtains the recall rate on default section of the degree of association magnitude of default sample, according to default section
On recall rate give N kind relevance metric algorithm that corresponding weight is set, obtain the comprehensive row of each text in candidate text collection
Sequence value obtains the retrieval ordering result of target text according to integrated ordered value.Such as: it is drawn for the degree of correlation of every kind of calculation method
It is divided into several sections, by the X document call back number and the degree of correlation section patent sum of each section, calculates the special of the section
The recall rate of benefit, is such as divided into following 6 sections to the degree of correlation:
For 1 statistical result of algorithm:
Greater than 95%:Z11=(X document call back number/sum)=5%
95%~90%:Z12=(X document call back number/sum)=10%
90%~80%:Z13=(X document call back number/sum)=11%
80%~70%:Z14=(X document call back number/sum)=13%
70%~60%:Z15=(X document call back number/sum)=19%
60% or less: Z16=(X document call back number/sum)=42%
For 2 statistical result of algorithm:
Greater than 95%:Z21=(X document call back number/sum)=3%
95%~90%:Z22=(X document call back number/sum)=12%
90%~80%:Z23=(X document call back number/sum)=17%
80%~70%:Z24=(X document call back number/sum=15%
70%~60%:Z25=(X document call back number/sum)=23%
60% or less: Z26=(X document call back number/sum)=30%
For the statistical result of algorithm 3:
Greater than 95%:Z31=(X document call back number/sum)=7%
95%~90%:Z32=(X document call back number/sum)=9%
90%~80%:Z33=(X document call back number/sum)=18%
80%~70%:Z34=(X document call back number/sum)=19%
70%~60%:Z35=(X document call back number/sum)=15%
60% or less: Z36=(X document call back number/sum)=32%
According to the above statistical data specified weight allocation plan, such as: for patent J, if the phase that its three kinds of algorithms calculate
Similarity for target patent O is respectively Rx, Ry, Rz, can calculate separately section locating for Rx, Ry, Rz, according to above-mentioned system
Meter result finds out its corresponding weight proportion, if it is respectively W1, W2, W3, resequences after carrying out operation to it, then it is integrated
Similarity value J' are as follows: J'=Rx*W1+Ry*W2+Rz*W3 obtains target patent according to the comprehensive similarity value of each patent
Retrieval ordering result.
In one embodiment, the peak for the similarity that comprehensive similarity takes three kinds of algorithms to obtain, i.e. max (Rx, Ry,
Rz).For example, if certain patent is respectively as follows: Rx=90%, Ry=relative to the similarity that three kinds of algorithms of target patent obtain
85%, Rz=96% then directly assign similarity of the patent relative to target patent are as follows: R=96%.
In one embodiment, it can be chosen using interval sequence, for example, can be respectively with three kinds of sortords to similar patent
Set is ranked up, and the final sortord of ordered set X, Y, Z for respectively obtaining three similar patents sequences can be according to
The mode of X1, Y1, Z1, X2, Y3, Z2, X3, Y3, Z3... are successively alternatively arranged, for example, certain patent simultaneously belong to X2, Y6,
Z53 on the position for then being come above-mentioned " X2 " first, directly skips this patent when to the position Y6, selects subsequent Y7 patent
(if Y7 was also selected, was successively prolonged afterwards), Z53 is processed similarly.
In one embodiment, the N kind ranking time of the degree of association magnitude of target text is obtained according to N kind relevance metric algorithm
Sequence, the integrated ordered value of each text in candidate text collection is obtained according to N kind rank order, obtains mesh according to integrated ordered value
Mark the retrieval ordering result of text.Such as: user can respectively be ranked up similar patent set with three kinds of sortords, respectively
Obtain ordered set X, Y, Z of three similar patents sequences, for patent C, if its sequence in three set be respectively Cx,
Cy, Cz resequence after can carrying out operation to it, for example, C'=Cu+Cv+Cw can be set as integrated ordered value C', finally press
It is ranked up according to the size of C', if there is the C' of multiple equivalences, these patents can be arranged according to preset rules
Sequence, for example, can more each C' corresponding each group Cx, Cy, Cz minimum value, prioritization min (Cx, Cy, Cz) is the smallest
Patent, or can more each C' corresponding each group Cx, Cy, Cz maximum value, prioritization max (Cx, Cy, Cz) is minimum
Patent.
In one embodiment, it is ranked up according to second preset rules, obtains the retrieval ordering result of target text
The step of, comprising: the relevance metric of each text in default sample and candidate text collection is obtained using N kind relevance metric algorithm
Value, and obtain the corresponding most related text of default sample and sequence precedence is obtained according to degree of association magnitude in candidate text collection,
The average recall rate or the recall rate on default section that the basis presets the ranking precedence of sample, are calculated to N kind relevance metric
Corresponding weight is arranged in method, obtains the integrated ordered value of each text in candidate text collection, obtains mesh according to integrated ordered value
The retrieval ordering of text is marked as a result, specifically including:
Weight assignment is carried out according to the distribution situation of arrangement result, takes a collection of patent sample (such as in default patent set
100 patents with X document), the most pertinent literature of these patents is found out (for example, using providing in patent examination information
X documentation & info, define the immediate documents that X class documents are patent) and itself and candidate patent established map
Relationship calculates the immediate comparison patent of each of sample patent respectively with different similarity calculating methods
Similarity.For different similarity calculation modes, the X document of each patent of sample patent is calculated separately out entire candidate
Degree of correlation arrangement precedence in patent set relative to target patent (if a sample patent corresponds to multiple X documents, takes its row
Sequence is near preceding one).According to above method, the corresponding X document Pi of each sample patent can get in three kinds of algorithms of different
Sequence: Pix, Piy, Piz (i=1~100).The above-mentioned data of acquisition are analyzed, the sequence of every kind of operation mode is obtained
Distribution situation, according to sequence distribution situation to the advantage section of every kind of algorithm, three kinds of algorithms as shown in Figure 3 as escribed above exist
The accuracy trend of each section compares.
According to statistical result above it is found that algorithm 1 and algorithm 2 (first 10) and relatively after (101~1000) recall
Rate is higher, and does not have apparent advantage compared to other calculations in 10~100 section;Algorithm 3 then shows opposite
Trend.According to above-mentioned statistical result, corresponding assignment and adjustment can be carried out to the weight of integrated ordered calculation formula, according to patent
The statistical result of recall rate, can be respectively according to a pair of weight assignment of following two method:
Method one: every kind of algorithm is counted respectively in the average recall rate of each precedence, according to recall rate to different algorithms
Weight assignment is carried out, according to statistical result, the ratio that three kinds of immediate documents of algorithm come the 6th is respectively:
1.5%, 2.3%, 0.6%, it is computed, the opposite accounting that three kinds of immediate documents of algorithm come the 6th is respectively:
Algorithm 1: accounting=1.5/ (1.5+2.3+0.6) * 100%=34%,
Algorithm 2: accounting=2.3/ (1.5+2.3+0.6) * 100%=52%,
Algorithm 3: accounting=0.6/ (1.5+2.3+0.6) * 100%=14%;
Then for coming the 6th the case where, 34%, 52%, 14% weight is given respectively, for patent C, if it is three
Sequence in a set is respectively Cx, Cy, Cz, finds out its corresponding weight proportion according to the above method, if its be respectively W1,
W2, W3 resequence after carrying out operation to it, its integrated ordered value are set as: C'=Cu*W1+Cv*W2+Cw*W3.
Method two: the precedence that search result is hit is divided into several sections, counts every kind of algorithm respectively in each section
Recall rate carries out weight assignment to different algorithm according to recall rate for example, three kinds of algorithms retrieve immediate patent comes
6th~10 ratio is 5%, 3%, 11% respectively, is computed, and three kinds of immediate documents of algorithm come the 6th~10
The opposite accounting of position is respectively:
Algorithm 1: accounting=5/ (5+3+11) * 100%=26%,
Algorithm 2: accounting=3/ (5+3+11) * 100%=16%,
Algorithm 3: accounting=11/ (5+3+11) * 100%=58%;
Then for coming the 6th the case where, 26%, 16%, 58% weight is given respectively, for patent C, if it is three
Sequence in a set is respectively Cx, Cy, Cz, can calculate separately section locating for Cx, Cy, Cz, finds out it according to the above method
Corresponding weight proportion is resequenced after carrying out operation to it if it is respectively W1, W2, W3, then by its integrated ordered value are as follows:
C'=Cu*W1+Cv*W2+Cw*W3.
Above embodiments are only done for example, be not limited, may be used also on the basis of the above description in practical applications
To make other variations or changes in different ways.
Retrieval ordering provided in an embodiment of the present invention determines method, obtains target text and candidate text to be retrieved first
Set, which can be a patent;It further obtains every in the target text and the candidate text collection
The degree of association magnitude of a text, the degree of association can be similarity;Then using degree of association magnitude according to the first preset rules pair
Each text in candidate's text collection is ranked up, and constructs the first text collection according to the first default screening conditions;Most
Each text in first text collection is ranked up according to the second preset rules afterwards, obtains the retrieval ordering of target text
As a result.The advantage of method set many algorithms provided by the embodiments of the present application improves the precision of patent search result, is promoted and is used
The recall precision at family.
Embodiment 2
The retrieval ordering that the embodiment of the present invention provides a kind of text determines system, as shown in fig. 7, the system includes:
Target text and candidate text collection obtain module 1, for obtaining the target text and the candidate text set
The degree of association magnitude of each text in conjunction.This module executes method described in the step S1 in embodiment 1, no longer superfluous herein
It states.
Degree of association magnitude obtains module 2, for obtaining each text in the target text and the candidate text collection
Degree of association magnitude.This module executes method described in the step S2 in embodiment 1, and details are not described herein.
First text collection constructs module 3, for utilizing the degree of association magnitude according to the first preset rules to the time
It selects each text in text collection to be ranked up, constructs the first text collection according to the first default screening conditions;This module is held
Method described in step S3 in row embodiment 1, details are not described herein.
Retrieval ordering result obtains module 4, for each text in first text collection to be preset rule according to second
It is then ranked up, obtains the retrieval ordering result of target text.This module executes side described in the step S4 in embodiment 1
Method, details are not described herein.
The retrieval ordering of text provided in an embodiment of the present invention determines system, obtains target text and time to be retrieved first
Text collection is selected, which can be a patent;Further obtain the target text and the candidate text set
The degree of association magnitude of each text in conjunction, the degree of association can be similarity;Then default according to first using degree of association magnitude
Rule is ranked up each text in the candidate text collection, constructs the first text set according to the first default screening conditions
It closes;Each text in first text collection is ranked up according to the second preset rules finally, obtains the inspection of target text
Rope ranking results.The advantage of system set many algorithms provided by the embodiments of the present application, improves the precision of patent search result,
Promote the recall precision of user.
Embodiment 3
The embodiment of the present invention provides a kind of computer equipment, as shown in Figure 8, comprising: at least one processor 401, such as
CPU (Central Processing Unit, central processing unit), at least one communication interface 403, memory 404, at least one
A communication bus 402.Wherein, communication bus 402 is for realizing the connection communication between these components.Wherein, communication interface 403
It may include display screen (Display), keyboard (Keyboard), optional communication interface 403 can also include that the wired of standard connects
Mouth, wireless interface.Memory 404 can be high speed RAM memory, and (Ramdom Access Memory, effumability are deposited at random
Access to memory), it is also possible to non-labile memory (non-volatile memory), for example, at least a disk storage
Device.Memory 404 optionally can also be that at least one is located remotely from the storage device of aforementioned processor 401.Wherein processor
401 retrieval orderings that can execute the text of Fig. 1 description determine method, batch processing code are stored in memory 404, and handle
Device 401 calls the program code stored in memory 404, with the retrieval ordering determination side for executing the text in embodiment 1
Method.
Wherein, communication bus 402 can be Peripheral Component Interconnect standard (peripheral component
Interconnect, abbreviation PCI) bus or expanding the industrial standard structure (extended industry standard
Architecture, abbreviation EISA) bus etc..Communication bus 402 can be divided into address bus, data/address bus, control bus etc..
Only to be indicated with a line in Fig. 8, it is not intended that an only bus or a type of bus convenient for indicating.
Wherein, memory 404 may include volatile memory (English: volatile memory), such as arbitrary access
Memory (English: random-access memory, abbreviation: RAM);Memory also may include nonvolatile memory (English
Text: non-volatile memory), for example, flash memory (English: flash memory), hard disk (English: hard disk
Drive, abbreviation: HDD) or solid state hard disk (English: solid-state drive, abbreviation: SSD);Memory 404 can also wrap
Include the combination of the memory of mentioned kind.
Wherein, processor 401 can be central processing unit (English: central processing unit, abbreviation:
CPU), the combination of network processing unit (English: network processor, abbreviation: NP) or CPU and NP.
Wherein, processor 401 can further include hardware chip.Above-mentioned hardware chip can be specific integrated circuit
(English: application-specific integrated circuit, abbreviation: ASIC), programmable logic device (English:
Programmable logic device, abbreviation: PLD) or combinations thereof.Above-mentioned PLD can be Complex Programmable Logic Devices
(English: complex programmable logic device, abbreviation: CPLD), field programmable gate array (English:
Field-programmable gate array, abbreviation: FPGA), Universal Array Logic (English: generic array
Logic, abbreviation: GAL) or any combination thereof.
Optionally, memory 404 is also used to store program instruction.Processor 401 can be instructed with caller, be realized such as this
The retrieval ordering of the text provided in application embodiment 1 determines method.
The embodiment of the present invention also provides a kind of computer readable storage medium, and meter is stored on computer readable storage medium
Calculation machine executable instruction, the retrieval ordering which can be performed the text in above-described embodiment 1 determine method.
Wherein, the storage medium can be magnetic disk, CD, read-only memory (Read-Only Memory, ROM), random storage
Memory body (Random Access Memory, RAM), flash memory (Flash Memory), hard disk (Hard Disk
Drive, abbreviation: HDD) or solid state hard disk (Solid-State Drive, SSD) etc.;The storage medium can also include above-mentioned
The combination of the memory of type.
Obviously, the above embodiments are merely examples for clarifying the description, and does not limit the embodiments.It is right
For those of ordinary skill in the art, can also make on the basis of the above description it is other it is various forms of variation or
It changes.There is no necessity and possibility to exhaust all the enbodiments.And thus amplify out it is obvious variation or
It changes still within the protection scope of the invention.
Claims (14)
1. a kind of retrieval ordering of text determines method, which comprises the steps of:
Obtain target text and candidate text collection to be retrieved;
Obtain the degree of association magnitude of each text in the target text and the candidate text collection;
Each text in the candidate text collection is ranked up according to the first preset rules using the degree of association magnitude,
The first text collection is constructed according to the first default screening conditions;
Each text in first text collection is ranked up according to the second preset rules, obtains the retrieval row of target text
Sequence result.
2. the retrieval ordering of text according to claim 1 determines method, which is characterized in that described by first text
The step of each text is ranked up according to the second preset rules in set, obtains the retrieval ordering result of target text, comprising:
Each text in first text collection is ranked up according to third preset rules, according to the second default screening conditions
Noise text is excluded, the second text collection is constructed;
Each text in second text collection is ranked up according to the second preset rules, obtains the retrieval of target text
Ranking results.
3. the retrieval ordering of text according to claim 1 or 2 determines method, which is characterized in that obtain the target text
This in the candidate text collection the step of degree of association magnitude of each text, comprising:
Each text in the target text and the candidate text collection is calculated separately using default N kind relevance metric algorithm
Degree of association magnitude, the N are the positive integer more than or equal to 2.
4. the retrieval ordering of text according to claim 3 determines method, which is characterized in that described to utilize the degree of association
Magnitude is ranked up each text in the candidate text collection according to the first preset rules, according to the first default screening item
Part constructs the step of the first text collection, comprising:
According to the degree of association magnitude that default N kind relevance metric algorithm obtains, respectively to each text in the candidate text collection
Originally it is ranked up, obtains N kind ordered set;
The N kind ordered set is carried out according to the first preset rules it is integrated ordered, according to the first default screening conditions building the
One text collection;Preferably, according to preset strategy to default N kind relevance metric algorithm calculate separately the target text with it is described
The degree of association magnitude of each text is analyzed in candidate text collection, obtains analysis result;Based on the analysis results described in judgement
Whether each text meets preset condition in candidate text collection, and the text of the preset condition will be met in candidate text collection
It is selected into first text collection.
5. the retrieval ordering of text according to claim 4 determines method, which is characterized in that described to sort to the N kind
The step of set is integrated ordered according to the progress of the first preset rules, constructs the first text collection according to the first default screening conditions,
Include:
To the degree of association magnitude obtained using default N kind metric algorithm, weight is distributed respectively according to the first preset rules, it will be described
Degree of association magnitude is with corresponding multiplied by weight and addition obtains Synthesis Relational Grade magnitude, according to the big of the Synthesis Relational Grade magnitude
Small determination is integrated ordered as a result, will be greater than the text of the first preset comprehensive degree of association magnitude threshold value, is selected into the first text collection.
6. the retrieval ordering of text according to claim 3 determines method, which is characterized in that described to utilize the degree of association
Magnitude is ranked up each text in the candidate text collection according to the first preset rules, according to the first default screening item
Part constructs the step of the first text collection, comprising:
It sorts respectively according to degree of association magnitude is obtained using default N kind metric algorithm according to size, obtains N kind ordered set;
By in the degree of association magnitude of each text of the N kind ordered set, greater than the first degree of association magnitude threshold value and/or it is less than
The text of first row tagmeme subthreshold is selected into first text collection.
7. the retrieval ordering of text according to claim 3 determines method, which is characterized in that described by first text
Each text is ranked up according to third preset rules in set, excludes noise text, building according to the second default screening conditions
The step of second text collection, comprising:
It is default according to third to the degree of association magnitude obtained using default N kind metric algorithm by the text in the first text collection
Rule distributes weight respectively;
By the degree of association magnitude and corresponding multiplied by weight and it is added and obtains Synthesis Relational Grade magnitude;
Integrated ordered result is determined according to the size of the Synthesis Relational Grade magnitude;
By less than the text of the second preset comprehensive degree of association magnitude threshold value, as noise text;
The noise text is removed from the first text collection, constructs second text collection.
8. the retrieval ordering of text according to claim 3 determines method, which is characterized in that described by first text
Each text is ranked up according to third preset rules in set, excludes noise text, building according to the second default screening conditions
The step of second text collection, comprising:
Second that the text and the target text in first text collection are obtained according to default N kind relevance metric algorithm is closed
Join metric;
It is sorted respectively according to second degree of association magnitude according to size, obtains N kind ordered set;
By in the degree of association magnitude of each text of N kind ordered set, less than the second degree of association magnitude threshold value and/or it is being greater than
The text of second row tagmeme time, as noise text;
The noise text is removed from the first text collection, constructs second text collection.
9. the retrieval ordering of text according to claim 1 determines method, which is characterized in that by first text collection
In text, second preset rules are according to the sequence of degree of association magnitude size or degree of association magnitude with the target text
Precedence is set, and the retrieval ordering result of target text is obtained;Preferably, default sample is obtained using N kind relevance metric algorithm
The degree of association magnitude of this and each text in candidate text collection, obtain the degree of association magnitude of default sample on default section
Recall rate, give N kind relevance metric algorithm that corresponding weight is set according to the recall rate on default section, obtain candidate text set
The integrated ordered value of each text in conjunction obtains the retrieval ordering result of target text according to integrated ordered value;Preferably, according to
N kind relevance metric algorithm obtains the N kind rank order of the degree of association magnitude of target text, is obtained according to N kind rank order candidate
The integrated ordered value of each text in text collection obtains the retrieval ordering result of target text according to integrated ordered value;It is preferred that
Ground, the degree of association magnitude of each text in default sample and candidate text collection is obtained using N kind relevance metric algorithm, and is obtained
The default corresponding most related text of sample obtains sequence precedence according to degree of association magnitude in candidate text collection, and the basis is pre-
If the average recall rate of the ranking precedence of sample or the recall rate on default section, are arranged corresponding to N kind relevance metric algorithm
Weight, obtain the integrated ordered value of each text in candidate text collection, the inspection of target text obtained according to integrated ordered value
Rope ranking results.
10. the retrieval ordering of text according to claim 2 determines method, which is characterized in that by second text set
Text in conjunction, second preset rules are according to the row of degree of association magnitude size or degree of association magnitude with the target text
Tagmeme is set, described to be ranked up, and obtains the retrieval ordering result of target text;Preferably, N kind relevance metric is utilized
Algorithm obtains the degree of association magnitude of each text in default sample and candidate text collection, obtains the degree of association magnitude of default sample
The recall rate on default section, give N kind relevance metric algorithm that corresponding weight is set according to the recall rate on default section,
The integrated ordered value for obtaining each text in candidate text collection obtains the retrieval ordering knot of target text according to integrated ordered value
Fruit;Preferably, the N kind rank order that the degree of association magnitude of target text is obtained according to N kind relevance metric algorithm, is arranged according to N kind
Precedence sequence obtains the integrated ordered value of each text in candidate text collection, and the retrieval of target text is obtained according to integrated ordered value
Ranking results;Preferably, default sample is obtained using N kind relevance metric algorithm to be associated with each text in candidate's text collection
Metric, and obtain the corresponding most related text of default sample and sequence position is obtained according to degree of association magnitude in candidate text collection
Average recall rate or the recall rate on default section secondary, that the basis presets the ranking precedence of sample, give N kind relevance metric
Corresponding weight is arranged in algorithm, obtains the integrated ordered value of each text in candidate text collection, is obtained according to integrated ordered value
The retrieval ordering result of target text.
11. the retrieval ordering of text according to claim 3 determines method, which is characterized in that described to obtain the target
In text and the candidate text collection the step of degree of association magnitude of each text, comprising:
Target text is obtained according to the corresponding transmogrified text of the target text using a kind of default or N kind relevance metric algorithm
And the degree of association magnitude of each text or transmogrified text corresponding with text each in candidate text collection in candidate text collection.
12. a kind of retrieval ordering of text determines system characterized by comprising
Target text and candidate text collection obtain module, for obtaining target text and candidate text collection to be retrieved;
Degree of association magnitude obtains module, is associated with for obtaining the target text with each text in candidate's text collection
Metric;
First text collection constructs module, for utilizing the degree of association magnitude according to the first preset rules to the candidate text
Each text in set is ranked up, and constructs the first text collection according to the first default screening conditions;
Retrieval ordering result obtains module, for carrying out each text in first text collection according to the second preset rules
Sequence, obtains the retrieval ordering result of target text.
13. a kind of computer equipment characterized by comprising at least one processor, and at least one described processor
The memory of communication connection, wherein the memory is stored with the instruction that can be executed by least one described processor, the finger
It enables and being executed by least one described processor, so that at least one described processor is executed as described in any in claim 1-11
The retrieval ordering of text determine method.
14. a kind of computer readable storage medium, which is characterized in that the computer-readable recording medium storage has computer to refer to
It enables, the retrieval ordering that the computer instruction is used to that the computer to be made to execute the text as described in any in claim 1-11
Determine method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910082601.4A CN109857856B (en) | 2019-01-28 | 2019-01-28 | Text retrieval sequencing determination method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910082601.4A CN109857856B (en) | 2019-01-28 | 2019-01-28 | Text retrieval sequencing determination method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109857856A true CN109857856A (en) | 2019-06-07 |
CN109857856B CN109857856B (en) | 2020-05-22 |
Family
ID=66896589
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910082601.4A Active CN109857856B (en) | 2019-01-28 | 2019-01-28 | Text retrieval sequencing determination method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109857856B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112115340A (en) * | 2020-09-14 | 2020-12-22 | 深圳市欢太科技有限公司 | Search strategy selection method, mobile terminal and readable storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130173610A1 (en) * | 2011-12-29 | 2013-07-04 | Microsoft Corporation | Extracting Search-Focused Key N-Grams and/or Phrases for Relevance Rankings in Searches |
CN105468790A (en) * | 2015-12-30 | 2016-04-06 | 北京奇艺世纪科技有限公司 | Comment information retrieval method and comment information retrieval apparatus |
CN106649650A (en) * | 2016-12-10 | 2017-05-10 | 宁波思库网络科技有限公司 | Demand information two-way matching method |
-
2019
- 2019-01-28 CN CN201910082601.4A patent/CN109857856B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130173610A1 (en) * | 2011-12-29 | 2013-07-04 | Microsoft Corporation | Extracting Search-Focused Key N-Grams and/or Phrases for Relevance Rankings in Searches |
CN105468790A (en) * | 2015-12-30 | 2016-04-06 | 北京奇艺世纪科技有限公司 | Comment information retrieval method and comment information retrieval apparatus |
CN106649650A (en) * | 2016-12-10 | 2017-05-10 | 宁波思库网络科技有限公司 | Demand information two-way matching method |
Non-Patent Citations (1)
Title |
---|
胡晓光: "基于语言模型的文本检索技术及检索结果重排序的研究", 《中国优秀博硕士学位论文全文数据库 (硕士) 信息科技辑》 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112115340A (en) * | 2020-09-14 | 2020-12-22 | 深圳市欢太科技有限公司 | Search strategy selection method, mobile terminal and readable storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN109857856B (en) | 2020-05-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109583468A (en) | Training sample acquisition methods, sample predictions method and corresponding intrument | |
CN107122382A (en) | A kind of patent classification method based on specification | |
CN109859054A (en) | Network community method for digging, device, computer equipment and storage medium | |
CN108021945A (en) | A kind of transformer state evaluation model method for building up and device | |
CN104317891B (en) | A kind of method and device that label is marked to the page | |
CN106777282B (en) | The sort method and device of relevant search | |
CN104517052B (en) | Invasion detection method and device | |
CN106485529A (en) | The sort method of advertisement position and device | |
CN108984708A (en) | Dirty data recognition methods and device, data cleaning method and device, controller | |
CN112463859B (en) | User data processing method and server based on big data and business analysis | |
CN106919957A (en) | The method and device of processing data | |
CN109598307A (en) | Data screening method, apparatus, server and storage medium | |
CN108733791A (en) | network event detection method | |
CN110232405A (en) | Method and device for personal credit file | |
CN109634997A (en) | A kind of acquisition methods, device and the electronic equipment of unusual fluctuation channel | |
CN103309857A (en) | Method and equipment for determining classified linguistic data | |
CN110209551A (en) | A kind of recognition methods of warping apparatus, device, electronic equipment and storage medium | |
CN107944487B (en) | Crop breeding variety recommendation method based on mixed collaborative filtering algorithm | |
CN109857856A (en) | A kind of retrieval ordering of text determines method and system | |
CN107908649A (en) | A kind of control method of text classification | |
CN106919587A (en) | Application program search system and method | |
CN107679174A (en) | Construction method, device and the server of Knowledge Organization System | |
CN106611339B (en) | Seed user screening method, and product user influence evaluation method and device | |
CN106815277A (en) | The appraisal procedure and device of search engine optimization | |
CN108170664A (en) | Keyword expanding method and device based on emphasis keyword |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |