CN109857856A - A kind of retrieval ordering of text determines method and system - Google Patents

A kind of retrieval ordering of text determines method and system Download PDF

Info

Publication number
CN109857856A
CN109857856A CN201910082601.4A CN201910082601A CN109857856A CN 109857856 A CN109857856 A CN 109857856A CN 201910082601 A CN201910082601 A CN 201910082601A CN 109857856 A CN109857856 A CN 109857856A
Authority
CN
China
Prior art keywords
text
magnitude
association
collection
degree
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910082601.4A
Other languages
Chinese (zh)
Other versions
CN109857856B (en
Inventor
郭永红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Enjoy Wisdom Technology Co Ltd
Original Assignee
Beijing Enjoy Wisdom Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Enjoy Wisdom Technology Co Ltd filed Critical Beijing Enjoy Wisdom Technology Co Ltd
Priority to CN201910082601.4A priority Critical patent/CN109857856B/en
Publication of CN109857856A publication Critical patent/CN109857856A/en
Application granted granted Critical
Publication of CN109857856B publication Critical patent/CN109857856B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention discloses a kind of retrieval orderings of text to determine method and system, and this method comprises the following steps: obtaining target text and candidate text collection to be retrieved;Obtain the degree of association magnitude of each text in the target text and the candidate text collection;Each text in the candidate text collection is ranked up according to the first preset rules using the degree of association magnitude, constructs the first text collection according to the first default screening conditions;Each text in first text collection is ranked up according to the second preset rules, obtains the retrieval ordering result of target text.Embodiment provided by the invention gathers the advantage of many algorithms, improves the precision of patent search result, promotes the recall precision of user.

Description

A kind of retrieval ordering of text determines method and system
Technical field
The present invention relates to data processing fields, and in particular to a kind of retrieval ordering of text determines method and system.
Background technique
When retrieving in the prior art to document (such as journal article, patent etc.), existing multiple and different kinds are utilized The similarity calculating method (such as structural analysis, semantic analysis, key word analysis) of class, can after being ranked up to candidate document Obtain different ranking results;In addition, Different Results may be also had for same class similarity calculating method, for example, with language For justice analysis, for the similarity calculation result between same a pair of of patent original text and the similarity calculation result between its translation Also there is difference.For the same target patent, for different solutions, the similarity of candidate patent Arrangement mode be also it is diversified, each way respectively has its ordering rule, obtained ranking results may difference it is larger, such as The maximally related patent that user really needs may be come in a solution wherein before 10, in another solution party After 1000 may be come in case, in this case, user can not learn optimal search result, also, if user It, also can extreme influence recall precision if browsing uses various arrangement modes one by one.
Summary of the invention
Therefore, the retrieval that the present invention provides a kind of document chooses and the determination method and system of sequence, overcomes existing skill To the deficiency that can not obtain optimum search result caused by the difference of the arrangement mode of literature search in art.
In a first aspect, the retrieval ordering that the embodiment of the present invention provides a kind of text determines method, include the following steps: to obtain Target text and candidate text collection to be retrieved;Obtain each text in the target text and the candidate text collection Degree of association magnitude;Using the degree of association magnitude according to the first preset rules to each text in the candidate text collection into Row sequence constructs the first text collection according to the first default screening conditions;By each text in first text collection according to Second preset rules are ranked up, and obtain the retrieval ordering result of target text.
In one embodiment, described to arrange each text in first text collection according to the second preset rules Sequence, the step of obtaining the retrieval ordering result of target text, comprising: by each text in first text collection according to third Preset rules are ranked up, and are excluded noise text according to the second default screening conditions, are constructed the second text collection;By described second Each text in text collection is ranked up according to the second preset rules, obtains the retrieval ordering result of target text.
In one embodiment, the degree of association magnitude of each text in the target text and the candidate text collection is obtained The step of, comprising: it is calculated separately using default N kind relevance metric algorithm every in the target text and the candidate text collection The degree of association magnitude of a text, the N are the positive integer more than or equal to 2.
In one embodiment, described to utilize the degree of association magnitude according to the first preset rules to the candidate text collection In each text the step of being ranked up, constructing the first text collection according to the first default screening conditions, comprising: according to default The degree of association magnitude that N kind relevance metric algorithm obtains respectively is ranked up each text in the candidate text collection, obtains To N kind ordered set;It is integrated ordered according to the progress of the first preset rules to the N kind ordered set, according to the first default screening Condition constructs the first text collection;Preferably, the step of constructing the first text collection according to the first default screening conditions includes: root Each text in the target text and the candidate text collection is calculated separately to default N kind relevance metric algorithm according to preset strategy This degree of association magnitude is analyzed, and analysis result is obtained;Each text in the candidate text collection is judged based on the analysis results Whether this meets preset condition, and the text that the preset condition is met in candidate text collection is selected into first text collection In.
In one embodiment, it is described the N kind ordered set is carried out according to the first preset rules it is integrated ordered, according to the The step of one default screening conditions construct the first text collection, comprising: to the relevance metric obtained using default N kind metric algorithm Value, distributes weight according to the first preset rules respectively, by the degree of association magnitude and corresponding multiplied by weight and addition obtain it is comprehensive Degree of association magnitude is closed, determines integrated ordered according to the size of the Synthesis Relational Grade magnitude as a result, will be greater than the first preset comprehensive The text of degree of association magnitude threshold value, is selected into the first text collection.
In one embodiment, described to utilize the degree of association magnitude according to the first preset rules to the candidate text collection In each text the step of being ranked up, constructing the first text collection according to the first default screening conditions, comprising: according to utilization Default N kind metric algorithm obtains degree of association magnitude and sorts respectively according to size, obtains N kind ordered set;The N kind is sorted and is collected In the degree of association magnitude of each text closed, greater than the first degree of association magnitude threshold value and/or less than first row tagmeme subthreshold Text is selected into first text collection.
In one embodiment, described to arrange each text in first text collection according to third preset rules Sequence, the step of excluding noise text according to the second default screening conditions, construct the second text collection, comprising: by the first text set Text in conjunction distributes weight according to third preset rules to the degree of association magnitude obtained using default N kind metric algorithm respectively; By the degree of association magnitude and corresponding multiplied by weight and it is added and obtains Synthesis Relational Grade magnitude;According to the Synthesis Relational Grade amount The size of value determines integrated ordered result;By less than the text of the second preset comprehensive degree of association magnitude threshold value, as noise text; The noise text is removed from the first text collection, constructs second text collection.
In one embodiment, described to arrange each text in first text collection according to third preset rules Sequence, the step of excluding noise text according to the second default screening conditions, construct the second text collection, comprising: closed according to default N kind Connection metric algorithm obtains the second degree of association magnitude of text and the target text in first text collection;According to described Second degree of association magnitude sorts respectively according to size, obtains N kind ordered set;By the pass of each text of N kind ordered set Join in metric, less than the second degree of association magnitude threshold value and/or in the text for being greater than second row tagmeme time, as noise text; The noise text is removed from the first text collection, constructs second text collection.
In one embodiment, by the text in first text collection, second preset rules according to the mesh The sequence precedence of the degree of association magnitude size or degree of association magnitude of marking text is set, and the retrieval ordering knot of target text is obtained Fruit;Preferably, the relevance metric of each text in default sample and candidate text collection is obtained using N kind relevance metric algorithm Value obtains the recall rate on default section of the degree of association magnitude of default sample, gives N kind according to the recall rate on default section Corresponding weight is arranged in relevance metric algorithm, obtains the integrated ordered value of each text in candidate text collection, is arranged according to comprehensive The retrieval ordering result of sequence value acquisition target text;Preferably, the degree of association of target text is obtained according to N kind relevance metric algorithm The N kind rank order of magnitude obtains the integrated ordered value of each text in candidate text collection according to N kind rank order, according to Integrated ordered value obtains the retrieval ordering result of target text;Preferably, using N kind relevance metric algorithm obtain default sample with The degree of association magnitude of each text in candidate text collection, and the corresponding most related text of default sample is obtained in candidate text collection Middle to obtain sequence precedence according to degree of association magnitude, the basis presets the average recall rate of the ranking precedence of sample or default Corresponding weight is arranged to N kind relevance metric algorithm in recall rate on section, and each text is comprehensive in the candidate text collection of acquisition Ranking value is closed, the retrieval ordering result of target text is obtained according to integrated ordered value.
In one embodiment, by the text in second text collection, second preset rules according to the mesh The sequence precedence of the degree of association magnitude size or degree of association magnitude of marking text is set, described to be ranked up, and obtains target text This retrieval ordering result;Preferably, each text in default sample and candidate text collection is obtained using N kind relevance metric algorithm This degree of association magnitude obtains the recall rate on default section of the degree of association magnitude of default sample, according on default section Recall rate give N kind relevance metric algorithm that corresponding weight is set, obtain the integrated ordered of each text in candidate text collection Value obtains the retrieval ordering result of target text according to integrated ordered value;Preferably, mesh is obtained according to N kind relevance metric algorithm The N kind rank order for marking the degree of association magnitude of text obtains the comprehensive of each text in candidate text collection according to N kind rank order Ranking value is closed, the retrieval ordering result of target text is obtained according to integrated ordered value;Preferably, N kind relevance metric algorithm is utilized The degree of association magnitude of each text in default sample and candidate text collection is obtained, and obtains the corresponding most related text of default sample Sequence precedence is obtained according to degree of association magnitude in candidate text collection, the basis presets being averaged for the ranking precedence of sample Recall rate or the recall rate on default section, are arranged corresponding weight to N kind relevance metric algorithm, obtain candidate text collection In each text integrated ordered value, according to it is integrated ordered value obtain target text retrieval ordering result.
In one embodiment, the degree of association for obtaining each text in the target text and the candidate text collection The step of magnitude, comprising: using default a kind of or N kind relevance metric algorithm, according to the corresponding transmogrified text of the target text, Obtain each text or transmogrified text corresponding with text each in candidate text collection in target text and candidate text collection Degree of association magnitude.
Second aspect, the retrieval ordering that the embodiment of the present invention provides a kind of text determine system, comprising: target text and time Text collection is selected to obtain module, for obtaining target text and candidate text collection to be retrieved;Degree of association magnitude obtains module, For obtaining the degree of association magnitude of each text in the target text and the candidate text collection;The building of first text collection Module, for being carried out according to the first preset rules to each text in the candidate text collection using the degree of association magnitude Sequence constructs the first text collection according to the first default screening conditions;Retrieval ordering result obtains module, is used for described first Each text is ranked up according to the second preset rules in text collection, obtains the retrieval ordering result of target text.
The third aspect, the embodiment of the present invention provide a kind of computer equipment, comprising: at least one processor, and with institute State the memory of at least one processor communication connection, wherein the memory is stored with can be by least one described processor The instruction of execution, described instruction are executed by least one described processor, so that at least one described processor executes the present invention The retrieval ordering for the text that first aspect provides determines method.
Fourth aspect, the embodiment of the present invention provide a kind of computer readable storage medium, the computer-readable storage medium Matter is stored with computer instruction, and the computer instruction is used to that the computer to be made to execute the text that first aspect present invention provides Retrieval ordering determine method.
Technical solution of the present invention has the advantages that
The retrieval ordering of text provided by the invention determines method and system, obtains target text and time to be retrieved first Text collection is selected, which can be a patent;Further obtain the target text and the candidate text set The degree of association magnitude of each text in conjunction, the degree of association can be similarity;Then default according to first using degree of association magnitude Rule is ranked up each text in the candidate text collection, constructs the first text set according to the first default screening conditions It closes;Each text in first text collection is ranked up according to the second preset rules finally, obtains the inspection of target text Rope ranking results.Compared with the existing technology, user can not learn optimal search result, need to browse the various arrangements of use one by one Mode, recall precision is low, and the advantage of method set many algorithms provided by the embodiments of the present application improves the essence of patent search result Accuracy promotes the recall precision of user.
Detailed description of the invention
It, below will be to specific in order to illustrate more clearly of the specific embodiment of the invention or technical solution in the prior art Embodiment or attached drawing needed to be used in the description of the prior art be briefly described, it should be apparent that, it is described below Attached drawing is some embodiments of the present invention, for those of ordinary skill in the art, before not making the creative labor It puts, is also possible to obtain other drawings based on these drawings.
Fig. 1 is that the retrieval ordering of text provided in an embodiment of the present invention determines the flow chart of one specific example of method;
Fig. 2 determines for the retrieval ordering of text provided in an embodiment of the present invention constructs the unification of the first text set in fact in method Apply the step flow diagram of example;
It is algorithm in each ranking section precision schematic diagram that Fig. 3, which is three kinds provided in an embodiment of the present invention,;
Fig. 4 is that the retrieval ordering of text provided in an embodiment of the present invention determines the flow chart of another specific example of method;
Fig. 5 is the step flow diagram of second text collection one embodiment of building provided in an embodiment of the present invention;
Fig. 6 is the step flow diagram of another embodiment of the second text collection of building provided in an embodiment of the present invention;
Fig. 7 is that the retrieval ordering of text provided in an embodiment of the present invention determines the composition figure of one specific example of system;
Fig. 8 is the composition figure of one specific example of computer equipment provided in an embodiment of the present invention.
Specific embodiment
Technical solution of the present invention is clearly and completely described below in conjunction with attached drawing, it is clear that described implementation Example is a part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill Personnel's every other embodiment obtained without making creative work, shall fall within the protection scope of the present invention.
Embodiment 1
The retrieval ordering that the embodiment of the present invention provides a kind of text determines method, can be applied to electronic equipment, the electronics Equipment can be server, or terminal, as shown in Figure 1, this method comprises the following steps:
Step S1: target text and candidate text collection to be retrieved are obtained.
In practical applications, target text to be retrieved includes but is not limited to technical literature, patent, academic paper etc., In the embodiment of the present invention, which is illustrated by taking patent as an example, which can be candidate patent set. Server can receive the target patent to be retrieved that user inputs in user terminal, and get time from patent database Patent set is selected, according to usage scenario, it may be possible to the patent in full library, it is also possible to a customized patent by other means Set, such as only include the set of Chinese patent or the candidate patent set may be a technical field in patent database All patents a subset, may include 10,000 patents, it should be noted that include is special in the candidate patent set Benefit quantity only illustrate and and it is non-limiting.
Step S2: the degree of association magnitude of each text in target text and candidate text collection is obtained.
In practical applications, target text can be similar to the degree of association magnitude of each text in candidate text collection Degree, novelty degree, different degree, action value etc. can arbitrarily indicate each textual association degree in target text and candidate text collection Metric.The embodiment of the present invention is illustrated by taking similarity as an example, and target patent and pre- can be obtained using N kind similarity algorithm The similarity of each Wen Li of candidate patent set, wherein N is not less than 2.In embodiments of the present invention, with structural analysis, keyword Three kinds of similarity calculating methods of analysis and semantic analysis obtain similarity value and are illustrated, and but not limited to this, in other implementations It can choose two or more any similarity calculating method in example.
In practical applications, the degree of association magnitude of each text in the target text and the candidate text collection is obtained The step of, comprising: target is obtained according to the corresponding transmogrified text of target text using a kind of default or N kind relevance metric algorithm The degree of association of each text or transmogrified text corresponding with text each in candidate text collection in text and candidate text collection Magnitude.
Transmogrified text in the embodiment of the present invention is other expression-form texts associated with original text, e.g.: with The translation of other corresponding language of original text;Abbreviation, rewriting or the summary text carried out according to original text content;Original text includes Part content of text (for example, for patent text, can choose the abstract of description of target text, claims or All or part of the content in person's specification);Corresponding with original text content other texts (for example, for patent text and Speech, can be the patent families text of former patent text) etc., it is above as an example, be not limited.
In one embodiment, the mistake of the degree of association magnitude of each text in target text and candidate text collection is obtained Journey can use default N kind relevance metric algorithm, according to the middle text or its English translation in the target text, obtain respectively Take the degree of association magnitude of each text in target text and candidate text collection.For example, can use default N kind similarity operator Method, calculates separately the similarity of each patent in the English text and candidate text collection of target patent, or calculates English specially The similarity of each patent in the Chinese translation of benefit and candidate text collection, to obtain different sortords.
In one embodiment, the mistake of the degree of association magnitude of each text in target text and candidate text collection is obtained Journey can be the degree of association magnitude that each patent in target patent and candidate patent set is obtained using a kind of relevance metric algorithm And in candidate patent text collection the translation of other corresponding language similarity value, to obtain different sortords.
In another embodiment, the degree of association magnitude of each text in target text and candidate text collection is obtained Process can also be using default N kind relevance metric algorithm, according to word content different in the target text, obtain mesh Mark the degree of association magnitude of each text in text and candidate text collection.For example, can use default N kind similarity algorithm, count Calculate the content of abstract of description, claims or specification and each text in candidate patent set in target patent The similarity of abstract of description, all or part of the content of claims or specification, to obtain different sortords.
Step S3: each text in candidate text collection is arranged according to the first preset rules using degree of association magnitude Sequence constructs the first text collection according to the first default screening conditions.
In the embodiment of the present invention, the first text collection is structure to each patent progress preliminary screening in candidate patent set The primary election patent set built.
In one embodiment, as shown in Fig. 2, the process of step S3 building primary election patent set, can specifically include following Step:
Step S31: the degree of association magnitude obtained according to default N kind relevance metric algorithm, respectively in candidate text collection Each text be ranked up, obtain N kind ordered set.
In one embodiment, it is analyzed using above structure, three kinds of similarity algorithms of semantic analysis and key word analysis, Respectively by the similarity of target patent and each patent in default patent set according to being ranked up from big to small, to obtain three Patent set X, Y, Z of the corresponding three kinds of sequential modes of kind similarity algorithm.
Step S32: it is integrated ordered according to the progress of the first preset rules to N kind ordered set, according to the first default screening item Part constructs the first text collection.
In one embodiment, to the degree of association magnitude obtained using default N kind metric algorithm, according to the first preset rules point Not Fen Pei weight, by the degree of association magnitude and corresponding multiplied by weight and be added and obtain Synthesis Relational Grade magnitude, according to described The size of Synthesis Relational Grade magnitude determines integrated ordered as a result, will be greater than the text of the first preset comprehensive degree of association magnitude threshold value, It is selected into the first text collection.
It in one embodiment, can X document conduct according to known most related text, e.g. in checking process Most related text is compared and is verified to the similitude effect of above-mentioned three kinds of similarity algorithms, obtains the excellent of every kind of calculation method Gesture section, and then every kind of calculation of selection is included in the patent numbers of primary election set.For example, having sampled 100 patents, adopt respectively It is ranked up and is compared with above-mentioned three kinds of similarity algorithms, only illustrated with wherein 3 groups of data, as shown in table 1:
Table 1
It is shown in table 1, for the target patent (such as CN104983351A) of sampling, most related patents (the most phase of the patent The X document for the patent that patent refers to that auditor provides is closed, CN104983351A is such as corresponded to, most related patents are CN203247669U), key word analysis (algorithm 1), structural analysis (algorithm 2) and semantic analysis (algorithm 3) is utilized respectively to obtain Arrangement value of the most related patents in full library.
For 100 patents of sampling, the most associated documents of each target patent are obtained in each ranking section by statistics Number, the most associated documents as shown in Table 2 for each target patent are based in each statistical result for ranking section number The precision correlation curve for three kinds of similarity algorithms that data are formed in table 2 is as shown in Figure 3.Recall rate is the correlation text retrieved The ratio of all relevant documentation numbers, can measure the recall ratio of searching system in gear number and document library, based on the data in table 2 Form recall rate of the most associated documents in each ranking section number of each target patent as shown in table 3:
Table 2
Table 3
According to above-mentioned statistical result, for the patent J in candidate patent set, if its three kinds of algorithms calculate relative to The similarity of target patent O is respectively Rx, Ry, Rz, can calculate separately the recall rate of section locating for Rx, Ry, Rz, is determined Weight proportion on each section resequences after carrying out operation to it if it is respectively W1, W2, W3, then obtains its synthesis Similarity value J' are as follows: it is special to be selected into primary election by J'=Rx*W1+Ry*W2+Rz*W3 for the patent that comprehensive similarity value is greater than preset value In benefit set.
In one embodiment, it sorts, obtains respectively according to size according to using default N kind metric algorithm acquisition degree of association magnitude To N kind ordered set;By in the degree of association magnitude of each text of the N kind ordered set, it is greater than the first degree of association magnitude threshold Value and/or less than first row tagmeme subthreshold text, be selected into first text collection.
In one embodiment, can according to arrangement precedence summation mode be screened: respectively obtain patent K X, Y, Three kinds of modes in Z tri- set arrange precedence Kx, Ky, Kz, carry out read group total, if its precedence when being less than preset threshold, Then it is selected in primary election patent set, such as: for the patent K in candidate text collection, if ∑ (Kx, Ky, Kz) is less than 500 (this model Enclose default based on experience value or can be by user's sets itself) within position, then the patent is selected into primary election patent set.
In one embodiment, it can be screened according to arrangement precedence average value: obtain patent K respectively in X, Y, Z tri- Three kinds of modes in a set arrange the average value mean (Kx, Ky, Kz) of precedence, if its precedence be less than preset threshold when, Selected primary election patent set, such as: for the patent K of candidate text collection, if mean (Kx, Ky, Kz) is less than 100 (this range Based on experience value default or can be by user's sets itself) within position, then the patent is selected into primary election patent set.
In one embodiment, it can be screened according to arrangement precedence minimum value: obtain patent K respectively in X, Y, Z tri- Arrangement precedence Kx, Ky, Kz in a set find out the similarity algorithm of arrangement precedence minimum min (Kx, Ky, Kz), precedence When being less than preset threshold, then it is selected in primary election patent set, such as: for the patent K of candidate text collection, if min (Kx, Ky, Kz) within preceding 50 (this range is defaulted based on experience value or can be by user's sets itself) positions, then the patent is selected into primary election Patent set.
In one embodiment, the mode that default precedence threshold value can be less than simultaneously according to various arrangement precedence is sieved Choosing: obtaining arrangement precedence Kx, Ky, Kz, such as Kx, Ky of the patent K in X, Y, Z tri- set respectively, in Kz as there are two or two N before coming more than a, then it is selected in primary election patent set, it in practical applications can be with the increase of the item number for the condition that meets, in advance The threshold value n first set can suitably increase.
In one embodiment, arrangement precedence Kx, Ky, Kz of the patent K in X, Y, Z tri- set can be obtained respectively, And permutation and combination is carried out to two of them arrangement precedence and carries out summation operation, ∑ (Kx, Ky) is obtained respectively, ∑ (Kx, Kz), ∑ (Ky, Kz) takes minimum value therein, and precedence is then selected in primary election patent set when being less than preset threshold, such as: for waiting The patent K in text collection is selected, if (this range is according to warp preceding 100 by min (∑ (Kx, Ky), ∑ (Kx, Kz), ∑ (Ky, Kz)) Test value default or can be by user's sets itself) within position, be then selected in primary election patent set.
In one embodiment, arrangement precedence Kx, Ky, Kz of the patent K in X, Y, Z tri- set are obtained respectively, are gone Except the arrangement maximum numerical value of precedence, and its remainder values is summed, when precedence is less than preset threshold, is then selected in primary election patent collection It closes, such as: for the patent K in candidate collection, if ∑ (Kx, Ky, Kz)-max (Kx, Ky, Kz) is less than 70, (this range is according to warp Test value default or can be by user's sets itself) within position, then the patent is selected into primary election range.
In one embodiment, arrangement precedence Kx, Ky, Kz of the patent K in X, Y, Z tri- set are obtained respectively, are gone After the arrangement maximum numerical value max (Kx, Ky, Kz) of precedence, the average value of other two ways arrangement precedence is obtained, precedence exists When less than preset threshold, then it is selected in primary election patent set, such as: for the patent K in candidate collection, if (∑ (Kx, Ky, Kz)- Max (Kx, Ky, Kz))/2 less than 70 (this range is defaulted based on experience value or can be by user's sets itself) positions within, then will The patent is selected into primary election patent set.
In one embodiment, the target text is calculated separately to default N kind relevance metric algorithm according to preset strategy The degree of association magnitude of this and each text in the candidate text collection is analyzed, and analysis result is obtained;Based on the analysis results Judge that whether each text meets preset condition in the candidate text collection, will meet the default item in candidate text collection The text of part is selected into first text collection.Such as: it can be screened: be obtained respectively special by way of differential analysis Arrangement precedence Kx, Ky, Kz of the sharp K in X, Y, Z tri- set, choose the maximum value of the arrangement precedence in three set respectively Max (Kx, Ky, Kz) and the minimum value min (Kx, Ky, Kz) of the arrangement precedence in three set calculate precedence coefficient of correlation, position Secondary coefficient of correlation can be calculated by following optional four schemes:
Optinal plan 1:C1=(max (Kx, Ky, Kz)-min (Kx, Ky, Kz))/max (Kx, Ky, Kz);
Optinal plan 2:C1=(max (Kx, Ky, Kz)-min (Kx, Ky, Kz))/min (Kx, Ky, Kz);
Optinal plan 3:C3=max (Kx, Ky, Kz)/min (Kx, Ky, Kz);
Optinal plan 4:C4=min (Kx, Ky, Kz)/max (Kx, Ky, Kz).
Can be obtained by above-mentioned optional four optinal plans precedence coefficient of correlation (only illustrated with this, not as Limit), (refer to the bigger feelings of two different sortord gaps according to whether preset threshold decision belongs to high drop patent Condition), if belonged to, according to preset strategy, determine whether the patent is imported into primary election patent set.Wherein, in advance The strategy of setting can be the patent Selection Strategy scheme according to big data statistic analysis result and practical experience acquisition.For example, According to big data statistic analysis result and practical experience, it is believed that patent K is much smaller than it in set Y in the precedence Kx in set X Precedence Ky when, if the patent meets condition 1 (such as technology belongs to technical field F1), be selected into primary election set, if this specially Benefit meets condition 2 (such as technology belongs to technical field F2), then is not selected into primary election patent set.
In one embodiment, according to respectively be directed to every kind of degree of correlation calculation method provide one it is preset minimum Relevance threshold Rtx, Rty, Rtz only can just be selected into primary election patent set higher than the patent of lowest threshold.
In one embodiment, it presets a composite thresholds Rt1 and is utilized respectively three kinds of relatedness computations for patent K Mode obtains it in the similarity Rx, Ry, Rz relative to target patent O;Choose the maximum value max (Rx, Ry, Rz) of similarity Judge whether max (Rx, Ry, Rz) is greater than composite thresholds Rt1 and patent K is selected into primary election set if it is greater than composite thresholds Rt1.
In one embodiment, it presets a composite thresholds Rt2 and is directed to patent K, be utilized respectively three kinds of relatedness computations Mode obtains it in the similarity Rx, Ry, Rz relative to target patent O;Similarity average value mean (Rx, Ry, Rz) is chosen to sentence Whether disconnected mean (Rx, Ry, Rz) is greater than composite thresholds Rt2, if it is greater than composite thresholds Rt2, by the patent K of candidate text collection It is selected into primary election set.
In one embodiment, it is set for every kind of similarity algorithm in minimum relevance threshold Rx, Ry, Rz, such as There are two fruit patent K or more than two greater than preset threshold value, then is conducted into primary election patent set.
Above embodiments only optional embodiment for example, is only illustrated with this and is not limited, in other embodiments, As long as above-mentioned two or more than two choosing methods can be met to the patent in candidate patent set simultaneously by not conflicting with each other It is chosen, constructs primary election patent set.
Step S4: text each in the first text collection is ranked up according to the second preset rules, obtains target text Retrieval ordering result.In embodiments of the present invention, the second preset rules according to target text degree of association magnitude size or The sequence precedence of degree of association magnitude is set.
In one embodiment, as shown in figure 4, executing step S4 may particularly include following steps:
Step S41: text each in the first text collection is ranked up according to third preset rules, default according to second Screening conditions exclude noise text, construct the second text collection.
In the embodiment of the present invention, the second text collection is after user is further screened, denoised to primary election patent set The similar patent set obtained.
In one embodiment, as shown in figure 5, constructing the process of the set of similar patent, following steps be may particularly include:
Step S411: by the text in the first text collection, to the degree of association magnitude obtained using default N kind metric algorithm Weight is distributed respectively according to third preset rules.
In practical applications, it in the embodiment of the present invention, is obtained according to target patent using default N kind metric algorithm in primary election The novel degree of each patent in patent set, similarity etc., third preset rules are referred in the first text collection of building First rule mode, can be done on preset value adaptation adjustment, can also use other preset rules, e.g. artificial root According to experience setting etc., is only illustrated, be not limited with this.
Step S412: it by degree of association magnitude and corresponding multiplied by weight and is added and obtains Synthesis Relational Grade magnitude.
It, can be according to each parser as shown in Figure 3 by the corresponding weight of degree of association magnitude in the embodiment of the present invention The recall rate in locating section, determine weight proportion on each section.
Step S413: integrated ordered result is determined according to the size of Synthesis Relational Grade magnitude.
Step S414: by less than the text of the second preset comprehensive degree of association magnitude threshold value, as noise text.
In the embodiment of the present invention, Synthesis Relational Grade magnitude can be less than the patent of preset value or be greater than rank order The patent of preset value is only illustrated as noise patent with this, is not limited.
Step S415: removing noise text from the first text collection, constructs the second text collection.
The embodiment of the present invention constructs similar patent set after removing noise patent in the primary election patent set of building.
In another embodiment, as shown in fig. 6, constructing the process of similar patent set, following steps be may particularly include:
Step S416: the text and the mesh in first text collection are obtained according to default N kind relevance metric algorithm Mark the second degree of association magnitude of text.
Step S417: it is sorted respectively according to second degree of association magnitude according to size, obtains N kind ordered set.
Step S418: by the degree of association magnitude of each text of N kind ordered set, less than the second degree of association magnitude threshold value And/or in the text for being greater than second row tagmeme time, as noise text.
Step S419: removing noise text from the first text collection, constructs the second text collection.
In the embodiment of the present invention, it can refer to employed in building primary election patent set according to similarity threshold and/or root According to the sequence that each similarity algorithm obtains, noise patent is removed by the way that threshold value appropriate is arranged, constructs similar patent set, this In repeat no more.
Step S42: each text in the second text collection is ranked up according to the second preset rules, obtains target text This retrieval ordering result.
In embodiments of the present invention, the second preset rules are according to the degree of association magnitude size or relevance metric with target text The sequence precedence of value is set.In one embodiment, using the mode of mean allocation weight (the weight phase of three kinds of algorithms Together), it may be assumed that Wx=Wy=Wz=1/3 is for example, if the similarity point that certain patent J is obtained relative to three kinds of algorithms of target patent O Not are as follows: Rx=90%, Ry=85%, Rz=96%, then simple weighted average phase of the candidate patent J relative to target patent Like degree are as follows: R=90%*1/3+85%*1/3+96%*1/3=90.3% is obtained according to the weighted average similarity of each patent Take the retrieval ordering result of target patent.
In a specific implementation, certain weight, example rule of thumb can artificially rule of thumb can be assigned to every kind of algorithm Such as, Wx=20% can artificially be assigned;Wy=30%;Wz=50% obtains the inspection of target patent according to the weighted value of each patent Rope ranking results.
In one embodiment, it is obtained using N kind relevance metric algorithm each in default sample and candidate text collection The degree of association magnitude of text obtains the recall rate on default section of the degree of association magnitude of default sample, according to default section On recall rate give N kind relevance metric algorithm that corresponding weight is set, obtain the comprehensive row of each text in candidate text collection Sequence value obtains the retrieval ordering result of target text according to integrated ordered value.Such as: it is drawn for the degree of correlation of every kind of calculation method It is divided into several sections, by the X document call back number and the degree of correlation section patent sum of each section, calculates the special of the section The recall rate of benefit, is such as divided into following 6 sections to the degree of correlation:
For 1 statistical result of algorithm:
Greater than 95%:Z11=(X document call back number/sum)=5%
95%~90%:Z12=(X document call back number/sum)=10%
90%~80%:Z13=(X document call back number/sum)=11%
80%~70%:Z14=(X document call back number/sum)=13%
70%~60%:Z15=(X document call back number/sum)=19%
60% or less: Z16=(X document call back number/sum)=42%
For 2 statistical result of algorithm:
Greater than 95%:Z21=(X document call back number/sum)=3%
95%~90%:Z22=(X document call back number/sum)=12%
90%~80%:Z23=(X document call back number/sum)=17%
80%~70%:Z24=(X document call back number/sum=15%
70%~60%:Z25=(X document call back number/sum)=23%
60% or less: Z26=(X document call back number/sum)=30%
For the statistical result of algorithm 3:
Greater than 95%:Z31=(X document call back number/sum)=7%
95%~90%:Z32=(X document call back number/sum)=9%
90%~80%:Z33=(X document call back number/sum)=18%
80%~70%:Z34=(X document call back number/sum)=19%
70%~60%:Z35=(X document call back number/sum)=15%
60% or less: Z36=(X document call back number/sum)=32%
According to the above statistical data specified weight allocation plan, such as: for patent J, if the phase that its three kinds of algorithms calculate Similarity for target patent O is respectively Rx, Ry, Rz, can calculate separately section locating for Rx, Ry, Rz, according to above-mentioned system Meter result finds out its corresponding weight proportion, if it is respectively W1, W2, W3, resequences after carrying out operation to it, then it is integrated Similarity value J' are as follows: J'=Rx*W1+Ry*W2+Rz*W3 obtains target patent according to the comprehensive similarity value of each patent Retrieval ordering result.
In one embodiment, the peak for the similarity that comprehensive similarity takes three kinds of algorithms to obtain, i.e. max (Rx, Ry, Rz).For example, if certain patent is respectively as follows: Rx=90%, Ry=relative to the similarity that three kinds of algorithms of target patent obtain 85%, Rz=96% then directly assign similarity of the patent relative to target patent are as follows: R=96%.
In one embodiment, it can be chosen using interval sequence, for example, can be respectively with three kinds of sortords to similar patent Set is ranked up, and the final sortord of ordered set X, Y, Z for respectively obtaining three similar patents sequences can be according to The mode of X1, Y1, Z1, X2, Y3, Z2, X3, Y3, Z3... are successively alternatively arranged, for example, certain patent simultaneously belong to X2, Y6, Z53 on the position for then being come above-mentioned " X2 " first, directly skips this patent when to the position Y6, selects subsequent Y7 patent (if Y7 was also selected, was successively prolonged afterwards), Z53 is processed similarly.
In one embodiment, the N kind ranking time of the degree of association magnitude of target text is obtained according to N kind relevance metric algorithm Sequence, the integrated ordered value of each text in candidate text collection is obtained according to N kind rank order, obtains mesh according to integrated ordered value Mark the retrieval ordering result of text.Such as: user can respectively be ranked up similar patent set with three kinds of sortords, respectively Obtain ordered set X, Y, Z of three similar patents sequences, for patent C, if its sequence in three set be respectively Cx, Cy, Cz resequence after can carrying out operation to it, for example, C'=Cu+Cv+Cw can be set as integrated ordered value C', finally press It is ranked up according to the size of C', if there is the C' of multiple equivalences, these patents can be arranged according to preset rules Sequence, for example, can more each C' corresponding each group Cx, Cy, Cz minimum value, prioritization min (Cx, Cy, Cz) is the smallest Patent, or can more each C' corresponding each group Cx, Cy, Cz maximum value, prioritization max (Cx, Cy, Cz) is minimum Patent.
In one embodiment, it is ranked up according to second preset rules, obtains the retrieval ordering result of target text The step of, comprising: the relevance metric of each text in default sample and candidate text collection is obtained using N kind relevance metric algorithm Value, and obtain the corresponding most related text of default sample and sequence precedence is obtained according to degree of association magnitude in candidate text collection, The average recall rate or the recall rate on default section that the basis presets the ranking precedence of sample, are calculated to N kind relevance metric Corresponding weight is arranged in method, obtains the integrated ordered value of each text in candidate text collection, obtains mesh according to integrated ordered value The retrieval ordering of text is marked as a result, specifically including:
Weight assignment is carried out according to the distribution situation of arrangement result, takes a collection of patent sample (such as in default patent set 100 patents with X document), the most pertinent literature of these patents is found out (for example, using providing in patent examination information X documentation & info, define the immediate documents that X class documents are patent) and itself and candidate patent established map Relationship calculates the immediate comparison patent of each of sample patent respectively with different similarity calculating methods Similarity.For different similarity calculation modes, the X document of each patent of sample patent is calculated separately out entire candidate Degree of correlation arrangement precedence in patent set relative to target patent (if a sample patent corresponds to multiple X documents, takes its row Sequence is near preceding one).According to above method, the corresponding X document Pi of each sample patent can get in three kinds of algorithms of different Sequence: Pix, Piy, Piz (i=1~100).The above-mentioned data of acquisition are analyzed, the sequence of every kind of operation mode is obtained Distribution situation, according to sequence distribution situation to the advantage section of every kind of algorithm, three kinds of algorithms as shown in Figure 3 as escribed above exist The accuracy trend of each section compares.
According to statistical result above it is found that algorithm 1 and algorithm 2 (first 10) and relatively after (101~1000) recall Rate is higher, and does not have apparent advantage compared to other calculations in 10~100 section;Algorithm 3 then shows opposite Trend.According to above-mentioned statistical result, corresponding assignment and adjustment can be carried out to the weight of integrated ordered calculation formula, according to patent The statistical result of recall rate, can be respectively according to a pair of weight assignment of following two method:
Method one: every kind of algorithm is counted respectively in the average recall rate of each precedence, according to recall rate to different algorithms Weight assignment is carried out, according to statistical result, the ratio that three kinds of immediate documents of algorithm come the 6th is respectively: 1.5%, 2.3%, 0.6%, it is computed, the opposite accounting that three kinds of immediate documents of algorithm come the 6th is respectively:
Algorithm 1: accounting=1.5/ (1.5+2.3+0.6) * 100%=34%,
Algorithm 2: accounting=2.3/ (1.5+2.3+0.6) * 100%=52%,
Algorithm 3: accounting=0.6/ (1.5+2.3+0.6) * 100%=14%;
Then for coming the 6th the case where, 34%, 52%, 14% weight is given respectively, for patent C, if it is three Sequence in a set is respectively Cx, Cy, Cz, finds out its corresponding weight proportion according to the above method, if its be respectively W1, W2, W3 resequence after carrying out operation to it, its integrated ordered value are set as: C'=Cu*W1+Cv*W2+Cw*W3.
Method two: the precedence that search result is hit is divided into several sections, counts every kind of algorithm respectively in each section Recall rate carries out weight assignment to different algorithm according to recall rate for example, three kinds of algorithms retrieve immediate patent comes 6th~10 ratio is 5%, 3%, 11% respectively, is computed, and three kinds of immediate documents of algorithm come the 6th~10 The opposite accounting of position is respectively:
Algorithm 1: accounting=5/ (5+3+11) * 100%=26%,
Algorithm 2: accounting=3/ (5+3+11) * 100%=16%,
Algorithm 3: accounting=11/ (5+3+11) * 100%=58%;
Then for coming the 6th the case where, 26%, 16%, 58% weight is given respectively, for patent C, if it is three Sequence in a set is respectively Cx, Cy, Cz, can calculate separately section locating for Cx, Cy, Cz, finds out it according to the above method Corresponding weight proportion is resequenced after carrying out operation to it if it is respectively W1, W2, W3, then by its integrated ordered value are as follows: C'=Cu*W1+Cv*W2+Cw*W3.
Above embodiments are only done for example, be not limited, may be used also on the basis of the above description in practical applications To make other variations or changes in different ways.
Retrieval ordering provided in an embodiment of the present invention determines method, obtains target text and candidate text to be retrieved first Set, which can be a patent;It further obtains every in the target text and the candidate text collection The degree of association magnitude of a text, the degree of association can be similarity;Then using degree of association magnitude according to the first preset rules pair Each text in candidate's text collection is ranked up, and constructs the first text collection according to the first default screening conditions;Most Each text in first text collection is ranked up according to the second preset rules afterwards, obtains the retrieval ordering of target text As a result.The advantage of method set many algorithms provided by the embodiments of the present application improves the precision of patent search result, is promoted and is used The recall precision at family.
Embodiment 2
The retrieval ordering that the embodiment of the present invention provides a kind of text determines system, as shown in fig. 7, the system includes:
Target text and candidate text collection obtain module 1, for obtaining the target text and the candidate text set The degree of association magnitude of each text in conjunction.This module executes method described in the step S1 in embodiment 1, no longer superfluous herein It states.
Degree of association magnitude obtains module 2, for obtaining each text in the target text and the candidate text collection Degree of association magnitude.This module executes method described in the step S2 in embodiment 1, and details are not described herein.
First text collection constructs module 3, for utilizing the degree of association magnitude according to the first preset rules to the time It selects each text in text collection to be ranked up, constructs the first text collection according to the first default screening conditions;This module is held Method described in step S3 in row embodiment 1, details are not described herein.
Retrieval ordering result obtains module 4, for each text in first text collection to be preset rule according to second It is then ranked up, obtains the retrieval ordering result of target text.This module executes side described in the step S4 in embodiment 1 Method, details are not described herein.
The retrieval ordering of text provided in an embodiment of the present invention determines system, obtains target text and time to be retrieved first Text collection is selected, which can be a patent;Further obtain the target text and the candidate text set The degree of association magnitude of each text in conjunction, the degree of association can be similarity;Then default according to first using degree of association magnitude Rule is ranked up each text in the candidate text collection, constructs the first text set according to the first default screening conditions It closes;Each text in first text collection is ranked up according to the second preset rules finally, obtains the inspection of target text Rope ranking results.The advantage of system set many algorithms provided by the embodiments of the present application, improves the precision of patent search result, Promote the recall precision of user.
Embodiment 3
The embodiment of the present invention provides a kind of computer equipment, as shown in Figure 8, comprising: at least one processor 401, such as CPU (Central Processing Unit, central processing unit), at least one communication interface 403, memory 404, at least one A communication bus 402.Wherein, communication bus 402 is for realizing the connection communication between these components.Wherein, communication interface 403 It may include display screen (Display), keyboard (Keyboard), optional communication interface 403 can also include that the wired of standard connects Mouth, wireless interface.Memory 404 can be high speed RAM memory, and (Ramdom Access Memory, effumability are deposited at random Access to memory), it is also possible to non-labile memory (non-volatile memory), for example, at least a disk storage Device.Memory 404 optionally can also be that at least one is located remotely from the storage device of aforementioned processor 401.Wherein processor 401 retrieval orderings that can execute the text of Fig. 1 description determine method, batch processing code are stored in memory 404, and handle Device 401 calls the program code stored in memory 404, with the retrieval ordering determination side for executing the text in embodiment 1 Method.
Wherein, communication bus 402 can be Peripheral Component Interconnect standard (peripheral component Interconnect, abbreviation PCI) bus or expanding the industrial standard structure (extended industry standard Architecture, abbreviation EISA) bus etc..Communication bus 402 can be divided into address bus, data/address bus, control bus etc.. Only to be indicated with a line in Fig. 8, it is not intended that an only bus or a type of bus convenient for indicating.
Wherein, memory 404 may include volatile memory (English: volatile memory), such as arbitrary access Memory (English: random-access memory, abbreviation: RAM);Memory also may include nonvolatile memory (English Text: non-volatile memory), for example, flash memory (English: flash memory), hard disk (English: hard disk Drive, abbreviation: HDD) or solid state hard disk (English: solid-state drive, abbreviation: SSD);Memory 404 can also wrap Include the combination of the memory of mentioned kind.
Wherein, processor 401 can be central processing unit (English: central processing unit, abbreviation: CPU), the combination of network processing unit (English: network processor, abbreviation: NP) or CPU and NP.
Wherein, processor 401 can further include hardware chip.Above-mentioned hardware chip can be specific integrated circuit (English: application-specific integrated circuit, abbreviation: ASIC), programmable logic device (English: Programmable logic device, abbreviation: PLD) or combinations thereof.Above-mentioned PLD can be Complex Programmable Logic Devices (English: complex programmable logic device, abbreviation: CPLD), field programmable gate array (English: Field-programmable gate array, abbreviation: FPGA), Universal Array Logic (English: generic array Logic, abbreviation: GAL) or any combination thereof.
Optionally, memory 404 is also used to store program instruction.Processor 401 can be instructed with caller, be realized such as this The retrieval ordering of the text provided in application embodiment 1 determines method.
The embodiment of the present invention also provides a kind of computer readable storage medium, and meter is stored on computer readable storage medium Calculation machine executable instruction, the retrieval ordering which can be performed the text in above-described embodiment 1 determine method. Wherein, the storage medium can be magnetic disk, CD, read-only memory (Read-Only Memory, ROM), random storage Memory body (Random Access Memory, RAM), flash memory (Flash Memory), hard disk (Hard Disk Drive, abbreviation: HDD) or solid state hard disk (Solid-State Drive, SSD) etc.;The storage medium can also include above-mentioned The combination of the memory of type.
Obviously, the above embodiments are merely examples for clarifying the description, and does not limit the embodiments.It is right For those of ordinary skill in the art, can also make on the basis of the above description it is other it is various forms of variation or It changes.There is no necessity and possibility to exhaust all the enbodiments.And thus amplify out it is obvious variation or It changes still within the protection scope of the invention.

Claims (14)

1. a kind of retrieval ordering of text determines method, which comprises the steps of:
Obtain target text and candidate text collection to be retrieved;
Obtain the degree of association magnitude of each text in the target text and the candidate text collection;
Each text in the candidate text collection is ranked up according to the first preset rules using the degree of association magnitude, The first text collection is constructed according to the first default screening conditions;
Each text in first text collection is ranked up according to the second preset rules, obtains the retrieval row of target text Sequence result.
2. the retrieval ordering of text according to claim 1 determines method, which is characterized in that described by first text The step of each text is ranked up according to the second preset rules in set, obtains the retrieval ordering result of target text, comprising:
Each text in first text collection is ranked up according to third preset rules, according to the second default screening conditions Noise text is excluded, the second text collection is constructed;
Each text in second text collection is ranked up according to the second preset rules, obtains the retrieval of target text Ranking results.
3. the retrieval ordering of text according to claim 1 or 2 determines method, which is characterized in that obtain the target text This in the candidate text collection the step of degree of association magnitude of each text, comprising:
Each text in the target text and the candidate text collection is calculated separately using default N kind relevance metric algorithm Degree of association magnitude, the N are the positive integer more than or equal to 2.
4. the retrieval ordering of text according to claim 3 determines method, which is characterized in that described to utilize the degree of association Magnitude is ranked up each text in the candidate text collection according to the first preset rules, according to the first default screening item Part constructs the step of the first text collection, comprising:
According to the degree of association magnitude that default N kind relevance metric algorithm obtains, respectively to each text in the candidate text collection Originally it is ranked up, obtains N kind ordered set;
The N kind ordered set is carried out according to the first preset rules it is integrated ordered, according to the first default screening conditions building the One text collection;Preferably, according to preset strategy to default N kind relevance metric algorithm calculate separately the target text with it is described The degree of association magnitude of each text is analyzed in candidate text collection, obtains analysis result;Based on the analysis results described in judgement Whether each text meets preset condition in candidate text collection, and the text of the preset condition will be met in candidate text collection It is selected into first text collection.
5. the retrieval ordering of text according to claim 4 determines method, which is characterized in that described to sort to the N kind The step of set is integrated ordered according to the progress of the first preset rules, constructs the first text collection according to the first default screening conditions, Include:
To the degree of association magnitude obtained using default N kind metric algorithm, weight is distributed respectively according to the first preset rules, it will be described Degree of association magnitude is with corresponding multiplied by weight and addition obtains Synthesis Relational Grade magnitude, according to the big of the Synthesis Relational Grade magnitude Small determination is integrated ordered as a result, will be greater than the text of the first preset comprehensive degree of association magnitude threshold value, is selected into the first text collection.
6. the retrieval ordering of text according to claim 3 determines method, which is characterized in that described to utilize the degree of association Magnitude is ranked up each text in the candidate text collection according to the first preset rules, according to the first default screening item Part constructs the step of the first text collection, comprising:
It sorts respectively according to degree of association magnitude is obtained using default N kind metric algorithm according to size, obtains N kind ordered set;
By in the degree of association magnitude of each text of the N kind ordered set, greater than the first degree of association magnitude threshold value and/or it is less than The text of first row tagmeme subthreshold is selected into first text collection.
7. the retrieval ordering of text according to claim 3 determines method, which is characterized in that described by first text Each text is ranked up according to third preset rules in set, excludes noise text, building according to the second default screening conditions The step of second text collection, comprising:
It is default according to third to the degree of association magnitude obtained using default N kind metric algorithm by the text in the first text collection Rule distributes weight respectively;
By the degree of association magnitude and corresponding multiplied by weight and it is added and obtains Synthesis Relational Grade magnitude;
Integrated ordered result is determined according to the size of the Synthesis Relational Grade magnitude;
By less than the text of the second preset comprehensive degree of association magnitude threshold value, as noise text;
The noise text is removed from the first text collection, constructs second text collection.
8. the retrieval ordering of text according to claim 3 determines method, which is characterized in that described by first text Each text is ranked up according to third preset rules in set, excludes noise text, building according to the second default screening conditions The step of second text collection, comprising:
Second that the text and the target text in first text collection are obtained according to default N kind relevance metric algorithm is closed Join metric;
It is sorted respectively according to second degree of association magnitude according to size, obtains N kind ordered set;
By in the degree of association magnitude of each text of N kind ordered set, less than the second degree of association magnitude threshold value and/or it is being greater than The text of second row tagmeme time, as noise text;
The noise text is removed from the first text collection, constructs second text collection.
9. the retrieval ordering of text according to claim 1 determines method, which is characterized in that by first text collection In text, second preset rules are according to the sequence of degree of association magnitude size or degree of association magnitude with the target text Precedence is set, and the retrieval ordering result of target text is obtained;Preferably, default sample is obtained using N kind relevance metric algorithm The degree of association magnitude of this and each text in candidate text collection, obtain the degree of association magnitude of default sample on default section Recall rate, give N kind relevance metric algorithm that corresponding weight is set according to the recall rate on default section, obtain candidate text set The integrated ordered value of each text in conjunction obtains the retrieval ordering result of target text according to integrated ordered value;Preferably, according to N kind relevance metric algorithm obtains the N kind rank order of the degree of association magnitude of target text, is obtained according to N kind rank order candidate The integrated ordered value of each text in text collection obtains the retrieval ordering result of target text according to integrated ordered value;It is preferred that Ground, the degree of association magnitude of each text in default sample and candidate text collection is obtained using N kind relevance metric algorithm, and is obtained The default corresponding most related text of sample obtains sequence precedence according to degree of association magnitude in candidate text collection, and the basis is pre- If the average recall rate of the ranking precedence of sample or the recall rate on default section, are arranged corresponding to N kind relevance metric algorithm Weight, obtain the integrated ordered value of each text in candidate text collection, the inspection of target text obtained according to integrated ordered value Rope ranking results.
10. the retrieval ordering of text according to claim 2 determines method, which is characterized in that by second text set Text in conjunction, second preset rules are according to the row of degree of association magnitude size or degree of association magnitude with the target text Tagmeme is set, described to be ranked up, and obtains the retrieval ordering result of target text;Preferably, N kind relevance metric is utilized Algorithm obtains the degree of association magnitude of each text in default sample and candidate text collection, obtains the degree of association magnitude of default sample The recall rate on default section, give N kind relevance metric algorithm that corresponding weight is set according to the recall rate on default section, The integrated ordered value for obtaining each text in candidate text collection obtains the retrieval ordering knot of target text according to integrated ordered value Fruit;Preferably, the N kind rank order that the degree of association magnitude of target text is obtained according to N kind relevance metric algorithm, is arranged according to N kind Precedence sequence obtains the integrated ordered value of each text in candidate text collection, and the retrieval of target text is obtained according to integrated ordered value Ranking results;Preferably, default sample is obtained using N kind relevance metric algorithm to be associated with each text in candidate's text collection Metric, and obtain the corresponding most related text of default sample and sequence position is obtained according to degree of association magnitude in candidate text collection Average recall rate or the recall rate on default section secondary, that the basis presets the ranking precedence of sample, give N kind relevance metric Corresponding weight is arranged in algorithm, obtains the integrated ordered value of each text in candidate text collection, is obtained according to integrated ordered value The retrieval ordering result of target text.
11. the retrieval ordering of text according to claim 3 determines method, which is characterized in that described to obtain the target In text and the candidate text collection the step of degree of association magnitude of each text, comprising:
Target text is obtained according to the corresponding transmogrified text of the target text using a kind of default or N kind relevance metric algorithm And the degree of association magnitude of each text or transmogrified text corresponding with text each in candidate text collection in candidate text collection.
12. a kind of retrieval ordering of text determines system characterized by comprising
Target text and candidate text collection obtain module, for obtaining target text and candidate text collection to be retrieved;
Degree of association magnitude obtains module, is associated with for obtaining the target text with each text in candidate's text collection Metric;
First text collection constructs module, for utilizing the degree of association magnitude according to the first preset rules to the candidate text Each text in set is ranked up, and constructs the first text collection according to the first default screening conditions;
Retrieval ordering result obtains module, for carrying out each text in first text collection according to the second preset rules Sequence, obtains the retrieval ordering result of target text.
13. a kind of computer equipment characterized by comprising at least one processor, and at least one described processor The memory of communication connection, wherein the memory is stored with the instruction that can be executed by least one described processor, the finger It enables and being executed by least one described processor, so that at least one described processor is executed as described in any in claim 1-11 The retrieval ordering of text determine method.
14. a kind of computer readable storage medium, which is characterized in that the computer-readable recording medium storage has computer to refer to It enables, the retrieval ordering that the computer instruction is used to that the computer to be made to execute the text as described in any in claim 1-11 Determine method.
CN201910082601.4A 2019-01-28 2019-01-28 Text retrieval sequencing determination method and system Active CN109857856B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910082601.4A CN109857856B (en) 2019-01-28 2019-01-28 Text retrieval sequencing determination method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910082601.4A CN109857856B (en) 2019-01-28 2019-01-28 Text retrieval sequencing determination method and system

Publications (2)

Publication Number Publication Date
CN109857856A true CN109857856A (en) 2019-06-07
CN109857856B CN109857856B (en) 2020-05-22

Family

ID=66896589

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910082601.4A Active CN109857856B (en) 2019-01-28 2019-01-28 Text retrieval sequencing determination method and system

Country Status (1)

Country Link
CN (1) CN109857856B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112115340A (en) * 2020-09-14 2020-12-22 深圳市欢太科技有限公司 Search strategy selection method, mobile terminal and readable storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130173610A1 (en) * 2011-12-29 2013-07-04 Microsoft Corporation Extracting Search-Focused Key N-Grams and/or Phrases for Relevance Rankings in Searches
CN105468790A (en) * 2015-12-30 2016-04-06 北京奇艺世纪科技有限公司 Comment information retrieval method and comment information retrieval apparatus
CN106649650A (en) * 2016-12-10 2017-05-10 宁波思库网络科技有限公司 Demand information two-way matching method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130173610A1 (en) * 2011-12-29 2013-07-04 Microsoft Corporation Extracting Search-Focused Key N-Grams and/or Phrases for Relevance Rankings in Searches
CN105468790A (en) * 2015-12-30 2016-04-06 北京奇艺世纪科技有限公司 Comment information retrieval method and comment information retrieval apparatus
CN106649650A (en) * 2016-12-10 2017-05-10 宁波思库网络科技有限公司 Demand information two-way matching method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
胡晓光: "基于语言模型的文本检索技术及检索结果重排序的研究", 《中国优秀博硕士学位论文全文数据库 (硕士) 信息科技辑》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112115340A (en) * 2020-09-14 2020-12-22 深圳市欢太科技有限公司 Search strategy selection method, mobile terminal and readable storage medium

Also Published As

Publication number Publication date
CN109857856B (en) 2020-05-22

Similar Documents

Publication Publication Date Title
CN109583468A (en) Training sample acquisition methods, sample predictions method and corresponding intrument
CN107122382A (en) A kind of patent classification method based on specification
CN109859054A (en) Network community method for digging, device, computer equipment and storage medium
CN108021945A (en) A kind of transformer state evaluation model method for building up and device
CN104317891B (en) A kind of method and device that label is marked to the page
CN106777282B (en) The sort method and device of relevant search
CN104517052B (en) Invasion detection method and device
CN106485529A (en) The sort method of advertisement position and device
CN108984708A (en) Dirty data recognition methods and device, data cleaning method and device, controller
CN112463859B (en) User data processing method and server based on big data and business analysis
CN106919957A (en) The method and device of processing data
CN109598307A (en) Data screening method, apparatus, server and storage medium
CN108733791A (en) network event detection method
CN110232405A (en) Method and device for personal credit file
CN109634997A (en) A kind of acquisition methods, device and the electronic equipment of unusual fluctuation channel
CN103309857A (en) Method and equipment for determining classified linguistic data
CN110209551A (en) A kind of recognition methods of warping apparatus, device, electronic equipment and storage medium
CN107944487B (en) Crop breeding variety recommendation method based on mixed collaborative filtering algorithm
CN109857856A (en) A kind of retrieval ordering of text determines method and system
CN107908649A (en) A kind of control method of text classification
CN106919587A (en) Application program search system and method
CN107679174A (en) Construction method, device and the server of Knowledge Organization System
CN106611339B (en) Seed user screening method, and product user influence evaluation method and device
CN106815277A (en) The appraisal procedure and device of search engine optimization
CN108170664A (en) Keyword expanding method and device based on emphasis keyword

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant