CN106372226B - Information retrieval device and method - Google Patents
Information retrieval device and method Download PDFInfo
- Publication number
- CN106372226B CN106372226B CN201610809109.9A CN201610809109A CN106372226B CN 106372226 B CN106372226 B CN 106372226B CN 201610809109 A CN201610809109 A CN 201610809109A CN 106372226 B CN106372226 B CN 106372226B
- Authority
- CN
- China
- Prior art keywords
- retrieval
- keywords
- formula
- unit
- classification
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3349—Reuse of stored results of previous queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3325—Reformulation based on results of preceding query
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/335—Filtering based on additional data, e.g. user or group profiles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2216/00—Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
- G06F2216/11—Patent retrieval
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses an information retrieval device and method, comprising the following steps: a receiving unit which receives a specific patent number input by a user; a keyword acquisition unit for automatically extracting keywords from the patent information corresponding to the specific patent number; a comparing unit for dividing the extracted keywords into a plurality of priorities according to the degree of correlation between the keywords and the specific patents; a classification number acquiring unit for extracting classification numbers from the patent information corresponding to the specific patent numbers and dividing the extracted classification numbers into a plurality of priorities; and the retrieval formula construction unit constructs a retrieval formula according to the keywords and the classification numbers in the order of the priority from high to low until the retrieval formula with the retrieval result meeting the preset threshold condition is constructed. The keywords and the classification numbers are automatically extracted, the extracted keywords and the classification numbers are subjected to priority classification according to the degree of correlation, and the retrieval formula is constructed according to the priority sequence of the keywords and the classification numbers, so that the patent information similar to the theme expected to be retrieved can be automatically retrieved more accurately and efficiently.
Description
Technical Field
The present invention relates to an information retrieval apparatus and method, and more particularly, to an apparatus and method for retrieving technical information, such as patent information.
Background
Technical information, particularly patent information, is an essential and important resource for the development of enterprises or scientific research institutes. For example, before research and development or investment is performed in an enterprise or a scientific research institution, the prior art level in a specific technical field can be comprehensively known, a correct research direction is determined, repeated development is avoided, and time and scientific research expenses are saved.
In recent years, however, patent information has grown rapidly, and more than one million patent documents are published every year around the world; the conventional patent retrieval is usually performed in a patent database, and the method comprises the steps of firstly inputting related keywords and synonyms thereof or corresponding classification numbers according to the subject to be queried and according to the experience of a searcher, constructing a retrieval formula, and repeatedly adjusting the retrieval formula in a manual review mode to obtain required data. Therefore, it is desirable to provide an apparatus and method for automatically retrieving patent information similar to the subject desired to be retrieved.
Patent document 1 (publication No. JP2005-234868A) discloses a similar application specification retrieval system that can retrieve similar patent specifications from a plurality of patent specifications stored in data by keywords extracted from a patent specification to be examined, the system including a retrieval language extraction section that extracts a language recited in a patent claim to be examined and outputs it as a retrieval language; a concept description character extracting unit that extracts a concept description character describing a concept that is a basis of an invention theory of a search language; a related language extracting unit that extracts a language described in the concept description text and outputs the language as a related language; the document searching part searches the similar patent specifications from the database through the searching language and the related language.
Although patent document 1 can automatically retrieve similar applications without the help of human, since some words extracted automatically by machine usually include some words without meaning, for example, "function" is a word without meaning in the computer field, and furthermore, the proximity of different words to the subject is different, for example, when the patent to be examined relates to a lens of an image pickup device, and when the extracted retrieval language includes "CCD", the "CCD" also participates in the construction of the retrieval formula according to the scheme of patent document 1, but obviously the degree of association between the "CCD" and the "lens" is not large, and therefore, if "CCD" is also put in the construction of the retrieval formula, there is a possibility that the introduced retrieval words are too many, and on the contrary, the number of the retrieval words is overlooked.
Therefore, it is desirable to provide an apparatus and method for automatically retrieving patent information similar to the subject desired to be retrieved more accurately and efficiently.
Disclosure of Invention
The invention aims to provide an information retrieval device and an information retrieval method, in particular to a patent information retrieval device and a patent information retrieval method, which can more accurately and efficiently automatically retrieve patent information similar to a theme expected to be retrieved.
The information retrieval device of the present invention includes: a receiving unit which receives a specific patent number input by a user; a keyword acquisition unit for automatically extracting keywords from the patent information corresponding to the specific patent number; a comparing unit for dividing the extracted keywords into a plurality of priorities according to the degree of correlation between the keywords and the specific patents; a classification number acquiring unit for extracting classification numbers from the patent information corresponding to the specific patent numbers and dividing the extracted classification numbers into a plurality of priorities; and the retrieval formula construction unit constructs the retrieval formula according to the keywords and/or the classification numbers in the order of the priority from high to low until the retrieval formula with the retrieval result meeting the preset threshold value condition is constructed.
The information retrieval method of the invention comprises the following steps: a receiving step of receiving a specific patent number input by a user; a keyword obtaining step of automatically extracting keywords from the patent information corresponding to the specific patent number; a comparison step of dividing the extracted keywords into a plurality of priorities according to the degree of correlation between the keywords and the specific patents; a classification number obtaining step of extracting classification numbers from the patent information corresponding to the specific patent number and dividing the extracted classification numbers into a plurality of priorities; and a searching formula constructing step, namely constructing the searching formula by the keywords and/or the classification numbers according to the sequence of the priorities from high to low until the searching formula with the searching result meeting a preset threshold value condition is constructed.
In the invention, the keywords are automatically extracted, the extracted keywords are subjected to priority classification according to the degree of correlation, the classification numbers are also automatically extracted and subjected to priority classification, and then the searching formula is constructed according to the priority order of the keywords and the classification numbers, so that compared with the prior art, the patent information similar to the theme expected to be searched can be automatically searched more accurately and efficiently because the searching formula is searched by the keywords and the classification numbers which are closest to the specific patent input by the user.
In the invention, the priority classification mode of the keyword acquisition unit to the keyword is as follows: acquiring keywords from manual processing data of a specific patent input by a user to serve as high-priority words; and then, segmenting words from the patent information according to semantics, thereby obtaining semantic keywords which are used as general keywords. The classification number acquisition unit classifies the acquired classification number into classification numbers of a plurality of priorities according to whether the acquired classification number is a manually processed classification number, a main classification number or a type of a classification system. Compared with automatic semantic word segmentation, the manually processed keywords can better reflect the core concept of the patent, and the manually determined classification numbers can also better reflect the position of the invention, so that the retrieval precision can be improved by taking the manual data as high-priority words in the invention. In addition, the primary classification can embody the core idea of the invention more than the secondary classification, and some classification systems are more subdivided, so that the classification numbers can also be prioritized according to the classification manner.
In the present invention, the predetermined threshold condition is that the search result is equal to or greater than a fourth threshold value and equal to or less than a fifth threshold value, and the fourth and fifth threshold values are dynamically variable. Since the data amount is different in different areas, the search accuracy can be further improved by setting the threshold value to be dynamically variable.
The information retrieval device of the present invention further comprises a similarity calculation unit that calculates a similarity of each document in the retrieval result with the specific patent input by the user, the retrieval result being a result after the retrieval by the retrieval-type construction step; and the sorting unit sorts each file in the search result according to the similarity. Therefore, the search results can be sorted according to the similarity sequence, and the browsing efficiency is improved.
Drawings
Embodiments of the invention are described in further detail below with reference to the attached drawing figures, wherein:
FIG. 1 schematically illustrates one embodiment of an information retrieval system in accordance with the present invention;
FIG. 2 schematically illustrates one embodiment of a keyword ranking process in an information retrieval system according to the present invention;
FIG. 3 schematically illustrates an embodiment of a classification number classification flow implemented by the classification number acquisition unit in the information retrieval system according to the present invention;
FIG. 4 schematically illustrates an example of a process of constructing a search that can be implemented by the information retrieval system according to the present invention;
FIG. 5 is a block diagram schematically showing the structure of a retrieval-type construction unit of the second embodiment;
FIG. 6(a) (b) (c) (d) schematically shows a process of constructing an index implemented by the index constructing unit of the second embodiment;
FIG. 7 schematically shows an example of a dynamic threshold determination unit in an information retrieval system according to the invention;
FIG. 8 schematically illustrates yet another embodiment of an information retrieval system according to the present invention;
FIG. 9 schematically illustrates an embodiment of a computer system according to the present invention.
Detailed Description
First embodiment
FIG. 1 illustrates one embodiment of an information retrieval system of the present invention. FIG. 2 illustrates one embodiment of a keyword ranking process implemented by the information retrieval system in accordance with the present invention. FIG. 3 illustrates one embodiment of a classification number ranking process implemented by the information retrieval system according to the present invention. Fig. 4 shows an embodiment of a retrievable construction process implemented by the retrievable construction element in the information retrieval system according to the invention. The following description is made with reference to fig. 1 to 4.
As shown in fig. 1, the information retrieval system includes an input device 101, a data retrieval device 201, and an information database 301. The input device 101 receives information input by a user, and the input information is, for example, a specific patent number. The information database 301 stores a batch of technical document information including, but not limited to, patent publications, utility model publications, specific standards, core journal documents, and the like of each country in advance.
As shown in fig. 1, the data retrieval apparatus of the present invention includes a receiving unit 202, a patent information acquiring unit 203, a high-priority word acquiring unit 204, a semantic participling unit 205, a filtering unit 206, a comparing unit 207, a classification number acquiring unit 208, a retrieval formula constructing unit 209, a synonym library 211, and a retrieval result storing unit 210. In fig. 1, the high-priority word acquisition unit 204, the semantic segmentation unit 205, and the filtering unit 206 constitute a keyword acquisition unit 213 of the data retrieval apparatus. The semantic segmentation unit 205 and the filtering unit 206 constitute a semantic word acquisition unit 212.
As shown in fig. 2, in step S2020, the receiving unit 202 in the data retrieval device 201 receives information input by a user, for example, a specific patent number.
In step S2030, the patent information acquisition unit 203 retrieves the information database 301 from the specific patent number received by the reception unit 202, thereby acquiring patent information corresponding to the specific patent number.
After the patent information corresponding to the specific patent number is obtained in step S2030, the keyword and the classification number are classified, respectively, and fig. 3 shows a specific classification manner of the classification number. S2040-S2074 of FIG. 2 show a specific hierarchical manner of keywords.
In step S2040 of fig. 2, the high-priority word acquiring unit 204 acquires a high-priority word in the patent information of the specific patent, where the high-priority word is, for example, a word extracted after processing the specific patent, for example, in a germant database, each patent is related to a keyword list record, and the high-priority word acquiring unit 204 acquires the keyword list record as the high-priority word, or the high-priority word may be a word in a proprietary search record made for a specific patent number.
In step S2050, the semantic word segmentation unit 205 performs word segmentation on the specific patent information according to the semantic meaning of the word, where the specific patent information may be the specification and the claims, and optionally, since the claims include more legal information, the word segmentation can be performed preferentially, and compared with the dependent claims, the independent claims indicate the protection scope of the specific patent, indicate the legal claim scope of the patent, and can better embody the inventive concept of the application, so the semantic word segmentation unit 205 can perform word segmentation only on the independent claims of the invention. Of course, it should be understood by those skilled in the art that the ranking may also be performed during semantic word segmentation, for example, the invention name may be subjected to semantic word segmentation first, then the independent claim is subjected to word segmentation, then the dependent claim is subjected to word segmentation, and finally the description is subjected to word segmentation. When the description is participated, because the description contains too many sentences, the semantic participations can be carried out by combining the word frequency information.
In step S2060, the filtering unit 206 compares the semantic segmentation result with the filtering dictionary to filter out some shielding words and single words, which are words that do not have specific meanings in the search, for example, for "high-temperature camera with panoramic function", and the semantic analysis result in step S2050 is "having", "week", "view", "function", "high temperature", "camera", at this time, the filtering unit 206 may filter words that do not have technical meanings, such as "having", "function", "week", "view", "i.e., in this case, only the keywords" high temperature "and" camera "are retained after being filtered by the filtering unit 206.
In step S2070, the comparison unit 207 compares the high-priority word obtained in step S2040 and the semantic segmentation word obtained by the semantic segmentation unit 205 in step S2060 with the invention name of the specific patent and/or the subject name of the independent claim, respectively, to thereby hierarchically classify the obtained high-priority word and semantic segmentation result. The comparison may be to perform similarity analysis on the high-priority words and the semantic participles with the invention names and/or the subject names, and classify the words based on the similarity analysis result.
In S2071, it is determined whether the similarity between the high-priority word obtained in step S2040 and the semantic participle obtained in step S2060 and the title of the invention and/or the subject name of the independent claim is equal to or greater than a first threshold.
The similarity analysis considers, for example, that the similarity is highest when the high-priority word completely overlaps with the semantic word and the invention name and/or the subject name of the independent claims, and the similarity is 1, and when the high-priority word as the comparison target has a single-word overlap with the semantic word and the invention name and/or the subject name of the independent claims, the similarity at this time is "single word/(length of the word of the comparison target)", for example, when "shooting" appears in the high-priority word and the invention name contains "shooting", since the word overlapping in "shooting" and "shooting" at this time is "shooting", and the length of the word of "shooting" is 2 words, the similarity at this time is 1/2, that is, 0.5.
Furthermore, the synonym library 211 may also be referred to for similarity analysis, and of course, different similarities have been given to several synonyms in the synonym library 211 in advance according to their actual meanings, for example, for the word "shoot", the similarity of "shoot" with it is 0.8, but "camera" is related to shoot, but it more means a shooting device, and therefore, the similarity with it is lower than that of "shoot", and may be 0.4.
When the similarity is greater than the first threshold as a result of the determination in step S2071, that is, when the determination in step S2071 is yes, a high-priority word having a similarity greater than the first threshold is defined as a first-priority word, that is, a type a word in step S2072. Then, the words with similarity larger than the first threshold value and not included in the A-type words are defined as the second priority words, namely B-type words.
When the determination result of step S2071 is "no", it is next determined in step S2073 whether the degree of similarity of the high-priority word acquired in step S2040 to the title and/or the independent claim is between the first threshold and the second threshold, that is, less than the first threshold but equal to or greater than the second threshold.
When the determination result of step S2073 is yes, in step S2074, a high-priority word having a similarity between the first and second thresholds is defined as a third-priority word, i.e., a C-type word; semantic segmentation with similarity between the first and second threshold is defined as a fourth priority word, i.e. a class D word.
When the determination result in step S2073 is "no", the process is ended, that is, for the high-priority words and semantic participles with similarity lower than the second threshold, it is considered that the introduction of these words will narrow the search result to an excessively small range, which is not beneficial to performing subsequent search of similar topics, so that these words are not further classified, and then the search formula is subsequently constructed. Of course, these words may be further compared with a third threshold (where the relationship between the third threshold and the first and second thresholds is the third threshold < the second threshold < the first threshold), and the words that do not satisfy steps S2071 and S2073 are further classified to obtain, for example, E, F, G … … -class words.
There are a plurality of the above-mentioned A-type, B-type, C-type, D-type, E-type, F-type and G-type … … -keywords, and the above-mentioned keywords of various types do not overlap each other. That is, when a specific word is both a class a word and a class C word, it can only be defined as a word with a high priority, i.e., a class a word. In this example, for convenience of explanation, only the a-D type keywords, i.e., the first to fourth priority words, are set.
As shown in fig. 3, in step S2030 of fig. 2, after the patent information acquiring unit 203 searches the information database 301 according to the specific patent number received by the receiving unit 202 to acquire the patent information corresponding to the specific patent number, in fig. 3, step S2080 is performed, and the classification number acquiring unit 208 of fig. 1 determines whether or not the high-priority classification number exists in the patent information.
The classification number may be ranked according to the type of classification system, and the high priority classification number may be the CPC information in the patent information, since the CPC classification number is generally more accurate than the IPC classification number. That is, when the CPC classification number is included in the patent information, it is considered that the high priority classification number exists.
When it is determined in step S2080 that the high priority class number is the high priority class number, that is, when the determination result in step S2080 is yes, in step S2081, the high priority class number is defined as the first priority class number, that is, a class number, in which one or more of the class a class numbers are present.
Then, in step S2082, after the first priority class number, i.e., the class a class number, is logically or-searched, the search result obtained is subjected to descending order statistics according to the number, and the class number ranked in the top ten of the descending order statistics is defined as the second priority class number, i.e., the class b class number. The class b classification number is a classification number excluding the class a classification number. Of course, the number of descending statistics can be customized as required, and the number of descending statistics can be twenty or fifteen.
When the determination result in step S2080 is "no", that is, when the patent information does not include a high-priority classification number, in step S2083, the classification number included in the patent information of the specific patent number input by the user is defined as a third-priority classification number, that is, a class c classification number. The class c classification number is a classification number excluding class a and class b classification numbers, and is, for example, a classification number included in a patent publication, in which one or more class c classification numbers are included.
In step S2084, after the third-priority class number, i.e., the c-class number, is logically or-retrieved, the descending statistics is performed on the retrieved results according to the number, and the class number ranked in the top ten of the descending statistics is positioned as the fourth-priority class number, i.e., the d-class number. The class d classification number is a classification number excluding class a, class b, and class c classification numbers. Of course, the number of descending statistics can be customized as required, and the number of descending statistics can be twenty or fifteen.
In addition, the classification number can be further classified, for example, the large groups of the class a classification number and the class b classification number can be extracted respectively and used as the class e classification number and the class f classification number. In this example, for convenience of explanation, only the keywords of the a to d categories, i.e., the first to fourth priority category numbers, are set.
Fig. 4 shows an embodiment of a retrievable construction flow implemented by the retrievable construction unit 209 in the information retrieval system according to the invention.
The search-type construction criteria for this embodiment are: there is a logical and relationship between keywords at all levels and a logical or relationship between category numbers regardless of the priority levels of the keywords and category numbers. Of course, it should be clear to one skilled in the art that there should be a logical or relationship between synonyms of a keyword, for example, if there are three keyword levels of a class a, B and C, the synonym of a1 is a logical or relationship if the category a word is related to the keywords a 1and a2, but the synonym of a2 is a logical or relationship, but between a 1and a2, and between keywords of different levels, i.e., between a class a, B and C.
As shown in fig. 4, in step S20749, i and j are set to 1, respectively, and i and j are natural numbers.
In step S20750, search formula constructing section 209 accesses synonym library 211, and obtains synonyms of i, i-1 …, 1-th-priority words (i, i-1 …, 1 is 1or more) since i is 1 in the initial state, that is, at this time, only synonyms of the first-priority word with the highest priority, that is, synonyms of each of the plurality of class a keywords are obtained.
In step S20751, the i, i-1 … … 1 priority words, i.e., the a words and their synonyms, are logically or-ed between the synonyms, but logically and-ed between the a words. For example, if the class a word is "image capture", "rotation", and the synonym of the class a word "image capture" is obtained as "image capture", "photograph", and the like, and the synonym of "rotation" is "rotation", the keyword expression constructed in step S20751 is "(image capture or photograph) and (rotation or rotation)".
In step S20752, the j, j-1 … … 1-th priority class number is obtained, and since j is 1 in the initial state, that is, at this time, the first priority class number with the highest priority, that is, a plurality of a class classification numbers are obtained, and a logical or operation is performed on the plurality of a class classification numbers. For example, if the obtained plurality of class a classification numbers are H04N5/225, G03B17/55, and G03B17/02, respectively, the classification number expression constructed in step S20752 is "H04N 5/225or G03B17/55or G03B 17/02".
In step S20753, the expressions constructed in step S20751 and step S20752 are logically anded to form a search expression, and the search expression is searched in the information database 301 using the formed search expression.
In step S20754, it is determined whether the search result in step S20753 is equal to or greater than the fourth threshold value and equal to or less than the fifth threshold value, or whether i >4 or j > 4. That is, whether all the keywords or classification numbers of four levels are involved in the construction of the search expression is determined, and when i >4 or j >4, it is determined that all the keywords or classification numbers of all the levels are involved in the construction of the search expression, and at this time, the search expression is saved, and the process is ended.
And when the retrieval result is greater than or equal to the fourth threshold and less than or equal to the fifth threshold, the retrieval result is considered to be more appropriate, namely the retrieved result fields are considered to be more similar and are more related to the content of the specific patent input by the user, and at the moment, the retrieval formula is saved and the process is ended.
When the search result is too small, the field range of machine search is considered to be too narrow; when the search result is excessive, the machine search is considered to introduce some noise, and the obtained documents are not strongly correlated. In this example, the fourth and fifth thresholds may be set according to the characteristics of the field, for example, the fourth threshold may be set to 1500, and the fifth threshold may be set to 2000.
Therefore, in step S20755, it is determined whether the search result is smaller than the fourth threshold value. If the determination result is "yes", that is, if the search result is less than the fourth threshold, let j be j +1, then proceed to step S20752, perform a logical or operation on the j, j-1 … … -th priority class numbers, since j is 2 at this time, that is, perform a logical or operation on the class numbers of the second and first priorities, for example, if the obtained second priority class number is H04N5/222, H04N5/235, at this time, the second priority class number is logically or operated together with the first priority class numbers H04N5/225, G03B17/55, and G03B17/02, that is, in step S20752 of the loop, the constructed search formula is "H04N 5/225G 03B17/55or G03B 17/orH N5/222H 04N 5/235".
Then, the search formula is formed again in step S20753, the search is performed in the information database 301, and then, it is continuously determined whether the search result is equal to or greater than the fourth threshold value and equal to or less than the fifth threshold value, or i >4 or j >4 is satisfied in step S20754.
When the determination result in step S20755 is "no", that is, when it indicates that the search result is greater than the fifth threshold, then, i is set to i +1, then, in step S20750, the synonym library is accessed to obtain the synonym of the i, i-1 … … 1-th priority word, and since i is set to 2, the first and second priority words, that is, the a-type word and the B-type word and their synonyms, are obtained.
Then, in step S20751, the i, i-1, … … 1 priority words are logically anded, and logical or operations are performed between synonyms of the keywords. For example, in this example, the first priority word, i.e., the a-type word is "camera shooting", "rotation", the second priority word, i.e., the B-type word is "cooling", and the synonyms of "cooling" after querying the synonym library have "high temperature resistance", "temperature reduction", and "cooling", so that the keyword expression constructed in step S20751 at this time is "(camera shooting or camera shooting) and (rotation or rotation) and (cooling or high temperature resistance or low temperature or cooling)".
Then, at step S20753, the search formula continues to be formed, the search is performed in the information database 301, and at step S20754, it is determined whether or not the search result satisfies the fifth threshold value or less and the fourth threshold value or more, or i >4 or j > 4.
And then, according to the judgment result, the loop is carried out until the search result is less than or equal to a fifth threshold value and greater than or equal to a fourth threshold value, or i >4 or j > 4. In the above description, the reason why the flow is terminated when i >4 or j >4 is that, in this example, only the four-level priorities of class a, class B, class C, and class D are set for the keywords, and only the four-level priorities of class a, class B, class C, and class D are set for the classification numbers. Of course, it will be understood by those of ordinary skill in the art that the set values of i and j that end the present flow are also larger as the priorities of the keywords and the classification numbers are set more, and correspond to the number of priorities of the division.
After a search formula is constructed by the search formula constructing unit 209 according to the steps of fig. 2 to 4 and a search is performed in the information database 301, the search result is stored in the search result storing unit 210.
In this example, the words obtained by the keyword obtaining unit 213 are processed and segmented according to the data, and further compared with the invention names and/or the applied topics, and the keywords are classified according to the similarity, but the keywords may also be classified in other manners, for example, if there are other keywords in the data processing or the retrieval process of the big data, at this time, the keywords may also be further classified according to the priority level according to the accuracy degree of the keywords, for example, the keywords may be classified into a plurality of levels according to the accuracy degree of the meaning between the keywords and the manual recognition, and then compared with the invention names and the applied topics, so as to classify the keywords into a plurality of priority levels.
In addition, in the aspect of determining the relevance between the keyword and the invention to determine the priority of the keyword, in this example, the keyword is compared with the similarity between the invention name and/or the subject name of the claim to obtain the priority level, but other ways may be used to determine the priority level of the keyword, for example, the application file of the whole invention may be analyzed, then the word frequency analysis may be performed to obtain the word frequency analysis list, and then the automatically obtained keyword may be compared with the word frequency analysis list to determine the priority level of the keyword.
In the above example, if there is a confirmed CPC classification number in the application, it is taken as the high priority classification number. Of course, other classification systems such as the FT and UC classification systems may be used, or since the degree of subdivision of each classification system differs in different fields, for example, in the field of cameras, the FT classification system has a high degree of subdivision, and in this case, in the field of camera specification, when the patent information of the specific patent includes the FT classification number, the FT classification number may be set to the high-priority classification number.
Or, in some databases, the confirmed classification number information is also included, for example, in the database of the european patent office, a confirmed CPC classification number field exists; in some databases, the classification number used by the searcher of the specific patent may be included, so that the identified classification number or the classification number used by the searcher may be used as the high-priority classification number.
Or, if only the IPC classification number exists in a certain application, the priority level may also be determined according to the IPC classification number, at this time, the classification number after being verified by manual confirmation or the classification number of the authorization text is usually used as the basis for dividing the high-priority classification number, and the classification number in the public text is used as the basis for dividing the lower-level priority.
In the above example, the search formula is described to be constructed according to the priority levels of the keywords and the classification numbers at the same time, but the search formula may also be constructed according to the priority levels of the keywords or the classification numbers only, and when the search formula is constructed according to the priority levels of the keywords only, the automatically extracted classification numbers may be subjected to logic or operation first, and then the keywords are added step by step according to the priority levels; when the search formula is constructed only according to the grade of the classification number, the keyword may be automatically extracted from some specific content, for example, the keyword may be automatically extracted only from the invention name or the independent claim, and after the removal of the mask word, the search formula may be constructed by adding the classification number step by step.
Second embodiment
The frame structure of the second embodiment is the same as fig. 1 of the first embodiment, and differs from the first embodiment only in the retrieval type construction manner of the retrieval type construction unit 209, and therefore, only the differences from the first embodiment will be described herein, and other explanations will be omitted.
Fig. 5 shows a block diagram of a search-type construction unit 209' according to a second embodiment of the present invention. Fig. 6(a) - (d) show another embodiment of the retrievable construction flow implemented by the retrievable construction unit 209' in the information retrieval system according to the invention.
In this second embodiment, the keywords and the classification numbers are classified in the same manner as in the first embodiment, that is, in the second embodiment, the keywords are classified into a plurality of priority words as in the first embodiment, and here, for convenience of explanation, only cases of classification into priority words of a class, B class, and C class are listed; in addition, the classification number is also divided into a plurality of priority classification numbers, and for convenience of description, only cases where the classification number is divided into a-type and b-type are listed. Certainly, the class a, class B and class C words respectively include a plurality of class a keywords, and the class B keywords and class C keywords and class a and class B classification numbers also have a plurality of classification numbers, where a 1and a2 are class a words, B1, B2 and B3 are class B words, C1, C2 and C3 are class C words, a 1and a2 are class a classification numbers, and B1and B2 are class B classification numbers.
It should be noted that, here, as an example, a case of dividing the keywords into 3 priority levels and the classification numbers into 2 priority levels is shown, and it should be understood by those skilled in the art that the keywords and the classification numbers may be divided into more levels according to needs, and the number of the division levels of the keywords and the classification numbers may be the same or different, for example, the keywords and the classification numbers may also be divided into 4 priority levels respectively as in embodiment 1.
The retrievable construction element of the second embodiment differs from the retrievable construction element of the first embodiment in the retrievable construction criterion: in the first embodiment, the keywords are logically anded, and the category numbers are logically ored, regardless of the priority levels of the keywords and the category numbers, and the keywords and the category numbers are logically anded. In the second embodiment, the retrievable form constructing unit constructs the retrievable form according to the first retrievable form constructing criterion or the second retrievable form constructing criterion, in which, in the first retrievable form constructing criterion, the logical or operation is performed between the classification numbers regardless of the priority levels thereof, but for the keywords, the logical and operation is performed between the keywords at the same level, the logical or operation is performed between the keywords at different levels, and at the same time, the logical and operation is performed between the keywords and the classification numbers; in the second search formula construction criterion, logic or operation is performed between classification numbers, logic and operation is performed between keywords regardless of priority levels, and logic and operation is performed between keywords and classification numbers.
Of course, synonyms of the keywords a 1-a 2, B1-B3, and C1-C3 should be considered when constructing the search formula, and it should be clear to those skilled in the art that there should be a logical or relationship between synonyms of the keywords.
Here, for convenience of explanation, when only a change in the level is involved, not a change in the number of keywords or class numbers in a certain level, when constructing the search expression according to the first search expression construction criterion, only the case of "a 1 andA 2" is represented by "a class word" and the case of "a 1or a 2" is represented by "a class number", and of course, the cases of B-class words and C-class words are the same as the a-class words and the case of B-class numbers is the same as the a-class numbers.
As shown in fig. 5, the retrieval formula building unit 209' of the second embodiment includes a first unit 20913, a second unit 20914, a third unit 20915, a fourth unit 20916, a fifth unit 20917, a sixth unit 20918, a second comparison unit 2094 and a retrieval formula obtaining unit 2093.
The first unit 20913 is configured to work when the second comparing unit 2094 determines that the search result is smaller than the fourth threshold after the first priority keyword (i.e., the class a keyword) and the first priority class number (i.e., the class a class classification number) construct the search formula according to the first search formula construction criterion, that is, the constructed search formula is "the class a word and the class a class classification number", the first unit 20913 first sequentially adds (since only three priority keywords are assigned in this example, then according to the order of B, C class keywords) the keywords of different levels, that is, when the search result is smaller than the fourth threshold, first adds the class B keyword, and constructs the search formula (i.e., the constructed search formula is "(the class a word OR the class B word) and the class a classification number) according to the first search formula construction criterion, if the search result is still smaller than the fourth threshold, the C-class keyword is added, and the search result and the a-class classification number are used to construct a search formula according to the first search formula construction criterion, and when the search result is still smaller than the fourth threshold, the classification numbers of different levels are sequentially added according to the priority order (in this example, since only two priority classification numbers are divided, the second priority classification number b-class classification number is added here), and the search formula is constructed according to the first search formula construction criterion until the second comparison unit 2094 determines that the search result satisfies the threshold condition that the search result is greater than or equal to the fourth threshold and less than or equal to the fifth threshold.
The second unit 20914 operates when the second comparing unit 2094 determines that the search result is greater than the fifth threshold after the keyword or classification number of the specific rank is added to the first unit 20913.
For example, in this embodiment, when a third priority keyword, i.e., a C-class word is added, the search result thereof is found to be greater than the fifth threshold value. At this time, the search formula constructed by adding the C-class keyword is (a-class word OR B-class word OR C-class word) and a-class classification number.
The second unit 20914 refers to the search formula constructed before the addition of the keyword OR the classification number of the specific level, the search formula "(a-type word OR B-type word) and a-type classification number") constructed before the search result is greater than the fifth threshold, adds the classification numbers of lower levels (B-type classification numbers) in order of priority while keeping the keywords in the search formula unchanged, and constructs the search formula according to the first search formula construction criterion, i.e., constructed as "(a-type word OR B-type word) and (a-type classification number OR B-type classification number)" until the second comparison unit 2094 determines that the search result satisfies the threshold condition, i.e., the search result is greater than OR equal to the fourth threshold and less than OR equal to the fifth threshold.
The third unit 20915 operates when the search result is judged to be still smaller than the fourth threshold after the comparison by the second comparing unit 2094 after the keywords and the classification numbers of all the levels are added to the first unit 20913 or the classification numbers of all the levels are added to the second unit 20914. In this embodiment, after constructing the search formula "(a-type word OR B-type word) and (a-type classification number OR B-type classification number)" search, it is found that the search result is still smaller than the fourth threshold. At this time, the third unit 20915 sequentially deletes the keywords from the back to the front among the plurality of keywords at the same level in the order of priority from low to high (i.e., in the order of the keywords of the C-type, the B-type, and the a-type in this order) until only a predetermined number of keywords remain at the level, with reference to the above search formula.
In this embodiment, since the referenced index is "(a-class word OR B-class word) and (a-class classification number OR B-class classification number)", there is no C-class keyword, and therefore, when deleting, only the B-class key with the lowest priority is deleted, in this example, the B3 key is deleted first, then the index is constructed according to the first index construction criterion, at this time, the retained index is "((A1 and A2) OR (B1and B2)) and (a1OR a 2OR B1 OR B2)", if the threshold condition is still not satisfied, then B2 is deleted until the B level keywords leave the B1 keywords, then the a2 keywords are deleted until the search becomes "(A1 OR B1) and (A1 a 2or B1 or B2)", before each deletion, the second comparing unit 2094 determines whether the search result satisfies the threshold condition.
The retrieve formula obtaining unit 2093 obtains a final retrieve formula, in which,
when the second comparing unit 2094 determines that the search result satisfies the threshold condition, the search formula satisfying the search result is obtained; when the search result of the search expression constructed after the third unit 20915 deletes all the keywords of the hierarchy (of course, a predetermined number of keywords are retained at each hierarchy according to a predetermined rule at this time) is still smaller than the fourth threshold, the finally constructed search expression is obtained, that is, if the search result of the search expression "(a 1OR B1) and (a1OR a 2OR B1 OR B2)" is still smaller than the fourth threshold, the search expression is obtained as the finally constructed search expression.
When the search result of the search expression constructed after the third unit 20915 deletes the keyword at the specific level is greater than the fifth threshold, that is, if the search result of the search expression "(a 1OR B1) and (a 1a 2a 1B 2)" is greater than the fifth threshold, the search expression constructed before deleting the keyword at the specific level, that is, the search expression "((a 1and a2) ORB1) and (a 1a 2a 1B 2)" is acquired as the final search expression.
The fourth unit 20916 operates when the first priority keyword (i.e., the class a keyword) and the first priority class number (i.e., the class a class classification number) construct an index expression (i.e., the index expression is "the class a word and the class a class classification number") according to the first index expression constructing criterion, and the second comparing unit 2094 compares the result of the search and determines that the result of the search is greater than the fifth threshold. At this time, the fourth unit 20916 deletes one of the first priority class numbers in order of the class numbers from back to front, that is, in this embodiment, deletes the next class number a2 of the class numbers a 1and a2 having the highest priority until the class has only a predetermined number of class numbers left (in this example, set to only 1), the keyword remains unchanged, and constructs the index expression according to the second index expression construction criterion, that is, the constructed index expression is "(a 1and a2) and a 1", until it is determined that the search result satisfies the above-mentioned threshold condition after the comparison by the second comparing unit.
The fifth unit 20917 is operated when the specific classification number is deleted in the fourth unit 20916, and the result of the comparison is determined to be smaller than the fourth threshold value by the second comparing unit 2094. Assuming that the search result of the search formula "(a 1AND a2) anda 1" after the deletion of the a2 classification number is smaller than the fourth threshold, the fifth unit keeps the classification number in the search formula unchanged according to the search formula before the deletion of the specific classification number AND the search result is smaller than the fourth threshold, that is, "a-type word AND a-type classification number", AND sequentially adds keywords of lower levels in the order of priority (in this example, in the order of a-type, B-type, AND C-type words), AND constructs the search formula according to the second search formula construction criterion, that is, "(a-type word AND B-type word) AND a-type classification number", until the second comparison unit 2094 judges that the search result satisfies the threshold condition.
The sixth cell 20918 operates when the search formula is constructed by deleting only a predetermined number of classification numbers left in the fourth cell 20916, and the second comparing cell 2094 determines that the search result is still larger than the fifth threshold. Assuming that the search result of the search formula "(a 1AND A2) AND a 1" after deleting the A2 classification number is greater than the fifth threshold, the sixth unit 20918 adds the keywords at lower levels in the same level in the priority words in the order from front to back in the order of priority (i.e., the order of the a-type, B-type, AND C-type words) while keeping the classification numbers in the search formula unchanged according to the search formula, AND constructs the search formula according to the second search formula construction criterion, that is, when adding the B-type word, B1 is added first, that is, the search formula is "(a 1AND A2AND B1) AND a 1", then B2 is added, AND then B3, until the second comparison unit 2094 determines that the search result satisfies the threshold condition.
Then, the retrieve formula is retrieved by the retrieve formula retrieving unit 2093, wherein,
when the second comparing unit 2094 determines that the search result satisfies the threshold condition, it acquires a search expression satisfying the search result.
When the sixth unit 20918 determines that the search result of the search expression constructed after adding all the keywords at all levels is still greater than the fifth threshold by the second comparing unit 2094, that is, in this embodiment, if the search result of the "(a-type word AND B-type word AND C-type word) AND a 1" is still greater than the fifth threshold, the finally constructed search expression is obtained.
When the fifth unit 20917 determines that the search result of the search expression constructed after the addition of the keyword at the specific level is smaller than the fourth threshold by the second comparing unit 2094, the search expression constructed before the addition of the keyword at the specific level is acquired. That is, in the above example, when the search result of the search expression "(a-type word AND B1) AND a-type classification number" is smaller than the fourth threshold value, "(a 1AND a2) AND a 1" is set as the final search expression.
Fig. 6(a) - (d) show an embodiment of the retrievable construction flow of the retrievable construction unit of this embodiment.
In step S20760, the search formula construction unit 209' accesses the synonym library 211 to obtain first-level priority words, i.e., synonyms for each of the plurality of class a keywords.
In step S20761, a plurality of first priority words, i.e., class a words and their synonyms, are logically or-ed between the synonyms, but are logically and-ed between a plurality of class a words. That is, assuming that the a-type words include a 1AND a2, synonyms of a1 are a1 ' AND a1 ', AND synonyms of a2 are a2 ', the search formula constructed in step S20761 is "(a 1OR a1 ' ORA 1") AND (a 2OR a2 ') ". For convenience of explanation, only the search expression is shown as a type a word to represent the search expression, or only the expression with synonyms expanded as described above is shown as "a 1AND a 2".
In step S20762, a first priority class number, i.e., a class classification number is obtained, and a logical or operation is performed on the plurality of class classification numbers. If the class a classification numbers are a 1and a2, therefore, the expression constructed in this step is "a 1 ora 2", and for convenience of description, the above-mentioned case of constructing the search expression is represented by the class a classification number only.
In step S20763, the first priority keyword a class word in step S20761 and the a class classification number in step S20752 are logically anded to form expression 1, i.e., "a class word and a class classification number", and then a search is performed in the information database 301 using the formed search expression.
In step S20764, it is determined whether the retrieval number of the retrieval results of expression 1 is equal to or greater than the fourth threshold value and equal to or less than the fifth threshold value, that is, whether a predetermined threshold condition is satisfied. If the determination result in step S20764 is yes, it indicates that the number of searches is appropriate, and the search formula is saved, and the process ends.
When it is determined in step S20765 whether the number of searches is less than the fourth threshold value, and when the determination result is yes, the classification number is kept unchanged, and the second-priority class B word is logically or-ed with the first-priority class B word, so as to obtain expression 2. That is, the search formula is constructed according to the first search formula construction criterion, AND the expression 2 constructed at this time is "(a-type word OR B-type word) AND a-type classification number".
Thereafter, it is determined in step S20766 whether the retrieval number of expression 2 satisfies a predetermined threshold condition. And when the preset threshold condition is met, saving the search expression, and ending the process. If it is determined in step S20767 that the number of searches is less than the fourth threshold, the classification number is kept unchanged in step S20768, and the keyword of the third priority (i.e., the C-type word) is added, and the search formula is constructed according to the first search formula construction criterion, so as to obtain the expression 3, i.e., the expression 3 is "(a-type word OR B-type word OR C-type word) and a-type classification number".
When the determination result in step S20767 is no, that is, the number of searches is greater than the fifth threshold, the expression before expression 2, that is, expression 1 is saved as expression 1' in step S20771.
Thereafter, it is continuously determined whether the search number satisfies a predetermined threshold condition and whether the search number is smaller than a fourth threshold value in steps S20769 and S20770. When the number of searches in step S20769 satisfies the predetermined threshold condition, the search expression is saved and the flow ends, and when the result of the search determination in step S20770 is no, that is, the number of searches is greater than the fifth threshold, the expression before expression 3, that is, the expression 2 class is taken as expression 1' in step S20772.
When it is determined in step S20770 that the retrieval number is still smaller than the fourth threshold value, the above expression 3 is taken as expression 1' in step S20773.
Thereafter, in step S20795, expression 1' is saved.
If the determination in step S20765 is no, that is, if the number of searches is greater than the fifth threshold, then in step S20774 the keywords are kept unchanged, and one of the first priority class numbers (class a class numbers) is deleted in reverse order to construct expression 4, in this example, since the class a class numbers are assumed to be a 1and a2, and therefore a1 is arranged before a2, at this time, the keywords are kept unchanged, and the class numbers are sequentially deleted in reverse order until a predetermined number of class numbers remain, in this example, a2 is deleted, and at this time, the result constructed in step S20744 is "class a word and a 1".
Since there are only two a-priority class numbers a 1and a2 in this example, the class number of the first priority class is not deleted any more, i.e., the number of the remaining predetermined number of class numbers is 1 at this time, although those skilled in the art will appreciate that the number of the remaining class numbers may be reserved as needed, for example, the number of the remaining class numbers may be 2or 3.
It is determined in step S20775 and step S20776 whether the search result satisfies a predetermined threshold condition and whether the search number is greater than a fifth threshold value, respectively. When it is determined in step S20775 that the search result satisfies the predetermined threshold condition, the search expression at that time is saved, and the flow ends.
When it is determined in step S20776 that the number of searches is greater than the fifth threshold, since there is no classification number that can be further deleted at this time, this expression 4 is taken as expression 1 "in step S20778.
When the determination in step S20776 is no, that is, when the retrieval number is smaller than the fourth threshold value, the expression before expression 4, that is, expression 1 is taken as expression 1 "in step S20777.
Thereafter, in step S20796, the expression 1 ″ is saved.
In step S20779, in expression 1 ', the keywords are kept unchanged, and the classification numbers are sequentially added in order of priority, and in this example, the classification numbers are divided into only two priority levels, so that, in step S20779, the second priority classification number, i.e., the class b classification number, is added and the index is constructed according to the first index construction criterion, resulting in expression 2'.
Thereafter, in step S20780 and step S20781, it is determined whether the search number satisfies a predetermined threshold condition and whether the search number is smaller than a fourth threshold, respectively. If the determination result in step S20780 is yes, the search expression at this time is saved, and the flow ends.
When it is determined in step S20781 that the retrieval number is smaller than the fourth threshold value, expression 2 'is taken as expression 1' "in step S20783. Of course, it should be understood by those skilled in the art that in this example, the classification numbers are only divided into two levels of priorities, i.e., a-class and b-class, and when there are more levels of classification numbers, e.g., four levels of priorities, i.e., a-class, b-class, c-class and d-class, in the above flow, if it is determined that the number of searches is always smaller than the fourth threshold, the priorities of c-class and d-class are sequentially added, but the keywords are kept unchanged, and the search formula is constructed according to the first search formula construction criterion.
When the determination result in step S20781 is "no", that is, the retrieval number is larger than the fifth threshold value, at this time, an expression before expression 2 ', that is, expression 1 ' is taken as expression 1 ' "in step S20782.
Thereafter, in step S20797, expression 1 "'.
In step S20784, the number of keywords is deleted with the classification number kept unchanged, and a search formula is constructed according to the first search formula construction criterion. The keyword deletion mode is as follows: and deleting the keywords in the order from low priority to high priority, and deleting the keywords in the reverse order from back to front in the plurality of keywords at the same level until only a preset number of keywords are left at the level. Here, it is assumed that expression 1 "' is" (a-class word OR B-class word OR C-class word) and (a-class classification number OR B-class classification number) ". And because the class A words are A1 and A2, the class B words are B1, B2 and B3, and the class C words are C1 and C2. Therefore, in step S20784, the top ranked C2 of the lowest priority C-class words is deleted first, and since the C-class words have only two keywords C1 and C2 in this example, the predetermined number reserved at this time is 1. That is, the expression 2 constructed at this time is "((A1 AND A2) OR (B1AND B2 AND B3) ORC1) AND (class a classification number OR class B classification number)".
In the deletion process, it is always determined whether the retrieval number of the above expression satisfies a predetermined threshold value, and it is determined whether the retrieval number is smaller than a fourth threshold value, as in steps S20785 and S20786. If the number of searches is still less than the fourth threshold, then delete the keyword B3 of the second priority class B word, that is, the expression constructed at this time is "((a 1AND a2) OR (B1AND B2) OR C1) AND (class a classification number OR class B classification number)", if the number of searches is still less than the fourth threshold, then continue deleting the keyword B2, AND if still less than the fourth threshold, delete the keyword a 2. Thus, the resulting expression is "(A1 OR B1 OR C1) and (class a classification number OR class B classification number)".
If it is determined in step S20785 that the expression satisfies the predetermined threshold condition, the search expression is saved, and the flow ends.
If the determination in step S20786 is "no", that is, if the retrieval number is greater than the fifth threshold value, the expression before the expression is saved as the expression 1 "". If the determination in step S20786 is yes, that is, the number of searches is still less than the fourth threshold, the expression is regarded as expression 1 "". In step S20798, expression 1 "" is saved, and the flow ends.
In step S20787, "in expression 1", the classification number is kept unchanged, and the number of keywords is increased, and the index is constructed in accordance with the second index construction criterion. When the number of the keywords is increased, the keywords are sequentially added in the order of priority from high to low, and in the plurality of keywords at the same level, the keywords are sequentially added in the order from front to back, that is, the increasing order. Here, it is assumed that "1" is "a-type word and a-type classification number", B1 in the B-type word is added first, and here, it is assumed that the a-type word has two keywords of a 1and a2, and the expression constructed at this time is "((a 1and a2) and B1) and a-type classification number", then judgment is made, and when it is judged that the number of searches is still greater than the fifth threshold value, B2 in the B-type word, i.e., "((a 1and a2) and (B1and B2)) and da-type classification number", then B3 in the B-type word is added, and then according to the rule, if the number of searches is still greater than the fifth threshold value, the B1 in the C-type word is added in order of C1, C2, and C3 in the C-type word.
If a decision is made at step S20788 that the number of searches satisfies a predetermined threshold condition, the expression is saved and the flow ends.
If it is determined as "yes" in step S20789, that is, when all the keywords are used and the number of searches is still greater than the fifth threshold, the expression is saved as the expression 1 ""'. If it is determined in step S20789 as "no", that is, if the retrieval number is smaller than the fourth threshold value at this time, the expression preceding the expression is regarded as the expression 1 ""'. In step S20799, the expression 1 ""' is saved, and the flow ends.
Third embodiment
In the first and second embodiments, the fourth and fifth thresholds are fixed values, but since the amounts of documents in the same category number are different in different fields, for example, in view of the distribution of the amount of application in each field in recent years, the amount of application in the electrical field is significantly higher than that in the chemical field, and therefore, it is not reasonable to fix the fourth and fifth thresholds to a fixed value in each field.
Therefore, in the third embodiment, it is set so that the above-described fourth and fifth thresholds are dynamic values. The block diagram is the same as that of the first embodiment, and the third embodiment is different from the first embodiment only in that it has a dynamic threshold value determining unit for determining the fourth and fifth threshold values, and fig. 7 is a specific structure of the dynamic threshold value determining unit of the third embodiment. Therefore, in the third embodiment, the same reference numerals are referred to herein for the same structures and units as those of fig. 1, and the description is omitted, and only the differences from fig. 1 will be described herein.
As shown in fig. 7, the semantic segmentation unit 205 and the filtering unit 206 have the same structure as those of fig. 1. After filtering by the filtering unit 206, the filtered keywords with actual meanings are output, for example, in this example, as shown in fig. 1, only the words "high temperature" and "camera" are retained after filtering by the filtering unit 206.
The classification number acquisition unit 2081' in fig. 7 acquires only a plurality of classification numbers that a specific patent number contains, without performing classification. Thereafter, the keyword output by the filtering unit is used in the second search formula constructing unit 2091 of fig. 7, and the synonym of the filtered reserved term is obtained by referring to the synonym library 211.
And then, carrying out logic or operation among synonyms of the reserved keywords, and carrying out logic or operation among the keywords to form a keyword expression. Then, logic or operation is also performed among the obtained plurality of classification numbers to form a classification number expression. Next, the second search expression constructing unit 2091 constructs a search expression by logically and-ing the formed keyword expression and the classification number expression.
For example, when the reserved keywords are "high temperature" AND "camera", when the synonym of "high temperature" is found to be "high temperature" after the synonym library 211 is queried, AND the synonym of "camera" is "camera", AND the classification numbers of the specific patent itself are H04N5/222 AND H04N5/235, at this time, the search formula constructed by the second search formula constructing unit 2091 is "((high temperature OR temperature high) OR (camera OR camera)) AND (H04N5/222OR H04N 5/235)".
Then, the search hit amount obtained by searching the information database 301 for the search expression constructed by the second search expression constructing unit 2091 is recorded. Thereafter, the data shift unit 2092 shifts the search hit amount up and down by a certain shift amount, and sets the finally obtained values as the upper and lower thresholds, i.e., the fourth and fifth thresholds. The offset may be 50% or 25%, for example, when the number of hits is 5000 and the offset is 50%, the fourth threshold may be 5000 (1 to 0.5) to 2500, and the fifth threshold may be 5000 (1+0.5) to 7500.
Thereafter, after determining the above dynamic threshold range, when performing the construction of the search expression of fig. 4 to 6(a) - (d), the dynamic threshold of fig. 7 may be used for the determination of the fourth and fifth threshold of fig. 4 to 6(a) - (d), thereby constructing a suitable search expression, and automatically retrieving patent information similar to the desired subject.
Fourth embodiment
In the first to third embodiments described above, only the retrieval result is shown to be stored by the retrieval result storage unit 210. However, each patent record in the finally hit retrieval result can be compared with the similarity of a specific patent input by the user, and each record in the retrieval result is sorted according to the similarity, so that the user can browse files with high similarity in advance, and browsing efficiency is improved.
Fig. 8 shows a fourth embodiment of the information retrieval system of the present invention. In fig. 8, the same reference numerals are given to the same components and modules as those in fig. 1, and the description thereof will be omitted, and only the differences will be described.
Wherein, the similarity of the file stored in the retrieval result storage unit 210 and the specific patent input by the user may be calculated by the similarity calculation unit 214. The similarity calculation may be performed by a vector comparison method commonly used in the art. For example, a weighted list of words and phrases may be considered as a file vector, assuming a simplified example of the vector for the particular patent entered by the user as [ Camera 1] [ high temperature 0.5] [ rotation 0.2 ]. And the vector of one file in the retrieved retrieval result is [ lens 1] [ CCD0.7] [ high temperature 0.6] [ rotation 0.5 ]. Through analysis, the coincident words in the document vectors in the two documents are "high temperature" and "rotation", so the vectors are multiplied to obtain the similarity of 0.5 × 0.6+0.2 × 0.5 — 0.4.
Then, the ranking unit 215 ranks the documents stored in the search result storage unit 210 according to the similarity, and displays the ranking result of each document in the search result.
Therefore, when browsing the search result, the related personnel can easily browse the search result from front to back according to the similarity of the files, and the efficiency can be greatly improved.
Of course, in the first and second embodiments of the present invention, when it is found that the predetermined threshold condition cannot be satisfied yet after the search formula of the multi-priority keywords and classification numbers is constructed according to the flow of fig. 4 or fig. 6(a) - (d), the predetermined files may also be deleted or added according to the similarity.
For example, when it is found from fig. 4 or fig. 6(a) - (d) that the number of searches through the search formula after all the processes is still greater than the fifth threshold, at this time, the files in the search result storage unit 210 may be sorted according to the similarity, and the files with the later similarity may be deleted so that the number of searches is equal to the fifth threshold.
Or, when it is found from fig. 4 or fig. 6(a) - (d) that the number of retrieved items after all the processes is still less than the fourth threshold, at this time, the retrieved items may be constructed according to the construction method of the second retrieved item construction unit 2091 in fig. 7 in the third embodiment, and then after the retrieval results are sorted according to the similarity, the retrieval result storage unit is supplemented with files in order of the similarity from high to low, that is, the retrieval results of the retrieved item construction unit are supplemented with files, and file deduplication is performed, so that the number of all the files is finally equal to the fourth threshold.
Fifth embodiment
The information retrieval system of the present invention can be implemented by a computer system 501 shown in fig. 9. As shown in fig. 9, the computer system 501 of the present invention includes an input device 5013 to which a user inputs information, a memory 5011 in which computer instruction information and a thesaurus are stored, the computer instruction information being instruction information that can execute, for example, the flow of fig. 2 to 6 and the corresponding flow of the third and fourth embodiments, and a processor 5012; the processor 5012 reads the computer instruction information and the synonym library storage result from the memory 5011 to process the computer instruction information and the synonym library storage result so that the computer instruction information and the synonym library storage result receive a specific patent number input by a user; automatically extracting keywords from the patent information corresponding to the specific patent number; dividing the extracted keywords into a plurality of priorities according to the degree of correlation between the keywords and the specific patents; extracting classification numbers from the patent information corresponding to the specific patent number, and dividing the extracted classification numbers into a plurality of priorities; and constructing a search expression by the keywords and the classification numbers in the order of the priority from high to low until the search expression with the search result meeting the preset threshold condition is constructed.
The information retrieval device and method of the present invention can be implemented in the following aspects:
for example, for an enterprise, when a technician inputs a corresponding patent number, he can easily obtain a file similar to the subject that the above-mentioned patent desires to retrieve, so that the technician can quickly browse the related art, thereby increasing his research and development starting point.
For patent analysts, they can also easily analyze the above search results by relying on the technology, thereby making clear the inventors, applicants, major competitors, and the like of the related art.
For patent searching personnel, the file similar to the theme to be searched can be easily obtained through the method, so that the file can be preferentially browsed, and the searching efficiency is improved.
The embodiments of the present invention have been described above with reference to the drawings, but the scope of the present invention is not limited to the above-described embodiments, and structures appropriately combined with or replacing the embodiments are also included in the scope of the present invention. Those skilled in the art can combine or replace the structures or compositions of the above-described embodiments according to their knowledge, and these modified embodiments are also included in the scope of the present invention.
Claims (39)
1. An information retrieval apparatus, characterized by comprising:
a receiving unit which receives a specific patent number input by a user;
a keyword acquisition unit for automatically extracting keywords from the patent information corresponding to the specific patent number;
a comparing unit for dividing the extracted keywords into a plurality of priorities according to the degree of correlation between the keywords and the specific patents; a classification number acquiring unit for extracting classification numbers from the patent information corresponding to the specific patent numbers and dividing the extracted classification numbers into a plurality of priorities;
the retrieval formula construction unit constructs a retrieval formula according to the priority of the keywords and/or the classification numbers from high to low until the retrieval formula with a retrieval result meeting a preset threshold condition is constructed; when the retrieval result is less than a preset threshold value, adding a low-priority classification number or a low-priority keyword, or deleting the low-priority keyword or the keyword with the same priority level in a reverse order; and when the retrieval result is greater than a preset threshold value, adding the keywords with low priority levels or the classification numbers with the same priority level in a reverse order until the retrieval result meets a preset threshold value condition or the added or deleted classification numbers and keywords do not exist.
2. The information retrieval device according to claim 1, characterized in that:
the retrieval formula construction unit performs retrieval formula construction with reference to the synonym library of the keyword.
3. The information retrieval device according to claim 1, characterized in that:
the keyword acquisition unit comprises a high-priority word acquisition unit and a semantic word acquisition unit; wherein the high-priority word acquiring unit acquires a high-priority word from the manual processing data of the specific patent; the semantic word acquiring unit divides words from the patent information according to semantics, so as to acquire semantic keywords.
4. The information retrieval device according to claim 3, characterized in that:
the semantic word acquiring unit comprises a semantic word segmentation unit and a filtering unit, wherein the filtering unit removes shielding words and single words from semantic word segmentation results of the patent information.
5. The information retrieval device according to claim 1, characterized in that:
the classification number acquisition unit classifies the acquired classification number into a plurality of priorities according to whether the acquired classification number is one or more of a manually processed classification number, whether the classification number is a master classification number, and a type of a predetermined classification system.
6. The information retrieval device according to claim 1, characterized in that:
the predetermined threshold condition is that the search result is equal to or greater than a fourth threshold value and equal to or less than a fifth threshold value.
7. The information retrieval device according to claim 6, characterized in that:
the fourth threshold and the fifth threshold are dynamically variable.
8. The information retrieval device according to claim 7, characterized in that:
the information retrieval apparatus further includes a dynamic threshold value determination unit for adjusting the fourth threshold value and the fifth threshold value.
9. The information retrieval device according to claim 8, characterized in that:
the keyword acquisition unit comprises a semantic word acquisition unit, and is used for segmenting words from the patent information and removing shielding words and single words from the segmentation result so as to obtain semantic keywords;
the dynamic threshold determining unit comprises a second searching type constructing unit and a data shifting unit, wherein the second searching type constructing unit acquires the semantic keywords acquired by the semantic word acquiring unit and the plurality of classification numbers extracted by the classification number acquiring unit to construct a searching type for searching and acquire searching hit amount;
and the data shifting unit is used for positively and negatively shifting the retrieval hit amount by a preset amount, taking the magnitude of the positive shift as a fifth threshold value and taking the magnitude of the negative shift as a fourth threshold value.
10. The information retrieval device according to claim 6, characterized in that:
and the searching formula constructing unit is used for constructing the searching formula, wherein the keywords with the same or different grades are in a logical AND relationship, the classification numbers with the same or different grades are in a logical OR relationship, and the keywords and the classification numbers are in a logical AND relationship.
11. The information retrieval device according to claim 10, characterized in that:
when the retrieval result is smaller than a fourth threshold value, the retrieval formula construction unit adds the classification numbers with low priority in the priority order to construct the retrieval formula until the retrieval result meets a preset threshold value condition or no classification number capable of being further added exists;
and when the retrieval result is larger than a fifth threshold, sequentially adding the keywords with low priority in the priority order to construct a retrieval formula until the retrieval result meets a preset threshold condition or no keywords capable of being further added exist.
12. The information retrieval device according to claim 6, characterized in that:
a search formula construction unit for constructing a search formula according to a first search formula construction criterion or a second search formula construction criterion,
in the first search type construction criterion, carrying out logic or operation among classification numbers with the same or different grades, carrying out logic and operation among keywords with the same grade, carrying out logic or operation among keywords with different grades, and carrying out logic and operation among the keywords and the classification numbers;
in the second search formula construction criterion, logic or operation is performed between classification numbers of the same or different levels, logic and operation is performed between keywords of the same or different levels, and logic and operation is performed between the keywords and the classification numbers.
13. The information retrieval device according to claim 12, characterized in that:
when the retrieval result is smaller than a fourth threshold value, the retrieval formula construction unit constructs a retrieval formula according to a first retrieval formula construction criterion; and when the retrieval result is larger than a fifth threshold value, constructing the retrieval formula according to the second retrieval formula construction criterion.
14. The information retrieval device according to claim 12, characterized in that:
when constructing the search formula, the search formula constructing unit firstly constructs the search formula by using the keyword with the highest priority and the classification number with the highest priority, and judges whether the search result meets a preset threshold value condition.
15. The information retrieval device according to claim 13, characterized in that:
the retrieval formula constructing unit comprises a first unit, the first unit works when the retrieval result is judged to be smaller than a fourth threshold value by a second comparing unit after the keyword with the highest priority and the classification number with the highest priority construct the retrieval formula according to a first retrieval formula constructing criterion, the first unit adds the keywords with different levels in sequence according to the priority order, constructs the retrieval formula by adding the added keywords and the classification number with the highest priority according to the first retrieval formula constructing criterion, then adds the classification numbers with different levels in sequence according to the priority order, constructs the retrieval formula according to the first retrieval formula constructing criterion until the retrieval result is judged to meet the threshold value condition by the second comparing unit;
the second unit is used for working when the retrieval result is judged to be larger than a fifth threshold value by the second comparison unit after the keywords or the classification numbers of the specific level are added in the first unit, keeping the keywords in the retrieval formula unchanged according to the retrieval formula constructed before the retrieval result is larger than the fifth threshold value after the keywords or the classification numbers of the specific level are added, sequentially adding the classification numbers of lower levels according to the priority order, and constructing the retrieval formula according to the first retrieval formula construction criterion until the retrieval result is judged to meet the threshold value condition by the second comparison unit;
the third unit is used for working when the retrieval result is judged to be still smaller than a fourth threshold value after the keywords and the classification numbers of all levels are added in the first unit or the classification numbers of all levels are added in the second unit after the keywords and the classification numbers of all levels are compared by the second comparison unit, the third unit deletes the keywords in sequence from bottom to top in the multiple keywords of the same level according to the sequence from bottom to top until only a preset number of keywords are left in the level, and then constructs a retrieval formula according to the first retrieval formula construction criterion until the retrieval result is judged to meet the threshold value condition by the second comparison unit;
a search formula obtaining unit for obtaining a search formula, wherein the search formula meeting the search result is obtained when the second comparing unit judges that the search result meets the threshold condition; when the retrieval result of the retrieval formula constructed after the third unit deletes all the keywords of the levels is still smaller than a fourth threshold value, acquiring the retrieval formula constructed finally; when the search result of the search formula constructed after the third unit deletes the keyword at the specific level is larger than the fifth threshold, the search formula constructed before the search result is larger than the fifth threshold after the keyword at the specific level is deleted is acquired.
16. The information retrieval device according to claim 13, characterized in that:
the retrieval formula constructing unit comprises a fourth unit, the fourth unit works when the second comparing unit judges that the retrieval result is larger than a fifth threshold value after the keyword with the highest priority and the classification number with the highest priority construct the retrieval formula according to the first retrieval formula constructing criterion, the fourth unit deletes one of the classification numbers with the highest priority in sequence from back to front according to the classification numbers until only a preset number of classification numbers are left in the classification, the keyword keeps unchanged, and the retrieval formula is constructed according to the second retrieval formula constructing criterion until the retrieval result is judged to meet the threshold value condition after the comparison of the second comparing unit;
a fifth unit, which works when the retrieval result is judged to be smaller than a fourth threshold value after the specific classification number is deleted in the fourth unit and the retrieval result is smaller than the fourth threshold value after the specific classification number is deleted by the second comparing unit, wherein the fifth unit keeps the classification number in the retrieval formula unchanged according to the retrieval formula before the retrieval result is smaller than the fourth threshold value after the specific classification number is deleted, sequentially adds the keywords with lower levels according to the priority order, and constructs the retrieval formula according to a second retrieval formula constructing criterion until the retrieval result is judged to meet the threshold value condition by the second comparing unit;
a sixth unit, configured to operate when the fourth unit deletes the predetermined number of classification numbers to construct the search expression and the second comparison unit compares the classification numbers with each other and determines that the search result is still greater than a fifth threshold, wherein the sixth unit adds the keywords in the search expression in order from top to bottom according to the priority level according to the search expression, and constructs the search expression according to a second search expression constructing criterion until the second comparison unit determines that the search result satisfies the threshold condition;
a search formula obtaining unit for obtaining a search formula, wherein the search formula meeting the search result is obtained when the second comparing unit judges that the search result meets the threshold condition; when the retrieval result of the retrieval formula constructed by the sixth unit after adding all the keywords of the levels is judged to be still larger than the fifth threshold value by the second comparison unit, the retrieval formula constructed finally is obtained; when the search result of the search formula constructed after the fifth unit adds the keyword of the specific level is judged to be smaller than the fourth threshold value by the second comparing unit, the search formula constructed before the search result is smaller than the fourth threshold value after the keyword of the specific level is added is acquired.
17. The information retrieval device according to claim 1, 9, 11, 15, or 16, wherein:
the information retrieval device further comprises a similarity calculation unit which calculates the similarity of each file in the retrieval result with the specific patent input by the user, wherein the retrieval result is the result after the retrieval formula constructed by the retrieval formula construction unit;
and the sorting unit sorts each file in the search result according to the similarity.
18. The information retrieval device according to claim 17, characterized in that:
when the retrieval result of the retrieval formula constructed by the third unit after deleting all the keywords of the grades is judged to be still smaller than the fourth threshold value by the second comparison unit, the retrieval result obtained by the retrieval formula constructed by the second retrieval formula construction unit is obtained;
the similarity calculation unit calculates the similarity of each file in the retrieval result obtained by the retrieval formula constructed by the second retrieval formula construction unit;
and supplementing the files in the retrieval result of the retrieval formula constructed by the retrieval formula construction unit according to the sequence of the similarity from high to low, and performing file deduplication to ensure that the number of the supplemented files is equal to a fourth threshold value.
19. The information retrieval device according to claim 17, characterized in that:
when the sixth unit determines that the search result of the search formula constructed by adding the keywords of all the levels is still greater than the fifth threshold by the second comparing unit, the documents in the search result of the search formula constructed by the search formula constructing unit are sequentially deleted in the order of the degree of similarity calculated by the degree of similarity calculating unit from small to large until the number of documents is equal to the fifth threshold.
20. A computer system for information retrieval, comprising:
an input device for inputting a specific patent number by a user;
a memory having stored therein a thesaurus and predetermined computer instructions;
a processor for reading the corresponding computer instructions and synonyms from the memory, thereby enabling the computer system to receive the specific patent number input by the user; automatically extracting keywords from the patent information corresponding to the specific patent number; dividing the extracted keywords into a plurality of priorities according to the degree of correlation between the keywords and the specific patents; extracting classification numbers from the patent information corresponding to the specific patent number, and dividing the extracted classification numbers into a plurality of priorities; building a search expression by referring to synonyms read from a memory and the keywords and/or classification numbers in the order of priority from high to low until a search expression with search results meeting a preset threshold condition is built; when the retrieval result is less than a preset threshold value, adding a low-priority classification number or a low-priority keyword, or deleting the low-priority keyword or the keyword with the same priority level in a reverse order; and when the retrieval result is greater than a preset threshold value, adding the keywords with low priority levels or deleting the classification numbers with the same priority level in reverse order until the retrieval result meets a preset threshold value condition or the added or deleted classification numbers and keywords do not exist.
21. An information retrieval method, comprising:
a receiving step of receiving a specific patent number input by a user;
a keyword obtaining step of automatically extracting keywords from the patent information corresponding to the specific patent number;
a comparison step of dividing the extracted keywords into a plurality of priorities according to the degree of correlation between the keywords and the specific patents; a classification number obtaining step of extracting classification numbers from the patent information corresponding to the specific patent number and dividing the extracted classification numbers into a plurality of priorities;
a retrieval formula construction step, namely constructing a retrieval formula by the keywords and/or the classification numbers according to the priority from high to low until the retrieval formula with a retrieval result meeting a preset threshold condition is constructed; when the retrieval result is less than a preset threshold value, adding a low-priority classification number or a low-priority keyword, or deleting the low-priority keyword or the keyword with the same priority level in a reverse order; and when the retrieval result is greater than a preset threshold value, adding the keywords with low priority levels or deleting the classification numbers with the same priority level in reverse order until the retrieval result meets a preset threshold value condition or the added or deleted classification numbers and keywords do not exist.
22. The information retrieval method according to claim 21, characterized in that:
the searchable construction step refers to a synonym library of keywords to perform searchable construction.
23. The information retrieval method according to claim 21, characterized in that:
the keyword acquisition step comprises a high-priority word acquisition step and a semantic word acquisition step; wherein the high-priority word acquiring step acquires a high-priority word from the manually processed data of the specific patent; and a semantic word acquiring step of performing word segmentation from the patent information according to semantics so as to acquire semantic keywords.
24. The information retrieval method according to claim 23, characterized in that:
the semantic word acquiring step comprises a semantic word segmentation step and a filtering step, wherein the filtering step removes shielding words and single words from semantic word segmentation results of the patent information.
25. The information retrieval method according to claim 21, characterized in that:
the classification number acquiring step classifies the acquired classification number into classification numbers of a plurality of priorities according to whether the acquired classification number is a manually processed classification number, whether the classification number is a master classification number, or one or more of types of classification systems.
26. The information retrieval method according to claim 21, characterized in that:
the predetermined threshold condition is that the search result is equal to or greater than a fourth threshold value and equal to or less than a fifth threshold value.
27. The information retrieval method according to claim 26, characterized in that:
the fourth threshold and the fifth threshold are dynamically variable.
28. The information retrieval method according to claim 27, characterized in that:
the information retrieval method further comprises a dynamic threshold determination step for adjusting the fourth threshold and the fifth threshold.
29. The information retrieval method according to claim 28, characterized in that:
the keyword obtaining step comprises a semantic word obtaining step, namely segmenting words from patent information, and removing shielding words and single words from segmentation results to obtain semantic keywords;
the dynamic threshold value determining step comprises a second searching type constructing step and a data shifting step, wherein the second searching type constructing step is used for acquiring the semantic keywords acquired in the semantic word acquiring step and the plurality of classification numbers extracted in the classification number acquiring step to construct a searching type for searching and acquire searching hit amount;
and a data shifting step, namely shifting the positive and negative values of the retrieval hit by a preset amount, taking the positive value of the shift as a fifth threshold value, and taking the negative value of the shift as a fourth threshold value.
30. The information retrieval method according to claim 26, characterized in that:
and a search formula construction step, wherein when the search formula is constructed, the keywords with the same or different levels are in a logical AND relationship, the classification numbers with the same or different levels are in a logical OR relationship, and the keywords and the classification numbers are in a logical AND relationship.
31. The information retrieval method according to claim 30, characterized in that:
a retrieval formula construction step of adding classification numbers with low priority in order of priority to construct a retrieval formula when the retrieval result is less than a fourth threshold value until the retrieval result meets a predetermined threshold value condition or no further addable classification number exists;
and when the retrieval result is larger than a fifth threshold, sequentially adding the keywords with low priority in the priority order to construct a retrieval formula until the retrieval result meets a preset threshold condition or no keywords capable of being further added exist.
32. The information retrieval method according to claim 26, characterized in that:
a search formula construction step of constructing a search formula according to a first search formula construction criterion or a second search formula construction criterion, wherein,
in the first search type construction criterion, carrying out logic or operation among classification numbers with the same or different grades, carrying out logic and operation among keywords with the same grade, carrying out logic or operation among keywords with different grades, and carrying out logic and operation among the keywords and the classification numbers;
in the second search formula construction criterion, logic or operation is performed between classification numbers of the same or different levels, logic and operation is performed between keywords of the same or different levels, and logic and operation is performed between the keywords and the classification numbers.
33. The information retrieval method according to claim 32, characterized in that:
in the retrieval formula construction step, when the retrieval result is smaller than a fourth threshold value, constructing a retrieval formula according to a first retrieval formula construction criterion; and when the retrieval result is larger than a fifth threshold value, constructing the retrieval formula according to the second retrieval formula construction criterion.
34. The information retrieval method according to claim 32, characterized in that:
in the retrieval formula construction step, when constructing the retrieval formula, the keyword with the highest priority and the classification number with the highest priority are firstly used for constructing the retrieval formula, and whether the retrieval result meets a preset threshold value condition or not is judged.
35. The information retrieval method according to claim 33, characterized in that:
the retrieval formula construction step comprises a first step, the first step works when the retrieval result is judged to be smaller than a fourth threshold value by a second comparison step after the keyword with the highest priority and the classification number with the highest priority construct the retrieval formula according to a first retrieval formula construction criterion, the first step firstly keeps the classification number unchanged, sequentially adds the keywords with different grades according to the priority order, constructs the retrieval formula by adding the added keywords and the classification numbers according to the first retrieval formula construction criterion, sequentially adds the classification numbers with different grades according to the priority order, constructs the retrieval formula according to the first retrieval formula construction criterion until the retrieval result is judged to meet the threshold value condition by the second comparison step;
a second step of working when the retrieval result is judged to be larger than a fifth threshold value by a second comparison step after the keywords or the classification numbers of the specific grade are added in the first step, and according to the retrieval formula constructed before the step that the retrieval result is larger than the fifth threshold value after the keywords or the classification numbers of the specific grade are added, keeping the keywords in the retrieval formula unchanged, sequentially adding the classification numbers of lower grades according to the priority order, and constructing the retrieval formula according to the first retrieval formula construction criterion until the retrieval result is judged to meet the threshold value condition by the second comparison step;
a third step of working when the retrieval result is judged to be still smaller than a fourth threshold value by the second comparison step after the keywords and the classification numbers of all levels are added in the first step or after the classification numbers of all levels are added in the second step, sequentially deleting the keywords from back to front in the order of priority in the plurality of keywords of the same level until only a preset number of keywords remain in the level, and then constructing a retrieval formula according to the first retrieval formula construction criterion until the retrieval result meets the threshold value condition by the second comparison step;
a retrieval formula obtaining step of obtaining a retrieval formula, wherein the retrieval formula meeting the retrieval result is obtained when the second comparison step judges that the retrieval result meets the threshold condition; when the retrieval result of the retrieval formula constructed after all the keywords of the levels are deleted in the third step is still smaller than a fourth threshold value, acquiring the retrieval formula constructed finally; and when the retrieval result of the retrieval formula constructed after the keywords of the specific level are deleted in the third step is larger than a fifth threshold, acquiring the retrieval formula constructed in the step before the retrieval result is larger than the fifth threshold after the keywords of the specific level are deleted.
36. The information retrieval method according to claim 33, characterized in that:
the retrieval formula constructing step comprises a fourth step, the fourth step works when the retrieval result is judged to be larger than a fifth threshold value by the second comparing step after the keyword with the highest priority and the classification number with the highest priority construct the retrieval formula according to the first retrieval formula constructing criterion, the fourth step deletes one of the classification numbers with the highest priority in sequence from back to front according to the classification numbers until only a preset number of classification numbers are left in the classification, the keyword keeps unchanged, the retrieval formula is constructed according to the second retrieval formula constructing criterion until the retrieval result is judged to meet the threshold value condition after the comparison by the second comparing step;
a fifth step of working when the retrieval result is judged to be smaller than a fourth threshold value after the specific classification number is deleted in the fourth step and compared in the second comparison step, wherein the fifth step keeps the classification number in the retrieval formula unchanged according to the retrieval formula before the specific classification number is deleted, sequentially adds keywords with lower levels according to the priority order, and constructs the retrieval formula according to a second retrieval formula construction criterion until the retrieval result is judged to meet the threshold value condition in the second comparison step;
a sixth step of working when the search result is judged to be still larger than a fifth threshold value after the predetermined number of classification numbers are deleted to construct the search formula in the fourth step and the comparison is carried out in the second comparison step, wherein the sixth step keeps the classification numbers in the search formula unchanged according to the search formula, sequentially adds the keywords in the same level in the order from high priority to low priority, and constructs the search formula according to a second search formula construction criterion until the search result is judged to meet the threshold value condition in the second comparison step;
a retrieval formula obtaining step of obtaining a retrieval formula, wherein the retrieval formula meeting the retrieval result is obtained when the second comparison step judges that the retrieval result meets the threshold condition; when the retrieval result of the retrieval formula constructed after all the keywords of the levels are added in the sixth step is judged to be still larger than the fifth threshold value by the second comparison step, the retrieval formula constructed finally is obtained; and when the retrieval result of the retrieval formula constructed after the keywords with the specific level are added in the fifth step is judged to be smaller than the fourth threshold value by the second comparison step, the retrieval formula constructed before the step that the retrieval result is smaller than the fourth threshold value after the keywords with the specific level are added is obtained.
37. The information retrieval method according to claim 21, 29, 31 or 36, wherein:
the information retrieval method further comprises a similarity calculation step of calculating the similarity of each file in the retrieval result with the specific patent input by the user, wherein the retrieval result is the result after the retrieval by the retrieval construction step;
and a sorting step, wherein each file in the retrieval result is sorted according to the similarity.
38. The information retrieval method according to claim 37, characterized in that:
when the retrieval result of the retrieval formula constructed in the third step after the keywords of all levels are deleted is judged to be still smaller than the fourth threshold value in the second comparison step, the retrieval result obtained by the retrieval formula constructed in the second retrieval formula construction step is obtained;
performing similarity calculation on each file in the retrieval result obtained by the retrieval formula constructed in the second retrieval formula construction step by the similarity calculation step;
and supplementing the files in the retrieval result of the retrieval formula constructed in the retrieval formula construction step according to the sequence of the similarity from high to low, and performing file deduplication to ensure that the number of the supplemented files is equal to a fourth threshold value.
39. The information retrieval method according to claim 37, characterized in that:
and when the retrieval result of the retrieval formula constructed in the sixth step after the keywords of all levels are added is judged to be still larger than the fifth threshold value by the second comparison step, deleting the files in the retrieval result of the retrieval formula constructed in the retrieval formula construction step in sequence from small to large according to the similarity until the number of the files is equal to the fifth threshold value.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610809109.9A CN106372226B (en) | 2016-09-07 | 2016-09-07 | Information retrieval device and method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610809109.9A CN106372226B (en) | 2016-09-07 | 2016-09-07 | Information retrieval device and method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106372226A CN106372226A (en) | 2017-02-01 |
CN106372226B true CN106372226B (en) | 2020-08-25 |
Family
ID=57898935
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610809109.9A Active CN106372226B (en) | 2016-09-07 | 2016-09-07 | Information retrieval device and method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106372226B (en) |
Families Citing this family (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
TW201832105A (en) * | 2017-02-17 | 2018-09-01 | 雲拓科技有限公司 | Method of suggesting keywords for patent searching |
WO2018161309A1 (en) * | 2017-03-09 | 2018-09-13 | 深圳市华第时代科技有限公司 | Automatic duplication checking method and device |
CN106934010A (en) * | 2017-03-09 | 2017-07-07 | 深圳市华第时代科技有限公司 | Automatic duplicate checking method and device |
CN108664508B (en) * | 2017-03-31 | 2021-12-24 | 百度在线网络技术(北京)有限公司 | Information pushing method and device |
CN108920484B (en) * | 2018-04-28 | 2022-06-10 | 广州市百果园网络科技有限公司 | Search content processing method and device, storage device and computer device |
CN110503281A (en) * | 2018-05-16 | 2019-11-26 | 北京牡丹电子集团有限责任公司 | Innovative product value-added tax function develops assistant system and its method |
CN110895556B (en) * | 2018-09-13 | 2023-07-28 | 北京蓝灯鱼智能科技有限公司 | Text retrieval method and device, storage medium and electronic device |
CN109344224A (en) * | 2018-09-18 | 2019-02-15 | 江苏润桐数据服务有限公司 | A kind of automatic denoising method of patent retrieval and device |
CN109359299A (en) * | 2018-09-28 | 2019-02-19 | 中国电子科技集团公司信息科学研究院 | A kind of internet of things equipment ability ontology based on commodity data is from construction method |
CN110083674B (en) * | 2019-03-04 | 2023-05-12 | 深圳云联智汇物联科技有限公司 | Intellectual property information processing method and device |
CN110597863B (en) * | 2019-09-25 | 2023-01-24 | 上海依图网络科技有限公司 | Retrieval system and method for keeping stable performance in control library through dynamic threshold |
CN111538880B (en) * | 2020-04-28 | 2022-08-05 | 中南林业科技大学 | Intelligent analysis and retrieval system for tenon structural design |
CN112131455B (en) * | 2020-09-28 | 2021-09-17 | 贝壳找房(北京)科技有限公司 | List page retrieval degradation method and device, electronic equipment and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2005234868A (en) * | 2004-02-19 | 2005-09-02 | Ntt Data Corp | Similar patent specification retrieval system, method therefor and program |
CN101539916A (en) * | 2008-03-17 | 2009-09-23 | 亿维讯软件(北京)有限公司 | Initial patent retrieving device, secondary patent retrieving device and patent retrieving system |
CN101546306A (en) * | 2008-03-27 | 2009-09-30 | 上海市知识产权服务中心 | Method and system for searching patent documentation by utilizing IPC classification |
-
2016
- 2016-09-07 CN CN201610809109.9A patent/CN106372226B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2005234868A (en) * | 2004-02-19 | 2005-09-02 | Ntt Data Corp | Similar patent specification retrieval system, method therefor and program |
CN101539916A (en) * | 2008-03-17 | 2009-09-23 | 亿维讯软件(北京)有限公司 | Initial patent retrieving device, secondary patent retrieving device and patent retrieving system |
CN101546306A (en) * | 2008-03-27 | 2009-09-30 | 上海市知识产权服务中心 | Method and system for searching patent documentation by utilizing IPC classification |
Also Published As
Publication number | Publication date |
---|---|
CN106372226A (en) | 2017-02-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106372226B (en) | Information retrieval device and method | |
CN108563773B (en) | Knowledge graph-based legal provision accurate search ordering method | |
KR100816923B1 (en) | System and method for classifying document | |
CN106815263B (en) | The searching method and device of legal provision | |
US7814105B2 (en) | Method for domain identification of documents in a document database | |
US8473532B1 (en) | Method and apparatus for automatic organization for computer files | |
AU2018349276A1 (en) | Methods and system for semantic search in large databases | |
JP2016532173A (en) | Semantic information, keyword expansion and related keyword search method and system | |
KR20070089449A (en) | Method of classifying documents, computer readable record medium on which program for executing the method is recorded | |
CN107844493B (en) | File association method and system | |
US20080228752A1 (en) | Technical correlation analysis method for evaluating patents | |
Sabuna et al. | Summarizing Indonesian text automatically by using sentence scoring and decision tree | |
CN111506727B (en) | Text content category acquisition method, apparatus, computer device and storage medium | |
CN102012915A (en) | Keyword recommendation method and system for document sharing platform | |
CN108228612B (en) | Method and device for extracting network event keywords and emotional tendency | |
CN106844482B (en) | Search engine-based retrieval information matching method and device | |
US7249122B1 (en) | Method and system for automatic harvesting and qualification of dynamic database content | |
CN117171331B (en) | Professional field information interaction method, device and equipment based on large language model | |
JP5418138B2 (en) | Document search system, information processing apparatus, and program | |
CN113656575A (en) | Training data generation method and device, electronic equipment and readable medium | |
KR100407081B1 (en) | Document retrieval and classification method and apparatus | |
CN115526601A (en) | File management method and device | |
CN107577690A (en) | The recommendation method and recommendation apparatus of magnanimity information data | |
WO2002037328A2 (en) | Integrating search, classification, scoring and ranking | |
CN112597106A (en) | Document page skipping method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |