CN115587162A - Method for converting patent retrieval expression into search engine query statement - Google Patents

Method for converting patent retrieval expression into search engine query statement Download PDF

Info

Publication number
CN115587162A
CN115587162A CN202211201513.XA CN202211201513A CN115587162A CN 115587162 A CN115587162 A CN 115587162A CN 202211201513 A CN202211201513 A CN 202211201513A CN 115587162 A CN115587162 A CN 115587162A
Authority
CN
China
Prior art keywords
node
executing
word segmentation
standard
root node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211201513.XA
Other languages
Chinese (zh)
Inventor
李扩拉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shaanxi Rongsheng Intellectual Property Platform Co ltd
Original Assignee
Shaanxi Rongsheng Intellectual Property Platform Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shaanxi Rongsheng Intellectual Property Platform Co ltd filed Critical Shaanxi Rongsheng Intellectual Property Platform Co ltd
Priority to CN202211201513.XA priority Critical patent/CN115587162A/en
Publication of CN115587162A publication Critical patent/CN115587162A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for converting a patent retrieval expression into a search engine query statement, which comprises the following steps: acquiring a patent retrieval expression to be processed; analyzing the character strings corresponding to the patent retrieval expression by using a pre-constructed word segmentation device to obtain a plurality of analyzed segmented words; processing a plurality of participles into a list of standard grammar nodes based on a patent retrieval expression grammar structure; generating a standard syntax tree according to the list of standard syntax nodes; and converting the standard syntax tree into a query statement of the target search engine by utilizing a pre-constructed syntax converter matched with the target search engine. The patent retrieval expression and the word segmentation device are constructed based on a predefined extensible patent retrieval expression syntactic structure. The invention can convert the patent retrieval expression into the corresponding query statement of the search engine, is suitable for various search engines, can realize patent data heterogeneous retrieval, can adapt to different data formats and data storage modes, and has high adaptability and expandability.

Description

Method for converting patent retrieval expression into search engine query statement
Technical Field
The invention belongs to the field of data retrieval technology and language processing, and particularly relates to a method for converting a patent retrieval expression into a search engine query statement.
Background
With the rapid development of the technology in various fields of human society, countless intelligent crystals emerge. Meanwhile, people are increasingly conscious of protecting property rights of the knowledge. As one of intellectual property rights, hundreds of millions of patents are accumulated all over the world at present, and the huge amount of patents contain high value. Therefore, how to better extract and query the information of the patents is the basis for realizing the utilization of the patent value. With the improvement of the information construction level of the intellectual property field, the data management of the patent is no longer a main problem, and at present, the multi-dimensional retrieval of the patent data becomes a key technology and core capability capable of solving the patent information query.
The multi-dimensional retrieval of patent data generally needs to analyze and convert a patent retrieval expression into a query language recognizable by a search engine, for example, in the invention patent of 'a method and a system for converting expression retrieval into an elastic search statement' applied by shanghai information technology (shanghai) limited, an expression composed of a retrieval word and a logical operator is analyzed into a retrieval command formula recognizable and executable by an elastic search engine, and a retrieval result is obtained through the elastic search.
However, the current prior art is only applicable to a specific search engine, such as the aforementioned elastic search. Meanwhile, in the aspect of searching heterogeneous data, the prior art has no adaptability and expandability; moreover, patent data formats of various countries are different, data storage modes are various, and the prior art cannot adapt to different data formats and data storage modes.
Disclosure of Invention
In order to solve the above problems in the prior art, the present invention provides a method for converting a patent retrieval expression into a search engine query statement. The technical problem to be solved by the invention is realized by the following technical scheme:
acquiring a patent retrieval expression to be processed;
analyzing the character strings corresponding to the patent retrieval expression by using a pre-constructed word segmentation device to obtain a plurality of analyzed words; the patent retrieval expression and the word segmentation device are constructed on the basis of a predefined extensible patent retrieval expression syntactic structure;
processing the plurality of participles into a list of standard grammar nodes based on the patent retrieval expression grammar structure;
generating a standard syntax tree according to the list of standard syntax nodes;
and converting the standard syntax tree into a query statement of the target search engine by utilizing a pre-constructed syntax converter matched with the target search engine.
The invention has the beneficial effects that:
in the scheme provided by the embodiment of the invention, an extensible patent retrieval expression syntactic structure is predefined, so that any patent retrieval expression to be processed is expressed according to the patent retrieval expression syntactic structure; meanwhile, on the basis of the grammatical structure of the patent retrieval expression, a general word segmentation device is constructed in advance, and matched grammatical converters are constructed in advance aiming at different target search engines. And converting the patent retrieval expression to be processed into a corresponding query statement of a target search engine based on the constructed word segmenter and the grammar converter. Specifically, after a patent retrieval expression to be processed is obtained, a word segmentation device is used for analyzing a character string corresponding to the patent retrieval expression to obtain a plurality of analyzed words; processing the multiple word segments into a list of standard grammar nodes based on the grammatical structure of the patent retrieval expression; then generating a standard syntax tree according to the list of the standard syntax nodes to realize that the patent retrieval expression is modeled into a standard data structure; the standard syntax tree is a universal semantic, is irrelevant to an actual data structure, a data storage mode and a data query engine, and has high standardization degree, so that the standard syntax tree has high adaptation degree to various data formats, data storage modes, data query engines and heterogeneous retrieval. And finally, converting the standard syntax tree into a query statement of the target search engine by using a syntax converter so as to perform patent retrieval. Because the embodiment of the invention can conveniently construct the grammar converter matched with the search engine according to the search engine used by the actual patent query, and convert the standard grammar tree into the query language of the specific search engine, the grammar converter of the embodiment of the invention has strong adaptability and expandability and provides great convenience for the retrieval of patents.
Drawings
Fig. 1 is a schematic flowchart of a method for converting a patent retrieval expression into a search engine query statement according to an embodiment of the present invention;
FIG. 2 is a schematic diagram illustrating an algorithm flow of a next function according to an embodiment of the present invention;
FIG. 3 is a graphical representation of a generated standard syntax tree in accordance with an embodiment of the present invention;
FIGS. 4 (a) -4 (c) are representations of the content of a different transformation in FIG. 3 according to an embodiment;
FIG. 5 is a diagram of a standard syntax tree generated in the second embodiment of the present invention;
fig. 6 (a) to 6 (b) are representations of the contents of the second different conversions of the embodiment in fig. 5.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In order to be suitable for various search engines, provide adaptability and expandability of a retrieval expression for retrieval of heterogeneous data, and adapt to different data formats and data storage modes, the embodiment of the invention provides a method for converting a patent retrieval expression into a search engine query statement.
It should be noted that the execution subject of the method for converting the patent retrieval expression into the search engine query statement provided by the embodiment of the present invention may be a device for converting the patent retrieval expression into the search engine query statement, and the device may be run in an electronic device. The electronic device may be a server or a terminal device, but is not limited thereto.
As shown in fig. 1, a method for converting a patent retrieval expression into a search engine query statement according to an embodiment of the present invention may include the following steps:
s1, acquiring a patent retrieval expression to be processed;
in order to improve the adaptability to various search engines, various data structures, data formats and data storage modes, the embodiment of the invention predefines an extensible patent retrieval expression grammar structure, and realizes the method of the embodiment of the invention on the basis. The patent retrieval expression to be processed is constructed based on a predefined extensible patent retrieval expression syntactic structure. To facilitate understanding of the embodiment of the present invention, the syntax structure of the extensible patent search expression is first introduced.
In an optional embodiment, any patent search expression constructed based on a predefined extensible patent search expression syntax structure includes:
field names, operators, and field values.
Wherein the field name represents a name of a search field for a patent; it is understood that the field names are the content items appearing in the patent entry information such as bibliographic items of patent text, text content, etc., for example, the field names may include application number, title, description, current application (patentee), inventor, agency, application date, priority country, IPC classification number, application domain classification, etc. Of course, the field names may also include content items in terms of citations, families, legal status, permissions, litigation, rights transfers, and the like, to name but a few. The format of the field name can include any characters such as Chinese, english and symbols, but the field name does not contain characters related to operators; for example, the field names may be patent title, AP, IPC-MAIN, etc.
The field value includes retrieval contents for a field name; for example, if a patent whose title contains an engine is desired to be retrieved, the field name is the title and the field value is the engine. The field values may be numbers, dates, individual strings, strings of matching characters, and so forth. For example, the field value may be car, comput, 20190101, 7, etc. Wherein, the automobile and the 7 are independent character strings; 20190101 is the date; comput is a character string of band-pass wild cards, wherein the character string represents a wild card, and the wild card represents 0 or more characters, and the meaning of comput is a patent containing the character string comput in the front part of the content of the matching field name.
The operator represents an operation on a field name and a field value, and an operation on a sub-expression in the patent retrieval expression.
In the embodiment of the present invention, the operators include at least the following 8 types. Respectively as follows:
1) Logical operators. Such as AND, OR, NOT, etc.
Specifically, the AND indicates that the search terms on both sides of the AND exist simultaneously; OR represents that at least one of the search terms on two sides of the OR occurs; NOT denotes a search term excluding its successor.
2) And (4) intercepting the word symbol. For example ","? "," $ ", etc.
Wherein the symbols within the quotation marks are exemplary word truncations. The truncators are used to obscure the search document. "" is an infinite suffix used for the suffix to replace zero, a single or multiple characters; "? "is often used in the middle of a word to replace a character; "$" is used to replace zero or one character.
3) A position operator. Such as (w), (n), etc.
Specifically, (W) indicates that the search terms on both sides must appear in a sequential order, no other terms are allowed to be inserted between the search terms on both sides, and only a space or a punctuation mark is possible. And (N) indicates that the positions of the search terms on two sides can be interchanged, and no other words are allowed to be inserted between the search terms on two sides, but a blank space or a punctuation mark is allowed.
4) The same sentence operator. Such as(s), etc.
Specifically,(s) indicates that the connected terms must appear in the same sentence at the same time, but does not limit the relative order of the connected terms and the number of intervening terms.
5) And (5) carrying out the same segment operator. Such as (p), etc.
Specifically, (p) indicates that the linked terms must appear in the same segment at the same time, but does not limit the relative order of the linked terms and the number of intervening terms.
6) A scope retriever. For example to, > =, < >, = = etc.
Specifically, where = denotes that the contents are completely matched, and the meanings of the remaining range searchers are not described. 7) A special character. Such as ".", "/", etc.
Wherein the symbols within the double quote are exemplary special characters. The special characters do not have actual meanings when retrieved and are ignored.
8) The C-CETS character. Such as% n,/HIGH,/LOW,/SEN,/FREC, etc.
Specifically, the% n operator can be used for retrieving the frequency of each group of classification numbers, and the literature of n times of occurrence of the classification numbers is retrieved when n takes on a number greater than 1; the HIGH represents a classification number in an extended retrieval search formula and an upper classification number; the LOW represents the classification number itself and the lower classification number in the extended search query; SEN represents the position of the retrieval classification number and supports the setting of the position range of the retrieval classification number; the/FREC represents the number of occurrences of retrieving the entire C-SETS classification number.
For the specific meanings and usage rules of the above 8 operators, please refer to the related art for understanding, and detailed description is omitted here.
Any one of the operators can be connected with the field names and the field values, and the patent retrieval expression is obtained independently or according to certain rule combination. For ease of understanding, several examples are given below to illustrate the form of the patent search expression to be processed by the embodiments of the present invention.
Example 1: "TI = (engine OR motor NOT engine)", which is a patent search expression used to search for a patent whose title (abbreviated as TI in the title) contains both words "engine" OR "motor", but cannot contain the word "engine". In this example, the field name is TI; the field value is engine, motor, engine; the operators include two logical operators, OR and NOT.
Example 2: "TI = en er", which indicates that a portion of a patent title is required to be able to match a rule that two characters must be present between en and er. In this example, the field name is TI; the field values are en and er; is the operator the intercept "? ", a word-truncate"? "represents 1 character.
Example 3: "TI = (different (3 w) focus (9 w) photograph)", and this patent search expression indicates that three words of "different", "focus", "photograph" need to be included in the patent title, and that 0 to 3 characters are to be spaced between "different" and "focus", and 0 to 9 characters are to be spaced between "focus" and "photograph". In this example, the field name is TI; the field values are different, focus and photograph; the operator is a position operator (w), and the previous value of the w represents the upper limit value of the interval character.
Example 4: "specification = (transformer (p) capacitor (p) feedback loop)", and this patent search expression represents a certain paragraph in the patent specification and must include three words "transformer", "capacitor" and "feedback loop". In this example, the field name is the description; the field value is a transformer, a capacitor and a feedback loop; the operator is the same segment operator (p).
Example 5: "TI = computer AND specification = computing device", the patent search expression is composed of two sub-expressions: "TI = computer", "specification = computing device", any sub-expression including field names, operators and field values; the two sub-expressions are linked by the AND logical operator, meaning that the word "computer" must be included in the patent title AND the word "computing device" must be included in the patent specification. It is understood that, in the example, in the front and rear sub-expressions, the field names are TI and the specification, respectively; the field values are respectively a computer and a computing device; the operator is the logical operator AND.
It should be noted that, in the embodiment of the present invention, the field names and the operators have extensibility. That is, the syntax structure of the patent retrieval expression defined by the embodiment of the present invention can be freely expanded according to actual needs, and is not limited to the above example, so that the syntax structure of the patent retrieval expression defined by the embodiment of the present invention has expandability.
S2, analyzing character strings corresponding to the patent retrieval expression by using a pre-constructed word segmentation device to obtain a plurality of analyzed words;
the word segmentation device is constructed based on a predefined extensible patent retrieval expression grammar structure.
Specifically, after defining the grammatical structure of the patent retrieval expression, the embodiment of the present invention may specifically construct a word segmenter for parsing for the patent retrieval expression under the specification of the grammatical structure.
The word segmentation device has the working principle that the content of the character string corresponding to the patent retrieval expression is disassembled and divided into a plurality of word segments according to the self-defined blank symbol set, symbol set and keyword set in the embodiment of the invention. Each obtained participle does not have logic, and only represents a basic unit obtained by dividing a character string corresponding to a patent search expression.
The space set is composed of a plurality of spaces, and the spaces are used for realizing the separating function. The blank characters in the embodiment of the invention comprise space characters, carriage returns, line feed characters and tab characters. The space character is a space; the carriage return symbol, the line feed symbol and the tab symbol are denoted as r, \\ n and/t, respectively.
The symbol set is composed of a plurality of symbols, and includes various conventional symbols, such as =, > =, <, (,), ", and the like. A symbol set contains a larger range of symbols than symbols used in field names, and symbols used in operators.
The keyword set is composed of a plurality of keywords, each keyword is complete and has semantics, and can be Chinese, english or other characters, such as students, cars, bag, and, or, not, to, and the like.
When the word segmentation is analyzed, the word segmentation device can be initialized, a blank symbol set, a symbol set and a keyword set are introduced, a pointer is maintained, and the pointer points to the offset position of the currently processed patent retrieval expression, namely, the current position pointed by the pointer is the current position to be processed of the patent retrieval expression. And after the operation of the corresponding position is finished, the pointer points backwards so as to continue the operation of subsequent characters, thereby finishing the word segmentation analysis of the patent retrieval expression.
In an optional embodiment, analyzing a character string corresponding to a patent search expression by using a pre-constructed word segmenter to obtain multiple parsed words, includes:
based on a predefined blank character set, a predefined symbol set and a predefined keyword set, a preset next function is utilized, a next participle in the patent retrieval expression is obtained backwards from the current position pointed by the pointer in the character string corresponding to the patent retrieval expression, the pointer position is updated, the process of obtaining the next participle in the patent retrieval expression backwards is repeated, and the next participle is obtained until the next participle cannot be obtained, so that a plurality of participles analyzed by the patent retrieval expression are obtained.
In order to simplify the processing and improve the efficiency, the embodiment of the present invention provides an external next function for obtaining the next word segmentation backwards from the position pointed by the current pointer. The next function ignores the blank symbol, and the type of any participle is one of a symbol, a keyword or a character string; a string is a character and a combination of characters other than elements in a set of whites, a set of symbols, and a set of keywords.
In an optional implementation manner, referring to fig. 2, based on a predefined blank symbol set, a symbol set, and a keyword set, and by using a preset next function, a next participle in the patent search expression is obtained backward from a current position pointed by a pointer in a character string corresponding to the patent search expression, where the method includes:
a1, judging whether an unprocessed matching result exists in a last matching result before a current position according to the current position pointed by a pointer; if yes, executing the step a2; if not, executing the step a5;
and at any moment, the current position pointed by the pointer is the position to be currently processed by the patent retrieval expression. According to the current position pointed by the pointer, the last matching result before the current position can be obtained, and the matching result can be one matching result or a plurality of matching results. Each matching result represents a character string corresponding to the patent retrieval expression, and the type of the corresponding content of a part of the character strings is confirmed to be any one of blank characters, symbols and keywords after the character strings are read. For a matching result, it may be determined immediately as a participle or as not a participle, when it is a processed matching result; it is also possible that whether the matching result is associated with the word segmentation cannot be determined immediately because it is not determined whether the matching result is associated with the word segmentation, and the word segmentation needs to be left for subsequent determination, and then the matching result is an unprocessed matching result; that is to say, the unprocessed matching result indicates that the belonging type of the corresponding content is confirmed to be any one of the blank symbol, the symbol and the keyword, but the word segmentation judgment result is not obtained; the word segmentation judgment result is as follows: whether a word is segmented or not. For example, the unprocessed matching result in the last matching result is the keyword "i", and the unprocessed matching result is that the keyword "i" still exists in the keyword set, so that the keyword "i" cannot be immediately determined as a word segmentation when the matching result of the subsequent character is unknown.
Step a2, obtaining a matching result which is not processed last time;
if a plurality of unprocessed matching results exist at the last time, the closest one to the current position is obtained.
Step a3, judging whether the obtained last unprocessed matching result is a blank symbol; if yes, executing the step a1; if not, executing the step a4;
specifically, since the space character is only used as the space character and has no actual meaning, if the obtained matching result which is not processed last time is the space character, the space character cannot be determined as the participle, and the space character is taken as a processed matching result, and the step a1 is returned to find the next participle. It will be appreciated that the pointer will move after the space before returning to step a 1.
Step a4, determining the obtained matching result which is not processed last time as a word segmentation;
specifically, if the obtained last unprocessed matching result is not a blank character, it may be determined that the matching result is a word segmentation.
Step a5, trying to obtain the next matching result by moving the pointer character by character backwards;
specifically, the pointer is moved back from the current position character by character, and whether the type of the moved content is any one of a blank character, a symbol and a keyword is judged at the stop position of each time; if yes, determining the content which is in accordance with the matching result; if not, the pointer is moved backwards to continue judgment. And when the matching result is determined, the stop position of the pointer is the end position corresponding to the matching result.
Step a6, judging whether the next matching result can be obtained or not; if not, executing the step a7; if yes, executing the step a8;
step a7, determining all residual texts after the current position pointed by the pointer before trying to obtain the next matching result as word segmentation;
specifically, if the pointer is moved backward one by one, the next matching result cannot be obtained, and all the remaining text after the current position pointed by the pointer before the next matching result is tried to be obtained is determined as the word segmentation.
Step a8, judging whether a text exists between the end position corresponding to the next matching result and the current position pointed by the pointer before trying to obtain the next matching result; if yes, executing step a9; if not, executing the step a10;
specifically, if the next matching result can be obtained by moving the pointer backward character by character, the content between the end position corresponding to the next matching result and the current position pointed by the pointer before attempting to obtain the next matching result can be obtained, and whether the content is a text or not is confirmed.
Step a9, determining a part corresponding to the text as a word segmentation, and temporarily storing the next acquired matching result;
specifically, if a text exists between the ending position corresponding to the next matching result and the current position pointed by the pointer before the next matching result is tried to be obtained, the part corresponding to the text is directly determined as a word segmentation, the word segmentation analysis of this time is completed, and meanwhile, the obtained next matching result is temporarily stored to be reserved for determining whether the word segmentation is performed or not at the next time. At this time, it can be understood that the pointer points to the end position of the next matching result.
Step a10, judging whether the next acquired matching result is a blank symbol; if yes, executing the step a1; if not, executing the step a11;
specifically, if there is no text between the ending position corresponding to the next matching result and the current position pointed by the pointer before attempting to obtain the next matching result, it is further necessary to determine whether the obtained next matching result is a blank character, if so, it cannot be determined as a word segmentation, and it is taken as a processed matching result, and the step a1 is returned to find the next word segmentation, and it can be understood that the pointer moves to behind the blank character before returning to the step a 1. If not, step a11 is performed.
Step a11, determining the next matching result as a word segmentation.
Specifically, the next matching result obtained at this time can be directly determined as a word segmentation.
S3, processing the multiple participles into a list of standard grammar nodes based on a patent retrieval expression grammar structure;
because the participles obtained by the participle device do not reflect the grammar of the patent retrieval expression, the participles need to be judged and processed, and all the participles are processed into standard grammar nodes which accord with the grammar structure of the patent retrieval expression.
In an alternative embodiment, S3 may include the following steps:
step b1, acquiring a next word segmentation from the word segmentation device; obtaining a standard grammar node according to the word segmentation, and judging whether the node is empty or not; if yes, ending the process; if not, executing the step b2;
executing the corresponding next word segmentation for the first time to obtain a first word segmentation; the standard grammar node indicates that the content type belongs to any one of field names, operators or field values; the node is empty, which means that the next word segmentation cannot be obtained from the word segmenter. The process of obtaining the standard grammar nodes based on the word segmentation is described later.
If the node is empty, indicating that all the participles have been processed, the list of standard grammar nodes has been built, thus ending the flow. If the node is not empty, this indicates that processing is to be continued, so step b2 is performed.
Step b2, judging whether the standard grammar node is not a left bracket; if yes, executing step b3; if not, executing the step b4;
step b3, adding the standard syntax node to a list of standard syntax nodes, and repeating the step b1;
specifically, in the patent search expression, when a character string needs to be searched, the character string is usually defined by a left bracket and a right bracket. For example, when searching for some date range or searching for some number range, such as searching for patents with application dates 20220901-20220916, the patent search expression is APD: [20220901TO 20220916], wherein APD is an abbreviation of the field name of application date. Or, searching for patents with CLAIM numbers of 1to 10, the patent search expression is CLAIM _ COUNT: [1TO 10], wherein CLAIM _ COUNT is an abbreviation for CLAIM number field names.
Therefore, it is necessary to determine whether each standard syntax node is a left bracket; if the standard syntax node is not in left brackets, it is indicated as a separate standard syntax node and therefore the standard syntax node may be added to the list of standard syntax nodes. If the standard syntax node is a left bracket, it indicates that it cannot be used as a single standard syntax node, and the content between the left and right brackets needs to be found for judgment.
Step b4, acquiring the next word segmentation from the word segmentation device; obtaining standard grammar nodes according to the word segmentation; judging whether the node is empty or whether the content of the node is not a character string; if yes, throwing an exception and stopping the process; if not, executing the step b5;
specifically, if the node is empty, it indicates that the process needs to be ended, and if the content of the node is not a character string, it does not conform to the normal definition of the left and right brackets in the patent retrieval expression, which indicates that an exception occurs, and if any of the above cases occurs, the process needs to be terminated. If the node is not empty and the content of the node is a character string, the state is normal, and the step b5 needs to be executed continuously.
Step b5, acquiring the next word segmentation from the word segmentation device; obtaining standard grammar nodes according to the word segmentation; judging whether the node is empty or whether the content of the node is not the character 'to'; if yes, throwing out the exception and stopping the flow; if not, executing the step b6;
specifically, when the left and right brackets in the patent retrieval expression perform string retrieval, there should be a character "to" in the string according to the specification, and if the content of the node is not the character "to", it indicates that an exception occurs. If the node is not empty and the node content is the character "to", this indicates that the state is normal and step b6 needs to be continued.
It should be noted that the character "to" in the embodiment of the present invention is not limited to cases.
Step b6, acquiring the next word segmentation from the word segmentation device; obtaining standard grammar nodes according to the word segmentation; judging whether the node is empty or whether the content of the node is not a character string; if yes, throwing an exception and stopping the process; if not, executing the step b7;
it is understood that step b4 is used to determine the content between the left bracket and the character "to", and step b6 is used to determine the content between the character "to" and the right bracket, and the execution is similar.
B7, acquiring the next word segmentation from the word segmentation device; obtaining standard grammar nodes according to the word segmentation; judging whether the node is empty or whether the content of the node is not the right bracket; if yes, throwing an exception and stopping the process; if not, executing the step b8;
specifically, if the node is empty or the content of the node is not in the right bracket, it indicates that an exception occurs; if the node is not empty and the node content is right brackets, the status is normal and all the content between the right brackets and the left brackets is found.
B8, combining all the previously acquired standard syntax nodes into a character string syntax node in sequence, adding the character string syntax node into a list of the standard syntax nodes, and repeating the step b1;
specifically, all the standard grammar nodes obtained in the steps b 1to b7 are combined into a character string grammar node in sequence, wherein the character string grammar node is a special type of the standard grammar node. And (c) adding the character string grammar nodes to the list of the standard grammar nodes, then repeating the step (b 1) until the step (b 1) obtains the next participle from the participler, when the standard grammar nodes are obtained according to the participle, judging that the nodes are empty, ending the process, and obtaining the final list of the standard grammar nodes.
The following describes a process of obtaining a standard grammar node from the word segmentation.
In an optional embodiment, obtaining the standard grammar node according to the word segmentation includes:
step c1, if the type of the word segmentation is not a symbol, directly packaging the word segmentation into a common standard grammar node; otherwise, executing step c2;
step c2, if the participle is not an English-form double quotation mark, directly packaging the participle into a common standard grammar node; otherwise, executing step c3;
step c3, judging whether the next matched double quotation marks can be found backwards or not; if not, throwing the exception; if yes, executing step c4;
and c4, packaging the contents between the two quotation marks into a character string grammar node.
Wherein, the symbols in step c1 are the elements in the symbol set. The next matched double quotation mark in the step c3 is the next one of a pair of double quotation marks in the English form; common standard syntax nodes are referred to as special standard syntax nodes, string syntax nodes.
And S3, extracting all the participles in the patent retrieval expression through the participler, and processing the participles into a list of standard grammar nodes through a certain rule for use in the subsequent construction of a standard grammar tree.
S4, generating a standard syntax tree according to the list of the standard syntax nodes;
and S4, processing the standard grammar node list obtained in the S3 into a standard grammar tree according to a defined patent retrieval expression grammar structure, wherein the semantics and the priority of different operators need to be considered. The standard syntax tree generated in S4 has only one final root node, each node may have 0-2 child nodes, all leaf nodes are field names or field values, and all internal nodes are operators.
In an alternative embodiment, S4 includes:
step d1, defining a stack value stack for storing a field name node, a field value node or a sub-expression root node, and defining a stack symbol stack for storing an operator node;
wherein the initial states of valustack and symbolStack are empty.
Step d2, defining a pointer i to point to the position of the current node to be processed in the nodeList;
where nodeList represents a list of standard syntax nodes.
D3, if i is more than or equal to the total number of nodes in the nodeList, executing the step d4; otherwise, executing step d5;
specifically, if i is greater than or equal to the total number of nodes in nodeList, it means that all nodes have completed processing.
Step d4, taking an operator node a from the current symbolStack, taking two nodes b and c from the current value stack, taking the node a as an operator, taking the node b as a left child node, taking the node c as a right child node, constructing the three nodes into a binocular operation node, pressing the binocular operation node into the current value stack, repeating the step until the current symbolStack is empty, and executing the step d12;
wherein, a, b and c only represent the code number of the node and are not used for limiting the content of the node.
Step d5, acquiring the node n at the position i from the nodeList, and adding 1to the i; if the word segmentation type of the node n is a character string, executing the step d6; otherwise, executing step d7;
step d6, if the word segmentation type of the previous node of the node n is also a character string, the node n AND an AND operator are pressed into the current valueStack, AND then the step d3 is repeated; if the word segmentation type of the previous node of the node n is not a character string, only pressing the node n into the current valueStack, and then repeating the step d3;
step d7, if the current symbolStack is empty, pressing the node n into the current symbolStack, and then repeating the step d3; otherwise, executing step d8;
step d8, if the node n is a left brace, pressing the node n into the current symbolStack, and then repeating the step d3, otherwise, executing the step d9;
step d9, checking the priority of the stack top node t of the current symbolStack, and executing step d10 if the priority of the node n is less than or equal to the priority of the stack top node t; otherwise, executing step d11; wherein the priority of each operator is predefined;
step d10, taking out the stack top node t, taking out two nodes e and f from the current valueStack, taking the stack top node t as an operator, taking the node e as a left child node, taking the node f as a right child node, constructing the three nodes into a binocular operation node, pressing the binocular operation node into the current valueStack, and repeating the step d9;
similarly, t, e, and f are merely symbols and are not used to limit the contents of the nodes.
Step d11, pressing the node n into the current symbolStack, and then repeating the step d3;
step d12, judging whether only one node is left in the current valustack, if so, taking the remaining node as the root node of the finally generated standard syntax tree; if not, the exception is thrown out, and the process is stopped.
It is understood that after the root node of the finally generated standard syntax tree is obtained, the standard syntax tree can be obtained according to the known relationship between the nodes.
The standard syntax tree generated in the step S4 contains the logic semantics of the whole patent retrieval expression, and the semantics are general and are irrelevant to a specific data format, a data storage mode and a data query engine. Therefore, the method can adapt to various data formats, data storage modes, data query engines and heterogeneous retrieval.
For the concrete processing procedure of each step in S4, please refer to the related art for understanding, and will not be described in detail here.
For the priorities of the operators in the embodiment of the present invention, please refer to the contents illustrated in table 1.
TABLE 1 predefined operator priorities (parts)
Figure BDA0003872569070000151
Figure BDA0003872569070000161
And S5, converting the standard syntax tree into a query statement of the target search engine by utilizing a pre-constructed syntax converter matched with the target search engine.
In the embodiment of the invention, the target search engine comprises an elastic search and mysql. For the concepts of both, please see the related art for understanding. Of course, the target search engine of the embodiment of the present invention is not limited to these two types.
In order to adapt to a specific retrieval environment, the embodiment of the invention needs to customize a matched grammar converter for each target search engine so as to convert the standard grammar tree into the query statement of the target search engine and realize retrieval in the specific environment.
The following describes the cases where the target search engines are Elasticsearch and mysql, respectively.
(1) The target search engine is an elastic search
In an optional embodiment, when the target search engine is an elastic search, the method for converting the standard syntax tree into the query statement of the target search engine by using a pre-constructed syntax converter matched with the target search engine includes:
step e1, acquiring a currently processed root node r, and executing step e2;
when the syntax tree is executed for the first time, the currently processed root node r is the root node of the standard syntax tree. It will be appreciated that after the first execution, the currently processed root node r may be the root node of the level below the root node of the standard syntax tree.
E2, if the currently processed root node r is not a binocular operation node, throwing an exception and stopping the process; otherwise, executing step e3;
it is understood that the currently processed root node r should be a binocular operation node in a normal state.
Step e3, if the operator of the currently processed root node r is a logical operator, executing step e4; otherwise, executing step e7;
step e4, taking out the root node r-left of the left sub-tree of the currently processed root node r, and recursively executing the step e2 to obtain the left clause left of the currently processed root node r;
it will be appreciated that the root node r-left of the left sub-tree of the currently processed root node r recursively performs step e2 as the currently processed root node r.
Step e5, taking out the root node r-right of the right sub-tree of the currently processed root node r, and recursively executing the step e2 to obtain the right clause right of the currently processed root node r;
it will be appreciated that the root node r-right of the right sub-tree of the currently processed root node r recursively performs step e2 as the currently processed root node r.
The concept of the left and right subtrees is understood in conjunction with the related art and will not be described in detail herein.
Step e6, according to the logical operator of the currently processed root node r, combining the left clause left and the right clause right thereof into a pool query statement of an elastic search; and returning to the step e3;
the concepts of the left clause left and the right clause right are understood in conjunction with the related art and will not be described in detail herein.
Step e7, taking the left sub-tree of the currently processed root node r as a field name k, taking the right sub-tree as a field value v, and converting k and v into a query statement in an elastic search according to an operator op of the currently processed root node r and a conversion algorithm corresponding to the operator op; and returns to step e3.
It will be appreciated that, when proceeding to step e7, there must be a left sub-tree of the currently processed root node r as the field name k and a right sub-tree as the field value v.
Here, since the conversion algorithms of different operators op have different algorithm logics, an example is described here.
When the operator op of the currently processed root node r is the matching symbol "=", converting k and v into a query statement in an elastic search according to a corresponding conversion algorithm according to the operator op of the currently processed root node r, including:
step g1, if the word segmentation type of v is a character string, executing step g2; otherwise, executing step g3;
step g2, packaging k and v into a corresponding Elasticissearch query statement according to the content of v, and ending the process;
the query statement of the Elasticsearch includes term query statement, range query statement or wildcard query statement of the Elasticsearch, and specific concepts of the query statements are understood by referring to related technologies and are not described herein.
Step g3, if the operator of v is not a logical operator, throwing an exception, and ending the flow; otherwise, executing step g4;
step g4, taking out a root node v-left of the left subtree of v, and recursively executing the step g 1to obtain a left clause left of v;
step g5, taking out the root node v-right of the right subtree of v, and recursively executing the step g 1to obtain the right clause right of v;
and step g6, combining the left clause left and the right clause right of the v into a pool query statement of the elastic search according to the logical operator of the v, and ending the flow.
(2) The target search engine is mysql
In an optional embodiment, when the target search engine is mysql, the method for converting the standard syntax tree into the query statement of the target search engine by using a pre-constructed syntax converter matched with the target search engine includes:
step f1, acquiring a currently processed root node r, and executing step f2; when the method is executed for the first time, the currently processed root node r is the root node of the standard syntax tree;
f2, if the currently processed root node r is not a binocular operation node, throwing an exception and stopping the process; otherwise, executing step f3;
step f3, if the operator of the currently processed root node r is a logical operator, executing step f4; otherwise, executing step f7;
step f4, taking out the root node r-left of the left sub-tree of the currently processed root node r, and recursively executing the step f2 to obtain the left clause left of the currently processed root node r;
step f5, taking out the root node r-right of the right sub-tree of the currently processed root node r, and recursively executing the step f2 to obtain the right clause right of the currently processed root node r;
step f6, according to the logical operator of the currently processed root node r, combining the left clause left and the right clause right of the currently processed root node r into a mysql and, or not query statement; and returning to the step f3;
step f7, taking the left sub-tree of the currently processed root node r as a field name k, taking the right sub-tree as a field value v, and converting k and v into query statements in mysql according to an operator op corresponding to the operator op according to the operator op of the currently processed root node r; and returns to step f3.
The detailed steps are understood correspondingly with reference to the relevant steps of the aforementioned elastic search, and are not described in detail herein. The conversion algorithms of different operators op also have different algorithm logics, which are not illustrated here.
In the scheme provided by the embodiment of the invention, an extensible patent retrieval expression grammar structure is predefined, so that any patent retrieval expression to be processed is expressed according to the patent retrieval expression grammar structure; meanwhile, on the basis of the grammatical structure of the patent retrieval expression, a general word segmentation device is constructed in advance, and matched grammatical converters are constructed in advance aiming at different target search engines. And converting the patent retrieval expression to be processed into a corresponding query statement of a target search engine based on the constructed word segmenter and the grammar converter.
Specifically, after a patent retrieval expression to be processed is obtained, a word segmentation device is used for analyzing a character string corresponding to the patent retrieval expression to obtain a plurality of analyzed words; processing the multiple word segments into a list of standard grammar nodes based on a grammar structure of a patent retrieval expression; then generating a standard syntax tree according to the list of the standard syntax nodes to realize that the patent retrieval expression is modeled into a standard data structure; the standard syntax tree is a universal semantic, is irrelevant to an actual data structure, a data storage mode and a data query engine, and has high standardization degree, so that the standard syntax tree has high adaptation degree to various data formats, data storage modes, data query engines and heterogeneous retrieval. And finally, converting the standard syntax tree into a query statement of a target search engine by using a syntax converter so as to perform patent retrieval. Because the embodiment of the invention can conveniently construct the grammar converter matched with the search engine according to the search engine used by the actual patent query, and convert the standard grammar tree into the query language of the specific search engine, the grammar converter of the embodiment of the invention has strong adaptability and expandability and provides great convenience for the retrieval of patents.
In order to facilitate understanding of the implementation of the method according to the embodiment of the present invention, two specific embodiments are described below.
(1) The first embodiment is as follows:
for S1, the pending patent search expression is part of the following double quote:
"title = ((bicycle OR bicycle) AND two persons) OR specification = (riding speed change two wheels)"
For S2, the participles obtained after the patent search expression is parsed by the participler are shown in table 2:
table 2 patent retrieval expression of embodiment one is parsed by the word segmenter to obtain word segmentations
Figure BDA0003872569070000201
Figure BDA0003872569070000211
For S3 to S4, the process of generating the standard syntax tree is shown in table 3.
TABLE 3 example Process schematic for generating a Standard syntax Tree
Figure BDA0003872569070000212
Fig. 3 shows a graphical representation of a standard syntax tree generated finally, and fig. 3 shows a graphical representation of a standard syntax tree generated in an embodiment of the present invention.
The character string expression form of the finally generated standard syntax tree is as follows:
"title = ((bicycle OR bicycle) AND two persons) OR specification = ((riding AND shifting) &twowheels)"
For verification, the standard syntax tree may be compared with the patent retrieval expression to be processed in the first embodiment S1, and the standard syntax tree is found to completely conform to the original semantics of the patent retrieval expression, which proves that the standard syntax tree is indeed the standard syntax tree of the patent retrieval expression. It should be understood that the above-described verification process need not be performed during actual execution of the method of embodiments of the present invention. Moreover, in the embodiment of the present invention, each node in fig. 3 is provided with a unique number for convenience of description of the subsequent process.
For S5, the target search engine is an Elasticsearch as an example. The specific process is as follows:
starting from the root node 1, since node 1 is a logical operator, it is necessary to recursively process its left and right subtrees, i.e., nodes 2 and 3.
Taking the processing of the node 2 as an example, the node 2 is a non-logical operator "=" which indicates a matching relationship, and the left sub-tree thereof is a field name and the right sub-tree thereof is a field value, and according to the description of the step of "when the operator op of the currently processed root node r is a matching symbol" = "in the introduction of the elastic search in S5, according to the operator op of the currently processed root node r, converting k and v into a query statement in the elastic search according to a corresponding conversion algorithm", the right sub-tree, that is, the node 5, needs to be processed first.
Since node 5 is a logical operator, it is necessary to recursively process its left and right subtrees, node 8 and node 9, first.
Taking the processing of node 8 as an example, since node 8 is a logical operator, it is necessary to recursively process the left and right subtrees, i.e., node 12 and node 13.
Taking the processing of the node 12 as an example, since the type of the node 12 is a character string, the semantics of the node are obtained by combining the node 4, the node 2 and the node 12 as follows: the title contains "bicycle", and the query statement for the Elasticsearch is: { "match _ phrase": title ": bicycle" }.
The converted content is shown in fig. 3, as shown in fig. 4 (a).
Similarly, the query statement converted from the node 13 to the elastic search is: { "match _ phrase": a { "title": a "bicycle" } }. The specific process is not described in detail.
After processing by nodes 12 and 13, node 8 may be processed. The logical operator of the node 8 is OR, and it is sufficient that the conditions of the node 12 and the node 13 satisfy one of them, so that the contents of the node 12 and the node 13 are combined into the bool query of the Elasticsearch, and their relationship is represented by the should, that is, the node 8 can be converted into the query statement of the Elasticsearch: { "pool": { "match _ phrase": { "title": bicycle "} }, {" match _ phrase ": {" bicycle "} }, {" match _ phrase ": bicycle" } } }.
The converted content is shown in fig. 3, as shown in fig. 4 (b).
The processing of node 9 is the same as that of node 12 and node 13, and will not be described herein.
After the processing at nodes 8 and 9, node 5 is ready to be processed. The logical operator of the node 5 is an AND, AND it represents that the conditions of the node 8 AND the node 9 must be satisfied at the same time, so the contents of the node 8 AND the node 9 are combined into the pool query of the Elasticsearch, AND their relationship is represented by must, that is, the node 5 can be converted into the query statement of the Elasticsearch:
{ "bone": { "short": { "match _ phrase": { "title": bicycle "} }, {" match _ phrase ": {" title ": bicycle" } } ] } } } }, { "match _ phrase": double "} } }.
After the processing at node 5, the processing at node 2 is completed, and the converted content is shown in fig. 3 as shown in fig. 4 (c).
The processing of the node 3 is the same as that of the node 2, and is not described herein again, and the query statement for converting the node 3 into an Elasticsearch is:
{ "bone": { "boost": { "bone": { "must": { "match _ phrase": { "manual": ride "}),
{ "wildcard": { "value": change "} ] } }, {" match _ phrase ": the {" description ": two wheels } } } } } } } }.
Finally, returning to the processing of the node 1, merging the contents of the node 2 and the node 3 into the pool query of the Elasticsearch, and expressing the relationship by the should, so that the whole standard syntax tree can be converted into the query statement of the Elasticsearch:
{ "cool": { "should": [ { "bone": { "must": { "bone": { "short": { "match _ phrase": { "title": bicycle "} }, {" match _ phrase ": {" title ":": bicycle "} } } } } } } }, {" bicycle ": {" bicycle "} } } } } }, {" bone ": {" mut ": {" match ": {" bone ": {" description ":" }, { "bicycle" }, { "wireframe": { "match" } } description } }.
At this point, the patent retrieval expression to be processed in the first embodiment S1 is converted into a query statement of a search engine Elasticsearch. The specific process is understood by referring to each specific step in S5, and will not be described in detail here.
(2) Example two:
for S1, the pending patent search expression is part of the following double quote:
"application date = [20110101to 20130505 ]" and title = (computer OR algorithm) "
For S2, the segmentation obtained after the patent search expression is analyzed by the segmenter is shown in table 4:
TABLE 4 segmentation obtained by parsing the patent retrieval expression of example two with a segmenter
Type of word segmentation Word segmentation content
Character string Date of filling
(symbol)
Character string [20110101to 20130505]
Keyword AND
Character string Title
(symbol)
(symbol) (
Character string Computer with a memory card
Keyword OR
Character string Algorithm
(symbol) )
For S3 to S4, the process of generating the standard syntax tree is shown in table 5.
TABLE 5 example two Process schematic Table for generating Standard syntax Tree
Figure BDA0003872569070000241
Figure BDA0003872569070000251
Fig. 5 shows a graphic representation of a finally generated standard syntax tree, and fig. 5 shows a graphic representation of a standard syntax tree generated in the second embodiment of the present invention.
The character string expression form of the finally generated standard syntax tree is as follows:
"application date = [20110101to 20130505 ]" and title = (computer OR algorithm) "
Similarly, for verification, the standard syntax tree may be compared with the patent retrieval expression to be processed in example two S1, and the standard syntax tree is found to completely conform to the original semantics of the patent retrieval expression, which proves that the standard syntax tree is indeed the standard syntax tree of the patent retrieval expression.
Similarly, each node in FIG. 5 is provided with a unique number.
For S5, the target search engine is mysql as an example. The specific process is as follows:
starting from the root node 1, since node 1 is a logical operator, it is necessary to recursively process the subtree and the right subtree, i.e., node 2 and node 3.
Taking the processing of the node 2 as an example, since the node 2 is the non-logical operator "=" indicating a matching relationship, the left sub-tree thereof is a field name, and the right sub-tree thereof is a field value, according to the description of the step of "when the operator op of the currently processed root node r is the matching symbol" = "in the introduction of mysql in S5, according to the operator op of the currently processed root node r, converting k and v into the query statement in the Elasticsearch according to the corresponding conversion algorithm", the right sub-tree, that is, the node 5, needs to be processed first.
Since node 5 represents a date range, and in combination with node 4, node 2, and node 5, the semantics here are: "application date is between 2011-01-01 and 2013-05-05", the query statement converted into mysql is: application No. between '2011-01' and '2013-05-05'.
After the processing of the node 5, the processing of the node 2 is completed, and the converted content is shown in fig. 5 as shown in fig. 6 (a).
After the processing of node 2, node 3 is processed, and node 3 is the non-logical operator "=", which indicates a matching relationship, and its left sub-tree is a field name and right sub-tree is a field value, so it is necessary to process the right sub-tree, i.e. node 7, first.
Since node 7 is a logical operator, it is necessary to recursively process its left and right subtrees, node 8 and node 9, first.
Taking the processing of the node 8 as an example, the type of the node 8 is a character string, and therefore, in combination with the node 6, the node 3, and the node 8, the semantics here are obtained as follows: the title contains "computer", and the query statement converted to mysql is: title like% computer.
The processing of node 9 is the same as that of node 8 and will not be described in detail here.
After the processing at nodes 8 and 9, node 7 is ready to be processed. The logical operator of the node 7 is OR, and it is sufficient that the conditions of the node 8 and the node 9 are satisfied with one of them, so that the contents of the node 8 and the node 9 are incorporated into the OR query of mysql, that is, the query statement converted into mysql is: title like 'computer%' or title like 'algorithm%'.
After the processing at the node 7, the processing at the node 3 is completed, and the converted contents are shown in fig. 5 as shown in fig. 6 (b).
And finally returning to the processing of the node 1, merging the contents of the node 2 and the node 3 into mysql and query, namely converting the whole standard syntax tree into a mysql structured query statement:
(application day between '2011-01' and '2013-05-05') and (heading like '% computer%' or heading like '% Algorithm%').
At this point, the patent retrieval expression to be processed in the second embodiment S1 is converted into a query statement of a search engine Elasticsearch. The specific process should be understood by combining with each specific step in S5, and is not described in detail here.
Therefore, the method provided by the embodiment of the invention can convert the patent retrieval expression to be processed into the query statement of the corresponding search engine aiming at different search engines so as to perform patent retrieval in the corresponding search engine subsequently.
The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims (10)

1. A method of converting a patent retrieval expression into a search engine query statement, comprising:
acquiring a patent retrieval expression to be processed;
analyzing the character strings corresponding to the patent retrieval expression by using a pre-constructed word segmentation device to obtain a plurality of analyzed words; the patent retrieval expression and the word segmentation device are constructed on the basis of a predefined extensible patent retrieval expression syntactic structure;
processing the plurality of participles into a list of standard grammar nodes based on the patent retrieval expression grammar structure;
generating a standard syntax tree according to the list of standard syntax nodes;
and converting the standard syntax tree into a query statement of the target search engine by utilizing a pre-constructed syntax converter matched with the target search engine.
2. The method of claim 1, wherein any patent retrieval expression constructed based on a predefined extensible patent retrieval expression syntax structure comprises:
field names, operators, and field values;
wherein the field name represents a name of a search field for a patent; the field value includes retrieval contents for the field name; the operator represents an operation on the field name and the field value, and an operation on a sub-expression in the patent retrieval expression; any sub-expression includes the field name, the operator, and the field value; the field name does not contain characters to which the operator relates.
3. The method of claim 2, wherein the parsing the character string corresponding to the patent search expression by using a pre-constructed tokenizer to obtain a plurality of parsed tokenizes comprises:
based on a predefined blank character set, a symbol set and a keyword set, utilizing a preset next function to obtain a next participle in the patent retrieval expression backwards from a current position pointed by a pointer in a character string corresponding to the patent retrieval expression, updating the pointer position, and repeating the process of obtaining the next participle in the patent retrieval expression backwards until the next participle cannot be obtained to obtain a plurality of participles analyzed by the patent retrieval expression;
the type of any participle is one of a symbol, a keyword or a character string; the character string is a character and a character combination except for elements in the blank character set, the symbol set and the keyword set.
4. The method as claimed in claim 3, wherein the step of obtaining a next participle in the patent search expression backward from a current position pointed by a pointer in a character string corresponding to the patent search expression by using a preset next function based on a predefined set of blank symbols, a set of symbols, and a set of keywords comprises:
step a1, judging whether an unprocessed matching result exists in a last matching result before a current position according to the current position pointed by a pointer; if yes, executing the step a2; if not, executing the step a5; the unprocessed matching result indicates that the type of the corresponding content is confirmed to be any one of a blank symbol, a symbol and a keyword, but a word segmentation judgment result is not obtained; the word segmentation judgment result is as follows: is a word or not a word;
step a2, obtaining a matching result which is not processed last time;
step a3, judging whether the obtained last unprocessed matching result is a blank symbol; if yes, executing the step a1; if not, executing the step a4;
step a4, determining the obtained matching result which is not processed last time as a word segmentation;
step a5, trying to obtain the next matching result by moving the pointer character by character backwards;
step a6, judging whether the next matching result can be obtained or not; if not, executing the step a7; if yes, executing step a8;
step a7, determining all residual texts after the current position pointed by the pointer before trying to obtain the next matching result as word segmentation;
step a8, judging whether a text exists between the end position corresponding to the next matching result and the current position pointed by the pointer before trying to obtain the next matching result; if yes, executing step a9; if not, executing the step a10;
step a9, determining a part corresponding to the text as a word segmentation, and temporarily storing the next acquired matching result;
step a10, judging whether the next acquired matching result is a blank symbol; if yes, executing the step a1; if not, executing the step a11;
step a11, determining the next matching result as a word segmentation.
5. The method for converting a patent retrieval expression into a search engine query statement according to claim 4, wherein the processing the plurality of participles into a list of standard grammar nodes based on the patent retrieval expression grammar structure comprises:
step b1, acquiring a next word segmentation from the word segmentation device; obtaining a standard grammar node according to the word segmentation, and judging whether the node is empty or not; if yes, ending the process; if not, executing the step b2; executing the corresponding next word segmentation for the first time to obtain a first word segmentation; the standard grammar node represents that the content type belongs to any one of field names, operators or field values; the node is empty, which means that the next word segmentation cannot be obtained from the word segmentation device;
step b2, judging whether the standard grammar node is not a left bracket or not; if yes, executing step b3; if not, executing the step b4;
step b3, adding the standard syntax node to a list of standard syntax nodes, and repeating the step b1;
step b4, acquiring the next word segmentation from the word segmentation device; obtaining standard grammar nodes according to the word segmentation; judging whether the node is empty or whether the content of the node is not a character string; if yes, throwing an exception and stopping the process; if not, executing the step b5;
step b5, acquiring the next word segmentation from the word segmentation device; obtaining standard grammar nodes according to the word segmentation; judging whether the node is empty or whether the content of the node is not the character 'to'; if yes, throwing an exception and stopping the process; if not, executing the step b6;
step b6, acquiring the next word segmentation from the word segmentation device; obtaining standard grammar nodes according to the word segmentation; judging whether the node is empty or whether the content of the node is not a character string; if yes, throwing an exception and stopping the process; if not, executing the step b7;
b7, acquiring the next word segmentation from the word segmentation device; obtaining standard grammar nodes according to the word segmentation; judging whether the node is empty or whether the content of the node is not the right bracket; if yes, throwing an exception and stopping the process; if not, executing the step b8;
b8, combining all the previously acquired standard syntax nodes into a character string syntax node in sequence, adding the character string syntax node into a list of the standard syntax nodes, and repeating the step b1; the character string grammar node is a special type of the standard grammar node.
6. The method of claim 5, wherein said obtaining standard grammar nodes according to the word segmentation comprises:
step c1, if the type of the word segmentation is not a symbol, directly packaging the word segmentation into a common standard grammar node; otherwise, executing step c2;
step c2, if the participle is not an English-form double quotation mark, directly packaging the participle into a common standard grammar node; otherwise, executing step c3;
step c3, judging whether the next matched double quotation marks can be found backwards or not; if not, throwing the exception; if yes, executing step c4;
and c4, packaging the contents between the two quotation marks into a character string grammar node.
7. The method of converting a patent retrieval expression into a search engine query statement as claimed in claim 6, wherein said generating a standard syntax tree from said list of standard syntax nodes comprises:
step d1, defining a stack value stack for storing a field name node, a field value node or a sub-expression root node, and defining a stack symbol stack for storing an operator node;
step d2, defining a pointer i to point to the position of the current node to be processed in the nodeList; wherein nodeList represents a list of the standard syntax nodes;
d3, if i is more than or equal to the total number of the nodes in the nodeList, executing the step d4; otherwise, executing step d5;
step d4, taking an operator node a from the current symbolStack, taking two nodes b and c from the current value stack, taking the node a as an operator, taking the node b as a left child node, taking the node c as a right child node, constructing the three nodes into a binocular operation node, pressing the binocular operation node into the current value stack, repeating the step until the current symbolStack is empty, and executing the step d12;
step d5, acquiring a node n at the position i from the nodeList, and adding 1to i; if the word segmentation type of the node n is a character string, executing a step d6; otherwise, executing step d7;
step d6, if the word segmentation type of the previous node of the node n is also a character string, the node n AND an AND operator are pressed into the current valueStack, AND then the step d3 is repeated; if the word segmentation type of the previous node of the node n is not a character string, only pressing the node n into the current valuStack, and then repeating the step d3;
step d7, if the current symbolStack is empty, pressing the node n into the current symbolStack, and then repeating the step d3; otherwise, executing step d8;
step d8, if the node n is a left brace, pressing the node n into the current symbolStack, and then repeating the step d3, otherwise, executing the step d9;
step d9, checking the priority of the stack top node t of the current symbolStack, and executing step d10 if the priority of the node n is less than or equal to the priority of the stack top node t; otherwise, executing step d11; wherein the priority of each operator is predefined;
step d10, taking out the stack top node t, taking out two nodes e and f from the current valueStack, taking the stack top node t as an operator, taking the node e as a left child node and taking the node f as a right child node, constructing the three nodes into a binocular operation node, pressing the binocular operation node into the current valueStack, and repeating the step d9;
step d11, pressing the node n into the current symbolStack, and then repeating the step d3;
step d12, judging whether only one node is left in the current valueStack, if so, taking the left node as a root node of a finally generated standard syntax tree; if not, the exception is thrown out, and the process is stopped.
8. The method for converting patent retrieval expressions into search engine query statements according to claim 1 or 7, wherein the target search engine comprises Elasticsearch and mysql.
9. The method of claim 8, wherein when the target search engine is an elastic search, the converting the standard syntax tree into the query statement of the target search engine by using a pre-constructed syntax converter matched with the target search engine comprises:
step e1, acquiring a currently processed root node r, and executing step e2; when the method is executed for the first time, the currently processed root node r is the root node of the standard syntax tree;
e2, if the currently processed root node r is not a binocular operation node, throwing an exception and stopping the process; otherwise, executing step e3;
step e3, if the operator of the currently processed root node r is a logical operator, executing step e4; otherwise, executing step e7;
step e4, taking out the root node r-left of the left sub-tree of the currently processed root node r, and recursively executing the step e2 to obtain the left clause left of the currently processed root node r;
step e5, taking out the root node r-right of the right sub-tree of the currently processed root node r, and recursively executing the step e2 to obtain the right clause right of the currently processed root node r;
step e6, combining the left clause left and the right clause right of the currently processed root node r into a pool query statement of an elastic search according to the logical operator of the currently processed root node r; and returning to the step e3;
step e7, taking the left sub-tree of the currently processed root node r as a field name k, taking the right sub-tree as a field value v, and converting k and v into query statements in an elastic search according to an operator op of the currently processed root node r and a conversion algorithm corresponding to the operator op; and returns to step e3.
10. The method of claim 8, wherein when the target search engine is mysql, the transforming the standard syntax tree into the query statement of the target search engine by using a pre-constructed syntax transformer matching the target search engine comprises:
step f1, acquiring a currently processed root node r, and executing step f2; when the syntax tree is executed for the first time, the currently processed root node r is the root node of the standard syntax tree;
f2, if the currently processed root node r is not a binocular operation node, throwing an exception and stopping the process; otherwise, executing step f3;
step f3, if the operator of the currently processed root node r is a logical operator, executing step f4; otherwise, executing step f7;
step f4, taking out the root node r-left of the left sub-tree of the currently processed root node r, and recursively executing the step f2 to obtain the left clause left of the currently processed root node r;
step f5, taking out the root node r-right of the right sub-tree of the currently processed root node r, and recursively executing the step f2 to obtain the right clause right of the currently processed root node r;
step f6, according to the logical operator of the currently processed root node r, combining the left clause left and the right clause right of the currently processed root node r into a mysql and, or not query statement; and returning to the step f3;
step f7, taking the left sub-tree of the currently processed root node r as a field name k, taking the right sub-tree as a field value v, and converting k and v into query statements in mysql according to an operator op corresponding to the operator op according to the operator op of the currently processed root node r; and returns to step f3.
CN202211201513.XA 2022-09-29 2022-09-29 Method for converting patent retrieval expression into search engine query statement Pending CN115587162A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211201513.XA CN115587162A (en) 2022-09-29 2022-09-29 Method for converting patent retrieval expression into search engine query statement

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211201513.XA CN115587162A (en) 2022-09-29 2022-09-29 Method for converting patent retrieval expression into search engine query statement

Publications (1)

Publication Number Publication Date
CN115587162A true CN115587162A (en) 2023-01-10

Family

ID=84778523

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211201513.XA Pending CN115587162A (en) 2022-09-29 2022-09-29 Method for converting patent retrieval expression into search engine query statement

Country Status (1)

Country Link
CN (1) CN115587162A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117331926A (en) * 2023-12-01 2024-01-02 太平金融科技服务(上海)有限公司 Data auditing method and device, electronic equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117331926A (en) * 2023-12-01 2024-01-02 太平金融科技服务(上海)有限公司 Data auditing method and device, electronic equipment and storage medium
CN117331926B (en) * 2023-12-01 2024-03-01 太平金融科技服务(上海)有限公司 Data auditing method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
US6853992B2 (en) Structured-document search apparatus and method, recording medium storing structured-document searching program, and method of creating indexes for searching structured documents
CN111291161A (en) Legal case knowledge graph query method, device, equipment and storage medium
JP5376163B2 (en) Document management / retrieval system and document management / retrieval method
Khare et al. Understanding deep web search interfaces: A survey
EP1033662A2 (en) Natural language search method and apparatus
CN107562919B (en) Multi-index integrated software component retrieval method and system based on information retrieval
CN109241080B (en) Construction and use method and system of FQL query language
EP2350871A1 (en) Method of computerized semantic indexing of natural language text, method of computerized semantic indexing of collection of natural language texts, and machine-readable media
US20030028503A1 (en) Method and apparatus for automatically extracting metadata from electronic documents using spatial rules
Neumann et al. A shallow text processing core engine
Manshadi et al. Semantic tagging of web search queries
KR100835706B1 (en) System and method for korean morphological analysis for automatic indexing
US20100228538A1 (en) Computational linguistic systems and methods
WO2006059425A1 (en) Database configuring device, database retrieving device, database device, database configuring method, and database retrieving method
CN115587162A (en) Method for converting patent retrieval expression into search engine query statement
JP3743678B2 (en) Automatic natural language translation
EA037156B1 (en) Method for template match searching in a text
US20060248037A1 (en) Annotation of inverted list text indexes using search queries
Sakamoto et al. Extracting partial structures from HTML documents
JP5169456B2 (en) Document search system, document search method, and document search program
Agbele et al. Context-aware stemming algorithm for semantically related root words
Lezius et al. Towards a search engine for syntactically annotated corpora
Guest Parsing for role and reference grammar
Lee et al. Ontology-based information retrieval and extraction
JP7371989B1 (en) Search server, search system, and search program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination