CN115146118A - Information retrieval method, device, equipment and storage medium - Google Patents

Information retrieval method, device, equipment and storage medium Download PDF

Info

Publication number
CN115146118A
CN115146118A CN202210832075.0A CN202210832075A CN115146118A CN 115146118 A CN115146118 A CN 115146118A CN 202210832075 A CN202210832075 A CN 202210832075A CN 115146118 A CN115146118 A CN 115146118A
Authority
CN
China
Prior art keywords
character string
affix
character
information
matching
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210832075.0A
Other languages
Chinese (zh)
Other versions
CN115146118B (en
Inventor
黄佳恒
胡银洪
吴育人
庄伯金
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202210832075.0A priority Critical patent/CN115146118B/en
Publication of CN115146118A publication Critical patent/CN115146118A/en
Application granted granted Critical
Publication of CN115146118B publication Critical patent/CN115146118B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9027Trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to the field of semantic parsing, and particularly discloses an information retrieval method, an information retrieval device, information retrieval equipment and a storage medium, wherein the method comprises the following steps: acquiring a first character string, partially turning the first character string to generate a second character string, and generating a character string set according to the first character string and the second character string; constructing an AC automaton according to the character string set; acquiring information to be retrieved input by a user, matching the information to be retrieved in an AC automatic machine, and acquiring a matching path of the information to be retrieved in the AC automatic machine; acquiring corresponding affix character strings according to the matching paths, and arranging and combining the affix character strings to obtain one or more affix character string combinations without character repetition; scoring the affix character string combination according to a preset scoring rule, and determining a target affix character string combination according to a scoring result; and performing information retrieval according to the target affix character string combination. Based on the method, the retrieval accuracy can be improved, and the retrieval experience of the user is optimized.

Description

Information retrieval method, device, equipment and storage medium
Technical Field
The present application relates to the field of semantic parsing, and in particular, to an information retrieval method, apparatus, device, and storage medium.
Background
At present, with the rapid increase of information resources, whether in the general field or the vertical field, users put higher demands on the accuracy of the intelligent retrieval technology. In the retrieval process, a user inputs information to be retrieved, a retrieval engine disassembles the information to be retrieved, keywords in the information to be retrieved are obtained, the retrieval intention of the user is found through the keywords, and therefore target information is output. The common keyword disassembling method comprises the following steps: AC automata, trie tree, and regular expression. Due to the fact that grammar habits and word habits of users are different, the method for disassembling the optimal keywords required by the users is difficult to disassemble, and therefore retrieval accuracy is poor, and target information required by the users cannot be accurately output.
Disclosure of Invention
The application provides an information retrieval method, an information retrieval device, information retrieval equipment and a storage medium, which are used for fuzzy retrieval under different language habits, can solve the problems of incomplete retrieval information and reverse order of information to be retrieved, improve the retrieval accuracy and optimize the retrieval experience of a user.
In a first aspect, the present application provides an information retrieval method, including: acquiring a first character string, partially turning the first character string to generate a second character string, and generating a character string set according to the first character string and the second character string; constructing an AC automaton according to the character string set; acquiring information to be retrieved input by a user, matching the information to be retrieved in the AC automatic machine, and acquiring a matching path of the information to be retrieved in the AC automatic machine; acquiring corresponding affix character strings according to the matching paths, and arranging and combining the affix character strings to obtain one or more affix character string combinations without character repetition; scoring the affix character string combination according to a preset scoring rule, and determining a target affix character string combination according to a scoring result; and performing information retrieval according to the target affix character string combination.
In a second aspect, the present application provides an information retrieval apparatus, the apparatus comprising: the system comprises a set expansion module, a model construction module, a data matching module, a data extraction module, a result grading module and an information retrieval module;
the set expansion module is used for acquiring a first character string, partially turning the first character string to generate a second character string, and generating a character string set according to the first character string and the second character string;
the model building module is used for building an AC automaton according to the character string set;
the data matching module is used for acquiring information to be retrieved input by a user, matching the information to be retrieved in the AC automatic machine and acquiring a matching path of the information to be retrieved in the AC automatic machine;
the data extraction module is used for acquiring corresponding affix character strings according to the matching paths and arranging and combining the affix character strings to obtain one or more affix character string combinations without character repetition;
the result scoring module is used for scoring the affix character string combination according to a preset scoring rule and determining a target affix character string combination according to a scoring result;
and the information retrieval module is used for retrieving information according to the target affix character string combination.
In a third aspect, the present application provides a computer device comprising a memory and a processor;
the memory is used for storing a computer program;
the processor is configured to execute the computer program and implement any one of the information retrieval methods provided in the embodiments of the present application when executing the computer program.
In a fourth aspect, the present application provides a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and when executed by a processor, the computer program causes the processor to implement any one of the information retrieval methods provided in the embodiments of the present application.
The application discloses an information retrieval method, an information retrieval device, information retrieval equipment and a storage medium, wherein the method comprises the following steps: acquiring a first character string, partially turning the first character string to generate a second character string, and generating a character string set according to the first character string and the second character string; constructing an AC automaton according to the character string set; acquiring information to be retrieved input by a user, matching the information to be retrieved in an AC automatic machine, and acquiring a matching path of the information to be retrieved in the AC automatic machine; acquiring corresponding affix character strings according to the matching paths, and arranging and combining the affix character strings to obtain one or more affix character string combinations without character repetition; scoring the affix character string combination according to a preset scoring rule, and determining a target affix character string combination according to a scoring result; and performing information retrieval according to the target affix character string combination. Based on the method, the problems that the information to be retrieved is incomplete and the information to be retrieved has reverse order are solved, the retrieval accuracy and the retrieval speed are improved, and the retrieval experience of a user is optimized.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
Fig. 1 is an application scenario diagram of an information retrieval method according to an embodiment of the present application;
fig. 2 is a schematic flow chart of an information retrieval method provided in an embodiment of the present application;
FIG. 3 is a vocabulary diagram in the field of life insurance provided by an embodiment of the present application;
fig. 4 is a schematic block diagram of an information retrieval apparatus provided in an embodiment of the present application;
fig. 5 is a schematic block diagram of a structure of a computer device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The flow diagrams depicted in the figures are merely illustrative and do not necessarily include all of the elements and operations/steps, nor do they necessarily have to be performed in the order depicted. For example, some operations/steps may be decomposed, combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.
It is to be understood that the terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
Some embodiments of the present application will be described in detail below with reference to the accompanying drawings. The embodiments described below and the features of the embodiments can be combined with each other without conflict.
Referring to fig. 1, fig. 1 is a diagram illustrating an application scenario of an information retrieval method according to an embodiment of the present application. As shown in fig. 1, the method of the embodiment of the present application may be applied to a server, and in particular, to a server of an information retrieval application program, where the server runs in an operation server, and is configured to acquire information to be retrieved uploaded by a user through a client of the information retrieval application program, and also configured to match corresponding target information from a storage server according to the information to be retrieved, and send the target information to a terminal of the user, or jump the terminal of the user to a target application according to the information to be retrieved, and the method is further implemented. The client runs in a terminal device used by a user. The storage server is used for storing the data document to be retrieved. The terminal, the operation server and the storage server can be in communication connection through a wireless network.
The user finishes searching information, such as document searching or keyword searching and the like, through an application program, the application program comprises a client and a server, the client is installed in terminal equipment and used by the user, the server is installed in a server, the server comprises an operation server and a storage server, and the terminal equipment is in communication connection with the server through a network.
When the information retrieval application program is installed in the terminal equipment, the terminal equipment is required to authorize corresponding authority. For example, the authority of the user to obtain information such as a mobile phone number, basic attribute information, a device number, network information, and the like can be obtained.
The server may be an independent server, may also be a server cluster, and may also be a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like. The terminal can be an electronic device such as a mobile phone, a tablet computer, a notebook computer, a desktop computer, a personal digital assistant and a wearable device.
It should be further noted that, in the embodiment of the present application, the relevant data may be acquired and processed based on an artificial intelligence technology, for example, a matching process of the information to be retrieved in the AC automaton is realized through the artificial intelligence. Among them, artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.
The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
Referring to fig. 2, fig. 2 is a schematic flowchart of an information retrieval method according to an embodiment of the present disclosure. As shown in fig. 2, the information retrieval method specifically includes the steps of: step S101 to step S106.
S101, acquiring a first character string, partially turning the first character string to generate a second character string, and generating a character string set according to the first character string and the second character string.
Specifically, a keyword document composed by a user is obtained, the keyword document comprises a plurality of keywords, the keywords comprise synonyms of standard words (professional nouns) and standard words, a first character string is a character string of the keywords, characters in the first character string are sequentially moved backwards to generate a plurality of second character strings, and a character string set is generated according to the first character string and the second character strings.
In some embodiments, the keywords found in the professional literature within the vertical domain by the information entropy algorithm are standard words.
Illustratively, the user can quickly find the important standard words from professional literature in the vertical field through the information entropy algorithm. Key parameters of the information entropy include left entropy and right entropy. The larger the information entropy of a word, the richer the words around the word are, meaning that the more free the word is, the higher the possibility that the word becomes an independent standard word. The formula of the information entropy algorithm is as follows:
Figure BDA0003748836960000051
where p (x, y) is the number of times a compound word appears in the document, p (x) is the number of times the left word of the compound word appears in the document, and p (y) is the number of times the right word of the compound word appears in the document.
For example, if there is an article describing in-Ying, there are 100 words in total, where in-Ying occurs 1 time, once in the near, and once in the near, then p (near, ying) =1/100p (near) =1/100, p (Ying) =1/100.
Figure BDA0003748836960000052
In some embodiments, after the standard words are found through the information entropy algorithm, the standard words can be added through auxiliary labeling of a business side user, and the synonyms can be obtained according to the standard words, so that the number of the keywords is increased.
Exemplarily, the part of speech of each standard word is obtained, a corresponding synonym library is determined according to the part of speech of each standard word, synonyms of each keyword are obtained from the synonym library, synonyms with the same meaning as the standard words in the synonym library are obtained, and the synonyms are stored in the keyword document.
It should be noted that the part of speech refers to the feature of a word as a basis for dividing the part of speech, and the part of speech includes: real-word nouns, pronouns, verbs, adjectives, numerators, quantifiers, fictional words, adverbs, prepositions, conjunctions, articles, adjectives, sighs, and pseudonyms.
In some embodiments, users often search through alternative or short names for standard words, due to their complexity and difficulty in remembering the words. In order to match with the use habits of the user, synonyms corresponding to the standard words can be obtained according to the historical search records.
Illustratively, historical search terms in the life insurance field and standard words according to the historical search term chain direction can be obtained through a crawler tool. Referring to fig. 3, fig. 3 shows a vocabulary diagram in the field of life insurance. As shown in fig. 3, in the historical search records, each standard word has a plurality of synonyms used with high frequency, for example, the synonym corresponding to the standard word "annual fund check tax profit before tax (after risk reserve is deducted)" has "annual fund check tax, annual fund check tax profit and profit", the synonym corresponding to the standard word "life insurance check operation profit increase speed" has "operation profit, profit increase speed", and the synonym corresponding to the standard word "live deposit year average increase (north district)" has "deposit year average, year average increase.
In some embodiments, the keywords in the keyword document are partially inverted, and after moving back to the tail character of the first character string in sequence from the first character of the first character string, each time moving back generates a second character string, and the displacement is stopped at the tail character of the first character string.
Illustratively, a first character string is acquired, a second character string is generated after a first character of the first character string is moved backward to a tail character of the first character string, and then the newly generated first character of the second character string is moved backward to a tail character of the second character string, and the process is repeated until the tail character of the first character string stops being turned over. For convenience of description, the character strings generated after the flipping are collectively referred to as second character strings, and words composed of the second character strings also become flipping words.
Illustratively, the first string ABCD is flipped, the first flipping generates BCD # a, the second flipping generates CD # AB, and the third flipping generates D # ABC. Thus we can get three second strings BCD # a, CD # AB and D # ABC. Wherein BCD, CD and D are affixes of ABCD.
In some embodiments, the first string ABCD is inverted, and the BD # CA, BCDA, CABD, and the like, which are not arranged in sequence, do not occur, because the string which is not arranged in sequence is easy to lose the meaning corresponding to the original string.
Since in the AC automaton, matching starts from the first character of the character string of the keyword, when the user inputs incomplete information, it is likely that the keyword of the word cannot be matched because the first character of the character string is missing. Then, in the flipping process of the character string, the change of the first character is realized by sequentially shifting back the first character of the character string, and when the information input by the user includes only a part of the keyword, the corresponding keyword can be matched.
For example, the keyword ABCD is generated after being flipped, if the information input by the user is BCDA, the AC automaton may generate a BCD # a matching BCDA according to the flipped information, and obtain the keyword ABCD through a mapping relationship between the BCD # a and the original keyword. Therefore, the problem of input reverse order which is difficult to solve by the word matching methods such as the AC automaton and the regular matching is solved.
For example, the key word ABCD is turned over to generate BCD # a, if the information input by the user is BCD, the AC automaton may generate BCD # a matching BCDA according to the turned over, and then obtain the key word ABCD through the mapping relationship between BCD # a and the original key word. Therefore, the problem that matching of incomplete information is difficult to solve in word matching methods such as AC automata and regular matching is solved.
Generating a character string set according to the standard words, the synonyms and the turning words, and generating a knowledge graph according to the corresponding relations of the standard words, the synonyms, the turning words and the three words, thereby obtaining the mapping relation among the standard words, the synonyms, the turning words and the three words.
And S102, constructing an AC automaton according to the character string set.
Specifically, a dictionary tree is generated according to a first character string and a second character string in a character string set; and constructing the AC automaton according to the dictionary tree.
In AC automata, a dictionary tree is composed of characters and paths, and the dictionary tree includes start characters and end characters. The first character of the first character string or the second character string constitutes a start character, the last character of the first character string or the second character string constitutes an end character, and the arranged order of the first character string or the second character string constitutes a path of the dictionary tree.
S103, obtaining information to be retrieved input by a user, matching the information to be retrieved in the AC automatic machine, and obtaining a matching path of the information to be retrieved in the AC automatic machine.
Specifically, information to be retrieved input by a user is acquired based on an application interface on a user terminal, the information to be retrieved is sent to a server through the user terminal, a corresponding AC automaton is called according to the information to be retrieved, and the information to be retrieved is matched with the dictionary tree; and acquiring a matching path of the successfully matched character string on the dictionary tree.
In some embodiments, if partial matching is successful in the dictionary tree of the AC automaton in the information to be retrieved, the matching path of the character string which is partially successfully matched in the dictionary tree is recorded. And if the complete matching in the dictionary tree of the AC automaton in the information to be retrieved is successful, outputting the corresponding keyword.
When matching in the AC automaton, firstly matching the initial character with the character of the information to be retrieved. And if any initial character is matched in the information to be retrieved, sequentially matching backwards on the dictionary tree according to the character arrangement sequence of the information to be retrieved until the character in the information to be retrieved is mismatched with the character under the path corresponding to the first character or matched with the terminal character under the path. If character mismatch exists in the matching process, the AC automaton cannot realize keyword output, and only when the matching from the initial character to the terminal character is completed, the AC automaton can output the keyword on the path from the initial character to the terminal character. Therefore, conventional AC automata cannot output affixes of keywords. The affixes in the embodiment of the application are composed of partial character strings of the keywords, and can be divided into prefixes and suffixes according to the positions of the character strings in the sequence.
In order to output affixes through an AC automaton, in the embodiment of the application, if any initial character is matched in the information to be retrieved, a matching record is generated, the characters are sequentially matched backwards in the dictionary tree according to the character arrangement sequence of the information to be retrieved, and the matching path of the matched characters on the dictionary tree is stored through the matching record. Generating a reverse matching path according to the matching path, and generating a reverse dictionary tree according to the dictionary tree; inputting the reverse matching path into a reverse dictionary tree to obtain a reverse affix character string; and carrying out reverse order arrangement on the reverse affix character strings to obtain the affix character strings. The match path is the path from the initial character to the character preceding the mismatched character.
And S104, acquiring corresponding affix character strings according to the matching paths, and arranging and combining the affix character strings to obtain one or more affix character string combinations without character repetition.
Specifically, after a matching path is obtained, secondary matching is performed on the AC automaton according to the matching path. Specifically, the matching path is subjected to reverse processing to generate a reverse matching path; setting original initial characters on a dictionary tree as termination characters, setting original non-initial characters on the dictionary tree as initial characters, and generating a reverse dictionary tree; and inputting the reverse matching path into an AC automaton of a reverse dictionary tree, wherein the AC automaton can complete matching from the initial character to the terminal character according to the reverse path, and output the affix character string of the keyword.
In some embodiments, the keyword document constituting the dictionary tree of the AC automaton includes a second character string generated by partially flipping the first character string, and the first character of the second character string is a middle character or a tail character of the first character string. When the first character of the first character string is not included in the information to be retrieved, the AC automaton can realize partial matching starting from the middle character of the first character string by the second character string. Therefore, the suffix of the keyword can be output, the problem that the information input by the user is incomplete is solved, and the fuzzy retrieval capability of the retrieval system is improved.
Through the steps, the problem of reverse order input can be solved, and the problem of incomplete input information can be solved, so that fuzzy search can be realized when the input information of the user is not standard enough, and the use experience of the user is improved.
In some embodiments, before the permutation and combination of the affix character strings, the method further includes: determining whether a first affix character string including a second affix character string exists between the affix character strings, wherein the first affix character string is any one affix character string among the affix character strings, and the second affix character string is any one affix character string among the affix character strings; and deleting the second affix character string when the first affix character string exists among the affix character strings and the second affix character string exists.
Illustratively, the screening operation is performed on the affix strings before the affix strings are subjected to the permutation and combination without character repetition. The screening operation comprises the following steps: deleting repeated word-attached character strings; further comprising: if the complete first character string is obtained, deleting the affix character string corresponding to the first character string; and comprises the following steps: and if the acquired affix character string has an inclusion relationship, deleting the short affix character string included in the long affix character string, for example, the acquired affix character string includes ACBD, CBD and ACB, and since CBD and ACB are completely included in ACBD, deleting CBD and ACB only needs to keep ACBD.
In some embodiments, the permutations of affix strings are arranged without coincidence as: i.e. no overlap of the prefix start positions of the individual combinations. If the permutation and combination algorithm is used, when the affix number is large, the required time is long, and the problem is not solved.
Illustratively, the time complexity is reduced using a dynamic programming algorithm, as follows: taking python statement as an example: dp [ i ] is: all non-coincident combinations of affixes that appear in the query [: i +1] (it must be guaranteed that some of the affixes ends with the query [ i ]) result in dp [ i +1] pseudo-codes as follows:
w: all affix sets ending with query [ i +1 ];
For w in W;
s = starting position of affix w;
For e in dp[s-1];
dp[i+1].append(e.append(w));
return dp[i+1].
example (c): when the input query is: ABCDEFG, the matched affix and the position of the affix in the query are (0, 1,AB), (2, 4,CDE), (3, 5,DEF), (5, 6,FG);
dp [0] = [ ].
dp[1]=[[AB]]。
dp[2]=[]。
dp[3]=[]。
dp[4]=[[CDE],[AB,CDE]]。
dp[5]=[[DEF],[AB,DEF]]。
dp[6]=[[FG],[AB,FG],[CDE,FG],[AB,CDE,FG]]。
The permutation and combination of the affix strings by the dynamic algorithm results in the un-repeated affix string combination, for example, dp [6] includes the affix string combination as: [ FG ], [ AB, FG ], [ CDE, FG ], and [ AB, CDE, FG ], the characters of the affix strings in each combination are all distinct. This helps to reduce the time complexity of the alignment process.
And S105, scoring the affix character string combination according to a preset scoring rule, and determining a target affix character string combination according to a scoring result.
Illustratively, the affix character string combination is scored, and when the affix character string combination is less than 5, all the affix character string combinations are taken out; when the number of the affix combinations is more than 50, taking the first 50 combinations from high to low according to the score, wherein the score formula is as follows: sum of affix length/query length.
Examples are:
the affix combination is as follows: [ 'conglomerate', 'net profit' ], with a score of 2.0.
[ 'group net profit' ], with a score of 2.0.
[ 'clump', 'net profit' ], a score of 1.875.
[ 'clump', 'net profit' ], a score of 1.875.
[ 'clique', 'profit'), a score of 1.8.
In one embodiment, before scoring, it is further required to recall the standard word through the affix string, and score the affix string combination according to a scoring formula, which is as follows:
Figure BDA0003748836960000101
wherein K is the score of the affix string combination, L plus For length sum, L, of affix strings within a affix string combination q Being the length of the character string of the information to be retrieved, alpha 0 And alpha 1 As a scoring coefficient, L i For the length, L, of a single affix string in a affix string combination STD The length of the standard word corresponding to the affix character string.
In some embodiments, the affix string combination with the highest score K is set as the target word affix string combination.
And S106, retrieving information according to the target affix character string combination.
Specifically, standard words corresponding to affix character strings in the target affix character string combination are obtained, and information retrieval is performed according to the standard words.
In some embodiments, the title and abstract of the target information are retrieved according to the obtained standard words.
In some other embodiments, an entry window of a function corresponding to the standard word is output according to the obtained standard word, so that a user can conveniently and quickly enter an application program corresponding to the standard word.
The embodiment of the application provides an information retrieval method, which comprises the steps of obtaining affix character strings of information to be retrieved by improving an AC (alternating current) automaton, obtaining affix character string combinations without repeated arrangement by a dynamic programming algorithm, finally obtaining standard words, and carrying out information retrieval based on the standard words.
Referring to fig. 4, fig. 4 is a schematic block diagram of an information retrieval apparatus 300 for executing the foregoing information retrieval method according to an embodiment of the present application. The information retrieval device may be configured in a server or a terminal.
The server may be an independent server, a server cluster, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like. The terminal can be an electronic device such as a mobile phone, a tablet computer, a notebook computer, a desktop computer, a user digital assistant and a wearable device.
As shown in fig. 4, the information retrieval apparatus 300 includes: the system comprises a set expansion module 301, a model construction module 302, a data matching module 303, a data extraction module 304, a result scoring module 305 and an information retrieval module 306.
The set expanding module 301 is configured to obtain a first character string, perform partial inversion on the first character string to generate a second character string, and generate a character string set according to the first character string and the second character string.
In some embodiments, the set expansion module 301 is further configured to move backward in sequence from the first character of the first character string to the last character of the first character string, and generate a second character string after each backward movement, and stop the displacement at the last character of the first character string.
And the model building module 302 is used for building the AC automaton according to the character string set.
And the data matching module 303 is configured to acquire information to be retrieved input by a user, match the information to be retrieved in the AC automaton, and acquire a matching path of the information to be retrieved in the AC automaton.
In some embodiments, the data matching module 303 is further configured to generate a dictionary tree from the set of character strings, and construct the AC automaton from the dictionary tree.
In some embodiments, the data matching module 303 is further configured to match the information to be retrieved with a dictionary tree; and acquiring a matching path of the successfully matched character string on the dictionary tree.
In some embodiments, the data matching module 303 is further configured to generate a reverse matching path from the matching path, and generate a reverse trie from the trie; inputting the reverse matching path into a reverse dictionary tree to obtain a reverse affix character string; and carrying out reverse order arrangement on the reverse affix character strings to obtain the affix character strings.
And the data extraction module 304 is configured to obtain corresponding affix character strings according to the matching paths, and perform permutation and combination on the affix character strings to obtain one or more affix character string combinations without character repetition.
In some embodiments, the data extraction module 304 is further configured to determine whether a first affix string exists between the affix strings or not, wherein the first affix string is any one of the affix strings, and the second affix string is any one of the affix strings; and deleting the second affix character string when the first affix character string exists among the affix character strings and the second affix character string exists.
And the result scoring module 305 is configured to score the affix string combination according to a preset scoring rule, and determine a target affix string combination according to a scoring result.
And the information retrieval module 306 is configured to perform information retrieval according to the target affix string combination.
It should be noted that, as will be clearly understood by those skilled in the art, for convenience and simplicity of description, the specific working processes of the information retrieval apparatus and each module described above may refer to the corresponding processes in the foregoing information retrieval method embodiment, and are not described herein again.
It should be noted that, as will be clearly understood by those skilled in the art, for convenience and simplicity of description, the specific working processes of the model training device and each module described above may refer to the corresponding processes in the foregoing embodiment of the information retrieval method, and are not described herein again.
The information retrieval apparatus described above may be implemented in the form of a computer program that can be run on a computer device as shown in fig. 5.
Referring to fig. 5, fig. 5 is a schematic block diagram of a structure of a computer device according to an embodiment of the present disclosure. The computer device may be a server or a terminal.
Referring to fig. 5, the computer device includes a processor, a memory, and a network interface connected through a system bus, wherein the memory may include a storage medium and an internal memory.
The storage medium may store an operating system and a computer program. The computer program comprises program instructions, which when executed, can make a processor execute any one of the information retrieval methods provided by the embodiments of the present application.
The processor is used for providing calculation and control capability and supporting the operation of the whole computer equipment.
The internal memory provides an environment for the execution of a computer program on a storage medium, which when executed by a processor causes the processor to perform any of the information retrieval methods. The storage medium may be non-volatile or volatile.
The network interface is used for network communication, such as sending assigned tasks and the like. Those skilled in the art will appreciate that the architecture shown in fig. 5 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
It should be understood that the Processor may be a Central Processing Unit (CPU), and the Processor may be other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. Wherein a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
Illustratively, in one embodiment, the processor is configured to execute a computer program stored in the memory to perform the steps of: acquiring a first character string, partially turning the first character string to generate a second character string, and generating a character string set according to the first character string and the second character string; constructing an AC automaton according to the character string set; acquiring information to be retrieved input by a user, matching the information to be retrieved in an AC automatic machine, and acquiring a matching path of the information to be retrieved in the AC automatic machine; acquiring corresponding affix character strings according to the matching paths, and arranging and combining the affix character strings to obtain one or more affix character string combinations without character repetition; scoring the affix character string combination according to a preset scoring rule, and determining a target affix character string combination according to a scoring result; and performing information retrieval according to the target affix character string combination.
In some embodiments, the processor, when implementing building the AC automaton from the set of strings, is further specifically configured to implement: and generating a dictionary tree according to the character string set, and constructing the AC automaton according to the dictionary tree.
In some embodiments, when the processor matches the information to be retrieved in the AC automaton and acquires a matching path of the information to be retrieved in the AC automaton, the processor is further specifically configured to: matching the information to be retrieved with the dictionary tree; and acquiring a matching path of the successfully matched character string on the dictionary tree.
In some embodiments, when the processor obtains the corresponding affix character string according to the matching path, the processor is further specifically configured to: generating a reverse matching path according to the matching path, and generating a reverse dictionary tree according to the dictionary tree; inputting the reverse matching path into the reverse dictionary tree to obtain a reverse affix character string; and carrying out reverse order arrangement on the reverse affix character strings to obtain the affix character strings.
In some embodiments, before implementing permutation and combination of the affix strings, the processor is further specifically configured to implement: determining whether a first affix character string exists among affix character strings or not and a second affix character string exists among the affix character strings, wherein the first affix character string is any one of the affix character strings, and the second affix character string is any one of the affix character strings; and deleting the second affix character string when the first affix character string exists among the affix character strings and the second affix character string exists.
In some embodiments, when the processor performs partial flip on the first character string and generates the second character string, the processor is further specifically configured to: and after the first character of the first character string is sequentially moved backwards to the tail character of the first character string, generating a second character string each time the second character string is moved backwards, and stopping the displacement at the tail character of the first character string.
In some embodiments, the processor, when implementing information retrieval according to the target affix string combination, further implements: acquiring the target affix character string combination to acquire the corresponding standard word; and retrieving information according to the standard words.
The embodiment of the application further provides a computer-readable storage medium, wherein a computer program is stored in the computer-readable storage medium, the computer program comprises program instructions, and the processor executes the program instructions to realize any information retrieval method provided by the embodiment of the application.
The computer-readable storage medium may be an internal storage unit of the computer device described in the foregoing embodiment, for example, a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the computer device.
While the invention has been described with reference to specific embodiments, the scope of the invention is not limited thereto, and those skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the invention. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (10)

1. An information retrieval method, the method comprising:
acquiring a first character string, partially turning the first character string to generate a second character string, and generating a character string set according to the first character string and the second character string;
constructing an AC automaton according to the character string set;
acquiring information to be retrieved input by a user, matching the information to be retrieved in the AC automaton, and acquiring a matching path of the information to be retrieved in the AC automaton;
acquiring corresponding affix character strings according to the matching paths, and arranging and combining the affix character strings to obtain one or more affix character string combinations without character repetition;
scoring the affix character string combination according to a preset scoring rule, and determining a target affix character string combination according to a scoring result;
and performing information retrieval according to the target affix character string combination.
2. The information retrieval method of claim 1, wherein the constructing an AC automaton from the set of character strings comprises:
and generating a dictionary tree according to the character string set, and constructing the AC automaton according to the dictionary tree.
3. The information retrieval method of claim 2, wherein the matching the information to be retrieved in the AC automaton and obtaining the matching path of the information to be retrieved in the AC automaton comprises:
matching the information to be retrieved with the dictionary tree;
and acquiring a matching path of the successfully matched character string on the dictionary tree.
4. The information retrieval method of claim 3, wherein the obtaining of the corresponding affix string according to the matching path comprises:
generating a reverse matching path according to the matching path, and generating a reverse dictionary tree according to the dictionary tree;
inputting the reverse matching path into the reverse dictionary tree to obtain a reverse affix character string;
and carrying out reverse order arrangement on the reverse affix character strings to obtain the affix character strings.
5. The information retrieval method as set forth in claim 1, further comprising, before the permutation and combination of the affix character strings:
determining whether a first affix character string exists among the affix character strings or not and a second affix character string exists among the affix character strings, wherein the first affix character string is any one of the affix character strings, and the second affix character string is any one of the affix character strings;
and deleting the second affix character string when the first affix character string exists among the affix character strings and comprises the second affix character string.
6. The information retrieval method of claim 1, wherein the partially flipping the first string to generate a second string comprises:
and after sequentially moving backwards from the first character of the first character string to the tail character of the first character string, generating a second character string each time of moving backwards, and stopping the displacement at the tail character of the first character string.
7. The information retrieval method of claim 1, wherein the first character string comprises a character string of a standard word, and the information retrieval based on the target affix character string combination comprises:
acquiring the target affix character string combination to acquire the corresponding standard word;
and retrieving information according to the standard words.
8. An information retrieval apparatus, characterized by comprising:
the set expansion module is used for acquiring a first character string, partially overturning the first character string to generate a second character string, and generating a character string set according to the first character string and the second character string;
the model building module is used for building an AC automaton according to the character string set;
the data matching module is used for acquiring information to be retrieved input by a user, matching the information to be retrieved in the AC automatic machine and acquiring a matching path of the information to be retrieved in the AC automatic machine;
the data extraction module is used for acquiring corresponding affix character strings according to the matching paths and arranging and combining the affix character strings to obtain one or more affix character string combinations without character repetition;
the result scoring module is used for scoring the affix character string combination according to a preset scoring rule and determining a target affix character string combination according to a scoring result;
and the information retrieval module is used for retrieving information according to the target affix character string combination.
9. A computer device, wherein the computer device comprises a memory and a processor;
the memory is used for storing a computer program;
the processor for executing the computer program and implementing the information retrieval method as claimed in any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, causes the processor to implement the information retrieval method according to any one of claims 1 to 7.
CN202210832075.0A 2022-07-15 2022-07-15 Information retrieval method, device, equipment and storage medium Active CN115146118B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210832075.0A CN115146118B (en) 2022-07-15 2022-07-15 Information retrieval method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210832075.0A CN115146118B (en) 2022-07-15 2022-07-15 Information retrieval method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN115146118A true CN115146118A (en) 2022-10-04
CN115146118B CN115146118B (en) 2024-08-20

Family

ID=83412792

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210832075.0A Active CN115146118B (en) 2022-07-15 2022-07-15 Information retrieval method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115146118B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116611076A (en) * 2023-07-20 2023-08-18 北京微步在线科技有限公司 Domain name matching method and device, electronic equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0991297A (en) * 1995-09-26 1997-04-04 Nippon Steel Corp Method and device for character string retrieval
CN103544167A (en) * 2012-07-13 2014-01-29 江苏新瑞峰信息科技有限公司 Backward word segmentation method and device based on Chinese retrieval
CN107193843A (en) * 2016-03-15 2017-09-22 阿里巴巴集团控股有限公司 A kind of character string selection method and device based on AC automatic machines and postfix expression

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0991297A (en) * 1995-09-26 1997-04-04 Nippon Steel Corp Method and device for character string retrieval
CN103544167A (en) * 2012-07-13 2014-01-29 江苏新瑞峰信息科技有限公司 Backward word segmentation method and device based on Chinese retrieval
CN107193843A (en) * 2016-03-15 2017-09-22 阿里巴巴集团控股有限公司 A kind of character string selection method and device based on AC automatic machines and postfix expression

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116611076A (en) * 2023-07-20 2023-08-18 北京微步在线科技有限公司 Domain name matching method and device, electronic equipment and storage medium
CN116611076B (en) * 2023-07-20 2023-10-27 北京微步在线科技有限公司 Domain name matching method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN115146118B (en) 2024-08-20

Similar Documents

Publication Publication Date Title
CN107924679B (en) Computer-implemented method, input understanding system and computer-readable storage device
CN112214593B (en) Question-answering processing method and device, electronic equipment and storage medium
CN106663124B (en) Generating and using knowledge-enhanced models
CN109670163B (en) Information identification method, information recommendation method, template construction method and computing device
CN111611361A (en) Intelligent reading, understanding, question answering system of extraction type machine
CN111881316B (en) Search method, search device, server and computer readable storage medium
US9298693B2 (en) Rule-based generation of candidate string transformations
US12013902B2 (en) Inter-document attention mechanism
CN112115232A (en) Data error correction method and device and server
WO2020007027A1 (en) Online question-answer method, apparatus, computer equipment and storage medium
JP2020087353A (en) Summary generation method, summary generation program, and summary generation apparatus
CN110276080B (en) Semantic processing method and system
CN112559895B (en) Data processing method and device, electronic equipment and storage medium
CN115795061B (en) Knowledge graph construction method and system based on word vector and dependency syntax
CN113190675A (en) Text abstract generation method and device, computer equipment and storage medium
CN112632261A (en) Intelligent question and answer method, device, equipment and storage medium
CN105404677A (en) Tree structure based retrieval method
CN108334491B (en) Text analysis method and device, computing equipment and storage medium
CN115146118B (en) Information retrieval method, device, equipment and storage medium
CN112307048A (en) Semantic matching model training method, matching device, equipment and storage medium
US8732158B1 (en) Method and system for matching queries to documents
US9009200B1 (en) Method of searching text based on two computer hardware processing properties: indirect memory addressing and ASCII encoding
US20240256840A1 (en) Storing entries in and retrieving information from an object memory
CN118445300A (en) Query sentence rewriting method, rewriting platform, electronic device and storage medium
WO2024163050A1 (en) Storing entries in and retrieving information from an object memory

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant