CN114691834A

CN114691834A - Synonym retrieval method and device

Info

Publication number: CN114691834A
Application number: CN202210355598.0A
Authority: CN
Inventors: 于楠; 蔡玉柱; 闫学森; 杜波; 李舒嫒
Original assignee: Agricultural Bank of China
Current assignee: Agricultural Bank of China
Priority date: 2022-04-06
Filing date: 2022-04-06
Publication date: 2022-07-01

Abstract

The application provides a synonym retrieval method and a synonym retrieval device, wherein the method comprises the following steps: acquiring a query statement from a first application system; determining a synonym list corresponding to a first application system from synonym lists respectively corresponding to a plurality of application systems maintained by synonym cache as a target synonym list; obtaining synonyms and synonym weights corresponding to the search terms from the target synonym list, and generating new query sentences according to the search terms, the synonyms and the synonym weights corresponding to the search terms; and retrieving the information on the first application system based on the new query statement and the index information corresponding to the first application system. The method and the device maintain the synonym list in the synonym cache, so that index information is only needed to be established for each word contained in the word segmentation device word bank when the index is established, the disk space occupied by the index information is saved, the importance degree of the original word and the synonym can be distinguished based on the weight of the synonym, and the recall rate of the retrieval result is improved.

Description

Synonym retrieval method and device

Technical Field

The present application relates to the field of retrieval, and in particular, to a method and an apparatus for retrieving synonyms.

Background

With the rapid development of the internet, the service scenes of information retrieval are more and more abundant, wherein the role of synonym retrieval in information retrieval is increasingly important.

In the conventional synonym retrieval method, an index is established in a full-text retrieval engine by simultaneously generating a primary word and a synonym in a target document, namely the indexes of the primary word and the synonym point to the same position of the document, so that the document where the primary word is positioned can be matched when a user searches the primary word or the synonym.

The conventional synonym retrieval method can improve the recall rate of retrieval results to a great extent and improve user experience, but the mode of establishing indexes of the synonym and the original word in the target document in the full-text retrieval engine at the same time needs to additionally store the indexes of a large number of synonyms, so that the consumption of disk space is in direct proportion to the number of documents containing the synonym, and a large amount of disk space is wasted.

Disclosure of Invention

In view of this, the present application provides a synonym retrieval method and apparatus, for solving the problems in the prior art that the index of synonyms wastes disk space and the importance of original words and synonyms cannot be distinguished due to the same weight of original words and synonyms, and the technical solution is as follows:

a synonym retrieval method, comprising:

acquiring a query statement from a first application system, wherein the query statement comprises a search term;

determining a synonym list corresponding to a first application system from synonym lists respectively corresponding to a plurality of application systems maintained by synonym cache, wherein the synonym list is used as a target synonym list, the synonym list maintains the corresponding relation among a target word, a synonym corresponding to the target word and a synonym weight, and the target word is an original word on the application systems or a synonym of the original word;

obtaining synonyms and synonym weights corresponding to the search terms from the target synonym list, and generating new query sentences according to the search terms, the synonyms and the synonym weights corresponding to the search terms;

and retrieving information on the first application system based on the new query statement and the index information corresponding to the first application system, wherein the index information corresponding to the first application system comprises index information established for each word corresponding to the first application system in the word segmentation device word bank.

Optionally, the method further includes: updating a synonym list maintained by the synonym cache;

updating the synonym list maintained by the synonym cache, comprising the following steps:

monitoring whether a synonym newly-added task exists or not, wherein the synonym newly-added task comprises a newly-added word and system indication information, the newly-added word is a synonym newly-added for an original word on a second application system, and the second application system is an application system indicated by the system indication information contained in the synonym newly-added task;

and if the synonym adding task exists, updating the synonym list corresponding to the second application system in the synonym cache based on the new added words and the synonym weight corresponding to the new added words.

Optionally, the method further includes: updating index information respectively corresponding to a plurality of application systems;

updating the index information respectively corresponding to the plurality of application systems, including:

if the synonym adding task exists, judging whether the new added word is an unregistered word based on the word stock of the word segmentation device;

if so, updating the word segmentation device word bank based on the newly added words to obtain an updated word segmentation device word bank;

and carrying out index reconstruction according to the updated word segmentation device word bank and the indexes in the original index bank corresponding to the second application system to obtain a reconstructed new index corresponding to the second application system.

Optionally, the word segmentation device word bank includes word segmentation device word banks corresponding to each node in the search engine cluster;

updating the word stock of the word segmentation device based on the newly added words comprises the following steps:

generating a newly added word bank file and a word bank updating identification file based on the newly added words, and writing the newly added words into the newly added word bank file;

monitoring whether the word stock updating identification file is updated or not, and if so, respectively loading the newly added words in the newly added word stock file to word splitter word stocks respectively corresponding to all nodes of the search engine cluster.

Optionally, the index reconstruction is performed according to the updated word segmentation device word bank and the index in the original index bank corresponding to the second application system, and the index reconstruction includes:

creating a new index library corresponding to the second application system;

and according to the updated word segmentation device word bank and the index in the original index bank corresponding to the second application system, carrying out index reconstruction in the new index bank.

Optionally, the method further includes:

if an index needs to be newly added in the index reconstruction process, creating a transition index base, and writing the newly added index into the transition index base after setting the transition index base as a default attribute;

and after the index reconstruction is finished, switching the new index base to be a default attribute, and writing the new index in the transition index base into the new index base to obtain the reconstructed new index contained in the new index base.

A synonym retrieval device, comprising: the system comprises a query sentence acquisition module, a target synonym list determination module, a query sentence generation module and a retrieval module;

the query statement acquisition module is used for acquiring a query statement from a first application system, wherein the query statement comprises a search term;

the system comprises a target synonym list determining module, a synonym list selecting module and a synonym selecting module, wherein the target synonym list determining module is used for determining a synonym list corresponding to a first application system from synonym lists respectively corresponding to a plurality of application systems maintained by synonym cache as a target synonym list, the synonym list maintains the corresponding relation among a target word, a synonym corresponding to the target word and a synonym weight, and the target word is an original word on the application systems or a synonym of the original word;

the query sentence generating module is used for acquiring synonyms and synonym weights corresponding to the search terms from the target synonym list and generating new query sentences according to the search terms, the synonyms and the synonym weights corresponding to the search terms;

and the retrieval module is used for retrieving the information on the first application system based on the new query statement and the index information corresponding to the first application system, wherein the index information corresponding to the first application system comprises index information established for each word corresponding to the first application system in the word segmentation device word bank.

Optionally, the method further includes: the synonym list updating module is used for updating the synonym list maintained by the synonym cache;

a synonym list update module comprising: the task monitoring sub-module and the synonym list updating sub-module;

the task monitoring submodule is used for monitoring whether a synonym newly-added task exists or not, wherein the synonym newly-added task comprises a newly-added word and system indication information, the newly-added word is a synonym newly-added for an original word on a second application system, and the second application system is an application system indicated by the system indication information contained in the synonym newly-added task;

and the synonym list updating submodule is used for updating the synonym list corresponding to the second application system in the synonym cache based on the new added words and the synonym weight corresponding to the new added words if the synonym new adding task exists.

Optionally, the method further includes: the index information updating module is used for updating the index information corresponding to the application systems respectively;

an index information update module comprising: the device comprises an unknown word judgment sub-module, a word segmentation device word bank updating sub-module and an index reconstruction sub-module;

the unknown word judgment sub-module is used for judging whether the new added word is an unknown word based on the word segmentation device word bank if the synonym new adding task exists;

the word segmentation device word stock updating sub-module is used for updating the word segmentation device word stock based on the newly added words to obtain an updated word segmentation device word stock if the unregistered word judging sub-module judges that the newly added words are unregistered words;

and the index reconstruction submodule is used for reconstructing an index according to the updated word segmentation device word bank and the index in the original index bank corresponding to the second application system to obtain a reconstructed new index corresponding to the second application system.

the word segmentation device word stock updating submodule comprises: the file generation sub-module and the file monitoring sub-module;

the file generation submodule is used for generating a newly added word bank file and a word bank updating identification file based on the newly added words and writing the newly added words into the newly added word bank file;

and the file monitoring submodule is used for monitoring whether the word bank updating identification file is updated or not, and if so, respectively loading the newly added words in the newly added word bank file to word splitter word banks respectively corresponding to each node of the search engine cluster.

According to the technical scheme, the synonym retrieval method comprises the steps of firstly obtaining query sentences from a first application system, then determining a synonym list corresponding to the first application system from synonym lists respectively corresponding to a plurality of application systems maintained by a synonym cache to serve as a target synonym list, then obtaining synonyms and synonym weights corresponding to the search words from the target synonym list, generating new query sentences according to the search words, the synonyms and the synonym weights corresponding to the search words, and finally retrieving information on the first application system based on the new query sentences and index information corresponding to the first application system. The method and the device can maintain the synonym lists corresponding to the application systems in the synonym cache, index information only needs to be established for each word contained in the word segmentation device word bank when the index is established, the disk space occupied by the index information is greatly saved, the retrieval words, the synonyms corresponding to the retrieval words and the synonym weight are taken into consideration when the index is retrieved, the important degree of the original words and the synonyms can be distinguished based on the synonym weight, and the recall rate of retrieval results is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a schematic flow chart illustrating a synonym retrieval method according to an embodiment of the present disclosure;

FIG. 2 is a schematic structural diagram of a synonym retrieval device according to an embodiment of the present disclosure;

FIG. 3 is a diagram illustrating a synonym search architecture with configurable weights according to an embodiment of the present disclosure;

fig. 4 is a block diagram of a hardware structure of a synonym retrieval device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The application provides a synonym retrieval method which can be realized based on a search engine cluster, so that when one or some search engine nodes in the search engine cluster can not provide service temporarily, synonym retrieval service can be provided through other search engine nodes. In order to make the present application more understandable to those skilled in the art, the following embodiments are provided to describe the synonym retrieval method provided in the present application in detail.

Referring to fig. 1, a schematic flow chart of a synonym retrieval method provided in an embodiment of the present application is shown, where the synonym retrieval method may include:

and step S101, acquiring a query statement from the first application system.

In the application, the search engine cluster may provide a synonym retrieval service for a plurality of application systems, where the plurality of application systems include the first application system, and when a user inputs a query statement through the first application system, the query statement from the first application system may be obtained in this step. Here, the query statement includes a search term, and information in the first application system may be searched based on the search term in the query statement.

It is understood that when the user inputs a query statement, the first application system may generate a query request based on the query statement, and in an alternative embodiment, the process of this step may include: and acquiring a query request from the first application system, and analyzing the query request to analyze the query statement which can be identified by the underlying search engine.

Optionally, before analyzing the query request, permission verification may be performed on the query request, where the permission verification includes verification of a reading permission of an index in the index library assigned by the user and verification of a synonym retrieval permission, and after the permission verification is passed, the query request is analyzed.

Optionally, after the query statement is obtained in this step, the query statement may be checked, where the check includes checking of the contents of the query mode, the query field, the return field, the highlight field, and the like, and after the check is passed, the following steps may be performed.

Step S102, determining a synonym list corresponding to the first application system from synonym lists respectively corresponding to a plurality of application systems maintained by the synonym cache as a target synonym list.

In the prior art, if the original word contained in the document has the synonym, the index of the original word (the original word refers to the word contained in the information on the application system) and the index of the synonym need to be stored in the index library at the same time, and because the index of the original word and the index of the synonym point to the same position of the document, a large amount of disk space is wasted, and the retrieval efficiency is influenced to a certain extent.

In order to solve the problem, the inventor of the present invention thinks that a synonym cache can be set, and synonym lists respectively corresponding to a plurality of application systems are maintained in the synonym cache, so that only the index of the original word can be stored in the disk, and the disk space is greatly saved.

Here, the synonym list maintains the correspondence among the target word, the synonym corresponding to the target word, and the synonym weight. In this step, data may be stored in the form of key value pairs in the synonym cache, where the target word refers to a word at the key "position in the key value pair, and the synonym corresponding to the target word refers to a word at the value" position in the key value pair.

Alternatively, the target word may be only the original word; preferably, considering that the retrieval speed of the "key" position in the key value pair is faster, and difficulty exists in obtaining the synonym and the weight of the "value" position in the cache, based on this, the target word can be the original word or the synonym of the original word, and this way can ensure that when the application system retrieves the original word or the synonym, the synonym list corresponding to the application system can be quickly obtained in the synonym cache, and the synonyms between different application systems do not affect each other.

For example, the target word may be the original word or the synonym. Assuming that there are two synonyms, "original word 1" in the application system a, "synonym 1" and "synonym 2", this step maintains three sets of synonym dictionary mapping relationships and three sets of different weight relationships in the synonym cache, and when the target word is "original word 1", the weight of "synonym 1" is 0.5, the weight of "synonym 2" is 0.3, when the target word is "synonym 1", the weight of "original word 1" is 0.5, the weight of "synonym 2" is 0.2, when the target word is "synonym 2", the weight of "synonym 1" is 0.4, the weight of "original word 1" is 0.1 as an example, and the logical relationship is as follows:

{ "application System A"

{ "original word 1": [ { "synonym 1": 0.5"}, {" synonym 2": 0.3" } ] },

{ "synonym 1": [ { "original word 1": 0.5"}, {" synonym 2": 0.2" } ] },

{ "synonym 2": [ { "synonym 1": 0.4"}, {" original word 1": 0.1" }

]}

It should be noted that the values of 0.5, 0.3, 0.2, 0.4, 0.1 and the like given above are only examples and are not used as synonym weight; in the synonym list, the weight of the target word is always 1, and if the synonym weight is not configured, the synonym weight defaults to 1.

It should be noted that the term "synonym" described in the present application includes synonyms commonly used in daily life, and also includes different terms used by products in different fields for the same concept, and synonyms defined by different application systems according to their own work fields and the search habits of users.

Because the synonym cache only needs to maintain the corresponding relation among the target words, the synonyms corresponding to the target words and the synonym weights, and the number of the synonyms of the target words and the target words is relatively limited, the cache space occupation of the application is relatively small. Meanwhile, the cache has the advantage of high query speed, so that the query can be carried out more efficiently, and the retrieval efficiency is improved.

Step S103, obtaining synonyms and synonym weights corresponding to the search terms from the target synonym list, and generating a new query sentence according to the search terms, the synonyms and the synonym weights corresponding to the search terms.

As introduced above, the term "synonym" as provided herein includes synonyms in the conventional sense and also includes different terms used for the same concept by products in different fields. It is understood that performing a search based on only the search term may result in incomplete search results or search results that are not information intended by the user.

In order to improve recall rate of search results (generally, recall rate, that is, a ratio of information amount of search results to total index amount, which is one of important indexes evaluated by a search engine), a search term and a synonym of the search term can be used simultaneously for searching, based on which, a synonym and a synonym weight corresponding to the search term can be obtained from a target synonym list, and a new query sentence is generated according to the search term (weight is 1), the synonym and the synonym weight corresponding to the search term.

For example, assume that the original query statement (i.e., the query statement in step S101) is: query "search term: computer, weight: 1", assuming that synonym 1 corresponding to" computer "in the target synonym list is" computer ", the weight is 0.5, synonym 2 is" notebook ", and the weight is 0.3, the new query statement is: simultaneously, inquiring' search key words: computer, weight: 1"," search keyword: computer, weight: 0.5"," search keyword: notebook, weight: 0.3 "results.

And step S104, retrieving information on the first application system based on the new query statement and the index information corresponding to the first application system.

In this step, the new query statement may be matched with the index information corresponding to the first application system, so as to obtain information required by the user on the first application system based on the matched index.

Here, the index information corresponding to the first application system includes index information established for each word corresponding to the first application system in the word segmentation device lexicon; correspondingly, the index information corresponding to any application system in the plurality of application systems comprises index information established for each word corresponding to the application system in the word segmentation device word bank. That is, the word segmentation device word bank may include words corresponding to a plurality of application systems, and in this step, index information may be established for each word in the word segmentation device word bank, so that index information corresponding to a plurality of application systems may be obtained.

According to the method and the device, the index information is respectively established on the basis of each word in the word segmentation device word bank, so that when the search is carried out on the basis of a new query sentence, the search result corresponding to the keyword and the search result corresponding to the synonym of the keyword are possibly different, and the recall rate of the search results is improved.

The synonym retrieval method comprises the steps of firstly obtaining query sentences from a first application system, then determining a synonym list corresponding to the first application system from synonym lists respectively corresponding to a plurality of application systems maintained by synonym cache to serve as a target synonym list, then obtaining synonyms and synonym weights corresponding to the search words from the target synonym list, generating new query sentences according to the search words, the synonyms and the synonym weights corresponding to the search words, and finally retrieving information on the first application system based on the new query sentences and index information corresponding to the first application system. The method and the device can maintain the synonym lists corresponding to the application systems in the synonym cache, index information only needs to be established for each word contained in the word segmentation device word bank when the index is established, the disk space occupied by the index information is greatly saved, the retrieval words, the synonyms corresponding to the retrieval words and the synonym weight are taken into consideration when the index is retrieved, the important degree of the original words and the synonyms can be distinguished based on the synonym weight, and the recall rate of retrieval results is improved.

An embodiment of the present application describes an update process of a synonym list maintained by a synonym cache and an update process of index information corresponding to each of a plurality of application systems.

It can be understood that, there may be a case that a new synonym needs to be added to an original word on an application system, optionally, in this embodiment, when a new synonym needs to be added to an original word on an application system, a new synonym addition task is generated, and when it is monitored that the new synonym addition task exists, the newly added synonym is updated to the synonym cache, that is, based on the newly added synonym and the corresponding synonym weight, the synonym list corresponding to the application system in the synonym cache is updated.

Optionally, it is considered that generating a synonym adding task and updating a synonym list corresponding to the application system is tedious when a synonym needs to be added each time. In order to solve the problem, the method and the device can maintain a synonym dictionary, the synonym dictionary can store target words maintained by synonym cache, corresponding synonyms and synonym weights, and can also store synonyms needing to be added and synonym weights corresponding to the newly added synonyms. After multiple times of newly added synonym storage is carried out in the synonym dictionary, synonym list updating is carried out in batch on the basis of the multiple times of newly added synonyms; of course, if necessary, after storing the newly added synonym only once, the synonym list can be updated based on the newly added synonym.

Specifically, for convenience of description, one or more newly added synonyms are referred to as a newly added synonym, and an application system corresponding to the newly added synonym is referred to as a second application system (included in the second application system), that is, the newly added synonym refers to a synonym added for an original synonym on the second application system.

The process of performing synonym list update in batch based on the multiple newly added synonyms in the present application may include: and monitoring whether a synonym adding task exists, and if the synonym adding task exists, updating a synonym list corresponding to a second application system in the synonym cache based on the new synonym and the synonym weight corresponding to the new synonym. Here, the synonym addition task includes an addition word and system indication information, optionally, the addition word refers to a synonym added for an original word on the second application system, and the second application system is an application system indicated by the system indication information included in the synonym addition task.

It is to be noted that, after the new added word is updated to the synonym cache, if the new added word is not an unknown word, the new added word may take effect directly because the index of the new added word already exists in the index library corresponding to the second application system; if the new added word is an unregistered word, the new added word is not effective because the index of the new added word is not in the index base corresponding to the second application system, the index information corresponding to the second application system needs to be updated to be effective, and the information can be retrieved based on the new added word after the new added word is effective. Here, the term "unknown word" means a word that is not included in the word segmentation device word stock but must be segmented.

Optionally, the process of updating the index information corresponding to the second application system may include: if the synonym adding task exists, judging whether the new added word is an unregistered word based on the word stock of the word segmentation device; if yes, updating the word segmentation device word stock based on the newly added words to obtain an updated word segmentation device word stock; and carrying out index reconstruction according to the updated word segmentation device word bank and the indexes in the original index bank corresponding to the second application system to obtain a reconstructed new index corresponding to the second application system.

Here, the word segmentation device word stock refers to a word stock maintained inside the word segmentation device, and the word stock may store original words on a plurality of application systems, and store synonyms of the original words when the original words have the synonyms.

It is to be noted that, in the embodiment, a process of updating the index information corresponding to the second application system is provided, and the updating processes of the index information corresponding to other application systems in the plurality of application systems are the same as the updating process of the index information corresponding to the second application system, and are not repeated in this application.

In addition, it should be further noted that, the sequence of each step mentioned in the introduction of the above-mentioned process of updating the synonym list maintained by the synonym cache and the process of updating the index information corresponding to each of the plurality of application systems is only an example, and is not limited to this application.

For example, in an actual application, the process of updating the synonym list maintained by the synonym cache and the process of updating the index information corresponding to each of the plurality of application systems may be as follows:

firstly, monitoring whether a synonym adding task exists or not, if not, continuing monitoring, if so, judging whether the added word is an unregistered word or not, if so, updating a participler word bank based on the added word to obtain an updated participler word bank, after updating the participler word bank, performing index reconstruction according to the updated participler word bank and an index in an original index bank corresponding to a second application system to obtain a reconstructed new index corresponding to the second application system, after reconstructing the index, updating a synonym list corresponding to the application system based on the newly added synonym and a corresponding synonym weight, and if not, skipping the steps related to updating the participler word bank and index reconstruction and directly executing the step of adding the newly added synonym to the cache.

The foregoing embodiments have introduced that the synonym retrieval method provided in the present application may be implemented based on a search engine cluster, and based on this, the segmenter thesaurus may include a segmenter thesaurus corresponding to each node in the search engine cluster.

Then, in an alternative embodiment, the process of "updating the word segmentation device word stock based on the newly added word" may include:

a1, generating a new word stock file and a word stock updating identification file based on the new words, and writing the new words into the new word stock file.

Optionally, in this step, after the new words are written into the new word stock file, the new word stock file may be stored in the word stock directory of the formulated word segmenter.

Optionally, in this step, when the word stock of the segmenter needs to be updated based on the new words, the task of creating the new synonym word stock is created first, and then the new word stock file and the word stock update identification file are generated based on the new words.

A2, monitoring whether the word stock updating identification file is updated, if so, loading the newly added words in the newly added word stock file to word segmenter word stocks respectively corresponding to each node of the search engine cluster.

In this step, the word segmenter of each node in the search engine cluster may start a monitoring thread to monitor whether the word bank update identification file is updated, and if so (i.e., currently, a1 generates a word bank update identification file based on new added words), which indicates that new added words exist, the new added words in the new added word bank file may be loaded into the word segmenter word banks respectively corresponding to the nodes in the search engine cluster, so as to perform index reconstruction based on the words in the updated word segmenter word banks subsequently.

It should be noted that, in this step, the newly added words in the newly added word bank file are loaded into the word segmenter word banks respectively corresponding to the nodes of the search engine cluster, so that the word segmenter word banks respectively corresponding to the nodes of the search engine cluster are kept consistent.

Optionally, on the basis of a1 and a2, in consideration of a situation that a loading failure may occur when a new word in a new word library file is loaded into a word splitter word library corresponding to each node, based on this, after the new word is loaded, each node in the search engine cluster returns a loading result, if the loading result returned by any node indicates that the loading fails, the task state is updated to be failed, and if each node in the search engine cluster returns a loading result indicating that the loading succeeds, the task state is updated to be successful.

Optionally, when a node returns a loading result indicating loading failure, the word segmentation device word bank corresponding to the node may be enabled to reload new additional words, and if the loading still fails after the set number of retries, the node word segmentation device word bank is considered to have failed in loading the new additional words.

The process of "reconstructing the index according to the updated word segmenter lexicon and the index in the original index repository corresponding to the second application system" is described below.

Optionally, the process of this embodiment may include the following steps:

and B1, creating a new index library corresponding to the second application system.

It should be understood by those skilled in the art that after a new added word is loaded in the word segmentation device word library, an index of the new added word cannot be simply created in the original index library corresponding to the second application system, but the index needs to be reconstructed, and therefore, a new index library corresponding to the second application system needs to be created through this step first to store the reconstructed new index.

Optionally, when index reconstruction is required, a reconstruction index task may be generated first, and then a new index library may be created.

And B2, according to the updated word segmentation device word stock and the index in the original index stock corresponding to the second application system, carrying out index reconstruction in the new index stock.

Optionally, in this step, after the index reconstruction is performed in the new index base, the original index base corresponding to the second application system may also be closed.

The process of index reconstruction in this step is the same as that in the prior art, and will not be described in detail here.

If no new index adding operation exists in the execution process of the index rebuilding task, the index rebuilding task is completed through B1 and B2; if there is an operation of adding an index during the execution of the task of reconstructing an index, the new index can be continuously written into the new index database through the following B3 and B4.

B3, if an index needs to be added in the index reconstruction process, creating a transition index base, and after setting the transition index base as a default attribute, writing the added index into the transition index base.

In consideration of the prior art, when a newly added index operation exists in the execution process of a reconstruction index task, the newly added index is written into an original index library, and if index reconstruction is already executed at the position where the newly added index is written, the newly added index cannot be written into the new index library, so that indexes in the new index library and the old index library are inconsistent. To avoid this, the prior art usually turns off the newly added index service directly during the index rebuilding process.

Based on this, the inventor thinks that a transition index library can be created and set as a default attribute (only when the index library is the default attribute, an external index can be written into the index library), and at this time, the index can be read through the transition index library and the original index library, and the index can be written through the transition index library.

Then, if a new index is needed in the index rebuilding process, the new index is written into the transition index library.

And B4, after the index is reconstructed, switching the new index base to default attributes, and writing the new index in the transition index base into the new index base to obtain the reconstructed new index contained in the new index base.

In this step, after the index reconstruction is completed, the new index library may be switched to the default attribute, and at this time, the index is read through the transition index library and the new index library, and the index is written through the new index library. And then, writing the newly-added index in the ferry index library into the new index library to complete the whole index reconstruction task, and optionally, updating the task state.

Optionally, after the step is executed, the transition index library may be deleted.

Therefore, the index reconstruction method can realize the index reconstruction of the index base on the premise of not shutting down the newly added index service, and can ensure the integrity and consistency of indexes in the newly and old index bases after reconstruction.

The embodiments of the present application further provide a synonym retrieval device, which is described below, and the synonym retrieval device described below and the synonym retrieval method described above may be referred to in correspondence.

Referring to fig. 2, a schematic structural diagram of a synonym retrieval device provided in an embodiment of the present application is shown, and as shown in fig. 2, the synonym retrieval device may include: a query sentence acquisition module 201, a target synonym list determination module 202, a query sentence generation module 203, and a retrieval module 204.

The query statement acquiring module 201 is configured to acquire a query statement from a first application system, where the query statement includes a search term.

The target synonym list determining module 202 is configured to determine, from synonym lists respectively corresponding to a plurality of application systems maintained by a synonym cache, a synonym list corresponding to a first application system, as a target synonym list, where the synonym list maintains a correspondence between a target word and a synonym corresponding to the target word and a synonym weight, and the target word is an original word on the application system or a synonym of the original word.

And the query sentence generation module 203 is configured to obtain the synonym and the synonym weight corresponding to the search term from the target synonym list, and generate a new query sentence according to the search term, the synonym and the synonym weight corresponding to the search term.

The retrieval module 204 is configured to retrieve information on the first application system based on the new query statement and the index information corresponding to the first application system, where the index information corresponding to the first application system includes index information established for each word corresponding to the first application system in the word segmentation device thesaurus.

The synonym retrieval device provided by the application comprises the steps of firstly obtaining query sentences from a first application system, then determining a synonym list corresponding to the first application system from synonym lists respectively corresponding to a plurality of application systems maintained by a synonym cache to serve as a target synonym list, then obtaining synonyms and synonym weights corresponding to the search words from the target synonym list, generating new query sentences according to the search words, the synonyms and the synonym weights corresponding to the search words, and finally retrieving information on the first application system based on the new query sentences and index information corresponding to the first application system. The method and the device can maintain the synonym lists corresponding to the application systems in the synonym cache, index information only needs to be established for each word contained in the word segmentation device word bank when the index is established, the disk space occupied by the index information is greatly saved, the retrieval words, the synonyms corresponding to the retrieval words and the synonym weight are taken into consideration when the index is retrieved, the important degree of the original words and the synonyms can be distinguished based on the synonym weight, and the recall rate of retrieval results is improved.

In a possible implementation manner, the synonym search apparatus provided by the present application may further include: a synonym list update module.

The synonym list updating module is used for updating the synonym list maintained by the synonym cache.

Optionally, the synonym list updating module may include: a task monitoring sub-module and a synonym list updating sub-module.

The task monitoring submodule is used for monitoring whether a synonym newly-added task exists or not, wherein the synonym newly-added task comprises a newly-added word and system indication information, the newly-added word is a synonym newly-added for an original word on a second application system, and the second application system is an application system indicated by the system indication information contained in the synonym newly-added task.

In a possible implementation manner, the synonym search apparatus provided by the present application may further include: and an index information updating module.

The index information updating module is used for updating the index information respectively corresponding to the plurality of application systems.

Optionally, the index information updating module may include: an unknown word judgment sub-module, a word segmentation device word bank updating sub-module and an index reconstruction sub-module.

And the unregistered word judgment sub-module is used for judging whether the newly added word is an unregistered word based on the word segmentation device word bank if the synonym newly added task exists.

And the word segmentation device word stock updating sub-module is used for updating the word segmentation device word stock based on the newly added words to obtain an updated word segmentation device word stock if the unregistered word judging sub-module judges that the newly added words are unregistered words.

In one possible implementation, the word segmenter thesaurus includes a word segmenter thesaurus corresponding to each node in the search engine cluster.

Based on this, the word segmentation device word stock updating sub-module may include: the file generation sub-module and the file monitoring sub-module.

In a possible implementation manner, the index rebuilding sub-module may include: a new index base creating module and a reconstruction index writing module.

And the new index base creating module is used for creating a new index base corresponding to the second application system.

And the reconstruction index writing module is used for reconstructing the index in the new index base according to the updated word segmentation device word base and the index in the original index base corresponding to the second application system.

In a possible implementation manner, the index rebuilding sub-module may further include: a transition index base creating module and a newly added index writing module.

The transition index base creating module is used for creating a transition index base if an index needs to be newly added in the index reconstruction process, and writing the newly added index into the transition index base after the transition index base is set as a default attribute;

and the new index writing module is used for switching the new index base to default attributes after the index reconstruction is completed, and writing the new index in the transition index base into the new index base to obtain the reconstructed new index contained in the new index base.

In order to make those skilled in the art understand the present application more clearly, a specific synonym search scenario is described below as an example, and it should be noted that the following example is only an example and is not a limitation to the present application.

Optionally, the synonym search method and apparatus provided by the present application may be applied to the synonym search architecture with configurable weights shown in fig. 3. As shown in fig. 3, the synonym retrieval architecture with configurable weights includes a thesaurus management unit, a reconstruction index unit, a synonym cache, and a synonym retrieval unit.

The word stock management unit is mainly used for maintaining a synonym dictionary and loading the newly added words in the synonym dictionary into the word segmentation device; the synonym cache is used for respectively maintaining the target words of the application systems, the synonyms corresponding to the target words and the synonym weights into the cache and serving the subsequent retrieval of the synonyms; because the newly added synonym dictionary may have unknown words, an index unit is required to be reconstructed for reconstructing the index in the original index base based on the updated word segmentation device word base, and the newly added index operation can still be provided in the process of reconstructing the index; and the synonym retrieval unit is mainly used for analyzing the query statement of the application system, querying the synonym cache and generating the query statement containing the synonym and the weight of the synonym.

As shown in fig. 3, the thesaurus managing unit may include: the system comprises a task generator, a word bank monitor, a word bank loader and a word segmentation device.

Optionally, the unregistered word judgment submodule provided in the present application may be arranged on the task generator, and the word segmentation device and word bank update submodule provided in the present application may be arranged on the task generator, the word bank monitor, and the word bank loader.

The word segmentation device comprises a word segmentation device word bank updating submodule, a task generator and a word segmentation device word bank updating submodule, wherein the word segmentation device word bank updating submodule and the word segmentation device word bank updating submodule comprise file generating submodules, the file generating submodules and the word segmentation device word bank updating submodule can be arranged on the task generator, so that the task generator in the word bank management unit can be used for creating a task of a newly added synonym word bank, when a newly added word is judged to be the unregistered word, a newly added word bank file and a word bank updating identification file are generated for the unregistered word, and the newly added word is written into the newly added word bank file.

The file monitoring submodule contained in the participler thesaurus updating submodule can be arranged on a thesaurus monitor and a thesaurus loader, so that the thesaurus monitor can be used for enabling the participler of each node in the cluster to start a monitoring thread and monitoring whether a thesaurus updating identification file is updated or not so as to judge whether new words are added or not; the word bank loader may be configured to load the newly added words in the newly added word bank file into the word segmenter after monitoring that the word bank update identification file is updated, that is, load the newly added words in the newly added word bank file into the word segmenter word banks respectively corresponding to the nodes of the search engine cluster.

And the word segmentation device in the word stock management unit is used for reconstructing the index subsequently.

As shown in fig. 3, the reconstruction index unit may include: a task generator and a task executor.

The task generator in the reconstruction index unit can be responsible for creating an index base reconstruction index task, and the reconstruction index task comprises a task of migrating the transition index base to the new index base data.

Optionally, the index reconstruction sub-module may be disposed on the task executor, that is, the new index base creation module, the reconstruction index writing module, the transition index base creation module, and the newly added index writing module may be disposed on the task executor, so that the task executor may be responsible for executing index base reconstruction indexes, creating a transition index base, executing data migration from the transition index base to the new index base, and after the index reconstruction task is executed, closing the original index base, deleting the transition index base, and updating the task state.

As shown in fig. 3, the synonym retrieval unit may include: an authority controller, a query parser, and a query generator.

The authority controller is used for carrying out authority verification on a user query request (namely the query request where a query statement is located), and comprises the reading authority and the synonym retrieval authority of the user for the index in the assigned index library.

And the query analyzer is used for verifying the query statement from the analysis result of the user query request message, and comprises the contents of a query mode, a query field, a return field, a highlight field and the like.

Optionally, the target synonym list determining module and the query sentence generating module provided by the application may be arranged on the query generator, so that the query generator may match a search term in the analyzed query sentence with a synonym list in the synonym cache, splice all synonyms and synonym weights in the matched synonym list into the query sentence, and generate a new query sentence.

The embodiment of the application also provides synonym retrieval equipment. Alternatively, fig. 4 is a block diagram showing a hardware structure of the synonym retrieval device, and referring to fig. 4, the hardware structure of the synonym retrieval device may include: at least one processor 401, at least one communication interface 402, at least one memory 403 and at least one communication bus 404;

in the embodiment of the present application, the number of the processor 401, the communication interface 402, the memory 403 and the communication bus 404 is at least one, and the processor 401, the communication interface 402 and the memory 403 complete communication with each other through the communication bus 404;

processor 401 may be a central processing unit CPU, or an application Specific Integrated circuit asic, or one or more Integrated circuits configured to implement embodiments of the present invention, or the like;

the memory 403 may include a high-speed RAM memory, and may further include a non-volatile memory (non-volatile memory) or the like, such as at least one disk memory;

wherein the memory 403 stores a program and the processor 401 may call the program stored in the memory 403 for:

determining a synonym list corresponding to a first application system from synonym lists respectively corresponding to a plurality of application systems maintained by synonym cache as a target synonym list, wherein the synonym list maintains the corresponding relationship among a target word, a synonym corresponding to the target word and a synonym weight, and the target word is an original word on the application systems or a synonym of the original word;

Alternatively, the detailed function and the extended function of the program may be as described above.

The embodiment of the present application further provides a readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the method for searching synonyms as described above is implemented.

Finally, it is further noted that, herein, relational terms such as, for example, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A synonym search method, comprising:

determining a synonym list corresponding to the first application system from synonym lists respectively corresponding to a plurality of application systems maintained by synonym cache as a target synonym list, wherein the synonym list maintains the corresponding relationship among a target word, a synonym corresponding to the target word and a synonym weight, and the target word is an original word on the application systems or a synonym of the original word;

and retrieving information on the first application system based on the new query statement and the index information corresponding to the first application system, wherein the index information corresponding to the first application system comprises index information established for each word corresponding to the first application system in a word segmentation device word bank.

2. The synonym retrieval method according to claim 1, further comprising: updating the synonym list maintained by the synonym cache;

the updating the synonym list maintained by the synonym cache comprises:

monitoring whether a synonym newly-added task exists, wherein the synonym newly-added task comprises a newly-added word and system indication information, the newly-added word is a synonym newly-added for an original word on a second application system, and the second application system is the application system indicated by the system indication information contained in the synonym newly-added task;

and if the synonym newly-added task exists, updating the synonym list corresponding to the second application system in the synonym cache based on the newly-added synonym and the synonym weight corresponding to the newly-added synonym.

3. The synonym retrieval method according to claim 2, further comprising: updating index information corresponding to the plurality of application systems respectively;

the updating the index information corresponding to the plurality of application systems respectively includes:

if the synonym newly-added task exists, judging whether the newly-added word is an unregistered word or not based on the word segmentation device word bank;

if yes, updating the word segmentation device word bank based on the new word addition to obtain an updated word segmentation device word bank;

and performing index reconstruction according to the updated word segmentation device word bank and the indexes in the original index bank corresponding to the second application system to obtain a reconstructed new index corresponding to the second application system.

4. The synonym retrieval method of claim 3, wherein the thesaurus of word segmenters comprises a thesaurus of word segmenters corresponding to each node in a search engine cluster;

updating the word segmentation device word bank based on the new word addition comprises the following steps:

monitoring whether the word stock updating identification file is updated or not, and if so, respectively loading the new words in the new word stock file to word segmenter word stocks respectively corresponding to all nodes of the search engine cluster.

5. The synonym retrieval method of claim 3, wherein the index reconstruction according to the updated segmenter thesaurus and the index in the original index base corresponding to the second application system includes:

creating a new index library corresponding to the second application system;

6. The synonym retrieval method of claim 5, further comprising:

if an index needs to be newly added in the index reconstruction process, creating a transition index library, and writing the newly added index into the transition index library after setting the transition index library as a default attribute;

and after the index reconstruction is finished, switching the new index base to the default attribute, and writing the new index in the transition index base into the new index base to obtain the reconstructed new index contained in the new index base.

7. A synonym retrieval device, comprising: the system comprises a query sentence acquisition module, a target synonym list determination module, a query sentence generation module and a retrieval module;

the target synonym list determining module is used for determining a synonym list corresponding to the first application system from synonym lists respectively corresponding to a plurality of application systems maintained by a synonym cache as a target synonym list, wherein the synonym list maintains the corresponding relation among a target word, a synonym corresponding to the target word and a synonym weight, and the target word is an original word on the application systems or a synonym of the original word;

the query sentence generation module is used for acquiring the synonyms and the synonym weights corresponding to the search terms from the target synonym list and generating new query sentences according to the search terms, the synonyms and the synonym weights corresponding to the search terms;

the retrieval module is configured to retrieve information on the first application system based on the new query statement and the index information corresponding to the first application system, where the index information corresponding to the first application system includes index information established for each word corresponding to the first application system in a word segmentation device thesaurus.

8. The synonym retrieval device according to claim 7, further comprising: a synonym list updating module, configured to update the synonym list maintained by the synonym cache;

the synonym list updating module comprises: a task monitoring sub-module and a synonym list updating sub-module;

and the synonym list updating sub-module is used for updating the synonym list corresponding to the second application system in the synonym cache based on the new added words and the synonym weight corresponding to the new added words if the synonym new adding task exists.

9. The synonym search device according to claim 8, further comprising: the index information updating module is used for updating the index information respectively corresponding to the plurality of application systems;

the index information updating module comprises: the device comprises an unknown word judgment sub-module, a word segmentation device word bank updating sub-module and an index reconstruction sub-module;

the unknown word judgment sub-module is used for judging whether the new words are unregistered words or not based on the word segmentation device word bank if the synonym new adding task exists;

the word segmentation device word stock updating sub-module is used for updating the word segmentation device word stock based on the new added words to obtain an updated word segmentation device word stock if the unregistered word judging sub-module judges that the new added words are unregistered words;

10. The synonym retrieval device of claim 9, wherein the thesaurus of tokenizers comprises a thesaurus of tokenizers respectively corresponding to each node in a search engine cluster;

and the file monitoring submodule is used for monitoring whether the word stock updating identification file is updated or not, and if so, the new words in the new word stock file are loaded to word segmenter word stocks respectively corresponding to all nodes of the search engine cluster.