CN110069610B - Solr-based retrieval method, solr-based retrieval device, solr-based retrieval equipment and storage medium - Google Patents

Solr-based retrieval method, solr-based retrieval device, solr-based retrieval equipment and storage medium Download PDF

Info

Publication number
CN110069610B
CN110069610B CN201910205809.0A CN201910205809A CN110069610B CN 110069610 B CN110069610 B CN 110069610B CN 201910205809 A CN201910205809 A CN 201910205809A CN 110069610 B CN110069610 B CN 110069610B
Authority
CN
China
Prior art keywords
word
index
chinese
preset
search
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910205809.0A
Other languages
Chinese (zh)
Other versions
CN110069610A (en
Inventor
杨昭
曾文韬
马兰
孙文宇
何维
王海君
刘菲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201910205809.0A priority Critical patent/CN110069610B/en
Publication of CN110069610A publication Critical patent/CN110069610A/en
Application granted granted Critical
Publication of CN110069610B publication Critical patent/CN110069610B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3322Query formulation using system suggestions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/374Thesaurus

Abstract

The invention discloses a retrieval method based on Solr, which comprises the following steps: receiving an information retrieval request, and acquiring retrieval information corresponding to the retrieval request; when the search information is Chinese search information, judging whether the number of characters of the Chinese search information exceeds a preset standard quantity; when the number of characters of the Chinese search information does not exceed a preset standard quantity, acquiring an index field corresponding to the Chinese search information in a preset search dictionary; and querying a preset retrieval database, acquiring a target article corresponding to the index field, and outputting the target article as a retrieval result. The invention also discloses a retrieval device, equipment and a storage medium based on Solr. According to the invention, the preset search dictionary is constructed by carrying out data analysis on a large amount of text data, and the search information is converted into the corresponding index field through the preset search dictionary in the search process, so that the information search is more comprehensive and accurate.

Description

Solr-based retrieval method, solr-based retrieval device, solr-based retrieval equipment and storage medium
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a storage medium for retrieving data based on Solr.
Background
With the wide application of big data, in our daily lives, more and more data is filled. The increase of the data volume brings convenience to people and also brings corresponding problems.
For example, how to obtain the information needed by the user from a huge database is a big problem. That is, the current information retrieval is generally based on a perfect matching or fuzzy matching technology, and the meaning of the retrieved information cannot be resolved, so that the information required by the user cannot be accurately and comprehensively retrieved, and how to more accurately and comprehensively retrieve the information becomes a technical problem to be solved currently.
Disclosure of Invention
The invention mainly aims to provide a Solr-based retrieval method, a Solr-based retrieval device, solr-based retrieval equipment and a Solr-based retrieval storage medium, and aims to solve the technical problem that the current information retrieval is inaccurate and incomplete.
In order to achieve the above object, the present invention provides a retrieval method based on Solr, the retrieval method based on Solr comprising the steps of:
receiving an information retrieval request, and acquiring retrieval information corresponding to the retrieval request;
when the search information is Chinese search information, judging whether the number of characters of the Chinese search information exceeds a preset standard quantity;
When the number of characters of the Chinese search information does not exceed a preset standard quantity, acquiring an index field corresponding to the Chinese search information in a preset search dictionary;
and querying a preset retrieval database, acquiring a target article corresponding to the index field, and outputting the target article as a retrieval result.
Optionally, before the step of obtaining the index field corresponding to the chinese search information in the preset search dictionary when the number of characters of the chinese search information does not exceed the preset standard amount, the method includes:
the method comprises the steps of crawling text data from a network, performing word segmentation processing on the text data according to a preset Chinese word segmentation algorithm to obtain corresponding words, and summarizing each word to form a sample set;
counting the occurrence frequency of the same words in the sample set, and sequencing the same words according to the occurrence frequency to form a word list;
selecting a preset number of words which are ranked in front in the word list as index words, forming a basic dictionary by using the index words, and converting the index words in the basic dictionary into corresponding word vectors through a preset word vector model;
and determining the approximate word of each index word according to the word vector of each index word, and storing the index word and the corresponding approximate word in a correlated way to generate a preset search dictionary.
Optionally, the step of determining the approximate word of each index word according to the word vector of each index word, storing the index word and the corresponding approximate word in association, and generating a preset search dictionary includes:
taking each index word in the basic dictionary as a first index word, and calculating cosine values between word vectors of the first index word and word vectors of second index words except the first index word in the basic dictionary;
when a target cosine value larger than a preset cosine value exists, obtaining an approximate index word corresponding to the target cosine value, and taking the approximate index word as an approximate word of the first index word;
and associating and storing the first index word and the corresponding approximate word to generate a preset search dictionary.
Optionally, the step of obtaining an index field corresponding to the chinese search information in a preset search dictionary when the number of characters of the chinese search information does not exceed a preset standard amount includes:
when the number of characters of the Chinese search information does not exceed a preset standard quantity, word segmentation processing is carried out on the Chinese search information according to a preset Chinese word segmentation algorithm, so that a keyword set corresponding to the Chinese search information is obtained;
Comparing the keywords in the keyword set with index words in a preset search dictionary to obtain target index words close to the keywords and approximate words associated with the target index words;
and taking the target index word and the corresponding approximate word as index fields corresponding to the Chinese retrieval information.
Optionally, the step of querying a preset search database, obtaining a target article corresponding to the index field, and outputting the target article as a search result includes:
combining the index fields to obtain a search formula corresponding to the Chinese search information, querying a preset search database, and obtaining a target article corresponding to the search formula;
setting the weight of each index field in the search formula according to a preset weight mapping table, and sorting each target article according to the weight of each index field to form an article sorting list;
and outputting the article ordered list as a search result.
Optionally, after the step of querying a preset search database to obtain a target article corresponding to the index field and outputting the target article as a search result, the method includes:
Receiving user behavior data based on the article ordered list, and determining and labeling search articles focused by a user in the article ordered list according to browsing times and browsing time in the user behavior data;
and outputting the marked search articles for the user to inquire when receiving a browse record inquiry command.
Optionally, after the step of determining whether the number of characters of the chinese search information exceeds a preset standard amount when the search information is chinese search information, the method includes:
when the number of characters of the Chinese search information exceeds a preset standard quantity, sentence dividing processing is carried out on the Chinese search information to obtain a single sentence corresponding to the Chinese search information;
word segmentation is carried out on the single sentence obtained through the sentence segmentation to obtain a corresponding keyword, and a target index word similar to the keyword in a preset search dictionary and an approximate word related to the target index word are obtained;
taking the target index word and the corresponding approximate word as index fields corresponding to the Chinese retrieval information;
and querying a preset retrieval database, acquiring a target article corresponding to the index field, and outputting the target article as a retrieval result.
In addition, in order to achieve the above object, the present invention also provides a search device based on Solr, the search device based on Solr includes:
the request receiving module is used for receiving an information retrieval request and acquiring retrieval information corresponding to the retrieval request;
the character number judging module is used for judging whether the character number of the Chinese search information exceeds a preset standard quantity when the search information is the Chinese search information;
the determining index module is used for acquiring an index field corresponding to the Chinese retrieval information in a preset retrieval dictionary when the number of characters of the Chinese retrieval information does not exceed a preset standard quantity;
and the result output module is used for querying a preset search database, acquiring a target article corresponding to the index field and outputting the target article as a search result.
In addition, in order to achieve the above purpose, the invention also provides a retrieval device based on Solr;
the Solr-based retrieval device comprises: a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein:
the computer program when executed by the processor implements the steps of the Solr-based retrieval method as described above.
In addition, in order to achieve the above object, the present invention also provides a computer storage medium;
the computer storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the Solr-based retrieval method as described above.
The embodiment of the invention provides a retrieval method, a retrieval device, retrieval equipment and a storage medium based on Solr, which are used for acquiring retrieval information corresponding to an information retrieval request by receiving the information retrieval request; when the search information is Chinese search information, judging whether the number of characters of the Chinese search information exceeds a preset standard quantity; when the number of characters of the Chinese search information does not exceed a preset standard quantity, acquiring an index field corresponding to the Chinese search information in a preset search dictionary; and querying a preset retrieval database, acquiring a target article corresponding to the index field, and outputting the target article as a retrieval result. According to the invention, a server captures massive text data from a network, generates a preset search dictionary by carrying out data analysis on the massive text data, acquires the search information in the process of searching the information, then identifies the information type and the character number of the search information, converts the Chinese search information into corresponding index fields by using the preset search dictionary after determining that the search information is Chinese search information not exceeding the preset standard quantity, and then carries out search by using the determined index fields, thereby avoiding the condition of missed detection caused by directly using the Chinese search information, and simultaneously effectively identifying the semantics of the Chinese search information, so that the information search is more comprehensive and accurate.
Drawings
FIG. 1 is a schematic diagram of a device architecture of a hardware operating environment according to an embodiment of the present invention;
FIG. 2 is a flow chart of a first embodiment of the Solr-based search method of the present invention;
FIG. 3 is a schematic diagram of functional modules of an embodiment of the Solr-based search device of the present invention.
The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.
Detailed Description
It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
As shown in fig. 1, fig. 1 is a schematic structural diagram of a server (also called a retrieval device based on Solr) of a hardware running environment according to an embodiment of the present invention, where the retrieval device based on Solr may be formed by a single retrieval device based on Solr, or may be formed by a combination of other devices and a retrieval device based on Solr.
The server of the embodiment of the invention refers to a computer for managing resources and providing services for users, and is generally divided into a file server, a database server and an application server. A computer or computer system running the above software is also referred to as a server. Compared with a common PC (personal computer) personal computer, the server has higher requirements on stability, safety, performance and the like; as shown in fig. 1, the server may include: a processor 1001, such as a central processing unit (Central Processing Unit, CPU), a network interface 1004, a user interface 1003, a memory 1005, a communication bus 1002, a chipset, a disk system, hardware of a network, and the like. Wherein the communication bus 1002 is used to enable connected communication between these components. The user interface 1003 may include a Display, an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may further include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a WIreless interface (e.g., WIreless-FIdelity, WIFI interface). The memory 1005 may be a high-speed random access memory (random access memory, RAM) or a stable memory (non-volatile memory), such as a disk memory. The memory 1005 may also optionally be a storage device separate from the processor 1001 described above.
Optionally, the server may further include a camera, an RF (Radio Frequency) circuit, a sensor, an audio circuit, and a WiFi module; the input unit is compared with the display screen and the touch screen; the network interface may optionally be other than WiFi in the wireless interface, bluetooth, probe, etc. Those skilled in the art will appreciate that the server architecture shown in fig. 1 is not limiting of the server and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.
As shown in fig. 1, the computer software product is stored in a storage medium (storage medium: also called computer storage medium, computer medium, readable storage medium, computer readable storage medium, or direct called medium, etc.), and the storage medium may be a nonvolatile readable storage medium, such as RAM, a magnetic disk, an optical disk, etc.), and includes several instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to execute the method according to the embodiments of the present invention, and the memory 1005 as a computer storage medium may include an operating system, a network communication module, a user interface module, and a computer program.
In the server shown in fig. 1, the network interface 1004 is mainly used for connecting to a background database and performing data communication with the background database; the user interface 1003 is mainly used for connecting a client (the client is called a user or a terminal, and the terminal in the embodiment of the invention can be a fixed terminal or a mobile terminal, for example, an intelligent air conditioner, an intelligent electric lamp, an intelligent power supply, an intelligent sound box, an automatic driving automobile, a PC, an intelligent mobile phone, a tablet personal computer, an electronic book reader, a portable computer and the like with networking functions, and the terminal contains sensors such as an optical sensor, a motion sensor and other sensors, which are not described herein any more) and performs data communication with the client; and the processor 1001 may be configured to invoke a computer program stored in the memory 1005 and execute steps in the Solr-based retrieval method provided in the following embodiments of the present invention.
The embodiment provides a Solr-based retrieval method, which is applied to a server as shown in FIG. 1. Solr in various embodiments of the present invention was developed using Java language, and is implemented primarily based on the HyperText transfer protocol (HTTP) and Apache Lucene (Apache Lucene is a full text search engine toolkit of open source code). That is, solr is a separate search application server providing an API interface similar to Web-service to the outside.
Referring to fig. 2, in a first embodiment of the retrieval method based on Solr of the present invention, the retrieval method based on Solr includes:
step S10, receiving an information retrieval request and obtaining retrieval information corresponding to the retrieval request.
The server receives the information retrieval request, and after the server receives the information retrieval request, the server acquires retrieval information corresponding to the information retrieval request; the triggering manner of the information retrieval request received by the server is not specifically limited, for example, a user inputs a sentence at the terminal: the method comprises the steps that the accuracy of information retrieval is improved, an information retrieval request is triggered based on the accuracy of information retrieval is improved, a terminal sends the information retrieval request to a server, the server receives the information retrieval request, and the server takes the accuracy of information retrieval is improved as retrieval information corresponding to the information retrieval request; for another example, the user inputs an article to be searched for, and inputs a similar article searching trigger information searching request through voice, the terminal sends the information searching request to the server, the server receives the information searching request, and the server takes the article to be searched for as searching information corresponding to the information searching request.
It should be noted that the search information may be of different types, for example, chinese or english. That is, after the server acquires the search information, the server first determines whether the search information is in chinese according to the character type of the search information, and if the search information is in foreign language, the server may translate the search information into chinese search information, and if the search information is in chinese search information, the server performs the following steps:
and step S20, judging whether the number of characters of the Chinese search information exceeds a preset standard quantity when the search information is the Chinese search information.
After the server determines that the search information is chinese search information according to the character type of the search information, the server obtains the number of characters (or also called information amount) of the search information, and then the server compares the number of characters of the search information with a preset standard amount, where the preset standard amount is a preset critical value of the number of characters, for example, the preset standard amount is set to 100 bytes, and if the number of characters of the search information does not exceed the preset standard amount, that is, the number of characters of the search information obtained by the server is smaller, the server performs information search by using a word conversion search mode, specifically:
And step S30, when the number of characters of the Chinese search information does not exceed a preset standard quantity, acquiring an index field corresponding to the Chinese search information in a preset search dictionary.
When the server determines that the number of characters of the chinese search information does not exceed the preset standard amount, the server needs to process the chinese search information, that is,
and S31, performing word segmentation processing on the Chinese search information according to a preset Chinese word segmentation algorithm to obtain a keyword set corresponding to the Chinese search information.
The server performs word segmentation processing on the Chinese search information according to a preset Chinese word segmentation algorithm to obtain a keyword set corresponding to the search information; the preset Chinese word segmentation algorithm refers to a preset algorithm for segmenting a Chinese character sequence into individual words, for example, the preset Chinese word segmentation algorithm can be a word segmentation algorithm based on character string matching or a word segmentation algorithm based on statistics.
Step S32, comparing the keywords in the keyword set with index words in a preset search dictionary to obtain target index words close to the keywords and approximate words associated with the target index words; and taking the target index word and the corresponding approximate word as index fields corresponding to the Chinese retrieval information.
The server compares the keywords in the keyword set with each index word in a preset index dictionary (wherein the preset index dictionary refers to a preset index word dictionary, for example, the server crawls network massive data, then selects 50000 index words to establish a search dictionary, stores similar approximate words of each index word in the search dictionary respectively), calculates the similarity between the keywords in the keyword set and each index word in the preset index dictionary, the server acquires the index word with the highest similarity with the keywords, the server takes the index word as a target index word matched with the keywords, and after the target index word is obtained, the server acquires the approximate word associated with the target index word in the preset index dictionary; and the server takes the target index word and the approximate word associated with the target index word as an index field corresponding to the Chinese retrieval information.
In this embodiment, a search dictionary is preset, and the server converts the chinese search information corresponding to the search request into the corresponding index field through the search dictionary, so that the search is more comprehensive, and the occurrence of search omission is prevented.
Step S40, inquiring a preset retrieval database, acquiring a target article corresponding to the index field, and outputting the target article as a retrieval result.
After obtaining the index field corresponding to the Chinese search information, the server queries a preset search database, wherein the preset search database refers to a database corresponding to the user search information, for example, a hundred-degree library; and the server acquires the target article corresponding to the index field and outputs the target article as a retrieval result corresponding to the retrieval request.
The server queries a preset search database to obtain a target article containing an index field; for example, the preset search dictionary of the server contains related approximate words of "property," "developer," "property price," when the server receives the Chinese search information as "how much the property price is," the server can convert the Chinese search information into an index field "property world currently xxx," that is, even if the articles in the preset search database have no identical word, the search engine can still find a certain semantic association from the index field, that is, the invention can improve the search accuracy by using the preset search dictionary for searching.
In this embodiment, the server captures a huge amount of text data from the network, generates a preset search dictionary by performing data analysis on the huge amount of text data, and in the process of searching information, the server acquires the search information, then identifies the information type and the character number of the search information, and after the server determines that the search information is Chinese search information which does not exceed a preset standard amount, the server converts the Chinese search information into a corresponding index field through the preset search dictionary, and then searches by using the determined index field, so that information searching is more comprehensive and accurate.
Further, on the basis of the first embodiment of the present invention, a second embodiment of the Solr-based search method of the present invention is presented.
The present embodiment is a step before step S30 in the first embodiment, in which step S30, the server obtains an index field corresponding to chinese search information in a preset search dictionary, before which the server needs to pre-establish the search dictionary, and in this embodiment, the step of establishing the search dictionary is specifically described, including:
and step S01, crawling text data from a network, performing word segmentation processing on the text data according to a preset Chinese word segmentation algorithm to obtain corresponding words, and summarizing each word to form a sample set.
The server crawls massive text data from the network, and the server processes the text data to extract words contained in the text data, and specifically comprises the following steps: 1. the server preprocesses the text data: the data preprocessing comprises simplified and complex body conversion, xml symbol removal, word bar content processing into single-line data, and word2vec training principle is based on word co-occurrence to train semantic relation between words. The content of different entries needs to be trained separately; 2. the server performs word segmentation processing on the text data according to a preset Chinese word segmentation algorithm to obtain corresponding words, and the corresponding words are obtained; the preset chinese word segmentation algorithm is the same as that of the first embodiment, and the description of this embodiment is omitted. After text data word segmentation, the server gathers words obtained after the text data word segmentation to form a sample set.
Step S02, counting the occurrence frequency of the same words in the sample set, and sorting the same words according to the occurrence frequency to form a word list.
The server counts the occurrence frequency of the same words in the sample set, sorts the same words according to the occurrence frequency to form a word list, namely, under normal conditions, chinese words are more and contain unusual uncommon rarely used words, and the server selects words with higher occurrence frequency of the words to carry out word vector training, in particular:
step S03, selecting a preset number of words which are ranked in front in the word list as index words, forming a basic dictionary by using the index words, and converting the index words in the basic dictionary into corresponding word vectors through a preset word vector model.
The server selects a preset number of index words in the word list, wherein the preset number is the number of index words in a preset retrieval dictionary, for example, the preset number is 5000, that is, the server selects 5000 common words with higher occurrence frequency from the word list, the server takes the 5000 common words as index words, and the server gathers the selected index words to form a basic dictionary.
After the basic dictionary is obtained, the server performs feature processing on index words in the basic dictionary: also called word vector coding, common coding modes include one hot coding (BOW word bag model discrete representation mode) and low-dimensional dense vectors obtained based on deep learning model training such as word vector model, and word vector model word2vec is commonly called word unbedding Distributed representation; then, the server performs word vector training through a machine learning method, namely, after the server encodes the word vector, the text data can be converted into numerical data, and the numerical data is input into a preset machine learning model to perform calculation training.
That is, the server composes the index words into an input layer through a preset word vector model, each word is expressed in a one-hot vector form, if the vocabulary is V, each word is expressed as a V-dimensional vector, the corresponding word corresponding element is set to 1, and the rest is 0. Multiplying the one-hot vector with the weight matrix W1 corresponds to a simple choice of one row in W1. If C word vectors are entered, the hidden layer's activation function is actually used to count the rows of hotspots in the matrix and then divided by C to take the average. That is, the activation function of the hidden layer unit is simply a linear operation (directly taking the weight sum as the input of the next layer). From the hidden layer to the output layer, a weight matrix W2 is used to calculate the score in the vocabulary for each word, and the distributed word vector corresponding to each high-frequency word is obtained.
Step S04, according to the word vector of each index word, determining the approximate word of each index word, and storing the index word and the corresponding approximate word in a correlated way to generate a preset search dictionary.
After the word vector training is completed, the server generates a preset search dictionary according to the word vector of the index word, and specifically includes:
step a, each index word in the basic dictionary is respectively used as a first index word, and cosine values between word vectors of the first index word and word vectors of second index words except the first index word in the basic dictionary are calculated;
b, when a target cosine value larger than a preset cosine value exists, obtaining an approximate index word corresponding to the target cosine value, and taking the approximate index word as an approximate word of the first index word;
and c, storing the first index word and the corresponding approximate word in a correlated way, and generating a preset search dictionary.
That is, the server takes each index word as a first index word, and calculates cosine values between word vectors of the first index word and word vectors of other second index words except the first index word in the basic dictionary; that is, the server characterizes the similarity between the first index word and the other second index words by using a cosine value, and compares the computed cosine value with a preset cosine value, wherein the preset cosine value is a preset cosine value critical value, for example, the preset cosine value is set to be 0.9; the method comprises the steps that when a target cosine value larger than a preset cosine value exists, the server obtains an approximate index word corresponding to the target cosine value, and the approximate index word is used as an approximate word of a first index word; and associating and storing the first index word and the corresponding approximate word to generate a preset search dictionary.
In this embodiment, by presetting the search dictionary, the server may convert the search information according to the preset search dictionary to obtain the corresponding index information, so that the server may analyze the meaning of the search information, and the search may be more accurate.
Further, on the basis of the above embodiment, a third embodiment of the Solr-based search method of the present invention is proposed.
The embodiment is a refinement of step S40 in the first embodiment, and specifically describes a step of determining the search information, where the search method based on Solr includes:
and S41, combining the index fields to obtain a search formula corresponding to the Chinese search information, querying a preset search database, and obtaining a target article corresponding to the search formula.
And the server combines the index fields to obtain a search formula corresponding to the Chinese search information, and then the server queries a preset search database to obtain a target article corresponding to the search formula. That is, the server merges the index fields, for example, the index field of an article is "property" acquired by the "property" server as the associated word "property", "developer", then the server merges the "property", "property" and "developer" to generate the corresponding index xml, and the server only needs to query once when querying, so that the target article including the "property", "property" and "developer" can be queried.
Step S42, setting the weight of each index field in the search formula according to a preset weight mapping table, and sorting each target article according to the weight of each index field to form an article sorting list; and outputting the article ordered list as a search result.
After obtaining the target articles, the server obtains the weights of all index fields according to a preset weight mapping table (the preset weight mapping table is provided with preset word types and weight mapping tables, for example, the preset weight mapping table is provided with 50% of name corresponding weights, 30% of adjective corresponding weights and 20% of pronoun corresponding weights), and the server sorts all the target articles according to the weights of the index fields to form an article sorting list; and outputting the article ordered list as a retrieval result.
In this embodiment, the server may have different importance degrees of each index field when indexing, so that in order to make the queried target article more accurate, different weight rules may be preset, so that the user may quickly view the required information.
Further, on the basis of the third embodiment, a fourth embodiment of the Solr-based search method of the present invention is proposed.
The present embodiment is a step subsequent to step S40 in the first embodiment, where the server may perform labeling and assurance of the search article according to the user behavior data, and specifically includes:
and S50, receiving user behavior data based on the article ordered list, and determining and labeling search articles focused by a user in the article ordered list according to browsing times and browsing time in the user behavior data.
The server receives user behavior data based on the article sorting list, namely, the article sorting list contains a plurality of articles, a user can check each article, the server collects the user behavior data, and search articles focused by the user in the article sorting list are determined and marked according to browsing times and browsing time in the user behavior data.
And step S60, outputting the marked search articles for the user to inquire when receiving a browse record inquiry command.
The user can trigger a browsing record query instruction by the terminal, the terminal sends the browsing record query instruction to the server, and when the server receives the browsing record query instruction, the server outputs the marked search article for the user to query. In this embodiment, the server stores the pre-browsing record of the user according to the user behavior data, so that the user can conveniently return a visit.
Further, on the basis of the above embodiment, a fifth embodiment of the Solr-based search method of the present invention is proposed.
The present embodiment is a step subsequent to step S20 in the first embodiment, and specifically, in the present embodiment, the search method is a search method of a server when the number of characters of the chinese search information exceeds a preset standard amount, and specifically includes:
and step S70, when the number of characters of the Chinese search information exceeds a preset standard quantity, carrying out sentence segmentation on the Chinese search information to obtain a single sentence corresponding to the Chinese search information.
When the server determines that the number of characters of the Chinese search information exceeds a preset standard quantity, the server divides the Chinese search information to obtain single sentences corresponding to the Chinese search information, wherein the server divides the Chinese search information into two cases, namely, the Chinese search information is a long compound sentence, in order to improve the information search accuracy, the server divides the compound sentence into a plurality of parallel single sentences, and the other case is that the Chinese search information is an article or a paragraph, and the server divides the Chinese search information into a plurality of single sentences according to punctuations of the single sentences.
Step S80, performing word segmentation on the single sentence obtained by the sentence segmentation to obtain a corresponding keyword, and obtaining a target index word similar to the keyword in a preset search dictionary and an approximate word related to the target index word; and taking the target index word and the corresponding approximate word as index fields corresponding to the Chinese retrieval information.
The server performs word segmentation on the single sentence obtained by the sentence segmentation to obtain a corresponding keyword, wherein the word segmentation process of the single sentence can refer to the first embodiment, which is not described in detail in the present embodiment, the server obtains a target index word similar to the keyword in a preset search dictionary, that is, the server compares the keyword with the index word in the preset search dictionary to obtain an index word similar to the keyword as the target index word, and the server obtains an approximate word associated with the target index word in the preset search dictionary; the server takes the target index word and the corresponding approximate word as the index field corresponding to the Chinese retrieval information.
Step S90, inquiring a preset retrieval database, acquiring a target article corresponding to the index field, and outputting the target article as a retrieval result.
After obtaining the target index words corresponding to the Chinese search information, the server queries a preset search database, wherein the preset search database refers to a database corresponding to the user search information, for example, a hundred-degree library; and the server acquires the target article corresponding to the index field and outputs the target article as a retrieval result corresponding to the retrieval request.
In this embodiment, the server processes the clauses of the chinese information with the number of characters exceeding the preset standard, and performs information retrieval according to the steps in the first embodiment, so that the information retrieval is more intelligent.
In addition, referring to fig. 3, an embodiment of the present invention further provides a retrieval device based on Solr, where the retrieval device based on Solr includes:
the request receiving module 10 is configured to receive an information retrieval request, and obtain retrieval information corresponding to the retrieval request;
a character number judging module 20, configured to judge whether the number of characters of the chinese search information exceeds a preset standard amount when the search information is chinese search information;
the determining index module 30 is configured to obtain an index field corresponding to the chinese search information in a preset search dictionary when the number of characters of the chinese search information does not exceed a preset standard amount;
and the result output module 40 is used for querying a preset search database, acquiring a target article corresponding to the index field, and outputting the target article as a search result.
Optionally, the Solr-based retrieval device includes:
the sample processing module is used for crawling text data from a network, performing word segmentation processing on the text data according to a preset Chinese word segmentation algorithm to obtain corresponding words, and summarizing the words to form a sample set;
The frequency statistics module is used for counting the occurrence frequency of the same words in the sample set, and ordering the same words according to the occurrence frequency to form a word list;
the word training module is used for selecting a preset number of words which are ranked in front in the word list as index words, forming a basic dictionary by utilizing the index words, and converting the index words in the basic dictionary into corresponding word vectors through a preset word vector model;
and the dictionary generating module is used for determining the approximate word of each index word according to the word vector of each index word, and storing the index word and the corresponding approximate word in a correlated way to generate a preset search dictionary.
Optionally, the dictionary generating module includes:
the cosine calculation is used for taking each index word in the basic dictionary as a first index word respectively, and calculating cosine values between word vectors of the first index word and word vectors of second index words except the first index word in the basic dictionary;
the similar word query unit is used for acquiring an approximate index word corresponding to a target cosine value when the target cosine value larger than a preset cosine value exists, and taking the approximate index word as an approximate word of the first index word;
And the storage generating unit is used for storing the first index word and the corresponding approximate word in a correlated way to generate a preset search dictionary.
Optionally, the determining index module 30 includes:
the word segmentation unit is used for carrying out word segmentation processing on the Chinese search information according to a preset Chinese word segmentation algorithm when the number of characters of the Chinese search information does not exceed a preset standard quantity, so as to obtain a keyword set corresponding to the Chinese search information;
the word comparison unit is used for comparing the keywords in the keyword set with index words in a preset search dictionary to obtain target index words close to the keywords and approximate words associated with the target index words;
and the index field determining unit is used for taking the target index word and the corresponding approximate word as an index field corresponding to the Chinese retrieval information.
Optionally, the result output module 40 includes:
the information combination unit is used for combining the index fields to obtain a search formula corresponding to the Chinese search information, querying a preset search database and obtaining a target article corresponding to the search formula;
the article sorting unit is used for setting the weight of each index field in the search formula according to a preset weight mapping table, and sorting each target article according to the weight of each index field to form an article sorting list;
And the information output unit is used for outputting the article ordered list as a search result.
Optionally, the Solr-based retrieval device includes:
the article standard module is used for receiving user behavior data based on the article ordered list, and determining and labeling search articles focused by a user in the article ordered list according to browsing times and browsing time in the user behavior data;
and the standard output module is used for outputting the marked search articles for the user to inquire when receiving the browse record inquiry command.
Optionally, the Solr-based retrieval device includes:
the sentence processing module is used for carrying out sentence processing on the Chinese retrieval information when the number of characters of the Chinese retrieval information exceeds a preset standard quantity to obtain a single sentence corresponding to the Chinese retrieval information;
the word comparison module is used for carrying out word segmentation on the single sentence obtained by the sentence segmentation processing to obtain a corresponding keyword, and obtaining a target index word similar to the keyword in a preset search dictionary and an approximate word associated with the target index word; taking the target index word and the corresponding approximate word as index fields corresponding to the Chinese retrieval information;
And the retrieval output module is used for querying a preset retrieval database, acquiring a target article corresponding to the index field and outputting the target article as a retrieval result.
The steps implemented by each functional module of the retrieval device based on Solr may refer to each embodiment of the retrieval method based on Solr of the present invention, and will not be described herein.
In addition, the embodiment of the invention also provides a computer storage medium.
The computer storage medium has stored thereon a computer program which, when executed by a processor, implements the operations in the Solr-based retrieval method provided by the above embodiment.
It should be noted that, in this document, relational terms such as first and second, and the like are used solely to distinguish one entity/operation/object from another entity/operation/object without necessarily requiring or implying any actual such relationship or order between such entities/operations/objects; the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
For the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points. The apparatus embodiments described above are merely illustrative, in which the units illustrated as separate components may or may not be physically separate. Some or all of the modules may be selected according to actual needs to achieve the objectives of the present invention. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) as described above, comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present invention.
The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

Claims (7)

1. The Solr-based retrieval method is characterized by comprising the following steps of:
receiving an information retrieval request, and acquiring retrieval information corresponding to the retrieval request;
when the search information is Chinese search information, judging whether the number of characters of the Chinese search information exceeds a preset standard quantity;
when the number of characters of the Chinese search information does not exceed a preset standard quantity, acquiring an index field corresponding to the Chinese search information in a preset search dictionary;
inquiring a preset retrieval database, acquiring a target article corresponding to the index field, and outputting the target article as a retrieval result;
the receiving the information retrieval request, after obtaining the retrieval information corresponding to the retrieval request, includes:
judging whether the search information is Chinese search information or not according to the character type of the search information;
When the search information is not Chinese search information, translating foreign language search information into Chinese search information;
wherein, before the step of obtaining the index field corresponding to the Chinese search information in the preset search dictionary when the number of characters of the Chinese search information does not exceed the preset standard quantity, the method comprises the following steps:
the method comprises the steps of crawling text data from a network, performing word segmentation processing on the text data according to a preset Chinese word segmentation algorithm to obtain corresponding words, and summarizing each word to form a sample set, wherein the Chinese word segmentation algorithm is a word segmentation algorithm based on character string matching or a word segmentation algorithm based on statistics;
counting the occurrence frequency of the same words in the sample set, and sequencing the same words according to the occurrence frequency to form a word list;
selecting a preset number of words which are ranked in front in the word list as index words, forming a basic dictionary by using the index words, and converting the index words in the basic dictionary into corresponding word vectors through a preset word vector model;
according to the word vector of each index word, determining the approximate word of each index word, and storing the index word and the corresponding approximate word in a correlated way to generate a preset search dictionary;
The step of obtaining the index field corresponding to the Chinese search information in the preset search dictionary when the number of characters of the Chinese search information does not exceed the preset standard quantity comprises the following steps:
when the number of characters of the Chinese search information does not exceed a preset standard quantity, word segmentation processing is carried out on the Chinese search information according to a preset Chinese word segmentation algorithm, so that a keyword set corresponding to the Chinese search information is obtained;
comparing the keywords in the keyword set with index words in a preset search dictionary to obtain target index words close to the keywords and approximate words associated with the target index words;
taking the target index word and the corresponding approximate word as index fields corresponding to the Chinese retrieval information;
wherein, when the search information is Chinese search information, the step of judging whether the number of characters of the Chinese search information exceeds a preset standard amount comprises the following steps:
when the number of characters of the Chinese search information exceeds a preset standard quantity, sentence dividing processing is carried out on the Chinese search information to obtain a single sentence corresponding to the Chinese search information, wherein the Chinese search information is one of complex sentences, articles or paragraphs;
Word segmentation is carried out on the single sentence obtained through the sentence segmentation to obtain a corresponding keyword, and a target index word similar to the keyword in a preset search dictionary and an approximate word related to the target index word are obtained;
taking the target index word and the corresponding approximate word as index fields corresponding to the Chinese retrieval information;
and querying a preset retrieval database, acquiring a target article corresponding to the index field, and outputting the target article as a retrieval result.
2. The method of claim 1, wherein the step of determining an approximate word of each index word from the word vector of each index word, storing the index word in association with the corresponding approximate word, and generating a preset search dictionary comprises:
taking each index word in the basic dictionary as a first index word, and calculating cosine values between word vectors of the first index word and word vectors of second index words except the first index word in the basic dictionary;
when a target cosine value larger than a preset cosine value exists, obtaining an approximate index word corresponding to the target cosine value, and taking the approximate index word as an approximate word of the first index word;
And associating and storing the first index word and the corresponding approximate word to generate a preset search dictionary.
3. The retrieval method based on Solr as claimed in claim 1, wherein said step of querying a preset retrieval database, obtaining a target article corresponding to said index field, and outputting said target article as a retrieval result comprises:
combining the index fields to obtain a search formula corresponding to the Chinese search information, querying a preset search database, and obtaining a target article corresponding to the search formula;
setting the weight of each index field in the search formula according to a preset weight mapping table, and sorting each target article according to the weight of each index field to form an article sorting list;
and outputting the article ordered list as a search result.
4. The method for searching for Solr-based search according to claim 3, wherein after the steps of querying a preset search database, obtaining a target article corresponding to the index field, and outputting the target article as a search result, the method comprises:
receiving user behavior data based on the article ordered list, and determining and labeling search articles focused by a user in the article ordered list according to browsing times and browsing time in the user behavior data;
And outputting the marked search articles for the user to inquire when receiving a browse record inquiry command.
5. A retrieval device based on Solr, wherein the retrieval device based on Solr comprises:
the request receiving module is used for receiving an information retrieval request and acquiring retrieval information corresponding to the retrieval request;
the character number judging module is used for judging whether the character number of the Chinese search information exceeds a preset standard quantity when the search information is the Chinese search information;
the determining index module is used for acquiring an index field corresponding to the Chinese retrieval information in a preset retrieval dictionary when the number of characters of the Chinese retrieval information does not exceed a preset standard quantity;
the result output module is used for inquiring a preset search database, acquiring a target article corresponding to the index field and outputting the target article as a search result;
the request receiving module is further used for judging whether the search information is Chinese search information according to the character type of the search information; when the search information is not Chinese search information, translating foreign language search information into Chinese search information;
The character quantity judging module is also used for crawling text data from a network, performing word segmentation processing on the text data according to a preset Chinese word segmentation algorithm to obtain corresponding words, and summarizing the words to form a sample set, wherein the Chinese word segmentation algorithm is a word segmentation algorithm based on character string matching or a word segmentation algorithm based on statistics; counting the occurrence frequency of the same words in the sample set, and sequencing the same words according to the occurrence frequency to form a word list; selecting a preset number of words which are ranked in front in the word list as index words, forming a basic dictionary by using the index words, and converting the index words in the basic dictionary into corresponding word vectors through a preset word vector model; according to the word vector of each index word, determining the approximate word of each index word, and storing the index word and the corresponding approximate word in a correlated way to generate a preset search dictionary;
the determining index module is further configured to perform word segmentation processing on the Chinese search information according to a preset Chinese word segmentation algorithm when the number of characters of the Chinese search information does not exceed a preset standard quantity, so as to obtain a keyword set corresponding to the Chinese search information; comparing the keywords in the keyword set with index words in a preset search dictionary to obtain target index words close to the keywords and approximate words associated with the target index words; taking the target index word and the corresponding approximate word as an index field corresponding to the Chinese retrieval information, wherein the Chinese retrieval information is one of a complex sentence, an article or a paragraph;
The character quantity judging module is further used for carrying out sentence segmentation on the Chinese search information when the character quantity of the Chinese search information exceeds a preset standard quantity to obtain a single sentence corresponding to the Chinese search information; word segmentation is carried out on the single sentence obtained through the sentence segmentation to obtain a corresponding keyword, and a target index word similar to the keyword in a preset search dictionary and an approximate word related to the target index word are obtained; taking the target index word and the corresponding approximate word as index fields corresponding to the Chinese retrieval information; and querying a preset retrieval database, acquiring a target article corresponding to the index field, and outputting the target article as a retrieval result.
6. A Solr-based retrieval device, the Solr-based retrieval device comprising: a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein:
the computer program, when executed by the processor, implements the steps of the Solr-based retrieval method as defined in any one of claims 1 to 4.
7. A computer storage medium, wherein a computer program is stored on the computer storage medium, which computer program, when being executed by a processor, implements the steps of the Solr-based retrieval method according to any one of claims 1 to 4.
CN201910205809.0A 2019-03-16 2019-03-16 Solr-based retrieval method, solr-based retrieval device, solr-based retrieval equipment and storage medium Active CN110069610B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910205809.0A CN110069610B (en) 2019-03-16 2019-03-16 Solr-based retrieval method, solr-based retrieval device, solr-based retrieval equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910205809.0A CN110069610B (en) 2019-03-16 2019-03-16 Solr-based retrieval method, solr-based retrieval device, solr-based retrieval equipment and storage medium

Publications (2)

Publication Number Publication Date
CN110069610A CN110069610A (en) 2019-07-30
CN110069610B true CN110069610B (en) 2024-03-19

Family

ID=67365343

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910205809.0A Active CN110069610B (en) 2019-03-16 2019-03-16 Solr-based retrieval method, solr-based retrieval device, solr-based retrieval equipment and storage medium

Country Status (1)

Country Link
CN (1) CN110069610B (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110619067A (en) * 2019-08-27 2019-12-27 深圳证券交易所 Industry classification-based retrieval method and retrieval device and readable storage medium
CN110705302B (en) * 2019-10-11 2023-12-12 掌阅科技股份有限公司 Named entity identification method, electronic equipment and computer storage medium
CN110941702A (en) * 2019-11-26 2020-03-31 北京明略软件系统有限公司 Retrieval method and device for laws and regulations and laws and readable storage medium
CN111078960B (en) * 2019-12-20 2023-09-05 金现代信息产业股份有限公司 Method and system for realizing real-time retrieval of power dispatching system equipment
CN111223533B (en) * 2019-12-24 2024-02-13 深圳市联影医疗数据服务有限公司 Medical data retrieval method and system
CN111209378B (en) * 2019-12-26 2024-03-12 航天信息股份有限公司企业服务分公司 Ordered hierarchical ordering method based on business dictionary weights
CN111708942B (en) * 2020-06-12 2023-08-08 北京达佳互联信息技术有限公司 Multimedia resource pushing method, device, server and storage medium
CN111767378A (en) * 2020-06-24 2020-10-13 北京墨丘科技有限公司 Method and device for intelligently recommending scientific and technical literature
CN111859091B (en) * 2020-07-21 2021-06-04 山东省科院易达科技咨询有限公司 Search result aggregation method and device based on artificial intelligence
CN112052309A (en) * 2020-09-07 2020-12-08 深圳壹账通智能科技有限公司 Text data retrieval method, related equipment and readable storage medium
CN112749162B (en) * 2020-12-31 2021-08-17 浙江省方大标准信息有限公司 ES-based rapid retrieval and sorting method for inspection and detection mechanism
CN115455147A (en) * 2022-09-09 2022-12-09 浪潮卓数大数据产业发展有限公司 Full-text retrieval method and system
CN115495483A (en) * 2022-09-21 2022-12-20 企查查科技有限公司 Data batch processing method, device, equipment and computer readable storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010267247A (en) * 2010-02-08 2010-11-25 Ntt Data Corp Device and method for retrieving information, terminal equipment, and program
WO2014087424A2 (en) * 2012-12-03 2014-06-12 Parthys Reverse Informatics Analytic Solutions (P) Ltd. Information retrieval, extraction and visualisation
CN108038096A (en) * 2017-11-10 2018-05-15 平安科技(深圳)有限公司 Knowledge database documents method for quickly retrieving, application server computer readable storage medium storing program for executing
CN108509474A (en) * 2017-09-15 2018-09-07 腾讯科技(深圳)有限公司 Search for the synonym extended method and device of information

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010267247A (en) * 2010-02-08 2010-11-25 Ntt Data Corp Device and method for retrieving information, terminal equipment, and program
WO2014087424A2 (en) * 2012-12-03 2014-06-12 Parthys Reverse Informatics Analytic Solutions (P) Ltd. Information retrieval, extraction and visualisation
CN108509474A (en) * 2017-09-15 2018-09-07 腾讯科技(深圳)有限公司 Search for the synonym extended method and device of information
CN108038096A (en) * 2017-11-10 2018-05-15 平安科技(深圳)有限公司 Knowledge database documents method for quickly retrieving, application server computer readable storage medium storing program for executing

Also Published As

Publication number Publication date
CN110069610A (en) 2019-07-30

Similar Documents

Publication Publication Date Title
CN110069610B (en) Solr-based retrieval method, solr-based retrieval device, solr-based retrieval equipment and storage medium
US8880512B2 (en) Method, apparatus and system, for rewriting search queries
US9489401B1 (en) Methods and systems for object recognition
US20070219986A1 (en) Method and apparatus for extracting terms based on a displayed text
US20120179667A1 (en) Searching through content which is accessible through web-based forms
US20060224552A1 (en) Systems and methods for determining user interests
US20130013616A1 (en) Systems and Methods for Natural Language Searching of Structured Data
RU2005111001A (en) CHECKING RELEVANCE BETWEEN KEYWORDS AND WEBSITE CONTENT
US20180004838A1 (en) System and method for language sensitive contextual searching
WO2009039392A1 (en) A system for entity search and a method for entity scoring in a linked document database
Im et al. Linked tag: image annotation using semantic relationships between image tags
CN111428494A (en) Intelligent error correction method, device and equipment for proper nouns and storage medium
JP5057474B2 (en) Method and system for calculating competition index between objects
KR20140075428A (en) Method and system for semantic search keyword recommendation
KR20180097120A (en) Method for searching electronic document and apparatus thereof
CN108804409A (en) A kind of semantic retrieving method and device
CN108509449B (en) Information processing method and server
CN111666479A (en) Method for searching web page and computer readable storage medium
CN114117242A (en) Data query method and device, computer equipment and storage medium
CN114691845A (en) Semantic search method and device, electronic equipment, storage medium and product
JP2006139484A (en) Information retrieval method, system therefor and computer program
TWI483129B (en) Retrieval method and device
US20190303464A1 (en) Directed Data Indexing Based on Conceptual Relevance
CN111930954B (en) Intention recognition method and device, storage medium and electronic equipment
CN116610782B (en) Text retrieval method, device, electronic equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant