CN108388635B - Data searching method, device, medium and computing equipment - Google Patents

Data searching method, device, medium and computing equipment Download PDF

Info

Publication number
CN108388635B
CN108388635B CN201810157579.0A CN201810157579A CN108388635B CN 108388635 B CN108388635 B CN 108388635B CN 201810157579 A CN201810157579 A CN 201810157579A CN 108388635 B CN108388635 B CN 108388635B
Authority
CN
China
Prior art keywords
data
characters
character string
lucene
mark group
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810157579.0A
Other languages
Chinese (zh)
Other versions
CN108388635A (en
Inventor
黄�俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Netease Zhiqi Technology Co Ltd
Original Assignee
Hangzhou Langhe Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Langhe Technology Co Ltd filed Critical Hangzhou Langhe Technology Co Ltd
Priority to CN201810157579.0A priority Critical patent/CN108388635B/en
Publication of CN108388635A publication Critical patent/CN108388635A/en
Application granted granted Critical
Publication of CN108388635B publication Critical patent/CN108388635B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures

Abstract

The embodiment of the invention provides a data search method, a data search device, a data search medium and a computing device. The data searching method comprises the following steps: acquiring a character string to be searched from a user; performing word segmentation on the character string to generate mark group data, wherein the mark group data at least comprises one of a first mark group and a second mark group, and the first mark group and the second mark group respectively comprise at least one key character; and providing a search result through the tag group data, wherein characters in the first tag group are searched through a first matching mode, and characters in the second tag group are searched through a second matching mode. The technical scheme of the embodiment of the invention can provide a simple, high-efficiency, customizable and good-compatibility full-text retrieval scheme based on the chat records.

Description

Data searching method, device, medium and computing equipment
Technical Field
The embodiment of the invention relates to the technical field of communication and computers, in particular to a data searching method, a data searching device, a data searching medium and a computing device.
Background
This section is intended to provide a background or context to the embodiments of the invention that are recited in the claims. The description herein is not admitted to be prior art by inclusion in this section.
Data searching, such as chat record searching, is a function of global searching for messages in instant messaging software, that is, to input one or more keywords in a dialog box, instant messaging software can immediately return query results, such as how many sessions the keyword contains, how many matching chat records in each session, and all matching chat records of a specific session can be checked and jumped to the context.
In order to realize a chat record searching function, particularly a chat record searching function in an Android system, various technical schemes exist in the prior art. However, the schemes in the prior art have low search efficiency and weak customizability; the realization cost is high, the realization is complex, and the maintenance cost is high; can not meet one or more problems of the requirement of Chinese search, and the like.
Disclosure of Invention
As several solutions for data search in the prior art have low search efficiency and weak customizability; the realization cost is high, the realization is complex, and the maintenance cost is high; can not meet the requirement of Chinese search.
To this end, there is a great need for an improved data search scheme that provides a simple, efficient, customizable, and compatible full-text search scheme
In this context, embodiments of the present invention are intended to provide a data search method, apparatus, medium, and computing device.
In a first aspect of embodiments of the present invention, there is provided a data search method, including: acquiring a character string to be searched from a user; performing word segmentation processing and grouping processing on the character string to generate mark group data, wherein the mark group data at least comprises one of a first mark group and a second mark group, and the first mark group and the second mark group respectively comprise at least one key character; and providing a search result through the tag group data, wherein characters in the first tag group are searched through a first matching mode, and characters in the second tag group are searched through a second matching mode.
In some embodiments of the present invention, based on the foregoing scheme, acquiring a character string to be searched from a user includes: and acquiring a character string to be searched from a user through the intelligent electronic equipment.
In some embodiments of the present invention, based on the foregoing scheme, the word segmentation processing is performed on the character string to generate tag group data, where the tag group data includes at least one of a first tag group and a second tag group, and includes: generating the first marker set by Chinese and digits in the character string; and generating the second mark group by English and symbols in the character string.
In some embodiments of the present invention, based on the foregoing scheme, generating the first marker set by chinese and numbers in the character string includes: taking each Chinese character in the character string as a key character respectively; and taking each number in the character string as a key character respectively.
In some embodiments of the present invention, based on the foregoing scheme, generating the second mark group by english and symbols in the character string includes: taking continuous English characters in the character string as key characters; taking each symbol in the character string as a key character respectively; wherein the symbol comprises: punctuation, non-chinese and english language characters, and special characters.
In some embodiments of the present invention, based on the foregoing solution, providing search results through the tag group data includes: inputting the marking group data into a Lucene querier, and providing search results through the Lucene querier.
In some embodiments of the present invention, based on the foregoing solution, inputting the tagged group data into a Lucene querier, and providing search results through the Lucene querier, where the method includes: and inputting the marking group data into a Lucene querier through an application terminal in the Android system, and providing a search result through the Lucene querier.
In some embodiments of the present invention, based on the foregoing scheme, the method further comprises: and importing the Lucene source code packet into an Android system.
In some embodiments of the present invention, based on the foregoing scheme, the method further comprises: and cutting the source code when the Lucene source code packet is imported into the Android system.
In some embodiments of the present invention, based on the foregoing scheme, searching for characters in the first token group by a first matching manner includes: and searching characters in the first mark group in a phrase matching mode.
In some embodiments of the present invention, based on the foregoing scheme, searching for characters in the second tag group by a second matching method includes: and searching the characters in the second mark group in a prefix matching mode.
In some embodiments of the present invention, based on the foregoing scheme, the method further comprises: performing word segmentation processing on the source data to generate index data; and sending the index data to a Lucene engine so that the Lucene engine can carry out index processing.
In some embodiments of the present invention, based on the foregoing scheme, sending the index data to the Lucene engine includes: processing the index data by a WhitespataineAnalyzer method to generate first data; and sending the first data to a Lucene engine.
In a second aspect of embodiments of the present invention, there is provided a data search apparatus comprising: the receiving module is used for acquiring a character string to be searched from a user; the word segmentation module is used for performing word segmentation processing and grouping processing on the character string to generate mark group data, wherein the mark group data at least comprises one of a first mark group and a second mark group, and the first mark group and the second mark group respectively comprise at least one key character; and the query module is used for providing a search result through the tag group data, wherein the first query submodule is used for searching the characters in the first tag group through a first matching mode, and the second query submodule is used for searching the characters in the second tag group through a second matching mode.
In some embodiments of the present invention, based on the foregoing scheme, the receiving module is further configured to obtain, by the intelligent electronic device, a character string to be searched from the user.
In some embodiments of the present invention, based on the foregoing solution, the word segmentation module includes: the first word sub-module is used for generating the first mark group by Chinese and numbers in the character string; and the second word sub-module is used for generating the second mark group through English and symbols in the character string.
In some embodiments of the invention, based on the foregoing scheme, the first sub-word module is configured to: taking each Chinese character in the character string as a key character respectively; and taking each number in the character string as a key character respectively.
In some embodiments of the invention, based on the foregoing scheme, the second sub-module is configured to: taking continuous English characters in the character string as key characters; taking each symbol in the character string as a key character respectively; wherein the symbol comprises: punctuation, non-chinese and english language characters, and special characters.
In some embodiments of the present invention, based on the foregoing solution, the query module is configured to: inputting the marking group data into a Lucene querier, and providing search results through the Lucene querier.
In some embodiments of the present invention, based on the foregoing solution, the query module is further configured to: and inputting the marking group data into a querier through an application end in an Android system, and providing a search result through a Lucene querier.
In some embodiments of the present invention, based on the foregoing scheme, the method further comprises: and the import module is used for importing the Lucene source code packet into the Android system.
In some embodiments of the present invention, based on the foregoing scheme, the importing module further includes: and the cutting submodule is used for cutting the source code when the Lucene source code packet is imported into the Android system.
In some embodiments of the present invention, based on the foregoing solution, the first query submodule includes: and the phrase matching unit is used for searching the characters in the first mark group in a phrase matching mode.
In some embodiments of the present invention, based on the foregoing solution, the second query submodule includes: and the prefix matching unit is used for searching the characters in the second mark group in a prefix matching mode.
In some embodiments of the present invention, based on the foregoing scheme, the method further comprises: the data module is used for performing word segmentation processing on the source data to generate index data; and the index module is used for sending the index data to the Lucene engine so that the Lucene engine can carry out index processing.
In some embodiments of the present invention, based on the foregoing solution, the indexing module includes: the data submodule is used for processing the index data through a WhitespateAnalyzer method to generate first data; and the sending submodule is used for sending the first data to the Lucene engine.
In a third aspect of embodiments of the present invention, there is provided a medium having stored thereon a program which, when executed by a processor, implements the method as described in the first aspect of the embodiments above.
In a fourth aspect of embodiments of the present invention, there is provided a computing device comprising: a processor and a memory, the memory storing executable instructions, the processor being configured to invoke the memory stored executable instructions to perform the method according to the first aspect of the above embodiments.
According to the data searching method, the data searching device, the data searching medium and the computing equipment, word segmentation processing is carried out on a character string to be searched from a user according to a preset rule to generate mark group data, and then a searching result is provided through the mark group data, wherein characters in a first mark group are searched through a first matching mode, and characters in a second mark group are searched through a second matching mode. The chat record full-text retrieval scheme which is simple, efficient, customizable and good in compatibility can be provided.
According to the data searching method, the data searching device, the data searching medium and the computing equipment, the file path of the marking group data acquired by the application terminal in the Android system is converted into the file channel, so that the Lucene querier can be embedded into the Android system to run, and the searching efficiency of the chat software on the Android system is greatly improved.
Drawings
The above and other objects, features and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:
FIG. 1 schematically illustrates a flow diagram of a data search method according to an embodiment of the present invention;
FIG. 2 schematically illustrates a flow diagram of another method of data searching in accordance with an embodiment of the present invention;
fig. 3 schematically shows a block diagram of a data search apparatus according to an embodiment of the present invention.
In the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.
Detailed Description
The principles and spirit of the present invention will be described with reference to a number of exemplary embodiments. It is understood that these embodiments are given solely for the purpose of enabling those skilled in the art to better understand and to practice the invention, and are not intended to limit the scope of the invention in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.
As will be appreciated by one skilled in the art, embodiments of the present invention may be embodied as a system, apparatus, device, method, or computer program product. Thus, the present invention may be embodied in the form of: entirely hardware, entirely software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.
According to the embodiment of the invention, a data searching method, a data searching device, a data searching medium and a computing device are provided.
In this context, it is to be understood that the term "IM" is referred to throughout as Instant Messaging, which is an abbreviation for Instant Messaging, and that IM is a real-time Messaging service that allows users to establish some kind of private chat over a network. Chat log search is a very important function in instant messaging software.
The term "Full text Search" refers to a process of first indexing unstructured data (also called Full text data) and then searching the index, which is called Full-text Search (FTS).
The term "Lucene" is referred to as a Java-based efficient full-text index library. Full-text retrieval is divided into two stages, namely, creating an index (Indexing) and searching the index (Search), wherein the Indexing is a process of extracting information from structured and unstructured data and reorganizing the information to form a certain structure, and the product is structured data-the index. Search searches the created index according to the query request of the user, and returns a matched result set.
Moreover, any number of elements in the drawings are by way of example and not by way of limitation, and any nomenclature is used solely for differentiation and not by way of limitation.
The principles and spirit of the present invention are explained in detail below with reference to several representative embodiments of the invention.
Summary of The Invention
The inventor finds that various technical schemes for searching data exist in the prior art. However, the schemes in the prior art have low search efficiency and weak customizability; the realization cost is high, the realization is complex, and the maintenance cost is high; the method can not meet one or more problems of Chinese search requirements and the like, particularly realizes the search function of chat records in an Android system, and in the prior art, the following implementation schemes are provided:
1. the SQL Like scheme has the advantages that: the method is simple, does not need extra storage space, is efficient when the data size is not large, and is also the most of conventional schemes for realizing the search function. However, the disadvantages of the SQL like scheme are: the table is scanned sequentially in the searching process, the whole table needs to be traversed once every time the input string is searched and changed, and the efficiency can be greatly reduced under the condition of more keywords or large data volume; the customizability is weak and undesirable results may be searched out.
2. The FTS scheme has the advantages that the Android SQLite supports FTS, and the SQL like scheme has the following disadvantages: due to safety considerations, most versions of Android do not support injection of custom word segmenters, the Android SQLite needs to be compiled by itself to open the FTS function, and the custom word segmenters are realized and injected. The scheme has the advantages of high implementation cost, complex implementation and higher maintenance cost. The FTS creates a virtual table, all primary keys and indexes are discarded. If the query service needs to query not only by searching but also according to a certain column, a meta table needs to be established, and a link table query needs to be used, which has a large influence on efficiency.
3. The design target of the SQLite scheme is embedded, the resource occupation is very low, and only hundreds of K of memories are needed in the embedded device. The SQLite scheme has the defect that the built-in word segmentation device has poor support for Chinese word segmentation and cannot meet the requirement of Chinese search.
Therefore, the embodiment of the invention provides a data searching method, a data searching device, a data searching medium and a computing device, wherein character strings to be searched from a user are subjected to word segmentation processing according to preset rules to generate mark group data, and then search results are provided through the mark group data, wherein characters in a first mark group are searched through a first matching mode, and characters in a second mark group are searched through a second matching mode. The chat record full-text retrieval scheme which is simple, efficient, customizable and good in compatibility can be provided. And the file path of the marker group data acquired by the application terminal in the Android system is converted into the file channel, so that the Lucene querier can be embedded into the Android system to operate, and the search efficiency of the chat software on the Android system is greatly improved.
Having described the general principles of the invention, various non-limiting embodiments of the invention are described in detail below.
Application scene overview
It should be noted that the following application scenarios are merely illustrated to facilitate understanding of the spirit and principles of the present invention, and the embodiments of the present invention are not limited in this respect. Rather, embodiments of the present invention may be applied to any scenario where applicable.
Application scenarios: the user searches in the instant communication software through the search box, and the searched content can be English, Chinese or special characters and the like. The search characters are searched by the scheme provided by the invention, and the search results are efficiently, quickly and accurately returned to the user side.
Exemplary method
In conjunction with the application scenarios described above, an information acquisition method according to an exemplary embodiment of the present invention is described below with reference to fig. 1 and 2.
FIG. 1 schematically illustrates a flow diagram of a data search method according to an embodiment of the present invention; referring to fig. 1, a data search method according to an embodiment of the present invention includes:
in step S10, a character string to be searched from a user is acquired. In the present application, the character string to be applied may be, for example, english, chinese, language characters of other countries, or special characters. For example, a character string to be searched from a user is acquired by an intelligent electronic device. The intelligent electronic device can be provided with an intelligent system, for example, and the following description is given by taking the electronic device provided with the Android system as an example without loss of generality.
Step S12, performing word segmentation and grouping on the character string to generate tag group data, where the tag group data at least includes one of a first tag group and a second tag group, and the first tag group and the second tag group respectively include at least one key character.
In some embodiments, the first set of indicia is generated, for example, by chinese and numbers in the string; and generating the second mark group by English and symbols in the character string. Wherein, in the first marker group, for example: taking each Chinese character in the character string as a key character (token); and taking each number in the character string as a key character respectively. Wherein, in the second marker set, for example: taking continuous English characters in the character string as key characters; taking each symbol in the character string as a key character respectively; wherein the symbol comprises: punctuation, non-chinese and english language characters, and special characters. Wherein token refers to the segmentation generated by the segmenter.
In some embodiments, further comprising: segmenting the chat records by using a self-defined segmentation algorithm, and sending the segmentation result to Lucene for indexing to form the following indexing result: the content, the frequency of occurrence, the position of occurrence and other information of the key characters in the word segmentation result are maintained in a specific data structure, the data structures are stored in a file, and the information about the same key character in the data structure for maintaining the content, the frequency of occurrence and the position of occurrence of the key characters can be stored in the same file or different files. For example, a user-defined word segmentation algorithm is adopted for word segmentation, a word segmentation result is transmitted to a WhitespataeAnalyzer for processing, the WhitespataeAnalyzer transmits a processing result to a Lucene engine for indexing and storage, then word segmentation and grouping are performed on a character string to be searched, and a querier provided by Lucene is combined to complete the search requirement.
In some embodiments, the custom word segmentation algorithm in the present application may be, for example: the characters in the input string are judged and classified, some characters are combined into a token (key character), some characters are combined into a token, and for Chinese and other foreign languages and special characters, the single character is used for searching in order to meet the requirements. For numbers, one number acts as one token, taking into account the user experience. It should be noted that: a space is a separator/stop sign,' which may be handled, for example, as english, and a sign.
Step S14, providing a search result through the tag group data, wherein the characters in the first tag group are searched through a first matching manner, and the characters in the second tag group are searched through a second matching manner. In the present application, the tagged group data may be input to a Lucene querier, for example, through which search results are provided. For the second tag group, a prefix matching query is adopted; for the first marked lesson, adopting a phrase matching query; and finally, acquiring the intersection of the two groups of queries as a query result.
In some embodiments, the characters in the first token group are searched by a phrase matching approach (PhraseQuery). The phrase matching approach requires that the keyword position of each token in the original string in the query string is to be continuous.
In some embodiments, the characters in the first tag group are searched by prefix matching (PrefeixQuery).
According to the data searching method, the data searching device, the data searching medium and the computing equipment, word segmentation processing is carried out on a character string to be searched from a user according to a preset rule to generate mark group data, and then a searching result is provided through the mark group data, wherein characters in a first mark group are searched through a first matching mode, and characters in a second mark group are searched through a second matching mode. The chat record full-text retrieval scheme which is simple, efficient, customizable and good in compatibility can be provided.
FIG. 2 schematically illustrates a flow diagram of another method of data searching in accordance with an embodiment of the present invention; fig. 2 schematically illustrates a process of indexing the source data by the Lucene engine, which is not limited in this application.
And step S20, performing word segmentation processing on the source data through a custom word segmentation algorithm to generate index data. For example, word segmentation is performed by the words in the character string in the chat log. Also, for example, in the indexing process, the chat logs can be segmented using a custom segmentation algorithm,
and step S22, sending the index data to a Lucene engine so that the Lucene engine can carry out index processing. The word segmentation result may be sent to Lucene for indexing, for example, to form the following indexing result: the content, the frequency of occurrence, the position of occurrence and other information of the key characters in the word segmentation result are maintained in a specific data structure, and then the data structures are stored in a file, and the information about the same key character in the data structure for maintaining the content, the frequency of occurrence and the position of occurrence of the key characters can be stored in the same file or different files.
Class files that exist on the Java platform but not on the Android platform can be supplemented, for example, by modifying the Lucene source code, thereby enabling the Lucene engine to run on the Android platform.
In the application, Lucene5 is taken as an example for description, and Lucene is a java framework and has some problems in the aspect of practical use to Android. When transplanting the Lucene5 to the Android, a plurality of jar packets of the following Lucene need to be transplanted as well:
lucene-core.jar
lucene-analyzers-common.jar
lucene-queryparser.jar
lucene-grouping.jar
in some embodiments, if the above few jar packages are directly placed under the libs directory of the Android, the classNotFoundException is reported in Android runtime because the Lucene5 uses the NIO.2 of java (J2SE 7), and the java virtual machine can find the class of NIO.2 in jdk when running on the java platform, but in Android SDK, the class under java. Therefore, the source code of the jar packet needs to be imported into the engineering to realize the missing nio class by itself, the java nio does not need to be moved in, and only the classes needed by the jar packet, such as Paths, Files, standardopen option and the like, need to be simply realized, and the open Path needs to be converted into the FileChannel.
In some embodiments, a file path of the index data acquired through an application terminal in the Android system is converted into a file channel through the modified class file. A modified class file comprising: path class file: org, apache, lucene, store, fsdirectory: self-implementation Path (Package of File), self-implementation Files (tool class to operate Path); document class file: apple, lucene, store, simple FSdirectory, which is the realization of FSdirectory in windows, and requires modifying an openInput method to realize acquiring a FileChannel according to Path and by constructing a RandomAccess File; standard open option class file: org, apache, util, atttributefactor: some reflective processing functions.
For a clearer description of the process of indexing by Lucene, the basic principle of Lucene is presented below. An index is a collection of a series of documents, documents is a collection of a series of fields, a Field is a collection of a series of possible occurrences of Term, Term is a collection of a series of bytes, the same sequence of bytes is considered to be different terms in different fields, and thus Term consists of a pair of values: field name (string) and Field value (bytes). An index is composed of a plurality of sub-indexes, i.e., Segments (Segments), which are completely independent indexes and are to be searched individually.
The organization and storage structure of the Lucene Index is an Inverted Index (inversed Index). For example, there are a series of documents, one Field in which is named content, all the character strings that may appear in Field: content are called Term, such as Term (Field Name: content, Field value: "lucene"). Each Term points to the Document List containing that Term, which becomes the inverted List (Posting List).
Having the index, can greatly accelerate the speed of searching when the data bulk is big, compared with the sequential scanning, the advantage of full text retrieval lies in: once indexing, multiple use. In order to quickly locate Term in the Term dictionary, Lucene also introduces Term index, the structure of which is a word search tree (Trie tree), and the common prefix of character strings is utilized to reduce the query time and reduce the unnecessary character string comparison to the maximum extent. The Trie tree does not contain all the Term, it contains some of the prefix of the Term. By means of the Term Index, it is possible to quickly locate a certain offset of the Term dictionary and then to search sequentially from this position further back.
In some embodiments, a keyword query for a field may go through Term Index to Term Dictionary and then to Posting List. For the query of multiple keywords, chain table operations such as intersection, union, difference and the like need to be performed on the Posting Lists of the multiple keywords.
In some embodiments, Frequency, i.e., word Frequency, is recorded during the index creation process by Lucene and represents the number of times Term appears in Document. In addition, the Position (Position) of the Term appearing in the Document is also recorded, and the Position includes two types:
1) and recording the position of the character, namely recording the number-th character of the Term in the article, and positioning when the keyword is highlighted.
2) And the keyword position records that the Term is the number of keywords in the article (some words can be filtered out through a word splitter, and the remaining valid words are arranged in sequence), the Term query (PharseQuery) is required to be used, and the keyword position records are recorded by Lucene.
A link List node containing DOC, Frequency and Position in a Posting List stores information as follows:
Term Document ID Frequency Position
nim 1 2 2,6
2 1 3
the node information shows that Term "nim" appears 2 times in article 1, at the 2 nd and 6 th keyword positions (three keywords in the middle); it appears 1 time in article 2, at the 3 rd keyword position.
In some embodiments, the content, the number of occurrences, the location of the occurrences, and other information of the key characters in the segmentation result of the chat record mentioned above in the present application are maintained in a specific data structure, the information about the same key character in the data structure for maintaining the content, the number of occurrences, and the location of the occurrences of the key character may be stored in the same file, or stored in different files, and for these files, a file path is given, a file channel (FileChannel) corresponding to this file path is returned, and thus is used by the Lucene engine for searching.
In some embodiments of the present invention, based on the foregoing scheme, the method further comprises: and importing the Lucene source code packet into an Android system. Further comprising: and cutting the source code when the Lucene source code packet is imported into the Android system.
The source code packet of Lucene is very large, Lucene-core.jar; lucene-analyzers-common. jar; lucene-queryparser. Jack, lucene-grouping; the size of at least 6M after four jar packages is achieved, if the package is directly used in Andoird without cutting, the problem that the number of methods exceeds 65K easily occurs, class which is not used is deleted by Proguard shrinker during packaging, but the class pointed in the configuration file under META-INF needs to be subjected to keep.
Cfg confusion configuration is as follows:
keepclasseswithmembernames class org.apache.lucene.**{*;}
keep class org.apache.lucene.codecs.lucene53.**{*;}
the Lucene output packet after compaction by the above method is only about 1M.
According to the data searching method provided by the embodiment of the invention, the file path of the marker group data acquired by the application terminal in the Android system is converted into the file channel, so that the Lucene querier can be embedded into the Android system to run, and the searching efficiency of the chat software on the Android system is greatly improved.
In some embodiments, according to the data search method in the present application, a schematic example of the process of performing data indexing and searching may be as follows, for example:
[1] an indexing stage:
the original string "Allen Love people's republic of China", after adopting self-defined word segmentation algorithm + Whitespatea Analyzer, the word segmentation result is: allen/love/medium/chinese/man/min/co/and/country, construct 9 terminals, write index: TermDictionary- > DocumentList.
[2] A searching stage:
when searching for "all", the search string is subjected to a word segmentation algorithm to obtain token: "all", starting the search after constructing Prefix query: and quickly positioning Term at the head of the A through a dictionary tree, then sequentially carrying out prefix matching, finding out the corresponding 'Allen', and taking docid of the 'Allen' from the inverted list.
When searching for "china", the search string gets the phrase "china" through word segmentation algorithm, and starts the search after constructing PhraseQuery > "china" (each character in the phrase is a Term, separated by a space): docid x in Term1 "and key position 3 in doc" are taken first, then docid x and key position 9 in Term2 "Country" are retrieved. And then combining the docid linked lists of all the Term to obtain the docid x which all the Term meets, and then judging whether the keyword position of each Term in the docid x is continuous, wherein 3 and 9 are discontinuous, so that the docid x does not meet the phrase matching result and is eliminated.
When searching for "Chinese love", the search string will get two groups of tokens through the word segmentation algorithm: "China" and "love", for Chinese phrase "China" will adopt phrase matching, process into "Term1Term2", namely "China", for English will adopt prefix matching "love", for two Query: the AND relationship between PhraseQuery AND PrefixQuery will be. Finally, obtaining docid x through phrase matching, obtaining docid x through prefix matching, AND finally returning the docid x through AND of the two docens without considering the front-back sequence of the two groups of tokens in the original string.
In the application, the full-text index library Lucene open source library of the Java open source is used for transplanting and customizing, so that the data search method in the embodiment of the application has the following advantages:
1) the professional java search open source framework uses numerous, version-continuous iterations in J2 EE.
2) And (4) writing a complete java layer, and although the complete java layer is operated on a java virtual machine, the actual measurement efficiency is not poor.
3) The inverted index structure and the space are changed by time, 30w chat records need to consume less than 30M of space, and about 1s 2000 chat records with writing performance are written.
4) The compatibility is good, no native exists, and compatibility problems such as 32/64 bit libraries and the like do not need to be considered.
5) The customizability is strong, and the result scoring is supported.
6) Pure java, convenient debugging and positioning.
In some embodiments, a performance comparison experiment under a large amount of data is performed, and for the requirement "how many sessions there are, how many matching chat records there are in each session" are returned, the experiment results are as follows:
1) search for "me":
total amount of messages (w: ten thousand) Number of hits Lucene SQL LIKE FTS4
1w 2018 49 56 221
2w 4082 76 97 403
5w 10272 172 226 989
10w 20504 257 426 2101
15w 30758 373 653 3131
20w 41165 494 859 4221
30w 61671 762 1272 6440
2) Search for "i you":
total amount of messages (w: ten thousand) Number of hits Lucene SQL LIKE FTS4
1w 889 29 53 140
2w 1787 87 124 253
5w 4442 133 237 548
10w 8857 161 469 1071
15w 13262 244 707 1722
20w 17622 282 909 2299
30w 26485 375 1361 3474
The above results can clearly show the advantage of Lucene in terms of search efficiency, and from the perspective of implementation cost and search efficiency, the Lucene scheme is advantageous over the SQL LIKE scheme and the FTS4 scheme!
Exemplary Medium
Having described the method of the exemplary embodiments of the present invention, the media of the exemplary embodiments of the present invention will be described next.
In some possible embodiments, aspects of the present invention may also be implemented as a medium having stored thereon program code for implementing steps in the information acquisition method according to various exemplary embodiments of the present invention described in the above section "exemplary method" of this specification when the program code is executed by a processor of a device.
Specifically, the processor of the device, when executing the program code, is configured to implement the following steps: acquiring a character string to be searched from a user; performing word segmentation processing and grouping processing on the character string to generate mark group data, wherein the mark group data at least comprises one of a first mark group and a second mark group, and the first mark group and the second mark group respectively comprise at least one key character; and providing a search result through the tag group data, wherein characters in the first tag group are searched through a first matching mode, and characters in the second tag group are searched through a second matching mode.
In some embodiments of the invention, the program code is executable by a processor of the device to perform the steps of: and acquiring a character string to be searched from a user through the intelligent electronic equipment.
In some embodiments of the invention, the program code is executable by a processor of the device to perform the steps of: generating the first marker set by Chinese and digits in the character string; and generating the second mark group by English and symbols in the character string.
In some embodiments of the invention, the program code is executable by a processor of the device to perform the steps of: taking each Chinese character in the character string as a key character respectively; and taking each number in the character string as a key character respectively.
In some embodiments of the invention, the program code is executable by a processor of the device to perform the steps of: taking continuous English characters in the character string as key characters; taking each symbol in the character string as a key character respectively; wherein the symbol comprises: punctuation, non-chinese and english language characters, and special characters.
In some embodiments of the invention, the program code is executable by a processor of the device to perform the steps of: inputting the marking group data into a Lucene querier, and providing search results through the Lucene querier.
In some embodiments of the invention, the program code is executable by a processor of the device to perform the steps of: and searching characters in the first mark group in a phrase matching mode.
In some embodiments of the invention, the program code is executable by a processor of the device to perform the steps of: and searching the characters in the first mark group in a prefix matching mode.
In some embodiments of the invention, the program code is executable by a processor of the device to perform the steps of: performing word segmentation processing on the source data through a custom word segmentation algorithm to generate index data; and sending the index data to a Lucene engine so that the Lucene engine can carry out index processing.
In some embodiments of the invention, the program code is executable by a processor of the device to perform the steps of: processing the index data by a WhitespataineAnalyzer method to generate first data; and sending the first data to a Lucene engine.
It should be noted that: the above-mentioned medium may be a readable signal medium or a readable storage medium. The readable storage medium may be, for example but not limited to: an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
A readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take a variety of forms, including, but not limited to: an electromagnetic signal, an optical signal, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device over any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., over the internet using an internet service provider).
Exemplary devices
Having described the medium of an exemplary embodiment of the present invention, a data search apparatus 300 of an exemplary embodiment of the present invention is explained next with reference to fig. 3.
Fig. 3 schematically shows a block diagram of a data search apparatus according to an embodiment of the present invention.
Referring to fig. 3, an information acquisition apparatus 300 according to an embodiment of the present invention includes: a receiving module 302, a word segmentation module 304 and a query module 306.
Specifically, the receiving module 302 is configured to obtain a character string to be searched from a user; the word segmentation module 304 is configured to perform word segmentation processing and grouping processing on the character string to generate tag group data, where the tag group data at least includes one of a first tag group and a second tag group, and the first tag group and the second tag group respectively include at least one key character; the query module 306 is configured to provide a search result through the tag group data, where a first query submodule is configured to search for characters in the first tag group through a first matching manner, and a second query submodule is configured to search for characters in the second tag group through a second matching manner.
In some embodiments of the present invention, based on the foregoing solution, the receiving module 302 is further configured to obtain, by the intelligent electronic device, a character string to be searched from the user.
In some embodiments of the present invention, based on the foregoing solution, the word segmentation module 304 comprises: the first word sub-module is used for generating the first mark group by Chinese and numbers in the character string; and the second word sub-module is used for generating the second mark group through English and symbols in the character string.
In some embodiments of the present invention, based on the foregoing scheme, the first sub-word module 3022 is configured to: taking each Chinese character in the character string as a key character respectively; and taking each number in the character string as a key character respectively.
In some embodiments of the present invention, based on the foregoing scheme, the second sub-module 3024 is configured to: taking continuous English characters in the character string as key characters; taking each symbol in the character string as a key character respectively; wherein the symbol comprises: punctuation, non-chinese and english language characters, and special characters.
In some embodiments of the present invention, based on the foregoing, the query module 306 is configured to: inputting the marking group data into a Lucene querier, and providing search results through the Lucene querier.
In some embodiments of the present invention, based on the foregoing solution, the query module 306 is further configured to: and inputting the marking group data into a querier through an application end in an Android system, and providing a search result through a Lucene querier.
In some embodiments of the present invention, based on the foregoing scheme, the method further comprises: and the importing module 308 is configured to import the Lucene source code package into the Android system.
In some embodiments of the present invention, based on the foregoing scheme, the importing module 308 further includes: and the cutting submodule is used for cutting the source code when the Lucene source code packet is imported into the Android system.
In some embodiments of the present invention, based on the foregoing scheme, the method further comprises: the data module 310 is configured to perform word segmentation on the source data to generate index data; and the indexing module 312 is configured to send the index data to a Lucene engine, so that the Lucene engine performs index processing.
In some embodiments of the present invention, based on the foregoing scheme, the indexing module 312 includes: the data submodule 3122 is configured to process the index data by a whitespace analyzer method to generate first data; and the sending submodule 3124 is configured to send the first data to the Lucene engine.
In some embodiments of the present invention, based on the foregoing solution, the first query submodule 3022 includes: and the phrase matching unit is used for searching the characters in the first mark group in a phrase matching mode.
In some embodiments of the present invention, based on the foregoing solution, the second query submodule 3024 includes: and the prefix matching unit is used for searching the characters in the first mark group in a prefix matching mode.
Exemplary computing device
Having described the method, medium, and apparatus of exemplary embodiments of the present invention, a computing device in accordance with another exemplary embodiment of the present invention is described.
As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or program product. Thus, various aspects of the invention may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.
In some possible implementations, a computing device according to an embodiment of the invention may include at least one processor, and at least one memory. Wherein the memory stores program code that, when executed by the processor, causes the processor to perform the steps in the information acquisition methods according to various exemplary embodiments of the present invention described in the "exemplary methods" section above in this specification. For example, the processor may execute step S10 shown in fig. 1, obtaining a character string to be searched from a user; step S12, performing word segmentation and grouping on the character string to generate tag group data, where the tag group data at least includes one of a first tag group and a second tag group, and the first tag group and the second tag group respectively include at least one key character; step S14, providing a search result through the tag group data, wherein the characters in the first tag group are searched through a first matching manner, and the characters in the second tag group are searched through a second matching manner.
It should be noted that although in the above detailed description several units or sub-units of the information obtaining apparatus are mentioned, such division is merely exemplary and not mandatory. Indeed, the features and functions of two or more modules or units described above may be embodied in one module or unit according to embodiments of the invention. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.
Moreover, while the operations of the method of the invention are depicted in the drawings in a particular order, this does not require or imply that the operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.
While the spirit and principles of the invention have been described with reference to several particular embodiments, it is to be understood that the invention is not limited to the particular embodiments disclosed, nor is the division of the aspects, which is for convenience only as the features in these aspects may not be combined to benefit from the present disclosure. The invention is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims (22)

1. A data search method is used for full text search of chat records; the method comprises the following steps:
acquiring a character string to be searched from a user;
performing word segmentation processing and grouping processing on the character string to generate mark group data, wherein the mark group data comprises a first mark group and a second mark group, and the first mark group and the second mark group respectively comprise at least one key character; wherein the first marker set is generated by Chinese and numerals in the character string, and the second marker set is generated by English and symbols in the character string; and
inputting the marking group data into a Lucene querier through an application end in an Android system, and providing a search result through the Lucene querier, wherein characters in the first marking group are searched in a first matching mode, and characters in the second marking group are searched in a second matching mode; and
and generating a query result of the character string according to the intersection of the search result of the first marker set and the search result of the second marker set.
2. The method of claim 1, wherein obtaining a string from a user to be searched comprises: and acquiring a character string to be searched from a user through the intelligent electronic equipment.
3. The method of claim 1, wherein generating the first set of tokens from chinese and digits in the string of characters comprises:
taking each Chinese character in the character string as a key character respectively; and
and taking each number in the character string as a key character respectively.
4. The method of claim 1, wherein generating the second set of indicia from english and symbols in the string comprises:
taking continuous English characters in the character string as key characters; and
taking each symbol in the character string as a key character respectively;
wherein the symbol comprises: punctuation, non-chinese and english language characters, and special characters.
5. The method of claim 1, further comprising:
and importing the Lucene source code packet into an Android system.
6. The method of claim 5, further comprising:
and cutting the source code when the Lucene source code packet is imported into the Android system.
7. The method of claim 1, wherein searching for characters in the first token group by a first match comprises:
and searching characters in the first mark group in a phrase matching mode.
8. The method of claim 1, wherein searching for characters in the second token group by a second match comprises:
and searching the characters in the second mark group in a prefix matching mode.
9. The method of claim 1, further comprising:
performing word segmentation processing on the source data to generate index data; and
and sending the index data to a Lucene engine so that the Lucene engine can perform index processing.
10. The method of claim 9, wherein sending the index data to a Lucene engine comprises:
processing the index data by a WhitespataineAnalyzer method to generate first data; and
and sending the first data to the Lucene engine.
11. A data search device is used for full text search of chat records; the device comprises:
the receiving module is used for acquiring a character string to be searched from a user;
the word segmentation module is used for performing word segmentation processing and grouping processing on the character string to generate mark group data, wherein the mark group data comprises a first mark group and a second mark group, and the first mark group and the second mark group respectively comprise at least one key character; the word segmentation module comprises: the first word sub-module is used for generating the first mark group by Chinese and numbers in the character string; the second word sub-module is used for generating the second mark group through English and symbols in the character string;
the query module is used for inputting the tag group data into a querier through an application end in an Android system and providing a search result through a Lucene querier, wherein the first query submodule is used for searching characters in the first tag group through a first matching mode, and the second query submodule is used for searching characters in the second tag group through a second matching mode; and generating a query result of the character string according to the intersection of the search result of the first marker set and the search result of the second marker set.
12. The apparatus of claim 11, wherein the receiving module is further configured to obtain, by the intelligent electronic device, a character string to be searched from the user.
13. The apparatus of claim 11, wherein the first sub-word module is configured to:
taking each Chinese character in the character string as a key character respectively; and
and taking each number in the character string as a key character respectively.
14. The apparatus of claim 11, wherein the second sub-module is configured to:
taking continuous English characters in the character string as key characters;
taking each symbol in the character string as a key character respectively;
wherein the symbol comprises: punctuation, non-chinese and english language characters, and special characters.
15. The apparatus of claim 11, further comprising:
and the import module is used for importing the Lucene source code packet into the Android system.
16. The apparatus of claim 15, wherein the import module further comprises:
and the cutting submodule is used for cutting the source code when the Lucene source code packet is imported into the Android system.
17. The apparatus of claim 11, wherein the query module comprises:
and the phrase matching unit is used for searching the characters in the first mark group in a phrase matching mode.
18. The apparatus of claim 11, wherein the query module comprises:
and the prefix matching unit is used for searching the characters in the second mark group in a prefix matching mode.
19. The apparatus of claim 11, further comprising:
the data module is used for performing word segmentation processing on the source data to generate index data; and
and the index module is used for sending the index data to the Lucene engine so that the Lucene engine can carry out index processing.
20. The apparatus of claim 19, wherein the indexing module comprises:
the data submodule is used for processing the index data through a WhitespateAnalyzer method to generate first data; and
and the sending submodule is used for sending the first data to the Lucene engine.
21. A medium having stored thereon a program which, when executed by a processor, carries out the method of any one of claims 1 to 10.
22. An electronic device, comprising: a processor and a memory, the memory storing executable instructions, the processor to invoke the memory-stored executable instructions to perform the method of any of claims 1 to 10.
CN201810157579.0A 2018-02-24 2018-02-24 Data searching method, device, medium and computing equipment Active CN108388635B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810157579.0A CN108388635B (en) 2018-02-24 2018-02-24 Data searching method, device, medium and computing equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810157579.0A CN108388635B (en) 2018-02-24 2018-02-24 Data searching method, device, medium and computing equipment

Publications (2)

Publication Number Publication Date
CN108388635A CN108388635A (en) 2018-08-10
CN108388635B true CN108388635B (en) 2021-08-03

Family

ID=63068960

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810157579.0A Active CN108388635B (en) 2018-02-24 2018-02-24 Data searching method, device, medium and computing equipment

Country Status (1)

Country Link
CN (1) CN108388635B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109241137A (en) * 2018-08-27 2019-01-18 中国建设银行股份有限公司 A kind of line number fuzzy query method and device
CN109800412A (en) * 2018-12-10 2019-05-24 鲁东大学 A kind of Chinese word segmentation and big data information retrieval method and device
CN111611471B (en) * 2019-02-25 2023-12-26 阿里巴巴集团控股有限公司 Searching method and device and electronic equipment
CN110069604B (en) * 2019-04-23 2022-04-08 北京字节跳动网络技术有限公司 Text search method, text search device and computer-readable storage medium
CN111159021B (en) * 2019-12-20 2022-12-23 苏宁云计算有限公司 Method and system for quickly positioning chat robot problem
CN111737288B (en) * 2020-06-05 2023-07-25 富途网络科技(深圳)有限公司 Search control method, device, terminal equipment, server and storage medium
CN112163207B (en) * 2020-10-30 2023-11-21 深圳平安智汇企业信息管理有限公司 Service data query method based on dynamic permission and related equipment

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100476800C (en) * 2007-06-22 2009-04-08 腾讯科技(深圳)有限公司 Method and system for cutting index participle
CN101154241A (en) * 2007-10-11 2008-04-02 北京金山软件有限公司 Data searching method and data searching system
JP5492726B2 (en) * 2010-09-27 2014-05-14 株式会社日立システムズ Character string search support system, search support method, and program therefor, excluding specific character strings
CN103186588A (en) * 2011-12-30 2013-07-03 大连天维科技有限公司 Pinyin searching method
CN103336850B (en) * 2013-07-24 2016-09-21 昆明理工大学 A kind of database retrieval system determines the method and device of term
CN104699724A (en) * 2013-12-10 2015-06-10 北京先进数通信息技术股份公司 Lucene-based data searching method and device
CN107506413B (en) * 2017-08-11 2020-03-20 江苏科技大学 Lucene wrongly written character based query method

Also Published As

Publication number Publication date
CN108388635A (en) 2018-08-10

Similar Documents

Publication Publication Date Title
CN108388635B (en) Data searching method, device, medium and computing equipment
US10169337B2 (en) Converting data into natural language form
CN109670163B (en) Information identification method, information recommendation method, template construction method and computing device
US20060047500A1 (en) Named entity recognition using compiler methods
US20090019356A1 (en) Intelligent Text Annotation
US9311388B2 (en) Semantic and contextual searching of knowledge repositories
US20160292153A1 (en) Identification of examples in documents
CN111176650B (en) Parser generation method, search method, server, and storage medium
US20090222407A1 (en) Information search system, method and program
CN111309760A (en) Data retrieval method, system, device and storage medium
US20200242349A1 (en) Document retrieval through assertion analysis on entities and document fragments
CN101021851B (en) Text search device, text search method
CN111475196A (en) Compiling alarm tracing method and device, electronic equipment and computer readable medium
US8065283B2 (en) Term synonym generation
US9904674B2 (en) Augmented text search with syntactic information
US20070129932A1 (en) Chinese to english translation tool
CN104778232A (en) Searching result optimizing method and device based on long query
US9720910B2 (en) Using business process model to create machine translation dictionaries
US10678870B2 (en) System and method for search discovery
KR20130074176A (en) Korean morphological analysis apparatus and method based on tagged corpus
US20230186023A1 (en) Automatically assign term to text documents
CN111898762B (en) Deep learning model catalog creation
CN113868375A (en) Data query method, device, equipment and storage medium based on structured query language
CN113515907A (en) Pre-analysis method of VVP file and computer-readable storage medium
US11669555B2 (en) System and method of creating index

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20210930

Address after: 310000 Room 408, building 3, No. 399, Wangshang Road, Changhe street, Binjiang District, Hangzhou City, Zhejiang Province

Patentee after: Hangzhou Netease Zhiqi Technology Co.,Ltd.

Address before: 310052 Room 301, Building No. 599, Changhe Street Network Business Road, Binjiang District, Hangzhou City, Zhejiang Province

Patentee before: HANGZHOU LANGHE TECHNOLOGY Ltd.