CN109885641B

CN109885641B - Method and system for searching Chinese full text in database

Info

Publication number: CN109885641B
Application number: CN201910053622.3A
Authority: CN
Inventors: 卢健; 姜瑞海; 王硕; 张龙
Original assignee: Highgo Base Software Co ltd
Current assignee: Highgo Base Software Co ltd
Priority date: 2019-01-21
Filing date: 2019-01-21
Publication date: 2021-03-09
Anticipated expiration: 2039-01-21
Also published as: CN109885641A

Abstract

The invention discloses a method and a system for searching a Chinese full text in a database, wherein the method comprises the following steps: receiving a text to be retrieved; performing binary word segmentation on every two Chinese characters of the text to be retrieved to obtain a plurality of binary word groups, and inserting the binary word groups into a data table file; creating an inverted index for the data table file, wherein the inverted index comprises a position index of each binary phrase and is used for writing position information of the corresponding phrase in each text data in a database in a retrieval process, and the position information comprises a row containing the phrase and a position in the row; and according to the binary phrases, performing full-text retrieval on the text to be retrieved in the database. The retrieval method provided by the invention does not need to construct a dictionary, has a better retrieval effect on new words, and has higher retrieval efficiency by introducing a multi-level indexing mechanism.

Description

Method and system for searching Chinese full text in database

Technical Field

The disclosure belongs to the technical field of data retrieval, and particularly relates to a method and a system for Chinese full-text retrieval of a database.

Background

The full text retrieval technology is a very common information query application, and one of various search engine core technologies on the network is full text retrieval. The full-text search product is a database product embedded with full-text search technology. Chinese word segmentation is involved in the process of Chinese full-text retrieval.

The main chinese word segmentation at present can be mainly classified as: a word segmentation method based on character string matching and a word segmentation method based on statistics. The word segmentation method based on character string matching needs to match a Chinese character string to be analyzed with a vocabulary entry in a dictionary, if a certain character string is found in the dictionary, a word is recognized, the word segmentation method needs a dictionary which is complete enough, but the updating of a network new word is very fast, and the updating of the dictionary is difficult to adapt to the updating speed of the new word. If the text to be retrieved contains the network new words and the dictionary does not contain the network new words, the words cannot be correctly segmented, so that the text containing the new words cannot be retrieved, and the missing of the detection is caused.

The word segmentation method based on statistics is to perform word segmentation by means of frequency or probability of adjacent co-occurrence of characters in a text, the method only needs to perform statistics on word group frequency in a corpus, a dictionary is not needed, but the method often extracts some common word groups which have high co-occurrence frequency but are not words, has a certain recognition effect on new words, but has poor recognition precision on common words, consumes time in operation, and generates a larger amount of data, so that the efficiency of subsequent retrieval is influenced.

On the basis of word segmentation, in order to accelerate the retrieval efficiency, the data processing is carried out on the inverted indexes commonly used in database products, specifically, after a database receives a data file to be inserted, the data file is firstly read for Chinese word segmentation, the word segmentation needs to be read again, the position of each word group in the data file is obtained and written into the inverted indexes, namely, the data file reading is carried out twice, and under the condition that the data file is large or the data amount inserted into the database is large, the processing mode has large calculation amount and low efficiency; in addition, when the general inverted index stores word positions, only the row positions of the word groups are stored, in this case, when the frequency of the words is counted in the retrieval process, data needs to be read again, then the similarity is calculated, and the retrieval efficiency is low.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a method and a system for searching the Chinese full text in a database.

In order to achieve the above object, one or more embodiments of the present disclosure provide the following technical solutions:

a method for Chinese full text retrieval in a database comprises the following steps:

receiving text data to be inserted into a database;

performing binary word segmentation processing on every two adjacent Chinese characters of the text data as a group, and simultaneously creating an inverted index for the text data;

in the word segmentation process, for each binary phrase obtained by word segmentation, writing the binary phrase and the position information of the binary phrase in the text data into the inverted index;

receiving a text to be retrieved, and performing binary word segmentation processing to obtain a plurality of binary word groups to be retrieved;

and in the database, executing full-text retrieval based on the inverted index and the binary phrases to be retrieved.

Further, the inverted index includes a third-level index, where the first-level index is used to identify a position of each binary phrase in the second-level index, the second-level index is used to record each binary phrase and a position of the text in the third-level index, and the third-level index is used to record position information of the binary phrases in the text, where the position information includes a line of the binary phrase in the text and a position in the line.

Further, the primary index is a code; or

The primary index is a letter or letter combination and corresponds to a plurality of data table files indexed by the letter or letter combination.

Further, the letter combinations are based on statistics of commonly used words.

Further, the executing full-text retrieval based on the inverted index and the plurality of binary phrases to be retrieved includes:

receiving a text to be retrieved, and performing binary word segmentation on the text to be retrieved to obtain a plurality of binary word groups to be retrieved;

for each text data in a database, according to the inverted index corresponding to the text data, counting the frequency of the binary phrases to be retrieved according to lines;

calculating the similarity of the text to be retrieved and each line of the text data according to the frequency;

summarizing the similarity of a text to be retrieved and each line of the text data to obtain the similarity of the text to be retrieved and the text data;

and sorting and outputting the text data in the database from high to low according to the similarity.

One or more embodiments provide a method for full text retrieval in a database, comprising the steps of:

creating an inverted index structure in advance;

receiving text data to be inserted into a database;

performing binary word segmentation on every two adjacent Chinese characters of the text data as a group;

Further, the primary index is a code; or

Further, the full-text retrieval of the text to be retrieved in the database comprises:

the executing full-text retrieval based on the inverted index and the binary phrases to be retrieved comprises:

One or more embodiments provide a server, which is connected with a database system and used for executing the method for Chinese full text retrieval of the database.

One or more embodiments provide a database Chinese full text retrieval system, which comprises a client, a database system and the server; and the client receives the text to be retrieved and sends the text to the server.

The above one or more technical solutions have the following beneficial effects:

the text to be retrieved is split based on the binary word segmentation, all possible word groups in the text can be listed as much as possible, the problem of incomplete dictionaries in other Chinese word segmentation solutions is avoided, and network words and emerging words can be well recognized and retrieved;

the method has the advantages that after the text to be inserted into the database is received, binary word segmentation is carried out, the inverted index file of the text is created, the currently split word group and the position of the word group in the text are written into the inverted index file in the word segmentation process, and compared with the mode that the word segmentation is carried out firstly, and then the position of each word group in the text is counted, the method saves the process of reading data once and has higher processing efficiency;

the present disclosure improves the general inverted index to include a three-level indexing mechanism: coding/letters-phrases-the position of the phrases in the document containing the phrases, and the position information is detailed as "the phrase is in the row + the position of the phrase in the row", when full-text retrieval is performed, the similarity between the text to be retrieved and each row of the text in the database can be directly and quickly counted according to the position information, so that the full-text similarity can be quickly calculated;

because binary word segmentation can bring the storage data volume to be large, in order to avoid the problem of cross-file writing and reading when the index file is written and when the retrieval is subsequently executed, the method also establishes a secondary index and a corresponding data table file based on common word statistics, can greatly increase the reading and writing speed of data, and improves the processing efficiency and the retrieval efficiency of the database text data.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure and are not to limit the disclosure.

FIG. 1 is a flowchart illustrating an overall method for searching Chinese full text in a database according to an embodiment of the present disclosure;

FIG. 2 is a flowchart illustrating a method for Chinese full-text search in a database according to an embodiment of the present disclosure;

FIG. 3 is a diagram illustrating a data structure of an inverted index according to an embodiment of the present disclosure;

FIG. 4 is a diagram illustrating an example of an inverted index with letters as a secondary index in the first embodiment of the disclosure;

FIG. 5 is a flowchart illustrating a method for searching Chinese full text in a database according to a second embodiment of the present disclosure;

FIG. 6 is a flowchart illustrating a method for full text search in a database according to a second embodiment of the present disclosure;

FIG. 7 is a block diagram of a Chinese full-text retrieval system for databases in the third and fourth embodiments of the present disclosure.

Detailed Description

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present disclosure. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

The embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict.

Example one

The embodiment discloses a method for searching a Chinese full text in a database, which comprises the following steps as shown in fig. 1:

receiving text data to be inserted into a database;

Specifically, as shown in fig. 2, the method includes a process 1 of inserting new text data into a database, and a process 2 of performing full-text retrieval on a received text to be retrieved.

The process 1 of inserting new text data into a database comprises the following steps:

step 101: receiving text data to be inserted into a database; for example, enter "I am at Tiananmen, Beijing";

step 102: preprocessing the text data;

wherein the preprocessing includes removing non-text content, such as space, TAB, comma, and other special symbols.

Step 103: performing binary word segmentation on every two adjacent Chinese characters of the preprocessed text data into a group, and simultaneously creating an inverted index for the text data;

for example: the text data is ' Beijing Tiananmen ', and after binary word segmentation, the text data is ' { ' Beijing ', ' Tianan ', ' Anmen ' }.

Taking a Postgresql database as an example, the database only provides word segmentation for english words, and does not support word segmentation for chinese words, Postgresql can also support chinese word segmentation in the form of an external plug-in, and existing plug-ins including zhparser, jieba and the like can support chinese word segmentation at present, but these plug-ins are required to be used for word segmentation and a dictionary carried by the plug-ins are required. However, since the network new word is updated at a high speed, the dictionary is easy to miss new words, and thus if word segmentation is performed based on the dictionary, the problem of missing keywords occurs, which results in poor retrieval effect. And through the binary word segmentation, all possible word groups in the text data can be listed to the greatest extent.

And creating an inverted index for the text data while performing binary word segmentation, wherein the inverted index structure comprises a tertiary index, the primary index is used for identifying the position of each binary phrase in the secondary index, the secondary index is used for recording each binary phrase and the position of the text in the tertiary index, and the tertiary index is used for recording the position information of the binary phrases in the text.

In one or more embodiments, the inverted index has a tree structure, and preferably, a B-tree structure is used in this embodiment.

In particular, the tree structure comprises a four-level tree structure. The first level tree structure is used for storing logic judgment conditions and judging the position of each binary phrase in the second level tree structure. The first level tree structure may correspond to a file in which logical decision conditions are stored. When the logic judgment condition is stored by adopting a file, the logic condition can be modified by calling the file, and the flexibility is high. The second-level tree structure corresponds to the first-level index and is used for identifying the position of each binary phrase in the second-level index; the third-level tree structure corresponds to the second-level index, and records each binary phrase and the position of the text in the third-level index; and the fourth-level tree structure corresponds to the third-level index, and records the text information of the corresponding binary phrase and the position information of the binary phrase in the text. The second-level tree structure and the third-level tree structure can be physically stored in the same data table file or different data table files. As shown in fig. 3, which is a schematic diagram of a four-level tree structure, in order to clearly express the relationship between the tree structures at different levels, it should be noted that the depth and the breadth of the tree structures at different levels in the diagram are merely examples, and the tree structures at different levels can be expanded according to specific data.

In one or more embodiments, the second level tree structure may be hexadecimal encoded. The second level tree structure and the third level tree structure may be physically stored in the same data table file, or may be stored in different data table files.

Since the minimum management unit in disk storage and memory in the Postgresql database is page, also known as block, the size of a PG page is generally 8K, which determines that the maximum size of a data table file is 1G. The word segmentation method adopted in this embodiment is a binary word segmentation method in which all two adjacent Chinese characters form a group, and the required storage capacity is large, and a plurality of data table files are required for storage. In one or more embodiments, a plurality of data table files corresponding to the secondary index are created, and the primary index is a letter index (e.g., the letter is included in the file name). For example, if a node at the third level corresponds to the binary phrase "i am", and the parent node is "w", the node "i am" is written into the data table file indexed by "w". By adopting the letter index file, the corresponding phrase and the corresponding tertiary index can be quickly positioned by taking the letter as an entry in the retrieval process, so that the position information of the phrase in each text data is written.

Although the present embodiment can cover all new words by using binary word segmentation, the number of phrases obtained by this word segmentation method is larger than that obtained by using dictionary word segmentation, and if 26 letters are respectively used as primary indexes and correspond to 26 data table files, frequent cross-file writing may be encountered in the process of writing position information of the phrases, for example, "tianan gate plaza", "tianan" writes the data table file corresponding to "t", "ann gate" writes the data table file corresponding to "a", "guan" writes the data table file corresponding to "m", and "plaza" writes the data table file corresponding to "g", so that the operation burden of the system is increased. To address this issue, in one or more embodiments, each spreadsheet file is indexed using a combination of letters, such as "a-d", "e-h", and the like.

Preferably, the combination of the letters may be composed of discontinuous letters, the combination of the letters is set based on statistical analysis of the existing vocabulary, the commonly used vocabulary is identified, the letters included in the commonly used vocabulary are set as a group, for example, if "Tiananmen's square" is judged as the commonly used vocabulary, the pinyin initials "t", "a", "m", and "g" of the corresponding binary phrases may be grouped into a group, written into the same primary index, and used for indexing the data table file corresponding to the primary index. Therefore, the frequency of writing data across tables is greatly reduced, the data processing burden is reduced, and moreover, indexes corresponding to phrases related to commonly used vocabularies are all in the same data table file, so that the searching efficiency in the subsequent searching process is improved.

Step 104: and in the word segmentation process, for each binary phrase obtained by word segmentation, writing the binary phrase and the position information of the binary phrase in the text data into the inverted index.

Specifically, each binary phrase is written into the secondary index according to the primary index, and the position information of the binary phrase in the text data is written into the tertiary index corresponding to each binary phrase.

In one or more embodiments, a unique identification, such as a text ID, is created in the database for the text data as it is inserted.

In one or more embodiments, the position information of the binary phrase in the text includes: the corresponding line of the binary phrase in the text, and the position information in the line, can be represented by a two-dimensional array, for example, (1,4) represents the 4 th word of the 1 st line. If the third level tree structure includes multiple levels of nodes, the position information in the row may be used as a child node of the row.

For example, the text data "i am at Tiananmen Square", for the phrase "i am", write at the corresponding tertiary index "(1, 1)", for the phrase "Tianan", write at the corresponding tertiary index "(1, 2)" for the phrase "Tianan", write at the corresponding tertiary index "(1, 3)" for the phrase "Anmen", and so on.

The above-mentioned steps 1-4 are a process of inserting new text data into the database. After the received new text data is subjected to word segmentation processing, the index information is written while the data is inserted, and the data processing efficiency is greatly improved.

One metadata in the Postgresql database may store multiple index files, as shown in fig. 4, an example of inserting data for a data file. For example, the text data "full text database" is a main component of the full text search system. The full-text database is a data set formed by converting the whole content of a complete information source into information units which can be recognized and processed by a computer. The full-text database stores not only information but also … …. "(the format of the text is shown in the figure), after the database receives the text data, firstly, symbols in the text except the text are filtered, then an inverted index structure (Entry Tree1) is created, and meanwhile, binary word segmentation is started. As just one example, the primary index of the inverted index structure employs the sequential letter combinations "a, b, …, g", "h, i, …, n", "o, q, …, z"; the obtained phrases of full text, literal, data and database are written into a secondary index according to the primary index in sequence in the word segmentation process; when each phrase is written, the position of the phrase in the text is written in the tertiary index, taking the first phrase 'full text' as an example, the phrase is written in the secondary index under the primary index 'o, q, …, z', and the position 'line 1, word 1' is obtained and written in the tertiary index; with the word segmentation processing, the word group 'full text' appears for the second time, the position 'line 1, word 7' is obtained and written into the same three-level index … …, and so on until the word segmentation is completed, and all the word groups obtained by word segmentation and the corresponding positions thereof are written into the index file. In the figure, the three-level index adopts a multi-level Tree structure (i.e. Post Tree), the first level represents all lines of the phrase row (the phrase 'full text' appears in

lines

1,3 and 8), the second level represents each line of the phrase, and the third level represents the position of the phrase in the corresponding line (line 1, line 7, line 3, line 4, line 8, line 6). It should be noted that, the positions of the word groups obtained by word segmentation in the index are not completely shown in the drawings, and those skilled in the art should be able to understand the process of inserting text data into the database in this embodiment according to the description of the embodiment and the parts in the drawings. Thus, the writing of the binary word segmentation and the inverted index is completed simultaneously.

Because the binary word segmentation divides two adjacent Chinese characters into a group in sequence, the word segmentation can be executed in the reading process without executing the word segmentation on the text data based on a dictionary or a statistical method after the whole text data is read in, and the position of the current binary word group can be simultaneously obtained and written into the inverted index, so that the processing efficiency of the data when new data is input into the database is greatly improved.

The process 2 for performing full-text retrieval on the received text to be retrieved comprises the following steps:

step 201: receiving a text to be retrieved, and performing binary word segmentation on the text to be retrieved to obtain a plurality of binary word groups to be retrieved;

step 202: for each text data in a database, according to the inverted index corresponding to the text data, counting the frequency of the binary phrases to be retrieved according to lines;

step 203: calculating the similarity of the text to be retrieved and each line of the text data according to the frequency;

step 204: and summarizing the similarity of the text to be retrieved and each line of the text data to obtain the similarity of the text to be retrieved and the text data.

Step 205: and sorting the text data in the database from high to low according to the similarity, and outputting.

In this embodiment, the inverted index has a three-level index structure, and the position information written in the three-level index has a data structure of "line + position", so that when the retrieval is performed, the position of the word group in the text to be retrieved in each text data in the database can be quickly found, and the frequency of each word group in each line of each text data can be quickly counted, thereby quickly calculating the similarity between the text to be retrieved and each text data.

Example two

As a variation of the first embodiment, the present embodiment provides a method for full text search in a database, as shown in fig. 5, including the following steps:

a method for searching Chinese full text in a database is characterized by comprising the following steps:

creating an inverted index structure in advance;

receiving text data to be inserted into a database;

Different from the first embodiment, in the present embodiment, a data table file and a corresponding inverted index structure are created in advance, where the inverted index structure includes a third-level index, where the first-level index is used to identify a position of each binary phrase in a second-level index, the second-level index is used to record each binary phrase and a position of the text in the third-level index, and the third-level index is used to record position information of the binary phrases in the text, where the position information includes a row including the binary phrase and a position in the row, and specific steps are shown in fig. 6.

The insert new data file process 3, comprising the steps of:

step 301: receiving text data to be inserted into a database;

step 302: preprocessing the text data;

step 303: performing binary word segmentation on every two adjacent Chinese characters of the preprocessed text data into a group, and simultaneously creating an inverted index for the text data;

step 304: and in the word segmentation process, writing the obtained multiple binary phrases into secondary indexes according to the primary indexes, and simultaneously writing the position information of the binary phrases in the text data into the tertiary indexes corresponding to the binary phrases.

The retrieval process 4 includes the following steps:

step 401: receiving a text to be retrieved, and performing binary word segmentation on the text to be retrieved to obtain a plurality of binary word groups to be retrieved;

step 402: for each text data in a database, according to the inverted index corresponding to the text data, counting the frequency of the binary phrases to be retrieved according to lines;

step 403: calculating the similarity of the text to be retrieved and each line of the text data according to the frequency;

step 404: summarizing the similarity of a text to be retrieved and each line of the text data to obtain the similarity of the text to be retrieved and the text data;

step 405: and sorting the text data in the database from high to low according to the similarity, and outputting.

The specific implementation of the above steps can be referred to the description of the corresponding part of the embodiment.

EXAMPLE III

Based on the retrieval method of the first embodiment, the embodiment provides a Chinese full-text retrieval system of a database.

A Chinese full text retrieval system for database is shown in FIG. 7, and comprises a client, a database system and a server; wherein the content of the first and second substances,

the client receives a text to be retrieved input by a user, generates a retrieval request and sends the retrieval request to the server;

a server, coupled to the database system, configured to: receiving text data, inserting the text data into a database, and generating an inverted index corresponding to the text data, specifically comprising:

step 101: receiving text data to be inserted into a database;

step 102: preprocessing the text data;

step 103: performing binary word segmentation on every two adjacent Chinese characters of the preprocessed text data into a group, and simultaneously creating an inverted index for the text data; the inverted index structure comprises three levels of indexes, wherein the first level of indexes are used for identifying the positions of all binary phrases in the second level of indexes, the second level of indexes are used for recording each binary phrase and the position of the text in the third level of indexes, and the third level of indexes are used for recording the position information of the binary phrases in the text;

step 104: in the word segmentation process, for each binary word group obtained by word segmentation, respectively writing the secondary indexes according to the primary indexes, and simultaneously writing the position information of the binary word group in the text data in the tertiary indexes corresponding to the binary word groups.

The server further configured to: receiving the text to be retrieved, and executing full-text retrieval in the database, wherein the method specifically comprises the following steps:

step 204: summarizing the similarity of a text to be retrieved and each line of the text data to obtain the similarity of the text to be retrieved and the text data;

Example four

Based on the retrieval method of the second embodiment, the embodiment provides a Chinese full-text retrieval system for a database.

the method comprises the steps that a data table file and a corresponding inverted index structure are created in advance in a server, the inverted index structure comprises a three-level index, the first-level index is used for identifying the position of each binary phrase in a second-level index, the second-level index is used for recording each binary phrase and the position of a text in the third-level index, the third-level index is used for recording the position information of the binary phrases in the text, and the position information comprises a line containing the phrases and the position of the line.

Step 301: receiving text data to be inserted into a database;

step 302: preprocessing the text data;

Receiving the text to be retrieved, performing full-text retrieval in the database, and configured to:

One or more of the above embodiments have the following technical effects:

because binary word segmentation can bring the storage data volume to be large, in order to avoid the problem of cross-file writing and reading when the index file is written and when the retrieval is subsequently executed, the method and the system establish the secondary index and the corresponding data table file based on the common word statistics, can greatly increase the reading and writing speed of the data, and improve the processing efficiency and the retrieval efficiency of the database text data.

Those skilled in the art will appreciate that the modules or steps of the present disclosure described above can be implemented using general purpose computer means, or alternatively, they can be implemented using program code executable by computing means, whereby the modules or steps may be stored in memory means for execution by the computing means, or separately fabricated into individual integrated circuit modules, or multiple modules or steps thereof may be fabricated into a single integrated circuit module. The present disclosure is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present disclosure and is not intended to limit the present disclosure, and various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Although the present disclosure has been described with reference to specific embodiments, it should be understood that the scope of the present disclosure is not limited thereto, and those skilled in the art will appreciate that various modifications and changes can be made without departing from the spirit and scope of the present disclosure.

Claims

1. A method for searching Chinese full text in a database is characterized by comprising the following steps:

receiving text data to be inserted into a database;

in the database, full-text retrieval is executed based on the reverse index and the binary phrases to be retrieved;

the inverted index comprises three levels of indexes, wherein the first level of index is used for identifying the position of each binary phrase in the second level of index, the second level of index is used for recording each binary phrase and the position of the text in the third level of index, the third level of index is used for recording the position information of the binary phrases in the text, and the position information comprises the line of the binary phrases in the text and the position of the line.

2. The method of claim 1, wherein the primary index is a code; or

The primary index is a letter or a letter combination and corresponds to a plurality of data table files indexed by the letter or the letter combination;

3. The method of claim 1, wherein performing full-text search based on the inverted index and the plurality of binary phrases to be searched comprises:

4. A method for searching Chinese full text in a database is characterized by comprising the following steps:

creating an inverted index structure in advance;

receiving text data to be inserted into a database;

5. The method as claimed in claim 4, wherein the primary index is a code; or

6. The method as claimed in claim 4, wherein the full text search of the text to be searched in the database comprises:

7. A server connected to a database system, for performing a method for chinese full text search in a database according to any of claims 1-3 or 4-6.

8. A database chinese full text retrieval system comprising a client, a database system and a server as claimed in claim 7; and the client receives the text to be retrieved and sends the text to the server.