CN109885641A - A kind of method and system of database Chinese Full Text Retrieval - Google Patents

A kind of method and system of database Chinese Full Text Retrieval Download PDF

Info

Publication number
CN109885641A
CN109885641A CN201910053622.3A CN201910053622A CN109885641A CN 109885641 A CN109885641 A CN 109885641A CN 201910053622 A CN201910053622 A CN 201910053622A CN 109885641 A CN109885641 A CN 109885641A
Authority
CN
China
Prior art keywords
text
retrieved
database
binary
phrase
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910053622.3A
Other languages
Chinese (zh)
Other versions
CN109885641B (en
Inventor
卢健
姜瑞海
王硕
张龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Han Gao Foundation Software Ltd By Share Ltd
Original Assignee
Han Gao Foundation Software Ltd By Share Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Han Gao Foundation Software Ltd By Share Ltd filed Critical Han Gao Foundation Software Ltd By Share Ltd
Priority to CN201910053622.3A priority Critical patent/CN109885641B/en
Publication of CN109885641A publication Critical patent/CN109885641A/en
Application granted granted Critical
Publication of CN109885641B publication Critical patent/CN109885641B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of method and system of database Chinese Full Text Retrieval, the described method comprises the following steps: receiving text to be retrieved;It is one group of carry out binary word segmentation processing to the text every two Chinese character to be retrieved, obtains multiple binary phrases, and be inserted into data list file;Inverted index is created for the data list file, it include the location index of each binary phrase in the inverted index, for location information of the corresponding phrase in the database in each text data to be written in retrieving, the location information includes the row comprising the phrase, and position in the row;According to the multiple binary phrase, full-text search is carried out to the text to be retrieved in the database.Search method of the invention is more preferable to the retrieval effectiveness of neologisms without constructing dictionary, and by introducing multiple index mechanism, recall precision is higher.

Description

A kind of method and system of database Chinese Full Text Retrieval
Technical field
The disclosure belongs to a kind of method in data retrieval technology field more particularly to database Chinese Full Text Retrieval and is System.
Background technique
Global search technology is a kind of very universal information inquiry application, online all kinds of search engine core technologies it One is exactly full-text search.The product of full-text search is substantially exactly the database product of an embedded global search technology.Chinese It can be related to Chinese word segmentation during full-text search.
Current main Chinese word segmentation can be mainly divided into: the segmenting method based on string matching and point based on statistics Word method.Entry progress in the Chinese character string and a dictionary that segmenting method based on string matching needs to be analysed to Match, if finding some character string in dictionary, then it is assumed that identify that a word, this segmenting method need one " complete enough " Dictionary, but due to network neologisms update it is very fast, the update of dictionary is difficult to adapt to the renewal speed of neologisms.If text to be retrieved In contain network neologisms and do not have in dictionary, cannot correct cutting handle the vocabulary, to cannot retrieve comprising should The text of neologisms, leads to missing inspection.
Segmenting method based on statistics is to be segmented in text by the frequency or probability of word co-occurrence adjacent with word, this side Method need to only count the word group frequency in corpus, not need dictionary, but this method often extract out some co-occurrence frequency it is high, But it is not the commonly used word group of word, has certain recognition effect to neologisms, but poor to the accuracy of identification of everyday words, and operation consumes When, the data volume for segmenting generation is also bigger, to influence the efficiency of later retrieval.
On the basis of being segmented, in order to accelerate recall precision, inverted index is commonly used in database product and is counted According to processing, specifically, database after receiving the data file to be inserted into, reads the data file first carries out Chinese word segmentation, It needs to read again after participle one time, obtains position write-in inverted index of each phrase in the data file, that is, carried out two Secondary data file is read, in the case where data file is big or the data volume of insertion database is big, this processing mode operation Amount is big, and efficiency is lower;Also, when general inverted index stores word position, the line position where only storing phrase is set, at this In the case of kind, when counting the frequency of word in retrieving, it is also necessary to reading data be come out again, then calculate similarity, examined Rope efficiency is lower.
Summary of the invention
To overcome above-mentioned the deficiencies in the prior art, present disclose provides a kind of method of database Chinese Full Text Retrieval and it is System, the search method using binary participle, can preferably identify with retrieval network neologisms, and inverted index is changed Into can quickly position corresponding phrase by the mechanism of multiple index, retrieve highly efficient.
To achieve the above object, one or more other embodiments of the present disclosure provide following technical solution:
A kind of method of database Chinese Full Text Retrieval, comprising the following steps:
Receive the text data being inserted into database;
It is one group of carry out binary word segmentation processing to text data each adjacent two Chinese character, while is the text data Create inverted index;
During participle, obtained binary phrase is segmented for each, the binary phrase and the binary phrase are existed The inverted index is written in location information in this article notebook data;
Text to be retrieved is received, and carries out binary word segmentation processing, obtains multiple binary phrases to be retrieved;
In the database, full-text search is executed based on inverted index and the multiple binary phrase to be retrieved.
Further, the inverted index includes three level list, wherein level-one is indexed for identifying each binary phrase two Position in grade index, secondary index is for recording the position of each binary phrase and the text in three level list, and three For grade index for recording the location information of binary phrase in the text, the location information includes binary phrase in the text Row, and position in the row.
Further, the level-one index is coding;Or
The level-one index is letter or monogram, corresponding multiple tables of data with the alphabetical or monogram index File.
Further, the monogram is obtained based on the statistics of common words.
Further, described full-text search is executed based on inverted index and the multiple binary phrase to be retrieved to include:
Text to be retrieved is received, binary participle is carried out to the text to be retrieved, obtains multiple binary phrases to be retrieved;
For each text data in database, according to the corresponding inverted index of the text data, by row statistics institute State the frequency that multiple binary phrases to be retrieved occur respectively;
The similarity of text to be retrieved Yu the every a line of the text data is calculated according to the frequency;
The similarity for summarizing text to be retrieved Yu each row of this article notebook data obtains the text to be retrieved and the text The similarity of data;
Text data in database is sorted and exported from high to low by similarity.
One or more embodiments provide a kind of method of database Chinese Full Text Retrieval, comprising the following steps:
Inverted index structure is pre-created;
Receive the text data being inserted into database;
It is one group of carry out binary word segmentation processing to text data each adjacent two Chinese character;
During participle, obtained binary phrase is segmented for each, the binary phrase and the binary phrase are existed The inverted index is written in location information in this article notebook data;
Text to be retrieved is received, and carries out binary word segmentation processing, obtains multiple binary phrases to be retrieved;
In the database, full-text search is executed based on inverted index and the multiple binary phrase to be retrieved.
Further, the inverted index includes three level list, wherein level-one is indexed for identifying each binary phrase two Position in grade index, secondary index is for recording the position of each binary phrase and the text in three level list, and three For grade index for recording the location information of binary phrase in the text, the location information includes binary phrase in the text Row, and position in the row.
Further, the level-one index is coding;Or
The level-one index is letter or monogram, corresponding multiple tables of data with the alphabetical or monogram index File.
Further, the monogram is obtained based on the statistics of common words.
Further, carrying out full-text search to the text to be retrieved in the database includes:
It is described full-text search is executed based on inverted index and the multiple binary phrase to be retrieved to include:
Text to be retrieved is received, binary participle is carried out to the text to be retrieved, obtains multiple binary phrases to be retrieved;
For each text data in database, according to the corresponding inverted index of the text data, by row statistics institute State the frequency that multiple binary phrases to be retrieved occur respectively;
The similarity of text to be retrieved Yu the every a line of the text data is calculated according to the frequency;
The similarity for summarizing text to be retrieved Yu each row of this article notebook data obtains the text to be retrieved and the text The similarity of data;
Text data in database is sorted and exported from high to low by similarity.
One or more embodiments provide a kind of server, connect with Database Systems, execute the database Chinese The method of full-text search.
One or more embodiments provide a kind of database Chinese Full-Text Retrieval System, including client, data base set System and the server;The client receives text to be retrieved and is sent to server.
The above one or more technical solution there are following the utility model has the advantages that
This can be set out based on being split by binary participle for text to be retrieved in the disclosure as far as possible Possible all phrases in text avoid to occur the infull problem of dictionary in other Chinese word segmentation solutions, for Network vocabulary, emerging word also can be good at identifying and retrieve;
After disclosure reception will be inserted into the text of database, binary participle is carried out, while creating the row's of falling rope of the text Inverted index is written during participle in the position of the phrase currently split out and the phrase in the text by quotation part, this The mode that index file is written in participle counts the position of each phrase in the text compared to first being segmented again after participle The mode set, saves the primary process for reading data, and treatment effeciency is higher;
The disclosure improves general inverted index, makes that it includes three level list mechanism: coding/letter --- word Position of group --- the phrase in the document comprising the phrase, and the location information be refined as " phrase is expert at+phrase is at this It capable position " can be directly according in the location information express statistic text to be retrieved and database when carrying out full-text search The similarity of the every a line of text, to quickly calculate full text similarity;
Since binary participle can bring storage data quantity big, when being written in order to avoid index file and subsequent execution is retrieved When across file write-in and read problem, the disclosure be also based on everyday words statistics establish secondary index and corresponding tables of data text Part can greatly increase the read or write speed of data, improve the treatment effeciency and recall precision of database text data.
Detailed description of the invention
The Figure of description for constituting a part of this disclosure is used to provide further understanding of the disclosure, and the disclosure is shown Meaning property embodiment and its explanation do not constitute the improper restriction to the disclosure for explaining the disclosure.
Fig. 1 is the overall flow figure of database Chinese Full Text Retrieval method in the embodiment of the present disclosure one;
Fig. 2 is the specific flow chart of database Chinese Full Text Retrieval method in the embodiment of the present disclosure one;
Fig. 3 is the data structure schematic diagram of inverted index in the embodiment of the present disclosure one;
Fig. 4 is the inverted index exemplary diagram that secondary index is letter in the embodiment of the present disclosure one;
Fig. 5 is the overall flow figure of database Chinese Full Text Retrieval method in the embodiment of the present disclosure two;
Fig. 6 is the specific flow chart of database Chinese Full Text Retrieval method in the embodiment of the present disclosure two;
Fig. 7 is the frame diagram of database Chinese Full-Text Retrieval System in the embodiment of the present disclosure three and four.
Specific embodiment
It is noted that described further below be all exemplary, it is intended to provide further instruction to the disclosure.Unless another It indicates, all technical and scientific terms used herein has usual with disclosure person of an ordinary skill in the technical field The identical meanings of understanding.
It should be noted that term used herein above is merely to describe specific embodiment, and be not intended to restricted root According to the illustrative embodiments of the disclosure.As used herein, unless the context clearly indicates otherwise, otherwise singular Also it is intended to include plural form, additionally, it should be understood that, when in the present specification using term "comprising" and/or " packet Include " when, indicate existing characteristics, step, operation, device, component and/or their combination.
In the absence of conflict, the feature in the embodiment and embodiment in the disclosure can be combined with each other.
Embodiment one
Present embodiment discloses a kind of methods of database Chinese Full Text Retrieval, as shown in Figure 1, comprising the following steps:
Receive the text data being inserted into database;
It is one group of carry out binary word segmentation processing to text data each adjacent two Chinese character, while is the text data Create inverted index;
During participle, obtained binary phrase is segmented for each, the binary phrase and the binary phrase are existed The inverted index is written in location information in this article notebook data;
Text to be retrieved is received, and carries out binary word segmentation processing, obtains multiple binary phrases to be retrieved;
In the database, full-text search is executed based on inverted index and the multiple binary phrase to be retrieved.
Specifically, as shown in Fig. 2, the method includes being inserted into new text data process 1 to database, and for connecing The text to be retrieved received executes the process 2 of full-text search.
It is described to be inserted into new text data process 1 to database and include:
Step 101: receiving the text data that be inserted into database;For example, input " I am in Beijing Tian An-men ";
Step 102: the text data is pre-processed;
Wherein, the pretreatment includes removing non-legible content, such as space, TAB, the additional characters such as comma.
Step 103: being one group of carry out binary participle to pretreated text data each adjacent two Chinese character, simultaneously Inverted index is created for the text data;
Such as: text data is " Beijing Tian An-men ", be after progress binary participle " " Beijing ", " capital day ", " Tian An ", " peace door " } ".
By taking Postgresql database as an example, the participle to English word is only provided in the database, to Chinese word language Participle do not support, Postgresql can also by the form of external plug-in support Chinese word segmentation, currently exist Including zhparser, the plug-in units such as jieba can support Chinese word segmentation, but need to be segmented using these plug-in units, and need It will be by included dictionary.But since network neologisms renewal speed is fast, dictionary is easy to miss neologisms, if thus word-based Allusion quotation executes participle, it may appear that missing crucial word problem causes retrieval effectiveness not good enough.And it can be utmostly by binary participle On possible all phrases, the present embodiment in text data is set out is segmented, avoided by using binary segmenting method Using dictionary, so as to avoid the problem that neologisms in dictionary are not complete.
Inverted index is created for the text data while carrying out binary participle, the inverted index structure includes three-level Index, wherein level-one index is for identifying position of each binary phrase in secondary index, and secondary index is for recording each two The position of first phrase and the text in three level list, three level list is for recording the position of binary phrase in the text Information.
In one or more embodiments, the inverted index uses tree structure, it is preferable that B- is used in the present embodiment tree。
Specifically, the tree structure includes level Four tree construction.First order tree construction is sentenced for storing logic judgment condition Break position of each binary phrase in the tree construction of the second level.The first order tree construction can correspond to a file, in the file Store logic judgment condition.When storing the logic judgment condition using file, transferring the file can be to logic therein Condition is modified, and flexibility is high.The second level tree construction is indexed corresponding to level-one, for identifying each binary phrase in second level Position in index;The third level tree construction corresponds to secondary index, records each binary phrase and the text three Position in grade index;The fourth stage tree construction corresponds to three level list, records the text information of corresponding binary phrase, and Location information of the binary phrase in the text.Wherein, the second level tree construction and third level tree construction physically may be used It is stored in the same data list file, may also be stored in different data list files.As shown in figure 3, being level Four tree construction Schematic diagram, for the relationship between clear expression tree constructions at different levels, it should be noted that in figure the depth of tree construction at different levels and Range is merely illustrative, can be extended according to specific data.
In one or more embodiments, hexadecimal code is can be used in the second level tree construction.The second level tree knot Structure and third level tree construction are physically storable in the same data list file, may also be stored in different data list files In.
Since the minimum management unit in Postgresql database in disk storage and memory is all page, and usually Described block, general PG pages of size is 8K, this just determines that the size of data list file is up to 1G.And due to this reality Applying segmenting method used by example is the binary participle that all two neighboring Chinese characters are one group, and the amount of storage needed is larger, needs Multiple data list files are wanted to be stored.Based on this, in one or more embodiments, the data list file wound of corresponding secondary index Have it is multiple, and using letter index (for example, in filename include the letter), and using letter as level-one index.For example, If some node of third layer corresponds to binary phrase " I ", father node is " w ", will " I " write-in using " w " index number According in list file.Index file is indexed by using letter, can quickly be navigated to accordingly with letter for entrance in retrieving Phrase and corresponding three level list, so that location information of the phrase in each text data be written.
Although the present embodiment can cover all neologisms, the phrase that this segmenting method obtains by using binary participle It is more relative to the phrase obtained using Dictionary based segment, if 26 letters are indexed respectively as level-one, and corresponding 26 numbers According to list file, then frequently across file write-in, such as " Tian An-men may be faced during the location information of phrase is written Square ", the corresponding data list file of " Tian An " write-in " t ", " peace door " corresponding data list file of write-in " a ", " Men Guang " write-in " m " corresponding data list file, the corresponding data list file of " square " write-in " g ", just increases the computational burden of system.In order to Solving the problems, such as this, in one or more embodiments, each data list file is indexed using the combination of multiple letters, such as " a-d ", " e-h " etc..
Preferably, the combination of the multiple letter can be made of discontinuous letter, based on the statistics to existing vocabulary The combination of analysis setting letter, identifies common words, the letter that more common vocabulary includes is set as one group, for example, if " Tian'anmen Square " is judged as common words, and first letter of pinyin " t ", " a ", " m ", " g " of corresponding binary phrase can be classified as One group, the same level-one index is written, and index corresponding data list file for indexing the level-one.Thus greatly reduce across The frequency of data is written in table, data processing load is reduced, moreover, because phrase involved in common words indexes accordingly In the same data list file, the efficiency searched during later retrieval is improved.
Step 104: during participle, obtained binary phrase is segmented for each, by the binary phrase and described two The inverted index is written in location information of first phrase in this article notebook data.
Specifically, each binary phrase is respectively written into secondary index according to level-one index, while in each binary phrase phase The location information of the binary phrase in the text data is written in the three level list answered.
In one or more embodiments, when text data is inserted into, for its unique identification of creation in the database, such as text This ID.
In one or more embodiments, the location information of the binary phrase in the text includes: the binary phrase at this Corresponding row in text, and location information in the row, can be indicated with two-dimensional array, such as (Isosorbide-5-Nitrae) indicates the 1st row the 4th A word.If third level tree construction includes multistage node, can be using location information in the row as the child node of the row.
For example, text data " I am in the Tian'anmen Square ", for phrase " I ", be written in corresponding three level list " (1, 1) ", for phrase " Tian An ", in " Tian An " corresponding three level list write-in " (1,2) ", for phrase " peace door ", at " peace door " Corresponding three level list write-in " (1,3) ", and so on.
Above-mentioned steps 1-4 is the process that new text data is inserted into database.To received new text data participle After processing, index information is written when being inserted into data, greatly improves data-handling efficiency.
A metapage can store multiple index files in Postgresql database, as shown in figure 4, being data file It is inserted into an example of data.Such as " full-text database is the main composition part of text retrieval system by text data.It is so-called Full-text database be convert the full content of a complete information source to computer can identify, the information unit that handles and The data acquisition system of formation.Full-text database not only stores information, but also also ...." (format of text is as shown in the figure), After database receives text data, the symbol in text in addition to Chinese is filtered out first, then creates inverted index structure (Entry Tree1) starts simultaneously at binary participle.Only as an example, the level-one index of the inverted index structure is using company Continuous monogram " a, b ..., g ", " h, i ..., n ", " o, q ..., z ";By obtained phrase " full text ", " text during participle Number ", " data ", " according to library " successively index write-in secondary index according to level-one;And in the write-in of each phrase, while in three-level The position of the phrase in the text is written in index, by taking first phrase " full text " as an example, phrase write-in level-one index " o, Secondary index under q ..., z ", and obtain its position " the 1st word of row the 1st " write-in three level list;With word segmentation processing into , there is phrase " full text " for the second time in row, obtains its position " the 1st word of row the 7th " and the same three level list ... ... is written, with this Analogize, until participle is completed, and all segments obtained phrase and its index file is written in corresponding position.Three level list in figure Using multi-level tree structure (i.e. Post Tree), the first order indicate phrase trip all rows (phrase " full text " appears in the 1st, 3, 8 rows), the second level indicate phrase occur each row, the third level indicate phrase in corresponding line position (the 1st word of row the 1st and 7, 3rd word of row the 4th, the 6th word of eighth row).It should be noted that in figure not completely by the obtained phrase of participle in the index Position is shown, and those skilled in the art should can illustrate to understand this reality according to the part in the description of the present embodiment and figure It applies in example to the process of database insertion text data.So far, it has been completed at the same time the write-in of binary participle and inverted index.
Since binary participle is to be one group for two neighboring Chinese character in order to split, because without will be to entire text Data are directed to the word-based allusion quotation of this article notebook data after reading in or statistical method executes participle, but can hold during reading Row participle, and the position write-in inverted index of current binary phrase can be obtained simultaneously, it substantially increases to database The treatment effeciency of data when typing new data.
It is described for received text to be retrieved execute full-text search process 2 the following steps are included:
Step 201: receiving text to be retrieved, binary participle is carried out to the text to be retrieved, obtains multiple to be retrieved two First phrase;
Step 202: each text data in database is pressed according to the corresponding inverted index of the text data Row counts the frequency that the multiple binary phrase to be retrieved occurs respectively;
Step 203: the similarity of text to be retrieved Yu the every a line of the text data is calculated according to the frequency;
Step 204: summarize the similarity of text to be retrieved Yu each row of this article notebook data, obtain the text to be retrieved with The similarity of the text data.
Step 205: the text data in database being sorted from high to low by similarity, and is exported.
The present embodiment inverted index uses three level list structure, and the location information that is written in three level list using " row+ The data structure of position " when executing retrieval, can quickly find in text to be retrieved phrase each text data in the database In position, and express statistic goes out each phrase in the frequency of the every a line of each text data, to quickly calculate text to be retrieved The similarity of this and each text data.
Embodiment two
As a kind of deformation of embodiment one, a kind of method for present embodiments providing database Chinese Full Text Retrieval, such as Shown in Fig. 5, comprising the following steps:
A kind of method of database Chinese Full Text Retrieval, which comprises the following steps:
Inverted index structure is pre-created;
Receive the text data being inserted into database;
It is one group of carry out binary word segmentation processing to text data each adjacent two Chinese character;
During participle, obtained binary phrase is segmented for each, the binary phrase and the binary phrase are existed The inverted index is written in location information in this article notebook data;
Text to be retrieved is received, and carries out binary word segmentation processing, obtains multiple binary phrases to be retrieved;
In the database, full-text search is executed based on inverted index and the multiple binary phrase to be retrieved.
What is different from the first embodiment is that data list file and corresponding inverted index structure, institute is pre-created in the present embodiment Stating inverted index structure includes three level list, wherein and level-one is indexed for identifying position of each binary phrase in secondary index, Secondary index is for recording the position of each binary phrase and the text in three level list, and three level list is for recording The location information of binary phrase in the text, the location information include the row comprising the phrase, and position in the row, Specific steps are as shown in Figure 6.
It is inserted into new data file process 3, comprising the following steps:
Step 301: receiving the text data that be inserted into database;
Step 302: the text data is pre-processed;
Step 303: being one group of carry out binary participle to pretreated text data each adjacent two Chinese character, simultaneously Inverted index is created for the text data;
Step 304: during participle, obtained multiple binary phrases are respectively written into secondary index according to level-one index, The location information of the binary phrase in the text data is written in each corresponding three level list of binary phrase simultaneously.
The retrieving 4, comprising the following steps:
Step 401: receiving text to be retrieved, binary participle is carried out to the text to be retrieved, obtains multiple to be retrieved two First phrase;
Step 402: each text data in database is pressed according to the corresponding inverted index of the text data Row counts the frequency that the multiple binary phrase to be retrieved occurs respectively;
Step 403: the similarity of text to be retrieved Yu the every a line of the text data is calculated according to the frequency;
Step 404: summarize the similarity of text to be retrieved Yu each row of this article notebook data, obtain the text to be retrieved with The similarity of the text data;
Step 405: the text data in database being sorted from high to low by similarity, and is exported.
The specific implementation of above step can be found in the description of one corresponding portion of embodiment.
Embodiment three
Based on the search method of embodiment one, a kind of database Chinese Full-Text Retrieval System is present embodiments provided.
A kind of database Chinese Full-Text Retrieval System, as shown in fig. 7, comprises client, Database Systems and server;Its In,
Client receives the text to be retrieved of user's input, generates retrieval request and is sent to server;
Server is connect with Database Systems, is configured as: being received text data and is inserted into database, and described in generation The corresponding inverted index of text data, specifically includes:
Step 101: receiving the text data that be inserted into database;
Step 102: the text data is pre-processed;
Step 103: being one group of carry out binary participle to pretreated text data each adjacent two Chinese character, simultaneously Inverted index is created for the text data;The inverted index structure includes three level list, wherein level-one is indexed for identifying Position of each binary phrase in secondary index, secondary index is for recording each binary phrase and the text in three-level Position in index, three level list is for recording the location information of binary phrase in the text;
Step 104: during participle, segmenting obtained binary phrase for each, be respectively written into two according to level-one index Grade index, while position letter of the binary phrase in the text data being written in each corresponding three level list of binary phrase Breath.
The server is also configured to receive the text to be retrieved, executes full-text search in the database, It specifically includes:
Step 201: receiving text to be retrieved, binary participle is carried out to the text to be retrieved, obtains multiple to be retrieved two First phrase;
Step 202: each text data in database is pressed according to the corresponding inverted index of the text data Row counts the frequency that the multiple binary phrase to be retrieved occurs respectively;
Step 203: the similarity of text to be retrieved Yu the every a line of the text data is calculated according to the frequency;
Step 204: summarize the similarity of text to be retrieved Yu each row of this article notebook data, obtain the text to be retrieved with The similarity of the text data;
Step 205: the text data in database being sorted from high to low by similarity, and is exported.
The specific implementation of above step can be found in the description of one corresponding portion of embodiment.
Example IV
Based on the search method of embodiment two, a kind of database Chinese Full-Text Retrieval System is present embodiments provided.
A kind of database Chinese Full-Text Retrieval System, as shown in fig. 7, comprises client, Database Systems and server;Its In,
Client receives the text to be retrieved of user's input, generates retrieval request and is sent to server;
Server is connect with Database Systems, is configured as: being received text data and is inserted into database, and described in generation The corresponding inverted index of text data, specifically includes:
Data list file and corresponding inverted index structure, the inverted index structure packet are pre-created in the server Include three level list, wherein level-one index is for identifying position of each binary phrase in secondary index, and secondary index is for recording The position of each binary phrase and the text in three level list, three level list is for recording binary phrase in the text Location information, the location information includes the row comprising the phrase, and position in the row.
Step 301: receiving the text data that be inserted into database;
Step 302: the text data is pre-processed;
Step 303: being one group of carry out binary participle to pretreated text data each adjacent two Chinese character, simultaneously Inverted index is created for the text data;
Step 304: during participle, obtained multiple binary phrases are respectively written into secondary index according to level-one index, The location information of the binary phrase in the text data is written in each corresponding three level list of binary phrase simultaneously.
The text to be retrieved is received, full-text search is executed in the database, is configured as:
Step 401: receiving text to be retrieved, binary participle is carried out to the text to be retrieved, obtains multiple to be retrieved two First phrase;
Step 402: each text data in database is pressed according to the corresponding inverted index of the text data Row counts the frequency that the multiple binary phrase to be retrieved occurs respectively;
Step 403: the similarity of text to be retrieved Yu the every a line of the text data is calculated according to the frequency;
Step 404: summarize the similarity of text to be retrieved Yu each row of this article notebook data, obtain the text to be retrieved with The similarity of the text data;
Step 405: the text data in database being sorted from high to low by similarity, and is exported.
The specific implementation of above step can be found in the description of one corresponding portion of embodiment.
The above one or more embodiment has following technical effect that
This can be set out based on being split by binary participle for text to be retrieved in the disclosure as far as possible Possible all phrases in text avoid to occur the infull problem of dictionary in other Chinese word segmentation solutions, for Network vocabulary, emerging word also can be good at identifying and retrieve;
After disclosure reception will be inserted into the text of database, binary participle is carried out, while creating the row's of falling rope of the text Inverted index is written during participle in the position of the phrase currently split out and the phrase in the text by quotation part, this The mode that index file is written in participle counts the position of each phrase in the text compared to first being segmented again after participle The mode set, saves the primary process for reading data, and treatment effeciency is higher;
The disclosure improves general inverted index, makes that it includes three level list mechanism: coding/letter --- word Position of group --- the phrase in the document comprising the phrase, and the location information be refined as " phrase is expert at+phrase is at this It capable position " can be directly according in the location information express statistic text to be retrieved and database when carrying out full-text search The similarity of the every a line of text, to quickly calculate full text similarity;
Since binary participle can bring storage data quantity big, when being written in order to avoid index file and subsequent execution is retrieved When across file write-in and read problem, the disclosure be based on everyday words statistics establish secondary index and corresponding data list file, The read or write speed that data can be greatly increased improves the treatment effeciency and recall precision of database text data.
It will be understood by those skilled in the art that each module or each step of the above-mentioned disclosure can be filled with general computer It sets to realize, optionally, they can be realized with the program code that computing device can perform, it is thus possible to which they are stored Be performed by computing device in the storage device, perhaps they are fabricated to each integrated circuit modules or by they In multiple modules or step be fabricated to single integrated circuit module to realize.The disclosure be not limited to any specific hardware and The combination of software.
The foregoing is merely preferred embodiment of the present disclosure, are not limited to the disclosure, for the skill of this field For art personnel, the disclosure can have various modifications and variations.It is all within the spirit and principle of the disclosure, it is made any to repair Change, equivalent replacement, improvement etc., should be included within the protection scope of the disclosure.
Although above-mentioned be described in conjunction with specific embodiment of the attached drawing to the disclosure, model not is protected to the disclosure The limitation enclosed, those skilled in the art should understand that, on the basis of the technical solution of the disclosure, those skilled in the art are not Need to make the creative labor the various modifications or changes that can be made still within the protection scope of the disclosure.

Claims (10)

1. a kind of method of database Chinese Full Text Retrieval, which comprises the following steps:
Receive the text data being inserted into database;
It is one group of carry out binary word segmentation processing to text data each adjacent two Chinese character, while is created for the text data Inverted index;
During participle, obtained binary phrase is segmented for each, by the binary phrase and the binary phrase in this article The inverted index is written in location information in notebook data;
Text to be retrieved is received, and carries out binary word segmentation processing, obtains multiple binary phrases to be retrieved;
In the database, full-text search is executed based on inverted index and the multiple binary phrase to be retrieved.
2. a kind of method of database Chinese Full Text Retrieval as described in claim 1, which is characterized in that the inverted index packet Include three level list, wherein level-one index is for identifying position of each binary phrase in secondary index, and secondary index is for recording The position of each binary phrase and the text in three level list, three level list is for recording binary phrase in the text Location information, the location information includes row of the binary phrase in the text, and position in the row.
3. a kind of method of database Chinese Full Text Retrieval as claimed in claim 2, which is characterized in that the level-one, which indexes, is Coding;Or
The level-one index is letter or monogram, corresponding multiple tables of data texts with the alphabetical or monogram index Part;
Further, the monogram is obtained based on the statistics of common words.
4. a kind of method of database Chinese Full Text Retrieval as claimed in claim 2, which is characterized in that described based on the row's of falling rope Draw and includes: with the multiple binary phrase execution full-text search to be retrieved
Text to be retrieved is received, binary participle is carried out to the text to be retrieved, obtains multiple binary phrases to be retrieved;
It is described more by row statistics according to the corresponding inverted index of the text data for each text data in database The frequency that a binary phrase to be retrieved occurs respectively;
The similarity of text to be retrieved Yu the every a line of the text data is calculated according to the frequency;
The similarity for summarizing text to be retrieved Yu each row of this article notebook data obtains the text to be retrieved and the text data Similarity;
Text data in database is sorted and exported from high to low by similarity.
5. a kind of method of database Chinese Full Text Retrieval, which comprises the following steps:
Inverted index structure is pre-created;
Receive the text data being inserted into database;
It is one group of carry out binary word segmentation processing to text data each adjacent two Chinese character;
During participle, obtained binary phrase is segmented for each, by the binary phrase and the binary phrase in this article The inverted index is written in location information in notebook data;
Text to be retrieved is received, and carries out binary word segmentation processing, obtains multiple binary phrases to be retrieved;
In the database, full-text search is executed based on inverted index and the multiple binary phrase to be retrieved.
6. a kind of method of database Chinese Full Text Retrieval as claimed in claim 5, which is characterized in that the inverted index packet Include three level list, wherein level-one index is for identifying position of each binary phrase in secondary index, and secondary index is for recording The position of each binary phrase and the text in three level list, three level list is for recording binary phrase in the text Location information, the location information includes row of the binary phrase in the text, and position in the row.
7. a kind of method of database Chinese Full Text Retrieval as claimed in claim 6, which is characterized in that the level-one, which indexes, is Coding;Or
The level-one index is letter or monogram, corresponding multiple tables of data texts with the alphabetical or monogram index Part;
Further, the monogram is obtained based on the statistics of common words.
8. a kind of method of database Chinese Full Text Retrieval as claimed in claim 6, which is characterized in that in the database Carrying out full-text search to the text to be retrieved includes:
It is described full-text search is executed based on inverted index and the multiple binary phrase to be retrieved to include:
Text to be retrieved is received, binary participle is carried out to the text to be retrieved, obtains multiple binary phrases to be retrieved;
It is described more by row statistics according to the corresponding inverted index of the text data for each text data in database The frequency that a binary phrase to be retrieved occurs respectively;
The similarity of text to be retrieved Yu the every a line of the text data is calculated according to the frequency;
The similarity for summarizing text to be retrieved Yu each row of this article notebook data obtains the text to be retrieved and the text data Similarity;
Text data in database is sorted and exported from high to low by similarity.
9. a kind of server, connect with Database Systems, which is characterized in that execute as described in any one of claim 1-4 or 5-8 The method of database Chinese Full Text Retrieval.
10. a kind of database Chinese Full-Text Retrieval System, which is characterized in that including client, Database Systems and as right is wanted Server described in asking 9;The client receives text to be retrieved and is sent to server.
CN201910053622.3A 2019-01-21 2019-01-21 Method and system for searching Chinese full text in database Active CN109885641B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910053622.3A CN109885641B (en) 2019-01-21 2019-01-21 Method and system for searching Chinese full text in database

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910053622.3A CN109885641B (en) 2019-01-21 2019-01-21 Method and system for searching Chinese full text in database

Publications (2)

Publication Number Publication Date
CN109885641A true CN109885641A (en) 2019-06-14
CN109885641B CN109885641B (en) 2021-03-09

Family

ID=66926311

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910053622.3A Active CN109885641B (en) 2019-01-21 2019-01-21 Method and system for searching Chinese full text in database

Country Status (1)

Country Link
CN (1) CN109885641B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110765262A (en) * 2019-09-24 2020-02-07 北京嘀嘀无限科技发展有限公司 POI text retrieval method and device and electronic equipment
CN113127662A (en) * 2021-04-13 2021-07-16 广联达科技股份有限公司 Component searching method and device, electronic equipment and readable storage medium
CN113609249A (en) * 2021-09-09 2021-11-05 北京环境特性研究所 Target model simulation data storage method and device
CN115840799A (en) * 2023-02-24 2023-03-24 南通专猎网络科技有限公司 Intellectual property comprehensive management system based on deep learning

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030200211A1 (en) * 1999-02-09 2003-10-23 Katsumi Tada Document retrieval method and document retrieval system
CN101393551A (en) * 2007-09-17 2009-03-25 鸿富锦精密工业(深圳)有限公司 Index establishing system and method for patent full text search
CN103823799A (en) * 2012-11-16 2014-05-28 镇江诺尼基智能技术有限公司 New-generation industry knowledge full-text search method
US20180300415A1 (en) * 2017-04-16 2018-10-18 Radim Rehurek Search engine system communicating with a full text search engine to retrieve most similar documents
CN108776705A (en) * 2018-06-12 2018-11-09 厦门市美亚柏科信息股份有限公司 A kind of method, apparatus, equipment and readable medium that text full text is accurately inquired

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030200211A1 (en) * 1999-02-09 2003-10-23 Katsumi Tada Document retrieval method and document retrieval system
CN101393551A (en) * 2007-09-17 2009-03-25 鸿富锦精密工业(深圳)有限公司 Index establishing system and method for patent full text search
CN103823799A (en) * 2012-11-16 2014-05-28 镇江诺尼基智能技术有限公司 New-generation industry knowledge full-text search method
US20180300415A1 (en) * 2017-04-16 2018-10-18 Radim Rehurek Search engine system communicating with a full text search engine to retrieve most similar documents
CN108776705A (en) * 2018-06-12 2018-11-09 厦门市美亚柏科信息股份有限公司 A kind of method, apparatus, equipment and readable medium that text full text is accurately inquired

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110765262A (en) * 2019-09-24 2020-02-07 北京嘀嘀无限科技发展有限公司 POI text retrieval method and device and electronic equipment
CN113127662A (en) * 2021-04-13 2021-07-16 广联达科技股份有限公司 Component searching method and device, electronic equipment and readable storage medium
CN113609249A (en) * 2021-09-09 2021-11-05 北京环境特性研究所 Target model simulation data storage method and device
CN113609249B (en) * 2021-09-09 2023-04-28 北京环境特性研究所 Target model simulation data storage method and device
CN115840799A (en) * 2023-02-24 2023-03-24 南通专猎网络科技有限公司 Intellectual property comprehensive management system based on deep learning
CN115840799B (en) * 2023-02-24 2023-05-02 南通专猎网络科技有限公司 Intellectual property comprehensive management system based on deep learning

Also Published As

Publication number Publication date
CN109885641B (en) 2021-03-09

Similar Documents

Publication Publication Date Title
Bennani-Smires et al. Simple unsupervised keyphrase extraction using sentence embeddings
Farra et al. Sentence-level and document-level sentiment mining for arabic texts
CN102479191B (en) Method and device for providing multi-granularity word segmentation result
CN107818815B (en) Electronic medical record retrieval method and system
US8392175B2 (en) Phrase-based document clustering with automatic phrase extraction
CN109885641A (en) A kind of method and system of database Chinese Full Text Retrieval
CN103365992B (en) Method for realizing dictionary search of Trie tree based on one-dimensional linear space
JP6056610B2 (en) Text information processing apparatus, text information processing method, and text information processing program
CN108319583B (en) Method and system for extracting knowledge from Chinese language material library
CN107577663B (en) Key phrase extraction method and device
Zu et al. Resume information extraction with a novel text block segmentation algorithm
CN111539193A (en) Ontology-based document analysis and annotation generation
WO2022134355A1 (en) Keyword prompt-based search method and apparatus, and electronic device and storage medium
Karim et al. A step towards information extraction: Named entity recognition in Bangla using deep learning
US20090234852A1 (en) Sub-linear approximate string match
Jain et al. Context sensitive text summarization using k means clustering algorithm
CN106383814A (en) Word segmentation method of English social media short text
KR20070007001A (en) Method and apparatus for searching information using automatic query creation
US9965546B2 (en) Fast substring fulltext search
CN113822059A (en) Chinese sensitive text recognition method and device, storage medium and equipment
Alam et al. Bangla news trend observation using lda based topic modeling
JP6260678B2 (en) Information processing apparatus, information processing method, and information processing program
CN115617965A (en) Rapid retrieval method for language structure big data
Saeed et al. An abstractive summarization technique with variable length keywords as per document diversity
CN112395856B (en) Text matching method, text matching device, computer system and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
PE01 Entry into force of the registration of the contract for pledge of patent right
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: A method and system of Chinese full text retrieval in database

Effective date of registration: 20220331

Granted publication date: 20210309

Pledgee: Bank of Beijing Co.,Ltd. Jinan Branch

Pledgor: HIGHGO BASE SOFTWARE Co.,Ltd.

Registration number: Y2022980003586

PC01 Cancellation of the registration of the contract for pledge of patent right
PC01 Cancellation of the registration of the contract for pledge of patent right

Date of cancellation: 20230619

Granted publication date: 20210309

Pledgee: Bank of Beijing Co.,Ltd. Jinan Branch

Pledgor: HIGHGO BASE SOFTWARE Co.,Ltd.

Registration number: Y2022980003586