A kind of method and system of database Chinese Full Text Retrieval
Technical field
The disclosure belongs to a kind of method in data retrieval technology field more particularly to database Chinese Full Text Retrieval and is
System.
Background technique
Global search technology is a kind of very universal information inquiry application, online all kinds of search engine core technologies it
One is exactly full-text search.The product of full-text search is substantially exactly the database product of an embedded global search technology.Chinese
It can be related to Chinese word segmentation during full-text search.
Current main Chinese word segmentation can be mainly divided into: the segmenting method based on string matching and point based on statistics
Word method.Entry progress in the Chinese character string and a dictionary that segmenting method based on string matching needs to be analysed to
Match, if finding some character string in dictionary, then it is assumed that identify that a word, this segmenting method need one " complete enough "
Dictionary, but due to network neologisms update it is very fast, the update of dictionary is difficult to adapt to the renewal speed of neologisms.If text to be retrieved
In contain network neologisms and do not have in dictionary, cannot correct cutting handle the vocabulary, to cannot retrieve comprising should
The text of neologisms, leads to missing inspection.
Segmenting method based on statistics is to be segmented in text by the frequency or probability of word co-occurrence adjacent with word, this side
Method need to only count the word group frequency in corpus, not need dictionary, but this method often extract out some co-occurrence frequency it is high,
But it is not the commonly used word group of word, has certain recognition effect to neologisms, but poor to the accuracy of identification of everyday words, and operation consumes
When, the data volume for segmenting generation is also bigger, to influence the efficiency of later retrieval.
On the basis of being segmented, in order to accelerate recall precision, inverted index is commonly used in database product and is counted
According to processing, specifically, database after receiving the data file to be inserted into, reads the data file first carries out Chinese word segmentation,
It needs to read again after participle one time, obtains position write-in inverted index of each phrase in the data file, that is, carried out two
Secondary data file is read, in the case where data file is big or the data volume of insertion database is big, this processing mode operation
Amount is big, and efficiency is lower;Also, when general inverted index stores word position, the line position where only storing phrase is set, at this
In the case of kind, when counting the frequency of word in retrieving, it is also necessary to reading data be come out again, then calculate similarity, examined
Rope efficiency is lower.
Summary of the invention
To overcome above-mentioned the deficiencies in the prior art, present disclose provides a kind of method of database Chinese Full Text Retrieval and it is
System, the search method using binary participle, can preferably identify with retrieval network neologisms, and inverted index is changed
Into can quickly position corresponding phrase by the mechanism of multiple index, retrieve highly efficient.
To achieve the above object, one or more other embodiments of the present disclosure provide following technical solution:
A kind of method of database Chinese Full Text Retrieval, comprising the following steps:
Receive the text data being inserted into database;
It is one group of carry out binary word segmentation processing to text data each adjacent two Chinese character, while is the text data
Create inverted index;
During participle, obtained binary phrase is segmented for each, the binary phrase and the binary phrase are existed
The inverted index is written in location information in this article notebook data;
Text to be retrieved is received, and carries out binary word segmentation processing, obtains multiple binary phrases to be retrieved;
In the database, full-text search is executed based on inverted index and the multiple binary phrase to be retrieved.
Further, the inverted index includes three level list, wherein level-one is indexed for identifying each binary phrase two
Position in grade index, secondary index is for recording the position of each binary phrase and the text in three level list, and three
For grade index for recording the location information of binary phrase in the text, the location information includes binary phrase in the text
Row, and position in the row.
Further, the level-one index is coding;Or
The level-one index is letter or monogram, corresponding multiple tables of data with the alphabetical or monogram index
File.
Further, the monogram is obtained based on the statistics of common words.
Further, described full-text search is executed based on inverted index and the multiple binary phrase to be retrieved to include:
Text to be retrieved is received, binary participle is carried out to the text to be retrieved, obtains multiple binary phrases to be retrieved;
For each text data in database, according to the corresponding inverted index of the text data, by row statistics institute
State the frequency that multiple binary phrases to be retrieved occur respectively;
The similarity of text to be retrieved Yu the every a line of the text data is calculated according to the frequency;
The similarity for summarizing text to be retrieved Yu each row of this article notebook data obtains the text to be retrieved and the text
The similarity of data;
Text data in database is sorted and exported from high to low by similarity.
One or more embodiments provide a kind of method of database Chinese Full Text Retrieval, comprising the following steps:
Inverted index structure is pre-created;
Receive the text data being inserted into database;
It is one group of carry out binary word segmentation processing to text data each adjacent two Chinese character;
During participle, obtained binary phrase is segmented for each, the binary phrase and the binary phrase are existed
The inverted index is written in location information in this article notebook data;
Text to be retrieved is received, and carries out binary word segmentation processing, obtains multiple binary phrases to be retrieved;
In the database, full-text search is executed based on inverted index and the multiple binary phrase to be retrieved.
Further, the inverted index includes three level list, wherein level-one is indexed for identifying each binary phrase two
Position in grade index, secondary index is for recording the position of each binary phrase and the text in three level list, and three
For grade index for recording the location information of binary phrase in the text, the location information includes binary phrase in the text
Row, and position in the row.
Further, the level-one index is coding;Or
The level-one index is letter or monogram, corresponding multiple tables of data with the alphabetical or monogram index
File.
Further, the monogram is obtained based on the statistics of common words.
Further, carrying out full-text search to the text to be retrieved in the database includes:
It is described full-text search is executed based on inverted index and the multiple binary phrase to be retrieved to include:
Text to be retrieved is received, binary participle is carried out to the text to be retrieved, obtains multiple binary phrases to be retrieved;
For each text data in database, according to the corresponding inverted index of the text data, by row statistics institute
State the frequency that multiple binary phrases to be retrieved occur respectively;
The similarity of text to be retrieved Yu the every a line of the text data is calculated according to the frequency;
The similarity for summarizing text to be retrieved Yu each row of this article notebook data obtains the text to be retrieved and the text
The similarity of data;
Text data in database is sorted and exported from high to low by similarity.
One or more embodiments provide a kind of server, connect with Database Systems, execute the database Chinese
The method of full-text search.
One or more embodiments provide a kind of database Chinese Full-Text Retrieval System, including client, data base set
System and the server;The client receives text to be retrieved and is sent to server.
The above one or more technical solution there are following the utility model has the advantages that
This can be set out based on being split by binary participle for text to be retrieved in the disclosure as far as possible
Possible all phrases in text avoid to occur the infull problem of dictionary in other Chinese word segmentation solutions, for
Network vocabulary, emerging word also can be good at identifying and retrieve;
After disclosure reception will be inserted into the text of database, binary participle is carried out, while creating the row's of falling rope of the text
Inverted index is written during participle in the position of the phrase currently split out and the phrase in the text by quotation part, this
The mode that index file is written in participle counts the position of each phrase in the text compared to first being segmented again after participle
The mode set, saves the primary process for reading data, and treatment effeciency is higher;
The disclosure improves general inverted index, makes that it includes three level list mechanism: coding/letter --- word
Position of group --- the phrase in the document comprising the phrase, and the location information be refined as " phrase is expert at+phrase is at this
It capable position " can be directly according in the location information express statistic text to be retrieved and database when carrying out full-text search
The similarity of the every a line of text, to quickly calculate full text similarity;
Since binary participle can bring storage data quantity big, when being written in order to avoid index file and subsequent execution is retrieved
When across file write-in and read problem, the disclosure be also based on everyday words statistics establish secondary index and corresponding tables of data text
Part can greatly increase the read or write speed of data, improve the treatment effeciency and recall precision of database text data.
Detailed description of the invention
The Figure of description for constituting a part of this disclosure is used to provide further understanding of the disclosure, and the disclosure is shown
Meaning property embodiment and its explanation do not constitute the improper restriction to the disclosure for explaining the disclosure.
Fig. 1 is the overall flow figure of database Chinese Full Text Retrieval method in the embodiment of the present disclosure one;
Fig. 2 is the specific flow chart of database Chinese Full Text Retrieval method in the embodiment of the present disclosure one;
Fig. 3 is the data structure schematic diagram of inverted index in the embodiment of the present disclosure one;
Fig. 4 is the inverted index exemplary diagram that secondary index is letter in the embodiment of the present disclosure one;
Fig. 5 is the overall flow figure of database Chinese Full Text Retrieval method in the embodiment of the present disclosure two;
Fig. 6 is the specific flow chart of database Chinese Full Text Retrieval method in the embodiment of the present disclosure two;
Fig. 7 is the frame diagram of database Chinese Full-Text Retrieval System in the embodiment of the present disclosure three and four.
Specific embodiment
It is noted that described further below be all exemplary, it is intended to provide further instruction to the disclosure.Unless another
It indicates, all technical and scientific terms used herein has usual with disclosure person of an ordinary skill in the technical field
The identical meanings of understanding.
It should be noted that term used herein above is merely to describe specific embodiment, and be not intended to restricted root
According to the illustrative embodiments of the disclosure.As used herein, unless the context clearly indicates otherwise, otherwise singular
Also it is intended to include plural form, additionally, it should be understood that, when in the present specification using term "comprising" and/or " packet
Include " when, indicate existing characteristics, step, operation, device, component and/or their combination.
In the absence of conflict, the feature in the embodiment and embodiment in the disclosure can be combined with each other.
Embodiment one
Present embodiment discloses a kind of methods of database Chinese Full Text Retrieval, as shown in Figure 1, comprising the following steps:
Receive the text data being inserted into database;
It is one group of carry out binary word segmentation processing to text data each adjacent two Chinese character, while is the text data
Create inverted index;
During participle, obtained binary phrase is segmented for each, the binary phrase and the binary phrase are existed
The inverted index is written in location information in this article notebook data;
Text to be retrieved is received, and carries out binary word segmentation processing, obtains multiple binary phrases to be retrieved;
In the database, full-text search is executed based on inverted index and the multiple binary phrase to be retrieved.
Specifically, as shown in Fig. 2, the method includes being inserted into new text data process 1 to database, and for connecing
The text to be retrieved received executes the process 2 of full-text search.
It is described to be inserted into new text data process 1 to database and include:
Step 101: receiving the text data that be inserted into database;For example, input " I am in Beijing Tian An-men ";
Step 102: the text data is pre-processed;
Wherein, the pretreatment includes removing non-legible content, such as space, TAB, the additional characters such as comma.
Step 103: being one group of carry out binary participle to pretreated text data each adjacent two Chinese character, simultaneously
Inverted index is created for the text data;
Such as: text data is " Beijing Tian An-men ", be after progress binary participle " " Beijing ", " capital day ", " Tian An ",
" peace door " } ".
By taking Postgresql database as an example, the participle to English word is only provided in the database, to Chinese word language
Participle do not support, Postgresql can also by the form of external plug-in support Chinese word segmentation, currently exist
Including zhparser, the plug-in units such as jieba can support Chinese word segmentation, but need to be segmented using these plug-in units, and need
It will be by included dictionary.But since network neologisms renewal speed is fast, dictionary is easy to miss neologisms, if thus word-based
Allusion quotation executes participle, it may appear that missing crucial word problem causes retrieval effectiveness not good enough.And it can be utmostly by binary participle
On possible all phrases, the present embodiment in text data is set out is segmented, avoided by using binary segmenting method
Using dictionary, so as to avoid the problem that neologisms in dictionary are not complete.
Inverted index is created for the text data while carrying out binary participle, the inverted index structure includes three-level
Index, wherein level-one index is for identifying position of each binary phrase in secondary index, and secondary index is for recording each two
The position of first phrase and the text in three level list, three level list is for recording the position of binary phrase in the text
Information.
In one or more embodiments, the inverted index uses tree structure, it is preferable that B- is used in the present embodiment
tree。
Specifically, the tree structure includes level Four tree construction.First order tree construction is sentenced for storing logic judgment condition
Break position of each binary phrase in the tree construction of the second level.The first order tree construction can correspond to a file, in the file
Store logic judgment condition.When storing the logic judgment condition using file, transferring the file can be to logic therein
Condition is modified, and flexibility is high.The second level tree construction is indexed corresponding to level-one, for identifying each binary phrase in second level
Position in index;The third level tree construction corresponds to secondary index, records each binary phrase and the text three
Position in grade index;The fourth stage tree construction corresponds to three level list, records the text information of corresponding binary phrase, and
Location information of the binary phrase in the text.Wherein, the second level tree construction and third level tree construction physically may be used
It is stored in the same data list file, may also be stored in different data list files.As shown in figure 3, being level Four tree construction
Schematic diagram, for the relationship between clear expression tree constructions at different levels, it should be noted that in figure the depth of tree construction at different levels and
Range is merely illustrative, can be extended according to specific data.
In one or more embodiments, hexadecimal code is can be used in the second level tree construction.The second level tree knot
Structure and third level tree construction are physically storable in the same data list file, may also be stored in different data list files
In.
Since the minimum management unit in Postgresql database in disk storage and memory is all page, and usually
Described block, general PG pages of size is 8K, this just determines that the size of data list file is up to 1G.And due to this reality
Applying segmenting method used by example is the binary participle that all two neighboring Chinese characters are one group, and the amount of storage needed is larger, needs
Multiple data list files are wanted to be stored.Based on this, in one or more embodiments, the data list file wound of corresponding secondary index
Have it is multiple, and using letter index (for example, in filename include the letter), and using letter as level-one index.For example,
If some node of third layer corresponds to binary phrase " I ", father node is " w ", will " I " write-in using " w " index number
According in list file.Index file is indexed by using letter, can quickly be navigated to accordingly with letter for entrance in retrieving
Phrase and corresponding three level list, so that location information of the phrase in each text data be written.
Although the present embodiment can cover all neologisms, the phrase that this segmenting method obtains by using binary participle
It is more relative to the phrase obtained using Dictionary based segment, if 26 letters are indexed respectively as level-one, and corresponding 26 numbers
According to list file, then frequently across file write-in, such as " Tian An-men may be faced during the location information of phrase is written
Square ", the corresponding data list file of " Tian An " write-in " t ", " peace door " corresponding data list file of write-in " a ", " Men Guang " write-in
" m " corresponding data list file, the corresponding data list file of " square " write-in " g ", just increases the computational burden of system.In order to
Solving the problems, such as this, in one or more embodiments, each data list file is indexed using the combination of multiple letters, such as
" a-d ", " e-h " etc..
Preferably, the combination of the multiple letter can be made of discontinuous letter, based on the statistics to existing vocabulary
The combination of analysis setting letter, identifies common words, the letter that more common vocabulary includes is set as one group, for example, if
" Tian'anmen Square " is judged as common words, and first letter of pinyin " t ", " a ", " m ", " g " of corresponding binary phrase can be classified as
One group, the same level-one index is written, and index corresponding data list file for indexing the level-one.Thus greatly reduce across
The frequency of data is written in table, data processing load is reduced, moreover, because phrase involved in common words indexes accordingly
In the same data list file, the efficiency searched during later retrieval is improved.
Step 104: during participle, obtained binary phrase is segmented for each, by the binary phrase and described two
The inverted index is written in location information of first phrase in this article notebook data.
Specifically, each binary phrase is respectively written into secondary index according to level-one index, while in each binary phrase phase
The location information of the binary phrase in the text data is written in the three level list answered.
In one or more embodiments, when text data is inserted into, for its unique identification of creation in the database, such as text
This ID.
In one or more embodiments, the location information of the binary phrase in the text includes: the binary phrase at this
Corresponding row in text, and location information in the row, can be indicated with two-dimensional array, such as (Isosorbide-5-Nitrae) indicates the 1st row the 4th
A word.If third level tree construction includes multistage node, can be using location information in the row as the child node of the row.
For example, text data " I am in the Tian'anmen Square ", for phrase " I ", be written in corresponding three level list " (1,
1) ", for phrase " Tian An ", in " Tian An " corresponding three level list write-in " (1,2) ", for phrase " peace door ", at " peace door "
Corresponding three level list write-in " (1,3) ", and so on.
Above-mentioned steps 1-4 is the process that new text data is inserted into database.To received new text data participle
After processing, index information is written when being inserted into data, greatly improves data-handling efficiency.
A metapage can store multiple index files in Postgresql database, as shown in figure 4, being data file
It is inserted into an example of data.Such as " full-text database is the main composition part of text retrieval system by text data.It is so-called
Full-text database be convert the full content of a complete information source to computer can identify, the information unit that handles and
The data acquisition system of formation.Full-text database not only stores information, but also also ...." (format of text is as shown in the figure),
After database receives text data, the symbol in text in addition to Chinese is filtered out first, then creates inverted index structure
(Entry Tree1) starts simultaneously at binary participle.Only as an example, the level-one index of the inverted index structure is using company
Continuous monogram " a, b ..., g ", " h, i ..., n ", " o, q ..., z ";By obtained phrase " full text ", " text during participle
Number ", " data ", " according to library " successively index write-in secondary index according to level-one;And in the write-in of each phrase, while in three-level
The position of the phrase in the text is written in index, by taking first phrase " full text " as an example, phrase write-in level-one index " o,
Secondary index under q ..., z ", and obtain its position " the 1st word of row the 1st " write-in three level list;With word segmentation processing into
, there is phrase " full text " for the second time in row, obtains its position " the 1st word of row the 7th " and the same three level list ... ... is written, with this
Analogize, until participle is completed, and all segments obtained phrase and its index file is written in corresponding position.Three level list in figure
Using multi-level tree structure (i.e. Post Tree), the first order indicate phrase trip all rows (phrase " full text " appears in the 1st, 3,
8 rows), the second level indicate phrase occur each row, the third level indicate phrase in corresponding line position (the 1st word of row the 1st and 7,
3rd word of row the 4th, the 6th word of eighth row).It should be noted that in figure not completely by the obtained phrase of participle in the index
Position is shown, and those skilled in the art should can illustrate to understand this reality according to the part in the description of the present embodiment and figure
It applies in example to the process of database insertion text data.So far, it has been completed at the same time the write-in of binary participle and inverted index.
Since binary participle is to be one group for two neighboring Chinese character in order to split, because without will be to entire text
Data are directed to the word-based allusion quotation of this article notebook data after reading in or statistical method executes participle, but can hold during reading
Row participle, and the position write-in inverted index of current binary phrase can be obtained simultaneously, it substantially increases to database
The treatment effeciency of data when typing new data.
It is described for received text to be retrieved execute full-text search process 2 the following steps are included:
Step 201: receiving text to be retrieved, binary participle is carried out to the text to be retrieved, obtains multiple to be retrieved two
First phrase;
Step 202: each text data in database is pressed according to the corresponding inverted index of the text data
Row counts the frequency that the multiple binary phrase to be retrieved occurs respectively;
Step 203: the similarity of text to be retrieved Yu the every a line of the text data is calculated according to the frequency;
Step 204: summarize the similarity of text to be retrieved Yu each row of this article notebook data, obtain the text to be retrieved with
The similarity of the text data.
Step 205: the text data in database being sorted from high to low by similarity, and is exported.
The present embodiment inverted index uses three level list structure, and the location information that is written in three level list using " row+
The data structure of position " when executing retrieval, can quickly find in text to be retrieved phrase each text data in the database
In position, and express statistic goes out each phrase in the frequency of the every a line of each text data, to quickly calculate text to be retrieved
The similarity of this and each text data.
Embodiment two
As a kind of deformation of embodiment one, a kind of method for present embodiments providing database Chinese Full Text Retrieval, such as
Shown in Fig. 5, comprising the following steps:
A kind of method of database Chinese Full Text Retrieval, which comprises the following steps:
Inverted index structure is pre-created;
Receive the text data being inserted into database;
It is one group of carry out binary word segmentation processing to text data each adjacent two Chinese character;
During participle, obtained binary phrase is segmented for each, the binary phrase and the binary phrase are existed
The inverted index is written in location information in this article notebook data;
Text to be retrieved is received, and carries out binary word segmentation processing, obtains multiple binary phrases to be retrieved;
In the database, full-text search is executed based on inverted index and the multiple binary phrase to be retrieved.
What is different from the first embodiment is that data list file and corresponding inverted index structure, institute is pre-created in the present embodiment
Stating inverted index structure includes three level list, wherein and level-one is indexed for identifying position of each binary phrase in secondary index,
Secondary index is for recording the position of each binary phrase and the text in three level list, and three level list is for recording
The location information of binary phrase in the text, the location information include the row comprising the phrase, and position in the row,
Specific steps are as shown in Figure 6.
It is inserted into new data file process 3, comprising the following steps:
Step 301: receiving the text data that be inserted into database;
Step 302: the text data is pre-processed;
Step 303: being one group of carry out binary participle to pretreated text data each adjacent two Chinese character, simultaneously
Inverted index is created for the text data;
Step 304: during participle, obtained multiple binary phrases are respectively written into secondary index according to level-one index,
The location information of the binary phrase in the text data is written in each corresponding three level list of binary phrase simultaneously.
The retrieving 4, comprising the following steps:
Step 401: receiving text to be retrieved, binary participle is carried out to the text to be retrieved, obtains multiple to be retrieved two
First phrase;
Step 402: each text data in database is pressed according to the corresponding inverted index of the text data
Row counts the frequency that the multiple binary phrase to be retrieved occurs respectively;
Step 403: the similarity of text to be retrieved Yu the every a line of the text data is calculated according to the frequency;
Step 404: summarize the similarity of text to be retrieved Yu each row of this article notebook data, obtain the text to be retrieved with
The similarity of the text data;
Step 405: the text data in database being sorted from high to low by similarity, and is exported.
The specific implementation of above step can be found in the description of one corresponding portion of embodiment.
Embodiment three
Based on the search method of embodiment one, a kind of database Chinese Full-Text Retrieval System is present embodiments provided.
A kind of database Chinese Full-Text Retrieval System, as shown in fig. 7, comprises client, Database Systems and server;Its
In,
Client receives the text to be retrieved of user's input, generates retrieval request and is sent to server;
Server is connect with Database Systems, is configured as: being received text data and is inserted into database, and described in generation
The corresponding inverted index of text data, specifically includes:
Step 101: receiving the text data that be inserted into database;
Step 102: the text data is pre-processed;
Step 103: being one group of carry out binary participle to pretreated text data each adjacent two Chinese character, simultaneously
Inverted index is created for the text data;The inverted index structure includes three level list, wherein level-one is indexed for identifying
Position of each binary phrase in secondary index, secondary index is for recording each binary phrase and the text in three-level
Position in index, three level list is for recording the location information of binary phrase in the text;
Step 104: during participle, segmenting obtained binary phrase for each, be respectively written into two according to level-one index
Grade index, while position letter of the binary phrase in the text data being written in each corresponding three level list of binary phrase
Breath.
The server is also configured to receive the text to be retrieved, executes full-text search in the database,
It specifically includes:
Step 201: receiving text to be retrieved, binary participle is carried out to the text to be retrieved, obtains multiple to be retrieved two
First phrase;
Step 202: each text data in database is pressed according to the corresponding inverted index of the text data
Row counts the frequency that the multiple binary phrase to be retrieved occurs respectively;
Step 203: the similarity of text to be retrieved Yu the every a line of the text data is calculated according to the frequency;
Step 204: summarize the similarity of text to be retrieved Yu each row of this article notebook data, obtain the text to be retrieved with
The similarity of the text data;
Step 205: the text data in database being sorted from high to low by similarity, and is exported.
The specific implementation of above step can be found in the description of one corresponding portion of embodiment.
Example IV
Based on the search method of embodiment two, a kind of database Chinese Full-Text Retrieval System is present embodiments provided.
A kind of database Chinese Full-Text Retrieval System, as shown in fig. 7, comprises client, Database Systems and server;Its
In,
Client receives the text to be retrieved of user's input, generates retrieval request and is sent to server;
Server is connect with Database Systems, is configured as: being received text data and is inserted into database, and described in generation
The corresponding inverted index of text data, specifically includes:
Data list file and corresponding inverted index structure, the inverted index structure packet are pre-created in the server
Include three level list, wherein level-one index is for identifying position of each binary phrase in secondary index, and secondary index is for recording
The position of each binary phrase and the text in three level list, three level list is for recording binary phrase in the text
Location information, the location information includes the row comprising the phrase, and position in the row.
Step 301: receiving the text data that be inserted into database;
Step 302: the text data is pre-processed;
Step 303: being one group of carry out binary participle to pretreated text data each adjacent two Chinese character, simultaneously
Inverted index is created for the text data;
Step 304: during participle, obtained multiple binary phrases are respectively written into secondary index according to level-one index,
The location information of the binary phrase in the text data is written in each corresponding three level list of binary phrase simultaneously.
The text to be retrieved is received, full-text search is executed in the database, is configured as:
Step 401: receiving text to be retrieved, binary participle is carried out to the text to be retrieved, obtains multiple to be retrieved two
First phrase;
Step 402: each text data in database is pressed according to the corresponding inverted index of the text data
Row counts the frequency that the multiple binary phrase to be retrieved occurs respectively;
Step 403: the similarity of text to be retrieved Yu the every a line of the text data is calculated according to the frequency;
Step 404: summarize the similarity of text to be retrieved Yu each row of this article notebook data, obtain the text to be retrieved with
The similarity of the text data;
Step 405: the text data in database being sorted from high to low by similarity, and is exported.
The specific implementation of above step can be found in the description of one corresponding portion of embodiment.
The above one or more embodiment has following technical effect that
This can be set out based on being split by binary participle for text to be retrieved in the disclosure as far as possible
Possible all phrases in text avoid to occur the infull problem of dictionary in other Chinese word segmentation solutions, for
Network vocabulary, emerging word also can be good at identifying and retrieve;
After disclosure reception will be inserted into the text of database, binary participle is carried out, while creating the row's of falling rope of the text
Inverted index is written during participle in the position of the phrase currently split out and the phrase in the text by quotation part, this
The mode that index file is written in participle counts the position of each phrase in the text compared to first being segmented again after participle
The mode set, saves the primary process for reading data, and treatment effeciency is higher;
The disclosure improves general inverted index, makes that it includes three level list mechanism: coding/letter --- word
Position of group --- the phrase in the document comprising the phrase, and the location information be refined as " phrase is expert at+phrase is at this
It capable position " can be directly according in the location information express statistic text to be retrieved and database when carrying out full-text search
The similarity of the every a line of text, to quickly calculate full text similarity;
Since binary participle can bring storage data quantity big, when being written in order to avoid index file and subsequent execution is retrieved
When across file write-in and read problem, the disclosure be based on everyday words statistics establish secondary index and corresponding data list file,
The read or write speed that data can be greatly increased improves the treatment effeciency and recall precision of database text data.
It will be understood by those skilled in the art that each module or each step of the above-mentioned disclosure can be filled with general computer
It sets to realize, optionally, they can be realized with the program code that computing device can perform, it is thus possible to which they are stored
Be performed by computing device in the storage device, perhaps they are fabricated to each integrated circuit modules or by they
In multiple modules or step be fabricated to single integrated circuit module to realize.The disclosure be not limited to any specific hardware and
The combination of software.
The foregoing is merely preferred embodiment of the present disclosure, are not limited to the disclosure, for the skill of this field
For art personnel, the disclosure can have various modifications and variations.It is all within the spirit and principle of the disclosure, it is made any to repair
Change, equivalent replacement, improvement etc., should be included within the protection scope of the disclosure.
Although above-mentioned be described in conjunction with specific embodiment of the attached drawing to the disclosure, model not is protected to the disclosure
The limitation enclosed, those skilled in the art should understand that, on the basis of the technical solution of the disclosure, those skilled in the art are not
Need to make the creative labor the various modifications or changes that can be made still within the protection scope of the disclosure.