CN108897817A - Date storage method, detection method and system, storage medium and computer equipment - Google Patents
Date storage method, detection method and system, storage medium and computer equipment Download PDFInfo
- Publication number
- CN108897817A CN108897817A CN201810637185.5A CN201810637185A CN108897817A CN 108897817 A CN108897817 A CN 108897817A CN 201810637185 A CN201810637185 A CN 201810637185A CN 108897817 A CN108897817 A CN 108897817A
- Authority
- CN
- China
- Prior art keywords
- node
- data
- pointer
- data field
- storage
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/194—Calculation of difference between files
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention discloses a kind of date storage method, for storing data to be stored to inverted index storage organization.Inverted index storage organization includes node, and node includes the data field and pointer field for storing multi-group data.Date storage method includes:Judge whether data to be stored can all be written the data field of current node;If so, null pointer is written in the pointer field in current node;If it is not, generating next node;The pointer of next node is directed toward in the pointer field write-in of current node;Remaining data to be stored is written to the data field of next node;Next node is enabled to be current node and enter judgment step.Invention additionally discloses content similarity detection method and systems, storage medium and computer equipment.Date storage method of the invention, content similarity detection method and system, computer readable storage medium and computer equipment store multi-group data using the data field of a node, to reduce the quantity of node, and then memory space needed for reducing pointer.
Description
Technical field
The present invention relates to memory technology, in particular to a kind of date storage method, content similarity detection method, content are similar
Detection system, non-volatile computer readable storage medium storing program for executing and computer equipment.
Background technique
Required content is quickly found out in large volume document in order to realize, database generally will be literary by establishing concordance list
It is associated between content and document in shelves.However, the problems such as existing concordance list is larger there are memory space.
Summary of the invention
The embodiment provides a kind of date storage method, content similarity detection method, content approx imately-detecting systems
System, non-volatile computer readable storage medium storing program for executing and computer equipment.
The date storage method of embodiment of the present invention, for storing data to be stored to inverted index storage organization,
The inverted index storage organization includes at least one node, and each node includes data field and pointer field, the data
Domain is for storing multi-group data, and the node includes current node, and the date storage method includes:
Judge whether the data to be stored can all be written the data field of the current node;
When the data field of the current node can be all written in the data to be stored, in the current node
The pointer field be written null pointer;
When the data field of the current node can not store all data to be stored, the row's of falling rope is generated
Draw next node of storage organization;
The pointer of next node is directed toward in the pointer field write-in of the current node;
The remaining data to be stored is written to the data field of next node;With
It enables next node be the current node and judges whether the data to be stored can be whole into described
The step of data field of the current node is written.
The content similarity detection method of embodiment of the present invention is used for inverted index storage organization, the inverted index storage
Structure includes characteristic information and at least one node, and each node includes data field and pointer field.The data field is used for
Multi-group data is stored, the information of the corresponding document of data described in every group, the pointer field is for storing pointer.The text phase
Include like search method:
It is obtained according to characteristic information described in the inverted index storage organization and the corresponding relationship of the information of the document
The number of the identical characteristic information of two documents;
Judge whether the number is greater than or equal to predetermined number;
When the number is greater than or equal to the predetermined number, judge that two documents are similar;
When the number is less than the predetermined number, judge that two documents are dissimilar.
The content approx imately-detecting system of embodiment of the present invention is used for inverted index storage organization, the inverted index storage
Structure includes characteristic information and at least one node, and each node includes data field and pointer field.The data field is used for
Multi-group data is stored, the information of the corresponding document of data described in every group, the pointer field is for storing pointer.The content phase
It include obtaining module, first judgment module, the second judgment module and third judgment module like detection system.The acquisition module is used
Two institutes are obtained in the characteristic information according to the inverted index storage organization and the corresponding relationship of the information of the document
State the number of the identical characteristic information of document.The first judgment module is for judging whether the number is greater than or equal to
Predetermined number.Second judgment module is used to judge two texts when the number is greater than or equal to the predetermined number
Shelves are similar.The third judgment module is used to judge two documents not phase when the number is less than the predetermined number
Seemingly.
The one or more of embodiment of the present invention, which includes that the non-volatile computer of computer executable instructions is readable, deposits
Storage media, when the computer executable instructions are executed by one or more processors, so that processor execution is above-mentioned
Date storage method and/or above content similarity detection method.
The computer equipment of embodiment of the present invention, including memory and processor store calculating in the memory
Machine readable instruction, when described instruction is executed by the processor so that the processor execute above-mentioned date storage method and/or
Above content similarity detection method.
The date storage method of embodiment of the present invention, content similarity detection method, content approx imately-detecting system, computer
Readable storage medium storing program for executing and computer equipment utilize the data field an of node to store multi-group data, so that the quantity of node is reduced,
And then memory space needed for reducing pointer.
Additional aspect and advantage of the invention will be set forth in part in the description, and will partially become from the following description
Obviously, or practice through the invention is recognized.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below
There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this
Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with
It obtains other drawings based on these drawings.
Fig. 1 is the structural schematic diagram of the inverted index storage organization of certain embodiments of the present invention.
Fig. 2 is the structural schematic diagram of traditional chain structure.
Fig. 3 is the structural schematic diagram of traditional structure of arrays.
Fig. 4 is the structural schematic diagram of the inverted index storage organization of certain embodiments of the present invention.
Fig. 5 is the structural schematic diagram of the inverted index storage organization of certain embodiments of the present invention.
Fig. 6 is the flow diagram of the date storage method of certain embodiments of the present invention.
Fig. 7 is the flow diagram of the content similarity detection method of certain embodiments of the present invention.
Fig. 8 is the schematic diagram of the content approx imately-detecting system of certain embodiments of the present invention.
Fig. 9 is the application scenarios schematic diagram of the content similarity detection method of certain embodiments of the present invention.
Figure 10 is the schematic diagram of the computer readable storage medium of certain embodiments of the present invention.
Figure 11 is the schematic diagram of the computer equipment of certain embodiments of the present invention.
Specific embodiment
Embodiments of the present invention are described below in detail, the example of the embodiment is shown in the accompanying drawings, wherein from beginning
Same or similar element or element with the same or similar functions are indicated to same or similar label eventually.Below by ginseng
The embodiment for examining attached drawing description is exemplary, and for explaining only the invention, and is not considered as limiting the invention.
In the description of the present invention, it is to be understood that, term " first ", " second " are used for description purposes only, and cannot
It is interpreted as indication or suggestion relative importance or implicitly indicates the quantity of indicated technical characteristic.Define as a result, " the
One ", the feature of " second " can explicitly or implicitly include one or more feature.In description of the invention
In, the meaning of " plurality " is two or more, unless otherwise specifically defined.
In the description of the present invention, it should be noted that unless otherwise clearly defined and limited, term " installation ", " phase
Even ", " connection " shall be understood in a broad sense, for example, it may be being fixedly connected, may be a detachable connection, or be integrally connected;It can
To be mechanical connection, it is also possible to be electrically connected or can be in communication with each other;It can be directly connected, it can also be by between intermediary
It connects connected, can be the connection inside two elements or the interaction relationship of two elements.For the ordinary skill of this field
For personnel, the specific meanings of the above terms in the present invention can be understood according to specific conditions.
Following disclosure provides many different embodiments or example is used to realize different structure of the invention.In order to
Simplify disclosure of the invention, hereinafter the component of specific examples and setting are described.Certainly, they are merely examples, and
And it is not intended to limit the present invention.In addition, the present invention can in different examples repeat reference numerals and/or reference letter,
This repetition is for purposes of simplicity and clarity, itself not indicate between discussed various embodiments and/or setting
Relationship.In addition, the present invention provides various specific techniques and material example, but those of ordinary skill in the art can be with
Recognize the application of other techniques and/or the use of other materials.
Embodiments of the present invention are described below in detail, the example of the embodiment is shown in the accompanying drawings, wherein from beginning
Same or similar element or element with the same or similar functions are indicated to same or similar label eventually.Below by ginseng
The embodiment for examining attached drawing description is exemplary, and for explaining only the invention, and is not considered as limiting the invention.
Referring to Fig. 1, the inverted index storage organization 100 of embodiment of the present invention, including at least one node 10.Each
Node 10 includes data field 12 and pointer field 14.The data field 12 of one node 10 is for storing multi-group data.Pointer field 14 is used
In storage pointer.
The inverted index storage organization 100 of embodiment of the present invention stores multiple groups number using the data field 12 of a node 10
According to wherein every group of data can correspond to the information of a document, to reduce the quantity of node 10, and then needed for reduction pointer
Memory space.
The inverted index storage organization 100 of embodiment of the present invention further includes characteristic information 20, and characteristic information 20 is, for example,
Word can obtain the corresponding relationship of this feature information 20 Yu data (information of document) by inverted index storage organization 100.
Referring to Fig. 2, existing inverted index storage organization generally uses traditional chain structure.Traditional chain structure
The data field of each node can only store one group of data, i.e., can only store the information of a document, document in the database
When comparing more, the information of the corresponding document of each characteristic information also compares more, such as characteristic information " people " is in 1,000 documents
In occurred, then need to be formed using chain structure 1,000 nodes the information of a document (each node record), then benefit
Association is formed with the pointer of pointer field, to obtain the corresponding relationship of this feature information " people " and document.However, due to pointer
Need certain memory space (such as under 32 systems pointer memory space be 1kb, the storage of pointer under 64 systems
Space is 2kb), therefore, when the quantity of node is more, pointer can occupy a large amount of memory space, so that traditional
The concordance list that chain structure is formed can occupy a large amount of memory space.
Referring to Fig. 1, the inverted index storage organization 100 of embodiment of the present invention utilizes the data of a node 10
Domain 12 stores multi-group data, that is, stores the information of multiple documents, relative to traditional chain structure, store same characteristic information
20 corresponding same document information, inverted index storage organization 100 needs less node 10, so node 10 can be reduced
Pointer needed for memory space.
Referring to Fig. 3, existing inverted index storage organization also uses traditional structure of arrays.Traditional structure of arrays will
All data are stored in prespecified continuous fixed-length memory space, therefore the data in traditional structure of arrays can be direct
It obtains, does not need to be jumped using pointer, that is, do not have to memory space needed for spending pointer.However the data stored in needs
When greater than continuous fixed-length memory space, traditional structure of arrays can not carry out data dynamic expansion, therefore, in the document of database
When increasing in real time, traditional structure of arrays is also difficult to meet business need.
Referring to Fig. 1, the inverted index storage organization 100 of embodiment of the present invention is jumped using pointer, from
And after the memory space of the data field of a node 10 12 is write and expired, it can use pointer and jump to next node 10, then
Using the data field 12 of next node 10 as memory space to continue storing data.
Referring to Fig. 4, in some embodiments, inverted index storage organization 100 includes single node 10.Believe in feature
When ceasing the small number of 20 corresponding documents, corresponding inverted index storage organization 100 may only need single node 10 i.e.
Corresponding document information can be stored.
Referring to Fig. 5, in some embodiments, inverted index storage organization 100 includes multiple nodes 10, each node
The group number for the data that 10 data field 12 can store is identical, and the pointer of current node 10 is used to be directed toward next node 10 or current
The pointer of node 10 is null pointer.
In one embodiment, the data field 12 of each node 10 can store 2 groups of data, and characteristic information 20 " people " exists
Occurred in 1000 documents, then when forming concordance list, this feature information 20 only needs 500 nodes 10, compared to traditional
1,000 nodes needed for chain structure reduce 500 nodes.The data that the data field 12 of each node 10 can store
Group numerical example is such as 2 groups of data or 4 groups of data or 1024 groups of data or 2048 groups of data, is not specifically limited herein.It needs
It is noted that when being greater than setting value, can preferably meet high-frequency characteristic (in many texts when this group of number is bigger
The characteristic information 20 that occurred in shelves) use demand, i.e., only need a small amount of node 10 that can be associated with high-frequency characteristic and correspondence
Document;When this group of number is smaller, when being, for example, less than setting value, it can preferably meet characteristics of low-frequency (in less document
The characteristic information 20 occurred) use demand, that is, avoid the memory space of the data field 12 of node 10 do not write greatly very much it is full and
It causes to waste.
In current node 10 there are when subsequent node 10, the pointer of current node 10 is for being directed toward next node 10, specifically
Ground, pointer can be used for being directed toward the storage address of next node 10.When current node 10 is last node 10, current node 10
Pointer can be null pointer (null).After the data field 12 of current node 10 is write completely, it can apply for node 10 in this prior
New node 10 is formed afterwards, and the pointer of the current node 10 is become by original null pointer for being directed toward new node 10.
In some embodiments, inverted index storage organization 100 includes multiple nodes 10, and the pointer of current node 10 is used
In the pointer for being directed toward next node 10 or current node 10 be null pointer, the group for the data that the data field 12 of node 10 can store
Several sequences with node 10 are positively correlated.
Specifically, the group number for the data that the data field 12 of first node 10 in multiple nodes 10 can store can compare
Less, for example, 2 groups of data etc. avoid the data at the beginning for node 10 in this way, can satisfy the use demand of characteristics of low-frequency
It opens up biggish memory space and causes the waste of memory space in domain 12.With the increase of the sequence of node 10, corresponding data
The group number for the data that domain 12 can store can gradually increase, such as can be increased according to 2 power, in this way, can make
The data field 12 of node 10 forms biggish memory space quickly to meet the use demand of high-frequency characteristic.In one embodiment
In, first node 10 can store 2 groups of data, and second node 10 can store 4 groups of data, and third node 10 can be deposited
8 groups of data are stored up, the 4th node 10 can store 16 groups of data, and so on.Characteristic information 20 " people " is in 1,000 documents
In occurred, then when forming concordance list, this feature information 20 only needs nine 10 (2+4+8+16+32+64+128+256+ of node
512=1022>1000), reduce a node more than 900 compared to 1,000 nodes needed for traditional chain structure.
In current node 10 there are when subsequent node 10, the pointer of current node 10 is for being directed toward next node 10, specifically
Ground, pointer can be used for being directed toward the storage address of next node 10.When current node 10 is last node 10, current node 10
Pointer can be null pointer (null).After the data field 12 of current node 10 is write completely, it can apply for node 10 in this prior
New node 10 is formed afterwards, and the pointer of the current node 10 is become by original null pointer for being directed toward new node 10.
Referring to Fig. 1, in some embodiments, inverted index storage organization 100 includes multiple nodes 10, currently
It is null pointer that the pointer of node 10, which is used to be directed toward next node 10 or the pointer of current node 10, in 12 energy of data field of node 10
When the group number of the data enough stored is less than preset group number, the group number and node 10 of the data that the data field 12 of node 10 can store
Sequence be positively correlated, the data that the data field 12 of node 10 can store group number be equal to preset group number when, node 10
The group number for the data that the data field 12 of subsequent node 10 can store is preset group number.
Specifically, the group number for the data that the data field 12 of first node 10 in multiple nodes 10 can store can compare
Less, for example, 2 groups of data etc. avoid the data at the beginning for node 10 in this way, can satisfy the use demand of characteristics of low-frequency
It opens up biggish memory space and causes the waste of memory space in domain 12.With the increase of the sequence of node 10, corresponding data
The group number for the data that domain 12 can store can gradually increase, such as can be increased according to 2 power, in this way, can make
The data field 12 of node 10 forms biggish memory space quickly to meet the use demand of high-frequency characteristic.In the data of node 10
When the group number for the data that domain 12 can store rises to preset group number, 12 energy of data field of the subsequent all nodes 10 of the node 10
The group number of the data enough stored can be preset group number, so that the group number for the data for avoiding data field 12 that from capable of storing infinitely increases
Length does not restrain, so that memory space is too big and causes the waste of memory space.In one embodiment, first node 10 can
2 groups of data are stored, second node 10 can store 4 groups of data, and third node 10 can store 8 groups of data, the 4th knot
Point 10 can store 16 groups of data, and so on until the group number for the data that node 10 can store is preset group number.Feature letter
20 " people " of breath occurred in 1,000 documents, then when forming concordance list, this feature information 20 only needs nine 10 (2+ of node
4+8+16+32+64+128+256+512=1022>1000) it, is reduced compared to 1,000 nodes needed for traditional chain structure
More than 900 a nodes.
In current node 10 there are when subsequent node 10, the pointer of current node 10 is for being directed toward next node 10, specifically
Ground, pointer can be used for being directed toward the storage address of next node 10.When current node 10 is last node 10, current node 10
Pointer can be null pointer (null).After the data field 12 of current node 10 is write completely, it can apply for node 10 in this prior
New node 10 is formed afterwards, and the pointer of the current node 10 is become by original null pointer for being directed toward new node 10.
In some embodiments, preset group number is 4096.In the system of the running environment of inverted index storage organization 100
When page is 4096kb (the pagesize size of Linux default), the capacity of data field 12 is the multiple of Installed System Memory page, from
And the data that batch can be made to be written and read data field 12 are more efficient.
Certainly, in other embodiments, preset group number may be 1024,2048,8192 etc..In addition, preset group number
It can also be configured by user, be not specifically limited herein according to demand.
In some embodiments, the information of document includes the number (docid) of document.Specifically, it is deposited shared by docid
Storage space is smaller, and generally 1kb, the information of document only includes that the number of document can reduce inverted index storage organization 100
Memory space.
Certainly, in other embodiments, the information of document can also include frequency of occurrence of this feature information in document
(TF), this feature information the information such as occurred in which position of document.
Incorporated by reference to Fig. 6, the date storage method of embodiment of the present invention can be used for storing data to be stored to above-mentioned
The inverted index storage organization 100 of any one embodiment.Inverted index storage organization 100 includes at least one node 10,
Each node 10 includes data field 12 and pointer field 14.The data field 12 of one node 10 is for storing multi-group data, node 10
Including current node 10, date storage method includes:
011:Judge whether data to be stored can all be written the data field 12 of current node 10;
012:When the data field 12 of current node 10 can be all written in data to be stored, in the pointer of current node 10
Null pointer is written in domain 14;
013:When the data field 12 of current node 10 can not store whole data to be stored, inverted index storage knot is generated
Next node 10 of structure 100;
014:The pointer of next node 10 is directed toward in the write-in of pointer field 12 of current node 10;
015:Remaining data to be stored is written to the data field 12 of next node 10;With
016:Next node 10 is enabled to be current node 10 and enter step 011.
The date storage method of embodiment of the present invention stores multi-group data using the data field 12 of a node 10, wherein
Every group of data can correspond to the information of a document, to reduce the quantity of node 10, and then storage needed for reduction pointer is empty
Between.
Specifically, when inverted index storage organization 100 constructs, the first can be generated for inverted index storage organization 100
One node 10.In data to be stored storing process, the last one node 10 of inverted index storage organization 100 can be enabled to be
Current node 10, judges whether data to be stored can all be written the data field 12 of current node 10, that is, judges current node
Memory space needed for whether the residual memory space of 10 data field 12 is greater than or equal to data to be stored, if so, explanation
The data field 12 of current node 10 can be all written in data to be stored, then current node 10 is all written in data to be stored
Data field 12 determines that data to be stored write-in is completed;If it is not, then illustrate the data field 12 of current node 10 can not store all to
Part data to be stored, regeneration can be then written in storing data according to the residual memory space of the data field 12 of current node 10
Be directed toward the pointer of next node 10 at next node 10, and in the write-in of the pointer field of current node 10 12, then by it is remaining to
Next node 10 is written in storing data, and next node 10 at this time is substantially the last one of inverted index storage organization 100
Node 10, therefore next node 10 can be enabled to be current node and recycle the date storage method for executing embodiment of the present invention
The inverted index storage organization 100 is written in all band storing datas.
Incorporated by reference to Fig. 7, the content similarity detection method of embodiment of the present invention can be used for any one of the above embodiment party
The inverted index storage organization 100 of formula.Inverted index storage organization 100 includes at least one node 10 and characteristic information 20, often
A node 10 includes data field 12 and pointer field 14.The data field 12 of one node 10 is for storing multi-group data, every group of data
The information of a corresponding document.Pointer field 14 is for storing pointer.Text similar to search method includes:
02:Two are obtained according to characteristic information 20 in inverted index storage organization 100 and the corresponding relationship of the information of document
The number of the identical characteristic information 20 of document;
04:Judge whether number is greater than or equal to predetermined number;
06:When number is greater than or equal to predetermined number, judge that two documents are similar;
08:When number is less than predetermined number, two document dissmilarities are judged.
Incorporated by reference to Fig. 8, the content approx imately-detecting system 300 of embodiment of the present invention can be used for any one of the above implementation
The inverted index storage organization 100 of mode.Inverted index storage organization 100 includes at least one node 10 and characteristic information 20,
Each node 10 includes data field 12 and pointer field 14.The data field 12 of one node 10 is for storing multi-group data, every group of number
According to the information of a corresponding document.Pointer field 14 is for storing pointer.Content approx imately-detecting system 300 include obtain module 310,
First judgment module 320, the second judgment module 330 and third judgment module 340.Module 310 is obtained to be used for according to inverted index
Of the characteristic information 20 identical with the corresponding relationship of the information of document two documents of acquisition of characteristic information 20 in storage organization 100
Number.First judgment module 320 is for judging whether number is greater than or equal to predetermined number.Second judgment module 330 is used for a
Number judges that two documents are similar when being greater than or equal to predetermined number.Third judgment module 340 is used to be less than predetermined number in number
When judge two document dissmilarities.
In other words, the content similarity detection method of embodiment of the present invention can be by the content phase of embodiment of the present invention
It is realized like detection system 300, wherein step 02 can be realized that step 04 can be by first judgment module by acquisition module 310
320 realize, step 06 can be realized that step 08 can be realized by third judgment module 340 by the second judgment module 330.
The content similarity detection method and content approx imately-detecting system 300 of embodiment of the present invention are stored using inverted index
The feature that the memory space of structure 100 is small, scalability is high, can judge in lesser memory space two documents whether phase
Seemingly, to provide reliable technical support for protection original works.
In some embodiments, step 02 can be:A document is chosen as document to be analyzed, obtains text to be analyzed
The characteristic information 20 of shelves, the information of the corresponding document of each characteristic information 20 is obtained according to the characteristic information 20 of document to be analyzed,
Traverse the information of the corresponding document of each characteristic information 20 of document to be analyzed, the corresponding document of each characteristic information (except to point
Outside analysis document) with the number of the identical characteristic information of document to be analyzed add one, it finally counts identical with document to be analyzed
The corresponding number of the largest number of documents of characteristic information 20.Referring to Fig. 9, in one embodiment, the spy of document to be analyzed
For reference breath 20 for example including A, B, C and D, the corresponding document of characteristic information A has document, document 1, document 2 and document 3 to be analyzed,
The corresponding document of characteristic information B has document, document 1 and document 2 to be analyzed, and the corresponding document of characteristic information C has document to be analyzed
With document 1, the corresponding document of characteristic information D has document and document 1 to be analyzed.Characteristic information 20 is traversed, it can according to characteristic information A
Know that document to be analyzed characteristic information identical with document 1 is 1, characteristic information identical with document 2 is 1, spy identical with document 3
Reference breath is 1, is 2, spy identical with document 2 according to characteristic information B document to be analyzed characteristic information identical with document 1
Reference breath is 2, characteristic information identical with document 3 is 1, according to characteristic information C document to be analyzed spy identical with document 1
Reference breath is 3, characteristic information identical with document 2 is 2, characteristic information identical with document 3 is 1, according to characteristic information D
Document to be analyzed characteristic information identical with document 1 is 4, characteristic information identical with document 2 is 2, feature identical with document 3
Information is 1, thus, it is possible to obtain the largest number of documents of characteristic information 20 identical with document to be analyzed are document 1, it is identical
Characteristic information 20 number be 4.
Referring to Fig. 10, the embodiment of the invention also provides a kind of computer readable storage mediums 500.One or more packet
Non-volatile computer readable storage medium storing program for executing 500 containing computer executable instructions, when computer executable instructions by one or
Multiple processors 600 execute when so that processor 600 execute any one of the above embodiment date storage method and/or
The content similarity detection method of any one of the above embodiment.
For example, processor 600 executes described in following steps when computer executable instructions are executed by processor 600
Date storage method:
011:Judge whether data to be stored can all be written the data field 12 of current node 10;
012:When the data field 12 of current node 10 can be all written in data to be stored, in the pointer of current node 10
Null pointer is written in domain 14;
013:When the data field 12 of current node 10 can not store whole data to be stored, inverted index storage knot is generated
Next node 10 of structure 100;
014:The pointer of next node 10 is directed toward in the write-in of pointer field 12 of current node 10;
015:Remaining data to be stored is written to the data field 12 of next node 10;With
016:Next node 10 is enabled to be current node 10 and enter step 011.
In another example processor 600 executes described in following steps when computer executable instructions are executed by processor 600
Content similarity detection method:
02:Two are obtained according to characteristic information 20 in inverted index storage organization 100 and the corresponding relationship of the information of document
The number of the identical characteristic information 20 of document;
04:Judge whether number is greater than or equal to predetermined number;
06:When number is greater than or equal to predetermined number, judge that two documents are similar;
08:When number is less than predetermined number, two document dissmilarities are judged.
Figure 11 is please referred to, the embodiment of the present invention also provides a kind of computer equipment 700.Computer equipment 700 includes storage
Device 720 and processor 740 store computer-readable instruction in memory 720, and computer-readable instruction is held by processor 740
When row, so that processor 740 executes the date storage method of any one of the above embodiment and/or any one of the above is implemented
The content similarity detection method of mode.
For example, processor 740 executes data described in following steps when computer-readable instruction is executed by processor 740
Storage method:
011:Judge whether data to be stored can all be written the data field 12 of current node 10;
012:When the data field 12 of current node 10 can be all written in data to be stored, in the pointer of current node 10
Null pointer is written in domain 14;
013:When the data field 12 of current node 10 can not store whole data to be stored, inverted index storage knot is generated
Next node 10 of structure 100;
014:The pointer of next node 10 is directed toward in the write-in of pointer field 12 of current node 10;
015:Remaining data to be stored is written to the data field 12 of next node 10;With
016:Next node 10 is enabled to be current node 10 and enter step 011.
In another example processor 740 executes interior described in following steps when computer-readable instruction is executed by processor 740
Hold similarity detection method:
02:Two are obtained according to characteristic information 20 in inverted index storage organization 100 and the corresponding relationship of the information of document
The number of the identical characteristic information 20 of document;
04:Judge whether number is greater than or equal to predetermined number;
06:When number is greater than or equal to predetermined number, judge that two documents are similar;
08:When number is less than predetermined number, two document dissmilarities are judged.
In the description of this specification, reference term " embodiment ", " some embodiments ", " schematically implementation
What the description of mode ", " example ", " specific example " or " some examples " etc. meant to describe in conjunction with the embodiment or example
Particular features, structures, materials, or characteristics are contained at least one embodiment or example of the invention.In this specification
In, schematic expression of the above terms are not necessarily referring to identical embodiment or example.Moreover, the specific spy of description
Sign, structure, material or feature can be combined in any suitable manner in any one or more embodiments or example.
Any process described otherwise above or method description are construed as in flow chart or herein, and expression includes
Module, segment or the portion of the code of one or more executable instructions for the step of executing specific logical function or process
Point, and the range of the preferred embodiment of the present invention includes other execution, wherein can not press shown or discussed suitable
Sequence, including according to related function by it is basic simultaneously in the way of or in the opposite order, to execute function, this should be of the invention
Embodiment person of ordinary skill in the field understood.
Expression or logic and/or step described otherwise above herein in flow charts, for example, being considered use
In the order list for the executable instruction for executing logic function, can specifically execute in any computer-readable medium, for
Instruction execution system, device or equipment (such as computer based system, including the system of processor or other can be held from instruction
The instruction fetch of row system, device or equipment and the system executed instruction) it uses, or combine these instruction execution systems, device or set
It is standby and use.For the purpose of this specification, " computer-readable medium ", which can be, any may include, stores, communicates, propagates or pass
Defeated program is for instruction execution system, device or equipment or the dress used in conjunction with these instruction execution systems, device or equipment
It sets.The more specific example (non-exhaustive list) of computer-readable medium includes following:Electricity with one or more wiring
Interconnecting piece (electronic device), portable computer diskette box (magnetic device), random access memory (RAM), read-only memory
(ROM), erasable edit read-only storage (EPROM or flash memory), fiber device and portable optic disk is read-only deposits
Reservoir (CDROM).In addition, computer-readable medium can even is that the paper that can print described program on it or other are suitable
Medium, because can then be edited, be interpreted or when necessary with it for example by carrying out optical scanner to paper or other media
His suitable method is handled electronically to obtain described program, is then stored in computer storage.
It should be appreciated that each section of the invention can be executed with hardware, software, firmware or their combination.Above-mentioned
In embodiment, software that multiple steps or method can be executed in memory and by suitable instruction execution system with storage
Or firmware executes.It, and in another embodiment, can be under well known in the art for example, if executed with hardware
Any one of column technology or their combination execute:With the logic gates for executing logic function to data-signal
Discrete logic, with suitable combinational logic gate circuit specific integrated circuit, programmable gate array (PGA), scene
Programmable gate array (FPGA) etc..
Those skilled in the art are understood that execute all or part of the steps that above-mentioned implementation method carries
It is that relevant hardware can be instructed to complete by program, the program can store in a kind of computer readable storage medium
In, which when being executed, includes the steps that one or a combination set of embodiment of the method.
It, can also be in addition, each functional unit in each embodiment of the present invention can integrate in a processing module
It is that each unit physically exists alone, can also be integrated in two or more units in a module.Above-mentioned integrated mould
Block both can take the form of hardware execution, can also be executed in the form of software function module.The integrated module is such as
Fruit is executed and when sold or used as an independent product in the form of software function module, also can store in a computer
In read/write memory medium.
Storage medium mentioned above can be read-only memory, disk or CD etc..Although having been shown and retouching above
The embodiment of the present invention is stated, it is to be understood that above-described embodiment is exemplary, and should not be understood as to limit of the invention
System, those skilled in the art can be changed above-described embodiment, modify, replace and become within the scope of the invention
Type.
Claims (15)
1. a kind of date storage method, for storing data to be stored to inverted index storage organization, which is characterized in that described
Inverted index storage organization includes at least one node, and each node includes data field and pointer field, and the data field is used
In storage multi-group data, the node includes current node, and the date storage method includes:
Judge whether the data to be stored can all be written the data field of the current node;
When the data field of the current node can be all written in the data to be stored, in the institute of the current node
State pointer field write-in null pointer;
When the data field of the current node can not store all data to be stored, generates the inverted index and deposit
Next node of storage structure;
The pointer of next node is directed toward in the pointer field write-in of the current node;
The remaining data to be stored is written to the data field of next node;With
It enables next node be the current node and judges whether the data to be stored can all be written into described
The step of data field of the current node.
2. date storage method according to claim 1, which is characterized in that the inverted index storage organization includes multiple
The group number of the node, the data of the data field storage of each node is identical.
3. date storage method according to claim 1, which is characterized in that the inverted index storage organization includes multiple
The sequence of the node, the group number and the node of the data of the data field storage of the node is positively correlated.
4. date storage method according to claim 1, which is characterized in that the inverted index storage organization includes multiple
The node, when the group number for the data that the data field of the node stores is less than preset group number, the node
The sequence of the group number and the node of the data of the data field storage is positively correlated, in the data field of the node
When the group number of the data of storage is equal to the preset group number, the institute of the data field storage of the subsequent node of the node
The group number for stating data is the preset group number.
5. date storage method according to claim 4, which is characterized in that the preset group number is 4096.
6. date storage method according to claim 1, which is characterized in that the letter of the corresponding document of data described in every group
Breath, the information of the document includes the number of the document.
7. a kind of content similarity detection method is used for inverted index storage organization, which is characterized in that the inverted index storage knot
Structure includes characteristic information and at least one node, and each node includes data field and pointer field, and the data field is for depositing
Multi-group data is stored up, the information of the corresponding document of data described in every group, for the pointer field for storing pointer, the text is similar
Search method includes:
Two are obtained according to characteristic information described in the inverted index storage organization and the corresponding relationship of the information of the document
The number of the identical characteristic information of the document;
Judge whether the number is greater than or equal to predetermined number;
When the number is greater than or equal to the predetermined number, judge that two documents are similar;
When the number is less than the predetermined number, judge that two documents are dissimilar.
8. content similarity detection method according to claim 7, which is characterized in that the inverted index storage organization includes
The group number of multiple nodes, the data of the data field storage of each node is identical.
9. content similarity detection method according to claim 7, which is characterized in that the inverted index storage organization includes
The sequence of multiple nodes, the group number and the node of the data of the data field storage of the node is positively correlated.
10. content similarity detection method according to claim 7, which is characterized in that the inverted index storage organization packet
Multiple nodes are included, when the group number for the data that the data field of the node stores is less than preset group number, the node
The sequence of group number and the node of the data of data field storage be positively correlated, in the data of the node
When the group number of the data of domain storage is equal to the preset group number, the data field storage of the subsequent node of the node
The group number of the data is the preset group number.
11. content similarity detection method according to claim 10, which is characterized in that the preset group number is 4096.
12. content similarity detection method according to claim 7, which is characterized in that the information of the document includes described
The number of document.
13. a kind of content approx imately-detecting system is used for inverted index storage organization, which is characterized in that the inverted index storage
Structure includes characteristic information and at least one node, and each node includes data field and pointer field, and the data field is used for
Multi-group data is stored, the information of the corresponding document of data described in every group, the pointer field is for storing pointer, the content phase
Include like detection system:
Module is obtained, the acquisition module is for the characteristic information according to the inverted index storage organization and the document
The corresponding relationship of information obtain the number of the identical characteristic information of two documents;
First judgment module, the first judgment module is for judging whether the number is greater than or equal to predetermined number;
Second judgment module, second judgment module are used to judge two when the number is greater than or equal to the predetermined number
A document is similar;
Third judgment module, the third judgment module are used to judge described in two when the number is less than the predetermined number
Document is dissimilar.
14. one or more includes the non-volatile computer readable storage medium storing program for executing of computer executable instructions, when the calculating
When machine executable instruction is executed by one or more processors, so that the processor perform claim requires any one of 1 to 6 institute
Content similarity detection method described in any one of date storage method and/or claim 7-12 for stating.
15. a kind of computer equipment, including memory and processor, computer-readable instruction is stored in the memory, institute
When stating instruction by processor execution, so that data storage described in any one of processor perform claim requirement 1 to 6
Content similarity detection method described in any one of method and/or claim 7-12.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810637185.5A CN108897817B (en) | 2018-06-20 | 2018-06-20 | Data storage method, detection method and system, storage medium and computer equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810637185.5A CN108897817B (en) | 2018-06-20 | 2018-06-20 | Data storage method, detection method and system, storage medium and computer equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108897817A true CN108897817A (en) | 2018-11-27 |
CN108897817B CN108897817B (en) | 2023-04-07 |
Family
ID=64345119
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810637185.5A Active CN108897817B (en) | 2018-06-20 | 2018-06-20 | Data storage method, detection method and system, storage medium and computer equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108897817B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113138902A (en) * | 2021-04-27 | 2021-07-20 | 上海英众信息科技有限公司 | Computer host heat dissipation system and device |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050182750A1 (en) * | 2004-02-13 | 2005-08-18 | Memento, Inc. | System and method for instrumenting a software application |
US20060117002A1 (en) * | 2004-11-26 | 2006-06-01 | Bing Swen | Method for search result clustering |
CN101136013A (en) * | 2006-09-01 | 2008-03-05 | 北大方正集团有限公司 | Method for quick updating data domain in full text retrieval system |
CN101206752A (en) * | 2007-12-25 | 2008-06-25 | 北京科文书业信息技术有限公司 | Electric commerce website related products recommendation system and method |
CN102750393A (en) * | 2012-07-13 | 2012-10-24 | 携程计算机技术(上海)有限公司 | Composite index structure and searching method based on same |
CN106991102A (en) * | 2016-01-21 | 2017-07-28 | 腾讯科技(深圳)有限公司 | The processing method and processing system of key-value pair in inverted index |
-
2018
- 2018-06-20 CN CN201810637185.5A patent/CN108897817B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050182750A1 (en) * | 2004-02-13 | 2005-08-18 | Memento, Inc. | System and method for instrumenting a software application |
US20060117002A1 (en) * | 2004-11-26 | 2006-06-01 | Bing Swen | Method for search result clustering |
CN101136013A (en) * | 2006-09-01 | 2008-03-05 | 北大方正集团有限公司 | Method for quick updating data domain in full text retrieval system |
CN101206752A (en) * | 2007-12-25 | 2008-06-25 | 北京科文书业信息技术有限公司 | Electric commerce website related products recommendation system and method |
CN102750393A (en) * | 2012-07-13 | 2012-10-24 | 携程计算机技术(上海)有限公司 | Composite index structure and searching method based on same |
CN106991102A (en) * | 2016-01-21 | 2017-07-28 | 腾讯科技(深圳)有限公司 | The processing method and processing system of key-value pair in inverted index |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113138902A (en) * | 2021-04-27 | 2021-07-20 | 上海英众信息科技有限公司 | Computer host heat dissipation system and device |
Also Published As
Publication number | Publication date |
---|---|
CN108897817B (en) | 2023-04-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8140780B2 (en) | Systems, methods, and devices for configuring a device | |
US8719206B2 (en) | Pattern-recognition processor with matching-data reporting module | |
US6122626A (en) | Sparse index search method | |
CN108334628A (en) | A kind of method, apparatus, equipment and the storage medium of media event cluster | |
CN102063943A (en) | Nand flash memory parameter automatic detecting system | |
CN105117351A (en) | Method and apparatus for writing data into cache | |
CN108897817A (en) | Date storage method, detection method and system, storage medium and computer equipment | |
Bancilhon et al. | Design of a backend processor for a data base machine | |
CN110287286A (en) | The determination method, apparatus and storage medium of short text similarity | |
CN109657109A (en) | Specified word lookup method, device, equipment and storage medium in a kind of document | |
US20210089539A1 (en) | Associating user-provided content items to interest nodes | |
US11900251B2 (en) | Amplification of initial training data | |
CN103473157B (en) | Hard disc failure processing method and treating apparatus | |
US9286349B2 (en) | Dynamic search system | |
CN110489032A (en) | Dictionaries query method and electronic equipment for e-book | |
CN108197164A (en) | Business data storage method and device | |
CN110008475A (en) | Participle processing method, device, equipment and storage medium | |
CN109816527A (en) | Reconciliation document handling method, device, computer equipment and storage medium | |
CN109426702A (en) | IOS platform file reads guard method, storage medium, electronic equipment and system | |
CN114489481A (en) | Method and system for storing and accessing data in hard disk | |
CN107436918B (en) | Database implementation method, device and equipment | |
CN106326138B (en) | The access control method of flash memory and flash memory internal data | |
CN110737617A (en) | Direct memory access | |
US11194804B2 (en) | System and method for an index search engine | |
CN108920660A (en) | Keyword weight acquisition methods, device, electronic equipment and readable storage medium storing program for executing |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |