CN108897817A - Date storage method, detection method and system, storage medium and computer equipment - Google Patents

Date storage method, detection method and system, storage medium and computer equipment Download PDF

Info

Publication number
CN108897817A
CN108897817A CN201810637185.5A CN201810637185A CN108897817A CN 108897817 A CN108897817 A CN 108897817A CN 201810637185 A CN201810637185 A CN 201810637185A CN 108897817 A CN108897817 A CN 108897817A
Authority
CN
China
Prior art keywords
node
data
pointer
data field
storage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810637185.5A
Other languages
Chinese (zh)
Other versions
CN108897817B (en
Inventor
白帆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201810637185.5A priority Critical patent/CN108897817B/en
Publication of CN108897817A publication Critical patent/CN108897817A/en
Application granted granted Critical
Publication of CN108897817B publication Critical patent/CN108897817B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention discloses a kind of date storage method, for storing data to be stored to inverted index storage organization.Inverted index storage organization includes node, and node includes the data field and pointer field for storing multi-group data.Date storage method includes:Judge whether data to be stored can all be written the data field of current node;If so, null pointer is written in the pointer field in current node;If it is not, generating next node;The pointer of next node is directed toward in the pointer field write-in of current node;Remaining data to be stored is written to the data field of next node;Next node is enabled to be current node and enter judgment step.Invention additionally discloses content similarity detection method and systems, storage medium and computer equipment.Date storage method of the invention, content similarity detection method and system, computer readable storage medium and computer equipment store multi-group data using the data field of a node, to reduce the quantity of node, and then memory space needed for reducing pointer.

Description

Date storage method, detection method and system, storage medium and computer equipment
Technical field
The present invention relates to memory technology, in particular to a kind of date storage method, content similarity detection method, content are similar Detection system, non-volatile computer readable storage medium storing program for executing and computer equipment.
Background technique
Required content is quickly found out in large volume document in order to realize, database generally will be literary by establishing concordance list It is associated between content and document in shelves.However, the problems such as existing concordance list is larger there are memory space.
Summary of the invention
The embodiment provides a kind of date storage method, content similarity detection method, content approx imately-detecting systems System, non-volatile computer readable storage medium storing program for executing and computer equipment.
The date storage method of embodiment of the present invention, for storing data to be stored to inverted index storage organization, The inverted index storage organization includes at least one node, and each node includes data field and pointer field, the data Domain is for storing multi-group data, and the node includes current node, and the date storage method includes:
Judge whether the data to be stored can all be written the data field of the current node;
When the data field of the current node can be all written in the data to be stored, in the current node The pointer field be written null pointer;
When the data field of the current node can not store all data to be stored, the row's of falling rope is generated Draw next node of storage organization;
The pointer of next node is directed toward in the pointer field write-in of the current node;
The remaining data to be stored is written to the data field of next node;With
It enables next node be the current node and judges whether the data to be stored can be whole into described The step of data field of the current node is written.
The content similarity detection method of embodiment of the present invention is used for inverted index storage organization, the inverted index storage Structure includes characteristic information and at least one node, and each node includes data field and pointer field.The data field is used for Multi-group data is stored, the information of the corresponding document of data described in every group, the pointer field is for storing pointer.The text phase Include like search method:
It is obtained according to characteristic information described in the inverted index storage organization and the corresponding relationship of the information of the document The number of the identical characteristic information of two documents;
Judge whether the number is greater than or equal to predetermined number;
When the number is greater than or equal to the predetermined number, judge that two documents are similar;
When the number is less than the predetermined number, judge that two documents are dissimilar.
The content approx imately-detecting system of embodiment of the present invention is used for inverted index storage organization, the inverted index storage Structure includes characteristic information and at least one node, and each node includes data field and pointer field.The data field is used for Multi-group data is stored, the information of the corresponding document of data described in every group, the pointer field is for storing pointer.The content phase It include obtaining module, first judgment module, the second judgment module and third judgment module like detection system.The acquisition module is used Two institutes are obtained in the characteristic information according to the inverted index storage organization and the corresponding relationship of the information of the document State the number of the identical characteristic information of document.The first judgment module is for judging whether the number is greater than or equal to Predetermined number.Second judgment module is used to judge two texts when the number is greater than or equal to the predetermined number Shelves are similar.The third judgment module is used to judge two documents not phase when the number is less than the predetermined number Seemingly.
The one or more of embodiment of the present invention, which includes that the non-volatile computer of computer executable instructions is readable, deposits Storage media, when the computer executable instructions are executed by one or more processors, so that processor execution is above-mentioned Date storage method and/or above content similarity detection method.
The computer equipment of embodiment of the present invention, including memory and processor store calculating in the memory Machine readable instruction, when described instruction is executed by the processor so that the processor execute above-mentioned date storage method and/or Above content similarity detection method.
The date storage method of embodiment of the present invention, content similarity detection method, content approx imately-detecting system, computer Readable storage medium storing program for executing and computer equipment utilize the data field an of node to store multi-group data, so that the quantity of node is reduced, And then memory space needed for reducing pointer.
Additional aspect and advantage of the invention will be set forth in part in the description, and will partially become from the following description Obviously, or practice through the invention is recognized.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with It obtains other drawings based on these drawings.
Fig. 1 is the structural schematic diagram of the inverted index storage organization of certain embodiments of the present invention.
Fig. 2 is the structural schematic diagram of traditional chain structure.
Fig. 3 is the structural schematic diagram of traditional structure of arrays.
Fig. 4 is the structural schematic diagram of the inverted index storage organization of certain embodiments of the present invention.
Fig. 5 is the structural schematic diagram of the inverted index storage organization of certain embodiments of the present invention.
Fig. 6 is the flow diagram of the date storage method of certain embodiments of the present invention.
Fig. 7 is the flow diagram of the content similarity detection method of certain embodiments of the present invention.
Fig. 8 is the schematic diagram of the content approx imately-detecting system of certain embodiments of the present invention.
Fig. 9 is the application scenarios schematic diagram of the content similarity detection method of certain embodiments of the present invention.
Figure 10 is the schematic diagram of the computer readable storage medium of certain embodiments of the present invention.
Figure 11 is the schematic diagram of the computer equipment of certain embodiments of the present invention.
Specific embodiment
Embodiments of the present invention are described below in detail, the example of the embodiment is shown in the accompanying drawings, wherein from beginning Same or similar element or element with the same or similar functions are indicated to same or similar label eventually.Below by ginseng The embodiment for examining attached drawing description is exemplary, and for explaining only the invention, and is not considered as limiting the invention.
In the description of the present invention, it is to be understood that, term " first ", " second " are used for description purposes only, and cannot It is interpreted as indication or suggestion relative importance or implicitly indicates the quantity of indicated technical characteristic.Define as a result, " the One ", the feature of " second " can explicitly or implicitly include one or more feature.In description of the invention In, the meaning of " plurality " is two or more, unless otherwise specifically defined.
In the description of the present invention, it should be noted that unless otherwise clearly defined and limited, term " installation ", " phase Even ", " connection " shall be understood in a broad sense, for example, it may be being fixedly connected, may be a detachable connection, or be integrally connected;It can To be mechanical connection, it is also possible to be electrically connected or can be in communication with each other;It can be directly connected, it can also be by between intermediary It connects connected, can be the connection inside two elements or the interaction relationship of two elements.For the ordinary skill of this field For personnel, the specific meanings of the above terms in the present invention can be understood according to specific conditions.
Following disclosure provides many different embodiments or example is used to realize different structure of the invention.In order to Simplify disclosure of the invention, hereinafter the component of specific examples and setting are described.Certainly, they are merely examples, and And it is not intended to limit the present invention.In addition, the present invention can in different examples repeat reference numerals and/or reference letter, This repetition is for purposes of simplicity and clarity, itself not indicate between discussed various embodiments and/or setting Relationship.In addition, the present invention provides various specific techniques and material example, but those of ordinary skill in the art can be with Recognize the application of other techniques and/or the use of other materials.
Embodiments of the present invention are described below in detail, the example of the embodiment is shown in the accompanying drawings, wherein from beginning Same or similar element or element with the same or similar functions are indicated to same or similar label eventually.Below by ginseng The embodiment for examining attached drawing description is exemplary, and for explaining only the invention, and is not considered as limiting the invention.
Referring to Fig. 1, the inverted index storage organization 100 of embodiment of the present invention, including at least one node 10.Each Node 10 includes data field 12 and pointer field 14.The data field 12 of one node 10 is for storing multi-group data.Pointer field 14 is used In storage pointer.
The inverted index storage organization 100 of embodiment of the present invention stores multiple groups number using the data field 12 of a node 10 According to wherein every group of data can correspond to the information of a document, to reduce the quantity of node 10, and then needed for reduction pointer Memory space.
The inverted index storage organization 100 of embodiment of the present invention further includes characteristic information 20, and characteristic information 20 is, for example, Word can obtain the corresponding relationship of this feature information 20 Yu data (information of document) by inverted index storage organization 100.
Referring to Fig. 2, existing inverted index storage organization generally uses traditional chain structure.Traditional chain structure The data field of each node can only store one group of data, i.e., can only store the information of a document, document in the database When comparing more, the information of the corresponding document of each characteristic information also compares more, such as characteristic information " people " is in 1,000 documents In occurred, then need to be formed using chain structure 1,000 nodes the information of a document (each node record), then benefit Association is formed with the pointer of pointer field, to obtain the corresponding relationship of this feature information " people " and document.However, due to pointer Need certain memory space (such as under 32 systems pointer memory space be 1kb, the storage of pointer under 64 systems Space is 2kb), therefore, when the quantity of node is more, pointer can occupy a large amount of memory space, so that traditional The concordance list that chain structure is formed can occupy a large amount of memory space.
Referring to Fig. 1, the inverted index storage organization 100 of embodiment of the present invention utilizes the data of a node 10 Domain 12 stores multi-group data, that is, stores the information of multiple documents, relative to traditional chain structure, store same characteristic information 20 corresponding same document information, inverted index storage organization 100 needs less node 10, so node 10 can be reduced Pointer needed for memory space.
Referring to Fig. 3, existing inverted index storage organization also uses traditional structure of arrays.Traditional structure of arrays will All data are stored in prespecified continuous fixed-length memory space, therefore the data in traditional structure of arrays can be direct It obtains, does not need to be jumped using pointer, that is, do not have to memory space needed for spending pointer.However the data stored in needs When greater than continuous fixed-length memory space, traditional structure of arrays can not carry out data dynamic expansion, therefore, in the document of database When increasing in real time, traditional structure of arrays is also difficult to meet business need.
Referring to Fig. 1, the inverted index storage organization 100 of embodiment of the present invention is jumped using pointer, from And after the memory space of the data field of a node 10 12 is write and expired, it can use pointer and jump to next node 10, then Using the data field 12 of next node 10 as memory space to continue storing data.
Referring to Fig. 4, in some embodiments, inverted index storage organization 100 includes single node 10.Believe in feature When ceasing the small number of 20 corresponding documents, corresponding inverted index storage organization 100 may only need single node 10 i.e. Corresponding document information can be stored.
Referring to Fig. 5, in some embodiments, inverted index storage organization 100 includes multiple nodes 10, each node The group number for the data that 10 data field 12 can store is identical, and the pointer of current node 10 is used to be directed toward next node 10 or current The pointer of node 10 is null pointer.
In one embodiment, the data field 12 of each node 10 can store 2 groups of data, and characteristic information 20 " people " exists Occurred in 1000 documents, then when forming concordance list, this feature information 20 only needs 500 nodes 10, compared to traditional 1,000 nodes needed for chain structure reduce 500 nodes.The data that the data field 12 of each node 10 can store Group numerical example is such as 2 groups of data or 4 groups of data or 1024 groups of data or 2048 groups of data, is not specifically limited herein.It needs It is noted that when being greater than setting value, can preferably meet high-frequency characteristic (in many texts when this group of number is bigger The characteristic information 20 that occurred in shelves) use demand, i.e., only need a small amount of node 10 that can be associated with high-frequency characteristic and correspondence Document;When this group of number is smaller, when being, for example, less than setting value, it can preferably meet characteristics of low-frequency (in less document The characteristic information 20 occurred) use demand, that is, avoid the memory space of the data field 12 of node 10 do not write greatly very much it is full and It causes to waste.
In current node 10 there are when subsequent node 10, the pointer of current node 10 is for being directed toward next node 10, specifically Ground, pointer can be used for being directed toward the storage address of next node 10.When current node 10 is last node 10, current node 10 Pointer can be null pointer (null).After the data field 12 of current node 10 is write completely, it can apply for node 10 in this prior New node 10 is formed afterwards, and the pointer of the current node 10 is become by original null pointer for being directed toward new node 10.
In some embodiments, inverted index storage organization 100 includes multiple nodes 10, and the pointer of current node 10 is used In the pointer for being directed toward next node 10 or current node 10 be null pointer, the group for the data that the data field 12 of node 10 can store Several sequences with node 10 are positively correlated.
Specifically, the group number for the data that the data field 12 of first node 10 in multiple nodes 10 can store can compare Less, for example, 2 groups of data etc. avoid the data at the beginning for node 10 in this way, can satisfy the use demand of characteristics of low-frequency It opens up biggish memory space and causes the waste of memory space in domain 12.With the increase of the sequence of node 10, corresponding data The group number for the data that domain 12 can store can gradually increase, such as can be increased according to 2 power, in this way, can make The data field 12 of node 10 forms biggish memory space quickly to meet the use demand of high-frequency characteristic.In one embodiment In, first node 10 can store 2 groups of data, and second node 10 can store 4 groups of data, and third node 10 can be deposited 8 groups of data are stored up, the 4th node 10 can store 16 groups of data, and so on.Characteristic information 20 " people " is in 1,000 documents In occurred, then when forming concordance list, this feature information 20 only needs nine 10 (2+4+8+16+32+64+128+256+ of node 512=1022>1000), reduce a node more than 900 compared to 1,000 nodes needed for traditional chain structure.
In current node 10 there are when subsequent node 10, the pointer of current node 10 is for being directed toward next node 10, specifically Ground, pointer can be used for being directed toward the storage address of next node 10.When current node 10 is last node 10, current node 10 Pointer can be null pointer (null).After the data field 12 of current node 10 is write completely, it can apply for node 10 in this prior New node 10 is formed afterwards, and the pointer of the current node 10 is become by original null pointer for being directed toward new node 10.
Referring to Fig. 1, in some embodiments, inverted index storage organization 100 includes multiple nodes 10, currently It is null pointer that the pointer of node 10, which is used to be directed toward next node 10 or the pointer of current node 10, in 12 energy of data field of node 10 When the group number of the data enough stored is less than preset group number, the group number and node 10 of the data that the data field 12 of node 10 can store Sequence be positively correlated, the data that the data field 12 of node 10 can store group number be equal to preset group number when, node 10 The group number for the data that the data field 12 of subsequent node 10 can store is preset group number.
Specifically, the group number for the data that the data field 12 of first node 10 in multiple nodes 10 can store can compare Less, for example, 2 groups of data etc. avoid the data at the beginning for node 10 in this way, can satisfy the use demand of characteristics of low-frequency It opens up biggish memory space and causes the waste of memory space in domain 12.With the increase of the sequence of node 10, corresponding data The group number for the data that domain 12 can store can gradually increase, such as can be increased according to 2 power, in this way, can make The data field 12 of node 10 forms biggish memory space quickly to meet the use demand of high-frequency characteristic.In the data of node 10 When the group number for the data that domain 12 can store rises to preset group number, 12 energy of data field of the subsequent all nodes 10 of the node 10 The group number of the data enough stored can be preset group number, so that the group number for the data for avoiding data field 12 that from capable of storing infinitely increases Length does not restrain, so that memory space is too big and causes the waste of memory space.In one embodiment, first node 10 can 2 groups of data are stored, second node 10 can store 4 groups of data, and third node 10 can store 8 groups of data, the 4th knot Point 10 can store 16 groups of data, and so on until the group number for the data that node 10 can store is preset group number.Feature letter 20 " people " of breath occurred in 1,000 documents, then when forming concordance list, this feature information 20 only needs nine 10 (2+ of node 4+8+16+32+64+128+256+512=1022>1000) it, is reduced compared to 1,000 nodes needed for traditional chain structure More than 900 a nodes.
In current node 10 there are when subsequent node 10, the pointer of current node 10 is for being directed toward next node 10, specifically Ground, pointer can be used for being directed toward the storage address of next node 10.When current node 10 is last node 10, current node 10 Pointer can be null pointer (null).After the data field 12 of current node 10 is write completely, it can apply for node 10 in this prior New node 10 is formed afterwards, and the pointer of the current node 10 is become by original null pointer for being directed toward new node 10.
In some embodiments, preset group number is 4096.In the system of the running environment of inverted index storage organization 100 When page is 4096kb (the pagesize size of Linux default), the capacity of data field 12 is the multiple of Installed System Memory page, from And the data that batch can be made to be written and read data field 12 are more efficient.
Certainly, in other embodiments, preset group number may be 1024,2048,8192 etc..In addition, preset group number It can also be configured by user, be not specifically limited herein according to demand.
In some embodiments, the information of document includes the number (docid) of document.Specifically, it is deposited shared by docid Storage space is smaller, and generally 1kb, the information of document only includes that the number of document can reduce inverted index storage organization 100 Memory space.
Certainly, in other embodiments, the information of document can also include frequency of occurrence of this feature information in document (TF), this feature information the information such as occurred in which position of document.
Incorporated by reference to Fig. 6, the date storage method of embodiment of the present invention can be used for storing data to be stored to above-mentioned The inverted index storage organization 100 of any one embodiment.Inverted index storage organization 100 includes at least one node 10, Each node 10 includes data field 12 and pointer field 14.The data field 12 of one node 10 is for storing multi-group data, node 10 Including current node 10, date storage method includes:
011:Judge whether data to be stored can all be written the data field 12 of current node 10;
012:When the data field 12 of current node 10 can be all written in data to be stored, in the pointer of current node 10 Null pointer is written in domain 14;
013:When the data field 12 of current node 10 can not store whole data to be stored, inverted index storage knot is generated Next node 10 of structure 100;
014:The pointer of next node 10 is directed toward in the write-in of pointer field 12 of current node 10;
015:Remaining data to be stored is written to the data field 12 of next node 10;With
016:Next node 10 is enabled to be current node 10 and enter step 011.
The date storage method of embodiment of the present invention stores multi-group data using the data field 12 of a node 10, wherein Every group of data can correspond to the information of a document, to reduce the quantity of node 10, and then storage needed for reduction pointer is empty Between.
Specifically, when inverted index storage organization 100 constructs, the first can be generated for inverted index storage organization 100 One node 10.In data to be stored storing process, the last one node 10 of inverted index storage organization 100 can be enabled to be Current node 10, judges whether data to be stored can all be written the data field 12 of current node 10, that is, judges current node Memory space needed for whether the residual memory space of 10 data field 12 is greater than or equal to data to be stored, if so, explanation The data field 12 of current node 10 can be all written in data to be stored, then current node 10 is all written in data to be stored Data field 12 determines that data to be stored write-in is completed;If it is not, then illustrate the data field 12 of current node 10 can not store all to Part data to be stored, regeneration can be then written in storing data according to the residual memory space of the data field 12 of current node 10 Be directed toward the pointer of next node 10 at next node 10, and in the write-in of the pointer field of current node 10 12, then by it is remaining to Next node 10 is written in storing data, and next node 10 at this time is substantially the last one of inverted index storage organization 100 Node 10, therefore next node 10 can be enabled to be current node and recycle the date storage method for executing embodiment of the present invention The inverted index storage organization 100 is written in all band storing datas.
Incorporated by reference to Fig. 7, the content similarity detection method of embodiment of the present invention can be used for any one of the above embodiment party The inverted index storage organization 100 of formula.Inverted index storage organization 100 includes at least one node 10 and characteristic information 20, often A node 10 includes data field 12 and pointer field 14.The data field 12 of one node 10 is for storing multi-group data, every group of data The information of a corresponding document.Pointer field 14 is for storing pointer.Text similar to search method includes:
02:Two are obtained according to characteristic information 20 in inverted index storage organization 100 and the corresponding relationship of the information of document The number of the identical characteristic information 20 of document;
04:Judge whether number is greater than or equal to predetermined number;
06:When number is greater than or equal to predetermined number, judge that two documents are similar;
08:When number is less than predetermined number, two document dissmilarities are judged.
Incorporated by reference to Fig. 8, the content approx imately-detecting system 300 of embodiment of the present invention can be used for any one of the above implementation The inverted index storage organization 100 of mode.Inverted index storage organization 100 includes at least one node 10 and characteristic information 20, Each node 10 includes data field 12 and pointer field 14.The data field 12 of one node 10 is for storing multi-group data, every group of number According to the information of a corresponding document.Pointer field 14 is for storing pointer.Content approx imately-detecting system 300 include obtain module 310, First judgment module 320, the second judgment module 330 and third judgment module 340.Module 310 is obtained to be used for according to inverted index Of the characteristic information 20 identical with the corresponding relationship of the information of document two documents of acquisition of characteristic information 20 in storage organization 100 Number.First judgment module 320 is for judging whether number is greater than or equal to predetermined number.Second judgment module 330 is used for a Number judges that two documents are similar when being greater than or equal to predetermined number.Third judgment module 340 is used to be less than predetermined number in number When judge two document dissmilarities.
In other words, the content similarity detection method of embodiment of the present invention can be by the content phase of embodiment of the present invention It is realized like detection system 300, wherein step 02 can be realized that step 04 can be by first judgment module by acquisition module 310 320 realize, step 06 can be realized that step 08 can be realized by third judgment module 340 by the second judgment module 330.
The content similarity detection method and content approx imately-detecting system 300 of embodiment of the present invention are stored using inverted index The feature that the memory space of structure 100 is small, scalability is high, can judge in lesser memory space two documents whether phase Seemingly, to provide reliable technical support for protection original works.
In some embodiments, step 02 can be:A document is chosen as document to be analyzed, obtains text to be analyzed The characteristic information 20 of shelves, the information of the corresponding document of each characteristic information 20 is obtained according to the characteristic information 20 of document to be analyzed, Traverse the information of the corresponding document of each characteristic information 20 of document to be analyzed, the corresponding document of each characteristic information (except to point Outside analysis document) with the number of the identical characteristic information of document to be analyzed add one, it finally counts identical with document to be analyzed The corresponding number of the largest number of documents of characteristic information 20.Referring to Fig. 9, in one embodiment, the spy of document to be analyzed For reference breath 20 for example including A, B, C and D, the corresponding document of characteristic information A has document, document 1, document 2 and document 3 to be analyzed, The corresponding document of characteristic information B has document, document 1 and document 2 to be analyzed, and the corresponding document of characteristic information C has document to be analyzed With document 1, the corresponding document of characteristic information D has document and document 1 to be analyzed.Characteristic information 20 is traversed, it can according to characteristic information A Know that document to be analyzed characteristic information identical with document 1 is 1, characteristic information identical with document 2 is 1, spy identical with document 3 Reference breath is 1, is 2, spy identical with document 2 according to characteristic information B document to be analyzed characteristic information identical with document 1 Reference breath is 2, characteristic information identical with document 3 is 1, according to characteristic information C document to be analyzed spy identical with document 1 Reference breath is 3, characteristic information identical with document 2 is 2, characteristic information identical with document 3 is 1, according to characteristic information D Document to be analyzed characteristic information identical with document 1 is 4, characteristic information identical with document 2 is 2, feature identical with document 3 Information is 1, thus, it is possible to obtain the largest number of documents of characteristic information 20 identical with document to be analyzed are document 1, it is identical Characteristic information 20 number be 4.
Referring to Fig. 10, the embodiment of the invention also provides a kind of computer readable storage mediums 500.One or more packet Non-volatile computer readable storage medium storing program for executing 500 containing computer executable instructions, when computer executable instructions by one or Multiple processors 600 execute when so that processor 600 execute any one of the above embodiment date storage method and/or The content similarity detection method of any one of the above embodiment.
For example, processor 600 executes described in following steps when computer executable instructions are executed by processor 600 Date storage method:
011:Judge whether data to be stored can all be written the data field 12 of current node 10;
012:When the data field 12 of current node 10 can be all written in data to be stored, in the pointer of current node 10 Null pointer is written in domain 14;
013:When the data field 12 of current node 10 can not store whole data to be stored, inverted index storage knot is generated Next node 10 of structure 100;
014:The pointer of next node 10 is directed toward in the write-in of pointer field 12 of current node 10;
015:Remaining data to be stored is written to the data field 12 of next node 10;With
016:Next node 10 is enabled to be current node 10 and enter step 011.
In another example processor 600 executes described in following steps when computer executable instructions are executed by processor 600 Content similarity detection method:
02:Two are obtained according to characteristic information 20 in inverted index storage organization 100 and the corresponding relationship of the information of document The number of the identical characteristic information 20 of document;
04:Judge whether number is greater than or equal to predetermined number;
06:When number is greater than or equal to predetermined number, judge that two documents are similar;
08:When number is less than predetermined number, two document dissmilarities are judged.
Figure 11 is please referred to, the embodiment of the present invention also provides a kind of computer equipment 700.Computer equipment 700 includes storage Device 720 and processor 740 store computer-readable instruction in memory 720, and computer-readable instruction is held by processor 740 When row, so that processor 740 executes the date storage method of any one of the above embodiment and/or any one of the above is implemented The content similarity detection method of mode.
For example, processor 740 executes data described in following steps when computer-readable instruction is executed by processor 740 Storage method:
011:Judge whether data to be stored can all be written the data field 12 of current node 10;
012:When the data field 12 of current node 10 can be all written in data to be stored, in the pointer of current node 10 Null pointer is written in domain 14;
013:When the data field 12 of current node 10 can not store whole data to be stored, inverted index storage knot is generated Next node 10 of structure 100;
014:The pointer of next node 10 is directed toward in the write-in of pointer field 12 of current node 10;
015:Remaining data to be stored is written to the data field 12 of next node 10;With
016:Next node 10 is enabled to be current node 10 and enter step 011.
In another example processor 740 executes interior described in following steps when computer-readable instruction is executed by processor 740 Hold similarity detection method:
02:Two are obtained according to characteristic information 20 in inverted index storage organization 100 and the corresponding relationship of the information of document The number of the identical characteristic information 20 of document;
04:Judge whether number is greater than or equal to predetermined number;
06:When number is greater than or equal to predetermined number, judge that two documents are similar;
08:When number is less than predetermined number, two document dissmilarities are judged.
In the description of this specification, reference term " embodiment ", " some embodiments ", " schematically implementation What the description of mode ", " example ", " specific example " or " some examples " etc. meant to describe in conjunction with the embodiment or example Particular features, structures, materials, or characteristics are contained at least one embodiment or example of the invention.In this specification In, schematic expression of the above terms are not necessarily referring to identical embodiment or example.Moreover, the specific spy of description Sign, structure, material or feature can be combined in any suitable manner in any one or more embodiments or example.
Any process described otherwise above or method description are construed as in flow chart or herein, and expression includes Module, segment or the portion of the code of one or more executable instructions for the step of executing specific logical function or process Point, and the range of the preferred embodiment of the present invention includes other execution, wherein can not press shown or discussed suitable Sequence, including according to related function by it is basic simultaneously in the way of or in the opposite order, to execute function, this should be of the invention Embodiment person of ordinary skill in the field understood.
Expression or logic and/or step described otherwise above herein in flow charts, for example, being considered use In the order list for the executable instruction for executing logic function, can specifically execute in any computer-readable medium, for Instruction execution system, device or equipment (such as computer based system, including the system of processor or other can be held from instruction The instruction fetch of row system, device or equipment and the system executed instruction) it uses, or combine these instruction execution systems, device or set It is standby and use.For the purpose of this specification, " computer-readable medium ", which can be, any may include, stores, communicates, propagates or pass Defeated program is for instruction execution system, device or equipment or the dress used in conjunction with these instruction execution systems, device or equipment It sets.The more specific example (non-exhaustive list) of computer-readable medium includes following:Electricity with one or more wiring Interconnecting piece (electronic device), portable computer diskette box (magnetic device), random access memory (RAM), read-only memory (ROM), erasable edit read-only storage (EPROM or flash memory), fiber device and portable optic disk is read-only deposits Reservoir (CDROM).In addition, computer-readable medium can even is that the paper that can print described program on it or other are suitable Medium, because can then be edited, be interpreted or when necessary with it for example by carrying out optical scanner to paper or other media His suitable method is handled electronically to obtain described program, is then stored in computer storage.
It should be appreciated that each section of the invention can be executed with hardware, software, firmware or their combination.Above-mentioned In embodiment, software that multiple steps or method can be executed in memory and by suitable instruction execution system with storage Or firmware executes.It, and in another embodiment, can be under well known in the art for example, if executed with hardware Any one of column technology or their combination execute:With the logic gates for executing logic function to data-signal Discrete logic, with suitable combinational logic gate circuit specific integrated circuit, programmable gate array (PGA), scene Programmable gate array (FPGA) etc..
Those skilled in the art are understood that execute all or part of the steps that above-mentioned implementation method carries It is that relevant hardware can be instructed to complete by program, the program can store in a kind of computer readable storage medium In, which when being executed, includes the steps that one or a combination set of embodiment of the method.
It, can also be in addition, each functional unit in each embodiment of the present invention can integrate in a processing module It is that each unit physically exists alone, can also be integrated in two or more units in a module.Above-mentioned integrated mould Block both can take the form of hardware execution, can also be executed in the form of software function module.The integrated module is such as Fruit is executed and when sold or used as an independent product in the form of software function module, also can store in a computer In read/write memory medium.
Storage medium mentioned above can be read-only memory, disk or CD etc..Although having been shown and retouching above The embodiment of the present invention is stated, it is to be understood that above-described embodiment is exemplary, and should not be understood as to limit of the invention System, those skilled in the art can be changed above-described embodiment, modify, replace and become within the scope of the invention Type.

Claims (15)

1. a kind of date storage method, for storing data to be stored to inverted index storage organization, which is characterized in that described Inverted index storage organization includes at least one node, and each node includes data field and pointer field, and the data field is used In storage multi-group data, the node includes current node, and the date storage method includes:
Judge whether the data to be stored can all be written the data field of the current node;
When the data field of the current node can be all written in the data to be stored, in the institute of the current node State pointer field write-in null pointer;
When the data field of the current node can not store all data to be stored, generates the inverted index and deposit Next node of storage structure;
The pointer of next node is directed toward in the pointer field write-in of the current node;
The remaining data to be stored is written to the data field of next node;With
It enables next node be the current node and judges whether the data to be stored can all be written into described The step of data field of the current node.
2. date storage method according to claim 1, which is characterized in that the inverted index storage organization includes multiple The group number of the node, the data of the data field storage of each node is identical.
3. date storage method according to claim 1, which is characterized in that the inverted index storage organization includes multiple The sequence of the node, the group number and the node of the data of the data field storage of the node is positively correlated.
4. date storage method according to claim 1, which is characterized in that the inverted index storage organization includes multiple The node, when the group number for the data that the data field of the node stores is less than preset group number, the node The sequence of the group number and the node of the data of the data field storage is positively correlated, in the data field of the node When the group number of the data of storage is equal to the preset group number, the institute of the data field storage of the subsequent node of the node The group number for stating data is the preset group number.
5. date storage method according to claim 4, which is characterized in that the preset group number is 4096.
6. date storage method according to claim 1, which is characterized in that the letter of the corresponding document of data described in every group Breath, the information of the document includes the number of the document.
7. a kind of content similarity detection method is used for inverted index storage organization, which is characterized in that the inverted index storage knot Structure includes characteristic information and at least one node, and each node includes data field and pointer field, and the data field is for depositing Multi-group data is stored up, the information of the corresponding document of data described in every group, for the pointer field for storing pointer, the text is similar Search method includes:
Two are obtained according to characteristic information described in the inverted index storage organization and the corresponding relationship of the information of the document The number of the identical characteristic information of the document;
Judge whether the number is greater than or equal to predetermined number;
When the number is greater than or equal to the predetermined number, judge that two documents are similar;
When the number is less than the predetermined number, judge that two documents are dissimilar.
8. content similarity detection method according to claim 7, which is characterized in that the inverted index storage organization includes The group number of multiple nodes, the data of the data field storage of each node is identical.
9. content similarity detection method according to claim 7, which is characterized in that the inverted index storage organization includes The sequence of multiple nodes, the group number and the node of the data of the data field storage of the node is positively correlated.
10. content similarity detection method according to claim 7, which is characterized in that the inverted index storage organization packet Multiple nodes are included, when the group number for the data that the data field of the node stores is less than preset group number, the node The sequence of group number and the node of the data of data field storage be positively correlated, in the data of the node When the group number of the data of domain storage is equal to the preset group number, the data field storage of the subsequent node of the node The group number of the data is the preset group number.
11. content similarity detection method according to claim 10, which is characterized in that the preset group number is 4096.
12. content similarity detection method according to claim 7, which is characterized in that the information of the document includes described The number of document.
13. a kind of content approx imately-detecting system is used for inverted index storage organization, which is characterized in that the inverted index storage Structure includes characteristic information and at least one node, and each node includes data field and pointer field, and the data field is used for Multi-group data is stored, the information of the corresponding document of data described in every group, the pointer field is for storing pointer, the content phase Include like detection system:
Module is obtained, the acquisition module is for the characteristic information according to the inverted index storage organization and the document The corresponding relationship of information obtain the number of the identical characteristic information of two documents;
First judgment module, the first judgment module is for judging whether the number is greater than or equal to predetermined number;
Second judgment module, second judgment module are used to judge two when the number is greater than or equal to the predetermined number A document is similar;
Third judgment module, the third judgment module are used to judge described in two when the number is less than the predetermined number Document is dissimilar.
14. one or more includes the non-volatile computer readable storage medium storing program for executing of computer executable instructions, when the calculating When machine executable instruction is executed by one or more processors, so that the processor perform claim requires any one of 1 to 6 institute Content similarity detection method described in any one of date storage method and/or claim 7-12 for stating.
15. a kind of computer equipment, including memory and processor, computer-readable instruction is stored in the memory, institute When stating instruction by processor execution, so that data storage described in any one of processor perform claim requirement 1 to 6 Content similarity detection method described in any one of method and/or claim 7-12.
CN201810637185.5A 2018-06-20 2018-06-20 Data storage method, detection method and system, storage medium and computer equipment Active CN108897817B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810637185.5A CN108897817B (en) 2018-06-20 2018-06-20 Data storage method, detection method and system, storage medium and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810637185.5A CN108897817B (en) 2018-06-20 2018-06-20 Data storage method, detection method and system, storage medium and computer equipment

Publications (2)

Publication Number Publication Date
CN108897817A true CN108897817A (en) 2018-11-27
CN108897817B CN108897817B (en) 2023-04-07

Family

ID=64345119

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810637185.5A Active CN108897817B (en) 2018-06-20 2018-06-20 Data storage method, detection method and system, storage medium and computer equipment

Country Status (1)

Country Link
CN (1) CN108897817B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113138902A (en) * 2021-04-27 2021-07-20 上海英众信息科技有限公司 Computer host heat dissipation system and device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050182750A1 (en) * 2004-02-13 2005-08-18 Memento, Inc. System and method for instrumenting a software application
US20060117002A1 (en) * 2004-11-26 2006-06-01 Bing Swen Method for search result clustering
CN101136013A (en) * 2006-09-01 2008-03-05 北大方正集团有限公司 Method for quick updating data domain in full text retrieval system
CN101206752A (en) * 2007-12-25 2008-06-25 北京科文书业信息技术有限公司 Electric commerce website related products recommendation system and method
CN102750393A (en) * 2012-07-13 2012-10-24 携程计算机技术(上海)有限公司 Composite index structure and searching method based on same
CN106991102A (en) * 2016-01-21 2017-07-28 腾讯科技(深圳)有限公司 The processing method and processing system of key-value pair in inverted index

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050182750A1 (en) * 2004-02-13 2005-08-18 Memento, Inc. System and method for instrumenting a software application
US20060117002A1 (en) * 2004-11-26 2006-06-01 Bing Swen Method for search result clustering
CN101136013A (en) * 2006-09-01 2008-03-05 北大方正集团有限公司 Method for quick updating data domain in full text retrieval system
CN101206752A (en) * 2007-12-25 2008-06-25 北京科文书业信息技术有限公司 Electric commerce website related products recommendation system and method
CN102750393A (en) * 2012-07-13 2012-10-24 携程计算机技术(上海)有限公司 Composite index structure and searching method based on same
CN106991102A (en) * 2016-01-21 2017-07-28 腾讯科技(深圳)有限公司 The processing method and processing system of key-value pair in inverted index

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113138902A (en) * 2021-04-27 2021-07-20 上海英众信息科技有限公司 Computer host heat dissipation system and device

Also Published As

Publication number Publication date
CN108897817B (en) 2023-04-07

Similar Documents

Publication Publication Date Title
US8140780B2 (en) Systems, methods, and devices for configuring a device
US8719206B2 (en) Pattern-recognition processor with matching-data reporting module
US6122626A (en) Sparse index search method
CN108334628A (en) A kind of method, apparatus, equipment and the storage medium of media event cluster
CN102063943A (en) Nand flash memory parameter automatic detecting system
CN105117351A (en) Method and apparatus for writing data into cache
CN108897817A (en) Date storage method, detection method and system, storage medium and computer equipment
Bancilhon et al. Design of a backend processor for a data base machine
CN110287286A (en) The determination method, apparatus and storage medium of short text similarity
CN109657109A (en) Specified word lookup method, device, equipment and storage medium in a kind of document
US20210089539A1 (en) Associating user-provided content items to interest nodes
US11900251B2 (en) Amplification of initial training data
CN103473157B (en) Hard disc failure processing method and treating apparatus
US9286349B2 (en) Dynamic search system
CN110489032A (en) Dictionaries query method and electronic equipment for e-book
CN108197164A (en) Business data storage method and device
CN110008475A (en) Participle processing method, device, equipment and storage medium
CN109816527A (en) Reconciliation document handling method, device, computer equipment and storage medium
CN109426702A (en) IOS platform file reads guard method, storage medium, electronic equipment and system
CN114489481A (en) Method and system for storing and accessing data in hard disk
CN107436918B (en) Database implementation method, device and equipment
CN106326138B (en) The access control method of flash memory and flash memory internal data
CN110737617A (en) Direct memory access
US11194804B2 (en) System and method for an index search engine
CN108920660A (en) Keyword weight acquisition methods, device, electronic equipment and readable storage medium storing program for executing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant