Embodiment
In order to solve the problem that occurs heavy original text when contribution is delivered in the prior art, the embodiment of the invention provides a kind of method and system of article duplicate checking.Be example with newspaper office's production below, describe, but being not limited to newspaper office produces, be suitable for as the Internet news issue is same, when the contribution on the space of a whole page is operated, production data storehouse contribution information is modified, and event trigger obtains contribution information from the production data storehouse, and described contribution information comprises the contribution content; Look into heavy server the contribution information of not carrying out the comparison of repetition contribution content in the contribution information of obtaining is looked into heavily, determine heavy original text information.Scheme by the embodiment of the invention can realize, reduces the number of times that heavy original text occurs when contribution is delivered.
First embodiment provided by the invention is a kind of method of article duplicate checking, and elder generation is that example describes with the production run of the production system of newspaper office, and method flow comprises as shown in Figure 1:
Step 101: when the contribution on the space of a whole page was operated, production data storehouse contribution information was modified, and event trigger obtains contribution information from the production data storehouse.
Step 102: the contribution information classification of each contribution that event trigger will obtain is handled, and stores into and look in the heavy database of looking into of heavy server.
Step 103: look into heavy server and do not look into heavy contribution information in the heavy database and look into heavily, determine heavy original text information to looking into.
Step 104: look into heavy collation server and look into the data that weigh in the database, the in real time heavy original text information reminding of the data in the newspaper office production data storehouse, and transmission synchronously.
Step 105: send in real time heavy original text information reminding and sign and issue the user, reuse the family workbench and check heavy original text information by looking into to contribution.
The contribution information that enforcement is signed and issued comprises and newly signing and issuing, transfer version or revision to sign information after signing and issuing, and signs and issues (cancel again after promptly signing and issuing and to sign and issue) signed with recession.
In newspaper office's production system, when the contribution on the space of a whole page is operated, the contribution information of storing in newspaper office's production data storehouse contribution information table can be by corresponding modify, by the event trigger of on the contribution information table, creating, sign and issue, transfer version, revision when contribution and sign (original text label: the contribution information except the contribution content, for example contribution title, author etc.) or when removing label, event trigger gets access to the contribution information after the corresponding modify in real time, and with the table 1 of these contribution information synchronization (duplicate copy) to newspaper office's storage facility located at processing plant: sign and issue in the contribution cache table.Contribution information comprises the field information except that modify_status and duple_id in the table 1.
Major key |
Title |
Data type |
Length |
Note |
??P |
??id |
??Bigint |
|
|
|
??paper_code |
??Varchar |
??32 |
Newspaper under the contribution |
|
??column_code |
??Varchar |
??32 |
Column under the contribution |
|
??filecode |
??Varchar |
??32 |
The contribution coding |
|
??author |
??Varchar |
??32 |
The author |
|
??title |
??Varchar |
??255 |
Title |
Major key |
Title |
Data type |
Length |
Note |
|
??sub_title |
??Varchar |
??255 |
Subtitle |
|
??pull_title |
??Varchar |
??255 |
Eyebrow head |
|
??content |
??Text |
|
Content |
|
??words |
??Int |
|
Number of words |
|
??sign_time |
??Datetime |
|
Sign and issue the time |
|
??sign_user_code |
??Varchar |
??32 |
Sign and issue subscriber-coded |
|
??sign_user_name |
??Varchar |
??32 |
Sign and issue address name |
|
??column_date |
??Datetime |
|
The periodical phase |
|
??layout_name |
??Varchar |
??32 |
Space of a whole page title |
|
??layout_code |
??Varchar |
??32 |
Space of a whole page coding |
|
??flow_type |
??Int |
|
Contribution flow process type 0 is that non-flow process contribution 1 is the flow process contribution |
|
??modify_status |
??Int |
|
Contribution state 0 is signed and issued for contribution and has been finished and looks into heavily, and not revising 1 expression contribution for information about signs and issues but does not look into heavily, and do not revise 2 expression contributions after signing and issuing for information about and removed label>=3 and represent that contribution signs and issues, but for information about modification is arranged |
|
??duple_id |
??Bigint |
|
Id behind the article duplicate checking |
Table 1
When contribution is signed and issued, event trigger inserts a contribution status information modify_status in table 1, the value of the field information in the table 1 except that modify_status and duple_id, all consistent with contribution information in the newspaper office production data storehouse, the value of modify_status field is revised as 1, and the value of duple_id is a null value.
When contribution was removed label, event trigger was revised as 2 with the value of modify_status field.
When contribution transfers version or revision to sign, contribution information after event trigger is signed with step version or revision, if the value of modify_status field is less than 3, the value of modify_status field is revised as 3, if the value of modify_status field is more than or equal to 3, the value of modify_status field is added 1.
Also can comprise before the step 101, delete printing phase all contributions before the same day in the table one, because the contribution of historical periodical phase belongs to the data of appearing in the newspapers, can enter as historical contribution distribution by warehouse-in agency and look into heavy database, no longer belong to the periodical phase on the same day need look into heavy contribution in real time.
All contributions all are to have looked into heavy in the table 1 if be checked through, the value that is modify_status all is 0, representing then that all contributions have all been looked into heavily finishes, the contribution of newly not signing and issuing need be looked into heavily and (carry out repetition contribution content relatively, promptly the contribution content to the content field in the contribution information compares), finish.
If being checked through the value of the modify_status of all contribution information correspondences in the table 1 not all is 0, then begin to obtain the contribution information of different conditions in the table 1.
Obtain the contribution information of removing label earlier, promptly modify_status value in respective flag position is 2 contribution information, be kept to remove to sign in the contribution tabulation (cancleDocumentList is promptly in the internal memory), and can certainly be in hard disk or database, to preserve temporarily.
Obtain the contribution of newly signing and issuing again, promptly modify_status value in respective flag position is 1 contribution information, is kept at newly to sign and issue in the contribution tabulation (newDocumentList is promptly in the internal memory).
Obtain the contribution of transferring version or revision to sign information after signing and issuing at last, promptly the contribution of respective flag position modify_status value>=3 is kept at and revises in the contribution tabulation (modifiedDocumentList is promptly in the internal memory).
Can comprise also before the step 102 that table 2 is looked into heavy contribution message buffer table in the heavy database, table 3 is looked into the content that weighs the object information buffer table and all emptied looking into.
Major key |
Title |
Data type |
Length |
Note |
??P |
id |
??Bigint |
|
|
|
column_code |
??Varchar |
??32 |
Column under the contribution |
|
filecode |
??Varchar |
??32 |
The contribution coding |
|
author |
??Varchar |
??32 |
The author |
|
title |
??Varchar |
??255 |
Title |
|
sub_title |
??Varchar |
??255 |
Subtitle |
Major key |
Title |
Data type |
Length |
Note |
|
pull_title |
??Varchar |
??255 |
Eyebrow head |
|
content |
??Text |
|
Content |
|
words |
??Int |
|
Number of words |
|
sign_time |
??Datetime |
|
Sign and issue the time |
|
sign_user_code |
??Varchar |
??32 |
Sign and issue subscriber-coded |
|
sign_user_name |
??Varchar |
??32 |
Sign and issue address name |
|
column_date |
??Datetime |
|
The periodical phase |
|
fill_time |
??Datetime |
|
Entry time |
|
layout_name |
??Varchar |
??32 |
Space of a whole page title |
|
layout_code |
??Varchar |
??32 |
Space of a whole page coding |
|
??calculate_status |
??Int |
|
Whether contribution looks into heavy 0 for not looking into heavy 1 for looking into heavily |
|
??publish_status |
??Int |
|
Contribution appears in the newspapers state 0 for not appearing in the newspapers 1 for appearing in the newspapers |
|
??flow_type |
??Int |
|
Contribution flow process type 0 is that non-flow process contribution 1 is the flow process contribution |
Table 2
Major key |
Title |
Data type |
Length |
Note |
??P |
??this_id |
??Bigint |
|
The Id of contribution 1 |
Major key |
Title |
Data type |
Length |
Note |
|
??this_column_code |
??Varchar |
??32 |
The name of tv column of contribution 1 |
|
??this_filecode |
??Varchar |
??32 |
The contribution coding of contribution 1 |
|
??this_author |
??Varchar |
??32 |
The author of contribution 1 |
|
??this_title |
??Varchar |
??255 |
The title of contribution 1 |
|
??this_sub_title |
??Varchar |
??255 |
The subtitle of contribution 1 |
|
??this_pull_title |
??Varchar |
??255 |
The eyebrow head of contribution 1 |
|
??this_words |
??Int |
|
The number of words of contribution 1 |
|
??this_sign_time |
??Datetime |
|
Signing and issuing the time of contribution 1 |
|
??this_sign_user_code |
??Varchar |
??32 |
Signing and issuing of contribution 1 is subscriber-coded |
|
??this_sign_user_name |
??Varchar |
??32 |
Contribution 1 sign and issue address name |
|
??this_fill_time |
??Datetime |
|
The entry time of contribution 1 |
|
??this_layout_name |
??Varchar |
??32 |
The space of a whole page title of contribution 1 |
|
??this_layout_code |
??Varchar |
??32 |
The space of a whole page coding of contribution 1 |
|
??this_publish_status |
??Int |
|
The state that appears in the newspapers 0 of contribution 1 is not for appearing in the newspapers 1 for appearing in the newspapers |
|
??this_flow_type |
??Int |
|
The flow process type 0 of contribution 1 is that non-flow process contribution 1 is the flow process contribution |
??P |
??that_id |
??Bigint |
|
The Id of contribution 2 |
|
??that_column_code |
??Varchar |
??32 |
The name of tv column of contribution 2 |
|
??that_filecode |
??Varchar |
??32 |
The contribution coding of contribution 2 |
Major key |
Title |
Data type |
Length |
Note |
|
??that_author |
??Varchar |
??32 |
The author of contribution 2 |
|
??that_title |
??Varchar |
??255 |
The title of contribution 2 |
|
??that_sub_title |
??Varchar |
??255 |
The subtitle of contribution 2 |
|
??that_pull_title |
??Varchar |
??255 |
The eyebrow head of contribution 2 |
|
??that_words |
??Int |
|
The number of words of contribution 2 |
|
??that_sign_time |
??Datetime |
|
Signing and issuing the time of contribution 2 |
|
??that_sign_user_code |
??Varchar |
??32 |
Signing and issuing of contribution 2 is subscriber-coded |
|
??that_sign_user_name |
??Varchar |
??32 |
Contribution 2 sign and issue address name |
|
??that_fill_time |
??Datetime |
|
The entry time of contribution 2 |
|
??that_layout_name |
??Varchar |
??32 |
The space of a whole page title of contribution 2 |
|
??that_layout_code |
??Varchar |
??32 |
The space of a whole page coding of contribution 2 |
|
??that_publish_status |
??Int |
|
The state that appears in the newspapers 0 of contribution 2 is not for appearing in the newspapers 1 for appearing in the newspapers |
|
??that_flow_type |
??Int |
|
The flow process type 0 of contribution 2 is that non-flow process contribution 1 is the flow process contribution |
|
??duple_rate |
??Int |
|
Contribution 1 and contribution 2 heavy original text similarities |
Table 3
In step 102, at first,,, be used as and newly sign and issue the contribution processing if do not look into heavily for transferring version or revision to sign the contribution information of information after signing and issuing; If looked into overweightly, only need look into the contribution information in the heavy database synchronously, and notice contribution production data storehouse is synchronous.
Circulation contribution tabulation modifiedDocumentList, if contribution information is to there being duple_id, the proof contribution is to have looked into heavy, then the contribution information among the modifiedDocumentList with look into table 4 in the heavy database and look in the heavy contribution information table id and equal the heavy contribution information of looking into of duple_id and carry out synchronously, looking into the heavy information (duple_rate, contribution 1 and contribution 2 heavy original text similarities are 80%) of looking into that this_id in the heavy object information table or that_id equal duple_id with table 5 carries out synchronously.
Major key |
Title |
Data type |
Length |
Note |
??P |
??id |
??Bigint |
|
|
|
??column_code |
??Varchar |
??32 |
Column under the contribution |
|
??filecode |
??Varchar |
??32 |
The contribution coding |
|
??author |
??Varchar |
??32 |
The author |
|
??title |
??Varchar |
??255 |
Title |
|
??sub_title |
??Varchar |
??255 |
Subtitle |
|
??pull_title |
??Varchar |
??255 |
Eyebrow head |
|
??content |
??Text |
|
Content |
|
??words |
??Int |
|
Number of words |
|
??sign_time |
??Datetime |
|
Sign and issue the time |
|
??sign_user_code |
??Varchar |
??32 |
Sign and issue subscriber-coded |
|
??sign_user_name |
??Varchar |
??32 |
Sign and issue address name |
|
??column_date |
??Datetime |
|
The periodical phase |
|
??fill_time |
??Datetime |
|
Entry time |
|
??layout_name |
??Varchar |
??32 |
Space of a whole page title |
|
??layout_code |
??Varchar |
??32 |
Space of a whole page coding |
Major key |
Title |
Data type |
Length |
Note |
|
??calculate_status |
??Int |
|
Whether contribution looks into heavy 0 for not looking into heavy 1 for looking into heavily |
|
??publish__status |
??Int |
|
Contribution appears in the newspapers state 0 for not appearing in the newspapers 1 for appearing in the newspapers |
|
??flow_type |
??Int |
|
Contribution flow process type 0 is that non-flow process contribution 1 is the flow process contribution |
Table 4
Major key |
Title |
Data type |
Length |
Note |
??P |
??this_id |
??Bigint |
|
The Id of contribution 1 |
|
??this_column_code |
??Varchar |
??32 |
The name of tv column of contribution 1 |
|
??this_filecode |
??Varchar |
??32 |
The contribution coding of contribution 1 |
|
??this_author |
??Varchar |
??32 |
The author of contribution 1 |
|
??this_title |
??Varchar |
??255 |
The title of contribution 1 |
|
??this_sub_title |
??Varchar |
??255 |
The subtitle of contribution 1 |
|
??this_pull_title |
??Varchar |
??255 |
The eyebrow head of contribution 1 |
|
??this_words |
??Int |
|
The number of words of contribution 1 |
|
??this_sign_time |
??Datetime |
|
Signing and issuing the time of contribution 1 |
|
??this_sign_user_code |
??Varchar |
??32 |
Signing and issuing of contribution 1 is subscriber-coded |
|
??this_sign_user_name |
??Varchar |
??32 |
Contribution 1 sign and issue address name |
Major key |
Title |
Data type |
Length |
Note |
|
??this_fill_time |
??Datetime |
|
The entry time of contribution 1 |
|
??this_layout_name |
??Varchar |
??32 |
The space of a whole page title of contribution 1 |
|
??this_layout_code |
??Varchar |
??32 |
The space of a whole page coding of contribution 1 |
|
??this_publish_status |
??Int |
|
The state that appears in the newspapers 0 of contribution 1 is not for appearing in the newspapers 1 for appearing in the newspapers |
|
??this_flow_type |
??Int |
|
The flow process type 0 of contribution 1 is that non-flow process contribution 1 is the flow process contribution |
??P |
??that_id |
??Bigint |
|
The Id of contribution 2 |
|
??that_column_code |
??Varchar |
??32 |
The name of tv column of contribution 2 |
|
??that_filecode |
??Varchar |
??32 |
The contribution coding of contribution 2 |
|
??that_author |
??Varchar |
??32 |
The author of contribution 2 |
|
??that_title |
??Varchar |
??255 |
The title of contribution 2 |
|
??that_sub_title |
??Varchar |
??255 |
The subtitle of contribution 2 |
|
??that_pull_title |
??Varchar |
??255 |
The eyebrow head of contribution 2 |
|
??that_words |
??Int |
|
The number of words of contribution 2 |
|
??that_sign_time |
??Datetime |
|
Signing and issuing the time of contribution 2 |
|
??that_sign_user_code |
??Varchar |
??32 |
Signing and issuing of contribution 2 is subscriber-coded |
|
??that_sign_user_name |
??Varchar |
??32 |
Contribution 2 sign and issue address name |
|
??that_fill_time |
??Datetime |
|
The entry time of contribution 2 |
|
??that_layout_name |
??Varchar |
??32 |
The space of a whole page title of contribution 2 |
Major key |
Title |
Data type |
Length |
Note |
|
??that_layout_code |
??Varchar |
??32 |
The space of a whole page coding of contribution 2 |
|
??that_publish_status |
??Int |
|
The state that appears in the newspapers 0 of contribution 2 is not for appearing in the newspapers 1 for appearing in the newspapers |
|
??that_flow_type |
??Int |
|
The flow process type 0 of contribution 2 is that non-flow process contribution 1 is the flow process contribution |
|
??duple_rate |
??Int |
|
Contribution 1,2 heavy original text similarities |
Table 5
If the contribution information among the modifiedDocumentList does not have duple_id, prove that contribution is also not look into heavily, then contribution information is inserted in the table 2, generate duple_id for contribution simultaneously.
Secondly, for the contribution information of newly signing and issuing, contribution information is saved in looks in the heavy database.
Circulation contribution tabulation newDocumentList is inserted into wherein contribution information in the table 2, is contribution generation duple_id simultaneously.
At last, for signing and issuing the contribution of signing with recession, delete its corresponding data in looking into heavy database.
Circulation contribution tabulation cancleDocumentList, if contribution has duple_id, the proof contribution is to have looked into heavy to sign with recession, and then id equals the contribution information of duple_id in the delete list four, and this_id or that_id equal the heavy object information of looking into of duple_id in the delete list five.
If contribution does not have duple_id, prove that contribution does not have not signed through looking into heavy just removing, do not need to do other processing herein.
Step 103 can be looked into heavily by the third-party weight software of looking into, as utilize the third-party plug-in unit magnanimity heavy basic part 2.0 editions that disappears to finish, at first the content with historical contribution inputs to plug-in unit, plug-in unit can make up the heavy storehouse that disappears automatically by Chinese words segmentation efficiently in internal memory, and then input to plug-in unit looking into heavy contribution content and minimum similarity numerical value, plug-in unit utilize Chinese word segmentation comparison techniques accurately will look into heavy contribution and the heavy storehouse that disappears in all contribution content informations compare, all comparative results that are higher than minimum similarity are returned.Can comprise before, from look into heavy database, obtain various look into to reset put parameter, comprise and look into heavy minimum similarity that look into information such as heavy key word, the value of these parameters will directly have influence on the heavy information of looking into of final affirmation.Obtain all and need look into heavy contribution from table 2, piece by piece contribution is looked into heavily, the heavy result that looks into of every piece of contribution is kept at unified one and looks in heavy the results list (duplationList).Looking into heavily of contribution wherein not only comprises with history the looking into heavily between the contribution (contribution in the table 4) of appearing in the newspapers, and comprises with other and sign and issue looking into heavily between the contribution (contribution in the table 2) in real time.
Look into after the heavily end, heavy the results list duplationList is looked in circulation, and the heavy information of finally confirming of looking into is inserted in the table 3.
For step 104 at first, in all the data importing tables 4 in the table 2, in all the data importing tables 5 in the table 3, when looking into heavily as a result, family workbench retrieval uses for looking into to reuse.The disposable submission of whole import operation is finished, and guarantees the result's that the user job platform retrieves integrality and accuracy.
Secondly, in the deletion newspaper office production data storehouse table 1 with all corresponding contributions of contribution tabulation cancleDocumentList.
At last, synchronously in newspaper office's database table 1 with contribution tabulation newDocumentList and the corresponding contribution information of modifiedDocumentList.And the value of modification field modify_status is for looking into heavily, promptly equal 0, because in looking into heavy process, contribution in the table 1 might be carried out associative operation by the production system of newspaper office simultaneously, the value of the field modify_status device that may be triggered is revised, so revise, only revise the contribution that modify_status did not change all the time herein; The sign of backfill contribution in looking into heavy database, promptly give duple_id field assignment, before (if do not have duple_id as the contribution information among the modifiedDocumentList in the step 102, the proof contribution is also not look into heavily, then contribution information being inserted in the table 2, being that simultaneously contribution generates duple_id) the duple_id value of the contribution of newly-generated duple_id is synchronized to the duple_id field of corresponding contribution in the table 1.
For step 105, from table 2, find and satisfy the heavy original text information that sends the real-time reminding condition, reuse family workbench transmission to looking into one by one, and the interface that provides by the real-time communication instrument sends the sign and issue user of heavy original text information reminding to contribution, the user reuses the family workbench by looking into, select certain conditions, check the heavy original text record tabulation that all comprise heavy original text information.
The user selects certain the bar record in the tabulation of heavy original text record, and click is checked, can check all heavy original text information of concrete certain contribution.
Second embodiment provided by the invention is a kind of system of article duplicate checking, and its structure comprises as shown in Figure 2:
Event trigger 202: be used for the contribution information in production data storehouse, because of the contribution on the space of a whole page is operated by after the corresponding modification, obtain amended contribution information, described contribution information comprises the contribution content;
Look into heavy server 204: be used for the contribution information that the contribution information of obtaining is not carried out the comparison of repetition contribution content is carried out repetition contribution content relatively, determine heavy original text information.
Further, event trigger 202: the status information that also is used to obtain amended contribution;
Look into heavy server 204: the contribution information stores that also is used for obtaining is to looking into heavy database, determine not carry out repetition contribution content contribution information relatively according to the status information of contribution, carry out repetition contribution content relatively to looking into the contribution information of not carrying out the comparison of repetition contribution content in the heavy database, the heavy original text information of determining is kept at looks into heavy database.
Further, event trigger 202: the contribution information that also is used for the production data storehouse, because of the contribution on the space of a whole page being signed and issued operation, transferring after signing and issuing version, revision to sign information and sign and issue with the label of dropping back, and by after the corresponding modification, obtain amended contribution information, described contribution information comprises the contribution content;
Look into heavy server 204: also be used for carrying out repetition contribution content relatively for the contribution of newly signing and issuing, the heavy original text information of determining is kept at looks into the weight database, the contribution information of the contribution of newly signing and issuing is kept at looks into heavy database, and notice production data storehouse contribution has carried out repetition contribution content relatively;
For signing and issuing the contribution of signing with recession, if carried out repetition contribution content relatively, then contribution information corresponding in the heavy database and heavy original text information are looked in deletion, and delete in notice contribution production data storehouse; If do not look into heavily, directly notify contribution production data storehouse to delete;
Sign the contribution of information for transferring version or revision after signing and issuing, if do not carry out repetition contribution content relatively, the contribution of transferring version or revision to sign information after signing and issuing carries out repetition contribution content relatively, the heavy original text information of determining is kept at looks into the weight database, the contribution information of transferring version or revision to sign the contribution of information after signing and issuing is kept at looks into heavy database, and notice production data storehouse contribution has carried out repetition contribution content relatively, if carried out repetition contribution content relatively, the contribution information of transferring version or revision to sign the contribution of information after signing and issuing is kept at looks into heavy database, and notice contribution production data storehouse.
Further, look into heavy server 204: also be used for to looking into the heavy original text information of family workbench transmission of reusing;
Described system also comprises:
Look into and reuse family workbench 206: be used to show heavy original text information.
Obviously, those skilled in the art can carry out various changes and modification to the present invention and not break away from the spirit and scope of the present invention.Like this, if of the present invention these are revised and modification belongs within the scope of claim of the present invention and equivalent technologies thereof, then the present invention also is intended to comprise these changes and modification interior.