CN115221856A - Method for judging similarity of documents based on image, video and text contents simultaneously - Google Patents

Method for judging similarity of documents based on image, video and text contents simultaneously Download PDF

Info

Publication number
CN115221856A
CN115221856A CN202210861048.6A CN202210861048A CN115221856A CN 115221856 A CN115221856 A CN 115221856A CN 202210861048 A CN202210861048 A CN 202210861048A CN 115221856 A CN115221856 A CN 115221856A
Authority
CN
China
Prior art keywords
document
similarity
video
image
documents
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210861048.6A
Other languages
Chinese (zh)
Inventor
张宇
李秀芬
陈龙
程任华
郑金辉
朱庭俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Telecom Corp Ltd
Original Assignee
China Telecom Digital Intelligence Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Telecom Digital Intelligence Technology Co Ltd filed Critical China Telecom Digital Intelligence Technology Co Ltd
Priority to CN202210861048.6A priority Critical patent/CN115221856A/en
Publication of CN115221856A publication Critical patent/CN115221856A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for judging document similarity based on image, video and text contents at the same time, which comprises the following steps: s1: selecting two documents D of the same type 1 And D 2 Performing Hash calculation on the obtained data, and judging D 1 And D 2 Whether it is a duplicate document; if D is 1 And D 2 If not, then for document D 1 And D 2 And respectively carrying out text, image and video content similarity calculation, respectively setting weights for the text, image and video similarity of the document, obtaining the document similarity through weighting calculation, and comparing the document similarity with a preset threshold value to obtain a document similarity conclusion. The invention judges the similarity of the text and the image and the video, calculates the similarity of the document after synthesis, and prompts the manual workAnd checking again and carrying out subsequent processing.

Description

Method for judging similarity of documents based on image, video and text contents simultaneously
Technical Field
The invention belongs to the technical field of platform document management, and particularly relates to a method for judging document similarity based on image, video and text contents.
Background
The world economy is developing towards the direction of economic integration and knowledge economy at present, and networking, virtualization, digitization and knowledge are becoming important characteristics of modern economic development, so that the operation environment of enterprises is becoming more and more complex and changeable. Under the condition of increasingly intense market competition, knowledge becomes the primary resource for enterprise operation, and the competitive advantage of enterprises increasingly reflects whether the enterprises have the huge knowledge capital and the unique operation capacity, so that the knowledge management is becoming the most core management content of the enterprises. Large enterprises and organizations and the like gradually start to deploy knowledge management platforms, realize the sharing of explicit knowledge and implicit knowledge, and enable employees to voluntarily cooperate to share and develop knowledge resources, so that the enterprises and the organizations achieve higher targets and generate better benefits.
With the online of large knowledge management platforms and the rapid growth of users, some problems are gradually exposed. The document which is obtained by simply modifying the initial file and a large number of repeated documents causes great inconvenience for users to find valuable documents, and each user has to download a large number of documents and browse the documents one by one to automatically delete the repeated documents with large similarity.
Particularly, documents on a knowledge management platform supporting marketing scenes are mostly marketing materials facing clients, a large number of pictures and videos are embedded in the documents, and the documents contain a large number of headline characters, and particularly for presentation documents such as PowerPoint, the text content in the same page of the document is unchanged, but the change of the placement position, the size and the like is large. The conventional text-based document similarity determination algorithm is not applicable.
For repeated documents, hash calculation can be performed on the document, for example, using an algorithm such as MD5, and the like, and direct judgment can be performed according to the Hash value. For the judgment of similarity documents, the industry has no perfect technical solution. One method is based on character string comparison, which is more accurate in judging similar documents with little text change, but is not accurate in judging the situation that the text content is not changed but only the position and the sequence are changed. The other method is a statistical method based on word frequency, for example, TF-IDF is used for calculating the similarity of the documents, but the position relation of the words in the documents cannot be accurately distinguished. Meanwhile, the above two types of algorithms do not consider the similarity of the video and the image in the document.
Therefore, the platform urgently needs to provide an automatic method for assisting in manually cleaning the repeated documents and performing simple modification on the documents based on the original documents.
Disclosure of Invention
The technical problem to be solved by the present invention is to provide a method for judging the similarity of documents based on the contents of images, videos and texts, which can automatically identify the repeated documents from the massive documents, and the documents that are simply modified based on the original documents, for the corresponding processing after the manual verification.
In order to achieve the technical purpose, the technical scheme adopted by the invention is as follows:
a method for determining document similarity based on image, video and text content simultaneously, comprising:
s1: selecting two documents D of the same type 1 And D 2 Performing Hash calculation on the obtained data, and judging D 1 And D 2 Whether it is a duplicate document;
s2: if D is 1 And D 2 If not, for document D 1 And D 2 Decompressing to extract text, image and video;
s3: for document D 1 And D 2 Respectively performing word segmentation, part-of-speech tagging and preprocessing on the text to form a noun and proper noun set W 1 And W 2
S4: for set W 1 And W 2 Respectively calculating the TD-IDF value according to the size of the TD-IDF valueSorting to form a list L 1 And L 2
S5: from L 1 And L 2 The first N words are selected from the document D and merged into a set W 1 And D 2 Calculating the word frequency of each document to each word in the set W, and generating a respective word frequency vector;
s6: computing document D 1 And D 2 Cosine similarity Sim of respective word frequency vectors txt ,Sim txt I.e. document D 1 And D 2 The text similarity of (2);
s7: respectively to document D 1 And D 2 Performing Hash calculation on all the pictures to obtain a Hash value list Hsh 1 And Hsh 2 And further calculate document D 1 And D 2 Image similarity Sim of (2) pic
S8: document D by byte comparison 1 And D 2 Calculating similarity Sim of video contents in (1) video
S9: setting weight w for text, image and video similarity of document respectively 1 ,w 2 ,w 3 Wherein w is 1 +w 2 +w 3 And =1, obtaining the document similarity Sim through weighting calculation, and comparing the document similarity Sim with a preset threshold value to obtain a document similarity conclusion.
In order to optimize the technical scheme, the specific measures adopted further comprise:
the document types described above are classified as Word or PowerPoint.
S1 uses MD5 and other algorithms to process document D 1 And D 2 Performing Hash calculation, and if the calculated Hash values are equal, determining that the document D is 1 And D 2 Is a duplicate document.
The preprocessing in S3 includes filtering out all stop words, and various adverbs and adjectives, retaining only nouns and proper nouns, and removing duplication.
In the step S4, when calculating the IDF value, all documents in the same industry or the same subject on the whole system or platform in the corpus are used.
S7 calculation document D described above 1 And D 2 Image similarity Sim of (2) pic =N hsh /Max(Len 1 ,Len 2 )
Wherein, N hsh Is Hsh 1 And Hsh 2 Number of identical values, len 1 And Len 2 Are respectively a list Hsh 1 And Hsh 2 Length.
S8 above document D 1 And D 2 Video file in (1) forms an LV 1 And LV 2 List, will LV 1 Each video in (1) is respectively associated with the LV 2 Comparing the bytes of each video, and if the ratio of the bytes is the same exceeds a preset value T 1 Then the value is increased to 1 in the L list, otherwise the value is increased to 0. Finally, L is a list of values 1 or 0;
computing document D 1 And D 2 Similarity of medium video content
Figure BDA0003758596580000031
Wherein Len (LV) 1 ) Len (L) are lists LV respectively 1 Length of L, L i Is the ith list element value of list L.
In the above S9, if there is no text, image or video content in the document, the corresponding similarity is set to 1;
computing document D 1 And D 2 The similarity Sim of (A) is Sim txt *w 1 +Sim pic *w 2 +Sim video *w 3
The Sim is compared with a preset threshold value T 2 By comparison, if Sim exceeds T 2 Then consider document D 1 And D 2 And if the documents are similar documents, transferring manual review and performing subsequent processing.
The invention has the following beneficial effects:
the method not only judges the similarity of the text, but also judges the similarity of the image and the video, calculates the similarity of the document after synthesis, prompts people to check again, and carries out subsequent processing.
1. According to the knowledge management platform, most marketing documents contain a large amount of picture information, so that the similarity of texts is calculated, the similarity of images and videos in the documents is comprehensively considered, and a better effect is achieved for evaluating the similarity of Word and PowerPoint documents.
2. The method judges the similarity of the images by using the characteristics of zooming and cutting of the images and invariance of original images in the documents of Word and PowerPoint and through the Hash function.
3. Because only nouns and proper nouns have actual meanings for most marketing materials, the method performs part-of-speech tagging after word segmentation, filters out all stop words, various adverbs, adjectives and the like, only retains and removes the nouns and the proper nouns to form sets W1 and W2 of the nouns and the proper nouns, and only performs TD-IDF calculation on the words, so that the calculation amount in the later period is remarkably reduced.
4. When the TD-IDF value is calculated, the corpus is used as all documents of the same industry or the same theme on the whole system or platform, but not all documents on the whole system or platform, so that the accuracy of IDF can be further improved, and the calculation amount is reduced.
Drawings
FIG. 1 is a flowchart of a method for determining document similarity based on image, video and text content in accordance with the present invention;
fig. 2 is a flow chart of text content similarity calculation according to the present invention.
Detailed Description
Embodiments of the present invention are described in further detail below with reference to the accompanying drawings.
A method for determining document similarity based on image, video and text content simultaneously, comprising:
s1: selecting two documents D of the same type 1 And D 2 Performing Hash calculation on the obtained data, and judging D 1 And D 2 Whether it is a duplicate document;
for example, the similarity determination is performed for Word or PowerPoint documents.
For document D 1 And D 2 Performing Hash processing, such as using an algorithm such as MD5, if the calculated Hash values are equal,then document D 1 And D 2 Is a duplicate document.
S2: if D is 1 And D 2 If not, for document D 1 And D 2 Decompressing to extract text, image and video;
and if the Hash values calculated in the S1 are different, decompressing the document, and extracting a text, an image and a video.
Note: word in Microsoft Office suite, powerPoint document, the actual file format is zip file format, decompression can be completed by using standard zip decompression algorithm, thus extracting original text, image and video information.
S3: for document D 1 And D 2 Respectively using word segmentation tools to perform word segmentation, part-of-speech tagging and preprocessing on the text to form a noun and a set W of proper nouns 1 And W 2
Filtering out all stop words, various adverbs and adjectives, etc., only retaining nouns and proper nouns, and removing duplication to form a set W of nouns and proper nouns 1 And W 2
Since for the vast majority of marketing materials only nouns and proper nouns have practical significance and the amount of computation at a later stage is significantly reduced.
S4: for set W 1 And W 2 Respectively calculating TD-IDF value of each word in the list, and sorting the words according to the size of the TD-IDF value to form a list L 1 And L 2
When calculating the IDF value, all documents in the same industry or the same subject on the whole system or platform in the corpus are adopted, but not all documents on the whole system or platform, so that the accuracy of the IDF can be further improved.
Figure BDA0003758596580000041
Figure BDA0003758596580000042
TD-IDF=TF*IDF
S5: from L 1 And L 2 The first N words are selected from the document D and combined into a set W with the length of M 1 And D 2 Calculating the word frequency of each document for each word in the set W, and generating respective word frequency vectors;
note: if the word is not repeated, M =2 × n.
S6: computing document D 1 And D 2 Cosine similarity Sim of respective word frequency vectors txt ,Sim txt I.e. document D 1 And D 2 The text similarity of (2);
Sim txt the larger the value, the document D 1 And D 2 The more similar;
s7: respectively to document D 1 And D 2 Hash calculation is carried out on all the pictures, and the algorithm can be selected from algorithms such as MD5 and the like to obtain a Hash value list Hsh 1 And Hsh 2 Each length of Len 1 And Len 2 And further calculate the document D 1 And D 2 Image similarity Sim of (2) pic
Because operations such as zooming, clipping and the like are carried out on the image in the Office document, the original image still remains in the document, and only information such as size, position and the like is digitally identified. Therefore, in different documents, operations such as scaling and cropping are performed on the same image, and the Hash value calculation is performed, so that the same value is obtained, and based on the operations:
hsh judgment 1 And Hsh 2 Number of median identity values N hsh Calculating the document D 1 And D 2 Image similarity Sim of pic =N hsh /Max(Len 1 ,Len 2 )
S8: comparing documents D by using byte 1 And D 2 Calculating similarity Sim of video contents in (1) video
Since the video may be edited, most of the videos with the same content have different Hash values due to different frames, and therefore, for the video content, a byte comparison method is required to determine the similarity.
Setting a threshold T, considering two video content identical parts exceeding T 1 The two videos are considered similar and may be set to 70% by default.
Document D 1 And D 2 Video file in (1) forms the LV 1 And LV 2 List, will LV 1 Each video in (1) is respectively associated with the LV 2 The video in (1) is subjected to byte comparison, and the proportion of the same content exceeds T 1 Then the value is increased to 1 in the L list, otherwise the value is increased to 0. Finally, L is a list of values 1 or 0, e.g., [1,0, 1, \ 8230; 0]Then document D 1 And D 2 Similarity of intermediate video content
Figure BDA0003758596580000051
Figure BDA0003758596580000052
S9: setting weights w for text, image and video similarity of documents respectively 1 ,w 2 ,w 3 Wherein w is 1 +w 2 +w 3 And =1, obtaining the document similarity Sim through weighting calculation, and comparing the document similarity Sim with a preset threshold value to obtain a document similarity conclusion.
If there is no corresponding content, the corresponding similarity is set to 1.
For example, if both documents do not contain video, then Sim video A value of 1; if no text is contained in the two documents, sim txt A value of 1; if no image is contained in both documents, then Sim pic The value is 1.
Then finally after comparison, document D 1 And D 2 The similarity Sim of (A) is Sim txt *w 1 +Sim pic *w 2 +Sim video *w 3 . Setting a threshold T 2 Document similarity exceeding T 2 And considering that the two documents are similar documents, and transferring to manual review and subsequent processing.
The above is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above-mentioned embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and adaptations to those skilled in the art without departing from the principles of the present invention may be apparent to those skilled in the relevant art and are intended to be within the scope of the present invention.

Claims (8)

1. A method for determining document similarity based on image, video and text content simultaneously, comprising:
s1: selecting two documents D of the same type 1 And D 2 Performing Hash calculation on the obtained data, and judging D 1 And D 2 Whether it is a duplicate document;
s2: if D is 1 And D 2 If not, for document D 1 And D 2 Decompressing to extract text, image and video;
s3: for document D 1 And D 2 Respectively carrying out word segmentation, part-of-speech tagging and preprocessing on the text to form a noun and proper noun set W 1 And W 2
S4: for set W 1 And W 2 Respectively calculating TD-IDF values of each word in the list, and sorting according to the size of the TD-IDF values to form a list L 1 And L 2
S5: from L 1 And L 2 Select the first N words from the document D, combine them into a set W, and respectively select the first N words from the document D 1 And D 2 Calculating the word frequency of each document to each word in the set W, and generating a respective word frequency vector;
s6: computing document D 1 And D 2 Cosine similarity Sim of respective word frequency vectors txt ,Sim txt I.e. document D 1 And D 2 The text similarity of (2);
s7: respectively to document D 1 And D 2 Performing Hash calculation on all the pictures to obtain a Hash value list Hsh 1 And Hsh 2 And further calculate document D 1 And D 2 Image similarity Sim of pic
S8: comparing documents D by using byte 1 And D 2 Calculating similarity Sim of video contents in (1) video
S9: setting weight w for text, image and video similarity of document respectively 1 ,w 2 ,w 3 Wherein w is 1 +w 2 +w 3 And =1, obtaining the document similarity Sim through weighting calculation, and comparing the document similarity Sim with a preset threshold value to obtain a document similarity conclusion.
2. The method of claim 1, wherein the document type is Word or PowerPoint.
3. The method for determining similarity of documents based on image, video and text contents at the same time as claimed in claim 1, wherein S1 uses MD5 and other algorithms to determine the similarity of documents D 1 And D 2 Performing Hash calculation, and if the calculated Hash values are equal, determining that the document D is 1 And D 2 Is a duplicate document.
4. The method for simultaneously determining the similarity of documents based on image, video and text contents as claimed in claim 1, wherein said preprocessing of S3 comprises filtering out all stop words, and various adverbs and adjectives, keeping only nouns and proper nouns, and de-duplicating.
5. The method for simultaneously judging the similarity of documents based on image, video and text contents according to claim 1, wherein when calculating the IDF value in S4, all documents in the same industry or the same theme on the whole system or platform in the corpus are adopted.
6. The method for determining similarity of documents based on image, video and text contents simultaneously as claimed in claim 1, wherein S7 is used for calculating the document D 1 And D 2 Image similarity Sim of pic =N hsh /Max(Len 1 ,Len 2 )
Wherein N is hsh Is Hsh 1 And Hsh 2 Number of identical values in Len 1 And Len 2 Are respectively a list Hsh 1 And Hsh 2 Length.
7. The method of claim 1, wherein S8 is implemented by comparing the similarity of the document D with the similarity of the document D based on the image, the video and the text content 1 And D 2 Video file in (1) forms the LV 1 And LV 2 List, will LV 1 Each video in (1) is respectively associated with the LV 2 Comparing the bytes of each video, and if the ratio of the bytes is the same exceeds a preset value T 1 Then add value 1 to the L list, otherwise add value 0. Finally, L is a list with values of 1 or 0;
computing document D 1 And D 2 Similarity of medium video content
Figure FDA0003758596570000021
Wherein Len (LV) 1 ) Len (L) are lists LV respectively 1 Length of L, L i Is the ith list element value of list L.
8. The method for simultaneously judging the similarity of the documents based on the image, the video and the text content as claimed in claim 1, wherein in S9, if no text, image or video content exists in the document, the corresponding similarity is set to 1;
computing document D 1 And D 2 Has a similarity Sim of Sim txt *w 1 +Sim pic *w 2 +Sim video *w 3
The Sim is compared with a preset threshold value T 2 By comparison, if Sim exceeds T 2 Then consider document D 1 And D 2 And if the documents are similar documents, transferring manual review and performing subsequent processing.
CN202210861048.6A 2022-07-22 2022-07-22 Method for judging similarity of documents based on image, video and text contents simultaneously Pending CN115221856A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210861048.6A CN115221856A (en) 2022-07-22 2022-07-22 Method for judging similarity of documents based on image, video and text contents simultaneously

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210861048.6A CN115221856A (en) 2022-07-22 2022-07-22 Method for judging similarity of documents based on image, video and text contents simultaneously

Publications (1)

Publication Number Publication Date
CN115221856A true CN115221856A (en) 2022-10-21

Family

ID=83613789

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210861048.6A Pending CN115221856A (en) 2022-07-22 2022-07-22 Method for judging similarity of documents based on image, video and text contents simultaneously

Country Status (1)

Country Link
CN (1) CN115221856A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117807404A (en) * 2024-02-29 2024-04-02 智广海联(天津)大数据技术有限公司 AI-based intelligent duplicate removal analysis method and device for studying and judging event
CN118429344A (en) * 2024-07-04 2024-08-02 上海恒等创享科技有限公司 Industrial defect detection method, device, equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117807404A (en) * 2024-02-29 2024-04-02 智广海联(天津)大数据技术有限公司 AI-based intelligent duplicate removal analysis method and device for studying and judging event
CN118429344A (en) * 2024-07-04 2024-08-02 上海恒等创享科技有限公司 Industrial defect detection method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN111327945B (en) Method and apparatus for segmenting video
CN108009293B (en) Video tag generation method and device, computer equipment and storage medium
US6199103B1 (en) Electronic mail determination method and system and storage medium
US7930647B2 (en) System and method for selecting pictures for presentation with text content
US7844139B2 (en) Information management apparatus, information management method, and computer program product
US8032539B2 (en) Method and apparatus for semantic assisted rating of multimedia content
JP2940501B2 (en) Document classification apparatus and method
CN109783787A (en) A kind of generation method of structured document, device and storage medium
CN109740152B (en) Text category determination method and device, storage medium and computer equipment
US20070098259A1 (en) Method and mechanism for analyzing the texture of a digital image
JP2010073114A (en) Image information search device, image information search method, computer program for the same
JP3682529B2 (en) Summary automatic evaluation processing apparatus, summary automatic evaluation processing program, and summary automatic evaluation processing method
CN110008309B (en) Phrase mining method and device
WO2017113592A1 (en) Model generation method, word weighting method, apparatus, device and computer storage medium
CN111444387A (en) Video classification method and device, computer equipment and storage medium
CN115221856A (en) Method for judging similarity of documents based on image, video and text contents simultaneously
WO2024179575A1 (en) Data processing method, and device and computer-readable storage medium
CN110874526B (en) File similarity detection method and device, electronic equipment and storage medium
CN110895654A (en) Segmentation method, segmentation system and non-transitory computer readable medium
CN117493645A (en) Big data-based electronic archive recommendation system
CN108427769B (en) Character interest tag extraction method based on social network
CN111444364B (en) Image detection method and device
US8566366B2 (en) Format conversion apparatus and file search apparatus capable of searching for a file as based on an attribute provided prior to conversion
CN111930883A (en) Text clustering method and device, electronic equipment and computer storage medium
CN107169065B (en) Method and device for removing specific content

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20221230

Address after: No.31, Financial Street, Xicheng District, Beijing, 100033

Applicant after: CHINA TELECOM Corp.,Ltd.

Address before: Room 1308, 13th floor, East Tower, 33 Fuxing Road, Haidian District, Beijing 100036

Applicant before: China Telecom Digital Intelligence Technology Co.,Ltd.

TA01 Transfer of patent application right