CN115221856A

CN115221856A - Method for judging similarity of documents based on image, video and text contents simultaneously

Info

Publication number: CN115221856A
Application number: CN202210861048.6A
Authority: CN
Inventors: 张宇; 李秀芬; 陈龙; 程任华; 郑金辉; 朱庭俊
Original assignee: China Telecom Digital Intelligence Technology Co Ltd
Current assignee: China Telecom Corp Ltd
Priority date: 2022-07-22
Filing date: 2022-07-22
Publication date: 2022-10-21

Abstract

The invention discloses a method for judging document similarity based on image, video and text contents at the same time, which comprises the following steps: s1: selecting two documents D of the same type ₁ And D ₂ Performing Hash calculation on the obtained data, and judging D ₁ And D ₂ Whether it is a duplicate document; if D is ₁ And D ₂ If not, then for document D ₁ And D ₂ And respectively carrying out text, image and video content similarity calculation, respectively setting weights for the text, image and video similarity of the document, obtaining the document similarity through weighting calculation, and comparing the document similarity with a preset threshold value to obtain a document similarity conclusion. The invention judges the similarity of the text and the image and the video, calculates the similarity of the document after synthesis, and prompts the manual workAnd checking again and carrying out subsequent processing.

Description

Method for judging similarity of documents based on image, video and text contents simultaneously

Technical Field

The invention belongs to the technical field of platform document management, and particularly relates to a method for judging document similarity based on image, video and text contents.

Background

The world economy is developing towards the direction of economic integration and knowledge economy at present, and networking, virtualization, digitization and knowledge are becoming important characteristics of modern economic development, so that the operation environment of enterprises is becoming more and more complex and changeable. Under the condition of increasingly intense market competition, knowledge becomes the primary resource for enterprise operation, and the competitive advantage of enterprises increasingly reflects whether the enterprises have the huge knowledge capital and the unique operation capacity, so that the knowledge management is becoming the most core management content of the enterprises. Large enterprises and organizations and the like gradually start to deploy knowledge management platforms, realize the sharing of explicit knowledge and implicit knowledge, and enable employees to voluntarily cooperate to share and develop knowledge resources, so that the enterprises and the organizations achieve higher targets and generate better benefits.

With the online of large knowledge management platforms and the rapid growth of users, some problems are gradually exposed. The document which is obtained by simply modifying the initial file and a large number of repeated documents causes great inconvenience for users to find valuable documents, and each user has to download a large number of documents and browse the documents one by one to automatically delete the repeated documents with large similarity.

Particularly, documents on a knowledge management platform supporting marketing scenes are mostly marketing materials facing clients, a large number of pictures and videos are embedded in the documents, and the documents contain a large number of headline characters, and particularly for presentation documents such as PowerPoint, the text content in the same page of the document is unchanged, but the change of the placement position, the size and the like is large. The conventional text-based document similarity determination algorithm is not applicable.

For repeated documents, hash calculation can be performed on the document, for example, using an algorithm such as MD5, and the like, and direct judgment can be performed according to the Hash value. For the judgment of similarity documents, the industry has no perfect technical solution. One method is based on character string comparison, which is more accurate in judging similar documents with little text change, but is not accurate in judging the situation that the text content is not changed but only the position and the sequence are changed. The other method is a statistical method based on word frequency, for example, TF-IDF is used for calculating the similarity of the documents, but the position relation of the words in the documents cannot be accurately distinguished. Meanwhile, the above two types of algorithms do not consider the similarity of the video and the image in the document.

Therefore, the platform urgently needs to provide an automatic method for assisting in manually cleaning the repeated documents and performing simple modification on the documents based on the original documents.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide a method for judging the similarity of documents based on the contents of images, videos and texts, which can automatically identify the repeated documents from the massive documents, and the documents that are simply modified based on the original documents, for the corresponding processing after the manual verification.

In order to achieve the technical purpose, the technical scheme adopted by the invention is as follows:

a method for determining document similarity based on image, video and text content simultaneously, comprising:

s1: selecting two documents D of the same type ₁ And D ₂ Performing Hash calculation on the obtained data, and judging D ₁ And D ₂ Whether it is a duplicate document;

s2: if D is ₁ And D ₂ If not, for document D ₁ And D ₂ Decompressing to extract text, image and video;

s3: for document D ₁ And D ₂ Respectively performing word segmentation, part-of-speech tagging and preprocessing on the text to form a noun and proper noun set W ₁ And W ₂ ；

S4: for set W ₁ And W ₂ Respectively calculating the TD-IDF value according to the size of the TD-IDF valueSorting to form a list L ₁ And L ₂ ；

S5: from L ₁ And L ₂ The first N words are selected from the document D and merged into a set W ₁ And D ₂ Calculating the word frequency of each document to each word in the set W, and generating a respective word frequency vector;

s6: computing document D ₁ And D ₂ Cosine similarity Sim of respective word frequency vectors _txt ，Sim _txt I.e. document D ₁ And D ₂ The text similarity of (2);

s7: respectively to document D ₁ And D ₂ Performing Hash calculation on all the pictures to obtain a Hash value list Hsh ₁ And Hsh ₂ And further calculate document D ₁ And D ₂ Image similarity Sim of (2) _pic ；

S8: document D by byte comparison ₁ And D ₂ Calculating similarity Sim of video contents in (1) _video ；

S9: setting weight w for text, image and video similarity of document respectively ₁ ,w ₂ ,w ₃ Wherein w is ₁ +w ₂ +w ₃ And =1, obtaining the document similarity Sim through weighting calculation, and comparing the document similarity Sim with a preset threshold value to obtain a document similarity conclusion.

In order to optimize the technical scheme, the specific measures adopted further comprise:

the document types described above are classified as Word or PowerPoint.

S1 uses MD5 and other algorithms to process document D ₁ And D ₂ Performing Hash calculation, and if the calculated Hash values are equal, determining that the document D is ₁ And D ₂ Is a duplicate document.

The preprocessing in S3 includes filtering out all stop words, and various adverbs and adjectives, retaining only nouns and proper nouns, and removing duplication.

In the step S4, when calculating the IDF value, all documents in the same industry or the same subject on the whole system or platform in the corpus are used.

S7 calculation document D described above ₁ And D ₂ Image similarity Sim of (2) _pic ＝N _hsh /Max(Len ₁ ,Len ₂ )

Wherein, N _hsh Is Hsh ₁ And Hsh ₂ Number of identical values, len ₁ And Len ₂ Are respectively a list Hsh ₁ And Hsh ₂ Length.

S8 above document D ₁ And D ₂ Video file in (1) forms an LV ₁ And LV ₂ List, will LV ₁ Each video in (1) is respectively associated with the LV ₂ Comparing the bytes of each video, and if the ratio of the bytes is the same exceeds a preset value T ₁ Then the value is increased to 1 in the L list, otherwise the value is increased to 0. Finally, L is a list of values 1 or 0;

computing document D ₁ And D ₂ Similarity of medium video content

Wherein Len (LV) ₁ ) Len (L) are lists LV respectively ₁ Length of L, L _i Is the ith list element value of list L.

In the above S9, if there is no text, image or video content in the document, the corresponding similarity is set to 1;

computing document D ₁ And D ₂ The similarity Sim of (A) is Sim _txt *w ₁ +Sim _pic *w ₂ +Sim _video *w ₃ ；

The Sim is compared with a preset threshold value T ₂ By comparison, if Sim exceeds T ₂ Then consider document D ₁ And D ₂ And if the documents are similar documents, transferring manual review and performing subsequent processing.

The invention has the following beneficial effects:

the method not only judges the similarity of the text, but also judges the similarity of the image and the video, calculates the similarity of the document after synthesis, prompts people to check again, and carries out subsequent processing.

1. According to the knowledge management platform, most marketing documents contain a large amount of picture information, so that the similarity of texts is calculated, the similarity of images and videos in the documents is comprehensively considered, and a better effect is achieved for evaluating the similarity of Word and PowerPoint documents.

2. The method judges the similarity of the images by using the characteristics of zooming and cutting of the images and invariance of original images in the documents of Word and PowerPoint and through the Hash function.

3. Because only nouns and proper nouns have actual meanings for most marketing materials, the method performs part-of-speech tagging after word segmentation, filters out all stop words, various adverbs, adjectives and the like, only retains and removes the nouns and the proper nouns to form sets W1 and W2 of the nouns and the proper nouns, and only performs TD-IDF calculation on the words, so that the calculation amount in the later period is remarkably reduced.

4. When the TD-IDF value is calculated, the corpus is used as all documents of the same industry or the same theme on the whole system or platform, but not all documents on the whole system or platform, so that the accuracy of IDF can be further improved, and the calculation amount is reduced.

Drawings

FIG. 1 is a flowchart of a method for determining document similarity based on image, video and text content in accordance with the present invention;

fig. 2 is a flow chart of text content similarity calculation according to the present invention.

Detailed Description

Embodiments of the present invention are described in further detail below with reference to the accompanying drawings.

for example, the similarity determination is performed for Word or PowerPoint documents.

For document D ₁ And D ₂ Performing Hash processing, such as using an algorithm such as MD5, if the calculated Hash values are equal,then document D ₁ And D ₂ Is a duplicate document.

and if the Hash values calculated in the S1 are different, decompressing the document, and extracting a text, an image and a video.

Note: word in Microsoft Office suite, powerPoint document, the actual file format is zip file format, decompression can be completed by using standard zip decompression algorithm, thus extracting original text, image and video information.

S3: for document D ₁ And D ₂ Respectively using word segmentation tools to perform word segmentation, part-of-speech tagging and preprocessing on the text to form a noun and a set W of proper nouns ₁ And W ₂ ；

Filtering out all stop words, various adverbs and adjectives, etc., only retaining nouns and proper nouns, and removing duplication to form a set W of nouns and proper nouns ₁ And W ₂ 。

Since for the vast majority of marketing materials only nouns and proper nouns have practical significance and the amount of computation at a later stage is significantly reduced.

S4: for set W ₁ And W ₂ Respectively calculating TD-IDF value of each word in the list, and sorting the words according to the size of the TD-IDF value to form a list L ₁ And L ₂ ；

When calculating the IDF value, all documents in the same industry or the same subject on the whole system or platform in the corpus are adopted, but not all documents on the whole system or platform, so that the accuracy of the IDF can be further improved.

TD-IDF＝TF*IDF

S5: from L ₁ And L ₂ The first N words are selected from the document D and combined into a set W with the length of M ₁ And D ₂ Calculating the word frequency of each document for each word in the set W, and generating respective word frequency vectors;

note: if the word is not repeated, M =2 × n.

Sim _txt the larger the value, the document D ₁ And D ₂ The more similar;

s7: respectively to document D ₁ And D ₂ Hash calculation is carried out on all the pictures, and the algorithm can be selected from algorithms such as MD5 and the like to obtain a Hash value list Hsh ₁ And Hsh ₂ Each length of Len ₁ And Len ₂ And further calculate the document D ₁ And D ₂ Image similarity Sim of (2) _pic ；

Because operations such as zooming, clipping and the like are carried out on the image in the Office document, the original image still remains in the document, and only information such as size, position and the like is digitally identified. Therefore, in different documents, operations such as scaling and cropping are performed on the same image, and the Hash value calculation is performed, so that the same value is obtained, and based on the operations:

hsh judgment ₁ And Hsh ₂ Number of median identity values N _hsh Calculating the document D ₁ And D ₂ Image similarity Sim of _pic ＝N _hsh /Max(Len ₁ ,Len ₂ )

S8: comparing documents D by using byte ₁ And D ₂ Calculating similarity Sim of video contents in (1) _video ；

Since the video may be edited, most of the videos with the same content have different Hash values due to different frames, and therefore, for the video content, a byte comparison method is required to determine the similarity.

Setting a threshold T, considering two video content identical parts exceeding T ₁ The two videos are considered similar and may be set to 70% by default.

Document D ₁ And D ₂ Video file in (1) forms the LV ₁ And LV ₂ List, will LV ₁ Each video in (1) is respectively associated with the LV ₂ The video in (1) is subjected to byte comparison, and the proportion of the same content exceeds T ₁ Then the value is increased to 1 in the L list, otherwise the value is increased to 0. Finally, L is a list of values 1 or 0, e.g., [1,0, 1, \ 8230; 0]Then document D ₁ And D ₂ Similarity of intermediate video content

S9: setting weights w for text, image and video similarity of documents respectively ₁ ,w ₂ ,w ₃ Wherein w is ₁ +w ₂ +w ₃ And =1, obtaining the document similarity Sim through weighting calculation, and comparing the document similarity Sim with a preset threshold value to obtain a document similarity conclusion.

If there is no corresponding content, the corresponding similarity is set to 1.

For example, if both documents do not contain video, then Sim _video A value of 1; if no text is contained in the two documents, sim _txt A value of 1; if no image is contained in both documents, then Sim _pic The value is 1.

Then finally after comparison, document D ₁ And D ₂ The similarity Sim of (A) is Sim _txt *w ₁ +Sim _pic *w ₂ +Sim _video *w ₃ . Setting a threshold T ₂ Document similarity exceeding T ₂ And considering that the two documents are similar documents, and transferring to manual review and subsequent processing.

The above is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above-mentioned embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and adaptations to those skilled in the art without departing from the principles of the present invention may be apparent to those skilled in the relevant art and are intended to be within the scope of the present invention.

Claims

1. A method for determining document similarity based on image, video and text content simultaneously, comprising:

s3: for document D ₁ And D ₂ Respectively carrying out word segmentation, part-of-speech tagging and preprocessing on the text to form a noun and proper noun set W ₁ And W ₂ ；

S4: for set W ₁ And W ₂ Respectively calculating TD-IDF values of each word in the list, and sorting according to the size of the TD-IDF values to form a list L ₁ And L ₂ ；

S5: from L ₁ And L ₂ Select the first N words from the document D, combine them into a set W, and respectively select the first N words from the document D ₁ And D ₂ Calculating the word frequency of each document to each word in the set W, and generating a respective word frequency vector;

s7: respectively to document D ₁ And D ₂ Performing Hash calculation on all the pictures to obtain a Hash value list Hsh ₁ And Hsh ₂ And further calculate document D ₁ And D ₂ Image similarity Sim of _pic ；

2. The method of claim 1, wherein the document type is Word or PowerPoint.

3. The method for determining similarity of documents based on image, video and text contents at the same time as claimed in claim 1, wherein S1 uses MD5 and other algorithms to determine the similarity of documents D ₁ And D ₂ Performing Hash calculation, and if the calculated Hash values are equal, determining that the document D is ₁ And D ₂ Is a duplicate document.

4. The method for simultaneously determining the similarity of documents based on image, video and text contents as claimed in claim 1, wherein said preprocessing of S3 comprises filtering out all stop words, and various adverbs and adjectives, keeping only nouns and proper nouns, and de-duplicating.

5. The method for simultaneously judging the similarity of documents based on image, video and text contents according to claim 1, wherein when calculating the IDF value in S4, all documents in the same industry or the same theme on the whole system or platform in the corpus are adopted.

6. The method for determining similarity of documents based on image, video and text contents simultaneously as claimed in claim 1, wherein S7 is used for calculating the document D ₁ And D ₂ Image similarity Sim of _pic ＝N _hsh /Max(Len ₁ ,Len ₂ )

Wherein N is _hsh Is Hsh ₁ And Hsh ₂ Number of identical values in Len ₁ And Len ₂ Are respectively a list Hsh ₁ And Hsh ₂ Length.

7. The method of claim 1, wherein S8 is implemented by comparing the similarity of the document D with the similarity of the document D based on the image, the video and the text content ₁ And D ₂ Video file in (1) forms the LV ₁ And LV ₂ List, will LV ₁ Each video in (1) is respectively associated with the LV ₂ Comparing the bytes of each video, and if the ratio of the bytes is the same exceeds a preset value T ₁ Then add value 1 to the L list, otherwise add value 0. Finally, L is a list with values of 1 or 0;

computing document D ₁ And D ₂ Similarity of medium video content

8. The method for simultaneously judging the similarity of the documents based on the image, the video and the text content as claimed in claim 1, wherein in S9, if no text, image or video content exists in the document, the corresponding similarity is set to 1;

computing document D ₁ And D ₂ Has a similarity Sim of Sim _txt *w ₁ +Sim _pic *w ₂ +Sim _video *w ₃ ；