CN116204612A - Text similarity calculation method and system - Google Patents

Text similarity calculation method and system Download PDF

Info

Publication number
CN116204612A
CN116204612A CN202211286844.8A CN202211286844A CN116204612A CN 116204612 A CN116204612 A CN 116204612A CN 202211286844 A CN202211286844 A CN 202211286844A CN 116204612 A CN116204612 A CN 116204612A
Authority
CN
China
Prior art keywords
text
word
register
simhash
string
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211286844.8A
Other languages
Chinese (zh)
Inventor
石林灵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
XFusion Digital Technologies Co Ltd
Original Assignee
XFusion Digital Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by XFusion Digital Technologies Co Ltd filed Critical XFusion Digital Technologies Co Ltd
Priority to CN202211286844.8A priority Critical patent/CN116204612A/en
Publication of CN116204612A publication Critical patent/CN116204612A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Bioethics (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Medical Informatics (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a text similarity calculation method and a text similarity calculation system. The method comprises the following steps: performing word segmentation on the first text to obtain a first word segmentation result, wherein the first word segmentation result comprises at least one first word segmentation and word weight of each first word segmentation; for each first word, generating a weighted digital string of the first word according to the hash value and the word weight of the first word; and carrying out vectorization accumulation calculation on each weighted digital string of the first word segmentation to obtain a sequence string of the first text, and carrying out vectorization dimension reduction on the sequence string of the first text to obtain a Simhash signature of the first text; and acquiring a Simhash signature of a second text, and obtaining the similarity between the first text and the second text based on the Simhash signature of the first text and the Simhash signature of the second text. Thus improving the calculation speed and the overall efficiency.

Description

Text similarity calculation method and system
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a text similarity calculation method and system.
Background
With the development of computer and internet technologies, massive amounts of data and information are often required, which brings great challenges to data storage, data processing and data transmission. Many identical or similar contents exist in the data and the information, and the data size can be compressed through the operations of deduplication and duplicate checking so as to be convenient to store and process. In addition, classification and modeling is facilitated by identifying identical or substantially identical content, and providing reference bases such as statistics, plagiarism decisions, etc. In addition, challenges in data sharing and privacy protection are faced between the party providing the raw data and the party collecting the data for text similarity calculation.
In the prior art, an effective technical means is lacking to efficiently judge the similarity between two objects such as two text contents and give reliable quantization indexes, and aspects of data sharing, privacy protection and the like can be considered.
In summary, the problem to be solved at present is how to provide a technical solution that can consider data sharing and privacy protection, and can efficiently determine the similarity between two objects, such as two text contents, and provide reliable quantization indexes.
Disclosure of Invention
The embodiment of the application provides a text similarity calculation method and a text similarity calculation system, which are used for solving the problems in the prior art.
In a first aspect, the present application provides a text similarity calculation method. The method comprises the following steps: performing word segmentation on the first text to obtain a first word segmentation result, wherein the first word segmentation result comprises at least one first word segmentation and word weight of each first word segmentation; for each first word, generating a weighted digital string of the first word according to the hash value and the word weight of the first word; and carrying out vectorization accumulation calculation on each weighted digital string of the first word segmentation to obtain a sequence string of the first text, and carrying out vectorization dimension reduction on the sequence string of the first text to obtain a Simhash signature of the first text; and acquiring a Simhash signature of a second text, and obtaining the similarity between the first text and the second text based on the Simhash signature of the first text and the Simhash signature of the second text.
In a second aspect, the present application provides a system. The system comprises: the word segmentation unit is used for segmenting the first text to obtain a first word segmentation result, wherein the first word segmentation result comprises at least one first word and word weight of each first word; a hash calculation unit for calculating a hash value of each word segment; a weighted digital string generating unit, configured to generate, for each first word, a weighted digital string of the first word according to the hash value and the word weight of the first word; the accumulation calculation unit is used for carrying out vectorization accumulation calculation on the weighted digital string of each first word segmentation to obtain a sequence string of the first text; and the dimension reduction unit is used for carrying out vectorization dimension reduction on the sequence string of the first text to obtain the Simhash signature of the first text.
In a third aspect, embodiments of the present application further provide a computing device, where the computing device includes a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements a method according to any implementation of any of the above aspects when the computer program is executed.
In a fourth aspect, embodiments of the present application also provide a computer-readable storage medium storing computer instructions that, when executed on a computing device, cause the computing device to perform a method according to any one of the implementations of any one of the above aspects.
In a fifth aspect, embodiments of the present application also provide a computer program product, characterized in that the computer program product comprises instructions stored on a computer-readable storage medium, which instructions, when run on a computing device, cause the computing device to perform a method according to any one of the implementation manners of any one of the above aspects.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic diagram of an example of a calculation flow of Simhash-based text similarity calculation according to an embodiment of the present application;
Fig. 2 is a schematic diagram of an application scenario of Simhash-based text similarity calculation according to an embodiment of the present application;
fig. 3 is a schematic diagram of another application scenario of Simhash-based text similarity calculation according to an embodiment of the present application;
fig. 4 is a schematic diagram of another application scenario of Simhash-based text similarity calculation according to an embodiment of the present application;
fig. 5 is a flow chart of a text similarity calculation method based on Simhash according to an embodiment of the present application;
FIG. 6 is a flowchart of a vectorization acceleration accumulation calculation and vectorization acceleration dimension reduction step of the text similarity calculation method of FIG. 5 according to an embodiment of the present application;
FIG. 7 is a schematic diagram of batch accumulation calculation performed by a vector addition instruction in the vectorized accelerated accumulation calculation step of FIG. 6 according to an embodiment of the present application;
fig. 8 is a schematic diagram of a Simhash calculation module provided in an embodiment of the present application;
fig. 9 is a schematic diagram of an accumulation calculation unit of the Simhash calculation module of fig. 8 according to an embodiment of the present application;
fig. 10 is a schematic structural diagram of a computing device according to an embodiment of the present application.
Detailed Description
Embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
It should be understood that in the description of this application, "at least one" means one or more than one, and "a plurality" means two or more than two. In addition, the words "first," "second," and the like, unless otherwise indicated, are used solely for the purposes of description and are not to be construed as indicating or implying a relative importance or order.
It should be understood that in the description of this application, BITs refer to the smallest unit of information quantity, also the BITs in a binary digit. One bit of binary digits contains one bit of information. The bit in the computer is the smallest unit of data storage. The 8 bits are one byte (byte), and one byte stores 8 bits, which is also a data unit 8 bits long. The main unit of measure of the amount of information stored in a computer is bytes, for example, 1 byte is 8 bits, 1 Kilobyte (KB) is 1024 bits, and 1 Megabyte (MB) is 1024KB.
Referring to fig. 1, fig. 1 is a schematic diagram of an example of a calculation flow of Simhash-based text similarity calculation according to an embodiment of the present application. Among them, simhash belongs to one of the locally sensitive hash (locality sensetive hashing, LSH) algorithms, which maps high-dimensional feature vectors into low-dimensional feature vectors, thereby determining whether text is repeated or highly approximated by the hamming distance (hamming distance) of the two vectors. By applying the Simhash algorithm, representative keywords in a plurality of articles can be selected through a word segmentation strategy, and then the keywords are encoded into a character string with a fixed length, so that the articles are compared through comparing the codes with the fixed length, and whether the articles are identical or highly similar is judged. The following describes an example of a calculation flow of the Simhash-based text similarity calculation shown in fig. 1. In addition, the hamming distance between two equal-length character strings refers to the number of different characters at the corresponding positions of the two character strings, or the number of characters that need to be replaced in order to transform one character string into another. For two numbers, the hamming distance between two numbers is the number of different characters (or different binary values) at corresponding positions on the binarized two strings corresponding to the two numbers after the two numbers are respectively converted into binary numbers. For example, assuming that two decimal numbers are 93 and 73, respectively, converting 93 and 73 into binary numbers yields 101100 and 111000, respectively, it can be seen that the number of different characters at corresponding positions on the binarized two character strings is 2, and thus the hamming distance of 93 and 73 is 2.
Step S102: staff members who acquire the text "M country N region say to see the transporter".
The text obtained in step S102 is text content for text similarity calculation, and may be data or information expressed in any form, obtained from any suitable source, for example, a part of content displayed on a web page, message feedback in a forum, or a paragraph in a paper, etc. The text acquired in step S102 represents a part of massive information or data that needs to be checked for duplicate, duplicate removed, or similarity discrimination. If the text similarity calculation is performed on the texts through word-by-word comparison, the calculation efficiency is low and the accuracy of the judgment result is easily affected by wrongly written characters, synonyms and the like. The traditional hash algorithm maps the original content as uniformly and randomly as possible to a signature value and compares the signature value mapped by the text; the signature values of the two texts as comparison objects are identical if they are equal, which means that the original contents of the two texts are identical with a certain probability, otherwise, are not identical. However, the conventional hash algorithm can only account for the differences in the original content, but cannot measure the similarity of the original content in the dimension of the signature, and even if the original content differs by one byte, the signature values may be unequal. While the Simhash algorithm may provide a reference in the dimension of the signature that characterizes the degree of similarity of the original content. For example, taking two text strings as an example, text string a is "you mom shout you back home for meal, go home for ro", and another text string B is "you mom you back home for meal, go home for ro". It can be seen that there is only one word difference between the two text strings. The results obtained by calculating the two text strings by the conventional hash algorithm and the Simhash algorithm are shown in table 1 below. By comparing the calculation results of the text string A and the text string B under the traditional hash, the signature values of the two text strings which are different by only one word have a very obvious difference, namely, the signature values of different binary digits are obtained at a plurality of corresponding positions. By contrast, as can be seen by comparing the calculation results of the text string a and the text string B under the Simhash algorithm, only a few corresponding positions have different binary digits, so that the calculation result under the Simhash algorithm can provide a reference of the similarity degree between the text string a and the text string B in the signature dimension. In other words, the hamming distance of the computation under the Simhash algorithm provides a reference for the degree of similarity between text string a and text string B, which can be used to determine whether the two are repeated or highly approximated.
TABLE 1
Figure SMS_1
/>
Step S104: staff in the N area of the text ' M country ' speaks and sees the transporter ' to divide words to obtain word ' M country ' and word ' N area ', and the word weight of the word marking word ' M country ' is 4 and the word weight of the word marking word ' N area ' is 5.
In step S104, the word segmentation strategy is used to select representative keywords in the text, so the word segmentation strategy can be adjusted according to actual application and needs. In this example, based on the word segmentation policy, staff members who speak the text "M country N region see the transporter" are segmented to obtain the word "M country" and the word "N region". In other examples, word segmentation for staff in N region of the text "M country" may result in word segmentation of "M country", word segmentation of "N region", word segmentation of "staff" and word segmentation of "transporter". In some embodiments, different text from the same source may be based on the same word segmentation strategy or may be based on different word segmentation strategies. For example, for different paragraphs of the same paper, one word segmentation strategy may be applied to the previous paragraph, while the next paragraph applies another word segmentation strategy. In some embodiments, the word segmentation strategy may be based on a Machine Learning (ML) model or a natural language processing (natural language processing, NLP) model or any suitable artificial intelligence model or algorithm. After the text is segmented, the word weight of each segmented word is also identified, i.e. the word weight identification is set. For example, assume that the word weight is divided into 5 levels, level 1, level 2, level 3, level 4, and level 5 from the lowest level to the highest level, respectively. Here, the word weight identifying the word "M country" is 4, that is, the word weight corresponding to level 4 and the word weight of the word "N region" is 5, that is, the word weight corresponding to level 5.
Step S106: the hash value of the calculated word "M country" is "100101" and the hash value of the calculated word "N region" is "101011" by a hash algorithm.
In step S106, the hash value of each word segmented by the word segmentation policy (e.g., by the NLP model) in step S104 is calculated by a hash algorithm, where the hash algorithm may be a common hash algorithm such as murmurhash, xxhash, cityhash, or the like.
Step S108: a weighted digital string is generated according to the hash value and the word weight of each word, the weighted digital string of the word "M country" is "4-4-4 4-4 4", and the weighted digital string of the word "N region" is "5-5 5-5 5".
In step S108, a weighted number string is obtained by performing a weighted calculation on the corresponding hash value and the word weight according to the word weight of each word obtained by performing word segmentation in step S104 by the word segmentation policy (for example, by the NLP model), and according to the hash value of each word obtained in step S106. Wherein, the hash value of the word "M country" is "100101" and the word weight thereof is 4, and the weighted number string of the word "M country" obtained by the weight calculation is "4-4-4 4-4 4".
Step S110: the weighted digit strings of each word segment of the text are accumulated to obtain a sequence string of the text, and the weighted digit strings of the word segment 'M country' 4-4-4 4-4 4 'and the weighted digit strings of the word segment' N region '5-5 5-5' are accumulated to obtain '9-9 1-1-9'.
In step S110, the weighted number strings of each word of the text are accumulated to obtain a sequence string, that is, a sequence string of the text. Wherein the accumulation calculation means that numbers at corresponding positions on the weighted digit string of each word are accumulated, so that the weighted digit string of the word "M country" 4-4-4 4-4 4 "and the weighted digit string of the word" N region "5-5 5-5 5" are accumulated to obtain "9-9 1-1 9".
Step S112: and (3) performing dimension reduction on the sequence string ' 9-9 1-1 ' 9 ' of the text to obtain a Simhash signature ' 101011 ' of the text.
In step S112, the sequence string "9-9 1-1 9" of the text obtained by the accumulation calculation in step S110 is converted into a binary digitized string, that is, into a string expressed by 0 and 1, thereby achieving dimension reduction and Simhash signature. Specifically, the sequence string of text is compared from position to position, and is scored as 1 if the number at a certain position is greater than 0, and as 0 if the number at a certain position is less than 0. Thus, the sequence string of text "9-9 1-1 1-9" is dimension-reduced to obtain the Simhash signature of the text "101011".
It should be understood that the above steps S102 to S112 take the given text "staff member in M country N region said to see transporter" as an example and illustrate how to calculate the Simhash signature of the given text step by step. The computing flow of the text similarity computation based on Simhash and the vectorization method of the text similarity computation based on Simhash, which are mentioned in the embodiment of the application, can be suitable for any input text and can be suitable for any hash algorithm to compute the hash value of the word segmentation.
Referring to the above steps S102 to S112, by performing word segmentation and word weight labeling on the input text, further calculating a hash value of each word segment and calculating a weighted value to obtain a hash weighted result of each word segment, performing bit-by-bit accumulation on the hash weighted results of all the word segments to obtain an accumulated result, and finally processing the accumulated result bit by bit to obtain a Simhash signature of the text, where the value is 0 or 1. In practice, according to different word segmentation strategies, different input texts can generate different data streams, and especially the link of performing bit-aligned accumulation on hash weighted results of all the words may face complex and variable calculation requirements. For example, in step S110, the weighted digit strings of each word segment of the text are accumulated to obtain a sequence string of the text, and the weighted digit strings of the word segments with higher word weight levels are likely to be faced with larger numbers, and the weighted digit strings of the word segments with lower word weight levels are also likely to be faced with smaller numbers, and the length of the weighted digit strings is also difficult to predict, so that the calculation needs are performed by the conventional scalar addition instruction or scalar addition operation to solve the problem of low calculation speed. Therefore, it is required to provide an optimized acceleration method for the characteristics of the data stream generated in the step of performing the aligned accumulation on the hash weighted results of all the segmented words in the calculation flow of the Simhash-based text similarity calculation and the algorithm characteristics of the calculation flow of the Simhash-based text similarity calculation, so as to improve the calculation speed and the operation efficiency, which will be further described below in connection with the specific embodiments.
Referring to fig. 2, fig. 2 is a schematic diagram of an application scenario of Simhash-based text similarity calculation according to an embodiment of the present application. Referring to the example of the calculation flow of the Simhash-based text similarity calculation shown in fig. 1, the Simhash algorithm may provide a reference for representing the similarity of the original content in the signature dimension, calculate respective Simhash signatures for two objects to be compared, for example, calculate Simhash signatures for the text string a and the text string B shown in table 1 respectively, and then provide a reference for the similarity between two objects in the signature dimension, for example, to determine whether the two objects are repeated or highly approximated by comparing the hamming distances of the two Simhash signatures (i.e., the number of values of binary digits that are different at the corresponding positions of the respective Simhash signatures). In the application scenario shown in fig. 2, two or more objects for comparison are from web pages, such as web page 202, web page 204, and web page 206 shown in fig. 2, which are collected and consolidated through the internet 210, and then the web content of each web page is obtained through operation 220, that is, through a search engine and web crawler, and then Simhash signatures of the web content are obtained through operation 230, that is, similarity calculation of the web content. In operation 220, among other things, the search engine and web crawler or similar software may constantly obtain various web pages in the internet 210, the content of which is many duplicate or highly similar. The Simhash signatures of the web content of these web pages are computed in operation 230 to facilitate subsequent deduplication, duplicate checking or modeling, statistics, and the like. For example, by identifying a large number of similar content in web page content, reference bases for trend judgment, public opinion analysis or business decision guidance can be provided. The web content and the Simhash signature calculated based on the web content are stored in the web library 240, and the stored data in the web library 240 may be used for operation 250, i.e., web deduplication services, so that storing the same or highly similar web content may be avoided. In addition to web page deduplication services, other services such as modeling, statistics, analytics, and the like may be used. The Simhash signature of the web page contents provides a reference to the similarity degree between the web page contents of different web pages in the signature dimension, and can be used for quickly and effectively judging whether the web page contents of different web pages are repeated or are highly similar. Also, since the web page deduplication service, that is, operation 250, is based on the Simhash signature of the web page content, such that a server or host performing the web page deduplication service may not touch the original data of the web page content, it is advantageous to protect the data security and privacy of the party providing the original data of the web page content (e.g., web page library 240).
Referring to fig. 3, fig. 3 is a schematic diagram of another application scenario of Simhash-based text similarity calculation according to an embodiment of the present application. Referring to the example of the calculation flow of the Simhash-based text similarity calculation shown in fig. 1, the Simhash algorithm may provide a reference for representing the similarity of the original content in the signature dimension, calculate respective Simhash signatures for two objects to be compared, for example, calculate Simhash signatures for the text string a and the text string B shown in table 1 respectively, and then provide a reference for the similarity between two objects in the signature dimension, for example, to determine whether the two objects are repeated or highly approximated by comparing the hamming distances of the two Simhash signatures (i.e., the number of values of binary digits that are different at the corresponding positions of the respective Simhash signatures). In the application scenario shown in fig. 3, two or more objects for comparison come from web pages, such as papers 302, 304 and 306 shown in fig. 3, which are summarized and collated by journals, meetings, reports 310, and then the paper contents of the respective papers are obtained by operation 320, i.e., by academic search, journal search, and then Simhash signatures of the paper contents are obtained by operation 330, i.e., similarity calculation of the paper contents. Therein, in operation 320, various papers from journals, meetings, reports 310 are obtained by academic search, journal search, or similar search service, the contents of which need to be duplicated and de-duplicated to ensure academic quality and to distinguish academic plagiarisms. Subsequent deduplication, duplicate checking or modeling, statistics, etc., are facilitated by computing Simhash signatures of the paper content of these papers in operation 330. For example, when the hamming distance of Simhash signatures of two papers is short, it may mean that the two papers are highly similar, and can be marked for subsequent manual processing, so that multiple papers can be quickly checked for duplicate removal, etc. The paper content and the Simhash signature calculated based on the paper content are stored in the paper library 340, and the stored data in the paper library 340 can be used in operation 350, that is, the paper review service, so that the papers from journals, conferences and reports 310 stored in the paper library 340 can be quickly and efficiently reviewed and academic plagiarism can be identified. In addition to being used for paper review services, the method can also be used for modeling, statistics, analysis and other services. Simhash signatures of paper content provide a reference to the degree of similarity between the paper content of different papers in the signature dimension, which can be used to quickly and effectively determine whether the paper content of different papers is duplicate or highly similar. Also, since the paper review service, i.e., operation 350, is based on Simhash signatures of the paper content, such that a server or host executing the paper review service may not touch the original data of the paper content, it is advantageous to protect the data security and privacy protection of the party providing the original data of the paper content (e.g., the theoretical library 340).
Referring to fig. 4, fig. 4 is a schematic diagram of another application scenario of Simhash-based text similarity calculation provided in the embodiments of the present application. Referring to the example of the calculation flow of the Simhash-based text similarity calculation shown in fig. 1, the Simhash algorithm may provide a reference for representing the similarity of the original content in the signature dimension, calculate respective Simhash signatures for two objects to be compared, for example, calculate Simhash signatures for the text string a and the text string B shown in table 1 respectively, and then provide a reference for the similarity between two objects in the signature dimension, for example, to determine whether the two objects are repeated or highly approximated by comparing the hamming distances of the two Simhash signatures (i.e., the number of values of binary digits that are different at the corresponding positions of the respective Simhash signatures). In the application scenario shown in fig. 4, two or more objects for comparison are derived from user feedback, such as user feedback 402, user feedback 404 and user feedback 406 shown in fig. 4, which are summarized and collated by the user community, forum 410 of the product and e-commerce, and then the user feedback content of each user feedback is obtained by operation 420, i.e. by forum search, floor search, and then Simhash signature of the user feedback content is obtained by operation 430, i.e. by similarity calculation of the user feedback content. In operation 420, the forum search, the speech search, or the similar search service obtains user feedback from sources such as user communities and forums of various products and electronic commerce, and the user feedback is conveniently categorized by identifying similar content, and can be used for public opinion processing, public opinion analysis, and the like. By computing Simhash signatures of the user feedback content that are derived from these user feedback in operation 430, subsequent deduplication, duplicate checking or modeling, statistics, etc. are facilitated. For example, by counting the number or scale of user feedback with a sufficiently small degree of similarity, representative user feedback can be identified and analyzed to sort out reference information such as public opinion, public opinion guide, etc. The user feedback and the Simhash signature calculated based on the content of the user feedback are stored in the user feedback library 440, and the stored data in the user feedback library 440 can be used for operation 450, namely, user feedback deduplication service, so that user feedback from various products, user communities and forums of electronic commerce and other sources stored in the user feedback library 440 can be quickly and efficiently deduplicated and categorized, thereby providing convenience for subsequent data analysis processing. The Simhash signature of the user feedback content provides a reference to the similarity degree between the user feedback contents fed back by different users in the signature dimension, and can be used for quickly and effectively judging whether the user feedback contents fed back by different users are repeated or are similar to each other to a high degree. Also, since the user feedback deduplication service, i.e., operation 450, is based on the Simhash signature of the user feedback content, such that a server or host performing the user feedback deduplication service may not contact the original data of the user feedback content, it is advantageous to protect the data security and privacy of the party providing the original data of the user feedback content (e.g., user feedback library 440).
With reference to fig. 1 to fig. 4, the example of the calculation flow of the text similarity calculation based on Simhash uses Simhash algorithm to provide a reference for representing the similarity degree of the original content in the dimension of the signature, and provides a reliable quantization index for distinguishing the similarity degree between two objects in the application scene of web page duplication elimination shown in fig. 2, the application scene of paper duplication check shown in fig. 3, the application scene of user feedback duplication elimination shown in fig. 4, and the like. It can be seen that the specific object faced by the computing flow of the Simhash-based text similarity computation is determined by an application scenario, for example, the similarity degree between the web contents of different web pages needs to be determined in the application scenario of web page deduplication shown in fig. 2, the similarity degree between the paper contents of different papers needs to be determined in the application scenario of paper review shown in fig. 3, and the similarity degree between the user feedback contents of different user feedback needs to be determined in the application scenario of user feedback deduplication shown in fig. 4. Therefore, depending on the specific application scenario, the computing flow of Simhash-based text similarity computation may face different input texts or different objects for text similarity computation, and may employ different word segmentation strategies or NLP models, so that the facing data flow in the computing flow has complex and changeable characteristics, which makes it difficult to optimize through scalar addition instructions. Therefore, it is required to provide an optimized acceleration method for the characteristics of the data stream generated in the step of performing the aligned accumulation on the hash weighted results of all the segmented words in the calculation flow of the Simhash-based text similarity calculation and the algorithm characteristics of the calculation flow of the Simhash-based text similarity calculation, so as to improve the calculation speed and the operation efficiency, which are further described below in conjunction with the embodiment of fig. 5.
Referring to fig. 5, fig. 5 is a flow chart of a text similarity calculation method based on Simhash according to an embodiment of the present application. As shown in fig. 5, the vectorization method includes the following steps. It should be understood that the method and system for calculating text similarity provided in the embodiments of the present application may be used for calculating similarity for a plurality of texts, for example, calculating similarity for a first text and a second text. The first text, the second text, and the like mentioned in the embodiments of the present application are only for convenience of expression, but do not necessarily mean that two or more texts are involved. The method and the system for calculating the text similarity can also only relate to a single text and generate the Simhash signature of the single text, so that the generated Simhash signature of the single text can be used for similarity comparison with the Simhash signatures of other texts, and the Simhash signatures of other texts can be generated in any other suitable mode. That is, the method and system for calculating text similarity provided by the embodiment of the application provide a technical scheme which can give consideration to data sharing and privacy protection, can efficiently judge the similarity between two objects such as two text contents and provide reliable quantization indexes. The quantization index generated by the embodiment of the application, for example, the first Simhash signature of the first text, can be subjected to similarity calculation with the second Simhash signature of the second text also generated by the embodiment of the application, so as to judge the similarity between the respective contents of the first text and the second text. But the quantization index generated by the embodiment of the present application, for example, the first Simhash signature of the first text, may also perform similarity calculation with the Simhash signature generated by other manners. The first text, the second text, etc. mentioned below are only for convenience of expression and for distinguishing different texts unless the context indicates otherwise.
Step S502: a first text for text similarity calculation is acquired.
In step S502, the first text used for text similarity calculation may be a web page, paper, or user feedback, or any suitable carrier, expression, etc. Text similarity calculations are used to provide a reference basis for whether text to text is repeated or highly approximated and to provide a quantization index to indicate the degree of similarity between text. For example, assuming that the text for text similarity calculation acquired in step S502 is web content of a different web page, the text similarity calculation is used to indicate whether or not web content of a different web page is repeated or highly similar, for example, web page deduplication service is provided.
Step S504: and performing word segmentation on the first text to obtain a first word segmentation result, wherein the first word segmentation result comprises at least one first word segmentation and word weight of each first word segmentation.
In step S504, the first text may be segmented according to a specific segmentation strategy, NLP model, or algorithm. For example, representative keywords may be screened out of the first text and portions that do not contain useful information may be ignored. Also, the specific manner in which the first text is segmented, such as the segmentation strategy or NLP model employed, may be specific to the first text. In some embodiments, different input text from the same source, e.g., different paragraphs from the same paper, may each employ different word segmentation strategies. For example, a word segmentation strategy is employed for a previous paragraph or text and a subsequent word segmentation strategy is employed for a subsequent paragraph or text. Furthermore, the word weights are respectively identified for each of the first partial words, and the specific manner of identifying the word weights can be specific to the first text. For example, assume that the word weights are divided into 5 levels, and level 1, level 2, level 3, level 4, and level 5 from the lowest level to the highest level, respectively, and the word weights from the lowest level to the highest level are set to 1, 2, 3, 4, and 5 accordingly.
Step S506: a hash value for each of the at least one word segment is computed by a hash algorithm traversal.
In step S506, the hash algorithm may be a common hash algorithm such as murmurhash, xxhash, cityhash, or the like.
Step S508: and for each first word, generating a weighted digital string of the first word according to the hash value and the word weight of the first word.
In step S508, the hash value and the word weight of each first word are respectively weighted to obtain a weighted digital string. For example, assuming that the hash value of the word "M country" is "100101" and the word weight thereof is 4, the weighted number string of the word "M country" obtained by the weighting calculation is "4-4-4 4-4 4".
Step S510: and carrying out vectorization accumulation calculation on each weighted digital string of the first segmentation to obtain a sequence string of the first text, and carrying out vectorization dimension reduction on the sequence string of the first text to obtain a Simhash signature of the first text.
In step S510, an accumulation calculation is performed on the weighted digit strings of each word, that is, the digits at the corresponding positions on the weighted digit strings of each word are accumulated, and then the result of the accumulation calculation is converted into a binary-digitized string, that is, the dimension is reduced. As mentioned above, depending on the specific application scenario, the computation flow of Simhash-based text similarity computation may face different input texts or different objects for text similarity computation, and may use different word segmentation strategies or NLP models, so that the facing data flow in the computation flow has complex and variable characteristics. For example, the length of the weighted string of numbers that needs to be accumulated in step S510 may be complex and variable, and the numbers on the weighted string of numbers may also vary over a wide range (affected by the word weight identification and word weight level). The scalar operation can only be performed on a pair of data at a time through the scalar operation, and the accumulation calculation of two weighted digit strings needs to be performed for a plurality of times to accumulate digits at corresponding positions on the two weighted digit strings. By vectorizing the accelerated accumulation computation, for example by an instruction such as a single instruction stream multiple data stream (single instruction multiple data, SIMD) instruction set supporting vector operations, the same operation can be performed on each data in a set of data (or data vectors) separately, thereby achieving spatial parallelism. For example, for accumulation calculations of two weighted digit strings, the accumulation of digits at a plurality of corresponding positions on the two weighted digit strings may be processed simultaneously by vectorized accelerated accumulation calculations, such as by SIMD instructions. In some embodiments, a single controller may be employed to control multiple processors through SIMD instructions, operating on multiple registers, each for holding smaller integers in a packed format.
And after the vectorization acceleration accumulation calculation is carried out to obtain a sequence string of the first text, vectorization acceleration and dimension reduction are carried out on the sequence string of the first text to obtain the Simhash signature of the first text. Vectorized acceleration and dimension reduction may also be based on instructions that support vector operations, such as SIMD instruction sets. In this way, an instruction or a similar control mechanism supporting vector operation such as a SIMD instruction set is used to convert scalar operation in a para accumulation and dimension reduction link in a Simhash-based text similarity calculation flow into vector operation, so that batch data addition and batch data comparison are facilitated to be executed in a single period, and therefore calculation speed and operation efficiency are improved. The vectorization acceleration accumulation calculation and vectorization acceleration dimension reduction steps of the vectorization method of fig. 5 are described in further detail below in connection with the embodiment of fig. 6.
In addition, after obtaining the Simhash signature of the first text by the text similarity calculation method shown in fig. 5, step S512 may be further performed: and acquiring a Simhash signature of a second text, and obtaining the similarity between the first text and the second text based on the Simhash signature of the first text and the Simhash signature of the second text. In some embodiments, the obtaining manner of the Simhash signature of the second text may refer to the obtaining step or principle of the Simhash signature of the first text, which is not described herein.
Referring to fig. 6, fig. 6 is a flowchart illustrating the steps of vectorizing, accelerating and accumulating, calculating and vectorizing, accelerating and dimension reducing in the text similarity calculating method of fig. 5 according to the embodiment of the present application. As shown in fig. 6, in step S510 of fig. 5 (for each weighted digit string of the word, performing vectorization accumulation calculation to obtain a sequence string of the first text, and performing vectorization acceleration and dimension reduction on the sequence string of the first text to obtain a first Simhash signature of the first text), the following steps are obtained after vectorization acceleration accumulation calculation and vectorization acceleration dimension reduction expansion.
Step S610: initializing a first register YMM0 according to a first integer-type value length, the first integer-type value length being based on a first word segmentation result, the first word segmentation result comprising at least one word, a weighted string of digits of each of the at least one word segmentation being converted into an array of bytes, the length of the array of bytes being the first integer-type value length.
An instruction that supports vector operations, such as a SIMD instruction set, defines a plurality of registers through which a single controller may be employed to control and thereby operate on a plurality of processors. For example, the advanced vector expansion (advanced vector extensions, AVX) instruction set of an x86 architecture microprocessor contains 16 256-bit wide registers. Registers used for vector operations, such as vector registers, use multiple types of data structures as the types of operands, requiring the registers to be initialized and the types of operands to be specified before the multiple registers are operated on by SIMD instructions. For example, the AVX instruction set contains registers that provide a variety of operand types including 32-bit or 4-byte single-precision floating point numbers, 64-bit or 8-byte double-precision floating point numbers, and integer types (which may support 8-bit or 1-byte, 16-bit or 2-byte, 32-bit or 4-byte, 64-bit or 8-byte). In step S610, the first register YMM0 is initialized according to the first integer type value length, and the first register YMM0 for vector operation may be initialized with a specified integer type (for example, the type of the set operand is 1 byte, 2 bytes, 4 bytes, or 8 bytes). And, the first integer-type numerical length is based on a first word segmentation result including at least one word, the weighted string of digits of each of the at least one word being converted into an array of bytes having a length that is the first integer-type numerical length. As described above, the length of the weighted digit string required to perform the accumulation calculation may be complex and variable, and the digits on the weighted digit string may also vary within a relatively large range (affected by the word weight identifier and the word weight level), so the operation requirement of the weighted digit string of the word included in the first word segmentation result may be represented by setting the corresponding first integer-type numerical value length. In addition, the weighted digit string of each word included in the first word segmentation result is converted into a byte array having the first integer-type numerical value length, which means that all words of the first word segmentation result can perform subsequent operations through the first register YMM0 initialized according to the first integer-type numerical value length.
Step S620 is performed next to step S610.
Step S620: traversing the weighted digit string of each of the at least one word, loading each time a byte array corresponding to a specific number of weighted digit strings into a second register YMM1, the specific number being determined based on the first integer number length and a width of the second register YMM1, the width of the second register YMM1 being equal to the width of the first register YMM 0.
In step S620, the width of the second register YMM1 is equal to the width of the first register YMM0, so all the words of the first word-segmentation result are also applicable to the second register YMM1. Considering that all the words of the first word segmentation result may not be loaded into the second register YMM1 at one time, each traversal loads the byte array corresponding to the weighted digit string of a specific number into the second register YMM1, for example, each traversal loads the byte array corresponding to the weighted digit string of one, two, three or other numbers into the second register YMM1. The specific number is determined based on the first integer-type value length and the width of the second register YMM1, for example, the width of the second register YMM1 is 256 bits and the first integer-type value length is 32 bits, and each traversal loads the byte array corresponding to the 8 weighted digit strings into the second register YMM1. For example, the SIMD instruction set supports loading 256 bits of integer data into vector registers without memory alignment. As mentioned above, the weighted digital string of each of the at least one word is converted into a byte array, the length of the byte array is the first integer value length, and the operation requirement of the weighted digital string of the word included in the first word segmentation result is represented by setting the corresponding first integer value length. In this way, by determining the specific number based on the first integer-type numerical length and the width of the second register YMM1, the operation of loading data on the second register YMM1 also represents the operation requirement of the weighted digital string of the word included in the first word-segmentation result, and the widths of the first register YMM0 and the second register YMM1 can be flexibly adapted.
Step S630 is performed next to step S620.
Step S630: and performing batch accumulation calculation on the weighted digital strings of the specific number stored in each of the first register YMM0 and the second register YMM1 through a vector addition instruction to obtain an accumulation calculation result, and storing the accumulation calculation result into the first register YMM0, wherein the vector addition instruction is based on the first integer type numerical value length.
As mentioned above, initializing the first register YMM0 according to the first integer-type value length in step S610 and loading the byte array corresponding to the weighted digital string of the specific number (determined based on the first integer-type value length and the width of the second register YMM 1) into the second register YMM1 in step S620, and the width of the second register YMM1 is equal to the width of the first register YMM0, means that the first register YMM0 and the second register YMM1 can each store the weighted digital string of the same number (i.e., the specific number) in step S630, and further means that batch data addition can be performed on the weighted digital strings stored in each of the first register YMM0 and the second register YMM1 by a vector addition instruction, such that the bit addition is performed in a vector addition manner, that is, one cycle can simultaneously process the accumulated computation of a plurality of weighted digital strings. It should be appreciated that the vector addition instruction is based on the first integer value length, which may be used as a parameter to control the vector addition instruction. For example, the SIMD instruction set supports adding values within two registers of the same width, in integer-type values such as 1 byte, 2 bytes, 4 bytes, or 8 bytes, and storing the result to the registers. Moreover, because the operation requirement of the weighted digit string of the word included in the first word segmentation result is reflected by setting the corresponding first integer numerical value length, complex and changeable data streams caused by different input texts, different word segmentation strategies or NLP models can be dealt with.
Step S640 is then performed after step S630 is performed.
Step S640: it is determined whether to traverse each of the at least one word segment. If yes in step S640, step S650 is executed; if no in step S640, step S620 is executed.
When it is determined in step S640 that each of the at least one word is not traversed, step S620 is then performed and a new specific number of byte arrays corresponding to the weighted digit strings are next traversed and loaded into the second register YMM1, and the new accumulated calculation result is then stored in the first register YMM0 in step S630. Thus, when the traversal ends, i.e., each of the at least one word segment is traversed, the first register YMM0 stores the accumulated computation results of all the word segments of the at least one word segment.
Step S650: the second register YMM1 is initialized according to the first integer-type value length.
As mentioned above, the first register YMM0 stores the accumulated calculation result of all the words of the at least one word after the end of the traversal. In step S650, the second register YMM1 is initialized according to the first integer type value length, and the second register YMM1 for vector operation may be initialized with a specified integer type (for example, the type of the set operand is 1 byte, 2 bytes, 4 bytes, or 8 bytes). In this manner, the initialized second register YMM1 is available for subsequent batch data comparisons with the first register YMM0.
Step S660 is performed next to step S650.
Step S660: the first register YMM0 and the second register YMM1 are subjected to batch comparison by a batch comparison instruction, and the comparison result is stored in the first register YMM0, wherein the batch comparison instruction is based on the first integer type numerical length.
In step S660, the first register YMM0 stores the accumulated calculation result of all the tokens of the at least one token, and the second register YMM1 is initialized, i.e., cleared, in step S650. Referring to the above example of the calculation flow of Simhash-based text similarity calculation in fig. 1, after obtaining a sequence string of text, it needs to be converted into a binary digitized string, that is, into strings expressed with 0 and 1, so as to achieve dimension reduction and obtain Simhash signature. Specifically, the sequence string of text is compared from position to position, and is scored as 1 if the number at a certain position is greater than 0, and as 0 if the number at a certain position is less than 0. Thus, in step S660, by comparing instructions in batches, for example SIMD instruction sets based on support vector operations, it is possible to implement dividing values within two registers of the same width into a plurality of positions according to the first integer-type numerical length and comparing values of the two registers at the respective positions for each position, and feeding back 1 or 0 to embody the comparison result. In this way, the batch comparison is performed by the batch comparison instruction and the comparison result is stored in the first register YMM0 in step S660, so that the dimension reduction link is implemented in a vector operation manner.
Step S670 is performed next after step S660 is performed.
Step S670: and storing the stored result of the first register YMM0 to a byte array with the length of the first integer numerical value according to the first integer numerical value length for generating a Simhash signature of the first word segmentation result.
In step S670, the stored result of the first register YMM0 is saved to the byte array with the length of the first integer type value according to the length of the first integer type value, which may be an operation instruction provided by, for example, the SIMD instruction set, for saving the value stored in the register to the integer array of the memory, for example, shifting the integer vector of 256 bits to the memory aligned position of 256 bits. It should be appreciated that after the register result of the first register YMM0 is saved to, for example, a byte array of length 32 bits (the first integer-type value is of length 32 bits), a left-shift and logical and operation may also be used to combine the byte array of length 32 bits into a 32-bit integer Simhash value, i.e., a Simhash signature of the first word segmentation result.
Referring to fig. 7, fig. 7 is a schematic diagram of batch accumulation calculation performed by a vector addition instruction in the vectorized accelerated accumulation calculation step of fig. 6 according to an embodiment of the present application. As shown in fig. 7, the vector stored in the first register YMM0 contains four data of X1, X2, X3, and X4, respectively, and the vector stored in the second register YMM1 contains four data of Y1, Y2, Y3, and Y4, respectively. The accumulated calculation result obtained by carrying out batch accumulated calculation by the vector addition instruction shows that four data can be processed simultaneously in one period and vectors of the accumulated calculation result are x1+y1, x2+y2, x3+y3 and x4+y4.
Referring to fig. 6 and 7, by a vector operation instruction, for example, an instruction of supporting vector operation in an AVX instruction set of an x86 architecture microprocessor, batch loading of hash weighted results to be accumulated, that is, weighted digit strings, and batch data accumulation completed in one hour period are realized, so that vectorization optimization is realized for characteristics of data streams generated in a step of performing aligned accumulation on hash weighted results of all segmentation words in a calculation flow of Simhash-based text similarity calculation and algorithm characteristics of a calculation flow of Simhash-based text similarity calculation, and calculation speed is greatly improved. Further, aiming at complex and changeable operation requirements of application scenes needing Simhash calculation on massive texts such as webpages, papers, user feedback and the like, the influence of specific word segmentation strategies, NLP models and the like on data streams is reflected by setting corresponding integer numerical values, so that various application scenes can be flexibly adapted, and the overall performance is improved. Furthermore, the batch data comparison is realized through the batch comparison instruction, the vectorization optimization of the dimension reduction link is also realized, the register resource of the link for the aligned accumulation is multiplexed, and the resource utilization efficiency is improved.
It should be appreciated that the vector addition instructions described above, as well as other instruction sets that support vector operations, are exemplified by the AVX instruction set of an x86 architecture microprocessor. Other instruction sets, provided they support the vectorization method provided in the embodiments of the present application, may also be applied to the present application, for example, an ARM architecture microprocessor (e.g., the NEON instruction set of an ARM architecture microprocessor), a PowerPC architecture microprocessor, a SPARC architecture microprocessor, a MIPS architecture microprocessor (e.g., the MSA instruction set of a MIPS architecture microprocessor), and so on.
It should be understood that the first register YMM0 and the second register YMM1 described above are only illustrative, and the names of the registers used in practical applications may be determined according to a specific instruction set used, and the specific registers used may be specified by a compiler. Thus, any register or register name specified by the instruction set or compiler may be used for the corresponding first and/or second registers YMM0 and YMM1 as long as the operation principle mentioned in the embodiments of the present application is satisfied.
Referring to fig. 1 to 7, the embodiment of the present application provides a text similarity calculation method. The method comprises the following steps: performing word segmentation on the first text to obtain a first word segmentation result, wherein the first word segmentation result comprises at least one first word segmentation and word weight of each first word segmentation; for each first word, generating a weighted digital string of the first word according to the hash value and the word weight of the first word; and carrying out vectorization accumulation calculation on each weighted digital string of the first word segmentation to obtain a sequence string of the first text, and carrying out vectorization dimension reduction on the sequence string of the first text to obtain a Simhash signature of the first text; and acquiring a Simhash signature of a second text, and obtaining the similarity between the first text and the second text based on the Simhash signature of the first text and the Simhash signature of the second text. Therefore, the method realizes vectorization optimization by vectorizing acceleration accumulation calculation and vectorizing acceleration dimension reduction, and aims at the characteristics of data streams generated in the link of carrying out bit accumulation on hash weighted results of all segmentation words in the calculation flow of text similarity calculation based on Simhash and the algorithm characteristics of the calculation flow of text similarity calculation based on Simhash, thereby greatly improving the calculation speed.
It should be appreciated that the method embodiments of fig. 5 and 6 obtain a first text for text similarity calculation and generate a first Simhash signature of the first text by vectorizing an accelerated accumulation calculation and vectorizing an accelerated reduction of dimensions. The above method embodiments may also be applied to a second text and generate a second Simhash signature of the second text by vectorizing the accelerated accumulation computations and vectorizing the accelerated reduction of dimensions. The method embodiment can be applied to a plurality of texts, including a first text and a second text, and the Simhash signatures of the texts are generated through vectorization acceleration accumulation calculation and vectorization acceleration dimension reduction, so that similarity comparison can be performed, and therefore, the technical scheme which takes data sharing and privacy protection into consideration, can efficiently judge the similarity between two objects such as two text contents and gives a reliable quantization index is realized.
In one possible implementation manner, the performing the vectorization acceleration accumulation calculation to obtain the sequence string of the first text includes: traversing the weighted digit strings of each word of the at least one word, loading the byte array corresponding to the weighted digit strings of the specific number to the second register each time, performing batch accumulation calculation on the weighted digit strings stored in the first register and the second register through a vector addition instruction to obtain an accumulation calculation result, and then storing the accumulation calculation result in the first register.
In one possible implementation, the particular number is determined based on a first integer-type value length and a width of the second register, the width of the second register being equal to the width of the first register.
In one possible implementation, the first integer-type value length is based on a first word segmentation result of the first text, and a length of the byte array corresponding to the weighted digit string of each of the at least one word is the first integer-type value length.
In one possible embodiment, the first register is initialized to the first integer value length at least before said vectorized accelerated accumulation calculation.
In one possible implementation, the vector addition instruction is based on the first integer-type numerical length.
In a possible embodiment, the second register is initialized to the first integer value length after the vectorized accelerated accumulation calculation and before the vectorized accelerated dimension reduction of the sequence string of the first text.
In one possible implementation manner, the vectorizing, accelerating and dimension-reducing the sequence string of the first text includes: the first register and the second register are subjected to batch comparison by a batch comparison instruction to obtain a comparison result, and the comparison result is stored in the first register.
In one possible implementation, the batch comparison instruction is based on the first integer type value length.
In a possible implementation manner, the vectorizing, accelerating and dimension reducing the sequence string of the first text further includes: and storing the storage result of the first register into a byte array with the length of the first integer type numerical value according to the length of the first integer type numerical value for generating a Simhash signature of the first text.
In a possible implementation manner, the vectorizing, accelerating and dimension reducing the sequence string of the first text further includes: a leftwards shift and logical and operation is used to combine the byte array of length the first integer value length into a Simhash signature of the first text.
In one possible implementation, the hash algorithm is a murmurhash, xxhash or cityhash.
In one possible implementation, the vector addition instruction is based on a SIMD instruction set.
In one possible implementation, the SIMD instruction set is an AVX instruction set of an x86 architecture microprocessor, an instruction set of a PowerPC architecture microprocessor, an instruction set of a SPARC architecture microprocessor, or an MSA instruction set of a MIPS architecture microprocessor.
In one possible embodiment, the first integer value length is 1 byte, 2 bytes, 4 bytes, or 8 bytes.
In one possible implementation, the first text is from a web page, paper, or user feedback.
In one possible implementation, the Simhash signature of the first text is used for deduplication, modeling, or statistics.
Referring to fig. 8, fig. 8 is a schematic diagram of a Simhash calculation module provided in an embodiment of the present application. As shown in fig. 8, the Simhash calculation module 800 includes a plurality of units including a word segmentation unit 810, a hash calculation unit 820, a weighted digital string generation unit 830, an accumulation calculation unit 840, and a dimension reduction unit 850. The Simhash computation module 800 receives input text 802 and outputs a corresponding Simhash signature 804. Here, the input text 802 may refer to the first text or the second text or any text among a plurality of texts for text similarity calculation in the above-described method embodiment. The internal operation mechanism of the Simhash computation module 800 of fig. 8 is described below in conjunction with the above-described method embodiments.
The word segmentation unit 810 obtains the input text 802, performs word segmentation on the input text 802 to obtain a word segmentation result of the input text 802 including at least one word, and identifies a word weight for each of the at least one word. When the input text 802 is a first text, the word segmentation unit 810 is configured to segment the first text to obtain a first word segmentation result, where the first word segmentation result includes at least one first word segment and a word weight of each first word segment. The word segmentation unit 810 transmits the word segmentation result of the input text 802 to the hash calculation unit 820. The hash calculation unit 820 is used for calculating the hash value of each of the at least one word segment through a hash algorithm traversal. The hash calculation unit 820 transmits the calculated hash value to the weighted digital string generation unit 830. The weighted digital string generating unit 830 also obtains the word weight of each of the at least one word segment from the word segment unit 810. The weighted digital string generating unit 830 is configured to perform a weighted calculation on the hash value and the word weight of each word to obtain a weighted digital string. When the input text 802 is a first text, the weighted digit string generating unit 830 is configured to generate, for each of the first word segments, a weighted digit string of the first word segment according to the hash value and the word weight of the first word segment. The weighted digit string generating unit 830 transmits the generated weighted digit string to the accumulation calculating unit 840. The accumulation calculation unit 840 is configured to perform vectorization acceleration accumulation calculation on the weighted digit string of each of the at least one word segment to obtain a sequence string of the input text 802. When the input text 802 is a first text, the accumulation calculation unit 840 is configured to perform vectorization accumulation calculation on the weighted digit strings of each of the first word segments to obtain a sequence string of the first text. The accumulation calculation unit 840 transmits the sequence string of the input text 802 to the dimension reduction unit 850. The dimension reduction unit 850 is configured to perform vectorization, acceleration and dimension reduction on the sequence string of the input text 802 to obtain a Simhash signature 804 of the input text 802. When the input text 802 is a first text, the dimension reduction unit 850 is configured to perform vectorization dimension reduction on a sequence string of the first text to obtain a Simhash signature of the first text.
Details regarding the accumulation calculation unit 840 and the dimension reduction unit 850 in fig. 8 may refer to the above method embodiments, including details regarding the implementation of the batch loading of the hash weighted result to be accumulated by the vector operation instruction and the batch data accumulation and the batch data comparison in fig. 6 and fig. 7, which are not described herein. The Simhash calculation module 800 shown in fig. 8 realizes vectorization optimization for the characteristics of the data stream generated in the link of performing the aligned accumulation on the hash weighted results of all the segmented words in the calculation flow of the Simhash-based text similarity calculation and the algorithm characteristics of the calculation flow of the Simhash-based text similarity calculation, so that the calculation speed is greatly improved; aiming at complex and changeable operation requirements of application scenes requiring Simhash calculation on massive texts such as webpages, papers, user feedback and the like, the influence of specific word segmentation strategies, NLP models and the like on data streams is reflected by setting corresponding integer numerical values, so that various application scenes can be flexibly adapted, and the overall performance is improved.
It should be appreciated that the Simhash computation module 800 shown in fig. 8 may be considered a system. The embodiment of fig. 8 provides a system including the units of Simhash computation module 800 described above.
Referring to fig. 9, fig. 9 is a schematic diagram of an accumulation calculation unit of the Simhash calculation module of fig. 8 according to an embodiment of the present application. As shown in fig. 9, the accumulation calculation unit includes a first register 902, a second register 904, and an instruction controller 910. As described above, the first and second registers YMM0 and YMM1 mentioned in the embodiments of fig. 6 and 7 are merely illustrative, and names of registers employed in practical applications may be determined according to a specific employed instruction set, and specific used registers may be designated by a compiler. Thus, any register or register name specified by the instruction set or compiler may be used for the corresponding first and/or second registers YMM0 and YMM1 as long as the operation principle mentioned in the embodiments of the present application is satisfied. In the embodiment of fig. 9, the first register 902 may correspond to the first register YMM0 and the second register 904 may correspond to the second register YMM1; alternatively, the first register 902 may correspond to the second register YMM1, and the second register 904 may correspond to the first register YMM0. The instruction controller 910 is configured to control the first register 902 and the second register 904 according to an instruction set supporting vector operations, such as an AVX instruction set of an x86 architecture microprocessor, so as to implement the above-described related operations with respect to the first register YMM0 and the second register YMM1.
Referring to fig. 8 and 9, a system is provided in an embodiment of the present application. The system comprises: the word segmentation unit is used for segmenting the first text to obtain a first word segmentation result comprising at least one word, and respectively identifying word weights for each of the at least one word; a hash calculation unit for calculating a hash value of each word segment; a weighted digital string generating unit, configured to generate, for each word segment, a weighted digital string of the word segment according to the hash value and the word weight of the word segment; the accumulation calculation unit is used for carrying out vectorization accumulation calculation on each weighted digit string of the word segmentation to obtain a sequence string of the first text; and the dimension reduction unit is used for carrying out vectorization dimension reduction on the sequence string of the first text to obtain a first Simhash signature of the first text.
In one possible implementation manner, the accumulation calculation unit includes a first register and a second register, and the accumulation calculation unit performs vectorization acceleration accumulation calculation to obtain a sequence string of the first text, including: traversing the weighted digit strings of each word of the at least one word, loading the byte array corresponding to the weighted digit strings of the specific number to the second register each time, performing batch accumulation calculation on the weighted digit strings stored in the first register and the second register through a vector addition instruction to obtain an accumulation calculation result, and then storing the accumulation calculation result in the first register.
In one possible implementation, the particular number is based on a first integer type value length and a width of the second register, the width of the second register being equal to the width of the first register.
In one possible implementation, the first integer-type value length is based on a first word segmentation result of the first text, and a length of the byte array corresponding to the weighted digit string of each of the at least one word is the first integer-type value length.
In one possible embodiment, the first register is initialized to the first integer value length at least before said vectorized accelerated accumulation calculation.
In one possible implementation, the vector addition instruction is based on the first integer-type numerical length.
In a possible embodiment, the second register is initialized to the first integer value length after the vectorized accelerated accumulation calculation and before the vectorized accelerated dimension reduction of the sequence string of the first text.
Referring to fig. 10, fig. 10 is a schematic structural diagram of a computing device according to an embodiment of the present application, where the computing device 1000 includes: one or more processors 1010, a communication interface 1020, and a memory 1030. The processor 1010, the communication interface 1020, and the memory 1030 are interconnected by a bus 1040. Optionally, the computing device 1000 may further include an input/output interface 1050, where the input/output interface 1050 is connected to an input/output device for receiving parameters set by a user, etc. The computing device 1000 can be used to implement some or all of the functionality of the device embodiments or system embodiments described above in the embodiments of the present application; the processor 1010 can also be used to implement some or all of the operational steps of the method embodiments described above in the embodiments of the present application. For example, specific implementations of the computing device 1000 performing various operations may refer to specific details in the above-described embodiments, such as the processor 1010 performing some or all of the steps of the above-described method embodiments or some or all of the operations of the above-described method embodiments. For another example, in the embodiments of the present application, the computing device 1000 may be used to implement some or all of the functions of one or more components of the apparatus embodiments described above, and the communication interface 1020 may be used in particular for communication functions and the like necessary for implementing the functions of these apparatuses, components, and the processor 1010 may be used in particular for processing functions and the like necessary for implementing the functions of these apparatuses, components.
It should be appreciated that the computing device 1000 of fig. 10 may include one or more processors 1010, and that the multiple processors 1010 may cooperatively provide processing power in a parallelized connection, a serialized connection, a serial-parallel connection, or any connection, or that the multiple processors 1010 may constitute a processor sequence or processor array, or that the multiple processors 1010 may be separated into primary and secondary processors, or that the multiple processors 1010 may have different architectures such as employing heterogeneous computing architectures. In addition, the computing device 1000 shown in FIG. 10, the associated structural and functional descriptions are exemplary and not limiting. In some example embodiments, computing device 1000 may include more or fewer components than shown in fig. 10, or combine certain components, or split certain components, or have a different arrangement of components.
The processor 1010 may have various specific implementations, for example, the processor 1010 may include one or more of a Central Processing Unit (CPU), a Graphics Processor (GPU), a neural Network Processor (NPU), a Tensor Processor (TPU) or a Data Processor (DPU), which are not limited in this embodiment. The processor 1010 may also be a single-core processor or a multi-core processor. The processor 1010 may be formed by a combination of a CPU and a hardware chip. The hardware chip may be an application-specific integrated circuit (ASIC), a programmable logic device (programmable logic device, PLD), or a combination thereof. The PLD may be a complex programmable logic device (complex programmable logic device, CPLD), a field-programmable gate array (field-programmable gate array, FPGA), general-purpose array logic (generic array logic, GAL), or any combination thereof. The processor 1010 may also be implemented as a logic device with built-in processing logic, such as an FPGA or a digital signal processor (digital signal proce or, DSP), for example. The communication interface 1020 may be a wired interface, which may be an ethernet interface, a local area interconnect network (local interconnect network, LIN), etc., or a wireless interface, which may be a cellular network interface, or use a wireless local area network interface, etc., for communicating with other modules or devices.
The memory 1030 may be a nonvolatile memory such as a read-only memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an electrically Erasable EPROM (EEPROM), or a flash memory. Memory 1030 may also be volatile memory, which may be Random Access Memory (RAM) used as external cache memory. By way of example, and not limitation, many forms of RAM are available, such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), enhanced SDRAM (ESDRAM), synchronous DRAM (SLDRAM), and direct memory bus RAM (DR RAM). Memory 1030 may also be used for storing program codes and data such that processor 1010 may invoke the program codes stored in memory 1030 to perform some or all of the operational steps in the method embodiments described above or to perform corresponding functions in the device embodiments described above. Moreover, computing device 1000 may contain more or fewer components than shown in FIG. 10, or may have a different arrangement of components.
The bus 1040 may be a peripheral component interconnect express (peripheral component interconnect expre, PCIe) bus, or an extended industry standard architecture (extended industry standard architecture, EISA) bus, a unified bus (Ubus or UB), a computer express 10 link (CXL), a cache coherent interconnect protocol (cache coherent interconnect for accelerators, CCIX), or the like. The bus 1040 may be divided into an address bus, a data bus, a control bus, etc. The bus 1040 may include a power bus, a control bus, a status signal bus, and the like, in addition to a data bus. But for clarity of illustration only one thick line is shown in fig. 10, but not only one bus or one type of bus.
The method and the device provided in the embodiments of the present application are based on the same inventive concept, and because the principles of solving the problems by the method and the device are similar, the embodiments, implementations, examples or implementation of the method and the device may refer to each other, and the repetition is not repeated. Embodiments of the present application also provide a system that includes a plurality of computing devices, each of which may be structured as described above. The functions or operations that may be implemented by the system may refer to specific implementation steps in the above method embodiments and/or specific functions described in the above apparatus embodiments, which are not described herein.
Embodiments of the present application also provide a computer-readable storage medium having stored therein computer instructions which, when executed on a computing device (e.g., one or more processors), may implement the method steps in the method embodiments described above. The specific implementation of the processor of the computer readable storage medium in executing the above method steps may refer to specific operations described in the above method embodiments and/or specific functions described in the above apparatus embodiments, which are not described herein again.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. The present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Embodiments of the present application may be implemented in whole or in part by software, hardware, firmware, or any other combination. When implemented in software, the above-described embodiments may be implemented in whole or in part in the form of a computer program product. The present application may take the form of a computer program product embodied on one or more computer-usable storage media having computer-usable program code embodied therein. The computer program product includes one or more computer instructions. When loaded or executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line), or wireless (e.g., infrared, wireless, microwave, etc.). Computer readable storage media can be any available media that can be accessed by a computer or data storage devices, such as servers, data centers, etc. that contain one or more collections of available media. Usable media may be magnetic media (e.g., floppy disks, hard disks, tape), optical media, or semiconductor media. The semiconductor medium may be a solid state disk, or may be a random access memory, flash memory, read only memory, erasable programmable read only memory, electrically erasable programmable read only memory, register, or any other form of suitable storage medium.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. Each flow and/or block of the flowchart and/or block diagrams, and combinations of flows and/or blocks in the flowchart and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to the related descriptions of other embodiments. It will be apparent to those skilled in the art that various modifications and variations can be made to the embodiments of the present application without departing from the spirit and scope of the embodiments of the present application. The steps in the method of the embodiment of the application can be sequentially adjusted, combined or deleted according to actual needs; the modules in the system of the embodiment of the application can be divided, combined or deleted according to actual needs. Such modifications and variations of the embodiments of the present application are intended to be included herein, if they fall within the scope of the claims and their equivalents.

Claims (12)

1. A text similarity calculation method, the method comprising:
performing word segmentation on the first text to obtain a first word segmentation result, wherein the first word segmentation result comprises at least one first word segmentation and word weight of each first word segmentation;
for each first word, generating a weighted digital string of the first word according to the hash value and the word weight of the first word; and
carrying out vectorization accumulation calculation on each weighted digital string of the first word segmentation to obtain a sequence string of the first text, and carrying out vectorization dimension reduction on the sequence string of the first text to obtain a Simhash signature of the first text;
And acquiring a Simhash signature of a second text, and obtaining the similarity between the first text and the second text based on the Simhash signature of the first text and the Simhash signature of the second text.
2. The method of claim 1, wherein said performing a vectorized accelerated accumulation calculation results in a sequence string of the first text, comprising:
traversing the weighted digit strings of each word of at least one word, loading a byte array corresponding to the weighted digit strings of a specific number to a second register each time, performing batch accumulation calculation on the weighted digit strings stored in the first register and the second register through a vector addition instruction to obtain an accumulation calculation result, and then storing the accumulation calculation result in the first register.
3. The method of claim 2, wherein the particular number is determined based on a first integer-type numerical length and a width of the second register, the width of the second register being equal to the width of the first register.
4. The method of claim 3, wherein the first integer-type value length is based on a first word segmentation result of the first text, and wherein a length of a byte array corresponding to a weighted digit string of each of the at least one word segment is the first integer-type value length.
5. The method of claim 4, wherein the first register is initialized to the first integer value length at least prior to the vectorized accumulation calculation.
6. The method of claim 4, wherein the second register is initialized to the first integer value length after the vectorized accelerated accumulation calculation and before the vectorized dimension reduction of the sequence string of the first text.
7. The method of claim 6, wherein vectorizing the sequence string of the first text to accelerate to reduce dimensions, comprises:
and performing batch comparison on the first register and the second register through a batch comparison instruction to obtain a comparison result, and storing the comparison result into the first register.
8. The method of claim 7, wherein vectorizing the sequence string of the first text accelerates to reduce dimensions, further comprising:
and storing the storage result of the first register into a byte array with the length of the first integer numerical value length according to the first integer numerical value length for generating a first Simhash signature of the first text.
9. The method of claim 8, wherein vectorizing the sequence string of the first text accelerates to reduce dimensions, further comprising:
a left shift and logical and operation is used to combine the byte array of length the first integer value length into a first Simhash signature of the first text.
10. The method of any of claims 2 to 9, wherein the vector addition instruction is based on a SIMD instruction set.
11. The method of claim 10, wherein the SIMD instruction set is an AVX instruction set of an x86 architecture microprocessor, an instruction set of a PowerPC architecture microprocessor, an instruction set of a SPARC architecture microprocessor, or an MSA instruction set of a MIPS architecture microprocessor.
12. A computing device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method according to any one of claims 1 to 11 when executing the computer program.
CN202211286844.8A 2022-10-20 2022-10-20 Text similarity calculation method and system Pending CN116204612A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211286844.8A CN116204612A (en) 2022-10-20 2022-10-20 Text similarity calculation method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211286844.8A CN116204612A (en) 2022-10-20 2022-10-20 Text similarity calculation method and system

Publications (1)

Publication Number Publication Date
CN116204612A true CN116204612A (en) 2023-06-02

Family

ID=86506644

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211286844.8A Pending CN116204612A (en) 2022-10-20 2022-10-20 Text similarity calculation method and system

Country Status (1)

Country Link
CN (1) CN116204612A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116611897A (en) * 2023-07-19 2023-08-18 宜宾叙控科技有限公司 Message reminding method and system based on artificial intelligence

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116611897A (en) * 2023-07-19 2023-08-18 宜宾叙控科技有限公司 Message reminding method and system based on artificial intelligence
CN116611897B (en) * 2023-07-19 2023-10-13 北京快益通科技有限公司 Message reminding method and system based on artificial intelligence

Similar Documents

Publication Publication Date Title
Hassen et al. Scalable function call graph-based malware classification
Xiang et al. A linguistic steganography based on word indexing compression and candidate selection
US11507601B2 (en) Matching a first collection of strings with a second collection of strings
US8344916B2 (en) System and method for simplifying transmission in parallel computing system
Matsui et al. A survey of product quantization
JP2790466B2 (en) Character string search method and apparatus
Wei et al. Projected residual vector quantization for ANN search
CN110569629A (en) Binary code file tracing method
CN110990058B (en) Software similarity measurement method and device
CN113360803B (en) Data caching method, device, equipment and storage medium based on user behaviors
CN116149669B (en) Binary file-based software component analysis method, binary file-based software component analysis device and binary file-based medium
CN116204612A (en) Text similarity calculation method and system
CN114741468A (en) Text duplicate removal method, device, equipment and storage medium
Duan et al. Distributed in-memory vocabulary tree for real-time retrieval of big data images
Pibiri On weighted k-mer dictionaries
CN111651695A (en) Method and device for generating and analyzing short link
Wan et al. Hdidx: High-dimensional indexing for efficient approximate nearest neighbor search
Wang et al. File fragment type identification with convolutional neural networks
Girotto et al. FSH: fast spaced seed hashing exploiting adjacent hashes
CN108334888B (en) Compression coding for bit sequences
CN113407693B (en) Text similarity comparison method and device for full-media reading
CN116150185A (en) Data standard extraction method, device, equipment and medium based on artificial intelligence
Fischer et al. Practical evaluation of lempel-Ziv-78 and lempel-ziv-welch tries
Rabea et al. A fast algorithm for constructing suffix arrays for DNA alphabets
CN113032523B (en) Extraction method and device of triple information, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination