WO2024082827A1 - 文本相似性度量方法、装置、设备、存储介质和程序产品 - Google Patents

文本相似性度量方法、装置、设备、存储介质和程序产品 Download PDF

Info

Publication number
WO2024082827A1
WO2024082827A1 PCT/CN2023/115292 CN2023115292W WO2024082827A1 WO 2024082827 A1 WO2024082827 A1 WO 2024082827A1 CN 2023115292 W CN2023115292 W CN 2023115292W WO 2024082827 A1 WO2024082827 A1 WO 2024082827A1
Authority
WO
WIPO (PCT)
Prior art keywords
string
distance
text string
text
distance matrix
Prior art date
Application number
PCT/CN2023/115292
Other languages
English (en)
French (fr)
Inventor
秦洋洋
邓淞
李黠
孙新澳
李铭晖
Original Assignee
抖音视界有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 抖音视界有限公司 filed Critical 抖音视界有限公司
Publication of WO2024082827A1 publication Critical patent/WO2024082827A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate

Definitions

  • the present disclosure relates to the field of computer processing technology, and in particular to a text similarity measurement method, apparatus, device, storage medium and program product.
  • Text similarity measurement methods are used in more and more scenarios, such as understanding the search content in search engines and building indexes of web links; evaluating the duplication, plagiarism, homogeneity, etc. of various information flow articles. Text similarity measurement methods are involved in all of these.
  • the text similarity measurement methods in related technologies all have different degrees of deficiencies for batch processing of massive texts. Some methods are only applicable to the analysis of very small amounts of data; some methods have high computational costs and are difficult to apply to big data and massive text processing; and some methods require full text processing and are difficult to independently calculate the similarity between two texts.
  • the embodiments of the present disclosure provide a text similarity measurement method, device, equipment, storage medium and program product, which realize the dimension reduction calculation of the text similarity method and improve the calculation efficiency.
  • the sampled string contains information of two text strings, the information loss is reduced while the dimension is reduced.
  • an embodiment of the present disclosure provides a method for measuring text similarity, the method comprising:
  • the similarity between the first text string and the second text string is determined based on the first distance matrix and the second distance matrix.
  • an embodiment of the present disclosure provides a text similarity measurement device, the device comprising:
  • a text string acquisition module used to acquire a first text string and a second text string
  • a sampling character string determination module configured to construct a joint probability distribution of the first text character string and the second text character string based on the first text character string and the second text character string, and to sample the joint probability distribution to obtain a sampling character string;
  • a distance matrix calculation module configured to calculate the distance between the first text string and the sampled string to obtain a first distance matrix, and to calculate the distance between the second text string and the sampled string to obtain a second distance matrix;
  • the text similarity measurement module is used to determine the similarity between the first text string and the second text string based on the first distance matrix and the second distance matrix.
  • an embodiment of the present disclosure provides an electronic device, the electronic device comprising:
  • processors one or more processors
  • a storage device for storing one or more programs
  • the one or more processors When the one or more programs are executed by the one or more processors, the one or more processors implement the text similarity measurement method as described in any one of the first aspects above.
  • an embodiment of the present disclosure provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the text similarity measurement method as described in any one of the above-mentioned first aspects.
  • an embodiment of the present disclosure provides a computer program product, which includes a computer program or instructions, and when the computer program or instructions are executed by a processor, the text similarity measurement method as described in any one of the above-mentioned first aspects is implemented.
  • the embodiment of the present disclosure samples the joint probability distribution of two text strings to obtain a sampled string, calculates the distance matrices between the two text strings and the sampled string respectively, and then calculates the distance matrices of the two distance matrices. Similarity, realizes the dimension reduction calculation of the text similarity method, and improves the calculation efficiency. In addition, since the sampled string contains the information of two text strings, the information loss is reduced while the dimension is reduced.
  • FIG1 is a flow chart of a text similarity measurement method in an embodiment of the present disclosure
  • FIG2 is a schematic diagram of a joint probability distribution in an embodiment of the present disclosure
  • FIG3a is a schematic diagram of an edit distance initialization in an embodiment of the present disclosure.
  • FIG3 b is a schematic diagram of an actual calculation process of an edit distance in an embodiment of the present disclosure.
  • FIG4a is a schematic diagram of an edit distance calculation process in an embodiment of the present disclosure.
  • FIG4b is a schematic diagram of an edit distance calculation process in an embodiment of the present disclosure.
  • FIG5a is a schematic diagram of a complete edit distance calculation process in an embodiment of the present disclosure.
  • FIG5 b is a schematic diagram of a complete edit distance calculation process in an embodiment of the present disclosure.
  • FIG6 is a schematic diagram of vector extraction of an edit distance matrix in an embodiment of the present disclosure.
  • FIG. 7 is a flow chart of a text similarity measurement in an embodiment of the present disclosure.
  • FIG8 is a schematic diagram of a vector calculation process in an embodiment of the present disclosure.
  • FIG9 is a schematic diagram of the structure of a text similarity measurement device in an embodiment of the present disclosure.
  • FIG. 10 is a schematic diagram of the structure of an electronic device in an embodiment of the present disclosure.
  • word frequency statistics which was mainly measured by counting the number of identical characters in two texts. For example, the string “abc” and the string “abd” have the same character “ab". The more identical characters there are, the more similar the two strings are. Later, in order to further evaluate the impact of string order on similarity based on statistical quantity, the string arrangement number was introduced. When the word frequency is the same, the smaller the difference in arrangement number, the higher the similarity.
  • TF-IDF Term Frequency-Inverse Document Frequency
  • TF-IDF is an early word vector encoding method.
  • CBOW Continuous Bag-of-Words
  • Skip-Gram the Continuous Skip-Gram model.
  • CBOW infers the word vector of a specific word through the word vectors of context-related words.
  • the calculation method of Skip-Gram is just the opposite. It calculates the context word vector corresponding to a specific word through the word vector of a specific word. Both calculation methods can obtain context semantics to a certain extent, but because there are more context inferences, the calculation consumption is relatively large.
  • the main current text similarity technologies include word frequency calculation, planning solution, statistical models, word vector encoding, neural network reasoning, etc., but they all have varying degrees of deficiencies in batch processing of massive texts.
  • the word frequency calculation method is currently only used for the analysis and characterization of very small amounts of data because it is too crude.
  • the edit distance algorithm is difficult to apply to big data and massive text processing due to its high computational cost.
  • the statistical model processes the entire text as its statistical probability source, so it is difficult to independently calculate the similarity between two texts.
  • almost all current word vector encoding technologies require a large amount of text corpus for pre-training, and need to calculate word encodings from the entire document. They cannot independently calculate the similarity between two texts, and the computational cost is also very high.
  • Data dimensionality reduction is to map data in a high-dimensional space to a low-dimensional space while minimizing information loss, thereby improving data processing efficiency.
  • the current data dimensionality reduction technologies mainly focus on factor analysis, autoencoders, topic models, local embedding, etc., but they all require a large amount of text corpus training, and are not effective for the calculation of two independent texts or batch processing of massive texts.
  • the basic algorithm relied on by the text similarity measurement method is the edit distance algorithm proposed by Russian scientist Vladimir Levenstein (Levenshtein Vladimir I) in 1965, also known as Levenshtein Distance (LD).
  • This algorithm is intuitive and easy to explain, and has a good effect on the similarity measurement of strings. After some optimization, it is still widely used today, and even has a special third-party computing package in coding languages such as Python.
  • This algorithm needs to be based on dynamic programming or recursive solution, and the computational consumption is large, making it difficult to apply to text big data and long text similarity measurement.
  • the power level is relatively high, and the time complexity and space complexity do not have a great impact on small data volumes, but for big data and long texts, the resources consumed are huge. This also makes it difficult to promote this method on a large scale in the current calculation of text big data.
  • a text similarity measurement method is provided in an embodiment of the present disclosure.
  • the distance matrices between two text strings and the sample character string are respectively calculated, and then the similarity of the two text strings is calculated according to the two distance matrices, the single-stage calculation is converted into a two-stage calculation, thereby improving the calculation efficiency, especially in massive text big data.
  • the text similarity measurement method provided by the embodiment of the present disclosure can be widely used in the fields of precision medicine, quantitative finance, intelligent speech, etc. It has very significant application value for web chain analysis, password encoding, genome measurement, speech proofreading, etc.
  • FIG. 1 is a flow chart of a text similarity measurement method in an embodiment of the present disclosure.
  • the method is applicable to the case of calculating the similarity of two texts.
  • the method can be performed by a text similarity measurement device as described below in conjunction with FIG. 9 , and the text similarity measurement device can be implemented in software and/or hardware.
  • the method can also be performed by an electronic device (including a terminal device) as described below in conjunction with FIG. 10 .
  • the text similarity measurement method provided by the embodiment of the present disclosure mainly includes steps S101 - S104 .
  • S101 Obtain a first text string and a second text string.
  • a text string is a form of expression of a written language, which may be a combination of multiple characters, and the characters may include: one or more of English characters, Chinese characters, punctuation marks, Roman characters, Greek characters or other special characters.
  • the first text string and the second text string refer to two text strings whose similarity needs to be measured.
  • a first text string and a second text string may be acquired.
  • the trigger event may be an event of receiving a text string input by a user.
  • the terminal device receives the text string, and it can be considered that a trigger event of text similarity measurement is detected.
  • the terminal The device may use the received text string input by the user as the first text string, and obtain any text string from the database as the second text string.
  • the trigger event may also be an event of receiving a text similarity measurement instruction.
  • a similarity measurement instruction may be input to a terminal device, and the similarity measurement instruction may be a click instruction, a press instruction, or a voice instruction, etc.
  • the terminal device receives the similarity measurement instruction, it may be considered that a trigger event for performing text similarity measurement is detected.
  • the terminal device may arbitrarily select two text strings from the database as the first text string and the second text string.
  • the trigger event may also be an event of receiving a target task completion instruction. For example, if a security risk is found in network interface A, and the terminal device receives the program code corresponding to network interface A, it can be considered that a trigger event for text similarity measurement is detected. At this time, the terminal device can use the received program code corresponding to network interface A as the first text string, and obtain the program code corresponding to any other network interface except network interface A as the second text string.
  • S102 construct a joint probability distribution of the first text string and the second text string based on the two, and sample the joint probability distribution to obtain a sampled string.
  • the characters included in the sample character string are all characters included in the first text character string and/or the second text character string.
  • a joint probability distribution of the first text string and the second text string is constructed, and sampling based on the joint probability distribution is the optimal solution for retaining the information of the two strings at the same time.
  • the above joint probability distribution can be a joint probability distribution between a normal distribution and a normal distribution, or a joint probability distribution between a normal distribution and an exponential distribution, which is not specifically limited in the embodiment of the present disclosure.
  • the sampled character string obtained by sampling the data of the joint probability distribution includes information in the first text character string and the second text character string, thereby reducing information loss. In addition, the sampling also reduces data noise.
  • sampling the joint probability distribution to obtain a sampled string includes: randomly sampling the joint probability distribution based on a preset sampling ratio to obtain the sampled string, wherein the preset sampling ratio is inversely proportional to the length of the text string.
  • the preset sampling ratio is a preset ratio of the sampled strings sampled from the joint probability distribution.
  • the preset sampling ratio is at most one quarter.
  • the preset sampling ratio is one quarter, it is possible to minimize the loss of information while performing dimensionality reduction calculations.
  • a maximum of one quarter of the joint probability distribution can be sampled.
  • the preset sampling ratio is inversely proportional to the length of the text string. In other words, the longer the text string, the smaller the preset sampling ratio, and the better the performance optimization effect on big data. It should be noted that the preset sampling ratio can be as small as possible as long as it does not cause the center position of the joint probability distribution to shift, so that the dimensionality reduction can be maximized.
  • S103 Calculate the distance between the first text string and the sample string to obtain a first distance matrix, and calculate the distance between the second text string and the sample string to obtain a second distance matrix.
  • the first distance matrix and the second distance matrix may be any distance matrix for measuring text similarity. It should be noted that the distance matrices obtained by using different similarity measurement strategies use different methods to generate distance representation vectors.
  • an edit distance calculation algorithm is used to calculate a first distance matrix and a second distance matrix, wherein the first distance matrix is used to represent the edit distance from the first text string to the sample string; and the second distance matrix is used to represent the edit distance from the second text string to the sample string.
  • the most representative of planning solvers is the edit distance, also known as the Levenshtein distance, which is measured by determining how many times a string is processed to transform it into another string.
  • the processing here includes deleting strings, inserting strings, and replacing strings. The fewer the processing times, the higher the similarity.
  • the edit distance indicates the minimum number of transformations required to transform from one string to another.
  • the transformations are mainly deletion, insertion and substitution operations. If i and j represent the i-th position in string a and the j-th position in string b respectively, the edit distance between the first i positions in string a and the first j positions in string b is lev a,b (i,j). It can be expressed as:
  • the first line of otherwise indicates a deletion operation
  • the second line indicates an insertion operation
  • the third line indicates a replacement operation.
  • the similarity between the strings "/tts_sync" and "tts/sync/” is calculated by the edit distance algorithm as an example, and the calculation process of the edit distance and its existing problems are described in detail through Figures 3a, 3b, 4a, 4b, 5a, and 5b.
  • the actual calculation of the edit distance begins, that is, the part in otherwise in the above edit distance calculation formula.
  • the value of any bit in the 3rd row and 3rd column is the minimum value of the following three values.
  • the first value the value of the 2nd row and 3rd column + 1, which is equal to 2; the second value: the value of the 3rd row and 2nd column + 1, which is equal to 2; the third value: because "/" and "t" are different, the value is the value of the 2nd row and 2nd column + 1, which is equal to 1.
  • the minimum value of the above three values is 1. That is, the value of the 3rd row and 3rd column is 1.
  • Figures 4a and 4b further show the calculation process of the edit distance.
  • the value of the 6th row and 3rd column is 3. Because both characters are "/", the value of the 6th row and 3rd column is the number of the 5th row and 2nd column on the diagonal. At this time, compared with the value of the 5th row and 3rd column + 1 and the value of the 6th row and 2nd column + 1, the smallest value is 3.
  • the value of the 4th row and 5th column is 1, which is less than the value of the 3rd row and 5th column 2, which further verifies the problem solved by the edit distance algorithm, that is, the (i, j) position measures the distance between the first i characters of one string and the first j characters of the other string.
  • the number of edits from "/tt" to "tt” is significantly less than the number of changes from "/tt" to "t”.
  • Figure 5a shows the intermediate process of the deduction
  • Figure 5b shows the entire calculation process, where the bold numbers are the change paths with the minimum distance.
  • the final calculated edit distance between the strings "/tts_sync” and “tts/sync/” is 3, which means that 3 conversions are required.
  • the "/" at the start position needs to be deleted
  • the "_” needs to be replaced with "/”
  • a "/" needs to be inserted at the end position, for a total of 3 operations.
  • the edit distance algorithm can be solved by two methods, namely recursive solution and dynamic programming solution. Any of the above solution methods can be used to calculate the distance from the first text string to the sample string to obtain a first distance matrix, and to calculate the distance from the second text string to the sample string to obtain a second distance matrix.
  • S104 Determine similarity between the first text string and the second text string based on the first distance matrix and the second distance matrix.
  • the similarity between the first distance matrix and the second distance matrix is calculated, and the similarity between the two distance matrices is used as the similarity between the first text string and the second text string.
  • the similarity between the two matrices is calculated using Euclidean distance or cosine similarity. It should be noted that other matrix similarity calculation methods may also be used to calculate the similarity between the first distance matrix and the second distance matrix, which is not specifically limited in this embodiment.
  • an eigenvector calculation method is used to calculate an eigenvector of a first distance matrix, recorded as a first eigenvector
  • an eigenvector calculation method is used to calculate an eigenvector of a second distance matrix, recorded as a second eigenvector
  • the similarity between the two eigenvectors is calculated as the similarity between the first text string and the second text string.
  • the eigenvector calculation can be implemented by matrix decomposition. This is not specifically limited in this embodiment.
  • determining the similarity of a first text string and a second text string based on a first distance matrix and a second distance matrix includes: performing feature extraction on the first distance matrix to obtain a first distance representation vector, the first distance representation vector includes a column representation vector, a diagonal representation vector, and a row representation vector corresponding to the first distance matrix; performing feature extraction on the second distance matrix to obtain a second distance representation vector, the second distance representation vector includes a column representation vector, a diagonal representation vector, and a row representation vector corresponding to the second distance matrix; and calculating the vector similarity between the first distance representation vector and the second distance representation vector as the similarity between the first text string and the second text string.
  • the first distance representation vector can be understood as a vector that can be extracted from the first distance matrix and represents the first distance matrix for matrix similarity calculation.
  • the second distance representation vector can be understood as a vector that can be extracted from the second distance matrix and represents the second distance matrix for matrix similarity calculation.
  • feature extraction includes: extracting the last row sequence in the distance matrix as a row representation vector; extracting the last column sequence in the distance matrix as a column representation vector; extracting the diagonal sequence in the distance matrix as a diagonal representation vector.
  • the text similarity measurement method provided by the embodiment of the present disclosure is based on the design concept of edit distance.
  • the distance value at the back (right and bottom) in the distance matrix depends on the previous distance value.
  • the last row, the last column and the diagonal sequence change in the distance matrix can represent the similarity information of the two text strings. Taking the text strings "/tts_sync" and "tts/sync/" as examples, the shortest path on the diagonal of the distance matrix, the last column in the distance matrix, and the distance change of the last row in the distance matrix can be extracted to form a matrix representation vector.
  • the shortest path on the diagonal of the distance matrix (1, 1, 1, 1, 2, 2, 2, 2, 3), the last column in the distance matrix (9, 8, 7, 6, 6, 5, 4, 3, 2, 3), and the last row in the distance matrix (9, 8, 8, 7, 7, 6, 5, 4, 3) are extracted to obtain the distance representation vector of the distance matrix on the left side of Figure 6.
  • the extraction of the distance representation vector of the distance matrix directly converts the similarity calculation of the text into the numerical calculation of the vector, potentially completing the encoding of the text string to the word vector.
  • distance representation vectors are extracted from two distance matrices respectively, and the similarity between the two distance representation vectors is calculated as the similarity between the first text string and the second text string. Since the method for calculating vector similarity is less computationally intensive than the method for calculating matrix similarity, the technical solution provided by the embodiment of the present disclosure can further reduce the amount of computation and improve computational efficiency.
  • the Euclidean distance or cosine similarity is used to determine the vector similarity.
  • the vector similarity is the Euclidean distance between the first distance representation vector and the second distance representation vector, or the cosine angle between the first distance representation vector and the second distance representation vector.
  • the embodiments of the present disclosure provide a method, apparatus, device, storage medium and program product for measuring text similarity.
  • the method includes: obtaining a first text string and a second text string; constructing a joint probability distribution of the first text string and the second text string based on the first text string and the second text string, and sampling the joint probability distribution to obtain a sampled string; calculating the distance from the first text string to the sampled string to obtain a first distance matrix, and calculating the distance from the second text string to the sampled string to obtain a second distance matrix; determining the similarity of the first text string and the second text string based on the first distance matrix and the second distance matrix.
  • the embodiments of the present disclosure are generally By sampling the joint probability distribution of the two text strings, a sampled string is obtained, and the distance matrices between the two text strings and the sampled string are calculated respectively, and then the similarity of the two distance matrices is calculated, the dimension reduction calculation of the text similarity method is realized, and the calculation efficiency is improved.
  • the sampled string contains the information of the two text strings, the information loss is reduced while the dimension is reduced.
  • a schematic diagram of the architecture of text similarity measurement is provided, as shown in FIG7 , firstly, a joint probability distribution is made for the two text strings a and text string b to be calculated, and then the joint probability distribution is downsampled to obtain a sampled string, and then the text string a and the text string b are respectively edited with the sampled string for calculation, and the calculation process will generate a distance matrix as shown in FIG5b , and a distance representation vector is extracted from the distance matrix, and finally a numerical operation is performed on the distance representation vector to obtain the similarity of the two distance representation vectors, that is, the similarity of the text string a and the text string b.
  • FIG8 is a schematic diagram of a vector calculation process in an embodiment of the present disclosure.
  • Figure 8 shows the process of calculating the edit distance between the text string a and the text string b and the sampled string.
  • the first text string a and the second text string b respectively calculate the distance matrix with the sampled string, extract the row, column and diagonal vectors as the distance representation vector, and then calculate the numerical similarity of the distance representation vector to obtain the similarity between the first text string a and the second text string b.
  • the two processes of calculating the distance matrix between the first text string a and the second text string b and the sampled string can be executed in parallel, further improving the batch processing performance for big data.
  • a text similarity measurement method proposed in the present disclosure splits a single-stage high-dimensional calculation process into a two-stage low-dimensional calculation process, effectively improving the similarity measurement efficiency of massive texts. Its main technical effects are: making batch processing of massive text similarity possible, completing dimensionality reduction of data through distribution sampling and representation vector extraction process, while ensuring the amount of information as much as possible, improving calculation efficiency. Compared with the traditional edit distance calculation method, the complexity is reduced by at least 70%, and the longer the text, the more significant the performance improvement effect. For text big data, the performance is expected to improve by more than 1 order of magnitude. Similarity measurement can be performed on any two strings or character sequences, and word vectors between two specific comparison strings are generated during the calculation process, without additional requirements for text segmentation and semantics.
  • a large amount of corpus is not required for supervised or unsupervised pre-training, and there is no additional calculation overhead except for the actual batch processing process. Thanks to the downsampling process of text distribution, the impact of data noise on similarity calculation is also reduced to a certain extent.
  • the text similarity measurement method proposed in the present invention is a universal big data processing technology that can be migrated between different application scenarios at almost zero cost and can be directly deployed to big data engines such as MapReduce, Spark, and Storm through code.
  • an application scenario of a text similarity measurement method further includes: obtaining request response data transmitted by a first network interface as a first text string; obtaining request response data transmitted by a second network interface as a second text string; and determining whether the first network interface and the second network interface have similar security risks based on the similarity between the first text string and the second text string.
  • the text similarity measurement method provided by the embodiment of the present disclosure can be applied to many scenarios of text similarity measurement under big data.
  • the text similarity measurement method provided by the embodiment of the present disclosure can be directly used to measure the similarity of data returned by different network interfaces, so as to discover more similar security risks.
  • the first network interface is a network interface where a security risk is discovered
  • the second network interface is any other network interface except the first network interface.
  • a security risk is discovered in the first network interface A
  • the request response data of the first network interface A is obtained as the first text string
  • any other network interface except the first network interface is obtained as the second network interface B
  • the request response data of the second network interface B is obtained as the second text string
  • the similarity between the first text string and the second text string is calculated using the text similarity measurement method provided in the above embodiment. If the similarity between the first text string and the second text string exceeds the set value, it means that the second network interface B has similar request and return parameters as the first network interface A, and it can be largely considered that the network interface B may also have similar security risks.
  • the text similarity measurement method provided in the embodiment of the present disclosure can quickly complete the measurement task.
  • the set value can be set according to actual conditions, for example: 80% or 90%.
  • an application scenario of a text similarity measurement method is provided. Specifically, any test result obtained by white-box security testing is obtained as a first text string; Uniform Resource Locator (URL) interface information is obtained as a second text string; and based on the similarity between the first text string and the second text string, the URL interface information corresponding to the test result is determined.
  • URL Uniform Resource Locator
  • Automated tools for risk scanning include: white-box security testing that identifies potential vulnerability risks based on the data flow dependencies of the application source code; black-box security testing that imitates hackers of various skill levels to initiate data requests from the outside; and manually discovered security vulnerabilities.
  • white-box security testing that identifies potential vulnerability risks based on the data flow dependencies of the application source code
  • black-box security testing that imitates hackers of various skill levels to initiate data requests from the outside
  • manually discovered security vulnerabilities In order to maximize the efficiency between automated tools, and between automated tools and humans. At least the test results of each dimension need to be connected in the URL interface dimension.
  • white-box security testing is based on the source code of the application repository, and it is difficult to directly obtain the interface information of the URL dimension.
  • the above test result refers to the test result of any one of the information such as warehouse, route, file, source code, etc. scanned by the white box security test, as the first text string, the path, request body, response body, etc. of the URL dimension are obtained as the second text string, and the similarity between the first text string and the second text string is calculated using the text similarity measurement method provided in the above embodiment. If the similarity between the first text string and the second text string exceeds the set value, it can be considered that the URL dimension information as the second text string has a corresponding relationship with the test result, and a similarity association relationship between the two is established.
  • Fig. 9 is a schematic diagram of the structure of a text similarity measurement device in an embodiment of the present disclosure.
  • the text similarity measurement device can be applied to the case of calculating the similarity between two texts.
  • the text similarity measurement device 90 mainly includes: a text string acquisition module 91 , a sample string determination module 92 , a distance matrix calculation module 93 and a text similarity measurement module 94 .
  • a text string acquisition module 91 is used to acquire a first text string and a second text string; a sampling string determination module 92 is used to construct a joint probability distribution of the first text string and the second text string based on the first text string and the second text string, and to sample the joint probability distribution to obtain a sampling string; a distance matrix calculation module 93 is used to calculate the distance from the first text string to the sampling string to obtain a first distance matrix, and to calculate the distance from the second text string to the sampling string to obtain a second distance matrix; a text similarity measurement module 94 is used to determine the similarity between the first text string and the second text string based on the first distance matrix and the second distance matrix.
  • an edit distance calculation algorithm is used to calculate a first distance matrix and a second distance matrix, wherein the first distance matrix is used to represent the edit distance from the first text string to the sample string; The distance matrix is used to represent the edit distance between the second text string and the sample string.
  • the text similarity measurement module 94 includes: a first distance representation vector extraction unit, which is used to perform feature extraction on the first distance matrix to obtain a first distance representation vector, and the first distance representation vector includes a column representation vector, a diagonal representation vector and a row representation vector corresponding to the first distance matrix; a second distance representation vector extraction unit, which is used to perform feature extraction on the second distance matrix to obtain a second distance representation vector, and the second distance representation vector includes a column representation vector, a diagonal representation vector and a row representation vector corresponding to the second distance matrix; a text similarity calculation unit, which is used to calculate the vector similarity between the first distance representation vector and the second distance representation vector as the similarity between the first character string and the second text string.
  • feature extraction includes: extracting the last row sequence in the distance matrix as a row representation vector; extracting the last column sequence in the distance matrix as a column representation vector; extracting the diagonal sequence in the distance matrix as a diagonal representation vector.
  • the sampling string determination module 92 is specifically configured to randomly downsample the joint probability distribution based on a preset sampling ratio to obtain a sampling string, where the preset sampling ratio is inversely proportional to the length of the text string.
  • Euclidean distance or cosine similarity is used to determine vector similarity.
  • the device also includes: a first text string determination module, used to obtain request response data transmitted by the first network interface as a first text string; a second text string determination module, used to obtain request response data transmitted by the second network interface as a second text string; and a security risk similarity determination module, used to determine whether the first network interface and the second network interface have similar security risks based on the similarity between the first text string and the second text string.
  • a first text string determination module used to obtain request response data transmitted by the first network interface as a first text string
  • a second text string determination module used to obtain request response data transmitted by the second network interface as a second text string
  • a security risk similarity determination module used to determine whether the first network interface and the second network interface have similar security risks based on the similarity between the first text string and the second text string.
  • the first text string determination module is also used to obtain the test result obtained by the white box security test as the first text string; the second text string determination module is also used to obtain the uniform resource locator system URL interface information as the second text string; the correspondence determination module is used to determine the URL interface information corresponding to the test result based on the similarity between the first text string and the second text string.
  • the modules included are only divided according to functional logic, but are not limited to the above-mentioned division.
  • the specific names of the functional modules are only for the convenience of distinguishing each other and are not used to limit the protection scope of the present disclosure.
  • the text similarity measurement device 90 can be implemented by software and/or hardware.
  • the various modules of the text similarity measurement device 90 can be implemented as software components executed on one or more general-purpose processors, or can be implemented as hardware such as a programmable logic device and/or an application-specific integrated circuit for performing certain functions.
  • these modules can be embodied in the form of a software product, which can be stored in a non-volatile storage medium.
  • These non-volatile storage media include computer devices (such as personal computers, servers, network devices, mobile terminals, etc.) to execute the methods described in the embodiments of the present disclosure.
  • the above modules can also be implemented on a single device or distributed on multiple devices. The functions of these modules can be merged with each other or further split into multiple sub-modules.
  • the text similarity measurement device provided in the embodiment of the present disclosure can execute the steps executed in the text similarity measurement method provided in the method embodiment of the present disclosure, and the specific execution steps and beneficial effects are not repeated here.
  • FIG10 is a schematic diagram of the structure of an electronic device in an embodiment of the present disclosure.
  • the electronic device 1000 in the embodiment of the present disclosure may include, but is not limited to, mobile terminals such as mobile phones, laptop computers, digital broadcast receivers, personal digital assistants (PDAs), tablet computers (PADs), portable multimedia players (PMPs), vehicle-mounted terminals (such as vehicle-mounted navigation terminals), wearable terminal devices, etc., and fixed terminals such as digital televisions (TVs), desktop computers, smart home devices, etc.
  • PDAs personal digital assistants
  • PADs tablet computers
  • PMPs portable multimedia players
  • vehicle-mounted terminals such as vehicle-mounted navigation terminals
  • wearable terminal devices etc.
  • fixed terminals such as digital televisions (TVs), desktop computers, smart home devices, etc.
  • TVs digital televisions
  • TVs digital televisions
  • the electronic device 1000 may include a processing device (e.g., a central processing unit, a graphics processing unit, etc.) 1001, which may perform various appropriate actions and processes according to a program stored in a read-only memory (ROM) 1002 or a program loaded from a storage device 1008 to a random access memory (RAM) 1003 to implement a text similarity measurement method as in an embodiment of the present disclosure.
  • a processing device e.g., a central processing unit, a graphics processing unit, etc.
  • RAM random access memory
  • Various programs and data required for the operation of the electronic device 1000 are also stored in the RAM 1003.
  • the processing device 1001, the ROM 1002, and the RAM 1003 are connected to each other via a bus 1004.
  • An input/output (I/O) interface 1005 is also connected to the bus 1004.
  • the following devices can be connected to the I/O interface 1005: including, for example, a touch screen, a touch pad, a keyboard, Input devices 1006 such as a mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 1007 such as a liquid crystal display (LCD), a speaker, a vibrator, etc.; storage devices 1008 such as a magnetic tape, a hard disk, etc.; and communication devices 1009.
  • the communication device 1009 can allow the electronic device 1000 to communicate with other devices wirelessly or wired to exchange data.
  • FIG. 10 shows an electronic device 1000 with various devices, it should be understood that it is not required to implement or have all the devices shown. More or fewer devices may be implemented or provided instead.
  • an embodiment of the present disclosure includes a computer program product, which includes a computer program carried on a non-transitory computer-readable medium, and the computer program contains a program code for executing the method shown in the flowchart, thereby implementing the text similarity measurement method as described above.
  • the computer program can be downloaded and installed from the network through the communication device 1009, or installed from the storage device 1008, or installed from the ROM 1002.
  • the processing device 1001 the above-mentioned functions defined in the method of the embodiment of the present disclosure are executed.
  • the computer-readable medium mentioned above in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium or any combination of the above two.
  • the computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device or device, or any combination of the above. More specific examples of computer-readable storage media may include, but are not limited to: an electrical connection with one or more wires, a portable computer disk, a hard disk, RAM, ROM, an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the above.
  • a computer-readable storage medium may be any tangible medium containing or storing a program that can be used by or in combination with an instruction execution system, device or device.
  • a computer-readable signal medium may include a data signal propagated in a baseband or as part of a carrier wave, which carries a computer-readable program code. This propagated data signal may take a variety of forms, including but not limited to an electromagnetic signal, an optical signal, or any suitable combination of the above.
  • the computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium.
  • the computer-readable signal medium may transmit, propagate or transfer a computer program program executed by an instruction.
  • the program code contained on the computer-readable medium can be transmitted by any suitable medium, including but not limited to: wire, optical cable, radio frequency (RF), etc., or any suitable combination of the above.
  • the client and the server can communicate using any known or future developed network protocol such as HyperText Transfer Protocol (HTTP), and can be interconnected with any form or medium of digital data communication (e.g., a communication network).
  • HTTP HyperText Transfer Protocol
  • Examples of communication networks include a local area network (LAN), a wide area network (WAN), an internet (e.g., the Internet), and a peer-to-peer network (e.g., an ad hoc peer-to-peer network), as well as any known or future developed network.
  • the computer-readable medium may be included in the electronic device, or may exist independently without being installed in the electronic device.
  • the computer-readable medium carries one or more programs.
  • the electronic device obtains a first text string and a second text string; constructs a joint probability distribution of the first text string and the second text string based on the first text string and the second text string, and samples the joint probability distribution to obtain a sampled string; calculates the distance from the first text string to the sampled string to obtain a first distance matrix, and calculates the distance from the second text string to the sampled string to obtain a second distance matrix; determines the similarity between the first text string and the second text string based on the first distance matrix and the second distance matrix.
  • the electronic device may also execute other steps described in the above embodiments.
  • Computer program code for performing the operations of the present disclosure may be written in one or more programming languages or a combination thereof, including, but not limited to, object-oriented programming languages, such as Java, Smalltalk, C++, and conventional procedural programming languages, such as "C" or similar programming languages.
  • the program code may be executed entirely on the user's computer, partially on the user's computer, as a separate software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server.
  • the remote computer may be connected to the user's computer via any type of network, including a LAN or a wide area network WAN, or may be connected to an external computer (e.g., via the Internet using an Internet service provider).
  • each square box in the flow chart or block diagram can represent a module, a program segment or a part of a code, and the module, the program segment or a part of the code contains one or more executable instructions for realizing the specified logical function.
  • the functions marked in the square box can also occur in a sequence different from that marked in the accompanying drawings. For example, two square boxes represented in succession can actually be executed substantially in parallel, and they can sometimes be executed in the opposite order, depending on the functions involved.
  • each square box in the block diagram and/or flow chart, and the combination of the square boxes in the block diagram and/or flow chart can be implemented with a dedicated hardware-based system that performs a specified function or operation, or can be implemented with a combination of dedicated hardware and computer instructions.
  • the units involved in the embodiments described in the present disclosure may be implemented by software or hardware, wherein the name of a unit does not, in some cases, constitute a limitation on the unit itself.
  • FPGA Field Programmable Gate Array
  • ASIC Application Specific Integrated Circuit
  • ASSP Application Specific Standard Parts
  • SOC System on Chip
  • CPLD Complex Programmable Logic Device
  • the present disclosure provides a method for measuring text similarity, including: obtaining a first text string and a second text string; constructing a joint probability distribution of the two based on the first text string and the second text string, and sampling the joint probability distribution to obtain a sampled string; calculating the distance from the first text string to the sampled string to obtain a first distance matrix, and calculating the distance from the second text string to the sampled string to obtain a second distance matrix; determining the similarity between the first text string and the second text string based on the first distance matrix and the second distance matrix.
  • the present disclosure provides a text similarity measurement device, including: a text string acquisition module, used to acquire a first text string and a second text string; a sampling character A string determination module is used to construct a joint probability distribution of the first text string and the second text string based on the first text string and the second text string, and sample the joint probability distribution to obtain a sampled string; a distance matrix calculation module is used to calculate the distance from the first text string to the sampled string to obtain a first distance matrix, and to calculate the distance from the second text string to the sampled string to obtain a second distance matrix; a text similarity measurement module is used to determine the similarity between the first text string and the second text string based on the first distance matrix and the second distance matrix.
  • the present disclosure provides an electronic device, including:
  • processors one or more processors
  • a memory for storing one or more programs
  • the one or more processors When the one or more programs are executed by the one or more processors, the one or more processors implement any of the text similarity measurement methods provided in the present disclosure.
  • the present disclosure provides a computer-readable storage medium having a computer program stored thereon, and when the program is executed by a processor, the program implements any of the text similarity measurement methods provided by the present disclosure.
  • the embodiments of the present disclosure also provide a computer program product, which includes a computer program or instructions.
  • a computer program product which includes a computer program or instructions.
  • the text similarity measurement method as described above is implemented.
  • the embodiments of the present disclosure also provide a computer program/instruction, which, when executed by a processor, implements the text similarity measurement method as described in any one of the above embodiments.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本公开涉及一种文本相似性度量方法、装置、设备、存储介质和程序产品,该方法包括:获取第一文本字符串和第二文本字符串;基于第一文本字符串和第二文本字符串构建两者的联合概率分布,并对联合概率分布进行采样得到采样字符串;计算第一文本字符串到采样字符串的距离得到第一距离矩阵,以及,计算第二文本字符串到采样字符串的距离得到第二距离矩阵;基于第一距离矩阵和第二距离矩阵确定第一文本字符串和第二文本字符串的相似性。

Description

文本相似性度量方法、装置、设备、存储介质和程序产品
相关申请的交叉引用
本申请是以申请号为202211274116.5、申请日为2022年10月18日的中国专利申请为基础,并主张其优先权,该中国专利申请的公开内容在此作为整体引入本申请中。
技术领域
本公开涉及计算机处理技术领域,尤其涉及一种文本相似性度量方法、装置、设备、存储介质和程序产品。
背景技术
随着互联网、物联网、人工智能大数据等新兴技术的发展,海量的文本数据在各个领域不断涌现。文本相似性度量方法应用在越来越多的场景中,例如:搜索引擎中对搜索内容的理解以及建立网页链接的索引;对各种信息流文章进行重复、抄袭、同质化等评估。文本相似性度量方法都会参与其中。
相关技术中的文本相似性度量方法对于海量文本的批处理,均存在不同程度的不足。一些方法仅适用于极小数据量的分析;一些方法计算成本较高,难以应用于大数据和海量文本处理;还有一些方法需要全量处理文本,难以独立计算两个文本之间的相似性。
发明内容
为了解决上述技术问题中的至少一些,本公开实施例提供了一种文本相似性度量方法、装置、设备、存储介质和程序产品,实现了文本相似性度方法的降维计算,提高了计算效率。此外,由于采样字符串中包含两个文本字符串的信息,在降维的同时降低信息损失。
第一方面,本公开实施例提供一种文本相似性度量方法,所述方法包括:
获取第一文本字符串和第二文本字符串;
基于所述第一文本字符串和所述第二文本字符串构建两者的联合概率分布, 并对所述联合概率分布进行采样得到采样字符串;
计算所述第一文本字符串到所述采样字符串的距离得到第一距离矩阵,以及,计算所述第二文本字符串到所述采样字符串的距离得到第二距离矩阵;
基于所述第一距离矩阵和第二距离矩阵确定所述第一文本字符串和所述第二文本字符串的相似性。
第二方面,本公开实施例提供一种文本相似性度量装置,所述装置包括:
文本字符串获取模块,用于获取第一文本字符串和第二文本字符串;
采样字符串确定模块,用于基于所述第一文本字符串和所述第二文本字符串构建两者的联合概率分布,并对所述联合概率分布进行采样得到采样字符串;
距离矩阵计算模块,用于计算所述第一文本字符串到所述采样字符串的距离得到第一距离矩阵,以及,计算所述第二文本字符串到所述采样字符串的距离得到第二距离矩阵;
文本相似性度量模块,用于基于所述第一距离矩阵和第二距离矩阵确定所述第一文本字符串和所述第二文本字符串的相似性。
第三方面,本公开实施例提供一种电子设备,所述电子设备包括:
一个或多个处理器;
存储装置,用于存储一个或多个程序;
当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如上述第一方面中任一项所述的文本相似性度量方法。
第四方面,本公开实施例提供一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现如上述第一方面中任一项所述的文本相似性度量方法。
第五方面,本公开实施例提供一种计算机程序产品,该计算机程序产品包括计算机程序或指令,该计算机程序或指令被处理器执行时实现如上述第一方面中任一项所述的文本相似性度量方法。
本公开实施例通过对两个文本字符串的联合概率分布进行采样,得到采样字符串,分别计算两个文本字符串与采样字符串的距离矩阵,再计算两个距离矩阵的 相似性,实现了文本相似性度方法的降维计算,提高了计算效率。此外,由于采样字符串中包含两个文本字符串的信息,在降维的同时降低信息损失。
附图说明
结合附图并参考以下具体实施方式,本公开各实施例的上述和其他特征、优点及方面将变得更加明显。贯穿附图中,相同或相似的附图标记表示相同或相似的元素。应当理解附图是示意性的,原件和元素不一定按照比例绘制。
图1为本公开实施例中的一种文本相似性度量方法的流程示意图;
图2为本公开实施例中的一种联合概率分布的示意图;
图3a是本公开实施例中的一种编辑距离初始化的示意图;
图3b是本公开实施例中的一种编辑距离实际计算过程的示意图;
图4a是本公开实施例中的一种编辑距离演算过程的示意图;
图4b是本公开实施例中的一种编辑距离演算过程的示意图;
图5a是本公开实施例中的一种完整编辑距离演算过程的示意图;
图5b是本公开实施例中的一种完整编辑距离演算过程的示意图;
图6为本公开实施例中的一种编辑距离矩阵的向量抽取的示意图;
图7为本公开实施例中的一种文本相似性度量的流程示意图
图8为本公开实施例中的一种向量计算流程的示意图;
图9为本公开实施例中的一种文本相似性度量装置的结构示意图;
图10为本公开实施例中的一种电子设备的结构示意图。
具体实施方式
下面将参照附图更详细地描述本公开的实施例。虽然附图中显示了本公开的某些实施例,然而应当理解的是,本公开可以通过各种形式来实现,而且不应该被解释为限于这里阐述的实施例,相反提供这些实施例是为了更加透彻和完整地理解本公开。应当理解的是,本公开的附图及实施例仅用于示例性作用,并非用于限制本公开的保护范围。
应当理解,本公开的方法实施方式中记载的各个步骤可以按照不同的顺序执 行,和/或并行执行。此外,方法实施方式可以包括附加的步骤和/或省略执行示出的步骤。本公开的范围在此方面不受限制。
本文使用的术语“包括”及其变形是开放性包括,即“包括但不限于”。术语“基于”是“至少部分地基于”。术语“一个实施例”表示“至少一个实施例”;术语“另一实施例”表示“至少一个另外的实施例”;术语“一些实施例”表示“至少一些实施例”。其他术语的相关定义将在下文描述中给出。
需要注意,本公开中提及的“第一”、“第二”等概念仅用于对不同的装置、模块或单元进行区分,并非用于限定这些装置、模块或单元所执行的功能的顺序或者相互依存关系。
需要注意,本公开中提及的“一个”、“多个”的修饰是示意性而非限制性的,本领域技术人员应当理解,除非在上下文另有明确指出,否则应该理解为“一个或多个”。
本公开实施方式中的多个装置之间所交互的消息或者信息的名称仅用于说明性的目的,而并不是用于对这些消息或信息的范围进行限制。
对本公开实施例进行进一步详细说明之前,对本公开实施例中涉及的名词和术语进行说明,本公开实施例中涉及的名词和术语适用于如下的解释。
为了解决文本相似性度量中的种种问题,提出了很多测量方法,先后经历了词频计算、规划求解、统计模型、词向量编码、神经网络推理等技术演变。
最早的文本相似性计算方法为词频统计,主要是通过计算两个文本中相同字符的数量来进行度量,比如对于字符串“abc”和字符串“abd”,它们具有相同的字符“ab”。相同字符越多,说明两个字符串越相似。此后为了在统计数量的基础上进一步评估字符串顺序对相似性的影响,引入了字符串的排列序号,当词频相同时,排列序号差越小的,相似性越高。
统计模型如词频-逆向文件频率模型(Term Frequency–Inverse Document Frequency,TF-IDF),该模型由TF和IDF两部分组成,统计某个字符或词语在单个语句或文档中的出现比例,再统计该字符或词语在全部语句或文档中的出现比例,得到字符或词语出现的概率值,从而得到语句或文档的一个概率值向量,计 算两个概率值向量之间的向量相似性,作为两个文档的相似性。
TF-IDF属于早期词向量编码的一种,但是由于统计值难以获得语义上的相似性,随着神经网络的发展,涌现出了很多基于神经网络的词向量编码技术,最常用的当属连续词袋模型(Continuous Bag-of-Words,CBOW)和连续跳字模型(Skip-Gram)。CBOW是通过上下文相关的词的词向量,来推断特定的一个词的词向量,Skip-Gram的计算方式正好相反,是通过特定的一个词的词向量,来计算特定词对应的上下文词向量。这两种计算方式都一定程度上能获取上下文语义,但也正因为存在较多的上下文推断,计算消耗较大。
当前主要的文本相似性技术有词频计算、规划求解、统计模型、词向量编码、神经网络推理等,但对于海量文本的批处理,均存在不同程度的不足。
词频计算方式由于该方法过于粗糙,目前仅用于对极小数据量的分析刻画。编辑距离算法由于其计算成本较高,难以应用于大数据和海量文本处理中。统计模型由于其统计概率来源全量处理文本,难以独立计算两个文本之间的相似性。此外,当前几乎所有的词向量编码技术都需要大量文本语料进行预训练,需要从全量文档中计算词编码,不能独立计算两个文本之间的相似性,而且计算消耗也非常高。
此外,为了应对海量数据,也出现很多数据预处理方法,例如:数据降维。数据降维是将一个高维空间的数据,在尽量降低信息损失的基础上,将数据映射到低维空间,提升数据处理效率。
当前已有的数据降维技术主要以因子分析、自编码器、主题模型、局部嵌入等为主,但均需要大量的文本语料训练,对独立两文本的计算及海量文本的批处理效果不好。
在本公开实施例中,所述文本相似性度量方法依赖的基础算法为俄罗斯科学家弗拉基米尔·莱文斯坦(Levenshtein·Vladimir I)在1965年提出的编辑距离算法,也叫莱文斯坦距离(Levenshtein Distance,LD)。该算法由于其直观易解释,而且对于字符串的相似性度量效果很好,经历过一些优化,至今仍然存在非常广泛的应用,甚至在python等编码语言中还有专门的第三方计算包。该算法需要基于动态规划或递归求解,计算消耗较大,难以应用于文本大数据和长文本相似性度量。
编辑距离算法中,幂次级较高,时间复杂度和空间复杂度对于小数据量而言,影响不是很大,但是对于大数据和长文本而言,需要消耗的资源是巨大的。也导致在当下文本大数据的计算中,该方法难以大范围推广。
为解决上述技术问题,本公开实施例中提供了一种文本相似性度量方法,通过选取一个采样字符串,分别计算两个文本字符串与采样字符串之间的距离矩阵,进而根据两个距离矩阵计算两个文本串的相似性,将单阶段的计算变成两阶段计算,提高了计算效率,特别在海量的文本大数据中效果尤为明显。
从应用价值角度看,本公开实施例提供的文本相似性度量方法,可以广泛应用于精准医疗、量化金融、智能语音等领域。对于web链分析、密码编码、基因组测算、语音校对等有非常显著的应用价值。
下面结合附图,对本公开实施例中提供的文本相似性度量方法进行详细介绍。
图1为本公开实施例中的一种文本相似性度量方法的流程图。该方法可适用于计算两个文本相似性的情况。该方法可以由如下文结合图9介绍的文本相似性度量装置执行,该文本相似性度量装置可以采用软件和/或硬件的方式实现。该方法也可以由如下文结合图10介绍的电子设备(包括终端设备)执行。
如图1所示,本公开实施例提供的文本相似性度量方法主要包括步骤S101-S104。
S101、获取第一文本字符串和第二文本字符串。
在本公开的实施例中,文本字符串是一种书面语言的表现形式,可以是多个字符的组合,该字符可以包括:英文字符、中文汉字、标点符号、罗马字符、希腊字符或者其他特殊字符中的一种或者多种。第一文本字符串和第二文本字符串是指两个需要度量相似性的文本字符串。
在本公开的一些实施例中,在检测到文本相似性度量的触发事件后,可以获取第一文本字符串和第二文本字符串。
在本公开的一个实施方式中,该触发事件可以是接收到用户输入文本字符串的事件。例如:用户在浏览器或者智能问答系统中输入文本字符串后,终端设备接收到此文本字符串,则可以认为检测到了文本相似性度量的触发事件。此时,终端 设备可以将接收到的用户输入的文本字符串作为第一文本字符串,从数据库中获取任意获取一个文本字符串作为第二文本字符串。
在本公开的一个实施方式中,该触发事件还可以是接收到文本相似性度量指令的事件。例如,用户想要统计终端数据库中任意两个文本字符串之间的相似性时,可以向终端设备输入相似性度量指令,该相似性度量指令可以是点击指令、按压指令、或者语音指令等等。终端设备接收到此相似性度量指令,则可以认为检测到了进行文本相似性度量的触发事件。此时,终端设备可以从数据库中任意选取两个文本字符串作为第一文本字符串和第二文本字符串。
在本公开的一个实施方式中,该触发事件还可以是接收到目标任务完成指令的事件。例如:在网络接口A发现了一个安全风险,终端设备接收网络接口A对应的程序代码,则可以认为检测到了进行文本相似性度量的触发事件。此时,终端设备可以将接收到的网络接口A对应的程序代码作为第一文本字符串,获取其他除网络接口A之外的任意一个网络接口对应的程序代码作为第二文本字符串。
S102、基于第一文本字符串和第二文本字符串构建两者的联合概率分布,并对联合概率分布进行采样得到采样字符串。
在本公开的实施例中,采样字符串中包括的字符均是第一文本字符串和/或第二文本字符串中包括的字符。
在本公开实施例中,构建第一文本字符串和第二文本字符串的联合概率分布,基于联合概率分布的采样,是同时保留两个字符串信息的最优方案。上述联合概率分布可以是正态分布和正态分布之间的联合概率分布,也可以是正态分布与指数分布之间的联合概率分布,本公开实施例中不具体限定。
如图2所示,分别展示了正态分布和正态分布之间的联合概率分布示意图、正态分布与指数分布之间的联合概率分布示意图。在联合概率分布中进行随机采样,可尽最大可能的保留原有文本信息,而此时的随机采样对于大数据计算来说,没有额外的计算开销,可以进行海量数据计算。
针对联合概率分布的数据采样得到的采样字符串,同时包含第一文本字符串和第二文本字符串中的信息,降低了信息损失,而且采样还降低了数据噪声。
在本公开的一个实施方式中,对联合概率分布进行采样得到采样字符串,包括:基于预设采样比例对联合概率分布进行随机采样得到采样字符串,预设采样比例与文本字符串长度成反比。
预设采样比例是预先设定的从联合概率分布中采样得到的采样字符串所占的比例。优选的,预设采样比例最大为四分之一。预设采样比例为四分之一时,可以实现在降维计算的同时,尽可能降低信息的损失。具体的,本公开实施例中,对联合概率分布最多进行四分之一采样即可。
进一步的,预设采样比例与文本字符串长度成反比。换句话说,文本字符串越长,上述预设采样比例越小,对大数据的性能优化效果也越好。需要说明的是,预设采样比例只要不会导致联合概率分布的中心位置发生偏移,可以尽可能的偏小,这样,可以最大幅度的进行降维。
S103、计算第一文本字符串到采样字符串的距离得到第一距离矩阵,以及,计算第二文本字符串到采样字符串的距离得到第二距离矩阵。
在本公开的实施例中,第一距离矩阵和第二距离矩阵,可以是任意一种度量文本相似性的距离矩阵。需要说明的是,采用不同的相似性度量策略得到的距离矩阵,采用不同的方法生成距离表示向量。
在本公开的一个实施方式中,采用编辑距离计算算法计算第一距离矩阵和第二距离矩阵,第一距离矩阵用于表征第一文本字符串到采样字符串的编辑距离;第二距离矩阵用于表征第二文本字符串到采样字符串的编辑距离。
规划求解中最具有代表性的为编辑距离,也叫莱文斯坦距离,量测方式为确定至少需要多少次的处理才能将一个字符串变成另一个字符串,这里的处理包括删除字符串、插入字符串和替换字符串。处理次数越少,相似性越高。
编辑距离表示从一个字符串变换为另一个字符串需要的最小变换次数,这里的变换主要为删除、插入和替换操作。如果用i和j分别表示字符串a中的第i位和字符串b中的第j位。字符串a中前i位和字符串字符串b中的前j位的编辑距离为leva,b(i,j)。可以表示为:
如果用|a|和|b|分别表示a、b两个字符串的长度,则字符串a和字符串b的编辑距离为leva,b(|a|,|b|),即i=|a|,j=|b|时的编辑距离。
上述公式中otherwise中的第一行表示删除操作,第二行表示插入操作,第三行表示替换操作。本公开实施例中以编辑距离算法计算字符串“/tts_sync”和“tts/sync/”之间的相似性为例,通过附图3a,3b,4a,4b,5a,5b详细介绍编辑距离的计算过程及其存在的问题。
如图3a所示,给出了距离的初始化结果,也就是上述编辑距离计算公式中的ifmin(i,j)=0时,取max(i,j)的部分。为字符串a和字符串b的中的每一位赋予初始值。在一些实施例中,初始值按照字符串原始顺序从0开始依次增加1,所以可得到第二行0-9的序列和第二列0-9的序列。如图3b所示,开始实际计算编辑距离,也就是上述编辑距离计算公式中otherwise中的部分。第3行3列中任意一位的取值为以下三个值的最小值。第一个值:第2行3列的值+1,等于2;第二个值:第3行2列的值+1,等于2;第三个值:因为“/”和“t”不相同,所以取值为第2行2列的值+1,等于1。以上三个值的最小值为1。即第3行3列的取值为1。
图4a和图4b进一步给出了编辑距离的演算过程,如图4a所示,第6行3列值为3,因为两个字符均是“/”,所以第6行3列的值取对角线上第5行2列的数字。此时和第5行3列值+1、6行2列值+1相比,其中最小的值为3。如图4b所示,第4行5列的值为1,小于第3行5列的值2,这也进一步验证编辑距离算法解决的问题,即(i,j)位置度量的是其中一个字符串前i个字符和另一个字符串前j个字符之间的距离,显然“/tt”变化到“tt”的编辑次数,明显小于“/tt”变换到“t”的次数。
图5a给出了推演的中间过程,图5b给出了全部计算过程,其中加粗数字为最小距离的变化路径,最终计算得到字符串“/tts_sync”和“tts/sync/”的编辑距离为3,即需要进行3次转换。直观而言,需要删除起始位置的“/”,需要将“_”替换为“/”,需要在结尾位置插入一个“/”,共计3次操作。
结合推演过程,发现编辑距离算法可以有两种求解方法,分别是递归求解和动态规划求解。可以采用上述任意一种求解方法计算第一文本字符串到采样字符串的距离得到第一距离矩阵,以及,计算第二文本字符串到采样字符串的距离得到第二距离矩阵。
S104、基于第一距离矩阵和第二距离矩阵确定第一文本字符串和第二文本字符串的相似性。
在本公开的一个实施方式中,计算第一距离矩阵和第二距离矩阵之间的相似性,将两个距离矩阵的相似性作为第一文本字符串和第二文本字符串之间的相似性。
可选的,采用欧氏距离或者余弦相似度的方式计算两个矩阵之间的相似性。需要说明的是,也可以采用其他矩阵相似性计算方式,计算第一距离矩阵和第二距离矩阵之间的相似性,本实施例中不再具体限定。
在本公开的一个实施方式中,采用特征向量计算方法计算第一距离矩阵的特征向量,记作第一特征向量,采用特征向量计算方法计算第二距离矩阵的特征向量,记作第二特征向量,计算两个特征向量之间的相似性,作为第一文本字符串和第二文本字符串的相似性。特征向量计算可以通过矩阵分解方式实现。本实施例中不再具体限定。
在本公开的一个实施方式中,基于第一距离矩阵和第二距离矩阵确定第一文本字符串和第二文本字符串的相似性,包括:对第一距离矩阵进行特征提取得到第一距离表示向量,第一距离表示向量包括第一距离矩阵对应的列表示向量、对角表示向量和行表示向量;对第二距离矩阵进行特征提取得到第二距离表示向量,第二距离表示向量包括第二距离矩阵对应的列表示向量、对角表示向量和行表示向量;计算第一距离表示向量和第二距离表示向量的向量相似性,作为第一字符串和第二文本字符串的相似性。
第一距离表示向量可以理解为可以从第一距离矩阵中抽取出来,代表第一距离矩阵进行矩阵相似度计算的向量。第二距离表示向量可以理解为可以从第二距离矩阵中抽取出来,代表第二距离矩阵进行矩阵相似度计算的向量。
在本公开的一个实施方式中,特征提取包括:抽取距离矩阵中最后一行序列作为行表示向量;抽取距离矩阵中最后一列序列作为列表示向量;抽取距离矩阵中对角序列作为对角表示向量。
本公开实施例提供的文本相似性度量方法是从编辑距离的设计思想出发,在距离矩阵中靠后(靠右和靠下)的距离值是依赖于之前的距离值的,理论上,选择距离矩阵中的最后一行、最后一列及对角序列变化,能代表两个文本字符串的相似性信息。以文本字符串“/tts_sync”和“tts/sync/”为例,可以抽取距离矩阵对角上的最短路径、距离矩阵中的最后一列、距离矩阵中的最后一行的距离变化形成矩阵表示向量。
如图6所示,抽取距离矩阵对角上的最短路径(1,1,1,1,2,2,2,2,2,3)、距离矩阵中的最后一列(9,8,7,6,6,5,4,3,2,3)、距离矩阵中的最后一行(9,8,8,8,7,7,6,5,4,3)得到图6中左侧距离矩阵的距离表示向量。在本公开实施例中,距离矩阵的距离表示向量的抽取,将文本的相似性计算直接转化成向量的数值计算,潜在完成了文本字符串到词向量的编码。
在本公开实施例中,分别从两个距离矩阵中抽取距离表示向量,计算两个距离表示向量之间的相似性,作为第一文本字符串和第二文本字符串之间的相似性,由于向量相似性的计算方法比矩阵相似性的计算方法计算量小,因此,本公开实施例提供的技术方案,可以进一步的降低计算量,提高计算效率。
在本公开的一个实施方式中,采用欧氏距离或者余弦相似度确定向量相似性。
具体的,向量相似性为第一距离表示向量与第二距离表示向量之间的欧式距离,或者第一距离表示向量与第二距离表示向量之间的余弦夹角。
本公开实施例提供了一种文本相似性度量方法、装置、设备、存储介质和程序产品,该方法包括:获取第一文本字符串和第二文本字符串;基于第一文本字符串和第二文本字符串构建两者的联合概率分布,并对联合概率分布进行采样得到采样字符串;计算第一文本字符串到采样字符串的距离得到第一距离矩阵,以及,计算第二文本字符串到采样字符串的距离得到第二距离矩阵;基于第一距离矩阵和第二距离矩阵确定第一文本字符串和第二文本字符串的相似性。本公开实施例通 过对两个文本字符串的联合概率分布进行采样,得到采样字符串,分别计算两个文本字符串与采样字符串的距离矩阵,再计算两个距离矩阵的相似性,实现了文本相似性度方法的降维计算,提高了计算效率。此外,由于采样字符串中包含两个文本字符串的信息,在降维的同时降低信息损失。
在本公开实施例中,提供一种文本相似性度量的架构示意图,如图7所示,首先对待计算的两个文本字符串a和文本字符串b作联合概率分布,再对联合概率分布进行下采样,得到采样字符串,然后文本字符串a和文本字符串b分别与采样字符串进行编辑距离计算,计算过程会生成如图5b所示的距离矩阵,从距离矩阵中抽取出距离表示向量,最后对距离表示向量做数值运算,得到两个距离表示向量的相似性,即为文本字符串a和文本字符串b的相似性。
图8为本公开实施例中的一种向量计算流程的示意图。
图8为文本字符串a和文本字符串b分别与采样字符串进行编辑距离计算的过程。第一文本字符串a和第二文本字符串b分别与采样字符串计算距离矩阵,抽取行、列和对角向量,作为距离表示向量,再计算距离表示向量的数值相似性,获得第一文本字符串a和第二文本字符串b的相似性。第一文本字符串a和第二文本字符串b分别与采样字符串计算距离矩阵的两个过程是可以并行执行的,进一步提升了对于大数据的批处理性能。
本公开提出的一种文本相似性度量方法,将单阶段的高维计算过程拆分成两阶段的低维计算过程,有效提升了海量文本的相似性度量效能。其主要技术效果有:让海量文本相似性的批处理成为可能,通过分布采样和表示向量抽取过程,对数据完成降维,在尽可能保证信息量的同时,提升计算效率。相比传统的编辑距离计算方式,复杂度至少降低70%以上,而且文本越长,性能提升效果越显著,对文本大数据,性能有望提升1个数量级以上。可以对任意两个字符串或字符序列进行相似性度量,而且在计算过程中生成两两特定比较字符串之间的词向量,对文字的分割和语义没有额外要求。不需要大量的语料进行有监督或无监督的预训练,除实际的批处理过程,没有额外的计算开销。得益于对文本分布的下采样过程,还一定程度上降低了数据噪声对相似性计算的影响。
本公开提出的一种文本相似性度量方法是一种通用化的大数据处理技术,不同应用场景之间几乎可以零成本迁移,可以直接通过代码部署到MapReduce、Spark、Storm等大数据引擎中。
在本公开的一些实施例中,提供一种文本相似性度量方法的应用场景。具体的,文本相似性度量方法还包括:获取第一网络接口传输的请求响应数据,作为第一文本字符串;获取第二网络接口传输的请求响应数据,作为第二文本字符串;根据第一文本字符串和第二文本字符串的相似性,确定第一网络接口和第二网络接口是否存在类似安全风险。
本公开实施例提供的文本相似性度量方法可应用于很多大数据下文本相似性度量的场景中。如在网络安全领域,本公开实施例提供的文本相似性度量方法可以直接用于度量不同网络接口返回数据的相似性,以此可挖掘更多相似的安全风险。
第一网络接口为发现了安全风险的网络接口,第二网络接口为任意一个除第一网络接口外的其他网络接口。例如:在第一网络接口A发现了一个安全风险,获取第一网络接口A的请求响应数据,作为第一文本字符串,获取任意一个除第一网络接口外的其他网络接口作为第二网络接口B,获取第二网络接口B的请求响应数据,作为第二文本字符串,采用上述实施例中提供的文本相似性度量方式计算第一文本字符串和第二文本字符串之间的相似性。如果第一文本字符串和第二文本字符串之间的相似性超过设定数值,则表示第二网络接口B有与第一网络接口A相似的请求和返回参数,则很大程度可以认为网络接口B也可能存在类似的安全风险。当面对数以亿计的网络接口流量时,本公开实施例提供的文本相似性度量方法能快速完成该度量任务。在一些实施例中,设定数值可以根据实际情况进行设置,例如:80%或者90%。
在本公开的一些实施例中,再提供一种文本相似性度量方法的应用场景。具体的,获取白盒安全测试得到的任意一个测试结果,作为第一文本字符串;获取统一资源定位系统(Uniform Resource Locator,URL)接口信息,作为第二文本字符串;根据第一文本字符串和第二文本字符串的相似性,确定测试结果对应的URL接口信息。
在本公开实施例中,网络安全领域会有很多自动化工具进行风险扫描,也会有不少人工测试验证。进行风险扫描的自动化工具包括:基于应用程序源代码的数据流依赖识别潜在漏洞风险的白盒安全测试;模仿各种技能水平黑客从外部发起数据请求的黑盒安全测试;还有人为挖掘的安全漏洞。为了尽可能大的提高自动化工具之间,以及自动化工具与人工之间的效率。至少需要将各维度测试结果在URL接口维度打通。但白盒安全测试基于应用程序仓库源代码,难以直接获取URL维度的接口信息。
在本公开实施例中,上述测试结果是指白盒安全测试扫描的仓库、路由、文件、源代码等任意一个信息的测试结果,作为第一文本字符串,获取URL维度的路径、请求体、响应体等信息作为第二文本字符串,采用上述实施例中提供的文本相似性度量方式计算第一文本字符串和第二文本字符串之间的相似性。如果第一文本字符串和第二文本字符串之间的相似性超过设定数值,则可以认为作为第二文本字符串的URL维度信息与测试结果具备对应关系,建立两者之间的相似性关联关系。
图9为本公开实施例中的一种文本相似性度量装置的结构示意图。该文本相似性度量装置可适用于计算两个文本相似性的情况。
如图9所示,本公开实施例提供的文本相似性度量装置90主要包括:文本字符串获取模块91,采样字符串确定模块92、距离矩阵计算模块93和文本相似性度量模块94。
文本字符串获取模块91,用于获取第一文本字符串和第二文本字符串;采样字符串确定模块92,用于基于第一文本字符串和第二文本字符串构建两者的联合概率分布,并对联合概率分布进行采样得到采样字符串;距离矩阵计算模块93,用于计算第一文本字符串到采样字符串的距离得到第一距离矩阵,以及,计算第二文本字符串到采样字符串的距离得到第二距离矩阵;文本相似性度量模块94,用于基于第一距离矩阵和第二距离矩阵确定第一文本字符串和第二文本字符串的相似性。
在本公开的一些实施例中,采用编辑距离计算算法计算第一距离矩阵和第二距离矩阵,第一距离矩阵用于表征第一文本字符串到采样字符串的编辑距离;第二 距离矩阵用于表征第二文本字符串到采样字符串的编辑距离。
在本公开的一些实施例中,文本相似性度量模块94包括:第一距离表示向量提取单元,用于对第一距离矩阵进行特征提取得到第一距离表示向量,第一距离表示向量包括第一距离矩阵对应的列表示向量、对角表示向量和行表示向量;第二距离表示向量提取单元,用于对第二距离矩阵进行特征提取得到第二距离表示向量,第二距离表示向量包括第二距离矩阵对应的列表示向量、对角表示向量和行表示向量;文本相似性计算单元,用于计算第一距离表示向量和第二距离表示向量的向量相似性,作为第一字符串和第二文本字符串的相似性。
在本公开的一些实施例中,特征提取包括:抽取距离矩阵中最后一行序列作为行表示向量;抽取距离矩阵中最后一列序列作为列表示向量;抽取距离矩阵中对角序列作为对角表示向量。
在本公开的一些实施例中,采样字符串确定模块92,具体用于基于预设采样比例对联合概率分布进行随机下采样得到采样字符串,预设采样比例与文本字符串长度成反比。
在本公开的一些实施例中,采用欧氏距离或者余弦相似度确定向量相似性。
在本公开的一些实施例中,装置还包括:第一文本字符串确定模块,用于获取第一网络接口传输的请求响应数据,作为第一文本字符串;第二文本字符串确定模块,用于获取第二网络接口传输的请求响应数据,作为第二文本字符串;安全风险类似确定模块,用于根据第一文本字符串和第二文本字符串的相似性,确定第一网络接口和第二网络接口是否存在类似安全风险。
在本公开的一些实施例中,第一文本字符串确定模块,还用于获取白盒安全测试得到的测试结果,作为第一文本字符串;第二文本字符串确定模块,还用于获取统一资源定位系统URL接口信息,作为第二文本字符串;对应关系确定模块,用于根据第一文本字符串和第二文本字符串的相似性,确定测试结果对应的URL接口信息。
值得注意的是,上述文本相似性度量装置90的实施例中,所包括的各个模块只是按照功能逻辑进行划分的,但并不局限于上述的划分,只要能够实现相应的功 能即可;另外,各功能模块的具体名称也只是为了便于相互区分,并不用于限制本公开的保护范围。文本相似性度量装置90可以采用软件和/或硬件来实现。具体来说,文本相似性度量装置90的各个模块可以被实现为在一个或多个通用处理器上执行的软件组件,也可以被实现为诸如用于执行某些功能的硬件,诸如可编程逻辑器件和/或专用集成电路。在一些实施例中,这些模块可以体现为软件产品的形式,该软件产品可以存储在非易失性存储介质中。这些非易失性存储介质中包括使得计算机设备(例如个人计算机、服务器、网络设备、移动终端等)执行本公开实施例中描述的方法。在一些实施例中,上述模块还可以在单个设备上实现,也可以分布在多个设备上。这些模块的功能可以相互合并,也可以进一步拆分为多个子模块。
本公开实施例提供的文本相似性度量装置,可执行本公开方法实施例所提供的文本相似性度量方法中所执行的步骤,具体执行步骤和有益效果此处不再赘述。
图10为本公开实施例中的一种电子设备的结构示意图。下面具体参考图10,其示出了适于用来实现本公开实施例所提供的文本相似性度量方法的电子设备1000的结构示意图。本公开实施例中的电子设备1000可以包括但不限于诸如移动电话、笔记本电脑、数字广播接收器、个人数字助理(Personal Digital Assistant,PDA)、平板电脑(PAD)、便携式多媒体播放器(Portable Media Player,PMP)、车载终端(例如车载导航终端)、可穿戴终端设备等等的移动终端以及诸如数字电视(Television,TV)、台式计算机、智能家居设备等等的固定终端。图10示出的电子设备仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。
如图10所示,电子设备1000可以包括处理装置(例如中央处理器、图形处理器等)1001,其可以根据存储在只读存储器(Read-Only Memory,ROM)1002中的程序或者从存储装置1008加载到随机访问存储器(Random Access Memory,RAM)1003中的程序而执行各种适当的动作和处理以实现如本公开的实施例的文本相似性度量方法。在RAM 1003中,还存储有电子设备1000操作所需的各种程序和数据。处理装置1001、ROM 1002以及RAM 1003通过总线1004彼此相连。输入/输出(I/O)接口1005也连接至总线1004。
通常,以下装置可以连接至I/O接口1005:包括例如触摸屏、触摸板、键盘、 鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置1006;包括例如液晶显示器(Liquid Crystal Display,LCD)、扬声器、振动器等的输出装置1007;包括例如磁带、硬盘等的存储装置1008;以及通信装置1009。通信装置1009可以允许电子设备1000与其他设备进行无线或有线通信以交换数据。虽然图10示出了具有各种装置的电子设备1000,但是应理解的是,并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。
特别地,根据本公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在非暂态计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码,从而实现如上所述的文本相似性度量方法。在这样的实施例中,该计算机程序可以通过通信装置1009从网络上被下载和安装,或者从存储装置1008被安装,或者从ROM 1002被安装。在该计算机程序被处理装置1001执行时,执行本公开实施例的方法中限定的上述功能。
需要说明的是,本公开上述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、RAM、ROM、可擦式可编程只读存储器(Erasable Programmable Read-Only Memory,EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(Compact Disc Read-Only Memory,CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令 执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:电线、光缆、射频(Radio Frequency,RF)等等,或者上述的任意合适的组合。
在一些实施方式中,客户端、服务器可以利用诸如超文本传输协议(HyperText Transfer Protocol,HTTP)之类的任何待测已知或未来研发的网络协议进行通信,并且可以与任意形式或介质的数字数据通信(例如,通信网络)互连。通信网络的示例包括局域网(Local Area Network,“LAN”),广域网(Wide Area Network,“WAN”),网际网(例如,互联网)以及端对端网络(例如,ad hoc端对端网络),以及任何待测已知或未来研发的网络。
上述计算机可读介质可以是上述电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。
上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该电子设备执行时,使得该电子设备:获取第一文本字符串和第二文本字符串;基于第一文本字符串和第二文本字符串构建两者的联合概率分布,并对联合概率分布进行采样得到采样字符串;计算第一文本字符串到采样字符串的距离得到第一距离矩阵,以及,计算第二文本字符串到采样字符串的距离得到第二距离矩阵;基于第一距离矩阵和第二距离矩阵确定第一文本字符串和第二文本字符串的相似性。
可选的,当上述一个或者多个程序被该电子设备执行时,该电子设备还可以执行上述实施例所述的其他步骤。
可以以一种或多种程序设计语言或其组合来编写用于执行本公开的操作的计算机程序代码,上述程序设计语言包括但不限于面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括LAN或广域网WAN—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。
附图中的流程图和框图,图示了按照本公开各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
描述于本公开实施例中所涉及到的单元可以通过软件的方式实现,也可以通过硬件的方式来实现。其中,单元的名称在某种情况下并不构成对该单元本身的限定。
本文中以上描述的功能可以至少部分地由一个或多个硬件逻辑部件来执行。例如,非限制性地,可以使用的示范类型的硬件逻辑部件包括:现场可编程门阵列(Field Programmable Gate Array,FPGA)、专用集成电路(Application Specific Integrated Circuit,ASIC)、专用标准产品(Application Specific Standard Parts,ASSP)、片上系统(System on Chip,SOC)、复杂可编程逻辑设备(Complex Programmable Logic Device,CPLD)等等。
根据本公开的一个或多个实施例,本公开提供了一种文本相似性度量方法,包括:获取第一文本字符串和第二文本字符串;基于第一文本字符串和第二文本字符串构建两者的联合概率分布,并对联合概率分布进行采样得到采样字符串;计算第一文本字符串到采样字符串的距离得到第一距离矩阵,以及,计算第二文本字符串到采样字符串的距离得到第二距离矩阵;基于第一距离矩阵和第二距离矩阵确定第一文本字符串和第二文本字符串的相似性。
根据本公开的一个或多个实施例,本公开提供了一种文本相似性度量装置,包括:文本字符串获取模块,用于获取第一文本字符串和第二文本字符串;采样字符 串确定模块,用于基于所述第一文本字符串和所述第二文本字符串构建两者的联合概率分布,并对所述联合概率分布进行采样得到采样字符串;距离矩阵计算模块,用于计算所述第一文本字符串到所述采样字符串的距离得到第一距离矩阵,以及,计算所述第二文本字符串到所述采样字符串的距离得到第二距离矩阵;文本相似性度量模块,用于基于所述第一距离矩阵和第二距离矩阵确定所述第一文本字符串和所述第二文本字符串的相似性。
根据本公开的一个或多个实施例,本公开提供了一种电子设备,包括:
一个或多个处理器;
存储器,用于存储一个或多个程序;
当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如本公开提供的任一所述的文本相似性度量方法。
根据本公开的一个或多个实施例,本公开提供了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现如本公开提供的任一所述的文本相似性度量方法。
本公开实施例还提供了一种计算机程序产品,该计算机程序产品包括计算机程序或指令,该计算机程序或指令被处理器执行时实现如上所述的文本相似性度量方法。
本公开实施例还提供了一种计算机程序/指令,所述计算机程序/指令被处理器执行时实现如上述任意一个实施例所述的文本相似性度量方法。
以上描述仅为本公开的较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解,本公开中所涉及的公开范围,并不限于上述技术特征的特定组合而成的技术方案,同时也应涵盖在不脱离上述公开构思的情况下,由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本公开中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。
此外,虽然采用特定次序描绘了各操作,但是这不应当理解为要求这些操作以所示出的特定次序或以顺序次序执行来执行。在一定环境下,多任务和并行处理可能是有利的。同样地,虽然在上面论述中包含了若干具体实现细节,但是这些不应 当被解释为对本公开的范围的限制。在单独的实施例的上下文中描述的某些特征还可以组合地实现在单个实施例中。相反地,在单个实施例的上下文中描述的各种特征也可以单独地或以任何合适的子组合的方式实现在多个实施例中。
尽管已经采用特定于结构特征和/或方法逻辑动作的语言描述了本主题,但是应当理解所附权利要求书中所限定的主题未必局限于上面描述的特定特征或动作。相反,上面所描述的特定特征和动作仅仅是实现权利要求书的示例形式。

Claims (15)

  1. 一种文本相似性度量方法,包括:
    获取第一文本字符串和第二文本字符串;
    基于所述第一文本字符串和所述第二文本字符串构建两者的联合概率分布,并对所述联合概率分布进行采样得到采样字符串;
    计算所述第一文本字符串到所述采样字符串的距离得到第一距离矩阵,以及,计算所述第二文本字符串到所述采样字符串的距离得到第二距离矩阵;以及
    基于所述第一距离矩阵和第二距离矩阵确定所述第一文本字符串和所述第二文本字符串的相似性。
  2. 根据权利要求1所述的方法,其中,采用编辑距离计算算法计算所述第一距离矩阵和所述第二距离矩阵,所述第一距离矩阵用于表征所述第一文本字符串到所述采样字符串的编辑距离,所述第二距离矩阵用于表征所述第二文本字符串到所述采样字符串的编辑距离。
  3. 根据权利要求2所述的方法,其中,所述基于所述第一距离矩阵和第二距离矩阵确定所述第一文本字符串和所述第二文本字符串的相似性,包括:
    对所述第一距离矩阵进行特征提取得到第一距离表示向量,所述第一距离表示向量包括第一距离矩阵对应的列表示向量、对角表示向量和行表示向量;
    对所述第二距离矩阵进行特征提取得到第二距离表示向量,所述第二距离表示向量包括第二距离矩阵对应的列表示向量、对角表示向量和行表示向量;以及
    计算所述第一距离表示向量和所述第二距离表示向量的向量相似性,作为所述第一字符串和所述第二文本字符串的相似性。
  4. 根据权利要求3所述的方法,其中,所述特征提取包括:
    抽取距离矩阵中最后一行序列作为行表示向量;
    抽取距离矩阵中最后一列序列作为列表示向量;以及
    抽取距离矩阵中对角序列作为对角表示向量。
  5. 根据权利要求1至4中任一项所述的方法,其中,所述对所述联合概率分 布进行采样得到采样字符串,包括:
    基于预设采样比例对所述联合概率分布进行随机下采样得到采样字符串,所述预设采样比例与文本字符串长度成反比。
  6. 根据权利要求3或4所述的方法,其中,采用欧氏距离或者余弦相似度确定所述向量相似性。
  7. 根据权利要求1至6中任一项所述方法,其中,所述获取第一文本字符串和第二文本字符串包括:
    获取第一网络接口传输的请求响应数据,作为所述第一文本字符串;以及
    获取第二网络接口传输的请求响应数据,作为所述第二文本字符串,所述方法还包括:
    根据所述第一文本字符串和所述第二文本字符串的相似性,确定所述第一网络接口和所述第二网络接口是否存在类似安全风险。
  8. 根据权利要求1至7中任一项所述方法,其中,所述获取第一文本字符串和第二文本字符串包括:
    获取白盒安全测试得到的测试结果,作为所述第一文本字符串;以及
    获取统一资源定位系统URL接口信息,作为所述第二文本字符串,所述方法还包括:
    根据所述第一文本字符串和所述第二文本字符串的相似性,确定所述测试结果对应的URL接口信息。
  9. 一种文本相似性度量装置,包括:
    文本字符串获取模块,用于获取第一文本字符串和第二文本字符串;
    采样字符串确定模块,用于基于所述第一文本字符串和所述第二文本字符串构建两者的联合概率分布,并对所述联合概率分布进行采样得到采样字符串;
    距离矩阵计算模块,用于计算所述第一文本字符串到所述采样字符串的距离得到第一距离矩阵,以及,计算所述第二文本字符串到所述采样字符串的距离得到第二距离矩阵;以及
    文本相似性度量模块,用于基于所述第一距离矩阵和第二距离矩阵确定所述第一文本字符串和所述第二文本字符串的相似性。
  10. 一种电子设备,包括:
    一个或多个处理器;
    存储装置,用于存储一个或多个程序,所述一个或多个程序当被所述一个或多个处理器执行,使得所述一个或多个处理器执行以下操作:
    获取第一文本字符串和第二文本字符串;
    基于所述第一文本字符串和所述第二文本字符串构建两者的联合概率分布,并对所述联合概率分布进行采样得到采样字符串;
    计算所述第一文本字符串到所述采样字符串的距离得到第一距离矩阵,以及,计算所述第二文本字符串到所述采样字符串的距离得到第二距离矩阵;以及
    基于所述第一距离矩阵和第二距离矩阵确定所述第一文本字符串和所述第二文本字符串的相似性。
  11. 根据权利要求10所述的电子设备,其中,采用编辑距离计算算法计算所述第一距离矩阵和所述第二距离矩阵,所述第一距离矩阵用于表征所述第一文本字符串到所述采样字符串的编辑距离;所述第二距离矩阵用于表征所述第二文本字符串到所述采样字符串的编辑距离。
  12. 根据权利要求11所述的电子设备,其中,所述基于所述第一距离矩阵和第二距离矩阵确定所述第一文本字符串和所述第二文本字符串的相似性的操作进一步包括以下操作:
    对所述第一距离矩阵进行特征提取得到第一距离表示向量,所述第一距离表示向量包括第一距离矩阵对应的列表示向量、对角表示向量和行表示向量;
    对所述第二距离矩阵进行特征提取得到第二距离表示向量,所述第二距离表示向量包括第二距离矩阵对应的列表示向量、对角表示向量和行表示向量;以及
    计算所述第一距离表示向量和所述第二距离表示向量的向量相似性,作为所述第一字符串和所述第二文本字符串的相似性。
  13. 根据权利要求10至12所述的电子设备,其中,所述对所述联合概率分布进行采样得到采样字符串的操作进一步包括以下操作:
    基于预设采样比例对所述联合概率分布进行随机下采样得到采样字符串,所述预设采样比例与文本字符串长度成反比。
  14. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,该程序被处理器执行时实现如权利要求1-8中任一项所述的方法。
  15. 一种计算机程序产品,该计算机程序产品包括计算机程序或指令,该计算机程序或指令被处理器执行时实现如权利要求1-8中任一项所述的方法。
PCT/CN2023/115292 2022-10-18 2023-08-28 文本相似性度量方法、装置、设备、存储介质和程序产品 WO2024082827A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211274116.5 2022-10-18
CN202211274116.5A CN115640523A (zh) 2022-10-18 2022-10-18 文本相似性度量方法、装置、设备、存储介质和程序产品

Publications (1)

Publication Number Publication Date
WO2024082827A1 true WO2024082827A1 (zh) 2024-04-25

Family

ID=84945375

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/115292 WO2024082827A1 (zh) 2022-10-18 2023-08-28 文本相似性度量方法、装置、设备、存储介质和程序产品

Country Status (2)

Country Link
CN (1) CN115640523A (zh)
WO (1) WO2024082827A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115640523A (zh) * 2022-10-18 2023-01-24 抖音视界有限公司 文本相似性度量方法、装置、设备、存储介质和程序产品

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103399907A (zh) * 2013-07-31 2013-11-20 深圳市华傲数据技术有限公司 一种基于编辑距离计算中文字符串相似度的方法及装置
US20190278850A1 (en) * 2018-03-12 2019-09-12 International Business Machines Corporation Low-complexity methods for assessing distances between pairs of documents
CN114692594A (zh) * 2022-04-18 2022-07-01 上海喜马拉雅科技有限公司 文本相似度识别方法、装置、电子设备及可读存储介质
CN115640523A (zh) * 2022-10-18 2023-01-24 抖音视界有限公司 文本相似性度量方法、装置、设备、存储介质和程序产品

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103399907A (zh) * 2013-07-31 2013-11-20 深圳市华傲数据技术有限公司 一种基于编辑距离计算中文字符串相似度的方法及装置
US20190278850A1 (en) * 2018-03-12 2019-09-12 International Business Machines Corporation Low-complexity methods for assessing distances between pairs of documents
CN114692594A (zh) * 2022-04-18 2022-07-01 上海喜马拉雅科技有限公司 文本相似度识别方法、装置、电子设备及可读存储介质
CN115640523A (zh) * 2022-10-18 2023-01-24 抖音视界有限公司 文本相似性度量方法、装置、设备、存储介质和程序产品

Also Published As

Publication number Publication date
CN115640523A (zh) 2023-01-24

Similar Documents

Publication Publication Date Title
US11232140B2 (en) Method and apparatus for processing information
CN111444340B (zh) 文本分类方法、装置、设备及存储介质
US10963794B2 (en) Concept analysis operations utilizing accelerators
US11544474B2 (en) Generation of text from structured data
WO2020224405A1 (zh) 图像处理方法、装置、计算机可读介质及电子设备
CN108090351B (zh) 用于处理请求消息的方法和装置
US11586817B2 (en) Word vector retrofitting method and apparatus
CN113535984A (zh) 一种基于注意力机制的知识图谱关系预测方法及装置
WO2023124005A1 (zh) 地图兴趣点查询方法、装置、设备、存储介质及程序产品
CN111291765A (zh) 用于确定相似图片的方法和装置
CN114861889B (zh) 深度学习模型的训练方法、目标对象检测方法和装置
US20170308620A1 (en) Making graph pattern queries bounded in big graphs
CN110134965B (zh) 用于信息处理的方法、装置、设备和计算机可读存储介质
WO2024082827A1 (zh) 文本相似性度量方法、装置、设备、存储介质和程序产品
JP2023017910A (ja) セマンティック表現モデルの事前トレーニング方法、装置及び電子機器
US11874798B2 (en) Smart dataset collection system
CN113407814B (zh) 文本搜索方法、装置、可读介质及电子设备
WO2021259205A1 (zh) 一种文本序列生成方法、装置、设备和介质
CN116415564B (zh) 基于知识图谱的功能点扩增方法和系统
WO2024021790A1 (zh) 一种基于数据湖的虚拟列构建方法以及数据查询方法
US20230008897A1 (en) Information search method and device, electronic device, and storage medium
CN111368551A (zh) 一种确定事件主体的方法和装置
CN111651552A (zh) 结构化信息确定方法、装置和电子设备
JP2023002690A (ja) セマンティックス認識方法、装置、電子機器及び記憶媒体
CN116383412B (zh) 基于知识图谱的功能点扩增方法和系统