CN113535887B - Formula similarity detection method and device - Google Patents

Formula similarity detection method and device Download PDF

Info

Publication number
CN113535887B
CN113535887B CN202010296491.4A CN202010296491A CN113535887B CN 113535887 B CN113535887 B CN 113535887B CN 202010296491 A CN202010296491 A CN 202010296491A CN 113535887 B CN113535887 B CN 113535887B
Authority
CN
China
Prior art keywords
formula
vector
characters
position vector
character information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010296491.4A
Other languages
Chinese (zh)
Other versions
CN113535887A (en
Inventor
李长亮
史红亮
廖敏鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Kingsoft Digital Entertainment Co Ltd
Original Assignee
Beijing Kingsoft Digital Entertainment Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Kingsoft Digital Entertainment Co Ltd filed Critical Beijing Kingsoft Digital Entertainment Co Ltd
Priority to CN202010296491.4A priority Critical patent/CN113535887B/en
Publication of CN113535887A publication Critical patent/CN113535887A/en
Application granted granted Critical
Publication of CN113535887B publication Critical patent/CN113535887B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3347Query execution using vector based model

Abstract

The application provides a formula similarity detection method and device, wherein the formula similarity detection method comprises the following steps: taking one formula from the formula set to be detected and the standard formula set respectively; generating a corresponding first formula vector and a corresponding position vector according to each formula; generating a second formula vector corresponding to each formula according to the first formula vector and the position vector corresponding to each formula; and determining the similarity of the two formulas according to a second formula vector corresponding to each formula. According to the method, the position vector is added into the first formula vector corresponding to the formula to generate the second formula vector, so that the second formula vector contains formula information and the position information of each character in the formula, the problem that the formula vector is insensitive to the position information in the prior art is solved, and the formula to be detected can be more accurate in comparison with the standard formula.

Description

Formula similarity detection method and device
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a method and apparatus for detecting similarity of formulas, a computing device, and a computer readable storage medium.
Background
Along with the intensive research of computer technology, a large amount of digital document resources are stored in a computer for users to review at any time, and in practical application, the situation of detecting the similarity between mathematical formulas is often encountered, for example, in machine learning, the similarity is calculated between a recognition result formula of a model and a reference result formula so as to determine the effect of the model.
In the existing formulase:Sub>A similarity detection, ase:Sub>A term frequency-inverse document frequency (TF-IDF) is generally used to convert ase:Sub>A mathematical formulase:Sub>A into ase:Sub>A corresponding vector, where the term frequency-inverse document frequency (TF-IDF) is ase:Sub>A common weighting technique used for information retrieval and datase:Sub>A mining, and the similarity of two formulas is further determined by calculating the vectors corresponding to the two formulas, but in practical application, the vector obtained by the TF-IDF method has ase:Sub>A certain error in the subsequent detection of the similarity, that is, the TF-IDF has ase:Sub>A disadvantage of being insensitive to the position, for example, for formulas s=ase:Sub>A-B and s=b-ase:Sub>A, the TF-IDF detection method can determine that the similarity of the two formulas is 1, that is, the same formulase:Sub>A, but in reality, the two formulas are only approximate but not identical.
Therefore, how to solve the above problems is a urgent problem for the current technicians.
Disclosure of Invention
In view of the foregoing, embodiments of the present application provide a method and apparatus for detecting similarity of formulas, a computing device, and a computer readable storage medium, so as to solve the technical drawbacks in the prior art.
According to a first aspect of an embodiment of the present application, there is provided a formula similarity detection method, including:
taking one formula from the formula set to be detected and the standard formula set respectively;
generating a corresponding first formula vector and a corresponding position vector according to each formula;
generating a second formula vector corresponding to each formula according to the first formula vector and the position vector corresponding to each formula;
and determining the similarity of the two formulas according to a second formula vector corresponding to each formula.
Optionally, generating a corresponding first formula vector and a position vector according to each formula includes:
acquiring the total number n of characters of a formula in the standard formula set, wherein n is more than or equal to 1;
converting each formula into a corresponding first formula vector according to the total number n of characters;
and generating a position vector corresponding to each formula according to each formula and a preset character information table.
Optionally, converting each formula into a corresponding first formula vector according to the total number n of characters includes:
converting each formula into a corresponding first initial vector by a word frequency-inverse document frequency method;
generating an n-dimensional first formula vector corresponding to each formula according to the first initial vector corresponding to each formula and the total number n of characters by a bit filling method.
Optionally, generating a position vector corresponding to each formula according to each formula and a preset character information table includes:
acquiring the position information of each character in each formula;
acquiring corresponding character information from a preset character information table according to the position information of each character;
acquiring a preset position vector dimension m and the number e of characters in each formula, wherein m is more than or equal to 1, and e is more than or equal to 1;
under the condition that m is smaller than e, character information corresponding to the first m characters in each formula is selected as a position vector of each formula;
and under the condition that m is greater than or equal to e, generating a position vector of each formula by a bit filling method according to character information corresponding to e characters in each formula and the position vector dimension m.
Optionally, generating a second formula vector corresponding to each formula according to the first formula vector and the position vector corresponding to each formula includes:
and splicing the first formula vector corresponding to each formula and the position vector to generate a second formula vector corresponding to each formula.
According to a second aspect of embodiments of the present application, there is provided a formula similarity detection apparatus, including:
the selection module is configured to take one formula from the formula set to be detected and the standard formula set respectively;
a first generation module configured to generate a corresponding first formula vector and position vector from each of the formulas;
the second generation module is configured to generate a second formula vector corresponding to each formula according to the first formula vector and the position vector corresponding to each formula;
and the similarity determining module is configured to determine the similarity of the two formulas according to a second formula vector corresponding to each formula.
Optionally, the first generating module includes:
an acquisition unit configured to acquire a total number n of characters of a formula in the standard formula set, wherein n is not less than 1;
a first formula vector unit configured to convert each of the formulas into a corresponding first formula vector according to the total number of characters n;
and the position vector unit is configured to generate a position vector corresponding to each formula according to each formula and a preset character information table.
Optionally, the first formula vector unit includes:
a conversion subunit configured to convert each of the formulas into a corresponding first initial vector by a word frequency-inverse document frequency method;
and the first formula vector generation subunit is configured to generate an n-dimensional first formula vector corresponding to each formula by a bit filling method according to the first initial vector corresponding to each formula and the total number n of characters.
Optionally, the position vector unit includes:
a first obtaining subunit configured to obtain position information of each character in each formula;
a second obtaining subunit configured to obtain corresponding character information in a preset character information table according to the position information of each character;
the third acquisition subunit is configured to acquire a preset position vector dimension m and the number e of characters in each formula, wherein m is more than or equal to 1, and e is more than or equal to 1;
the generating position vector subunit is configured to select character information corresponding to the first m characters in each formula as a position vector of each formula under the condition that m is smaller than e; and under the condition that m is greater than or equal to e, generating a position vector of each formula by a bit filling method according to character information corresponding to e characters in each formula and the position vector dimension m.
Optionally, the second generating module is further configured to splice the first formula vector corresponding to each formula and the position vector to generate a second formula vector corresponding to each formula.
According to a third aspect of embodiments of the present application, there is provided a computing device comprising a memory, a processor and computer instructions stored on the memory and executable on the processor, the processor implementing the steps of the formula similarity detection method when executing the instructions.
According to a fourth aspect of embodiments of the present application, there is provided a computer readable storage medium storing computer instructions which, when executed by a processor, implement the steps of the formula similarity detection method.
According to a fifth aspect of embodiments of the present application, there is provided a chip storing computer instructions which, when executed by the chip, implement the steps of the formula similarity detection method.
In the embodiment of the application, the second formula vector is generated by adding the position vector into the first formula vector corresponding to the formula, so that the second formula vector contains formula information and position information of each character in the formula, the problem that the position information is insensitive when the vector similarity is calculated is solved, and the formula similarity detection is more accurate.
Secondly, determining the dimension of the first formula vector according to the total number n of characters of the standard formulas in the standard formula set, wherein the dimension can comprise character information in the formulas to be detected.
Drawings
FIG. 1 is a block diagram of a computing device provided by an embodiment of the present application;
FIG. 2 is a flowchart of a formula similarity detection method provided in an embodiment of the present application;
FIG. 3 is a flowchart of a method for generating a formula corresponding to a first formula vector and a position vector according to an embodiment of the present application;
FIG. 4 is a flowchart of a formula similarity detection method according to another embodiment of the present application;
fig. 5 is a schematic structural diagram of a formula similarity detection device according to an embodiment of the present application.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is, however, susceptible of embodiment in many other ways than those herein described and similar generalizations can be made by those skilled in the art without departing from the spirit of the application and the application is therefore not limited to the specific embodiments disclosed below.
The terminology used in one or more embodiments of the application is for the purpose of describing particular embodiments only and is not intended to be limiting of one or more embodiments of the application. As used in this application in one or more embodiments and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present application refers to and encompasses any or all possible combinations of one or more of the associated listed items.
It should be understood that, although the terms first, second, etc. may be used in one or more embodiments of the present application to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, a first may also be referred to as a second, and similarly, a second may also be referred to as a first, without departing from the scope of one or more embodiments of the present application. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.
First, terms related to one or more embodiments of the present invention will be explained.
Latex: expression of a mathematical formula, such as a=c×d 2 Is expressed as a=c\times d ζ2.
Word frequency-inverse document frequency: term frequency-inverse document frequency, TF-IDF, is a common weighting technique used for information retrieval and data mining. TF refers to word frequency. IDF is the inverse document frequency. TF-IDF is the multiplication of both.
Cosine similarity: the degree of similarity of the two vectors is calculated. Ranging between [ -1,1 ].
The formula to be detected: awaiting a formula to be detected against an existing standard formula present in the database.
Standard formula: formulas in the database need to be compared with standard formulas to be detected.
First initial vector: the formula is a vector obtained by TF-IDF conversion.
First formula vector: and the first initial vector is a vector obtained by complementing the vector dimension according to the preset vector dimension.
Position vector: and converting the position of each character in the formula into a corresponding vector according to a preset character information table.
Second formula vector: and the vector obtained after the first formula vector and the position vector are spliced.
A preset character information table: corresponding vector values of characters existing in a preset existing database.
In the present application, a method and apparatus for detecting similarity of formulas, a computing device, and a computer-readable storage medium are provided, and are described in detail in the following embodiments.
FIG. 1 illustrates a block diagram of a computing device 100, according to an embodiment of the present application. The components of the computing device 100 include, but are not limited to, a memory 110 and a processor 120. Processor 120 is coupled to memory 110 via bus 130 and database 150 is used to store data.
Computing device 100 also includes access device 140, access device 140 enabling computing device 100 to communicate via one or more networks 160. Examples of such networks include the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. The access device 140 may include one or more of any type of network interface, wired or wireless (e.g., a Network Interface Card (NIC)), such as an IEEE802.11 Wireless Local Area Network (WLAN) wireless interface, a worldwide interoperability for microwave access (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a bluetooth interface, a Near Field Communication (NFC) interface, and so forth.
In one embodiment of the present application, the above-described components of computing device 100, as well as other components not shown in FIG. 1, may also be connected to each other, such as by a bus. It should be understood that the block diagram of the computing device shown in FIG. 1 is for exemplary purposes only and is not intended to limit the scope of the present application. Those skilled in the art may add or replace other components as desired.
Computing device 100 may be any type of stationary or mobile computing device including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), mobile phone (e.g., smart phone), wearable computing device (e.g., smart watch, smart glasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 100 may also be a mobile or stationary server.
Wherein the processor 120 may perform the steps of the formula similarity detection method shown in fig. 2. Fig. 2 shows a flowchart of a formula similarity detection method according to an embodiment of the present application, including steps 202 to 208.
Step 202: and respectively taking one formula from the formula set to be detected and the standard formula set.
The formula set to be detected is a set of formulas to be detected, and the formula set to be detected can be obtained through neural network model identification processing or can be obtained from the internet, and in the application, the composition mode of the formula set to be detected is not limited.
The standard formulas in the standard formula set are standard formulas which need to be compared with the formulas to be detected, namely the formulas to be detected are needed to be compared with each formula in the standard formula set in sequence, and the similarity of the formulas to be detected and each standard formula in the standard formulas is obtained.
Obtaining a formula to be detected in a formula set to be detected, obtaining a standard formula in a standard formula set, and performing similarity calculation on the formula to be detected and each formula in the standard formula set in a normal condition, and comparing the formula to be detected with the standard formula when performing similarity calculation each time.
The formulas referred to in this application may be expressions of conventional mathematical formulas, such as a=c×d 2 A Latex expression of a mathematical formula, such as a=c\times d≡2, is also possible.
In the embodiment provided by the application, the acquired formula to be detected is s= -T, and the standard formula is t= -S.
Step 204: and generating a corresponding first formula vector and a position vector according to each formula.
Alternatively, referring to FIG. 3, step 204 may be implemented by steps 302 through 306 described below.
Step 302: and acquiring the total number n of characters of the formulas in the standard formula set.
When generating a corresponding formula vector according to each formula, the dimension of the formula vector needs to be determined, so that the dimension of the vector is identical, and the existing characters can be included, and therefore the total number n of the characters in the existing standard formula set is obtained. In the method provided by the application, the dimension of the formula vector is determined according to the total number n of characters of a standard formula in a standard formula set, wherein n is more than or equal to 1, and if 3 formulas are respectively a=b+c, mx 2+nx+c=0 and super= \frac { q } { number }, then the characters in the standard formula set are 'a, b, c, m, n, q, x, super, number, 0, 2, +, =, \frac, { }', and total 17 characters, so n is 17.
In the embodiment provided in the present application, the total number of characters of all the formulas in the standard formula set is obtained to be 80, that is, n is 80.
Step 304: and converting each formula into a corresponding first formula vector according to the total number n of characters.
Optionally, converting each formula into a corresponding first initial vector by a word frequency-inverse document frequency method; generating an n-dimensional first formula vector corresponding to each formula according to the first initial vector corresponding to each formula and the total number n of characters by a bit filling method.
Each formula is converted into a corresponding first initial vector through a word frequency-inverse document frequency method, the first initial vector is a vector which is directly obtained by the formula according to the word frequency-inverse document frequency method, the dimension of the first initial vector at the moment is related to the number of characters in each formula, the character information in the formula can be ensured to be completely converted into the first initial vector without omission, if the formula comprises 5 characters, the dimension of the first initial vector is 5, in general, the dimension of the first initial vector cannot reach the length of n, at the moment, the first initial vector needs to be constructed into a first formula vector with n dimensions through a bit filling mode, the dimension consistency of the first formula vector can be ensured, and the subsequent calculation can be more accurate.
In the embodiment provided in the present application, the first initial vector of the equation to be detected s= -T is (W 1 ,W 2 ,W 3 ,W 4 ) The first initial vector of the standard formula t= -S is (S 1 ,S 2 ,S 3 ,S 4 ) N is 80, by adding 0 bits to the first initial vector (W 1 ,W 2 ,W 3 ,W 4 ) Is constructed as (W) 1 ,W 2 ,W 3 ,W 4 ,W 5 ,W 6 ,……W 80 ) The first initial vector of the standard formula (S 1 ,S 2 ,S 3 ,S 4 ) Is constructed as (S) 1 ,S 2 ,S 3 ,S 4 ,S 5 ,S 6 ,……S 80 ) Wherein W is 5 ,W 6 ,……W 80 And S is 5 ,S 6 ,……S 80 All are the numbers 0.
And 306, generating a position vector corresponding to each formula according to each formula and a preset character information table.
The position vector corresponding to the formula is generated according to the positions of the characters in the formula and the preset character information table, and the positions of the characters marked in the formula in each formula are convenient to add the position information in subsequent processing.
Alternatively, step 306 may be implemented by steps S30602 to S30610 described below.
S30602, acquiring position information of each character in each formula.
And acquiring position information corresponding to each formula, wherein the position information of each character is (a, =, b, +, c) in a = b + c.
In the embodiment provided in the present application, the position information of each character in the equation s= -T to be detected is (S, =, -, T), and the position information of each character in the standard equation t= -S is (T, =, -, S).
S30604, acquiring corresponding character information from a preset character information table according to the position information of each character.
In practical application, a character information table is maintained in advance, in which character information corresponding to each character is stored, as shown in table 1 below, table 1 shows a schematic diagram of the character information table.
TABLE 1
(symbol) a b c number 0 2 + ^
Character information 0 1 2 8 9 11 15 19 #
As shown in table 1, the character information corresponding to the symbol a is 0, the character information corresponding to the character 2 is 11, and the character information corresponding to each character is obtained according to the position information of each character.
In the embodiment provided in the application, the position information of each character in the formula to be detected is (S, =, -, T), and the corresponding character information is (16, 55, 53, 19); the position information of each character in the standard formula t= -S is (T, =, -, S), and the corresponding character information is (19, 55, 53, 16).
S30306, acquiring a preset position vector dimension m and the number e of characters in each formula, wherein m is more than or equal to 1, and e is more than or equal to 1.
And acquiring a position vector dimension m, wherein the value of the vector dimension m is preset. The number e of characters in each formula is obtained, and the length of each formula is inconsistent, so that the dimensions of the position vectors of all characters in the identification formula are also different, and in order to ensure the unification of the formulas when calculating the similarity, the position vectors of the formulas are required to be intercepted according to the preset position vector dimensions.
In the embodiment provided by the application, the dimension of the preset position vector is 10, the number of characters in the formula to be detected is 4, and the number of characters in the standard formula is 4.
And S30608, under the condition that m is smaller than e, selecting character information corresponding to the first m characters in each formula as a position vector of each formula.
And under the condition that the dimension of the position vector is smaller than the number of characters in the formulas, character information corresponding to the dimension of the position vector in each formula is selected as the position vector of each formula.
In a specific embodiment provided in the present application, the dimension of the preset position vector is 7, the number of characters in the formula is 10, and the first 7 character information of the formula is selected as the position vector of the formula.
S30610, generating a position vector of each formula by a bit filling method according to character information corresponding to e characters in each formula and a position vector dimension m under the condition that m is greater than or equal to e.
And generating the m-dimensional position vector by a bit supplementing mode according to the character information of the formula under the condition that the dimension of the position vector is larger than or equal to the number of characters in the formula.
In the embodiment provided by the application, the preset vector dimension is 10, the number of characters in the formula to be detected is 4, and the position vector (16, 55, 53, 19,0,0,0,0,0,0) corresponding to the formula to be detected is generated according to the character information in the formula to be detected by using a 0-bit supplementing mode, and it is noted that the position vector needs to be divided by the super-parameters alpha and alpha before the final position vector is obtained. The number of characters in the standard formula is 4, and a position vector (19, 55, 53, 16,0,0,0,0,0,0) corresponding to the standard formula is generated according to the character information in the standard formula by using a 0 bit filling mode.
Step 206: and generating a second formula vector corresponding to each formula according to the first formula vector corresponding to each formula and the position vector.
Optionally, the first formula vector corresponding to each formula and the position vector are spliced to generate the second formula vector corresponding to each formula.
And generating a second formula vector corresponding to each formula by vector splicing the first formula vector and the position vector corresponding to each formula, so that the second formula vector contains formula information and position information corresponding to each formula.
In the embodiment provided in the present application, the first formula vector (W 1 ,W 2 ,W 3 ,W 4 ,W 5 ,W 6 ,……W 80 ) And the position vector (16, 55, 53, 19,0,0,0,0,0,0) of the detection formula generates a second formula vector (W) of the formula to be detected by means of vector concatenation 1 ,W 2 ,W 3 ,W 4 ,W 5 ,W 6 ,……W 80 ,16, 55, 53, 19,0,0,0,0,0,0). A first formula vector according to a standard formula (S 1 ,S 2 ,S 3 ,S 4 ,S 5 ,S 6 ,……S 80 ) And the position vector (19, 55, 53, 16,0,0,0,0,0,0) of the standard formula, generates a second formula vector (S) of the standard formula by means of vector concatenation 1 ,S 2 ,S 3 ,S 4 ,S 5 ,S 6 ,……S 80 ,19,55,53,16,0,0,0,0,0,0)。
Step 208: and determining the similarity of the two formulas according to a second formula vector corresponding to each formula.
And determining the similarity between the two formulas by calculating the similarity of the second formula vector corresponding to each formula. There are many methods for calculating the similarity of the second formula vector corresponding to each formula, such as manhattan distance, cosine similarity, etc., and the method for calculating the similarity between vectors is not limited in this application.
In the embodiment provided in the present application, the similarity between two vectors is calculated by cosine similarity, and the calculation formula is shown in the following formula 1:
wherein A is i ,B i Each component vector respectively representing the two vectors is subjected to cosine similarity calculation to obtain a second common formula of the formula to be detectedThe similarity between the formula vector and the second formula vector of the standard formula is 0.899815, so the similarity between the formula to be detected s= -T and the standard formula t= -S is 0.899815.
According to the formula similarity detection method, the position vector is added into the first formula vector corresponding to the formula to generate the second formula vector, so that the second formula vector contains formula information and position information of each character in the formula, the problem that the position information is insensitive when the vector similarity is calculated is solved, and the formula similarity detection is more accurate.
Secondly, determining the dimension of the first formula vector according to the total number n of characters of the standard formulas in the standard formula set, wherein the dimension can comprise character information in the formulas to be detected.
Fig. 4 shows a formula similarity detection method according to another embodiment of the present application, which includes steps 402 to 420.
Step 402: and respectively taking one formula from the formula set to be detected and the standard formula set.
In the embodiment provided by the application, a formula mx 2+nx+c=0 to be detected is obtained in a formula set to be detected, and a standard formula mx 2+ny 3-c=0 is obtained in a standard formula set.
Step 404: and acquiring the total number n of characters of the formulas in the standard formula set.
In the embodiment provided in the present application, the total number of characters contained in all the formulas in the standard formula set is 383.
Step 406: each formula is converted into a corresponding first initial vector by a word frequency-inverse document frequency method.
In the embodiment provided by the application, the formula mx ζ2+nx+c=0 to be detected is converted into the corresponding first initial vector (A 1 ,A 2 ,……A 9 ) Converting the standard formula mx 2+ny 3-c=0 into a corresponding first initial vector (B 1 ,B 2 ,……B 12 )。
Step 408: generating an n-dimensional first formula vector corresponding to each formula according to the first initial vector corresponding to each formula and the total number n of characters by a bit filling method.
In the embodiment provided in the present application, the first initial vector (a 1 ,A 2 ,……A 9 ) By generating a corresponding first formula vector (A by 0-bit filling 1 ,A 2 ,……A 383 ) Wherein A is 10 、A 11 、……A 383 Is 0. A first initial vector (B) 1 ,B 2 ,……B 12 ) By generating a corresponding first formula vector (B by 0-bit filling 1 ,B 2 ,……B 383 ) Wherein B is 13 、B 14 、……B 383 Is 0.
Step 410: and acquiring the position information of each character in each formula.
In the embodiment provided in the application, the position information of each character in the formula to be detected is (m, x, ++2, +, n, c, =, 0), and the position information of each character in the standard formula is (m, x, ++2, ++, n, y, 3, -, c, =, 0).
Step 412: and acquiring corresponding character information from a preset character information table according to the position information of each character.
In the embodiment provided in the present application, the preset character information table includes, but is not limited to, the table 2 shown below.
TABLE 2
(symbol) m x n c y ^ + - 0 2 3
Character information 1 2 3 4 5 6 7 8 9 10 11 12
And acquiring character information (1, 2,6, 11,7,3,4,8 and 10) corresponding to the to-be-detected formula according to a preset character information table, wherein the character information (1, 2,6, 11,7,3,5, 12,9,4, 8 and 10) corresponding to the standard formula.
Step 414: and acquiring a preset position vector dimension m and the number e of characters in each formula.
In the embodiment provided by the application, the dimension of the preset position vector is 10, the number of characters corresponding to the formula to be detected is 9, and the number of characters corresponding to the standard formula is 12.
Step 416: and generating an m-dimensional position vector corresponding to each formula according to the size relation between m and e and character information in each formula.
And under the condition that m is smaller than e, character information corresponding to the first m characters in each formula is selected as a position vector of each formula.
In the embodiment provided by the application, the dimension 10 of the position vector is smaller than the number 12 of characters of the standard formula, so that the first 10 character information in the standard formula is selected as the position vector (1, 2,6, 11,7,3,5, 12,9,4) corresponding to the standard formula.
And generating the position vector of each formula by a bit filling method according to character information corresponding to e characters in each formula and the position vector dimension m under the condition that m is greater than or equal to e.
In the embodiment provided by the application, the dimension 10 of the position vector is larger than the number 9 of characters of the formula to be detected, so that the position vector (1, 2,6, 11,7,3,4,8, 10, 0) corresponding to the formula to be detected is generated according to the character information of 9 characters in the formula to be detected and a 0 bit filling method.
It should be noted that steps 404 to 408 and steps 410 to 416 may be performed in parallel.
Step 418: and splicing the first formula vector corresponding to each formula and the position vector to generate a second formula vector corresponding to each formula.
In the embodiment provided in the present application, a first formula vector (a 1 ,A 2 ,……A 383 ) And the position vectors (1, 2,6, 11,7,3,4,8, 10, 0) are used for generating a second formula vector (A) of the formula to be detected in a vector splicing mode 1 ,A 2 ,……A 383 1,2,6, 11,7,3,4,8, 10, 0) of the standard formula, the first formula vector (B 1 ,B 2 ,……B 383 ) And the position vectors (1, 2,6, 11,7,3,5, 12,9,4) are used for generating a second formula vector (B) of the standard formula by means of vector stitching 1 ,B 2 ,……B 383 ,1,2,6,11,7,3,5,12,9,4)。
Step 420: and determining the similarity of the two formulas according to a second formula vector corresponding to each formula.
In the embodiment provided by the application, the similarity obtained by cosine similarity calculation between the second formula vector of the formula to be detected and the second formula vector of the standard formula is 0.785215, so as to determine that the similarity between the formula mx 2+nx+c=0 to be detected and the standard formula mx 2+ny 3-c=0 is 0.785215.
According to the formula similarity detection method, the position vector is added into the first formula vector corresponding to the formula to generate the second formula vector, so that the second formula vector contains formula information and position information of each character in the formula, the problem that the position information is insensitive when the vector similarity is calculated is solved, and the formula similarity detection is more accurate.
Secondly, determining the dimension of the first formula vector according to the total number n of characters of the standard formulas in the standard formula set, wherein the dimension can comprise character information in the formulas to be detected.
Corresponding to the above method embodiment, the present application further provides an embodiment of a formula similarity detection device, and fig. 5 shows a schematic structural diagram of the formula similarity detection device according to one embodiment of the present application. As shown in fig. 5, the apparatus includes:
the selection module 502 is configured to take one formula from the formula set to be detected and the standard formula set respectively.
A first generation module 504 is configured to generate a corresponding first formula vector and position vector from each of the formulas.
The second generating module 506 is configured to generate a second formula vector corresponding to each formula according to the first formula vector and the position vector corresponding to each formula.
The similarity determination module 508 is configured to determine the similarity of the two formulas according to a second formula vector corresponding to each formula.
Optionally, the first generating module 504 includes:
an acquisition unit configured to acquire a total number n of characters of a formula in the standard formula set, wherein n is not less than 1;
a first formula vector unit configured to convert each of the formulas into a corresponding first formula vector according to the total number of characters n;
and the position vector unit is configured to generate a position vector corresponding to each formula according to each formula and a preset character information table.
Optionally, the first formula vector unit includes:
a conversion subunit configured to convert each of the formulas into a corresponding first initial vector by a word frequency-inverse document frequency method;
and the first formula vector generation subunit is configured to generate an n-dimensional first formula vector corresponding to each formula by a bit filling method according to the first initial vector corresponding to each formula and the total number n of characters.
Optionally, the position vector unit includes:
a first obtaining subunit configured to obtain position information of each character in each formula;
a second obtaining subunit configured to obtain corresponding character information in a preset character information table according to the position information of each character;
the third acquisition subunit is configured to acquire a preset position vector dimension m and the number e of characters in each formula, wherein m is more than or equal to 1, and e is more than or equal to 1;
the generating position vector subunit is configured to select character information corresponding to the first m characters in each formula as a position vector of each formula under the condition that m is smaller than e; and under the condition that m is greater than or equal to e, generating a position vector of each formula by a bit filling method according to character information corresponding to e characters in each formula and the position vector dimension m.
Optionally, the second generating module 506 is further configured to splice the first formula vector corresponding to each formula and the position vector to generate a second formula vector corresponding to each formula.
According to the formula similarity detection device provided by the embodiment of the application, the position vector is added into the first formula vector corresponding to the formula to generate the second formula vector, so that the second formula vector contains formula information and the position information of each character in the formula, the problem that the position information is insensitive when the vector similarity is calculated is solved, and the formula similarity detection is more accurate.
Secondly, determining the dimension of the first formula vector according to the total number n of characters of the standard formulas in the standard formula set, wherein the dimension can comprise character information in the formulas to be detected.
An embodiment of the present application further provides a computing device, including a memory, a processor, and computer instructions stored in the memory and executable on the processor, where the processor executes the instructions to implement the steps of the formula similarity detection method.
An embodiment of the present application also provides a computer-readable storage medium storing computer instructions that, when executed by a processor, implement the steps of the formula similarity detection method as described above.
The above is an exemplary version of a computer-readable storage medium of the present embodiment. It should be noted that, the technical solution of the storage medium and the technical solution of the formula similarity detection method belong to the same concept, and details of the technical solution of the storage medium which are not described in detail can be referred to the description of the technical solution of the formula similarity detection method.
The embodiment of the application discloses a chip which stores computer instructions which, when executed by a processor, implement the steps of the formula similarity detection method as described above.
The foregoing describes specific embodiments of the present application. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
The computer instructions include computer program code that may be in source code form, object code form, executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the computer readable medium contains content that can be appropriately scaled according to the requirements of jurisdictions in which such content is subject to legislation and patent practice, such as in certain jurisdictions in which such content is subject to legislation and patent practice, the computer readable medium does not include electrical carrier signals and telecommunication signals.
It should be noted that, for the sake of simplicity of description, the foregoing method embodiments are all expressed as a series of combinations of actions, but it should be understood by those skilled in the art that the present application is not limited by the order of actions described, as some steps may be performed in other order or simultaneously in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily all necessary for the present application.
In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to the related descriptions of other embodiments.
The above-disclosed preferred embodiments of the present application are provided only as an aid to the elucidation of the present application. Alternative embodiments are not intended to be exhaustive or to limit the invention to the precise form disclosed. Obviously, many modifications and variations are possible in light of the teaching of this application. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best understand and utilize the invention. This application is to be limited only by the claims and the full scope and equivalents thereof.

Claims (10)

1. The formula similarity detection method is characterized by comprising the following steps of:
taking one formula from the formula set to be detected and the standard formula set respectively;
acquiring the total number n of characters of formulas in the standard formula set, converting each formula into a corresponding first formula vector according to the total number n of characters, and generating a position vector corresponding to each formula according to each formula and a preset character information table, wherein n is more than or equal to 1;
generating a second formula vector corresponding to each formula according to the first formula vector and the position vector corresponding to each formula;
and determining the similarity of the two formulas according to a second formula vector corresponding to each formula.
2. The formula similarity detection method of claim 1, wherein converting each of the formulas into a corresponding first formula vector according to the total number of characters n comprises:
converting each formula into a corresponding first initial vector by a word frequency-inverse document frequency method;
generating an n-dimensional first formula vector corresponding to each formula according to the first initial vector corresponding to each formula and the total number n of characters by a bit filling method.
3. The formula similarity detection method according to claim 1, wherein generating a position vector corresponding to each formula according to each formula and a preset character information table comprises:
acquiring the position information of each character in each formula;
acquiring corresponding character information from a preset character information table according to the position information of each character;
acquiring a preset position vector dimension m and the number e of characters in each formula, wherein m is more than or equal to 1, and e is more than or equal to 1;
under the condition that m is smaller than e, character information corresponding to the first m characters in each formula is selected as a position vector of each formula;
and under the condition that m is greater than or equal to e, generating a position vector of each formula by a bit filling method according to character information corresponding to e characters in each formula and the position vector dimension m.
4. The formula similarity detection method of claim 1, wherein generating a second formula vector for each of the formulas from the first formula vector and the position vector for each of the formulas comprises:
and splicing the first formula vector corresponding to each formula and the position vector to generate a second formula vector corresponding to each formula.
5. A formula similarity detection device, comprising:
the selection module is configured to take one formula from the formula set to be detected and the standard formula set respectively;
the first generation module is configured to generate a corresponding first formula vector and a corresponding position vector according to each formula, wherein the first generation module comprises an acquisition unit and is configured to acquire the total number n of the characters of the formula in the standard formula set, wherein n is more than or equal to 1; a first formula vector unit configured to convert each of the formulas into a corresponding first formula vector according to the total number of characters n; the position vector unit is configured to generate a position vector corresponding to each formula according to each formula and a preset character information table;
the second generation module is configured to generate a second formula vector corresponding to each formula according to the first formula vector and the position vector corresponding to each formula;
and the similarity determining module is configured to determine the similarity of the two formulas according to a second formula vector corresponding to each formula.
6. The formula similarity detection apparatus of claim 5,
the first formula vector unit includes:
a conversion subunit configured to convert each of the formulas into a corresponding first initial vector by a word frequency-inverse document frequency method;
and the first formula vector generation subunit is configured to generate an n-dimensional first formula vector corresponding to each formula by a bit filling method according to the first initial vector corresponding to each formula and the total number n of characters.
7. The formula similarity detection apparatus of claim 5,
the position vector unit includes:
a first obtaining subunit configured to obtain position information of each character in each formula;
a second obtaining subunit configured to obtain corresponding character information in a preset character information table according to the position information of each character;
the third acquisition subunit is configured to acquire a preset position vector dimension m and the number e of characters in each formula, wherein m is more than or equal to 1, and e is more than or equal to 1;
the generating position vector subunit is configured to select character information corresponding to the first m characters in each formula as a position vector of each formula under the condition that m is smaller than e; and under the condition that m is greater than or equal to e, generating a position vector of each formula by a bit filling method according to character information corresponding to e characters in each formula and the position vector dimension m.
8. The formula similarity detection apparatus of claim 5,
the second generation module is further configured to splice the first formula vector corresponding to each formula and the position vector to generate a second formula vector corresponding to each formula.
9. A computing device comprising a memory, a processor, and computer instructions stored on the memory and executable on the processor, wherein the processor, when executing the instructions, implements the steps of the method of any of claims 1-4.
10. A computer readable storage medium storing computer instructions which, when executed by a processor, implement the steps of the method of any one of claims 1-4.
CN202010296491.4A 2020-04-15 2020-04-15 Formula similarity detection method and device Active CN113535887B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010296491.4A CN113535887B (en) 2020-04-15 2020-04-15 Formula similarity detection method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010296491.4A CN113535887B (en) 2020-04-15 2020-04-15 Formula similarity detection method and device

Publications (2)

Publication Number Publication Date
CN113535887A CN113535887A (en) 2021-10-22
CN113535887B true CN113535887B (en) 2024-04-02

Family

ID=78120149

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010296491.4A Active CN113535887B (en) 2020-04-15 2020-04-15 Formula similarity detection method and device

Country Status (1)

Country Link
CN (1) CN113535887B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106127222A (en) * 2016-06-13 2016-11-16 中国科学院信息工程研究所 The similarity of character string computational methods of a kind of view-based access control model and similarity determination methods
CN106651696A (en) * 2016-11-16 2017-05-10 福建天泉教育科技有限公司 Approximate question push method and system
US9990339B1 (en) * 2012-04-10 2018-06-05 Symantec Corporation Systems and methods for detecting character encodings of text streams
JP2018190358A (en) * 2017-05-12 2018-11-29 東日本旅客鉄道株式会社 Content selection method and content selection program
CN110032635A (en) * 2019-04-22 2019-07-19 齐鲁工业大学 One kind being based on the problem of depth characteristic fused neural network to matching process and device
CN110347802A (en) * 2019-07-17 2019-10-18 北京金山数字娱乐科技有限公司 A kind of text analyzing method and device
CN110516210A (en) * 2019-08-22 2019-11-29 北京影谱科技股份有限公司 The calculation method and device of text similarity

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11182395B2 (en) * 2018-05-15 2021-11-23 International Business Machines Corporation Similarity matching systems and methods for record linkage
US20200112475A1 (en) * 2018-10-08 2020-04-09 Ca, Inc. Real-time adaptive infrastructure scenario identification using syntactic grouping at varied similarity

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9990339B1 (en) * 2012-04-10 2018-06-05 Symantec Corporation Systems and methods for detecting character encodings of text streams
CN106127222A (en) * 2016-06-13 2016-11-16 中国科学院信息工程研究所 The similarity of character string computational methods of a kind of view-based access control model and similarity determination methods
CN106651696A (en) * 2016-11-16 2017-05-10 福建天泉教育科技有限公司 Approximate question push method and system
JP2018190358A (en) * 2017-05-12 2018-11-29 東日本旅客鉄道株式会社 Content selection method and content selection program
CN110032635A (en) * 2019-04-22 2019-07-19 齐鲁工业大学 One kind being based on the problem of depth characteristic fused neural network to matching process and device
CN110347802A (en) * 2019-07-17 2019-10-18 北京金山数字娱乐科技有限公司 A kind of text analyzing method and device
CN110516210A (en) * 2019-08-22 2019-11-29 北京影谱科技股份有限公司 The calculation method and device of text similarity

Also Published As

Publication number Publication date
CN113535887A (en) 2021-10-22

Similar Documents

Publication Publication Date Title
CN109086303B (en) Intelligent conversation method, device and terminal based on machine reading understanding
US11386271B2 (en) Mathematical processing method, apparatus and device for text problem, and storage medium
CN111950638B (en) Image classification method and device based on model distillation and electronic equipment
CN110348535B (en) Visual question-answering model training method and device
WO2020207079A1 (en) Image recognition-based desensitization processing method and device
CN111460807A (en) Sequence labeling method and device, computer equipment and storage medium
CN113409437B (en) Virtual character face pinching method and device, electronic equipment and storage medium
CN108228700B (en) Training method and device of image description model, electronic equipment and storage medium
CN113239176B (en) Semantic matching model training method, device, equipment and storage medium
CN115496970A (en) Training method of image task model, image recognition method and related device
CN112487409A (en) Method and device for detecting weak password
JP7414357B2 (en) Text processing methods, apparatus, devices and computer readable storage media
CN116127925B (en) Text data enhancement method and device based on destruction processing of text
CN113535887B (en) Formula similarity detection method and device
CN112307738A (en) Method and device for processing text
CN111597336A (en) Processing method and device of training text, electronic equipment and readable storage medium
CN111582284A (en) Privacy protection method and device for image recognition and electronic equipment
CN114186039A (en) Visual question answering method and device and electronic equipment
CN111079013A (en) Information recommendation method and device based on recommendation model
CN113407702A (en) Method, system, computer and storage medium for quantifying employee cooperation strength
CN110728625B (en) Image reasoning method and device
US11836449B2 (en) Information processing device and information processing method for judging the semantic relationship between words and sentences
CN116955670A (en) Information extraction method and related device
CN115080708A (en) Question answering method and device, computer readable storage medium and terminal
CN115809429A (en) Network media data supervision method and device, electronic equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant