CN111125313B - Text identical content query method, device, equipment and storage medium - Google Patents

Text identical content query method, device, equipment and storage medium Download PDF

Info

Publication number
CN111125313B
CN111125313B CN201911354493.8A CN201911354493A CN111125313B CN 111125313 B CN111125313 B CN 111125313B CN 201911354493 A CN201911354493 A CN 201911354493A CN 111125313 B CN111125313 B CN 111125313B
Authority
CN
China
Prior art keywords
text
target
string
matrix
target feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911354493.8A
Other languages
Chinese (zh)
Other versions
CN111125313A (en
Inventor
王防修
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Polytechnic University
Original Assignee
Wuhan Polytechnic University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Polytechnic University filed Critical Wuhan Polytechnic University
Priority to CN201911354493.8A priority Critical patent/CN111125313B/en
Publication of CN111125313A publication Critical patent/CN111125313A/en
Application granted granted Critical
Publication of CN111125313B publication Critical patent/CN111125313B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1744Redundancy elimination performed by the file system using compression, e.g. sparse files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method, a device, equipment and a storage medium for inquiring the same text content, and belongs to the technical field of computers. The invention realizes the longest public subsequence by means of data compression, the memory space used by the method only occupies 1/8 of the data when not compressed, and secondly, in order to solve the longer public subsequence, the invention further designs to replace the memory by the external memory to realize the data storage, and the external memory is used for replacing the memory to store the data, so that the realization of the longest public subsequence is not limited by the memory space, the longest public subsequence can not be solved because of insufficient memory space, and the longer longest public subsequence can be solved only if the external memory space allows, thereby solving the longest public subsequence of any two text files.

Description

Text identical content query method, device, equipment and storage medium
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a storage medium for querying the same text content.
Background
With the widespread use of computers, people store various files, such as specifications, contracts and papers, through the computer, in actual situations, people often make various modifications to the files, some file modifications may be only a small part of the files, and it is difficult to query differences between the files manually, so that the computer is generally used for comparing and searching similar files, and the related technology usually uses too much memory space when searching the same content of the files, and the searching of the same content of the files is incomplete due to the limitation of the memory space.
The foregoing is provided merely for the purpose of facilitating understanding of the technical solutions of the present invention and is not intended to represent an admission that the foregoing is prior art.
Disclosure of Invention
The invention mainly aims to provide a text identical content query method, which aims at solving the technical problem that the file identical content is not comprehensively searched due to overlarge memory space occupied by file searching.
To achieve the above object, the present invention provides a method for querying the same text content, the method comprising the steps of:
acquiring a first text character string and a second text character string;
comparing the first text character string with the second text character string, and obtaining a plurality of characteristic parameters according to a comparison result;
compressing the characteristic parameters to obtain target characteristic parameters;
constructing a target feature matrix according to the target feature parameters;
determining the longest common subsequence between the first text string and the second text string according to the target feature matrix;
and taking the content corresponding to the longest public subsequence as the text same content.
Preferably, the step of comparing the first text string with the second text string to obtain a plurality of feature parameters according to the comparison result specifically includes:
comparing each element in the first text character string and the second text character string, and determining the lengths of a plurality of public subsequences according to the comparison result;
and comparing the lengths of the public subsequences, and determining a plurality of characteristic parameters according to the comparison result.
Preferably, the step of compressing the feature parameter to obtain a target feature parameter specifically includes:
acquiring position variables corresponding to elements in the first text character string and the second text character string;
calculating the position variable according to a preset algorithm to obtain target storage positions corresponding to the characteristic parameters;
and storing each characteristic parameter according to the target storage position to obtain the target characteristic parameter.
Preferably, the step of storing each feature parameter according to the target storage location to obtain a target feature parameter specifically includes:
carrying out numerical amplification on each characteristic parameter according to a preset proportion to obtain amplified characteristic parameters;
and storing the amplified characteristic parameters according to the target storage positions to obtain a target characteristic matrix.
Preferably, the step of constructing a target feature matrix according to the target feature parameters specifically includes:
taking the position variable as a row and a column of a matrix;
and assigning values to all variables in the matrix according to the target characteristic parameters to obtain a target characteristic matrix.
Preferably, after the step of constructing the target feature matrix according to the target feature parameters, the method further includes:
storing the target feature matrix to a preset position of an external memory;
the step of determining the longest common subsequence of the first text string and the second text string according to the target feature matrix specifically includes:
acquiring characteristic parameters corresponding to the target matrix from the external memory;
and determining the longest public subsequence of the first text character string and the second text character string according to the characteristic parameters.
Preferably, the step of obtaining the feature parameters corresponding to the target matrix from the external memory specifically includes:
acquiring byte data corresponding to the external memory;
acquiring corresponding reference characteristic parameters from the target characteristic matrix according to the byte data;
and carrying out numerical reduction on the reference characteristic parameters according to a preset proportion to obtain the characteristic parameters.
In addition, in order to achieve the above object, the present invention also provides a text identical content query device, which includes:
the acquisition module is used for acquiring the first text character string and the second text character string;
the comparison module is used for comparing the first text character string with the second text character string, and obtaining a plurality of characteristic parameters according to the comparison result;
the compression module is used for compressing the characteristic parameters to obtain target characteristic parameters;
the construction module is used for constructing a target feature matrix according to the target feature parameters;
a computing module for determining a longest common subsequence between the first text string and the second text string according to the target feature matrix;
and the output module is used for taking the content corresponding to the longest public subsequence as the text same content.
In addition, to achieve the above object, the present invention also proposes a text identical content query device, the device comprising: a memory, a processor, and a text identical content query program stored on the memory and running on the processor, the text identical content query program configured to implement the steps of the method of text identical content query as described above.
In addition, to achieve the above object, the present invention also proposes a storage medium having stored thereon a text identical content query program which, when executed by a processor, implements the steps of the method of text identical content query as described above.
According to the method, the first text character string and the second text character string are obtained, the first text character string and the second text character string are compared, a plurality of characteristic parameters are obtained according to the comparison result, the characteristic parameters are compressed, the target characteristic parameters are obtained, the target characteristic matrix is constructed according to the target characteristic parameters, the longest public subsequence between the first text character string and the second text character string is determined according to the target characteristic matrix, the content corresponding to the longest public subsequence is used as the same text content, the characteristic parameters are compressed, so that the memory occupied by the target characteristic matrix is smaller, the longest public subsequence is obtained according to the target characteristic matrix, the calculated longest public subsequence is more comprehensive, and the searching accuracy of the same text content is improved.
Drawings
FIG. 1 is a schematic diagram of a text-identical content query device of a hardware operating environment according to an embodiment of the present invention;
FIG. 2 is a flowchart of a first embodiment of a text-to-text content query method according to the present invention;
FIG. 3 is a flowchart of a second embodiment of the text-identical content query method according to the present invention;
FIG. 4 is a flowchart of a third embodiment of a text-to-text content query method according to the present invention;
fig. 5 is a block diagram showing the structure of a first embodiment of the text identical content query device according to the present invention.
The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.
Detailed Description
It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
Referring to fig. 1, fig. 1 is a schematic diagram of a text-identical content query device in a hardware running environment according to an embodiment of the present invention.
As shown in fig. 1, the text identical content querying device may include: a processor 1001, such as a central processing unit (Central Processing Unit, CPU), a communication bus 1002, a user interface 1003, a network interface 1004, a memory 1005. Wherein the communication bus 1002 is used to enable connected communication between these components. The user interface 1003 may include a Display, an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may further include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a WIreless interface (e.g., a WIreless-FIdelity (WI-FI) interface). The Memory 1005 may be a high-speed random access Memory (Random Access Memory, RAM) Memory or a stable nonvolatile Memory (NVM), such as a disk Memory. The memory 1005 may also optionally be a storage device separate from the processor 1001 described above.
It will be appreciated by those skilled in the art that the structure shown in fig. 1 does not constitute a limitation of the text-identical content querying device, and may include more or fewer components than shown, or may combine certain components, or may have a different arrangement of components.
As shown in fig. 1, an operating system, a network communication module, a user interface module, and a text identical content query program may be included in the memory 1005 as one type of storage medium.
In the text-identical content querying device shown in fig. 1, the network interface 1004 is mainly used for data communication with a network server; the user interface 1003 is mainly used for data interaction with a user; the processor 1001 and the memory 1005 in the text identical content query device of the present invention may be provided in the text identical content query device, and the text identical content query device invokes the text identical content query program stored in the memory 1005 through the processor 1001 and executes the text identical content query method provided by the embodiment of the present invention.
The embodiment of the invention provides a text identical content query method, referring to fig. 2, fig. 2 is a flow chart of a first embodiment of the text identical content query method of the invention.
In this embodiment, the text identical content query method includes the following steps:
step S10: and acquiring the first text character string and the second text character string.
In this embodiment, a first text string and a second text string corresponding to text content to be compared are obtained, where the first text string is x= (X) 1 ,x 2 ,…x m ) I.e. a sequence of length m, the second text string y= (Y) 1 ,y 2 ,…y n ) I.e. a sequence of length n.
Step S20: and comparing the first text character string with the second text character string, and obtaining a plurality of characteristic parameters according to a comparison result.
Step S30: compressing the characteristic parameters to obtain target characteristic parameters.
Step S40: and constructing a target feature matrix according to the target feature parameters.
In this embodiment, the step of comparing the first text string with the second text string to obtain a plurality of feature parameters according to the comparison result specifically includes comparing each element in the first text string with each element in the second text string, and determining lengths of a plurality of public subsequences according to the comparison result; comparing the lengths of the common subsequences, determining a plurality of feature parameters, e.g., a first text string x= (X) 1 ,x 2 ,…x m ) Each element is x 1 、x 2 、…x m Second text string y= (Y) 1 ,y 2 ,…y n ) Each element is y 1 、y 2 、…y n Will x 1 Respectively with y 1 、y 2 、…y n Comparing x 2 Respectively with y 1 、y 2 、…y n Comparing, and so on, each element in the first text string X is compared with each element in the second text string Y respectively, a plurality of common subsequence lengths are obtained after the comparison, the lengths of the common subsequences are compared with each other, corresponding characteristic parameters are determined according to the comparison result, the comparison result is divided into four types in the embodiment, so that four different characteristic parameters exist, when the lengths of the common subsequences are all 0, the corresponding characteristic parameter t=0, when the common subsequence lengths of the first m elements of X and the first n elements of Y are equal to the common subsequence length of the first m-1 elements of X and the first n-1 elements of Y plus 1, corresponding characteristic parameter t=1, when the common subsequence length of the first m-1 elements of X and the first n elements of Y is greater than or equal to the common subsequence length of the first m elements of X and the first n-1 elements of Y, corresponding characteristic parameter t=2, when the common subsequence length of the first m-1 elements of X and the first n elements of Y is less than the common subsequence length of the first m elements of X and the first n-1 elements of Y, corresponding characteristic parameter t=3, for example assuming that m and n are 3, x= (6, 7, 8), y= (6, 8, 10), then the first 3 elements of X are (6, 7, 8), the first 2 elements of Y are (6, 8), then the first m-1 elements of X andthe common subsequence length of the first n elements of Y, i.e. the common subsequence length of the first 3 elements of X and the first 2 elements of Y, is 2.
Further, after obtaining the characteristic parameters, compressing the characteristic parameters in the memory, storing every four characteristic parameters in one byte, thereby reducing occupied memory space, and constructing a target characteristic matrix b according to the compressed target characteristic parameters ij I=1, 2,3, …, m, j=1, 2,3, … n, the constructed target diagnosis matrix is a matrix of m rows and n columns, and the values of the elements in the matrix are the target characteristic parameters.
Step S50: and determining the longest public subsequence between the first text character string and the second text character string according to the target feature matrix.
Step S60: and taking the content corresponding to the longest public subsequence as the text same content.
In a specific implementation, the longest common subsequence between the first text string and the second text string can be determined according to the target feature matrix, and the longest common subsequence is deduced reversely by the common subsequence length of the first text string and the second text string corresponding to each target feature parameter in the target feature matrix, for example, assuming b ij =1, and i=m=2, j=n=1, x= (a, B), y= (B), according to the target feature matrix B 21 The target feature parameter is known as t=1, from t=1, the length of the common subsequence of the first 2 elements of X and the first 1 elements of Y is known as X plus 1, and the length of the common subsequence of the first 1 elements of X and the first 0 elements is known as 0, so that the length of the common subsequence of the first 2 elements of X and the first 1 elements of Y is known as 1, i.e. the 2 nd element of X is identical to the 1 st element of Y, whereby the length of the longest common subsequence is known as 1, and the longest common subsequence of X and Y is known as (B) from the 2 nd element of X and the 1 st element of Y, as indicated in the common subsequence length table of table 1, the figures represent the length of the common subsequence.
TABLE 1 public subsequence Length Table
In addition, after the longest common subsequence is obtained, the text content corresponding to the longest common subsequence is found from the first text string or the second text string, and it can be understood that, because the text is the same, the text content corresponding to the longest common subsequence can be found in both the first text string and the second text string, for example, the two texts a and B are compared, the first text string x= "today 40 ℃, the weather is true, the second text string y=" today's weather is very hot ", and the longest common subsequence B is obtained through comparison ij "today's weather heat", therefore "today's weather heat" is the same content between the searched texts a and B.
The implementation obtains a first text character string and a second text character string corresponding to a text, compares each character string in the first text character string with each character string in the second text character string, determines the lengths of a plurality of public subsequences according to comparison results, compares the lengths of the public subsequences, determines a plurality of characteristic parameters according to comparison results, compresses the characteristic parameters to reduce a memory space, constructs a target characteristic matrix according to the compressed target characteristic parameters, and obtains the longest public subsequence between the first text character string and the second text character string according to the target characteristic matrix, thereby obtaining the same content of the text corresponding to the longest public subsequence, greatly reducing the memory space occupied when calculating the longest public subsequence, and more comprehensively and accurately searching the same content between the texts.
The embodiment of the invention provides a text identical content query method, referring to fig. 3, fig. 3 is a flow chart of a second embodiment of the text identical content query method of the invention.
Based on the first embodiment, the step S30 specifically includes:
step S301: and acquiring position variables corresponding to each character string in the first text character string and the second text character string.
Step S302: and calculating the position variable according to a preset algorithm to obtain target storage positions corresponding to the characteristic parameters.
Step S303: and carrying out numerical amplification on each characteristic parameter according to a preset proportion to obtain amplified characteristic parameters.
Step S304: and storing the amplified characteristic parameters according to the target storage positions to obtain target characteristic parameters.
In the present embodiment, the first text string in the first embodiment is x= (X) 1 ,x 2 ,…x m ),Y=(y 1 ,y 2 ,…y n ) The position variables corresponding to the character strings in the first text character string and the second text character string are m and n, i=1, 2, …, m, j=1, 2, … and n can be obtained, after the position variables are obtained, the target storage positions of the characteristic parameters are calculated according to a preset algorithm, and the target storage positions are calculatedr indicates that the characteristic parameter t should be stored in the ith row +.>Two bits of 2r+1 and 2r+2 of the byte.
Further, after calculating the target storage position of the characteristic parameter t, the process of compressing the characteristic parameter is specifically to amplify each characteristic parameter according to a preset ratio, and amplify the characteristic parameter t to t=2 2r t, obtaining amplified characteristic parameters, replacing the stored characteristic parameters during compression storage through amplification, storing the amplified characteristic parameters according to the calculated target storage positions, and storing 4 target characteristic parameters in 1 byte.
Further, the step S40 specifically includes:
step S401: and taking the position variables corresponding to the character strings in the first text character string and the second text character string as rows and columns of a matrix respectively.
Step S402: and assigning values to all variables in the matrix according to the target characteristic parameters to obtain a target characteristic matrix.
In this embodiment, according to the position variable corresponding to the first text string as a row and the position variable corresponding to the second text string as a column, a matrix is constructed, and each variable in the matrix is assigned according to the target feature parameter, so as to obtain the target feature matrix, for example, the first text string x= (X) 1 ,x 2 ,x 3 ) Second text string y= (Y) 1 ,y 2 ,y 3 ,y 4 ,y 5 ) Where m=3, n=5, i=3,the constructed matrix is a three-row and two-column matrix, and the values of all elements in the matrix are stored target characteristic parameters.
According to the embodiment, the position variables corresponding to the character strings in the first text character string and the second text character string are obtained, the target storage position is calculated according to a preset algorithm, numerical amplification is carried out on each characteristic parameter according to a preset proportion, amplified characteristic parameters are obtained, the amplified characteristic parameters are stored according to the target storage position, the target characteristic parameters are obtained, and then a target characteristic matrix is constructed according to the target characteristic parameters and the position variables, so that the characteristic parameters are more accurately compressed and stored, and the memory space occupied by calculating the longest public subsequence is reduced.
The embodiment of the invention provides a text identical content query method, referring to fig. 4, fig. 4 is a flow chart of a third embodiment of the text identical content query method of the invention.
Based on the first embodiment, after the step S40, the method further includes:
step S403: and storing the target feature matrix to a preset position of an external memory.
In this embodiment, in order to further reduce the occupation of the memory space, the following will be adoptedStoring the target feature matrix into a memory, wherein the preset position of the memory is a distance from the first of the memory fileThe target feature matrix is stored in this location.
Further, the step S50 specifically includes:
step S501: and acquiring the byte data corresponding to the external memory.
Step S502: and acquiring corresponding reference characteristic parameters from the target characteristic matrix according to the byte data.
Step S503: and carrying out numerical reduction on the reference characteristic parameters according to a preset proportion to obtain the characteristic parameters.
Step S504: and determining the longest public subsequence of the first text character string and the second text character string according to the characteristic parameters.
In this embodiment, after storing the target feature matrix in the external memory, when determining the longest common subsequence, the corresponding byte data needs to be fetched from the external memory at a position distant from the first of the external memory fileThe byte data is binary digit string, and then the reference characteristic parameter t 'is obtained from the target characteristic matrix according to the byte data string, and the t' is reduced according to the preset ratio column to obtain the characteristic parameter +.>Finally, the longest common subsequence between the first text string and the second text string can be determined based on the feature parameter, for example, assuming that the feature parameter is t=1, knowing that the length of the common subsequence of the first 2 elements of X and the first 1 elements of Y is equal to the length of the common subsequence of the first 1 elements of X and the first 0 elements of X plus 1 based on t=1, and assuming that x= (a, B), y= (B), knowing that the length of the common subsequence of the first 1 elements of X and the first 0 elements is 0, the length of the common subsequence of the first 2 elements of X and the first 1 elements of Y is equal to 1, i.e. XThe 2 nd element is identical to the 1 st element of Y, whereby the length of the longest common subsequence is 1, and the longest common subsequence of X and Y is (B) is obtained from the 2 nd element of X being identical to the 1 st element of Y.
In this embodiment, the text identical content query method includes the following steps:
referring to fig. 5, fig. 5 is a block diagram showing the structure of a first embodiment of the connection device for text-identical content query according to the present invention.
As shown in fig. 5, the text identical content query device provided by the embodiment of the invention includes:
the acquiring module 10 is configured to acquire the first text string and the second text string.
In this embodiment, a first text string and a second text string corresponding to text content to be compared are obtained, where the first text string is x= (X) 1 ,x 2 ,…x m ) I.e. a sequence of length m, the second text string y= (Y) 1 ,y 2 ,…y n ) I.e. a sequence of length n.
And the comparison module 20 is used for comparing the first text character string with the second text character string, and obtaining a plurality of characteristic parameters according to the comparison result.
And the compression module 30 is used for compressing the characteristic parameters to obtain target characteristic parameters.
A construction module 40, configured to construct a target feature matrix according to the target feature parameters.
In this embodiment, the step of comparing the first text string with the second text string to obtain a plurality of feature parameters according to the comparison result specifically includes comparing each element in the first text string with each element in the second text string, and determining lengths of a plurality of public subsequences according to the comparison result; comparing the lengths of the common subsequences, determining a plurality of feature parameters, e.g., a first text string x= (X) 1 ,x 2 ,…x m ) Each element is x 1 、x 2 、…x m Second text string y= (Y) 1 ,y 2 ,…y n ) Each element is y 1 、y 2 、…y n Will x 1 Respectively with y 1 、y 2 、…y n Comparing x 2 Respectively with y 1 、y 2 、…y n And so on, each element in the first text string X is compared with each element in the second text string Y respectively, a plurality of common subsequence lengths are obtained after one comparison, and the lengths of the common subsequences are compared with each other, and corresponding characteristic parameters are determined according to the comparison result, in this embodiment, the comparison result is divided into four types, so that four different characteristic parameters exist, when the lengths of the common subsequences are all 0, the corresponding characteristic parameter t=0, when the common subsequence length of the first m elements of X and the first n elements of Y is equal to the common subsequence length of the first m-1 elements of X and the first n-1 elements of Y plus 1, the corresponding characteristic parameter t=1, when the common subsequence length of the first m elements of X and the first n elements of Y is greater than or equal to the common subsequence length of the first m elements of X and the first n elements of Y, the corresponding characteristic parameter t=2, when the common subsequence length of the first m elements of X and the first n elements of Y is less than or equal to the common subsequence length of X, and the first n elements of Y is less than 1, and the first m elements of X, and the first n elements of Y is less than 8, and the first n elements of X is equal to the first m-1, the first n elements of Y is equal to 3, the first m-1 elements of X and the first n elements of Y is less than 8, the first n is 3, the first n is 8, the first n is the common subsequence of Y is 8, and the first n is the first 2, and the first n is the first 2.
Further, after obtaining the characteristic parameters, compressing the characteristic parameters in the memory, storing every four characteristic parameters in one byte, thereby reducing occupied memory space, and constructing a target characteristic matrix b according to the compressed target characteristic parameters ij I=1, 2,3, …, m, j=1, 2,3, … n, the constructed target diagnosis matrix is a matrix of m rows and n columns, and the values of the elements in the matrix are the target characteristic parameters.
A calculation module 50 is configured to determine a longest common subsequence between the first text string and the second text string according to the target feature matrix.
And the output module 60 is used for taking the content corresponding to the longest public subsequence as the text same content.
In a specific implementation, the longest common subsequence between the first text string and the second text string can be determined according to the target feature matrix, and the longest common subsequence is deduced reversely by the common subsequence length of the first text string and the second text string corresponding to each target feature parameter in the target feature matrix, for example, assuming b ij =1, and i=m=2, j=n=1, x= (a, B), y= (B), according to the target feature matrix B 21 The target feature parameter is known as t=1, from t=1, the length of the common subsequence of the first 2 elements of X and the first 1 elements of Y is known as X plus 1, and the length of the common subsequence of the first 1 elements of X and the first 0 elements is known as 0, so that the length of the common subsequence of the first 2 elements of X and the first 1 elements of Y is known as 1, i.e. the 2 nd element of X is identical to the 1 st element of Y, whereby the length of the longest common subsequence is known as 1, and the longest common subsequence of X and Y is known as (B) from the 2 nd element of X and the 1 st element of Y, as indicated in the table 1, the numbers represent the length of the common subsequence.
TABLE 1 public subsequence Length Table
In addition, after the longest common subsequence is obtained, the text content corresponding to the longest common subsequence is found from the first text string or the second text string, and it can be understood that, because the text is the same, the text content corresponding to the longest common subsequence can be found in both the first text string and the second text string, for example, the two texts a and B are compared, the first text string x= "today 40 ℃, and the weather is true hotThe second text character string Y= "today's weather is very hot" corresponding to the text B, and the longest public series is B through comparison result ij "today's weather heat", therefore "today's weather heat" is the same content between the searched texts a and B.
The implementation obtains a first text character string and a second text character string corresponding to a text, compares each character string in the first text character string with each character string in the second text character string, determines the lengths of a plurality of public subsequences according to comparison results, compares the lengths of the public subsequences, determines a plurality of characteristic parameters according to comparison results, compresses the characteristic parameters to reduce a memory space, constructs a target characteristic matrix according to the compressed target characteristic parameters, and obtains the longest public subsequence between the first text character string and the second text character string according to the target characteristic matrix, thereby obtaining the same content of the text corresponding to the longest public subsequence, greatly reducing the memory space occupied when calculating the longest public subsequence, and more comprehensively and accurately searching the same content between the texts.
In an embodiment, the comparison module 20 is further configured to compare each of the first text string and the second text string, and determine lengths of a plurality of public subsequences according to the comparison result; and comparing the lengths of the public subsequences, and determining a plurality of characteristic parameters according to the comparison result.
In an embodiment, the compression module 30 is further configured to obtain a position variable corresponding to each of the first text string and the second text string; calculating the position variable according to a preset algorithm to obtain target storage positions corresponding to the characteristic parameters; and storing each characteristic parameter according to the target storage position to obtain the target characteristic parameter.
In an embodiment, the compression module 30 is further configured to perform numerical amplification on each of the feature parameters according to a preset ratio, so as to obtain amplified feature parameters; and storing the amplified characteristic parameters according to the target storage positions to obtain target characteristic parameters.
In an embodiment, the construction module 40 is further configured to use the position variables corresponding to the first text string and each string in the second text string as a row and a column of a matrix respectively; and assigning values to all variables in the matrix according to the target characteristic parameters to obtain a target characteristic matrix.
In an embodiment, the device further includes a storage module, configured to store the target feature matrix to a preset location of an external memory; the computing module 50 is further configured to obtain, from the external memory, a feature parameter corresponding to the target matrix; and determining the longest public subsequence of the first text character string and the second text character string according to the characteristic parameters.
In an embodiment, the storage module is further configured to obtain byte data corresponding to the external memory; acquiring corresponding reference characteristic parameters from the target characteristic matrix according to the byte data; and carrying out numerical reduction on the reference characteristic parameters according to a preset proportion to obtain the characteristic parameters.
In addition, the embodiment of the invention also provides a storage medium, wherein the storage medium stores a text identical content query program, and the text identical content query program realizes the steps of the text identical content query method when being executed by a processor.
It should be noted that the above-described working procedure is merely illustrative, and does not limit the scope of the present invention, and in practical application, a person skilled in the art may select part or all of them according to actual needs to achieve the purpose of the embodiment, which is not limited herein.
In addition, technical details that are not described in detail in this embodiment may refer to the text-identical content query method provided in any embodiment of the present invention, and are not described herein again.
Furthermore, it should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. Read Only Memory)/RAM, magnetic disk, optical disk) and including several instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present invention.
The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

Claims (9)

1. A method for querying text-like content, the method comprising:
acquiring a first text character string and a second text character string;
comparing the first text character string with the second text character string, and obtaining a plurality of characteristic parameters according to a comparison result;
obtaining position variables m and n corresponding to elements in the first text string x= (X1, X2..xm.) and the second text string y= (Y1, y2...yn), wherein i=1, 2, …, m, j=1, 2, …, n;
calculating the position variable according to a preset algorithm to obtain target storage positions corresponding to the characteristic parametersWherein r is the characteristic parameter t which should be stored in line i +.>Two binary bits 2r+1 and 2r+2 of the byte;
storing each characteristic parameter according to the target storage position to obtain a target characteristic parameter;
constructing a target feature matrix according to the target feature parameters;
determining the longest common subsequence between the first text string and the second text string according to the target feature matrix;
taking the content corresponding to the longest public subsequence as the text same content;
after the target feature matrix is constructed according to the target feature parameters, the method further comprises the following steps:
storing the target feature matrix to the first external memory file of the external memoryI represents the i-th character number of the first text, and n represents the character number of the second text string.
2. The method for querying the same text content as defined in claim 1, wherein the step of comparing the first text string with the second text string to obtain a plurality of feature parameters according to the comparison result specifically comprises:
comparing each element in the first text character string and the second text character string, and determining the lengths of a plurality of public subsequences according to the comparison result;
and comparing the lengths of the public subsequences, and determining a plurality of characteristic parameters according to the comparison result.
3. The method for querying the text-like content according to claim 1, wherein the step of storing each feature parameter according to the target storage location to obtain the target feature parameter specifically comprises:
carrying out numerical amplification on each characteristic parameter according to a preset proportion to obtain amplified characteristic parameters;
and storing the amplified characteristic parameters according to the target storage positions to obtain a target characteristic matrix.
4. The text-identical content query method of claim 1, wherein the step of constructing a target feature matrix according to the target feature parameters specifically comprises:
taking the position variable as a row and a column of a matrix;
and assigning values to all variables in the matrix according to the target characteristic parameters to obtain a target characteristic matrix.
5. The text-identical content query method according to claim 1, wherein after the step of constructing a target feature matrix from the target feature parameters, further comprising:
storing the target feature matrix to a preset position of an external memory;
the step of determining the longest common subsequence of the first text string and the second text string according to the target feature matrix specifically includes:
acquiring characteristic parameters corresponding to the target characteristic matrix from the external memory;
and determining the longest public subsequence of the first text character string and the second text character string according to the characteristic parameters.
6. The method for querying the text-identical content according to claim 5, wherein the step of obtaining the feature parameters corresponding to the target feature matrix from the external memory specifically comprises:
acquiring byte data corresponding to the external memory;
acquiring corresponding reference characteristic parameters from the target characteristic matrix according to the byte data;
and carrying out numerical reduction on the reference characteristic parameters according to a preset proportion to obtain the characteristic parameters.
7. A text identical content query device, the device comprising:
the acquisition module is used for acquiring the first text character string and the second text character string;
the comparison module is used for comparing the first text character string with the second text character string, and obtaining a plurality of characteristic parameters according to the comparison result;
a compression module, configured to obtain position variables m and n corresponding to elements in the first text string x= (X1, X2..xm) and the second text string y= (Y1, y2...yn), where i=1, 2, …, m, j=1, 2, …, n;
the compression module is also used for calculating the position variable according to a preset algorithm to obtain target storage positions corresponding to the characteristic parametersWherein r is the characteristic parameter t which should be stored in line i +.>Two binary bits 2r+1 and 2r+2 of the byte;
the compression module is also used for storing each characteristic parameter according to the target storage position to obtain a target characteristic parameter;
the construction module is used for constructing a target feature matrix according to the target feature parameters;
a computing module for determining a longest common subsequence between the first text string and the second text string according to the target feature matrix;
the output module is used for taking the content corresponding to the longest public subsequence as the text same content;
after the target feature matrix is constructed according to the target feature parameters, the method further comprises the following steps:
storing the target feature matrix to the first external memory file of the external memoryI represents the ith character number and n represents the second text string character number.
8. A text-identical content querying device, the device comprising: a memory, a processor and a text identical content query program stored on the memory and executable on the processor, the text identical content query program being configured to implement the steps of the text identical content query method as claimed in any one of claims 1 to 6.
9. A storage medium having stored thereon a text identical content query program which, when executed by a processor, implements the steps of the text identical content query method of any one of claims 1 to 6.
CN201911354493.8A 2019-12-24 2019-12-24 Text identical content query method, device, equipment and storage medium Active CN111125313B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911354493.8A CN111125313B (en) 2019-12-24 2019-12-24 Text identical content query method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911354493.8A CN111125313B (en) 2019-12-24 2019-12-24 Text identical content query method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN111125313A CN111125313A (en) 2020-05-08
CN111125313B true CN111125313B (en) 2023-12-01

Family

ID=70503277

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911354493.8A Active CN111125313B (en) 2019-12-24 2019-12-24 Text identical content query method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111125313B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102184165A (en) * 2011-04-22 2011-09-14 烽火通信科技股份有限公司 LCS (Longest Common Subsequence) algorithm for saving memory
WO2016041428A1 (en) * 2014-09-17 2016-03-24 北京搜狗科技发展有限公司 Method and device for inputting english
CN106610965A (en) * 2015-10-21 2017-05-03 北京瀚思安信科技有限公司 Text string common sub sequence determining method and equipment
CN106897258A (en) * 2017-02-27 2017-06-27 郑州云海信息技术有限公司 The computational methods and device of a kind of text otherness
WO2018094764A1 (en) * 2016-11-23 2018-05-31 深圳大学 Method and device for pattern string match verification based on cloud service
CN108763569A (en) * 2018-06-05 2018-11-06 北京玄科技有限公司 Text similarity computing method and device, intelligent robot
CN108829780A (en) * 2018-05-31 2018-11-16 北京万方数据股份有限公司 Method for text detection, calculates equipment and computer readable storage medium at device
CN109857366A (en) * 2019-02-20 2019-06-07 武汉轻工大学 Insertion sort method, system, equipment and storage medium based on external memory

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7487169B2 (en) * 2004-11-24 2009-02-03 International Business Machines Corporation Method for finding the longest common subsequences between files with applications to differential compression
US8521759B2 (en) * 2011-05-23 2013-08-27 Rovi Technologies Corporation Text-based fuzzy search
US20170116238A1 (en) * 2015-10-26 2017-04-27 Intelliresponse Systems Inc. System and method for determining common subsequences

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102184165A (en) * 2011-04-22 2011-09-14 烽火通信科技股份有限公司 LCS (Longest Common Subsequence) algorithm for saving memory
WO2016041428A1 (en) * 2014-09-17 2016-03-24 北京搜狗科技发展有限公司 Method and device for inputting english
CN106610965A (en) * 2015-10-21 2017-05-03 北京瀚思安信科技有限公司 Text string common sub sequence determining method and equipment
WO2018094764A1 (en) * 2016-11-23 2018-05-31 深圳大学 Method and device for pattern string match verification based on cloud service
CN106897258A (en) * 2017-02-27 2017-06-27 郑州云海信息技术有限公司 The computational methods and device of a kind of text otherness
CN108829780A (en) * 2018-05-31 2018-11-16 北京万方数据股份有限公司 Method for text detection, calculates equipment and computer readable storage medium at device
CN108763569A (en) * 2018-06-05 2018-11-06 北京玄科技有限公司 Text similarity computing method and device, intelligent robot
CN109857366A (en) * 2019-02-20 2019-06-07 武汉轻工大学 Insertion sort method, system, equipment and storage medium based on external memory

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于最长公共子序列的随机路径选择算法设计;王防修;《计算机工程与设计》;第第35卷卷(第第6期期);2170-2173 *

Also Published As

Publication number Publication date
CN111125313A (en) 2020-05-08

Similar Documents

Publication Publication Date Title
Wang et al. The spectrum of genomic signatures: from dinucleotides to chaos game representation
US8433714B2 (en) Data cell cluster identification and table transformation
JP4438448B2 (en) Structured document display processing device, structured document display method, structured document display program
CN108664582B (en) Enterprise relation query method and device, computer equipment and storage medium
WO2020258491A1 (en) Universal character recognition method, apparatus, computer device, and storage medium
JPWO2004062110A1 (en) Data compression method, program and apparatus
CN110968585B (en) Storage method, device, equipment and computer readable storage medium for alignment
CN111414379A (en) Serial number generation method, device, equipment and computer readable storage medium
CN112995414B (en) Behavior quality inspection method, device, equipment and storage medium based on voice call
CN113297269A (en) Data query method and device
CN111125313B (en) Text identical content query method, device, equipment and storage medium
CN109857366B (en) Insertion ordering method, system, equipment and storage medium based on external memory
CN109634955B (en) Data storage method, data retrieval method and device
CN113190549B (en) Multidimensional table data calling method, multidimensional table data calling device, server and storage medium
CN112698877B (en) Data processing method and system
CN111009247B (en) Speech recognition correction method, device and storage medium
CN112749539B (en) Text matching method, text matching device, computer readable storage medium and computer equipment
CN112989185A (en) Information pushing method and device, computer equipment and storage medium
CN105468603A (en) Data selection method and apparatus
JP7231024B2 (en) Information processing program, information processing method, and information processing apparatus
CN116360703A (en) Data compression method, electronic equipment and storage medium
CN116861472A (en) Data matching method, device, computer equipment and storage medium
JP6420728B2 (en) Mask processing system, mask processing method, user terminal, and server
CN114840871A (en) Data desensitization method and device, electronic equipment and storage medium
CN114218898A (en) Online collaborative document editing method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant