CN111125308B

CN111125308B - Lightweight text fuzzy search method supporting semantic association

Info

Publication number: CN111125308B
Application number: CN201911331527.1A
Authority: CN
Inventors: 裴正奇; 黄梓忱; 段必超; 段朦丽; 朱斌斌
Original assignee: Shenzhen Qianhai Heidun Technology Co ltd
Current assignee: Shenzhen Qianhai Heidun Technology Co ltd
Priority date: 2019-12-21
Filing date: 2019-12-21
Publication date: 2023-02-21
Anticipated expiration: 2039-12-21
Also published as: CN111125308A

Abstract

The invention discloses a lightweight text fuzzy search method supporting semantic association, which comprises the following steps. The ambiguity is higher, the invention improves the traditional sentence retrieval algorithm, can retrieve out the sentences which are completely consistent with the target sentences and have high similarity, and can flexibly adjust the approximate values of the target sentences; the operation speed is fast: the traditional violent enumeration algorithm is abandoned, and the methods of semantic maps, convolution, dynamic planning and the like are used, so that the searching process is optimized, and the searching speed is greatly improved; the system is light: the system size is reduced, internal and external optimization is performed aiming at light-weight users and use scenes, the whole calculation process is optimized, and the memory burden is reduced. The invention also provides a set of association modes without field operation, and a user can call the association module in the fuzzy search without occupying local computing power; the system is flexible, and is easy for users to flexibly call different applications: the whole algorithm module is subjected to interface packaging.

Description

Lightweight text fuzzy search method supporting semantic association

Technical Field

The invention relates to the relevant field of text fuzzy search, in particular to a lightweight text fuzzy search method supporting semantic association.

Background

Fuzzy search of texts is applied in many places, especially nowadays, networks are increasingly developed, and the amount of text information generated on the networks is also explosively increased. Accordingly, harmful information and information causing instability are increasingly abused, and therefore, many contents need to be inspected to be displayed on a public network platform. In the initial stage of network examination, most of the network examination is carried out by manual examination, so that the efficiency is low and is more negligible compared with the speed of network text generation. Many scholars and companies are therefore more concerned with fuzzy search of text, i.e. fuzzy finding of a given keyword or key sentence in a large amount of text information, i.e. fuzzy matching. Firstly, the text is matched mainly by algorithms such as BF (BruteForce), RK (Robin-Karp), KMP (Knuth-Morris-Pratt), BM (BoyerMoore) and the like, namely, the matching is successful only if a character string which is completely the same as a keyword is found in the text information, and the semantic information is not considered in the mode, so that the task of fuzzy matching cannot be completed. The main methods for fuzzy matching of texts, namely character string fuzzy matching, include a bit vector method, a filtering method, and the like, and a large amount of space is required when the bit vector method is applied, which is a problem for a microcomputer with a small memory, such as an embedded system.

The existing text fuzzy search has the following defects:

1. most of the current text fuzzy search does not well embody real fuzzy search, and simply speaking, the fuzzy degree is low, and semantic association such as synonyms and associated words of search keywords cannot be well supported, so that the synonyms of the keywords can be filtered out, and in practical application, the synonyms may need to be retained, thereby causing mis-filtering and reducing recall ratio. When a keyword or a key sentence is searched in a relatively long text, the text is processed in a relatively violent mode, so that the efficiency is relatively low, namely, the weight is not light enough;

2. most of the current text fuzzy search does not well solve two main problems of character string fuzzy matching: the method has the advantages that the method has the space problem and the time problem, a large amount of calculation and storage are needed when texts are processed, and the actual online requirements cannot be met frequently in time complexity and space complexity by the existing fuzzy matching algorithm;

3. most of the current text fuzzy search cannot perform fuzzy search on sentence-level feature capture, and in short, for texts needing to be searched, if no texts needing to be searched exist in texts to be searched, the search result is null. But there may be texts with similar meanings to the texts needing to be searched, and if the situation is met in practical application, the results of the search are not expected to be empty, and the texts with similar meanings are used as the returned results.

Therefore, a lightweight text fuzzy search method supporting semantic association is provided.

Disclosure of Invention

The invention aims to provide a lightweight text fuzzy search method supporting semantic association to solve the problems in the background technology.

In order to achieve the purpose, the invention adopts the following technical scheme:

a method of lightweight text fuzzy search supporting semantic association, the search method comprising the steps of:

s1, modeling a technical scene, wherein a text fuzzy search problem can be converted into a problem of inquiring a short text in a long text, and the long text and the short text are a series of character sequences;

s2, in order to ensure the light weight of the operation, the semantic association graph is built in advance and stored for direct calling, and the operation is not performed on site;

s3, a fuzzy search scheme is given, long texts S = { S1, S2, S3, \8230; sn }, and search requests Q = { Q1, Q2, Q3, … qm } are given;

s4, automatically dividing the search task, namely automatically dividing the long text S with larger space, segmenting the long text S according to a specific terminator, and then performing the operation 3 segment by segment;

s5, performing internal acceleration processing on each link of the algorithm scheme in the S3 by internal acceleration and multithread acceleration;

s6, interface packaging, which is convenient for flexible application of the text fuzzy search module and can be packaged in the form of an interface product, wherein the input parameter format is as follows: bluE (S, Q, autoSplit, isImagine, stop _ words), wherein the autoSplit and the isImagine are both values of Boolean type, the autoSplit determines whether an operation mechanism of automatic task division is adopted, the isImagine determines whether an association mode is started, and the stop _ words are self-defined terminators in the autoSplit mode.

Preferably, the characters in S1 include kanji, english alphabets, numerals, and special characters.

Preferably, the fuzzy search scheme in S3 depends on whether the user turns on the semantic association function, if not, the fuzzy search will be based on characters, and the constituent units of S and Q are directly characters; if the semantic association function is started, word segmentation processing needs to be performed on S and Q firstly.

Preferably, the algorithm for fuzzy search in S3 includes a multi-level convolution character density weighted matching algorithm and a near-diagonal common subsequence matching algorithm.

Preferably, before performing the operation of S3, a "first glance" determination may be performed, and the idea is as follows: blu (S, Q) = = trueifelen (set (Q) & set (S)) > len (set (Q)) × 0.5.

Preferably, in the convolution operation of the multi-level convolution character density weighted matching algorithm, whether S _ conv has enough non-zero value units or not can be judged in advance, otherwise, the convolution operation is not executed on S _ conv.

Preferably, the convolution summation operation of the multi-level convolution character density weighted matching algorithm is assisted by an external tool such as numpy.

Compared with the prior art, the invention has the following beneficial effects:

1. the fuzzy degree is higher, the traditional sentence retrieval algorithm is improved, sentences which are completely consistent with the target sentences and have high similarity can be retrieved, and the approximate values of the sentences and the target sentences can be flexibly adjusted;

2. the operation speed is fast: the traditional violent enumeration algorithm is abandoned, and the methods of semantic maps, convolution, dynamic planning and the like are used, so that the searching process is optimized, and the searching speed is greatly improved;

3. the system is light: the system size is reduced, internal and external optimization is performed aiming at light-weight users and use scenes, the whole calculation process is optimized, and the memory burden is reduced. The invention also provides a set of association modes without field operation, and a user can call the association module in the fuzzy search without occupying local computing power;

4. the system is flexible, and is easy for users to flexibly call different applications: the whole algorithm module is subjected to interface packaging, so that a user can conveniently and directly call part of modules to solve the actual requirements of the user;

5. the method supports semantic association, and supports fuzzy matching of synonyms, similar words and associated word levels: compared with the traditional semantic retrieval method, the invention provides a fuzzy retrieval method which is more suitable for life use, and supports intelligent matching of the specific meaning of the words in the target text in the text with corresponding synonyms, near-synonyms and associated words;

6. text positioning: the fuzzy retrieval method can sort the found texts with the target text from high to low according to the similarity, and give the positions of the similar text segments and the matching degree of the similar text segments with the target text.

Drawings

FIG. 1 is a table cell representation of a lightweight text fuzzy search method supporting semantic association, where S _ conv for Si is T [: i-k// 2;

FIG. 2 is a distribution diagram of a convolution score distribution situation in a lightweight text fuzzy search method supporting semantic association proposed by the present invention;

FIG. 3 is a diagram of a running result of S1+ Q1 in an embodiment of the lightweight text fuzzy search method supporting semantic association proposed by the present invention;

fig. 4 is a diagram of operation results of S1+ Q2 in an embodiment of the method for lightweight text fuzzy search supporting semantic association according to the present invention;

fig. 5 is a running result diagram of a second S2+ Q3 embodiment in the lightweight text fuzzy search method supporting semantic association according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments.

Referring to fig. 1-2, the invention also provides a lightweight text fuzzy search method supporting semantic association, and the search method comprises the following steps:

s1, modeling a technical scene, wherein a text fuzzy search problem can be converted into a problem of inquiring a short text in a long text, and the long text and the short text are a series of character sequences.

S2, in order to ensure the light weight of the operation, the semantic association graph is built in advance and stored for direct calling, and the operation is not performed on site.

S3, fuzzy search scheme, given long text S = { S1, S2, S3, \8230sn }, given search request Q = { Q1, Q2, Q3, \8230qm }, which can be performed by a commonly used word segmenter in the industry, such as a chinese word segmenter: jieba, pkuSeg, and the like. By means of the word segmentation device, the character string 'I family has a beautiful small flowered cat' can be converted into { s1, s2, s3, \823030; sn } = { 'I family', 'having', 'one', "beautiful", "small", "spotted cat" }, if without the help of word segmenter, { s1, s2, s3, \8230; sn } = { "me", "home", "have", "one", "only", \8230 {, }. The converted text structure is equivalent to the following algorithm operation, whether with or without the help of a word segmenter. Generally, a plurality of different fuzzy search methods can be flexibly set, for example, by using a common LCS (longest common subsequence) algorithm, the similarity of the sequence between two character strings can be better obtained.

And S4, automatically dividing the search task, namely automatically dividing the long text S with larger space, segmenting the long text S according to specific terminators such as periods, questions, exclamation marks and the like, and then carrying out the operation 3 segment by segment.

And S5, internal acceleration and multithread acceleration are performed, and internal acceleration processing is performed on each link of the algorithm scheme in the S3, so that the time consumption of summation operation can be greatly reduced. There are many acceleration methods such as this, and the description thereof is omitted here. Regarding the multithreading acceleration, mainly aiming at the autoSplit method of 4, the character string paragraphs which are automatically divided are respectively sent to different computing units (servers) to be subjected to multithreading computation, so that the technical efficiency can be greatly improved. The currently used multithreading technology is mainly implemented by using third-party library Multiprocesses of Python, firstly, a segmented sentence takes the format of an array S = [ Si, \8230; ] as one of the inputs, and the number of used threads or processes is specified, and then, a function map or similar functions in the library are used for automatically calling currently idle thread units. And for each clause, completing a matching process from association to fuzzy search by using the model. The technical point here is how to parallelize the conventional iterative process of associating each clause and then searching for a fuzzy search, namely, the process of associating and searching for a plurality of clauses [ Si.., sm ] simultaneously by using a callable resource unit.

S6, interface packaging, which is convenient for flexible application of the text fuzzy search module and can be packaged in the form of an interface product, wherein the input parameter format is as follows: the method comprises the following steps of bluE (S, Q, autoSplit, isImagine, stop _ words), wherein the autoSplit and the isImagine are both Boolean type values, the autoSplit determines whether an operation mechanism of automatic task division is adopted, the isImagine determines whether an association mode is started, and the stop _ words are self-defined terminators in the autoSplit mode.

Wherein, the characters in S1 comprise Chinese characters, english letters, numbers and special characters.

The fuzzy search scheme in the S3 depends on whether the user starts the semantic association function, if the user does not start the semantic association function, the fuzzy search is based on characters, and the constituent units of the S and the Q are directly characters; if the semantic association function is started, the word segmentation processing needs to be carried out on S and Q.

The fuzzy search algorithm in S3 comprises a multi-level convolution character density weighted matching algorithm and a near-diagonal common subsequence matching algorithm.

The multilevel convolution character density weighted matching algorithm emphasizes that higher ambiguity is supported, and a weighted matching table { Tij } for long text S = { S1, S2, S3, \8230; sn } and search request Q = { Q1, Q2, Q3, \8230; qm } is built by means of a semantic association map:

S1

S2

S3

S4

S5

...

Sn

Q1

T11

T12

Q2

T21

T22

Q3

Q4

...

Qm

Tnm

where Tij represents the score of the corresponding state in the semantic association map with Sj as an index word and Qi as a subword, generally, it can be assumed that "synonyms" is 1 score, "near-synonyms" is 0.75 score, and "related words" is 0.5 score, that is, if Qi belongs to the near-synonyms of Sj, tij =0.75, and if Qi does not belong to any group of Sj, tij =0. After the weighted matching table is built, the step-over stride k needs to be set first, the larger k is, the larger tolerance for the step-over degree between characters is, and the matching effect is about fuzzy, and generally, k = 5. Then zero padding (padding) is performed on the weighted matching table { Tij } to make it uniformly surrounded by zero-valued cells, as shown below, the cells in the padded part represent that they have been given zero values:

extracting an area S _ conv to be convolved for S, for example, S _ conv for Si is T [: i-k// 2;

the convolutional layer for Q is set for k, which may be referred to as Q _ conv, and is a matrix of k × k, whose purpose is to score higher for the case with diagonal regions having higher assignments, so Q _ conv may be set as follows:

1	0.75	0.25	0.05	0
					0.75	1	0.75	0.25	0.05
0.25	0.75	1	0.75	0.25
					0.05	0.25	0.75	1	0.75
0	0.05	0.25	0.75	1

with more sophisticated deep learning model training conditions, its own Q _ conv can be configured for any Si, i.e. different constituent elements in S should have different convolutional layers. In addition, si can be provided with Q _ conv incorporating semantics by means of a syntactic level language model (e.g., elmo, BERT), for example, for the same word "love", which is different between the two cases where S = "i love beijing tiananmen" and S = "i love my wife". Then, each S _ conv is convolved vertically with Q _ conv, and for Tij, the result of the convolution is:

wherein, x 'and y' are coordinates of the position corresponding to S _ conv [ x ] [ y ] on the upper surface of Q _ conv, and are simplified to be convenient for expression as x 'and y'. Therefore, convolution scores of Si and Qj can be obtained, the score of Qj with the highest convolution score is reserved for each Si and is used as the approximate convolution score of Si for the whole Q, and then the convolution score distribution situation of S can be obtained as shown in the graph 2. According to fig. 2, the matched character strings, such as the corresponding character strings [ S7, S8, S9, S10] in the above frame, can be extracted by various extraction methods, and the basic idea is as follows: extracting the continuous character string (for example, the length of the valley character does not exceed 2) of which the convolution score exceeds a certain threshold (for example, 0.5), and summing and normalizing the convolution scores corresponding to the character string. And outputting the position corresponding to the character string, and taking the convolution fraction after summing and normalization as the similarity of the character string and Q.

The critical angle public subsequence matching algorithm is more efficient in side emphasis operation, and by means of a weighting matching table { Tij } of a multilevel convolution character density weighting matching algorithm, the specific scheme is a variant of LCS, namely, aiming at any Tij, taking Tij as a starting point, searching a cell Txy with the maximum L value or a cell with the first L value exceeding a certain threshold value (generally set to 0.25) in a matrix T [ 0. The cell found serves as the parent cell for Tij and once found, the query is terminated. The calculation method of the L value comprises the following steps:

the cell Tij stores its own Tij value, its parent cell, and its corresponding L value, and obtains its own weighted Tij value, which is called Yij. Using superscripts to describe the relationship between this cell and the parent cell, i.e., the parent cell of Tij ^ (t) is written as Tij ^ (t-1), we can then get the expression for YIj:

example one

Long text S1=' great family! Today i want to introduce my father. My father is a teacher, son, etc. A pair of bright eyes and black and beautiful hair also have a few silvery threads, which appear to be very old but very vigorous! '.

For short text Q1= 'double eyes' and Q2= 'eyes great'

Calling the packaged interface:

bluE(S，Q，autoSplit，isImagine，stop_words)

search for short text Q in long text S: result1= bluE (S = S1, Q = Q1), result2= bluE (S = S1, Q = Q2)

The returned results are respectively:

{ 'match _ str'; 'A bright large eye'; 'position': [33, 43], 'similarity' 0.4238}

{ 'match _ str': the large eye of, 'position': [39, 44], 'similarity' 0.3797}

The convolution fraction distribution conditions of S1 corresponding to two conditions of Q1 and Q2 are respectively (note: at this time, convolution operation is screened in advance, and S _ conv with less non-zero value units is not operated, so that only part of Si possesses convolution fraction):

s1+ Q1, the single-thread macbook 7 takes 0.002 seconds to run, and the running result is shown in figure 3.

S1+ Q2, the single-thread macbook 7 takes 0.0015 seconds to run, and the running result is shown in FIG. 4.

Example two

For the sample using autoSplit:

the long text S2=' john carl FriedrichGauss (johann carl FriedrichGauss) is a german mathematician who has made major advances in many fields of numerical theory, algebra, statistics, analysis, differential geometry, geodetics, geophysics, mechanics, electrostatics, astronomy, matrix theory and optics. Gaussians have indicated that geometric mapping of regular trilaterals, regular quadrilaterals, regular pentagons, regular pentadecagons and regular polygons with twice as many sides as the above can be achieved with compasses and straightedges, but since then no much progress has been made with respect to this problem. Gaussians provide a criterion for determining whether a given number of edges of a regular polygon can be geometrically mapped on the basis of number theory. For example, a regular heptadecagon inscribed in a circle can be formed by compasses and a ruler. Such a finding is also the first one after euclidean. In classical differential geometry, one often places curves and surfaces in three-dimensional euclidean space to handle. The description and discussion of many geometric properties of curves and surfaces often depend on how they are embedded in large spaces. In fact, however, the important properties of many geometric objects are intrinsic in nature, i.e., independent of the way they are embedded in large spaces. This is rarely noticed by geometry scientists of early years. Gaussians and riemann began to really realize this problem. Riemann in its famous lecture of geometry has formally re-discussed many concepts of geometry with an implication. '

Short text Q3= 'major progress',

calling the packaged interface:

bluE(S，Q，autoSplit，isImagine，stop_words)

search for short text Q in long text S: result1= blu (S = S2, Q = Q3, autoSplit = True)

The returned results are respectively:

{

1 { ' match _ str ': major progression ', ' position ': [110, 114], 'precision': 0.36},

2 { ' match _ str ': large progression ', ' position ': [77, 80], 'precision': 0.26}

}

The corresponding convolution score distribution is as follows:

s2+ Q3, the single-threaded macbook 7 takes 0.0027 seconds to run, and the result is shown in FIG. 5.

The invention has higher ambiguity, improves the traditional sentence retrieval algorithm, can retrieve the sentences which are completely consistent with the target sentences and have high similarity, and can flexibly adjust the approximate values of the target sentences. The method has the advantages that the operation speed is high, the traditional violent enumeration algorithm is abandoned, the methods of semantic mapping, convolution, dynamic programming and the like are used, the search process is optimized, and the search speed is greatly improved. The system is light, the size of the system is reduced, internal and external optimization is performed aiming at light-weight users and use scenes, the whole calculation process is optimized, and the memory burden is reduced. The invention also provides a set of association modes without field operation, and a user can call the association module in the fuzzy search without occupying local computing power. The system is flexible, and a user can flexibly call different applications easily: the whole algorithm module is subjected to interface packaging, and a user can conveniently and directly call partial modules to solve the actual requirements of the user. The method supports semantic association, and supports fuzzy matching of synonyms, similar words and associated word levels: compared with the traditional semantic retrieval method, the invention provides a fuzzy retrieval method which is more suitable for life use, and supports intelligent matching of the specific meaning of the words in the target text with the corresponding synonyms, near-synonyms and associated words. And (4) positioning the text, wherein the fuzzy retrieval method can sort the found text and the target text from high to low according to the similarity, and gives the position of the similar text segment and the matching degree of the similar text segment and the target text.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equivalent or changed within the scope of the present invention.

Claims

1. A method for supporting semantic association lightweight text fuzzy search is characterized in that the search method comprises the following steps:

s2, in order to ensure the light weight of the operation, the semantic association map is built in advance and stored for direct calling, and the operation is not carried out on site;

s4, automatically dividing the search task, automatically dividing the long text S with larger space, segmenting the long text S according to a specific terminator, and then performing 3 operations segment by segment;

s5, performing internal acceleration processing on each link of the algorithm scheme in the S3;

2. The method for lightweight fuzzy search supporting semantic association as recited in claim 1, wherein the characters in S1 comprise kanji, english alphabets, numerals and special characters.

3. The method for supporting lightweight fuzzy search of semantic association according to claim 1, wherein the fuzzy search scheme in S3 depends on whether the user turns on the semantic association function, if not, the fuzzy search will be based on characters, and the constituent units of S and Q are directly characters; if the semantic association function is started, word segmentation processing needs to be performed on S and Q firstly.

4. The method for lightweight fuzzy search supporting semantic association as claimed in claim 1, wherein said fuzzy search algorithm in S3 comprises a multi-level convolution character density weighted matching algorithm and a near-diagonal common subsequence matching algorithm.

5. The method for lightweight fuzzy search supporting semantic association according to claim 1, wherein a "first glance" decision can be made before performing the operation of S3, and the idea is as follows: blu (S, Q) = = trueifelen (set (Q) & set (S)) > len (set (Q)) × 0.5.

6. The method for lightweight fuzzy search supporting semantic association as recited in claim 4, wherein in the convolution operation of said multilevel convolution character density weighted matching algorithm, it can be determined in advance whether S _ conv has enough non-zero value units, otherwise, no convolution operation is performed on it.

7. The method for lightweight fuzzy search supporting semantic association as claimed in claim 4, wherein the convolution summation operation of the multilevel convolution character density weighted matching algorithm is assisted by an external tool such as numpy.