CN114996439A - Text search method and device - Google Patents

Text search method and device Download PDF

Info

Publication number
CN114996439A
CN114996439A CN202210913444.9A CN202210913444A CN114996439A CN 114996439 A CN114996439 A CN 114996439A CN 202210913444 A CN202210913444 A CN 202210913444A CN 114996439 A CN114996439 A CN 114996439A
Authority
CN
China
Prior art keywords
vector
text
preset
word segmentation
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210913444.9A
Other languages
Chinese (zh)
Inventor
陈轮
黄海峰
韩国权
祁纲
吕灏
张国强
周伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Taiji Computer Corp Ltd
Original Assignee
Taiji Computer Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Taiji Computer Corp Ltd filed Critical Taiji Computer Corp Ltd
Priority to CN202210913444.9A priority Critical patent/CN114996439A/en
Publication of CN114996439A publication Critical patent/CN114996439A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a text search method and a text search device, which belong to the technical field of data retrieval, and obtain a relatively comprehensive word segmentation text set through word segmentation processing; performing vector conversion and coding on a participle text set of a text to be retrieved according to a standard vector of a preset corpus to obtain a vector of the participle text set; and acquiring the address of the text to be detected corresponding to the highest cosine similarity vector, thereby acquiring the target text according to the address of the text to be detected. The method has the advantages that continuous repeated characters are not required to be limited, comprehensive word segmentation can be obtained after comprehensive word segmentation regardless of word replacement or field adjustment, cosine similarity is calculated according to vector coordinates, after comprehensive word segmentation is obtained, when vector conversion and encoding are carried out, the standard vector of a fixed preset corpus is used as a reference, the vector obtained by conversion encoding is more accurate, after the vector is determined, the corresponding cosine similarity is unique, the calculation process is simple, convenient and accurate, and the accuracy of similarity calculation is improved.

Description

Text search method and device
Technical Field
The invention belongs to the technical field of data retrieval, and particularly relates to a text search method and device.
Background
The development of information enables the data volume of various text data to rise straightly, and how to quickly search texts required by users from massive text data is always a topic which people pay attention to. For example, for a query of similar documents, in the related art, a text to be retrieved is generally required to be input into a search server, and the search server searches for a text similar to the text to be retrieved according to the content of the text to be retrieved. In the similarity interpretation process, the similarity between the text to be retrieved and the text in the server is generally determined according to the repetition rate of a few consecutive words (for example, if 7 consecutive words are the same, the text is determined to be repeated), and the similarity between the two texts is determined according to the repetition rate.
However, such a calculation method requires that continuous repeated words exist in the contents of the two texts, and when the text to be retrieved is subjected to field adjustment or word replacement, a situation that similar texts cannot be detected is likely to occur, and the detection accuracy is low.
Disclosure of Invention
In view of the above, the present invention provides a text search method and apparatus, so as to overcome the technical problem that continuous repeated words exist in the contents of two texts, which are required for similarity calculation, and when a field adjustment or a word replacement is performed on a text to be retrieved, a situation that a similar text cannot be detected is likely to occur, and the detection accuracy is low.
In order to achieve the purpose, the invention adopts the following technical scheme:
in one aspect, a text search method includes:
performing word segmentation processing on a text to be retrieved based on the crust word segmentation to obtain a word segmentation text set, wherein the word segmentation text set comprises at least two words;
respectively coding the participles in the participle text set according to a vector conversion method and a standard vector of a preset corpus to obtain a vector of the participle text set;
calculating the cosine similarity between the vector of the word segmentation text set and each vector in a preset vector library;
determining a cosine similarity highest value in the cosine similarities, and taking a vector in a preset vector library corresponding to the cosine similarity highest value as a target vector;
acquiring an address of a target text corresponding to a target vector based on a corresponding relation between a preset vector and a target text address;
and searching a text database for a target text corresponding to the address of the target text as a search target text.
Optionally, before performing word segmentation processing on the text to be retrieved based on the crust word segmentation to obtain a word segmentation text set, the method further includes:
acquiring a target text set, wherein the target text set comprises at least two target texts;
performing word segmentation processing on the target text set, and combining the obtained segmented words according to a preset rule to obtain a preset corpus;
performing vector conversion on each participle in the preset corpus according to a vector conversion method to obtain a standard vector, wherein the standard vector comprises the participle and a corresponding number;
coding each target text according to the standard vector to obtain a target text vector of each target text;
constructing the preset vector library according to the target text vector; and setting the corresponding relation between the preset vector and the target text address, and storing the target text in the text database according to the corresponding relation between the preset vector and the target text address.
Optionally, the performing word segmentation processing on the target text set, and combining the obtained segmented words according to a preset rule to obtain the preset corpus includes:
performing word segmentation processing on each target text to obtain a word segmentation set corresponding to each target text;
judging whether repeated participles exist among each participle set, and if repeated participles exist among the participle sets, adding one repeated participle into the preset corpus; and adding the non-repeated participles into the preset corpus.
Optionally, the word segmentation processing is performed on the text to be retrieved based on the ending word segmentation to obtain a word segmentation text set, and the method further includes:
judging whether preset forbidden words or preset symbols exist in the word segmentation text set or not;
and if the word segmentation text set has preset forbidden words or preset symbols, deleting the preset forbidden words or the preset symbols, and taking the text set in which the preset forbidden words or the preset symbols are deleted as the word segmentation text set.
Optionally, the rule for calculating the cosine similarity between the vector of the word segmentation text set and each vector in the preset vector library includes:
Figure 161816DEST_PATH_IMAGE001
wherein Xi is a coordinate value of a vector of the word segmentation text set, Yi is a coordinate value of a vector in the preset vector library, and i is a space dimension.
Optionally, the method further includes: determining the necessary keywords and the weight of the necessary keywords of the text to be retrieved;
determining a cosine similarity highest value in the cosine similarities, and taking a vector in a preset vector library corresponding to the cosine similarity highest value as a target vector, wherein the method comprises the following steps:
calculating the weighted similarity of each cosine similarity according to the weight of the necessary key words;
sorting the weighted similarity according to the size;
and determining the vector in the preset vector library corresponding to the highest weighted similarity as a target vector.
In still another aspect, a text search apparatus includes:
the word segmentation module is used for carrying out word segmentation processing on a text to be retrieved based on the crust word segmentation to obtain a word segmentation text set, wherein the word segmentation text set comprises at least two word segmentations;
the encoding module is used for respectively encoding the participles in the participle text set according to a vector conversion method and a standard vector of a preset corpus to obtain a vector of the participle text set;
the calculation module is used for calculating the cosine similarity between the vector of the word segmentation text set and each vector in a preset vector library;
the determining module is used for determining the highest cosine similarity value in the cosine similarity, and taking a vector in a preset vector library corresponding to the highest cosine similarity value as a target vector;
the acquisition module is used for acquiring the address of the target text corresponding to the target vector based on the corresponding relation between the preset vector and the target text address;
and the searching module is used for searching a target text corresponding to the address of the target text in a text database to serve as a searching target text.
Optionally, the method further includes: building a module; the construction module is used for acquiring a target text set before the text to be retrieved is subjected to word segmentation processing based on the crust word segmentation to obtain a word segmentation text set, wherein the target text set comprises at least two target texts; performing word segmentation processing on the target text set, and combining the obtained segmented words according to a preset rule to obtain a preset corpus; performing vector conversion on each participle in the preset corpus according to a vector conversion method to obtain a standard vector, wherein the standard vector comprises the participle and a corresponding number; coding each target text according to the standard vector to obtain a target text vector of each target text; constructing the preset vector library according to the target text vector; and setting the corresponding relation between the preset vector and the target text address, and storing the target text in the text database according to the corresponding relation between the preset vector and the target text address.
Optionally, the building module is specifically configured to perform word segmentation processing on each target text to obtain a word segmentation set corresponding to each target text; judging whether repeated participles exist among each participle set, and if repeated participles exist among the participle sets, adding one repeated participle into the preset corpus; and adding the non-repeated participles into the preset corpus.
Optionally, the word segmentation module is further configured to determine whether a preset prohibited word or a preset symbol exists in the word segmentation text set; and if the word segmentation text set has preset forbidden words or preset symbols, deleting the preset forbidden words or the preset symbols, and taking the text set in which the preset forbidden words or the preset symbols are deleted as the word segmentation text set.
According to the text searching method, device and equipment provided by the embodiment of the invention, a relatively comprehensive word segmentation text set is obtained by performing word segmentation on the text to be searched, so that the searching range is ensured; according to the standard vectors of the preset corpus, vector conversion and coding are carried out on the participle text set of the text to be retrieved to obtain the vectors of the participle text set, and the common dictionary vector of the text to be retrieved and each stored text does not need to be constructed again, so that the time is saved; the similarity between the vectors is expressed by calculating the cosine similarity between the vectors, and after the highest cosine similarity is obtained, the text address to be detected corresponding to the highest cosine similarity vector is obtained, so that the target text is obtained according to the address of the text to be detected. The method has the advantages that continuous repeated characters are not required to be limited, comprehensive word segmentation can be obtained after comprehensive word segmentation regardless of word replacement or field adjustment, cosine similarity is calculated according to vector coordinates, after comprehensive word segmentation is obtained, when vector conversion and encoding are carried out, the standard vector of a fixed preset corpus is used as a reference, the vector obtained by conversion encoding is more accurate, after the vector is determined, the corresponding cosine similarity is unique, the calculation process is simple, convenient and accurate, and the accuracy of similarity calculation is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic flowchart of a text search method according to an embodiment of the present invention;
fig. 2 is a schematic diagram illustrating a construction step of a text search method according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a text search apparatus according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a text search apparatus according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be described in detail below. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the examples given herein without any inventive step, are within the scope of the present invention.
Fig. 1 is a schematic flow chart of a text search method according to an embodiment of the present invention, in this embodiment, any server may be used as an execution main body to perform text search, for example, the server may be an elastic search engine (which is characterized by being distributed, configured in zero, automatically discovered, automatically fragmented in index, and conveniently configured in cluster), or may be another server, which is not specifically limited in this application. In which, the elastic search is a Lucene (search engine) -based search server, which provides a full-text search engine with distributed multi-user capability, referring to fig. 1, the method provided in this embodiment may include the following steps:
and step S1, performing word segmentation processing on the text to be retrieved based on the crust word segmentation to obtain a word segmentation text set, wherein the word segmentation text set comprises at least two word segmentations.
In a specific search process, a user can search for a text to be retrieved in a server, so as to obtain a text similar to the text to be retrieved. For example, the method may be applied to scenarios such as duplicate checking, approximate text acquisition, and retrieval, and is not particularly limited in this application.
After a user inputs a text to be retrieved, word segmentation processing can be carried out on the text, words can be segmented through a dictionary through a Chinese word segmentation algorithm, after the word segmentation processing, the text to be retrieved can be segmented into at least two segmented words, and the segmented words form a segmented word text set.
It can be understood that, in the application, the ending word segmentation is adopted to scan out all words capable of being word-formed in the text to be retrieved, the word segmentation is comprehensive, the range of the subsequent search is ensured, the word segmentation speed is high, the time is saved, and the speed is improved.
It should be noted that, in some embodiments, before performing word segmentation processing on a text to be retrieved based on a final word segmentation to obtain a word segmentation text set, the method further includes: judging whether preset forbidden words or preset symbols exist in the word segmentation text set; and if the preset forbidden words or the preset symbols exist in the word segmentation text set, deleting the preset forbidden words or the preset symbols, and taking the text set in which the preset forbidden words or the preset symbols are deleted as the word segmentation text set.
In the embodiment of the application, for convenience of user processing or for protecting the text content from harmony, forbidden words or preset symbols can be preset, so that the forbidden words or the preset symbols are avoided during word segmentation. For example, the preset prohibited words may include political words, yellow gambling poison words, etc.; the preset symbols may include special medical symbols or the like.
It should be noted that, in some embodiments, before performing word segmentation processing on a text to be retrieved based on a final word segmentation to obtain a word segmentation text set, a corresponding relationship between a preset corpus, a preset vector library, a preset vector and a target text address may be constructed, and the construction process may include the following steps: acquiring a target text set, wherein the target text set comprises at least two target texts; performing word segmentation processing on the target text set, and combining the obtained segmented words according to a preset rule to obtain a preset corpus; performing vector conversion on each participle in a preset corpus according to a vector conversion method to obtain a standard vector, wherein the standard vector comprises the participle and a corresponding number; coding each target text according to the standard vector to obtain a target text vector of each target text; constructing a preset vector library according to the target text vector; and setting a corresponding relation between the preset vector and the target text address, and storing the target text in a text database according to the corresponding relation between the preset vector and the target text address.
In some embodiments, performing word segmentation processing on the target text set, and combining the obtained segmented words according to a preset rule to obtain a preset corpus, includes: performing word segmentation processing on each target text to obtain a word segmentation set corresponding to each target text; judging whether repeated participles exist among each participle set, and if repeated participles exist among the participle sets, adding one repeated participle into a preset corpus; and adding the non-repeated participles into a preset corpus.
Specifically, a stored text for duplicate checking of the text to be retrieved may be set as a target text, and at least two target texts may be set to form a target text set. After the target text set is determined, word segmentation processing is carried out on the texts in the target text set, and after word segmentation is completed, the obtained word segments are combined according to preset rules to obtain a preset corpus.
For example, word segmentation processing may be performed on each target text to obtain a word segmentation set corresponding to each target text; if the same participles exist among the participle sets, adding the participle into a preset material library, and adding different participles into the preset material library to form a final preset material library. It can be understood that, because the same participles among the participle sets are deleted, the participle is added into the preset corpus only once, so that the repeated calculation of subsequent vectors is avoided, and the accuracy of similarity calculation is improved.
After the preset corpus is obtained, vector conversion is carried out on each participle in the preset corpus according to a vector conversion method, and a standard vector is obtained. For example, each participle in the predetermined corpus is numbered, and a standard vector containing the participle and the corresponding number is obtained.
And after the standard vector is obtained, coding the word segmentation set of each target text by taking the standard vector as a reference, thereby obtaining the target text vector of each target text. Storing each target text vector, constructing a preset vector library, setting a corresponding relation between the preset vector and a target text address, and storing the target text in a text database.
Fig. 2 is a schematic diagram of construction steps of a text search method according to an embodiment of the present invention, and referring to fig. 2, in the embodiment of the present invention, a plurality of target texts may be selected, for example, a text a, a text B, a text C, and the like, to form a text list, that is, a target text set, and each target text in the target text set is subjected to word segmentation processing, for example, after all target texts are subjected to word segmentation by using a Chinese word segmentation (for example, after a word segmentation of the text a is a, after a word segmentation of the text B is B, and after a word segmentation of the text C is C), words are combined to obtain a preset corpus. Vector conversion and onehot encoding are performed on each word in the preset corpus, for example, each word in the preset corpus may be divided into integers with subscripts starting from 0, so as to obtain a standard vector. After the standard vector is obtained, the participles in the participle set a of the target text A, the participle set B of the target text B and the participle set C of the target text C are converted into onehot coded target text vectors based on the standard vector. Receiving all target text vectors and constructing the target text vectors into a preset vector library; and setting a corresponding relation between a preset vector and a target text address, and storing a target text in a text database.
It is understood that, in the embodiment of the present application, in order to facilitate text retrieval, the target text may be stored in the form of vector + text address. For example, the vector + text id can be stored in doc mode of elasitcsearch; and initializing an index library, and importing a self-defined named cosine similarity function.
After the pre-step is completed, the user can input the text to be retrieved for text retrieval, and search for a search target text similar to the text to be retrieved.
And step S2, respectively coding the participles in the participle text set according to the vector conversion method and the standard vector of the preset corpus to obtain the vector of the participle text set.
In the embodiment of the present application, the vector conversion algorithm may adopt a word2vec (word vector) method, so as to encode the word segmentation text. The word2vec is a method for converting a word into a vector, and details of a process of converting a word into a vector are not described in this application, please refer to the prior art.
It can be understood that, in the embodiment of the application, the text can be converted into the vector by using the word2vec method, so that the text is converted into the vector, the subsequent vector retrieval is realized, and the technical problem that the text content cannot be directly processed in the vector retrieval in the prior art is solved.
Specifically, after vector conversion, onehot encoding may be performed based on a preset corpus to obtain a vector list. Onehot encoding, also known as one-hot encoding, uses an N-bit status register to encode N states, each state having its own independent register bits, and only one of which is active at any time. And after coding, obtaining a vector of the word segmentation text.
Step S3, calculating the cosine similarity between the vector of the word segmentation text set and each vector in the preset vector library.
The preset vector library is constructed in advance, and a target text vector is stored in the preset vector library.
In the embodiment of the present application, a cosine similarity function may be called by using a query model provided by an ES (elastic search, high-expansion and open-source full-text search and analysis engine) database to calculate the cosine similarity. The ES database can store, search and analyze massive data in near real time. The cosine similarity between the vector of the word segmentation text set and each vector in the preset vector library can be calculated one by one, and the cosine similarity can also be calculated at the same time.
Step S4, determining a highest cosine similarity value in the cosine similarities, and taking a vector in a preset vector library corresponding to the highest cosine similarity value as a target vector.
Specifically, after the cosine approximation degree of the vector of the word segmentation text set and each vector in the preset vector library is calculated, the obtained cosine approximation degrees are sorted, and the vector summarized in the preset vector library corresponding to the highest value of the cosine approximation degree is selected as the target vector.
And step S5, acquiring the address of the target text corresponding to the target vector based on the corresponding relation between the preset vector and the target text address.
Specifically, after the target vector is determined, the target text address corresponding to the target vector is found according to the corresponding relationship between the preset vector and the target text address.
Step S6, finding the target text corresponding to the address of the target text in the text database as the search target text.
Specifically, after the target text address is determined, according to the target text address, the corresponding target text is found to be used as a search target text, that is, the text with the highest similarity to the text to be retrieved.
In some embodiments, the rule for calculating the cosine similarity between the vector of the segmented text and the vector in the preset vector library includes:
Figure 38505DEST_PATH_IMAGE001
wherein X i Coordinate values of vectors, Y, for word-segmented text i And i is a coordinate value of a vector in a preset vector library, and i is an i-dimensional space.
Setting i as 5, in a five-dimensional space, the coordinates of a vector item1 of a participle text are [0,0,1,2,1], and vectors in a preset vector library comprise item2[1,1,0,0,1], item3[0,0,1,2,0], and the cosine similarity of item1 and item2 and the cosine similarity of item1 and item3 are respectively calculated.
Wherein, the cosine similarity of item1 and item2 is:
Figure 487066DEST_PATH_IMAGE002
wherein the cosine similarity of item1 and item3 is:
Figure 816416DEST_PATH_IMAGE003
the cosine similarity of item1 and item2 is smaller than the cosine similarity of item1 and item3, and the cosine similarity of item1 and item3 is large, so that the text corresponding to item3 can be determined to be the target text retrieved by item 1.
Specifically, after the cosine similarity is obtained through calculation, the vector with the highest cosine similarity is determined as a target vector, and a text address corresponding to the target vector is obtained, so that a target text is retrieved according to the text address, for example, corresponding text content is searched in a text database according to the doc _idvalue and is output to a retrieval user.
For example, the similarity calculation process will be described by taking an example in which the text to be retrieved is "the hat is a little small, and the other one is possible", and the target text is "the hat is not large, and the other one is selected".
Firstly, the text to be detected is segmented to obtain a segmented text set [ this, one, hat, little, small, in addition, one, right ].
After word segmentation, the standard vector of the preset corpus is looked up to be [ not large ] 0 Selecting 1 A point is 2 Is small and small 3 Can be made of 4 This is 5 A 1, an 6 Cap (hat) 7 In addition to 8 A is 9 ]. And (5) according to the standard vector, carrying out coding on the participles in the participle text set to obtain codes [5,6,7,2,3,8,9,4 ] of the participle text set]According to the coding, the vector of the participle text set is [0,0,1,1,1,1,1]。
And calculating the cosine similarity of the vector of the word segmentation text set and the vector of the target text described in the embodiment, wherein the code of the target text is [5,6,7,0,1,8,9], and the vector of the target text is [1,1,0,0,0,1,1,1,1 ]. According to the calculation rule of cosine similarity, the similarity between the two is calculated as follows:
Figure 139950DEST_PATH_IMAGE004
the cosine similarity between the text to be retrieved and other vectors in the preset vector library is 0.91, and if the cosine similarity between the text to be retrieved and other vectors in the preset vector library is calculated to be lower than 0.91, the vector in the preset vector library corresponding to 0.91 can be determined as the target vector. After the target vector is determined, the target text address is obtained according to the corresponding relation between the preset vector and the target text address, and the search target text is obtained as 'the cap is not big, and the other one is selected' according to the text address.
It can be understood that, according to the technical scheme provided by the embodiment of the invention, a relatively comprehensive word segmentation text set is obtained by performing word segmentation processing on a text to be retrieved, so that the search range is ensured; according to the standard vectors of the preset corpus, vector conversion and coding are carried out on the participle text set of the text to be retrieved to obtain the vectors of the participle text set, and the common dictionary vector of the text to be retrieved and each stored text does not need to be constructed again, so that the time is saved; the similarity between the vectors is expressed by calculating the cosine similarity between the vectors, and after the highest cosine similarity is obtained, the text address to be detected corresponding to the highest cosine similarity vector is obtained, so that the target text is obtained according to the address of the text to be detected. The method has the advantages that continuous repeated characters are not required to be limited, after the comprehensive word segmentation, the cosine similarity is calculated according to vector coordinates, after the comprehensive word segmentation is obtained, when vector conversion and coding are carried out, the vector obtained by converting and coding is more accurate by taking the standard vector of the fixed preset corpus as a reference, after the vector is determined, the corresponding cosine similarity is unique, the calculation process is simple, convenient and accurate, and the accuracy of similarity calculation is improved.
In some embodiments, optionally, the method further includes: determining the necessary keywords and the weight of the necessary keywords of the text to be retrieved; determining the highest value of the cosine similarity in the cosine similarity, and taking the vector in a preset vector library corresponding to the highest value of the cosine similarity as a target vector, wherein the method comprises the following steps: calculating the weighted similarity of each cosine similarity according to the weight of the necessary keywords; sorting the weighted similarity according to size; and determining the vector in the preset vector library corresponding to the highest weighted similarity as a target vector.
For example, the user may set the necessary keywords and the weight of the necessary keywords according to the requirement, for example, the user may set a certain professional term "a" as the necessary keywords and set the weight of the necessary keywords as 40% according to the professional field. For example, the cosine similarities between the vectors of the segmented text and 4 vectors in the preset vector library are calculated to be M1, M2, M3 and M4, where the segmented words corresponding to the first three vectors include essential keywords, and then the weighted similarities of the cosine similarities are calculated to be: 40% + M1 × 60%, 40% + M2 × 60%, 40% + M3 × 60%, and M1 × 60%, and determining the vector in the preset vector library corresponding to the highest weighted similarity as the target vector.
It can be understood that, the technical scheme provided in the embodiment of the application enables a user to set the self retrieval emphasis point according to the retrieval requirement by setting the necessary keywords and the weight of the necessary keywords, so as to obtain the retrieval result more accurately.
Based on a general inventive concept, the embodiment of the present invention further provides a text search apparatus, which is used for implementing the above method embodiment.
Fig. 3 is a schematic structural diagram of a text search apparatus according to an embodiment of the present invention, and as shown in fig. 3, the apparatus according to the embodiment of the present invention may include the following structures:
the word segmentation module 31 is configured to perform word segmentation processing on a to-be-retrieved text based on the crust word segmentation to obtain a word segmentation text set, where the word segmentation text set includes at least two word segments;
the encoding module 32 is configured to encode the participles in the participle text set respectively according to the vector conversion method and a standard vector of a preset corpus to obtain a vector of the participle text set;
the calculating module 33 is configured to calculate cosine similarity between the vector of the word segmentation text set and each vector in the preset vector library;
the determining module 34 is configured to determine a highest cosine similarity value in the cosine similarity, and use a vector in a preset vector library corresponding to the highest cosine similarity value as a target vector;
the obtaining module 35 is configured to obtain an address of the target text corresponding to the target vector based on a corresponding relationship between the preset vector and the target text address;
and a searching module 36, configured to search the text database for a target text corresponding to the address of the target text as a search target text.
Optionally, the method further includes: building a module; the construction module is used for acquiring a target text set before word segmentation processing is carried out on a text to be retrieved based on the ending word segmentation to obtain a word segmentation text set, wherein the target text set comprises at least two target texts; performing word segmentation processing on the target text set, and combining the obtained segmented words according to a preset rule to obtain a preset corpus; performing vector conversion on each participle in a preset corpus according to a vector conversion method to obtain a standard vector, wherein the standard vector comprises the participle and a corresponding number; coding each target text according to the standard vector to obtain a target text vector of each target text; constructing a preset vector library according to the target text vector; and setting a corresponding relation between the preset vector and the target text address, and storing the target text in a text database according to the corresponding relation between the preset vector and the target text address.
Optionally, the building module is specifically configured to perform word segmentation processing on each target text to obtain a word segmentation set corresponding to each target text; judging whether repeated participles exist among each participle set, and if repeated participles exist among the participle sets, adding one repeated participle into a preset corpus; and adding the non-repeated participles into a preset corpus.
Optionally, the word segmentation module is further configured to determine whether a preset prohibited word or a preset symbol exists in the word segmentation text set;
and if the preset forbidden words or the preset symbols exist in the word segmentation text set, deleting the preset forbidden words or the preset symbols, and taking the text set in which the preset forbidden words or the preset symbols are deleted as the word segmentation text set.
Optionally, the method further includes: the receiving module is used for determining the necessary keywords and the weight of the necessary keywords of the text to be retrieved; the determining module 34 is specifically configured to calculate a weighted similarity of each cosine similarity according to the weight of the necessary keyword; sorting the weighted similarity according to size; and determining the vector in the preset vector library corresponding to the highest weighted similarity as a target vector.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
It can be understood that, according to the technical scheme provided by the embodiment of the invention, a relatively comprehensive word segmentation text set is obtained by performing word segmentation processing on a text to be retrieved, so that the search range is ensured; according to the standard vectors of the preset corpus, vector conversion and coding are carried out on the participle text set of the text to be retrieved to obtain the vectors of the participle text set, and the common dictionary vector of the text to be retrieved and each stored text does not need to be constructed again, so that the time is saved; the similarity between the vectors is expressed by calculating the cosine similarity between the vectors, and after the highest cosine similarity is obtained, the text address to be detected corresponding to the highest cosine similarity vector is obtained, so that the target text is obtained according to the address of the text to be detected. The method has the advantages that continuous repeated characters are not required to be limited, after the comprehensive word segmentation, the cosine similarity is calculated according to vector coordinates, after the comprehensive word segmentation is obtained, when vector conversion and coding are carried out, the vector obtained by converting and coding is more accurate by taking the standard vector of the fixed preset corpus as a reference, after the vector is determined, the corresponding cosine similarity is unique, the calculation process is simple, convenient and accurate, and the accuracy of similarity calculation is improved.
Based on a general inventive concept, the embodiment of the present invention further provides a text search device, which is used to implement the above method embodiments.
Fig. 4 is a schematic structural diagram of a text search apparatus according to an embodiment of the present invention, and referring to fig. 4, the text search apparatus according to the embodiment includes a processor 41 and a memory 42, where the processor 41 is connected to the memory 42. Wherein, the processor 41 is used for calling and executing the program stored in the memory 42; the memory 42 is used for storing a program for executing at least the text search method in the above embodiment.
The specific implementation of the text search device provided in the embodiment of the present application may refer to the implementation of the text search method in any of the above embodiments, and details are not described here.
It can be understood that, according to the technical scheme provided by the embodiment of the invention, the word segmentation processing is carried out on the text to be retrieved to obtain the word segmentation text; converting the text into a vector according to a vector conversion method and a preset corpus, and coding the participle text to obtain a vector of the participle text; and calculating cosine similarity between the vector of the word segmentation text and the vector in the preset vector library, and obtaining the target text according to the cosine similarity. By adopting the technical scheme provided by the embodiment of the application, the text content is converted into the vector for vector retrieval, and the cosine similarity between the vectors is calculated to obtain the target text similar to the text to be retrieved, so that the method is convenient, quick and high in accuracy.
It is understood that the same or similar parts in the above embodiments may be mutually referred to, and the same or similar parts in other embodiments may be referred to for the content which is not described in detail in some embodiments.
It should be noted that the terms "first," "second," and the like in the description of the present invention are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. Further, in the description of the present invention, the meaning of "a plurality" means at least two unless otherwise specified.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.
It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.
In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.
The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims (10)

1. A text search method, comprising:
performing word segmentation processing on a text to be retrieved based on the crust word segmentation to obtain a word segmentation text set, wherein the word segmentation text set comprises at least two word segmentations;
respectively coding the participles in the participle text set according to a vector conversion method and a standard vector of a preset corpus to obtain a vector of the participle text set;
calculating the cosine similarity between the vector of the word segmentation text set and each vector in a preset vector library;
determining a cosine similarity highest value in the cosine similarities, and taking a vector in a preset vector library corresponding to the cosine similarity highest value as a target vector;
acquiring an address of a target text corresponding to a target vector based on a corresponding relation between a preset vector and a target text address;
and searching a text database for a target text corresponding to the address of the target text as a search target text.
2. The method according to claim 1, wherein before performing segmentation processing on the text to be retrieved based on the ending segmentation to obtain a segmented text set, the method further comprises:
acquiring a target text set, wherein the target text set comprises at least two target texts;
performing word segmentation processing on the target text set, and combining the obtained segmented words according to a preset rule to obtain a preset corpus;
performing vector conversion on each participle in the preset corpus according to a vector conversion method to obtain a standard vector, wherein the standard vector comprises the participle and a corresponding number;
coding each target text according to the standard vector to obtain a target text vector of each target text;
constructing the preset vector library according to the target text vector; and setting the corresponding relation between the preset vector and the target text address, and storing the target text in the text database according to the corresponding relation between the preset vector and the target text address.
3. The method according to claim 2, wherein performing word segmentation processing on the target text set and combining the obtained word segmentations according to preset rules to obtain the preset corpus comprises:
performing word segmentation processing on each target text to obtain a word segmentation set corresponding to each target text;
judging whether repeated participles exist among each participle set, and if repeated participles exist among the participle sets, adding one repeated participle into the preset corpus; and adding the non-repeated participles into the preset corpus.
4. The method of claim 1, wherein the segmenting the to-be-retrieved text based on the ending segmentation to obtain a segmented text set, further comprising:
judging whether preset forbidden words or preset symbols exist in the word segmentation text set or not;
and if the word segmentation text set has preset forbidden words or preset symbols, deleting the preset forbidden words or the preset symbols, and taking the text set in which the preset forbidden words or the preset symbols are deleted as the word segmentation text set.
5. The method according to claim 1, wherein the rule for calculating the cosine similarity between the vector of the segmented text set and each vector in the preset vector library comprises:
Figure 717307DEST_PATH_IMAGE001
wherein, X i Coordinate values of vectors, Y, for a set of participles i And i is a space dimension which is a coordinate value of the vector in the preset vector library.
6. The method of claim 1, further comprising: determining the necessary keywords and the weight of the necessary keywords of the text to be retrieved;
determining a cosine similarity highest value in the cosine similarities, and taking a vector in a preset vector library corresponding to the cosine similarity highest value as a target vector, wherein the method comprises the following steps:
calculating the weighted similarity of each cosine similarity according to the weight of the necessary keywords;
sorting the weighted similarity according to size;
and determining the vector in the preset vector library corresponding to the highest weighted similarity as a target vector.
7. A text search apparatus, comprising:
the word segmentation module is used for carrying out word segmentation processing on a text to be retrieved based on the crust word segmentation to obtain a word segmentation text set, wherein the word segmentation text set comprises at least two words;
the encoding module is used for respectively encoding the participles in the participle text set according to a vector conversion method and a standard vector of a preset corpus to obtain a vector of the participle text set;
the calculation module is used for calculating the cosine similarity between the vector of the word segmentation text set and each vector in a preset vector library;
the determining module is used for determining the highest cosine similarity value in the cosine similarity, and taking a vector in a preset vector library corresponding to the highest cosine similarity value as a target vector;
the acquisition module is used for acquiring the address of the target text corresponding to the target vector based on the corresponding relation between the preset vector and the target text address;
and the searching module is used for searching a target text corresponding to the address of the target text in a text database to serve as a searching target text.
8. The apparatus of claim 7, further comprising: building a module; the building module is used for obtaining a target text set before the text to be retrieved is subjected to word segmentation processing based on the crust word segmentation to obtain a word segmentation text set, wherein the target text set comprises at least two target texts; performing word segmentation processing on the target text set, and combining the obtained segmented words according to a preset rule to obtain a preset corpus; performing vector conversion on each participle in the preset corpus according to a vector conversion method to obtain a standard vector, wherein the standard vector comprises the participle and a corresponding number; coding each target text according to the standard vector to obtain a target text vector of each target text; constructing the preset vector library according to the target text vector; and setting the corresponding relation between the preset vector and the target text address, and storing the target text in the text database according to the corresponding relation between the preset vector and the target text address.
9. The apparatus according to claim 8, wherein the building module is specifically configured to perform a word segmentation process on each target text to obtain a word segmentation set corresponding to each target text; judging whether repeated participles exist among each participle set, and if repeated participles exist among the participle sets, adding one repeated participle into the preset corpus; and adding the non-repeated participles into the preset corpus.
10. The apparatus according to claim 7, wherein the word segmentation module is further configured to determine whether a preset prohibited word or a preset symbol exists in the word segmentation text set; and if the word segmentation text set has preset forbidden words or preset symbols, deleting the preset forbidden words or the preset symbols, and taking the text set with the deleted preset forbidden words or preset symbols as a word segmentation text set.
CN202210913444.9A 2022-08-01 2022-08-01 Text search method and device Pending CN114996439A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210913444.9A CN114996439A (en) 2022-08-01 2022-08-01 Text search method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210913444.9A CN114996439A (en) 2022-08-01 2022-08-01 Text search method and device

Publications (1)

Publication Number Publication Date
CN114996439A true CN114996439A (en) 2022-09-02

Family

ID=83021653

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210913444.9A Pending CN114996439A (en) 2022-08-01 2022-08-01 Text search method and device

Country Status (1)

Country Link
CN (1) CN114996439A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108628825A (en) * 2018-04-10 2018-10-09 平安科技(深圳)有限公司 Text message Similarity Match Method, device, computer equipment and storage medium
CN110019669A (en) * 2017-10-31 2019-07-16 北京国双科技有限公司 A kind of text searching method and device
US20200242304A1 (en) * 2017-11-29 2020-07-30 Tencent Technology (Shenzhen) Company Limited Text recommendation method and apparatus, and electronic device
CN112988216A (en) * 2021-03-12 2021-06-18 北京航空航天大学 Software architecture recovery method based on functional structure
CN112988970A (en) * 2021-03-11 2021-06-18 浙江康旭科技有限公司 Text matching algorithm serving intelligent question-answering system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110019669A (en) * 2017-10-31 2019-07-16 北京国双科技有限公司 A kind of text searching method and device
US20200242304A1 (en) * 2017-11-29 2020-07-30 Tencent Technology (Shenzhen) Company Limited Text recommendation method and apparatus, and electronic device
CN108628825A (en) * 2018-04-10 2018-10-09 平安科技(深圳)有限公司 Text message Similarity Match Method, device, computer equipment and storage medium
CN112988970A (en) * 2021-03-11 2021-06-18 浙江康旭科技有限公司 Text matching algorithm serving intelligent question-answering system
CN112988216A (en) * 2021-03-12 2021-06-18 北京航空航天大学 Software architecture recovery method based on functional structure

Similar Documents

Publication Publication Date Title
US11227118B2 (en) Methods, devices, and systems for constructing intelligent knowledge base
CN107992596B (en) Text clustering method, text clustering device, server and storage medium
CN110019732B (en) Intelligent question answering method and related device
CN110941951B (en) Text similarity calculation method, text similarity calculation device, text similarity calculation medium and electronic equipment
CN110134777B (en) Question duplication eliminating method and device, electronic equipment and computer readable storage medium
WO2020232898A1 (en) Text classification method and apparatus, electronic device and computer non-volatile readable storage medium
CN112559747B (en) Event classification processing method, device, electronic equipment and storage medium
CN112329460B (en) Text topic clustering method, device, equipment and storage medium
CN114861889B (en) Deep learning model training method, target object detection method and device
CN109885180B (en) Error correction method and apparatus, computer readable medium
CN111078842A (en) Method, device, server and storage medium for determining query result
CN110727769B (en) Corpus generation method and device and man-machine interaction processing method and device
CN114116973A (en) Multi-document text duplicate checking method, electronic equipment and storage medium
CN111339248A (en) Data attribute filling method, device, equipment and computer readable storage medium
CN111950280A (en) Address matching method and device
CN113407814A (en) Text search method and device, readable medium and electronic equipment
CN117235137B (en) Professional information query method and device based on vector database
CN117076636A (en) Information query method, system and equipment for intelligent customer service
CN111949765B (en) Semantic-based similar text searching method, system, device and storage medium
CN114996439A (en) Text search method and device
CN114547233A (en) Data duplicate checking method and device and electronic equipment
CN115495636A (en) Webpage searching method, device and storage medium
CN115455179B (en) Sensitive vocabulary detection method, device, equipment and storage medium
CN114385903B (en) Application account identification method and device, electronic equipment and readable storage medium
CN117272975A (en) Document processing method and document processing apparatus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20220902

RJ01 Rejection of invention patent application after publication