CN111259118B - Text data retrieval method and device - Google Patents

Text data retrieval method and device Download PDF

Info

Publication number
CN111259118B
CN111259118B CN202010370839.XA CN202010370839A CN111259118B CN 111259118 B CN111259118 B CN 111259118B CN 202010370839 A CN202010370839 A CN 202010370839A CN 111259118 B CN111259118 B CN 111259118B
Authority
CN
China
Prior art keywords
vector
feature
preset
retrieval
weight
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010370839.XA
Other languages
Chinese (zh)
Other versions
CN111259118A (en
Inventor
侯凯
李耀东
金波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Power Grid Co Ltd
Original Assignee
Guangdong Power Grid Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Power Grid Co Ltd filed Critical Guangdong Power Grid Co Ltd
Priority to CN202010370839.XA priority Critical patent/CN111259118B/en
Publication of CN111259118A publication Critical patent/CN111259118A/en
Application granted granted Critical
Publication of CN111259118B publication Critical patent/CN111259118B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3347Query execution using vector based model

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a text data retrieval method and a text data retrieval device, wherein the method comprises the following steps: firstly, constructing a feature vector extracted from preset text data into a vector set, wherein the feature vector comprises a first keyword and a first feature weight; then, classifying the vector set according to a first similarity between the preset hotspot vector and the feature vector to obtain a feature vector class library; secondly, constructing a retrieval vector according to the preset retrieval hot spot, wherein the retrieval vector comprises a second keyword and a second feature weight; then, randomly selecting a category from the feature vector category library, and calculating a second similarity between each feature vector in the category and the retrieval vector to obtain a maximum similarity; and finally, replacing the second characteristic weight with the first characteristic weight according to a preset condition, and performing iterative retrieval to obtain a retrieval characteristic vector. The technical problems that the retrieval effect is poor and the actual application requirements cannot be efficiently met are solved.

Description

Text data retrieval method and device
Technical Field
The present application relates to the field of text retrieval technologies, and in particular, to a text data retrieval method and apparatus.
Background
In recent years, the rapid development of the internet has led to the era of explosive growth of information. With the gradual and comprehensive shift of daily life to the internet, the big data era has become necessary. Big data is as the leading edge concept of global internet, mainly includes two characteristics: firstly, the information amount is increased sharply; secondly, the amount of information available to an individual grows exponentially.
Artificial intelligence is a specialized study on how computers simulate or implement human learning behaviors to acquire new knowledge or skills and reorganize existing knowledge structures to improve their performance. With the development of artificial intelligence, the artificial intelligence is also applied to various fields, and various problems which cannot be realized by a computer in the past are solved.
The text data, both structured and unstructured, contains a large amount of text data in the enterprise IT system, such as log information, software text records for business financial sales management, customer service complaint suggestions, mail comments, and the like. Due to natural data scatter of text data and the characteristics of cross-system and cross-field of the text data, and the sharp increase of data volume, the existing text feature extraction and retrieval technology cannot adapt to the actual application requirements.
Disclosure of Invention
The application provides a text data retrieval method and a text data retrieval device, which are used for solving the technical problems that the text data is disordered and spans the fields, the data volume is increased sharply, the retrieval effect is poor, and the actual application requirements cannot be met efficiently.
In view of this, a first aspect of the present application provides a text data retrieval method, including:
s1: constructing a feature vector extracted from preset text data into a vector set, wherein the feature vector comprises a first keyword and a first feature weight;
s2: classifying the vector set according to a first similarity between a preset hotspot vector and the feature vector to obtain a feature vector class library, wherein the preset hotspot vector is a standard vector with timeliness;
s3: constructing a retrieval vector according to a preset retrieval hotspot, wherein the retrieval vector comprises a second keyword and a second feature weight;
s4: randomly selecting a category from the feature vector category library, and calculating a second similarity between each feature vector in the category and the retrieval vector to obtain a maximum similarity;
s5: and when the maximum similarity is greater than or equal to a threshold value, if the first feature weight of the feature vector corresponding to the maximum similarity is greater than the second feature weight, replacing the second feature weight with the first feature weight, and repeating the step S4 until a unique retrieval feature vector is obtained.
Preferably, step S1 is preceded by:
acquiring disordered original text data;
and carrying out data cleaning operation on the original text data to obtain the preset text data.
Preferably, step S2 includes:
constructing a plurality of preset hotspot vectors, wherein the preset hotspot vectors comprise a third keyword and a third feature weight, and the preset hotspot vectors are standard vectors with timeliness;
calculating the first similarity between the preset hotspot vector and each feature vector according to a preset similarity formula;
dividing the feature vectors of which the first similarity exceeds a similarity threshold into hotspot categories corresponding to the preset hotspot vectors;
and constructing the classified feature vectors into the feature vector class library.
Preferably, step S1 is followed by:
calculating the word frequency of the first keyword through a preset formula, wherein the preset formula is as follows:
Figure GDA0002557381190000021
wherein L isiFor the word frequency, TF is the word frequency, Ctotal is the total number of words;
calculating an updating weight according to the word frequency and a preset part-of-speech weight;
and adjusting the first feature weight by adopting the updating weight to obtain the optimized feature vector.
Preferably, step S5 further includes:
and when the maximum similarity is smaller than a threshold value, judging that the information is not target information, and skipping the retrieval.
A second aspect of the present application provides a text data retrieval apparatus, including:
the system comprises a first construction module, a second construction module and a third construction module, wherein the first construction module is used for constructing a feature vector extracted from preset text data into a vector set, and the feature vector comprises a first keyword and a first feature weight;
the classification module is used for classifying the vector set according to a first similarity between a preset hotspot vector and the feature vector to obtain a feature vector class library, wherein the preset hotspot vector is a standard vector with timeliness;
the second construction module is used for constructing a retrieval vector according to the preset retrieval hot spot, and the retrieval vector comprises a second keyword and a second feature weight;
the calculation module is used for randomly selecting a category from the feature vector category library, and calculating a second similarity between each feature vector in the category and the retrieval vector to obtain a maximum similarity;
and the iteration module is used for replacing the first characteristic weight with the second characteristic weight if the first characteristic weight of the characteristic vector corresponding to the maximum similarity is greater than the second characteristic weight when the maximum similarity is greater than or equal to a threshold value, and triggering the calculation module until a unique retrieval characteristic vector is obtained.
Preferably, the method further comprises the following steps:
the preprocessing module is used for acquiring disordered original text data;
and carrying out data cleaning operation on the original text data to obtain the preset text data.
Preferably, the classification module comprises:
the first construction submodule is used for constructing a plurality of preset hotspot vectors, the preset hotspot vectors comprise third key words and third feature weights, and the preset hotspot vectors are standard vectors with timeliness;
the calculation submodule is used for calculating the first similarity between the preset hotspot vector and each feature vector according to a preset similarity formula;
the classification submodule is used for classifying the feature vectors of which the first similarity exceeds a similarity threshold into hot spot categories corresponding to the preset hot spot vectors;
and the second construction submodule is used for constructing the classified feature vectors into the feature vector category library.
Preferably, the method further comprises the following steps:
the word frequency module is used for calculating the word frequency of the first keyword through a preset formula, wherein the preset formula is as follows:
Figure GDA0002557381190000041
wherein L isiFor the word frequency, TF is the word frequency, Ctotal is the total number of words;
the part-of-speech weight module is used for calculating and updating the weight according to the word frequency and the preset part-of-speech weight;
and the adjusting module is used for adjusting the first feature weight by adopting the updated weight to obtain the optimized feature vector.
Preferably, the iteration module is further configured to:
and when the maximum similarity is smaller than a threshold value, judging that the information is not target information, and skipping the retrieval.
According to the technical scheme, the embodiment of the application has the following advantages:
the application provides a text data retrieval method, which comprises the following steps: s1: constructing a feature vector extracted from preset text data into a vector set, wherein the feature vector comprises a first keyword and a first feature weight; s2: classifying the vector set according to a first similarity between a preset hotspot vector and the feature vector to obtain a feature vector class library, wherein the preset hotspot vector is a standard vector with timeliness; s3: constructing a retrieval vector according to the preset retrieval hot spot, wherein the retrieval vector comprises a second keyword and a second feature weight; s4: randomly selecting a category from the feature vector category library, and calculating a second similarity between each feature vector in the category and the retrieval vector to obtain a maximum similarity; s5: and when the maximum similarity is greater than or equal to the threshold, if the first feature weight of the feature vector corresponding to the maximum similarity is greater than the second feature weight, replacing the second feature weight with the first feature weight, and repeating the step S4 until a unique retrieval feature vector is obtained.
According to the text data retrieval method, the text data which are scattered and disordered and have weak regularity is expressed into a vector form, keywords are used as feature items in the vector, corresponding weights of the keywords are included, so that an abstract text concept is converted into an image-bearing mathematical model, similarity calculation is performed between preset hot spot vectors in the same form and established feature vectors, classification is achieved, the retrieval efficiency can be improved to a large extent through the classification, and the preset hot spot vectors have timeliness and are used as classification standards to better meet practical conditions; the preset retrieval hotspot is text information retrieved in the input system, and a corresponding retrieval vector is consistent with a vector form in the feature vector category library, so that calculation is facilitated; the retrieval process is not one-step retrieval except that the similarity is calculated in the library, but the iterative retrieval of the weight is updated, the retrieval vector is continuously optimized, and the unique retrieval feature vector which meets the condition is obtained and is used as the final retrieval result. Therefore, the text data retrieval method provided by the application can solve the technical problems that the text data is disordered and spans the fields, the data volume is increased rapidly, the retrieval effect is poor, and the actual application requirements cannot be met efficiently.
Drawings
Fig. 1 is a schematic flowchart of a text data retrieval method according to an embodiment of the present application;
fig. 2 is another schematic flow chart of a text data retrieval method according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of a text data retrieval device according to an embodiment of the present application.
Detailed Description
In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
For easy understanding, referring to fig. 1, a first embodiment of a text data retrieval method provided by the present application includes:
step 101, constructing a feature vector extracted from preset text data into a vector set, wherein the feature vector comprises a first keyword and a first feature weight.
It should be noted that the preset text data is collected and processed text data of different levels of the cross-domain and cross-system, such as log information, software text records of business financial sales management and the like, customer service complaint suggestions, mail comments and the like; the internal relation of the text data is difficult to find at an abstract level, and the text data needs to be converted into a mathematical model which is convenient to research, namely a feature vector; the extraction of the feature vector is a feature extraction process, the selected feature item is important, the feature vector in the embodiment is different from a common vector form and consists of keywords and weights corresponding to the keywords, for example, (keyword 1, weight 1; keyword 2, weight 2; … … keyword n, weight n), the expression of the text data by using the keywords as the feature items is more pertinent, the redundancy of the text data can be reduced, and the processing efficiency of the text data can be improved.
And 102, classifying the vector set according to a first similarity between the preset hot spot vector and the feature vector to obtain a feature vector class library.
It should be noted that although the preset hotspot vector is preset, the preset hotspot vector is a standard vector with timeliness, a hotspot is a latest event or problem, the occurrence time is short or the occurrence frequency is high, and the preset hotspot vector is defined as a hotspot, and the text information is used as a classification standard, so that disordered text data can be effectively condensed to a certain extent, so that the text data has a certain rule, and classification is realized; for convenience of calculation, the preset hotspot vector is consistent with the form of the feature vector, the dimension of the preset hotspot vector can be set according to specific conditions, and then the preset hotspot vector is initialized to obtain the preset hotspot vector to participate in the calculation of the first similarity; more than one preset hotspot vector is selected, each preset hotspot vector selects one category, only the feature vector with higher similarity with the preset hotspot vector is selected, the feature vectors of the respective hotspot categories can be selected by performing similarity calculation on different preset hotspot vectors and the feature vectors in the vector set one by one, specific selection can be performed in a mode of setting a threshold, the similarity is classified into the current category when exceeding the threshold, otherwise, the similarity is directly ignored, the specific selection process belongs to an realizable technology, and details are not repeated herein.
Step 103, constructing a retrieval vector according to the preset retrieval hot spot, wherein the retrieval vector comprises a second keyword and a second feature weight.
It should be noted that the preset retrieval hotspot is text information input into the system for retrieval during retrieval, the construction of the corresponding retrieval vector is a process of extracting the feature item of the retrieval hotspot, and the construction method and the form of the retrieval vector are consistent with those of the feature vector, so that subsequent calculation or analysis is facilitated.
And 104, randomly selecting a category from the feature vector category library, and calculating a second similarity between each feature vector in the category and the retrieval vector to obtain the maximum similarity.
It should be noted that the similarity calculation here is to search text information that conforms to a preset retrieval hotspot, and it takes time to directly retrieve in disordered text data, and the number of retrieved target texts is large, and the effect is not good enough, and this embodiment adopts a step-by-step optimization retrieval method to solve this problem, and the similarity between expression vector vectors of texts is a very direct retrieval method, and a second similarity can be calculated by using the existing similarity formula, which is not described herein; the maximum similarity and the corresponding text data are selected from the plurality of second similarities, and there may be more than one, and even if there is only one, the data size of the text is large, and further optimization search is required.
And 105, when the maximum similarity is larger than or equal to the threshold, if the first feature weight of the feature vector corresponding to the maximum similarity is larger than the second feature weight, replacing the second feature weight with the first feature weight, and repeating the step 104 until a unique retrieval feature vector is obtained.
It should be noted that, whether the similarity meets the condition of the retrieval target is judged through a threshold, and if no feature vector exceeds the threshold, the selected maximum similarity is also meaningless; and if the weight corresponding to the keyword in the feature vector is larger than the weight corresponding to the keyword in the retrieval vector, carrying out 'large-order small-order' replacement on the weight to update the retrieval vector, and then continuing the retrieval until a complete feature vector class library is iterated, so as to gradually reduce the feature vectors meeting the conditions and obtain the unique retrieval feature vector. Note that replacing the second feature weight with the first feature weight does not replace the second feature weight of the entire search vector, but replaces the second feature weight with the first feature weight corresponding to the same keyword, and if the first feature weight is smaller than the second feature weight, the second feature weight is not replaced, and the original second feature weight value is kept unchanged.
The text data retrieval method provided by the embodiment expresses the text data which is scattered and disordered and has weak regularity into a vector form, the keywords are used as feature items in the vector, the corresponding weights of the keywords are also included, so that an abstract text concept is converted into an image-bearing mathematical model, and the similarity calculation is performed between the preset hot spot vectors in the same form and the established feature vectors, so that the classification is realized, the retrieval efficiency can be improved to a greater extent by the classification, and the preset hot spot vectors have timeliness, so that the classification is used as a classification standard to better meet the actual condition; the preset retrieval hotspot is text information retrieved in the input system, and a corresponding retrieval vector is consistent with a vector form in the feature vector category library, so that calculation is facilitated; the retrieval process is not one-step retrieval except that the similarity is calculated in the library, but the iterative retrieval of the weight is updated, the retrieval vector is continuously optimized, and the unique retrieval feature vector which meets the condition is obtained and is used as the final retrieval result. Therefore, the text data retrieval method provided by the embodiment can solve the technical problems that the text data is disordered and spans the fields, the data volume is increased rapidly, the retrieval effect is poor, and the actual application requirements cannot be met efficiently.
For easy understanding, please refer to fig. 2, an embodiment two of a text data retrieval method is provided in the embodiment of the present application, including:
step 201, collecting disordered original text data.
Step 202, performing data cleaning operation on the original text data to obtain preset text data.
IT should be noted that the original text data includes structured and unstructured, and in the enterprise IT system, a large amount of text data is contained in software text records such as log information, business financial sales management and the like, customer service complaint suggestions, mail comments and the like; original text data relates to a system or a field, and is relatively disordered, the data levels are different, and no correlation exists, and the retrieval aims to retrieve the most similar text information from the various disordered data according to the existing text information; the specific collection mode can be that an Agent is installed on site to collect, analyze and process logs; for places where the Agent is inconvenient to install, collecting and storing the place by adopting a log collection mode of SNMP TRAP and Syslog, and then processing the place by the Agent; meanwhile, the collection can be carried out in a remote reading mode. The acquired original text data has high complexity, inevitably high noise and inconsistent data quality, so that the acquired original text data needs to be cleaned, and then the preset text data can be obtained.
Step 203, constructing a feature vector extracted from the preset text data into a vector set, wherein the feature vector comprises a first keyword and a first feature weight.
And 204, calculating the word frequency of the first keyword through a preset formula.
Wherein, the preset formula is as follows:
Figure GDA0002557381190000081
wherein L isiIs the word frequency, TF is the word frequency, Ctotal is the total number of words.
And step 205, calculating an updating weight according to the word frequency and the preset part-of-speech weight.
And step 206, adjusting the first feature weight by adopting the updated weight to obtain the optimized feature vector.
It should be noted that the extraction process of the feature vector is a feature extraction process of the preset text data, the keyword is used as a feature item, the feature vector is used to express the preset text data, and it can be assumed that the feature vector is N (x)1,y1;x2,y2;......xm,ym) Wherein x ismAs a feature item, i.e. a first keyword, ymA first feature weight corresponding to the feature item; the number of the feature vectors is large, the text data is not completely expressed by simply extracting keywords, the first feature weight can be adjusted through the part-of-speech weight, the keywords are usually nouns or verbs, the part-of-speech weight can measure the importance degree of the keywords in the text and can reflect the influence of the keywords, so that the feature vectors are optimized, and the expression capability is strong; the specific method for adjusting the first characteristic weight by extracting the keywords comprises the following steps: TF is used for representing word frequency, namely the number of times of the keyword appearing in the text data, the higher TF of a word is, the higher corresponding word frequency is, the more important the word is in the text data, however, if L is a threshold value, L is generally set to be 0.8, and if L is the word frequencyiIf the frequency exceeds 0.8, the word is useless information with low importance, such as words like 'ground', 'ones' and 'ones' in the text, and the updated word frequency TFnewThe preset part-of-speech weight is the weight of the part-of-speech of the first keyword, mainly the part-of-speech weight setting of words with larger meanings such as verbs and nouns, the word frequency and the part-of-speech weight of the keyword jointly determine the importance of the keyword in the text data, and the updating weight can be obtained through the following formula:
wi=k1TFnew+k2weight;
wherein k is1、k2All adjustable parameters, values of which are 0, 1, 2 and 3, can be adjusted to obtain different update weights, and the different update weights are used for updating the first feature weight to obtain the optimized feature vector.
And step 207, constructing a plurality of preset hotspot vectors, wherein the preset hotspot vectors comprise third key words and third feature weights, and the preset hotspot vectors are standard vectors with timeliness.
And 208, calculating a first similarity between the preset hotspot vector and each feature vector according to a preset similarity formula.
And 209, dividing each feature vector with the first similarity exceeding the similarity threshold into hot spot categories corresponding to preset hot spot vectors.
And step 210, constructing the classified feature vectors into a feature vector category library.
It should be noted that although the preset hotspot vector is preset, the preset hotspot vector is a standard vector with timeliness, a hotspot is a latest event or problem, the occurrence time is short or the occurrence frequency is high, and the preset hotspot vector is defined as a hotspot, and the text information is used as a classification standard, so that disordered text data can be effectively condensed to a certain extent, so that the text data has a certain rule, and classification is realized; for the convenience of calculation, the preset hotspot vector is also consistent with the form of the feature vector, the dimension of the preset hotspot vector can be set according to specific conditions, then the preset hotspot vector can be obtained by initializing the preset hotspot vector, and E (E) is used1,s1;e2,s2;......en,sn) Representing a preset feature vector and participating in the calculation of the first similarity; more than one preset hot spot vector is selected, each preset hot spot vector selects one category, only the feature vector with higher similarity is selected, the feature vectors of the respective hot spot categories can be selected by carrying out similarity calculation on different preset hot spot vectors and the feature vectors in the vector set one by one, specific selection can be carried out in a mode of setting a threshold, and the current category is attributed to the fact that the similarity exceeds the threshold. The specific preset similarity formula for calculating the first similarity is as follows:
Figure GDA0002557381190000091
mixing N (x)1,y1;x2,y2;......xm,ym) Is expressed as P1={x1,x2,...xmAnd P2={y1,y2,...ymWith P3={t1,t2,...tmIndicating the latest time update of the keywords in N; e (E)1,s1;e2,s2;......en,sn) Is expressed as P4={e1,e2,...enAnd P5={s1,s2,...snWith P6={q1,q2,...qnRepresenting the latest time update of the keywords in E; w is ai、sjA first feature weight and a third feature weight, respectively; the first similarity of the feature vector N and the preset hot spot vector E can be calculated by the formula; the similarity threshold is preset, the feature vectors lower than the similarity threshold are ignored, and the feature vectors higher than the similarity threshold are classified into the event category to which the current preset hotspot vector belongs; and then, a preset hot spot vector is given again, and classification is carried out again until all the feature vectors are classified, so that a feature vector class library can be obtained.
And step 211, constructing a retrieval vector according to the preset retrieval hot spot, wherein the retrieval vector comprises a second keyword and a second feature weight.
It should be noted that the search vector is text information input into the system for searching, and needs to be processed into a form of a feature vector in the feature vector library, and therefore, the extraction method and the optimization method of the second keyword of the search vector, such as the extraction optimization method of the first keyword in the feature vector, also need to extract the keyword, and adjust the second feature weight to obtain the optimized search vector, and the specific process is not described herein again.
Step 212, randomly selecting a category from the feature vector category library, and calculating a second similarity between each feature vector in the category and the retrieval vector to obtain a maximum similarity.
It should be noted that a plurality of second similarities may be calculated between the search vector and the feature vectors in the feature vector category library. The maximum similarity and the corresponding text data are selected from the second similarities, there may be more than one, and even if there is only a single text data, there is a problem that the data size of the text is large, and further optimization search needs to be performed. The method for calculating the second similarity may be a method for calculating the first similarity, or may be another method for calculating a similarity formula, which is not limited herein.
And 213, when the maximum similarity is greater than or equal to the threshold, if the first feature weight of the feature vector corresponding to the maximum similarity is greater than the second feature weight, replacing the second feature weight with the first feature weight, and repeating the step 212 until a unique retrieval feature vector is obtained.
It should be noted that, whether the similarity meets the condition of the retrieval target is judged through a threshold, and if no feature vector exceeds the threshold, the selected maximum similarity is also meaningless; and if the weight corresponding to the keyword in the feature vector is larger than the weight corresponding to the keyword in the retrieval vector, replacing the weight by 'large and small' to update the retrieval vector, then continuing to retrieve, traversing a complete feature vector class library, continuously reducing the feature vectors meeting the conditions, and obtaining the unique retrieval feature vector. Note that replacing the second feature weights with the first feature weights is not replacing the second feature weights of all keywords in the entire search vector, but replacing the second feature weights with the first feature weights corresponding to the same keywords, and if some of the first feature weights are smaller than the second feature weights, the second feature weights are not replaced, and the original second feature weights are kept unchanged. The range of each iteration retrieval is not the whole feature vector category library, retrieval is carried out in a mode that the category is used as a block range, after the retrieval vector is updated, calculation is carried out again in another category to obtain a second similarity, and the maximum similarity is selected for analysis until the whole feature vector category library is traversed. Compared with the 'one-time retrieval', the retrieval speed can be accelerated, and the obtained retrieval result is more reliable and accords with the actual situation.
And step 214, when the maximum similarity is smaller than the threshold value, determining that the information is not target information, and skipping the retrieval.
It should be noted that when the maximum similarity is smaller than the threshold, it is described that the search does not find a particularly close feature vector, the search fails, and the search needs to skip the search, which may be a direct ending operation or a continuous search, and may be specifically set according to an actual situation.
For ease of understanding, referring to fig. 3, an embodiment of a text data retrieving apparatus is further provided in the present application, including:
a first constructing module 301, configured to construct a feature vector extracted from preset text data into a vector set, where the feature vector includes a first keyword and a first feature weight;
the classification module 302 is configured to classify the vector set according to a first similarity between a preset hotspot vector and a feature vector to obtain a feature vector class library, where the preset hotspot vector is a standard vector with timeliness;
the second construction module 303 is configured to construct a retrieval vector according to the preset retrieval hotspot, where the retrieval vector includes a second keyword and a second feature weight;
a calculating module 304, configured to randomly select a category from the feature vector category library, and calculate a second similarity between each feature vector in the category and the search vector to obtain a maximum similarity;
and the iteration module 305 is configured to, when the maximum similarity is greater than or equal to the threshold, replace the second feature weight with the first feature weight if the first feature weight of the feature vector corresponding to the maximum similarity is greater than the second feature weight, and trigger the calculation module until the unique retrieval feature vector is obtained.
Further, still include:
a preprocessing module 306 for collecting disordered original text data;
and carrying out data cleaning operation on the original text data to obtain preset text data.
Further, the classification module 302 includes:
the first construction submodule 3021 is configured to construct a plurality of preset hotspot vectors, where the preset hotspot vectors include a third keyword and a third feature weight, and the preset hotspot vectors are standard vectors with timeliness;
the calculating submodule 3022 is configured to calculate a first similarity between the preset hotspot vector and each feature vector according to a preset similarity formula;
a classification submodule 3023, configured to classify each feature vector with the first similarity exceeding the similarity threshold into a hotspot category corresponding to a preset hotspot vector;
and the second constructing submodule 3024 is configured to construct the classified feature vectors into a feature vector category library.
Further, still include:
a word frequency module 307, configured to calculate a word frequency of the first keyword through a preset formula, where the preset formula of the word frequency is:
Figure GDA0002557381190000121
wherein L isiThe word frequency is the word frequency, TF is the word frequency, Ctotal is the total number of words;
a part-of-speech weighting module 308 for calculating update weights according to the word frequency and preset part-of-speech weights;
and the adjusting module 309 is configured to adjust the first feature weight of the word frequency by using the update weight to obtain an optimized word frequency feature vector.
Further, the iteration module 305 is further configured to:
and when the maximum similarity is smaller than the threshold value, judging that the information is not target information, and skipping the retrieval.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for executing all or part of the steps of the method described in the embodiments of the present application through a computer device (which may be a personal computer, a server, or a network device). And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims (10)

1. A text data retrieval method, comprising:
s1: constructing a feature vector extracted from preset text data into a vector set, wherein the feature vector comprises a first keyword and a first feature weight;
s2: classifying the vector set according to a first similarity between a preset hotspot vector and the feature vector to obtain a feature vector class library, wherein the preset hotspot vector is a standard vector with timeliness;
s3: constructing a retrieval vector according to a preset retrieval hotspot, wherein the retrieval vector comprises a second keyword and a second feature weight;
s4: randomly selecting a category from the feature vector category library, and calculating a second similarity between each feature vector in the category and the retrieval vector to obtain a maximum similarity;
s5: and when the maximum similarity is greater than or equal to a threshold value, if the first feature weight of the feature vector corresponding to the maximum similarity is greater than the second feature weight, replacing the second feature weight with the first feature weight, and repeating the step S4 until a unique retrieval feature vector is obtained.
2. The text data retrieval method according to claim 1, wherein step S1 is preceded by:
acquiring disordered original text data;
and carrying out data cleaning operation on the original text data to obtain the preset text data.
3. The text data retrieval method according to claim 1, wherein step S2 includes:
constructing a plurality of preset hotspot vectors, wherein the preset hotspot vectors comprise a third keyword and a third feature weight, and the preset hotspot vectors are standard vectors with timeliness;
calculating the first similarity between the preset hotspot vector and each feature vector according to a preset similarity formula;
dividing the feature vectors of which the first similarity exceeds a similarity threshold into hotspot categories corresponding to the preset hotspot vectors;
and constructing the classified feature vectors into the feature vector class library.
4. The text data retrieval method according to claim 1, wherein step S1 is followed by further comprising:
calculating the word frequency of the first keyword through a preset formula, wherein the preset formula is as follows:
Figure FDA0002571378820000021
wherein L isiFor the word frequency, TF is the word frequency, Ctotal is the total number of words;
calculating an updating weight according to the word frequency and a preset part-of-speech weight;
and adjusting the first feature weight by adopting the updating weight to obtain the optimized feature vector.
5. The text data retrieval method according to claim 1, wherein step S5 further includes:
and when the maximum similarity is smaller than a threshold value, judging that the information is not target information, and skipping the retrieval.
6. A text data retrieval apparatus, comprising:
the system comprises a first construction module, a second construction module and a third construction module, wherein the first construction module is used for constructing a feature vector extracted from preset text data into a vector set, and the feature vector comprises a first keyword and a first feature weight;
the classification module is used for classifying the vector set according to a first similarity between a preset hotspot vector and the feature vector to obtain a feature vector class library, wherein the preset hotspot vector is a standard vector with timeliness;
the second construction module is used for constructing a retrieval vector according to the preset retrieval hot spot, and the retrieval vector comprises a second keyword and a second feature weight;
the calculation module is used for randomly selecting a category from the feature vector category library, and calculating a second similarity between each feature vector in the category and the retrieval vector to obtain a maximum similarity;
and the iteration module is used for replacing the first characteristic weight with the second characteristic weight if the first characteristic weight of the characteristic vector corresponding to the maximum similarity is greater than the second characteristic weight when the maximum similarity is greater than or equal to a threshold value, and triggering the calculation module until a unique retrieval characteristic vector is obtained.
7. The text data retrieval device according to claim 6, further comprising:
the preprocessing module is used for acquiring disordered original text data;
and carrying out data cleaning operation on the original text data to obtain the preset text data.
8. The text data retrieval device of claim 6, wherein the classification module comprises:
the first construction submodule is used for constructing a plurality of preset hotspot vectors, the preset hotspot vectors comprise third key words and third feature weights, and the preset hotspot vectors are standard vectors with timeliness;
the calculation submodule is used for calculating the first similarity between the preset hotspot vector and each feature vector according to a preset similarity formula;
the classification submodule is used for classifying the feature vectors of which the first similarity exceeds a similarity threshold into hot spot categories corresponding to the preset hot spot vectors;
and the second construction submodule is used for constructing the classified feature vectors into the feature vector category library.
9. The text data retrieval device according to claim 6, further comprising:
the word frequency module is used for calculating the word frequency of the first keyword through a preset formula, wherein the preset formula is as follows:
Figure FDA0002571378820000031
wherein L isiFor the word frequency, TF is the word frequency, Ctotal is the total number of words;
the part-of-speech weight module is used for calculating and updating the weight according to the word frequency and the preset part-of-speech weight;
and the adjusting module is used for adjusting the first feature weight by adopting the updated weight to obtain the optimized feature vector.
10. The text data retrieval device of claim 6, wherein the iteration module is further configured to:
and when the maximum similarity is smaller than a threshold value, judging that the information is not target information, and skipping the retrieval.
CN202010370839.XA 2020-05-06 2020-05-06 Text data retrieval method and device Active CN111259118B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010370839.XA CN111259118B (en) 2020-05-06 2020-05-06 Text data retrieval method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010370839.XA CN111259118B (en) 2020-05-06 2020-05-06 Text data retrieval method and device

Publications (2)

Publication Number Publication Date
CN111259118A CN111259118A (en) 2020-06-09
CN111259118B true CN111259118B (en) 2020-09-01

Family

ID=70951693

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010370839.XA Active CN111259118B (en) 2020-05-06 2020-05-06 Text data retrieval method and device

Country Status (1)

Country Link
CN (1) CN111259118B (en)

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050198026A1 (en) * 2004-02-03 2005-09-08 Dehlinger Peter J. Code, system, and method for generating concepts
CN102103604B (en) * 2009-12-18 2012-12-19 百度在线网络技术(北京)有限公司 Method and device for determining core weight of term
CN105809107B (en) * 2016-02-23 2019-12-03 深圳大学 Single sample face recognition method and system based on face feature point
CN106776782B (en) * 2016-11-21 2020-05-22 北京百度网讯科技有限公司 Semantic similarity obtaining method and device based on artificial intelligence
CN108959329B (en) * 2017-05-27 2023-05-16 腾讯科技(北京)有限公司 Text classification method, device, medium and equipment
CN109783727A (en) * 2018-12-24 2019-05-21 东软集团股份有限公司 Retrieve recommended method, device, computer readable storage medium and electronic equipment
CN110659392B (en) * 2019-09-29 2022-05-06 北京市商汤科技开发有限公司 Retrieval method and device, and storage medium
CN110807149B (en) * 2019-10-11 2023-07-14 卓尔智联(武汉)研究院有限公司 Retrieval method, device and storage medium

Also Published As

Publication number Publication date
CN111259118A (en) 2020-06-09

Similar Documents

Publication Publication Date Title
US10459971B2 (en) Method and apparatus of generating image characteristic representation of query, and image search method and apparatus
CN109299271B (en) Training sample generation method, text data method, public opinion event classification method and related equipment
CN102929873B (en) Method and device for extracting searching value terms based on context search
CN105045875B (en) Personalized search and device
CN107862022B (en) Culture resource recommendation system
WO2017097231A1 (en) Topic processing method and device
US7370033B1 (en) Method for extracting association rules from transactions in a database
Patra et al. A survey report on text classification with different term weighing methods and comparison between classification algorithms
WO2017000610A1 (en) Webpage classification method and apparatus
WO2004013775A2 (en) Data search system and method using mutual subsethood measures
Lubis et al. A framework of utilizing big data of social media to find out the habits of users using keyword
CN111382276A (en) Event development venation map generation method
CN112559747A (en) Event classification processing method and device, electronic equipment and storage medium
CN110795613A (en) Commodity searching method, device and system and electronic equipment
Lin Association rule mining for collaborative recommender systems.
CN110019763B (en) Text filtering method, system, equipment and computer readable storage medium
KR100842216B1 (en) Automatic document classification method and apparatus for multiple category documents with plural associative classification rules extracted using association rule mining technique
CN113239268A (en) Commodity recommendation method, device and system
CN112667814A (en) Hot word discovery method and system
CN111259118B (en) Text data retrieval method and device
CN115329078B (en) Text data processing method, device, equipment and storage medium
Karasalo et al. Developing horizon scanning methods for the discovery of scientific trends
Kontogiannis et al. Tree-based Focused Web Crawling with Reinforcement Learning
CN111899832B (en) Medical theme management system and method based on context semantic analysis
Choenni et al. Supporting technologies for knowledge management

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant