CN111259118B

CN111259118B - Text data retrieval method and device

Info

Publication number: CN111259118B
Application number: CN202010370839.XA
Authority: CN
Inventors: 侯凯; 李耀东; 金波
Original assignee: Guangdong Power Grid Co Ltd
Current assignee: Guangdong Power Grid Co Ltd
Priority date: 2020-05-06
Filing date: 2020-05-06
Publication date: 2020-09-01
Anticipated expiration: 2040-05-06
Also published as: CN111259118A

Abstract

The application discloses a text data retrieval method and a text data retrieval device, wherein the method comprises the following steps: firstly, constructing a feature vector extracted from preset text data into a vector set, wherein the feature vector comprises a first keyword and a first feature weight; then, classifying the vector set according to a first similarity between the preset hotspot vector and the feature vector to obtain a feature vector class library; secondly, constructing a retrieval vector according to the preset retrieval hot spot, wherein the retrieval vector comprises a second keyword and a second feature weight; then, randomly selecting a category from the feature vector category library, and calculating a second similarity between each feature vector in the category and the retrieval vector to obtain a maximum similarity; and finally, replacing the second characteristic weight with the first characteristic weight according to a preset condition, and performing iterative retrieval to obtain a retrieval characteristic vector. The technical problems that the retrieval effect is poor and the actual application requirements cannot be efficiently met are solved.

Description

Text data retrieval method and device

Technical Field

The present application relates to the field of text retrieval technologies, and in particular, to a text data retrieval method and apparatus.

Background

In recent years, the rapid development of the internet has led to the era of explosive growth of information. With the gradual and comprehensive shift of daily life to the internet, the big data era has become necessary. Big data is as the leading edge concept of global internet, mainly includes two characteristics: firstly, the information amount is increased sharply; secondly, the amount of information available to an individual grows exponentially.

Artificial intelligence is a specialized study on how computers simulate or implement human learning behaviors to acquire new knowledge or skills and reorganize existing knowledge structures to improve their performance. With the development of artificial intelligence, the artificial intelligence is also applied to various fields, and various problems which cannot be realized by a computer in the past are solved.

The text data, both structured and unstructured, contains a large amount of text data in the enterprise IT system, such as log information, software text records for business financial sales management, customer service complaint suggestions, mail comments, and the like. Due to natural data scatter of text data and the characteristics of cross-system and cross-field of the text data, and the sharp increase of data volume, the existing text feature extraction and retrieval technology cannot adapt to the actual application requirements.

Disclosure of Invention

The application provides a text data retrieval method and a text data retrieval device, which are used for solving the technical problems that the text data is disordered and spans the fields, the data volume is increased sharply, the retrieval effect is poor, and the actual application requirements cannot be met efficiently.

In view of this, a first aspect of the present application provides a text data retrieval method, including:

s1: constructing a feature vector extracted from preset text data into a vector set, wherein the feature vector comprises a first keyword and a first feature weight;

s2: classifying the vector set according to a first similarity between a preset hotspot vector and the feature vector to obtain a feature vector class library, wherein the preset hotspot vector is a standard vector with timeliness;

s3: constructing a retrieval vector according to a preset retrieval hotspot, wherein the retrieval vector comprises a second keyword and a second feature weight;

s4: randomly selecting a category from the feature vector category library, and calculating a second similarity between each feature vector in the category and the retrieval vector to obtain a maximum similarity;

s5: and when the maximum similarity is greater than or equal to a threshold value, if the first feature weight of the feature vector corresponding to the maximum similarity is greater than the second feature weight, replacing the second feature weight with the first feature weight, and repeating the step S4 until a unique retrieval feature vector is obtained.

Preferably, step S1 is preceded by:

acquiring disordered original text data;

and carrying out data cleaning operation on the original text data to obtain the preset text data.

Preferably, step S2 includes:

constructing a plurality of preset hotspot vectors, wherein the preset hotspot vectors comprise a third keyword and a third feature weight, and the preset hotspot vectors are standard vectors with timeliness;

calculating the first similarity between the preset hotspot vector and each feature vector according to a preset similarity formula;

dividing the feature vectors of which the first similarity exceeds a similarity threshold into hotspot categories corresponding to the preset hotspot vectors;

and constructing the classified feature vectors into the feature vector class library.

Preferably, step S1 is followed by:

calculating the word frequency of the first keyword through a preset formula, wherein the preset formula is as follows:

wherein L is_iFor the word frequency, TF is the word frequency, Ctotal is the total number of words;

calculating an updating weight according to the word frequency and a preset part-of-speech weight;

and adjusting the first feature weight by adopting the updating weight to obtain the optimized feature vector.

Preferably, step S5 further includes:

and when the maximum similarity is smaller than a threshold value, judging that the information is not target information, and skipping the retrieval.

A second aspect of the present application provides a text data retrieval apparatus, including:

the system comprises a first construction module, a second construction module and a third construction module, wherein the first construction module is used for constructing a feature vector extracted from preset text data into a vector set, and the feature vector comprises a first keyword and a first feature weight;

the classification module is used for classifying the vector set according to a first similarity between a preset hotspot vector and the feature vector to obtain a feature vector class library, wherein the preset hotspot vector is a standard vector with timeliness;

the second construction module is used for constructing a retrieval vector according to the preset retrieval hot spot, and the retrieval vector comprises a second keyword and a second feature weight;

the calculation module is used for randomly selecting a category from the feature vector category library, and calculating a second similarity between each feature vector in the category and the retrieval vector to obtain a maximum similarity;

and the iteration module is used for replacing the first characteristic weight with the second characteristic weight if the first characteristic weight of the characteristic vector corresponding to the maximum similarity is greater than the second characteristic weight when the maximum similarity is greater than or equal to a threshold value, and triggering the calculation module until a unique retrieval characteristic vector is obtained.

Preferably, the method further comprises the following steps:

the preprocessing module is used for acquiring disordered original text data;

Preferably, the classification module comprises:

the first construction submodule is used for constructing a plurality of preset hotspot vectors, the preset hotspot vectors comprise third key words and third feature weights, and the preset hotspot vectors are standard vectors with timeliness;

the calculation submodule is used for calculating the first similarity between the preset hotspot vector and each feature vector according to a preset similarity formula;

the classification submodule is used for classifying the feature vectors of which the first similarity exceeds a similarity threshold into hot spot categories corresponding to the preset hot spot vectors;

and the second construction submodule is used for constructing the classified feature vectors into the feature vector category library.

Preferably, the method further comprises the following steps:

the word frequency module is used for calculating the word frequency of the first keyword through a preset formula, wherein the preset formula is as follows:

the part-of-speech weight module is used for calculating and updating the weight according to the word frequency and the preset part-of-speech weight;

and the adjusting module is used for adjusting the first feature weight by adopting the updated weight to obtain the optimized feature vector.

Preferably, the iteration module is further configured to:

According to the technical scheme, the embodiment of the application has the following advantages:

the application provides a text data retrieval method, which comprises the following steps: s1: constructing a feature vector extracted from preset text data into a vector set, wherein the feature vector comprises a first keyword and a first feature weight; s2: classifying the vector set according to a first similarity between a preset hotspot vector and the feature vector to obtain a feature vector class library, wherein the preset hotspot vector is a standard vector with timeliness; s3: constructing a retrieval vector according to the preset retrieval hot spot, wherein the retrieval vector comprises a second keyword and a second feature weight; s4: randomly selecting a category from the feature vector category library, and calculating a second similarity between each feature vector in the category and the retrieval vector to obtain a maximum similarity; s5: and when the maximum similarity is greater than or equal to the threshold, if the first feature weight of the feature vector corresponding to the maximum similarity is greater than the second feature weight, replacing the second feature weight with the first feature weight, and repeating the step S4 until a unique retrieval feature vector is obtained.

According to the text data retrieval method, the text data which are scattered and disordered and have weak regularity is expressed into a vector form, keywords are used as feature items in the vector, corresponding weights of the keywords are included, so that an abstract text concept is converted into an image-bearing mathematical model, similarity calculation is performed between preset hot spot vectors in the same form and established feature vectors, classification is achieved, the retrieval efficiency can be improved to a large extent through the classification, and the preset hot spot vectors have timeliness and are used as classification standards to better meet practical conditions; the preset retrieval hotspot is text information retrieved in the input system, and a corresponding retrieval vector is consistent with a vector form in the feature vector category library, so that calculation is facilitated; the retrieval process is not one-step retrieval except that the similarity is calculated in the library, but the iterative retrieval of the weight is updated, the retrieval vector is continuously optimized, and the unique retrieval feature vector which meets the condition is obtained and is used as the final retrieval result. Therefore, the text data retrieval method provided by the application can solve the technical problems that the text data is disordered and spans the fields, the data volume is increased rapidly, the retrieval effect is poor, and the actual application requirements cannot be met efficiently.

Drawings

Fig. 1 is a schematic flowchart of a text data retrieval method according to an embodiment of the present application;

fig. 2 is another schematic flow chart of a text data retrieval method according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a text data retrieval device according to an embodiment of the present application.

Detailed Description

In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

For easy understanding, referring to fig. 1, a first embodiment of a text data retrieval method provided by the present application includes:

step 101, constructing a feature vector extracted from preset text data into a vector set, wherein the feature vector comprises a first keyword and a first feature weight.

It should be noted that the preset text data is collected and processed text data of different levels of the cross-domain and cross-system, such as log information, software text records of business financial sales management and the like, customer service complaint suggestions, mail comments and the like; the internal relation of the text data is difficult to find at an abstract level, and the text data needs to be converted into a mathematical model which is convenient to research, namely a feature vector; the extraction of the feature vector is a feature extraction process, the selected feature item is important, the feature vector in the embodiment is different from a common vector form and consists of keywords and weights corresponding to the keywords, for example, (keyword 1, weight 1; keyword 2, weight 2; … … keyword n, weight n), the expression of the text data by using the keywords as the feature items is more pertinent, the redundancy of the text data can be reduced, and the processing efficiency of the text data can be improved.

And 102, classifying the vector set according to a first similarity between the preset hot spot vector and the feature vector to obtain a feature vector class library.

It should be noted that although the preset hotspot vector is preset, the preset hotspot vector is a standard vector with timeliness, a hotspot is a latest event or problem, the occurrence time is short or the occurrence frequency is high, and the preset hotspot vector is defined as a hotspot, and the text information is used as a classification standard, so that disordered text data can be effectively condensed to a certain extent, so that the text data has a certain rule, and classification is realized; for convenience of calculation, the preset hotspot vector is consistent with the form of the feature vector, the dimension of the preset hotspot vector can be set according to specific conditions, and then the preset hotspot vector is initialized to obtain the preset hotspot vector to participate in the calculation of the first similarity; more than one preset hotspot vector is selected, each preset hotspot vector selects one category, only the feature vector with higher similarity with the preset hotspot vector is selected, the feature vectors of the respective hotspot categories can be selected by performing similarity calculation on different preset hotspot vectors and the feature vectors in the vector set one by one, specific selection can be performed in a mode of setting a threshold, the similarity is classified into the current category when exceeding the threshold, otherwise, the similarity is directly ignored, the specific selection process belongs to an realizable technology, and details are not repeated herein.

Step 103, constructing a retrieval vector according to the preset retrieval hot spot, wherein the retrieval vector comprises a second keyword and a second feature weight.

It should be noted that the preset retrieval hotspot is text information input into the system for retrieval during retrieval, the construction of the corresponding retrieval vector is a process of extracting the feature item of the retrieval hotspot, and the construction method and the form of the retrieval vector are consistent with those of the feature vector, so that subsequent calculation or analysis is facilitated.

And 104, randomly selecting a category from the feature vector category library, and calculating a second similarity between each feature vector in the category and the retrieval vector to obtain the maximum similarity.

It should be noted that the similarity calculation here is to search text information that conforms to a preset retrieval hotspot, and it takes time to directly retrieve in disordered text data, and the number of retrieved target texts is large, and the effect is not good enough, and this embodiment adopts a step-by-step optimization retrieval method to solve this problem, and the similarity between expression vector vectors of texts is a very direct retrieval method, and a second similarity can be calculated by using the existing similarity formula, which is not described herein; the maximum similarity and the corresponding text data are selected from the plurality of second similarities, and there may be more than one, and even if there is only one, the data size of the text is large, and further optimization search is required.

And 105, when the maximum similarity is larger than or equal to the threshold, if the first feature weight of the feature vector corresponding to the maximum similarity is larger than the second feature weight, replacing the second feature weight with the first feature weight, and repeating the step 104 until a unique retrieval feature vector is obtained.

It should be noted that, whether the similarity meets the condition of the retrieval target is judged through a threshold, and if no feature vector exceeds the threshold, the selected maximum similarity is also meaningless; and if the weight corresponding to the keyword in the feature vector is larger than the weight corresponding to the keyword in the retrieval vector, carrying out 'large-order small-order' replacement on the weight to update the retrieval vector, and then continuing the retrieval until a complete feature vector class library is iterated, so as to gradually reduce the feature vectors meeting the conditions and obtain the unique retrieval feature vector. Note that replacing the second feature weight with the first feature weight does not replace the second feature weight of the entire search vector, but replaces the second feature weight with the first feature weight corresponding to the same keyword, and if the first feature weight is smaller than the second feature weight, the second feature weight is not replaced, and the original second feature weight value is kept unchanged.

The text data retrieval method provided by the embodiment expresses the text data which is scattered and disordered and has weak regularity into a vector form, the keywords are used as feature items in the vector, the corresponding weights of the keywords are also included, so that an abstract text concept is converted into an image-bearing mathematical model, and the similarity calculation is performed between the preset hot spot vectors in the same form and the established feature vectors, so that the classification is realized, the retrieval efficiency can be improved to a greater extent by the classification, and the preset hot spot vectors have timeliness, so that the classification is used as a classification standard to better meet the actual condition; the preset retrieval hotspot is text information retrieved in the input system, and a corresponding retrieval vector is consistent with a vector form in the feature vector category library, so that calculation is facilitated; the retrieval process is not one-step retrieval except that the similarity is calculated in the library, but the iterative retrieval of the weight is updated, the retrieval vector is continuously optimized, and the unique retrieval feature vector which meets the condition is obtained and is used as the final retrieval result. Therefore, the text data retrieval method provided by the embodiment can solve the technical problems that the text data is disordered and spans the fields, the data volume is increased rapidly, the retrieval effect is poor, and the actual application requirements cannot be met efficiently.

For easy understanding, please refer to fig. 2, an embodiment two of a text data retrieval method is provided in the embodiment of the present application, including:

step 201, collecting disordered original text data.

Step 202, performing data cleaning operation on the original text data to obtain preset text data.

IT should be noted that the original text data includes structured and unstructured, and in the enterprise IT system, a large amount of text data is contained in software text records such as log information, business financial sales management and the like, customer service complaint suggestions, mail comments and the like; original text data relates to a system or a field, and is relatively disordered, the data levels are different, and no correlation exists, and the retrieval aims to retrieve the most similar text information from the various disordered data according to the existing text information; the specific collection mode can be that an Agent is installed on site to collect, analyze and process logs; for places where the Agent is inconvenient to install, collecting and storing the place by adopting a log collection mode of SNMP TRAP and Syslog, and then processing the place by the Agent; meanwhile, the collection can be carried out in a remote reading mode. The acquired original text data has high complexity, inevitably high noise and inconsistent data quality, so that the acquired original text data needs to be cleaned, and then the preset text data can be obtained.

Step 203, constructing a feature vector extracted from the preset text data into a vector set, wherein the feature vector comprises a first keyword and a first feature weight.

And 204, calculating the word frequency of the first keyword through a preset formula.

Wherein, the preset formula is as follows:

wherein L is_iIs the word frequency, TF is the word frequency, Ctotal is the total number of words.

And step 205, calculating an updating weight according to the word frequency and the preset part-of-speech weight.

And step 206, adjusting the first feature weight by adopting the updated weight to obtain the optimized feature vector.

It should be noted that the extraction process of the feature vector is a feature extraction process of the preset text data, the keyword is used as a feature item, the feature vector is used to express the preset text data, and it can be assumed that the feature vector is N (x)₁,y₁；x₂,y₂；......x_m,y_m) Wherein x is_mAs a feature item, i.e. a first keyword, y_mA first feature weight corresponding to the feature item; the number of the feature vectors is large, the text data is not completely expressed by simply extracting keywords, the first feature weight can be adjusted through the part-of-speech weight, the keywords are usually nouns or verbs, the part-of-speech weight can measure the importance degree of the keywords in the text and can reflect the influence of the keywords, so that the feature vectors are optimized, and the expression capability is strong; the specific method for adjusting the first characteristic weight by extracting the keywords comprises the following steps: TF is used for representing word frequency, namely the number of times of the keyword appearing in the text data, the higher TF of a word is, the higher corresponding word frequency is, the more important the word is in the text data, however, if L is a threshold value, L is generally set to be 0.8, and if L is the word frequency_iIf the frequency exceeds 0.8, the word is useless information with low importance, such as words like 'ground', 'ones' and 'ones' in the text, and the updated word frequency TF_newThe preset part-of-speech weight is the weight of the part-of-speech of the first keyword, mainly the part-of-speech weight setting of words with larger meanings such as verbs and nouns, the word frequency and the part-of-speech weight of the keyword jointly determine the importance of the keyword in the text data, and the updating weight can be obtained through the following formula:

w_i＝k₁TF_new+k₂weight；

wherein k is₁、k₂All adjustable parameters, values of which are 0, 1, 2 and 3, can be adjusted to obtain different update weights, and the different update weights are used for updating the first feature weight to obtain the optimized feature vector.

And step 207, constructing a plurality of preset hotspot vectors, wherein the preset hotspot vectors comprise third key words and third feature weights, and the preset hotspot vectors are standard vectors with timeliness.

And 208, calculating a first similarity between the preset hotspot vector and each feature vector according to a preset similarity formula.

And 209, dividing each feature vector with the first similarity exceeding the similarity threshold into hot spot categories corresponding to preset hot spot vectors.

And step 210, constructing the classified feature vectors into a feature vector category library.

It should be noted that although the preset hotspot vector is preset, the preset hotspot vector is a standard vector with timeliness, a hotspot is a latest event or problem, the occurrence time is short or the occurrence frequency is high, and the preset hotspot vector is defined as a hotspot, and the text information is used as a classification standard, so that disordered text data can be effectively condensed to a certain extent, so that the text data has a certain rule, and classification is realized; for the convenience of calculation, the preset hotspot vector is also consistent with the form of the feature vector, the dimension of the preset hotspot vector can be set according to specific conditions, then the preset hotspot vector can be obtained by initializing the preset hotspot vector, and E (E) is used₁,s₁；e₂,s₂；......e_n,s_n) Representing a preset feature vector and participating in the calculation of the first similarity; more than one preset hot spot vector is selected, each preset hot spot vector selects one category, only the feature vector with higher similarity is selected, the feature vectors of the respective hot spot categories can be selected by carrying out similarity calculation on different preset hot spot vectors and the feature vectors in the vector set one by one, specific selection can be carried out in a mode of setting a threshold, and the current category is attributed to the fact that the similarity exceeds the threshold. The specific preset similarity formula for calculating the first similarity is as follows:

mixing N (x)₁,y₁；x₂,y₂；......x_m,y_m) Is expressed as P₁＝{x₁,x₂,...x_mAnd P₂＝{y₁,y₂,...y_mWith P₃＝{t₁,t₂,...t_mIndicating the latest time update of the keywords in N; e (E)₁,s₁；e₂,s₂；......e_n,s_n) Is expressed as P₄＝{e₁,e₂,...e_nAnd P₅＝{s₁,s₂,...s_nWith P₆＝{q₁,q₂,...q_nRepresenting the latest time update of the keywords in E; w is a_i、s_jA first feature weight and a third feature weight, respectively; the first similarity of the feature vector N and the preset hot spot vector E can be calculated by the formula; the similarity threshold is preset, the feature vectors lower than the similarity threshold are ignored, and the feature vectors higher than the similarity threshold are classified into the event category to which the current preset hotspot vector belongs; and then, a preset hot spot vector is given again, and classification is carried out again until all the feature vectors are classified, so that a feature vector class library can be obtained.

And step 211, constructing a retrieval vector according to the preset retrieval hot spot, wherein the retrieval vector comprises a second keyword and a second feature weight.

It should be noted that the search vector is text information input into the system for searching, and needs to be processed into a form of a feature vector in the feature vector library, and therefore, the extraction method and the optimization method of the second keyword of the search vector, such as the extraction optimization method of the first keyword in the feature vector, also need to extract the keyword, and adjust the second feature weight to obtain the optimized search vector, and the specific process is not described herein again.

Step 212, randomly selecting a category from the feature vector category library, and calculating a second similarity between each feature vector in the category and the retrieval vector to obtain a maximum similarity.

It should be noted that a plurality of second similarities may be calculated between the search vector and the feature vectors in the feature vector category library. The maximum similarity and the corresponding text data are selected from the second similarities, there may be more than one, and even if there is only a single text data, there is a problem that the data size of the text is large, and further optimization search needs to be performed. The method for calculating the second similarity may be a method for calculating the first similarity, or may be another method for calculating a similarity formula, which is not limited herein.

And 213, when the maximum similarity is greater than or equal to the threshold, if the first feature weight of the feature vector corresponding to the maximum similarity is greater than the second feature weight, replacing the second feature weight with the first feature weight, and repeating the step 212 until a unique retrieval feature vector is obtained.

It should be noted that, whether the similarity meets the condition of the retrieval target is judged through a threshold, and if no feature vector exceeds the threshold, the selected maximum similarity is also meaningless; and if the weight corresponding to the keyword in the feature vector is larger than the weight corresponding to the keyword in the retrieval vector, replacing the weight by 'large and small' to update the retrieval vector, then continuing to retrieve, traversing a complete feature vector class library, continuously reducing the feature vectors meeting the conditions, and obtaining the unique retrieval feature vector. Note that replacing the second feature weights with the first feature weights is not replacing the second feature weights of all keywords in the entire search vector, but replacing the second feature weights with the first feature weights corresponding to the same keywords, and if some of the first feature weights are smaller than the second feature weights, the second feature weights are not replaced, and the original second feature weights are kept unchanged. The range of each iteration retrieval is not the whole feature vector category library, retrieval is carried out in a mode that the category is used as a block range, after the retrieval vector is updated, calculation is carried out again in another category to obtain a second similarity, and the maximum similarity is selected for analysis until the whole feature vector category library is traversed. Compared with the 'one-time retrieval', the retrieval speed can be accelerated, and the obtained retrieval result is more reliable and accords with the actual situation.

And step 214, when the maximum similarity is smaller than the threshold value, determining that the information is not target information, and skipping the retrieval.

It should be noted that when the maximum similarity is smaller than the threshold, it is described that the search does not find a particularly close feature vector, the search fails, and the search needs to skip the search, which may be a direct ending operation or a continuous search, and may be specifically set according to an actual situation.

For ease of understanding, referring to fig. 3, an embodiment of a text data retrieving apparatus is further provided in the present application, including:

a first constructing module 301, configured to construct a feature vector extracted from preset text data into a vector set, where the feature vector includes a first keyword and a first feature weight;

the classification module 302 is configured to classify the vector set according to a first similarity between a preset hotspot vector and a feature vector to obtain a feature vector class library, where the preset hotspot vector is a standard vector with timeliness;

the second construction module 303 is configured to construct a retrieval vector according to the preset retrieval hotspot, where the retrieval vector includes a second keyword and a second feature weight;

a calculating module 304, configured to randomly select a category from the feature vector category library, and calculate a second similarity between each feature vector in the category and the search vector to obtain a maximum similarity;

and the iteration module 305 is configured to, when the maximum similarity is greater than or equal to the threshold, replace the second feature weight with the first feature weight if the first feature weight of the feature vector corresponding to the maximum similarity is greater than the second feature weight, and trigger the calculation module until the unique retrieval feature vector is obtained.

Further, still include:

a preprocessing module 306 for collecting disordered original text data;

and carrying out data cleaning operation on the original text data to obtain preset text data.

Further, the classification module 302 includes:

the first construction submodule 3021 is configured to construct a plurality of preset hotspot vectors, where the preset hotspot vectors include a third keyword and a third feature weight, and the preset hotspot vectors are standard vectors with timeliness;

the calculating submodule 3022 is configured to calculate a first similarity between the preset hotspot vector and each feature vector according to a preset similarity formula;

a classification submodule 3023, configured to classify each feature vector with the first similarity exceeding the similarity threshold into a hotspot category corresponding to a preset hotspot vector;

and the second constructing submodule 3024 is configured to construct the classified feature vectors into a feature vector category library.

Further, still include:

a word frequency module 307, configured to calculate a word frequency of the first keyword through a preset formula, where the preset formula of the word frequency is:

wherein L is_iThe word frequency is the word frequency, TF is the word frequency, Ctotal is the total number of words;

a part-of-speech weighting module 308 for calculating update weights according to the word frequency and preset part-of-speech weights;

and the adjusting module 309 is configured to adjust the first feature weight of the word frequency by using the update weight to obtain an optimized word frequency feature vector.

Further, the iteration module 305 is further configured to:

and when the maximum similarity is smaller than the threshold value, judging that the information is not target information, and skipping the retrieval.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for executing all or part of the steps of the method described in the embodiments of the present application through a computer device (which may be a personal computer, a server, or a network device). And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A text data retrieval method, comprising:

2. The text data retrieval method according to claim 1, wherein step S1 is preceded by:

acquiring disordered original text data;

3. The text data retrieval method according to claim 1, wherein step S2 includes:

4. The text data retrieval method according to claim 1, wherein step S1 is followed by further comprising:

5. The text data retrieval method according to claim 1, wherein step S5 further includes:

6. A text data retrieval apparatus, comprising:

7. The text data retrieval device according to claim 6, further comprising:

the preprocessing module is used for acquiring disordered original text data;

8. The text data retrieval device of claim 6, wherein the classification module comprises:

9. The text data retrieval device according to claim 6, further comprising:

10. The text data retrieval device of claim 6, wherein the iteration module is further configured to: