CN110019668A - A kind of text searching method and device - Google Patents
A kind of text searching method and device Download PDFInfo
- Publication number
- CN110019668A CN110019668A CN201711043608.2A CN201711043608A CN110019668A CN 110019668 A CN110019668 A CN 110019668A CN 201711043608 A CN201711043608 A CN 201711043608A CN 110019668 A CN110019668 A CN 110019668A
- Authority
- CN
- China
- Prior art keywords
- text
- word
- retrieved
- words
- similarity
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Abstract
The invention discloses a kind of text searching method and devices.Method includes: to segment to retrieval text, obtains retrieval set of words;For each word in retrieval set of words, the TextRank value that TextRank algorithm calculates each word is respectively adopted;According to the TextRank value of each word, the word of preset quantity is chosen as keyword set;Determine the term vector of each word in keyword set;At least one corresponding text word set of text to be retrieved is obtained, and determines the term vector of each word at least one described corresponding text word set of text to be retrieved;Calculate the term vector similarity with the term vector of each word at least one corresponding text word set of text to be retrieved respectively of each word in keyword set;At least one text to be retrieved is ranked up output according to similarity.The present invention improves the accuracy of search result.
Description
Technical field
The present invention relates to text retrieval technique field more particularly to a kind of text searching methods and device.
Background technique
The push of legal documents class case refers to one legal documents of input, obtains a series of and input using certain algorithm
Other similar documents of legal documents, whereby quickly to find history document relevant to legal documents currently entered
(also referred to as history case).
However currently used algorithm is generally based on some screening rules, such as case is by identical, to be applicable in law article consistent etc.,
Retrieve similar other documents of legal documents with input, the search result that this retrieval mode obtains often accuracy compared with
Difference.
Summary of the invention
In view of the above problems, it proposes on the present invention overcomes the above problem or at least be partially solved in order to provide one kind
The text searching method and device of problem are stated, technical solution is as follows:
A kind of text searching method, which comprises
Retrieval text is segmented, retrieval set of words is obtained;
For each word in the retrieval set of words, TextRank algorithm is respectively adopted and calculates each word
TextRank value;
According to the TextRank value of each word, the word of preset quantity is chosen as keyword set;
Determine the term vector of each word in the keyword set;
At least one corresponding text word set of text to be retrieved is obtained, and at least one is to be retrieved described in determination
The term vector of each word in the corresponding text word set of text;
The term vector for calculating each word in the keyword set is each at least one described text to be retrieved respectively
The similarity of the term vector of each word in self-corresponding text word set;
At least one described text to be retrieved is ranked up output according to the similarity.
Optionally, obtaining at least one text to be retrieved includes: to include: based on the retrieval text, similar using text
Algorithm is spent, determines at least one text to be retrieved;
Obtaining at least one corresponding text word set of text to be retrieved includes:
Each text to be retrieved is segmented, multiple words are obtained;
Dittograph and stop words are removed from the multiple word, obtain candidate set of words;
For each word in the candidate set of words, TextRank algorithm is respectively adopted and calculates each word
TextRank value;
According to the TextRank value of each word in the candidate set of words, determined from the candidate set of words
Text word set.
It is optionally, described that at least one described text to be retrieved is ranked up by output according to the similarity, comprising:
For any one text to be retrieved, each word is to be retrieved with this from the keyword set being calculated
In the text word set of text in the similarity of each word, it is corresponding to obtain each word in the keyword set
Maximum similarity;
From the corresponding maximum similarity of word each in the keyword set, from big to small by maximum similarity
Sequence, determine sequence score of the corresponding maximum similarity of word as the text to be retrieved at predetermined order position;
According to the sequence score of each text to be retrieved, output is ranked up at least one described text to be retrieved.
Optionally, the text to be retrieved includes: text topic to be retrieved and text body to be retrieved.
Optionally it is determined that the term vector of word includes:
Using preparatory trained term vector model, the term vector of word is determined;
Wherein the trained term vector model in advance includes any of the following: word2vector model, potential language
Justice analysis LSA matrix decomposition model, the latent semantic analysis PLSA latent semantic analysis probabilistic model of probability and potential Di Li Cray point
Cloth LDA model.
A kind of text retrieval device, described device include:
Participle unit obtains retrieval set of words for segmenting to retrieval text;
TextRank value computing unit, for being respectively adopted for each word in the retrieval set of words
TextRank algorithm calculates the TextRank value of each word;
Keyword set determination unit, for the TextRank value according to each word, the word for choosing preset quantity is made
For keyword set;
First term vector determination unit, for determining the term vector of each word in the keyword set;
Text word set acquiring unit, for obtaining at least one corresponding text word collection of text to be retrieved
It closes;
Second term vector determination unit, for determining at least one described corresponding text word collection of text to be retrieved
The term vector of each word in conjunction;
Similarity calculated, for calculate the term vector of each word in the keyword set respectively with it is described extremely
The similarity of the term vector of each word in a few corresponding text word set of text to be retrieved;
Text sequence output unit, it is defeated for being ranked up at least one described text to be retrieved according to the similarity
Out.
Optionally, the text word set acquiring unit includes:
Text to be retrieved determines subelement, for being based on the retrieval text, using text similarity measurement algorithm, determines at least
One text to be retrieved;
It segments subelement and obtains multiple words for segmenting to each text to be retrieved;
Candidate set of words determines subelement, for removing dittograph and stop words from the multiple word, obtains
Candidate set of words;
TextRank value computation subunit, for being respectively adopted for each word in the candidate set of words
TextRank algorithm calculates the TextRank value of each word;
Text word set determines subelement, for the TextRank according to each word in the candidate set of words
Value determines text word set from the candidate set of words.
Optionally, the text sequence output unit includes:
Maximum similarity determines subelement, is used for for any one text to be retrieved, from the key being calculated
The key is obtained in the similarity of each word in the text word set of each word and the text to be retrieved in set of words
Each corresponding maximum similarity of word in set of words;
Sequence score determines subelement, for each corresponding maximum similarity of word from the keyword set
In, by the sequence of maximum similarity from big to small, determine the corresponding maximum similarity of word at predetermined order position as institute
State the sequence score of text to be retrieved;
Text sequence output subelement, for the sequence score according to each text to be retrieved, to it is described at least one wait for
Retrieval text is ranked up output.
A kind of storage medium, is stored thereon with program, and text inspection described previously is realized when described program is executed by processor
Suo Fangfa.
A kind of processor, the processor is for running program, wherein described program executes text described previously when running
Search method.
By above-mentioned technical proposal, in text searching method and device provided by the invention, retrieval text is segmented,
Obtain retrieval set of words;For each word in the retrieval set of words, TextRank algorithm is respectively adopted and calculates often
The TextRank value of a word;According to the TextRank value of each word, the word of preset quantity is chosen as keyword set;
Determine the term vector of each word in the keyword set;Obtain at least one corresponding text word of text to be retrieved
Set, and determine the term vector of each word at least one described corresponding text word set of text to be retrieved;Meter
Calculate the term vector of each word in the keyword set respectively at least one described corresponding text of text to be retrieved
The similarity of the term vector of each word in this set of words;It will at least one described text to be retrieved according to the similarity
It is ranked up output.
The present invention calculates the TextRank value of each word in retrieval set of words by using TextRank algorithm, and
According to the TextRank value of each word, the word of preset quantity is chosen as keyword set, the obtained keyword set
The core content that retrieval text can relatively accurately be expressed, eliminates the interference of the unrelated word of some high frequencies, to a certain degree
On ensure that the accuracy of text to be retrieved.And the application indicates the relationship between each word and word using term vector, and according to
Relationship between each word itself and word is ranked up text to be retrieved, and the accuracy of search result further increases.
The above description is only an overview of the technical scheme of the present invention, in order to better understand the technical means of the present invention,
And it can be implemented in accordance with the contents of the specification, and in order to allow above and other objects of the present invention, feature and advantage can
It is clearer and more comprehensible, the followings are specific embodiments of the present invention.
Detailed description of the invention
By reading the following detailed description of the preferred embodiment, various other advantages and benefits are common for this field
Technical staff will become clear.The drawings are only for the purpose of illustrating a preferred embodiment, and is not considered as to the present invention
Limitation.And throughout the drawings, the same reference numbers will be used to refer to the same parts.In the accompanying drawings:
Fig. 1 shows a kind of flow chart of text searching method provided in an embodiment of the present invention;
Fig. 2 shows a kind of structural schematic diagrams of text retrieval device provided in an embodiment of the present invention.
Specific embodiment
Exemplary embodiments of the present disclosure are described in more detail below with reference to accompanying drawings.Although showing the disclosure in attached drawing
Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here
It is limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosure
It is fully disclosed to those skilled in the art.
As shown in Figure 1, a kind of text searching method provided in an embodiment of the present invention, may include:
Step 101, retrieval text is segmented, obtains retrieval set of words.
Specifically, the present invention can by the segmenting method of word-based storehouse matching, word-based frequency statistics segmenting method,
(Language Technology Platform, language technology are flat by the LTP for the segmenting method and Harbin Institute of Technology that knowledge based understands
At least one of platform) participle tool etc., the retrieval text of user's input is segmented.
It optionally, can also be all by what is obtained after participle after the retrieval file that the present invention inputs user segments
Word executes duplicate removal processing, and then obtains the retrieval set of words after duplicate removal processing.Such as, in the word obtained after participle, word
Language " robbery " has n times, then the present invention can delete N-1 " robbery ", so that only occurring in retrieval set of words primary
Word " robbery ".Wherein, N is the positive integer greater than 1.
For ease of description, the present invention segments the retrieval text inputted to user, the retrieval set of words of acquisition
It is denoted as A.
Step 102, for each word in the retrieval set of words, TextRank algorithm is respectively adopted and calculates each
The TextRank value of word.
After obtaining retrieval set of words A, the present invention is calculated separately each in retrieval set of words A using TextRank algorithm
The TextRank value of a word.
Step 103, according to the TextRank value of each word, the word of preset quantity is chosen as keyword set.
Specifically, the present invention can choose the word conduct of preset quantity according to the sequence of TextRank value from big to small
Keyword set.Wherein preset quantity is, for example, 8,10,12 etc., the invention is not limited in this regard.
More specifically, by taking preset quantity is specially 10 as an example, sequence of the present invention according to TextRank value from big to small,
Preceding 10 words composition keyword set can successively be chosen.For ease of description, keyword set is denoted as K by the present invention.
Optionally, sequence of the present invention according to TextRank value from big to small can remove wherein stop words and word
Property be judged as the word of the parts of speech such as conjunction, preposition after, choose preceding 10 words composition keyword set K.
Step 104, the term vector of each word in the keyword set is determined.
Specifically, the present invention determines each word in the keyword set using preparatory trained term vector model
Term vector.Wherein, in advance trained term vector model may include it is following any one: word2vector model, LSA
(Latent Semantic Analysis, latent semantic analysis) matrix decomposition model, PLSA (Probability Latent
Semantic Analysis, probability are dived semantic analysis) latent semantic analysis probabilistic model and LDA (Latent Dirichlet
Allocation, potential Di Li Cray distribution) model (be commonly referred to as document subject matter and generate model).Using preparatory trained word to
Model is measured, determines the term vector of each word in keyword set K.
In practical application of the present invention, the present invention can be in advance trained term vector model, such as: pass through a fixed number
The text of amount is trained term vector model.As in practical applications, 100,000 grades of judgement document couple can use
Word2vector model is trained, and obtains each word in retrieval set of words by trained word2vector model
The term vector of language, wherein the term vector of each word can indicate the relationship (such as similitude) between each word and word, and word
The dimension of vector can be between default dimension, and such as in 50 to 300 dimensions, specific number is determined according to practical application.
Step 105, at least one corresponding text word set of text to be retrieved is obtained, and determines described at least one
The term vector of each word in a corresponding text word set of text to be retrieved.
Specifically, step 105 of the present invention can be realized using following steps 1051- step 1056, comprising:
Step 1051, at least one text to be retrieved is determined using text similarity measurement algorithm based on the retrieval text.
For example, being based on TF- using the retrieval text of user's input as the input of search engine (such as Elasticsearch)
The contents such as IDF calculate the similarity between long text, determine that similarity meets the text of preset threshold as text to be retrieved.
Certainly, whole judgement documents that the present invention can also directly determine judgement document library are text to be retrieved, without holding
Row step 1051.
The quantity of text to be retrieved in the embodiment of the present invention can be not less than some quantity, such as 100,000.Text to be retrieved
Preferably judgement document.Wherein optional, text to be retrieved may include: text topic to be retrieved and text body to be retrieved.
It is understood that the word for including in topic is particularly significant for text to be retrieved, therefore the present invention is by topic and just
Text can determine text word set together as text to be retrieved from topic and text, more comprehensive and accurate.
Step 1052, each text to be retrieved is segmented, obtains multiple words.
Specifically, the present invention can by the segmenting method of word-based storehouse matching, word-based frequency statistics segmenting method,
At least one of LTP participle tool of segmenting method and Harbin Institute of Technology that knowledge based understands etc., to the retrieval text of user's input
This is segmented.
Step 1053, dittograph and stop words are removed from the multiple word, obtain candidate set of words.
Specifically, the process for removing dittograph from the multiple word is the process of duplicate removal processing.
Stop words refers in information retrieval, to save memory space and improving search efficiency, in processing natural language number
Fall certain words or word according to meeting automatic fitration before or after (or text), these words or word are referred to as Stop Words and (deactivate
Word).Stop words can be divided into two classes, and one kind is the function word for including in human language, these function words are extremely universal, such as
" net " word almost will appear on each website, such word search engine not can guarantee can provide really it is relevant
Search result, it is difficult to which search range is reduced in help, while can also reduce the efficiency of search;The another kind of word for no clear meaning
Language, such as auxiliary words of mood, adverbial word, preposition, conjunction.
Step 1054, for each word in the candidate set of words, TextRank algorithm is respectively adopted and calculates often
The TextRank value of a word.
The concrete methods of realizing of step 1054 of the present invention is identical with the implementation method of abovementioned steps 102.
Step 1055, according to the TextRank value of each word in the candidate set of words, from the candidate word collection
Text word set is determined in conjunction.
The present invention, can be according to TextRank in candidate set of words is calculated after the TextRank value of each word
Value is ranked up each word in candidate set of words, such as the word for being arranged in top N is determined as text word set.
When the situation that retrieval text is the long texts such as judgement document, optionally, keyword set and text to be retrieved to retrieval text
Text word set acquisition processing mode it is identical.
Step 1056, each word at least one described corresponding text word set of text to be retrieved is determined
Term vector.
After determining text word set, using preparatory trained term vector model, determine each in text set of words
The term vector of a word.
It should be noted that the term vector model used in step 1056 and the term vector mould used in step 104
Type is consistent.
Step 106, calculate the term vector of each word in the keyword set respectively with described at least one is to be checked
The similarity of the term vector of each word in the corresponding text word set of Suo Wenben.
Step 107, at least one described text to be retrieved is ranked up by output according to the similarity.
Specifically, step 107 of the present invention can be realized using following steps 1071- step 1073, comprising:
Step 1071, for any one text to be retrieved, each word from the keyword set being calculated
With in the similarity of each word, obtain each word in the keyword set in the text word set of the text to be retrieved
Corresponding maximum similarity.
Step 1072, from the corresponding maximum similarity of word each in the keyword set, by maximum similar
The sequence of degree from big to small, determines the corresponding maximum similarity of word at predetermined order position as the text to be retrieved
Sort score.
In the present invention, predetermined order position is, for example, M, centre in the arrangement of maximum similarity descending order
Position or last position, wherein preferably interposition, the invention is not limited in this regard.M is positive integer.
Step 1073, the sequence score according to each text to be retrieved arranges at least one described text to be retrieved
Sequence output.
As an example it is assumed that keyword set K includes: tri- words of A1, A2 and A3, the text determined in certain text to be retrieved
This set of words includes: tri- words of B1, B2, B3, and each word in keyword set K is determined with from the text to be retrieved
Text word set in the similarity of each word be respectively as follows:
A1 and B1 similarity are 23%;
A1 and B2 similarity are 50%;
A1 and B3 similarity are 61%;
A2 and B1 similarity are 15%;
A2 and B2 similarity are 76%;
A2 and B3 similarity are 95%;
A3 and B1 similarity are 100%;
A3 and B2 similarity are 2%;
A3 and B3 similarity are 30%.
It can then determine that the maximum value in corresponding three similarities of A1 is 61%, it may be assumed that in A1 and text word set
B3 is most like.Meanwhile it can determine that the maximum value in corresponding three similarities of A2 is 95%, it may be assumed that A2 and text word set
In B3 it is most like.Meanwhile it can determine that the maximum value in corresponding three similarities of A3 is 100%, it may be assumed that A3 and text word
B1 in set is most like.It then, can be by three most for including the keyword set K of tri- words of A1, A2, A3
Sequence score of the corresponding maximum similarity 95% of word of the 2nd in big value namely interposition as the text to be retrieved,
Or by sequence score of the corresponding maximum similarity 61% of last word as the text to be retrieved in three maximum values.
It is understood that the maximum value of above-mentioned similarity represents word and text to be retrieved in keyword set K
In some word it is highly relevant, and take maximum similarity descending order arrange in the 5th, the 7th, interposition or
Last is to allow each word in keyword set K to be embodied in be checked as the sequence score of text to be retrieved
In the similarity of the text word set determined in Suo Wenben, guarantee the accuracy of retrieval.
Therefore, text searching method provided by the invention calculates in retrieval set of words by using TextRank algorithm
Each word TextRank value, and the TextRank value according to each word takes the word of preset quantity as keyword
Set, the obtained keyword set can relatively accurately express retrieval text core content, eliminate some high frequencies without
The interference for closing word, ensure that the accuracy of text to be retrieved to a certain extent.And the application indicates each word using term vector
Relationship between language and word, and text to be retrieved is ranked up according to the relationship between each word itself and word, retrieval knot
The accuracy of fruit further increases.
Corresponding with above method embodiment, the present invention also provides a kind of text retrieval devices.
As shown in Fig. 2, a kind of text retrieval device provided in an embodiment of the present invention, may include: participle unit 10,
TextRank value computing unit 20, keyword set determination unit 30, the first term vector determination unit 40, text word set obtain
Take unit 50, the second term vector determination unit 60, similarity calculated 70 and text sequence output unit 80.Wherein,
Participle unit 10 obtains retrieval set of words for segmenting to retrieval text;
TextRank value computing unit 20, for being respectively adopted for each word in the retrieval set of words
TextRank algorithm calculates the TextRank value of each word;
Keyword set determination unit 30 chooses the word of preset quantity for the TextRank value according to each word
As keyword set;
First term vector determination unit 40, for determining the term vector of each word in the keyword set;
Text word set acquiring unit 50, for obtaining at least one corresponding text word collection of text to be retrieved
It closes;
Second term vector determination unit 60, for determining at least one described corresponding text word of text to be retrieved
The term vector of each word in set;
Similarity calculated 70, for calculate the term vector of each word in the keyword set respectively with it is described
The similarity of the term vector of each word at least one corresponding text word set of text to be retrieved;
Text sequence output unit 80, for being ranked up at least one described text to be retrieved according to the similarity
Output.
Optionally, the text word set acquiring unit includes:
Text to be retrieved determines subelement, for being based on the retrieval text, using text similarity measurement algorithm, determines at least
One text to be retrieved;
It segments subelement and obtains multiple words for segmenting to each text to be retrieved;
Candidate set of words determines subelement, for removing dittograph and stop words from the multiple word, obtains
Candidate set of words;
TextRank value computation subunit, for being respectively adopted for each word in the candidate set of words
TextRank algorithm calculates the TextRank value of each word;
Text word set determines subelement, for the TextRank according to each word in the candidate set of words
Value determines text word set from the candidate set of words.
Optionally, the text sequence output unit includes:
Maximum similarity determines subelement, is used for for any one text to be retrieved, from the key being calculated
The key is obtained in the similarity of each word in the text word set of each word and the text to be retrieved in set of words
Each corresponding maximum similarity of word in set of words;
Sequence score determines subelement, for each corresponding maximum similarity of word from the keyword set
In, by the sequence of maximum similarity from big to small, determine the corresponding maximum similarity of word at predetermined order position as institute
State the sequence score of text to be retrieved;
Text sequence output subelement, for the sequence score according to each text to be retrieved, to it is described at least one wait for
Retrieval text is ranked up output.
The text retrieval device includes processor and memory, above-mentioned participle unit 10, TextRank value computing unit
20, keyword set determination unit 30, the first term vector determination unit 40, text word set acquiring unit 50, the second word to
Amount determination unit 60, similarity calculated 70 and text sequence output unit 80 etc. are stored in memory as program unit
In, above procedure unit stored in memory is executed by processor to realize corresponding function.
Include kernel in processor, is gone in memory to transfer corresponding program unit by kernel.Kernel can be set one
Or more, text retrieval is carried out by adjusting kernel parameter.
Memory may include the non-volatile memory in computer-readable medium, random access memory (RAM) and/
Or the forms such as Nonvolatile memory, if read-only memory (ROM) or flash memory (flash RAM), memory include that at least one is deposited
Store up chip.
The embodiment of the invention provides a kind of storage mediums, are stored thereon with program, real when which is executed by processor
The existing text searching method.
The embodiment of the invention provides a kind of processor, the processor is for running program, wherein described program operation
Text searching method described in Shi Zhihang.
The embodiment of the invention provides a kind of equipment, equipment include processor, memory and storage on a memory and can
The program run on a processor, processor perform the steps of when executing program
Retrieval text is segmented, retrieval set of words is obtained;
For each word in the retrieval set of words, TextRank algorithm is respectively adopted and calculates each word
TextRank value;
According to the TextRank value of each word, the word of preset quantity is chosen as keyword set;
Determine the term vector of each word in the keyword set;
At least one corresponding text word set of text to be retrieved is obtained, and at least one is to be retrieved described in determination
The term vector of each word in the corresponding text word set of text;
The term vector for calculating each word in the keyword set is each at least one described text to be retrieved respectively
The similarity of the term vector of each word in self-corresponding text word set;
At least one described text to be retrieved is ranked up output according to the similarity.
Optionally, obtaining at least one text to be retrieved includes: to include: based on the retrieval text, similar using text
Algorithm is spent, determines at least one text to be retrieved;
Obtaining at least one corresponding text word set of text to be retrieved includes:
Each text to be retrieved is segmented, multiple words are obtained;
Dittograph and stop words are removed from the multiple word, obtain candidate set of words;
For each word in the candidate set of words, TextRank algorithm is respectively adopted and calculates each word
TextRank value;
According to the TextRank value of each word in the candidate set of words, determined from the candidate set of words
Text word set.
It is optionally, described that at least one described text to be retrieved is ranked up by output according to the similarity, comprising:
For any one text to be retrieved, each word is to be retrieved with this from the keyword set being calculated
In the text word set of text in the similarity of each word, it is corresponding to obtain each word in the keyword set
Maximum similarity;
From the corresponding maximum similarity of word each in the keyword set, from big to small by maximum similarity
Sequence, determine sequence score of the corresponding maximum similarity of word as the text to be retrieved at predetermined order position;
According to the sequence score of each text to be retrieved, output is ranked up at least one described text to be retrieved.
Optionally, the text to be retrieved includes: text topic to be retrieved and text body to be retrieved.
Optionally it is determined that the term vector of word includes:
Using preparatory trained term vector model, the term vector of word is determined;
Wherein the trained term vector model in advance includes any of the following: word2vector model, potential language
Justice analysis LSA matrix decomposition model, the latent semantic analysis PLSA latent semantic analysis probabilistic model of probability and potential Di Li Cray point
Cloth LDA model.Equipment herein can be server, PC, PAD, mobile phone etc..
Present invention also provides a kind of computer program products, when executing on data processing equipment, are adapted for carrying out just
The program of beginningization there are as below methods step:
Retrieval text is segmented, retrieval set of words is obtained;
For each word in the retrieval set of words, TextRank algorithm is respectively adopted and calculates each word
TextRank value;
According to the TextRank value of each word, the word of preset quantity is chosen as keyword set;
Determine the term vector of each word in the keyword set;
At least one corresponding text word set of text to be retrieved is obtained, and at least one is to be retrieved described in determination
The term vector of each word in the corresponding text word set of text;
The term vector for calculating each word in the keyword set is each at least one described text to be retrieved respectively
The similarity of the term vector of each word in self-corresponding text word set;
At least one described text to be retrieved is ranked up output according to the similarity.
Optionally, obtaining at least one text to be retrieved includes: to include: based on the retrieval text, similar using text
Algorithm is spent, determines at least one text to be retrieved;
Obtaining at least one corresponding text word set of text to be retrieved includes:
Each text to be retrieved is segmented, multiple words are obtained;
Dittograph and stop words are removed from the multiple word, obtain candidate set of words;
For each word in the candidate set of words, TextRank algorithm is respectively adopted and calculates each word
TextRank value;
According to the TextRank value of each word in the candidate set of words, determined from the candidate set of words
Text word set.
It is optionally, described that at least one described text to be retrieved is ranked up by output according to the similarity, comprising:
For any one text to be retrieved, each word is to be retrieved with this from the keyword set being calculated
In the text word set of text in the similarity of each word, it is corresponding to obtain each word in the keyword set
Maximum similarity;
From the corresponding maximum similarity of word each in the keyword set, from big to small by maximum similarity
Sequence, determine sequence score of the corresponding maximum similarity of word as the text to be retrieved at predetermined order position;
According to the sequence score of each text to be retrieved, output is ranked up at least one described text to be retrieved.
Optionally, the text to be retrieved includes: text topic to be retrieved and text body to be retrieved.
Optionally it is determined that the term vector of word includes:
Using preparatory trained term vector model, the term vector of word is determined;
Wherein the trained term vector model in advance includes any of the following: word2vector model, potential language
Justice analysis LSA matrix decomposition model, the latent semantic analysis PLSA latent semantic analysis probabilistic model of probability and potential Di Li Cray point
Cloth LDA model.
It should be understood by those skilled in the art that, embodiments herein can provide as method, system or computer program
Product.Therefore, complete hardware embodiment, complete software embodiment or reality combining software and hardware aspects can be used in the application
Apply the form of example.Moreover, it wherein includes the computer of computer usable program code that the application, which can be used in one or more,
The computer program implemented in usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) produces
The form of product.
The application is referring to method, the process of equipment (system) and computer program product according to the embodiment of the present application
Figure and/or block diagram describe.It should be understood that every one stream in flowchart and/or the block diagram can be realized by computer program instructions
The combination of process and/or box in journey and/or box and flowchart and/or the block diagram.It can provide these computer programs
Instruct the processor of general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce
A raw machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute for real
The device for the function of being specified in present one or more flows of the flowchart and/or one or more blocks of the block diagram.
These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy
Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates,
Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or
The function of being specified in multiple boxes.
These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting
Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or
The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one
The step of function of being specified in a box or multiple boxes.
In a typical configuration, calculating equipment includes one or more processors (CPU), input/output interface, net
Network interface and memory.
Memory may include the non-volatile memory in computer-readable medium, random access memory (RAM) and/
Or the forms such as Nonvolatile memory, such as read-only memory (ROM) or flash memory (flash RAM).Memory is computer-readable Jie
The example of matter.
Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method
Or technology come realize information store.Information can be computer readable instructions, data structure, the module of program or other data.
The example of the storage medium of computer includes, but are not limited to phase change memory (PRAM), static random access memory (SRAM), moves
State random access memory (DRAM), other kinds of random access memory (RAM), read-only memory (ROM), electric erasable
Programmable read only memory (EEPROM), flash memory or other memory techniques, read-only disc read only memory (CD-ROM) (CD-ROM),
Digital versatile disc (DVD) or other optical storage, magnetic cassettes, tape magnetic disk storage or other magnetic storage devices
Or any other non-transmission medium, can be used for storage can be accessed by a computing device information.As defined in this article, it calculates
Machine readable medium does not include temporary computer readable media (transitory media), such as the data-signal and carrier wave of modulation.
It should also be noted that, the terms "include", "comprise" or its any other variant are intended to nonexcludability
It include so that the process, method, commodity or the equipment that include a series of elements not only include those elements, but also to wrap
Include other elements that are not explicitly listed, or further include for this process, method, commodity or equipment intrinsic want
Element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including element
There is also other identical elements in process, method, commodity or equipment.
It will be understood by those skilled in the art that embodiments herein can provide as method, system or computer program product.
Therefore, complete hardware embodiment, complete software embodiment or embodiment combining software and hardware aspects can be used in the application
Form.It is deposited moreover, the application can be used to can be used in the computer that one or more wherein includes computer usable program code
The shape for the computer program product implemented on storage media (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.)
Formula.
The above is only embodiments herein, are not intended to limit this application.To those skilled in the art,
Various changes and changes are possible in this application.It is all within the spirit and principles of the present application made by any modification, equivalent replacement,
Improve etc., it should be included within the scope of the claims of this application.
Claims (10)
1. a kind of text searching method, which is characterized in that the described method includes:
Retrieval text is segmented, retrieval set of words is obtained;
For each word in the retrieval set of words, TextRank algorithm is respectively adopted and calculates each word
TextRank value;
According to the TextRank value of each word, the word of preset quantity is chosen as keyword set;
Determine the term vector of each word in the keyword set;
At least one corresponding text word set of text to be retrieved is obtained, and determines at least one described text to be retrieved
The term vector of each word in corresponding text word set;
The term vector for calculating each word in the keyword set is respectively right at least one described text to be retrieved respectively
The similarity of the term vector of each word in the text word set answered;
At least one described text to be retrieved is ranked up output according to the similarity.
2. the method according to claim 1, wherein
Obtaining at least one text to be retrieved includes: to include: based on the retrieval text, using text similarity measurement algorithm, is determined
At least one text to be retrieved;
Obtaining at least one corresponding text word set of text to be retrieved includes:
Each text to be retrieved is segmented, multiple words are obtained;
Dittograph and stop words are removed from the multiple word, obtain candidate set of words;
For each word in the candidate set of words, TextRank algorithm is respectively adopted and calculates each word
TextRank value;
According to the TextRank value of each word in the candidate set of words, text is determined from the candidate set of words
Set of words.
3. method according to claim 1 or 2, which is characterized in that it is described according to the similarity will it is described at least one
Text to be retrieved is ranked up output, comprising:
For any one text to be retrieved, each word and the text to be retrieved from the keyword set being calculated
Text word set in each word similarity in, obtain each corresponding maximum of word in the keyword set
Similarity;
From the corresponding maximum similarity of word each in the keyword set, by maximum similarity from big to small suitable
Sequence determines sequence score of the corresponding maximum similarity of word as the text to be retrieved at predetermined order position;
According to the sequence score of each text to be retrieved, output is ranked up at least one described text to be retrieved.
4. according to the method described in claim 3, it is characterized in that, the text to be retrieved include: text topic to be retrieved and
Text body to be retrieved.
5. the method according to claim 1, wherein determining that the term vector of word includes:
Using preparatory trained term vector model, the term vector of word is determined;
Wherein the trained term vector model in advance includes any of the following: word2vector model, potential applications point
It analyses the latent semantic analysis PLSA latent semantic analysis probabilistic model of LSA matrix decomposition model, probability and potential Di Li Cray is distributed LDA
Model.
6. a kind of text retrieval device, which is characterized in that described device includes:
Participle unit obtains retrieval set of words for segmenting to retrieval text;
TextRank value computing unit, for TextRank to be respectively adopted for each word in the retrieval set of words
Algorithm calculates the TextRank value of each word;
Keyword set determination unit chooses the word of preset quantity as pass for the TextRank value according to each word
Keyword set;
First term vector determination unit, for determining the term vector of each word in the keyword set;
Text word set acquiring unit, for obtaining at least one corresponding text word set of text to be retrieved;
Second term vector determination unit, for determining at least one described corresponding text word set of text to be retrieved
The term vector of each word;
Similarity calculated, for calculating the term vector of each word in the keyword set respectively with described at least one
The similarity of the term vector of each word in a corresponding text word set of text to be retrieved;
Text sequence output unit, at least one described text to be retrieved to be ranked up output according to the similarity.
7. device according to claim 6, which is characterized in that the text word set acquiring unit includes:
Text to be retrieved determines subelement, for determining at least one using text similarity measurement algorithm based on the retrieval text
Text to be retrieved;
It segments subelement and obtains multiple words for segmenting to each text to be retrieved;
Candidate set of words determines subelement, for removing dittograph and stop words from the multiple word, obtains candidate
Set of words;
TextRank value computation subunit, for being respectively adopted for each word in the candidate set of words
TextRank algorithm calculates the TextRank value of each word;
Text word set determines subelement, for the TextRank value according to each word in the candidate set of words, from
Text word set is determined in candidate's set of words.
8. device according to claim 6 or 7, which is characterized in that text sequence output unit includes:
Maximum similarity determines subelement, is used for for any one text to be retrieved, from the keyword set being calculated
The keyword set is obtained in the similarity of each word in the text word set of each word and the text to be retrieved in conjunction
Each corresponding maximum similarity of word in conjunction;
Sequence score determines subelement, is used for from the corresponding maximum similarity of word each in the keyword set,
By the sequence of maximum similarity from big to small, determine the corresponding maximum similarity of word at predetermined order position as it is described to
Retrieve the sequence score of text;
Text sequence output subelement, for the sequence score according to each text to be retrieved, at least one is to be retrieved to described
Text is ranked up output.
9. a kind of storage medium, which is characterized in that be stored thereon with program, realize that right is wanted when described program is executed by processor
Text searching method described in asking any one of 1 to 5.
10. a kind of processor, which is characterized in that the processor is for running program, wherein right of execution when described program is run
Benefit require any one of 1 to 5 described in text searching method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711043608.2A CN110019668A (en) | 2017-10-31 | 2017-10-31 | A kind of text searching method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711043608.2A CN110019668A (en) | 2017-10-31 | 2017-10-31 | A kind of text searching method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110019668A true CN110019668A (en) | 2019-07-16 |
Family
ID=67186714
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711043608.2A Pending CN110019668A (en) | 2017-10-31 | 2017-10-31 | A kind of text searching method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110019668A (en) |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110866144A (en) * | 2019-11-06 | 2020-03-06 | 腾讯音乐娱乐科技(深圳)有限公司 | Song retrieval method and device |
CN110968702A (en) * | 2019-11-29 | 2020-04-07 | 北京明略软件系统有限公司 | Method and device for extracting matter relationship |
CN111061869A (en) * | 2019-11-13 | 2020-04-24 | 北京数字联盟网络科技有限公司 | Application preference text classification method based on TextRank |
CN111159461A (en) * | 2019-12-30 | 2020-05-15 | 秒针信息技术有限公司 | Audio file determination method and device, storage medium and electronic device |
CN111274808A (en) * | 2020-02-11 | 2020-06-12 | 支付宝(杭州)信息技术有限公司 | Text retrieval method, model training method, text retrieval device, and storage medium |
CN111538830A (en) * | 2020-04-28 | 2020-08-14 | 清华大学 | French retrieval method, French retrieval device, computer equipment and storage medium |
CN111666461A (en) * | 2020-04-24 | 2020-09-15 | 百度在线网络技术(北京)有限公司 | Method, apparatus, device and computer storage medium for retrieving geographical location |
CN112257436A (en) * | 2020-09-29 | 2021-01-22 | 华为技术有限公司 | Text detection method and device |
CN112732870A (en) * | 2020-12-31 | 2021-04-30 | 平安科技(深圳)有限公司 | Searching method, device and equipment based on word vector and storage medium |
CN112988971A (en) * | 2021-03-15 | 2021-06-18 | 平安科技(深圳)有限公司 | Word vector-based search method, terminal, server and storage medium |
CN115203379A (en) * | 2022-09-15 | 2022-10-18 | 太平金融科技服务(上海)有限公司深圳分公司 | Retrieval method, retrieval apparatus, computer device, storage medium, and program product |
CN116186203A (en) * | 2023-03-01 | 2023-05-30 | 人民网股份有限公司 | Text retrieval method, text retrieval device, computing equipment and computer storage medium |
CN114036946B (en) * | 2021-11-26 | 2023-07-07 | 浪潮卓数大数据产业发展有限公司 | Text feature extraction and auxiliary retrieval system and method |
US11836174B2 (en) | 2020-04-24 | 2023-12-05 | Baidu Online Network Technology (Beijing) Co., Ltd. | Method and apparatus of establishing similarity model for retrieving geographic location |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120330978A1 (en) * | 2008-06-24 | 2012-12-27 | Microsoft Corporation | Consistent phrase relevance measures |
CN103886063A (en) * | 2014-03-18 | 2014-06-25 | 国家电网公司 | Text retrieval method and device |
US20160070803A1 (en) * | 2014-09-09 | 2016-03-10 | Funky Flick, Inc. | Conceptual product recommendation |
CN105653671A (en) * | 2015-12-29 | 2016-06-08 | 畅捷通信息技术股份有限公司 | Similar information recommendation method and system |
CN106021223A (en) * | 2016-05-09 | 2016-10-12 | Tcl集团股份有限公司 | Sentence similarity calculation method and system |
CN106156272A (en) * | 2016-06-21 | 2016-11-23 | 北京工业大学 | A kind of information retrieval method based on multi-source semantic analysis |
CN106991092A (en) * | 2016-01-20 | 2017-07-28 | 阿里巴巴集团控股有限公司 | The method and apparatus that similar judgement document is excavated based on big data |
CN107066621A (en) * | 2017-05-11 | 2017-08-18 | 腾讯科技(深圳)有限公司 | A kind of search method of similar video, device and storage medium |
CN107153689A (en) * | 2017-04-29 | 2017-09-12 | 安徽富驰信息技术有限公司 | A kind of case search method based on Topic Similarity |
CN107247780A (en) * | 2017-06-12 | 2017-10-13 | 北京理工大学 | A kind of patent document method for measuring similarity of knowledge based body |
-
2017
- 2017-10-31 CN CN201711043608.2A patent/CN110019668A/en active Pending
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120330978A1 (en) * | 2008-06-24 | 2012-12-27 | Microsoft Corporation | Consistent phrase relevance measures |
CN103886063A (en) * | 2014-03-18 | 2014-06-25 | 国家电网公司 | Text retrieval method and device |
US20160070803A1 (en) * | 2014-09-09 | 2016-03-10 | Funky Flick, Inc. | Conceptual product recommendation |
CN105653671A (en) * | 2015-12-29 | 2016-06-08 | 畅捷通信息技术股份有限公司 | Similar information recommendation method and system |
CN106991092A (en) * | 2016-01-20 | 2017-07-28 | 阿里巴巴集团控股有限公司 | The method and apparatus that similar judgement document is excavated based on big data |
CN106021223A (en) * | 2016-05-09 | 2016-10-12 | Tcl集团股份有限公司 | Sentence similarity calculation method and system |
CN106156272A (en) * | 2016-06-21 | 2016-11-23 | 北京工业大学 | A kind of information retrieval method based on multi-source semantic analysis |
CN107153689A (en) * | 2017-04-29 | 2017-09-12 | 安徽富驰信息技术有限公司 | A kind of case search method based on Topic Similarity |
CN107066621A (en) * | 2017-05-11 | 2017-08-18 | 腾讯科技(深圳)有限公司 | A kind of search method of similar video, device and storage medium |
CN107247780A (en) * | 2017-06-12 | 2017-10-13 | 北京理工大学 | A kind of patent document method for measuring similarity of knowledge based body |
Non-Patent Citations (1)
Title |
---|
杨丽萍: "面向自然语言的法律检索系统的研究与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (25)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110866144A (en) * | 2019-11-06 | 2020-03-06 | 腾讯音乐娱乐科技(深圳)有限公司 | Song retrieval method and device |
CN110866144B (en) * | 2019-11-06 | 2022-08-05 | 腾讯音乐娱乐科技(深圳)有限公司 | Song retrieval method and device |
CN111061869A (en) * | 2019-11-13 | 2020-04-24 | 北京数字联盟网络科技有限公司 | Application preference text classification method based on TextRank |
CN111061869B (en) * | 2019-11-13 | 2024-01-26 | 北京数字联盟网络科技有限公司 | Text classification method for application preference based on TextRank |
CN110968702A (en) * | 2019-11-29 | 2020-04-07 | 北京明略软件系统有限公司 | Method and device for extracting matter relationship |
CN110968702B (en) * | 2019-11-29 | 2023-05-09 | 北京明略软件系统有限公司 | Method and device for extracting rational relation |
CN111159461A (en) * | 2019-12-30 | 2020-05-15 | 秒针信息技术有限公司 | Audio file determination method and device, storage medium and electronic device |
CN111159461B (en) * | 2019-12-30 | 2023-10-03 | 秒针信息技术有限公司 | Audio file determining method and device, storage medium and electronic device |
CN111274808B (en) * | 2020-02-11 | 2023-07-04 | 支付宝(杭州)信息技术有限公司 | Text retrieval method, model training method, text retrieval device, and storage medium |
CN111274808A (en) * | 2020-02-11 | 2020-06-12 | 支付宝(杭州)信息技术有限公司 | Text retrieval method, model training method, text retrieval device, and storage medium |
US11836174B2 (en) | 2020-04-24 | 2023-12-05 | Baidu Online Network Technology (Beijing) Co., Ltd. | Method and apparatus of establishing similarity model for retrieving geographic location |
CN111666461A (en) * | 2020-04-24 | 2020-09-15 | 百度在线网络技术(北京)有限公司 | Method, apparatus, device and computer storage medium for retrieving geographical location |
CN111666461B (en) * | 2020-04-24 | 2023-05-26 | 百度在线网络技术(北京)有限公司 | Method, apparatus, device and computer storage medium for retrieving geographic location |
CN111538830B (en) * | 2020-04-28 | 2023-09-05 | 清华大学 | French searching method, device, computer equipment and storage medium |
CN111538830A (en) * | 2020-04-28 | 2020-08-14 | 清华大学 | French retrieval method, French retrieval device, computer equipment and storage medium |
CN112257436A (en) * | 2020-09-29 | 2021-01-22 | 华为技术有限公司 | Text detection method and device |
CN112257436B (en) * | 2020-09-29 | 2024-04-02 | 华为技术有限公司 | Text detection method and device |
WO2022141876A1 (en) * | 2020-12-31 | 2022-07-07 | 平安科技(深圳)有限公司 | Word embedding-based search method, apparatus and device, and storage medium |
CN112732870A (en) * | 2020-12-31 | 2021-04-30 | 平安科技(深圳)有限公司 | Searching method, device and equipment based on word vector and storage medium |
CN112732870B (en) * | 2020-12-31 | 2024-03-05 | 平安科技(深圳)有限公司 | Word vector based search method, device, equipment and storage medium |
CN112988971A (en) * | 2021-03-15 | 2021-06-18 | 平安科技(深圳)有限公司 | Word vector-based search method, terminal, server and storage medium |
CN114036946B (en) * | 2021-11-26 | 2023-07-07 | 浪潮卓数大数据产业发展有限公司 | Text feature extraction and auxiliary retrieval system and method |
CN115203379A (en) * | 2022-09-15 | 2022-10-18 | 太平金融科技服务(上海)有限公司深圳分公司 | Retrieval method, retrieval apparatus, computer device, storage medium, and program product |
CN116186203A (en) * | 2023-03-01 | 2023-05-30 | 人民网股份有限公司 | Text retrieval method, text retrieval device, computing equipment and computer storage medium |
CN116186203B (en) * | 2023-03-01 | 2023-10-10 | 人民网股份有限公司 | Text retrieval method, text retrieval device, computing equipment and computer storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110019668A (en) | A kind of text searching method and device | |
CN108241621B (en) | legal knowledge retrieval method and device | |
US9542477B2 (en) | Method of automated discovery of topics relatedness | |
US10268758B2 (en) | Method and system of acquiring semantic information, keyword expansion and keyword search thereof | |
TWI710917B (en) | Data processing method and device | |
CN110019669B (en) | Text retrieval method and device | |
CN109582948B (en) | Method and device for extracting evaluation viewpoints | |
CN110019670A (en) | A kind of text searching method and device | |
CN106610931B (en) | Topic name extraction method and device | |
CN112328544B (en) | Multidisciplinary simulation data classification method, device and storage medium | |
CN112329460A (en) | Text topic clustering method, device, equipment and storage medium | |
CN109597983A (en) | A kind of spelling error correction method and device | |
CN110019785A (en) | A kind of file classification method and device | |
CN106598997B (en) | Method and device for calculating text theme attribution degree | |
CN109597982A (en) | Summary texts recognition methods and device | |
CN109117434A (en) | Judgement document's search method, device, storage medium and processor | |
CN110019665A (en) | Text searching method and device | |
CN105786929B (en) | A kind of information monitoring method and device | |
CN115563268A (en) | Text abstract generation method and device, electronic equipment and storage medium | |
CN112487181A (en) | Keyword determination method and related equipment | |
Zimniewicz et al. | Scheduling aspects in keyword extraction problem | |
CN110019295A (en) | Database index method, device, system and storage medium | |
CN110895703A (en) | Legal document routing identification method and device | |
CN115391656A (en) | User demand determination method, device and equipment | |
CN112613320A (en) | Method and device for acquiring similar sentences, storage medium and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information | ||
CB02 | Change of applicant information |
Address after: 100080 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing Applicant after: Beijing Guoshuang Technology Co.,Ltd. Address before: 100086 Beijing city Haidian District Shuangyushu Area No. 76 Zhichun Road cuigongfandian 8 layer A Applicant before: Beijing Guoshuang Technology Co.,Ltd. |
|
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190716 |