CN110019668A - A kind of text searching method and device - Google Patents

A kind of text searching method and device Download PDF

Info

Publication number
CN110019668A
CN110019668A CN201711043608.2A CN201711043608A CN110019668A CN 110019668 A CN110019668 A CN 110019668A CN 201711043608 A CN201711043608 A CN 201711043608A CN 110019668 A CN110019668 A CN 110019668A
Authority
CN
China
Prior art keywords
text
word
retrieved
words
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201711043608.2A
Other languages
Chinese (zh)
Inventor
戴威
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201711043608.2A priority Critical patent/CN110019668A/en
Publication of CN110019668A publication Critical patent/CN110019668A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

The invention discloses a kind of text searching method and devices.Method includes: to segment to retrieval text, obtains retrieval set of words;For each word in retrieval set of words, the TextRank value that TextRank algorithm calculates each word is respectively adopted;According to the TextRank value of each word, the word of preset quantity is chosen as keyword set;Determine the term vector of each word in keyword set;At least one corresponding text word set of text to be retrieved is obtained, and determines the term vector of each word at least one described corresponding text word set of text to be retrieved;Calculate the term vector similarity with the term vector of each word at least one corresponding text word set of text to be retrieved respectively of each word in keyword set;At least one text to be retrieved is ranked up output according to similarity.The present invention improves the accuracy of search result.

Description

A kind of text searching method and device
Technical field
The present invention relates to text retrieval technique field more particularly to a kind of text searching methods and device.
Background technique
The push of legal documents class case refers to one legal documents of input, obtains a series of and input using certain algorithm Other similar documents of legal documents, whereby quickly to find history document relevant to legal documents currently entered (also referred to as history case).
However currently used algorithm is generally based on some screening rules, such as case is by identical, to be applicable in law article consistent etc., Retrieve similar other documents of legal documents with input, the search result that this retrieval mode obtains often accuracy compared with Difference.
Summary of the invention
In view of the above problems, it proposes on the present invention overcomes the above problem or at least be partially solved in order to provide one kind The text searching method and device of problem are stated, technical solution is as follows:
A kind of text searching method, which comprises
Retrieval text is segmented, retrieval set of words is obtained;
For each word in the retrieval set of words, TextRank algorithm is respectively adopted and calculates each word TextRank value;
According to the TextRank value of each word, the word of preset quantity is chosen as keyword set;
Determine the term vector of each word in the keyword set;
At least one corresponding text word set of text to be retrieved is obtained, and at least one is to be retrieved described in determination The term vector of each word in the corresponding text word set of text;
The term vector for calculating each word in the keyword set is each at least one described text to be retrieved respectively The similarity of the term vector of each word in self-corresponding text word set;
At least one described text to be retrieved is ranked up output according to the similarity.
Optionally, obtaining at least one text to be retrieved includes: to include: based on the retrieval text, similar using text Algorithm is spent, determines at least one text to be retrieved;
Obtaining at least one corresponding text word set of text to be retrieved includes:
Each text to be retrieved is segmented, multiple words are obtained;
Dittograph and stop words are removed from the multiple word, obtain candidate set of words;
For each word in the candidate set of words, TextRank algorithm is respectively adopted and calculates each word TextRank value;
According to the TextRank value of each word in the candidate set of words, determined from the candidate set of words Text word set.
It is optionally, described that at least one described text to be retrieved is ranked up by output according to the similarity, comprising:
For any one text to be retrieved, each word is to be retrieved with this from the keyword set being calculated In the text word set of text in the similarity of each word, it is corresponding to obtain each word in the keyword set Maximum similarity;
From the corresponding maximum similarity of word each in the keyword set, from big to small by maximum similarity Sequence, determine sequence score of the corresponding maximum similarity of word as the text to be retrieved at predetermined order position;
According to the sequence score of each text to be retrieved, output is ranked up at least one described text to be retrieved.
Optionally, the text to be retrieved includes: text topic to be retrieved and text body to be retrieved.
Optionally it is determined that the term vector of word includes:
Using preparatory trained term vector model, the term vector of word is determined;
Wherein the trained term vector model in advance includes any of the following: word2vector model, potential language Justice analysis LSA matrix decomposition model, the latent semantic analysis PLSA latent semantic analysis probabilistic model of probability and potential Di Li Cray point Cloth LDA model.
A kind of text retrieval device, described device include:
Participle unit obtains retrieval set of words for segmenting to retrieval text;
TextRank value computing unit, for being respectively adopted for each word in the retrieval set of words TextRank algorithm calculates the TextRank value of each word;
Keyword set determination unit, for the TextRank value according to each word, the word for choosing preset quantity is made For keyword set;
First term vector determination unit, for determining the term vector of each word in the keyword set;
Text word set acquiring unit, for obtaining at least one corresponding text word collection of text to be retrieved It closes;
Second term vector determination unit, for determining at least one described corresponding text word collection of text to be retrieved The term vector of each word in conjunction;
Similarity calculated, for calculate the term vector of each word in the keyword set respectively with it is described extremely The similarity of the term vector of each word in a few corresponding text word set of text to be retrieved;
Text sequence output unit, it is defeated for being ranked up at least one described text to be retrieved according to the similarity Out.
Optionally, the text word set acquiring unit includes:
Text to be retrieved determines subelement, for being based on the retrieval text, using text similarity measurement algorithm, determines at least One text to be retrieved;
It segments subelement and obtains multiple words for segmenting to each text to be retrieved;
Candidate set of words determines subelement, for removing dittograph and stop words from the multiple word, obtains Candidate set of words;
TextRank value computation subunit, for being respectively adopted for each word in the candidate set of words TextRank algorithm calculates the TextRank value of each word;
Text word set determines subelement, for the TextRank according to each word in the candidate set of words Value determines text word set from the candidate set of words.
Optionally, the text sequence output unit includes:
Maximum similarity determines subelement, is used for for any one text to be retrieved, from the key being calculated The key is obtained in the similarity of each word in the text word set of each word and the text to be retrieved in set of words Each corresponding maximum similarity of word in set of words;
Sequence score determines subelement, for each corresponding maximum similarity of word from the keyword set In, by the sequence of maximum similarity from big to small, determine the corresponding maximum similarity of word at predetermined order position as institute State the sequence score of text to be retrieved;
Text sequence output subelement, for the sequence score according to each text to be retrieved, to it is described at least one wait for Retrieval text is ranked up output.
A kind of storage medium, is stored thereon with program, and text inspection described previously is realized when described program is executed by processor Suo Fangfa.
A kind of processor, the processor is for running program, wherein described program executes text described previously when running Search method.
By above-mentioned technical proposal, in text searching method and device provided by the invention, retrieval text is segmented, Obtain retrieval set of words;For each word in the retrieval set of words, TextRank algorithm is respectively adopted and calculates often The TextRank value of a word;According to the TextRank value of each word, the word of preset quantity is chosen as keyword set; Determine the term vector of each word in the keyword set;Obtain at least one corresponding text word of text to be retrieved Set, and determine the term vector of each word at least one described corresponding text word set of text to be retrieved;Meter Calculate the term vector of each word in the keyword set respectively at least one described corresponding text of text to be retrieved The similarity of the term vector of each word in this set of words;It will at least one described text to be retrieved according to the similarity It is ranked up output.
The present invention calculates the TextRank value of each word in retrieval set of words by using TextRank algorithm, and According to the TextRank value of each word, the word of preset quantity is chosen as keyword set, the obtained keyword set The core content that retrieval text can relatively accurately be expressed, eliminates the interference of the unrelated word of some high frequencies, to a certain degree On ensure that the accuracy of text to be retrieved.And the application indicates the relationship between each word and word using term vector, and according to Relationship between each word itself and word is ranked up text to be retrieved, and the accuracy of search result further increases.
The above description is only an overview of the technical scheme of the present invention, in order to better understand the technical means of the present invention, And it can be implemented in accordance with the contents of the specification, and in order to allow above and other objects of the present invention, feature and advantage can It is clearer and more comprehensible, the followings are specific embodiments of the present invention.
Detailed description of the invention
By reading the following detailed description of the preferred embodiment, various other advantages and benefits are common for this field Technical staff will become clear.The drawings are only for the purpose of illustrating a preferred embodiment, and is not considered as to the present invention Limitation.And throughout the drawings, the same reference numbers will be used to refer to the same parts.In the accompanying drawings:
Fig. 1 shows a kind of flow chart of text searching method provided in an embodiment of the present invention;
Fig. 2 shows a kind of structural schematic diagrams of text retrieval device provided in an embodiment of the present invention.
Specific embodiment
Exemplary embodiments of the present disclosure are described in more detail below with reference to accompanying drawings.Although showing the disclosure in attached drawing Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here It is limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosure It is fully disclosed to those skilled in the art.
As shown in Figure 1, a kind of text searching method provided in an embodiment of the present invention, may include:
Step 101, retrieval text is segmented, obtains retrieval set of words.
Specifically, the present invention can by the segmenting method of word-based storehouse matching, word-based frequency statistics segmenting method, (Language Technology Platform, language technology are flat by the LTP for the segmenting method and Harbin Institute of Technology that knowledge based understands At least one of platform) participle tool etc., the retrieval text of user's input is segmented.
It optionally, can also be all by what is obtained after participle after the retrieval file that the present invention inputs user segments Word executes duplicate removal processing, and then obtains the retrieval set of words after duplicate removal processing.Such as, in the word obtained after participle, word Language " robbery " has n times, then the present invention can delete N-1 " robbery ", so that only occurring in retrieval set of words primary Word " robbery ".Wherein, N is the positive integer greater than 1.
For ease of description, the present invention segments the retrieval text inputted to user, the retrieval set of words of acquisition It is denoted as A.
Step 102, for each word in the retrieval set of words, TextRank algorithm is respectively adopted and calculates each The TextRank value of word.
After obtaining retrieval set of words A, the present invention is calculated separately each in retrieval set of words A using TextRank algorithm The TextRank value of a word.
Step 103, according to the TextRank value of each word, the word of preset quantity is chosen as keyword set.
Specifically, the present invention can choose the word conduct of preset quantity according to the sequence of TextRank value from big to small Keyword set.Wherein preset quantity is, for example, 8,10,12 etc., the invention is not limited in this regard.
More specifically, by taking preset quantity is specially 10 as an example, sequence of the present invention according to TextRank value from big to small, Preceding 10 words composition keyword set can successively be chosen.For ease of description, keyword set is denoted as K by the present invention.
Optionally, sequence of the present invention according to TextRank value from big to small can remove wherein stop words and word Property be judged as the word of the parts of speech such as conjunction, preposition after, choose preceding 10 words composition keyword set K.
Step 104, the term vector of each word in the keyword set is determined.
Specifically, the present invention determines each word in the keyword set using preparatory trained term vector model Term vector.Wherein, in advance trained term vector model may include it is following any one: word2vector model, LSA (Latent Semantic Analysis, latent semantic analysis) matrix decomposition model, PLSA (Probability Latent Semantic Analysis, probability are dived semantic analysis) latent semantic analysis probabilistic model and LDA (Latent Dirichlet Allocation, potential Di Li Cray distribution) model (be commonly referred to as document subject matter and generate model).Using preparatory trained word to Model is measured, determines the term vector of each word in keyword set K.
In practical application of the present invention, the present invention can be in advance trained term vector model, such as: pass through a fixed number The text of amount is trained term vector model.As in practical applications, 100,000 grades of judgement document couple can use Word2vector model is trained, and obtains each word in retrieval set of words by trained word2vector model The term vector of language, wherein the term vector of each word can indicate the relationship (such as similitude) between each word and word, and word The dimension of vector can be between default dimension, and such as in 50 to 300 dimensions, specific number is determined according to practical application.
Step 105, at least one corresponding text word set of text to be retrieved is obtained, and determines described at least one The term vector of each word in a corresponding text word set of text to be retrieved.
Specifically, step 105 of the present invention can be realized using following steps 1051- step 1056, comprising:
Step 1051, at least one text to be retrieved is determined using text similarity measurement algorithm based on the retrieval text.
For example, being based on TF- using the retrieval text of user's input as the input of search engine (such as Elasticsearch) The contents such as IDF calculate the similarity between long text, determine that similarity meets the text of preset threshold as text to be retrieved.
Certainly, whole judgement documents that the present invention can also directly determine judgement document library are text to be retrieved, without holding Row step 1051.
The quantity of text to be retrieved in the embodiment of the present invention can be not less than some quantity, such as 100,000.Text to be retrieved Preferably judgement document.Wherein optional, text to be retrieved may include: text topic to be retrieved and text body to be retrieved. It is understood that the word for including in topic is particularly significant for text to be retrieved, therefore the present invention is by topic and just Text can determine text word set together as text to be retrieved from topic and text, more comprehensive and accurate.
Step 1052, each text to be retrieved is segmented, obtains multiple words.
Specifically, the present invention can by the segmenting method of word-based storehouse matching, word-based frequency statistics segmenting method, At least one of LTP participle tool of segmenting method and Harbin Institute of Technology that knowledge based understands etc., to the retrieval text of user's input This is segmented.
Step 1053, dittograph and stop words are removed from the multiple word, obtain candidate set of words.
Specifically, the process for removing dittograph from the multiple word is the process of duplicate removal processing.
Stop words refers in information retrieval, to save memory space and improving search efficiency, in processing natural language number Fall certain words or word according to meeting automatic fitration before or after (or text), these words or word are referred to as Stop Words and (deactivate Word).Stop words can be divided into two classes, and one kind is the function word for including in human language, these function words are extremely universal, such as " net " word almost will appear on each website, such word search engine not can guarantee can provide really it is relevant Search result, it is difficult to which search range is reduced in help, while can also reduce the efficiency of search;The another kind of word for no clear meaning Language, such as auxiliary words of mood, adverbial word, preposition, conjunction.
Step 1054, for each word in the candidate set of words, TextRank algorithm is respectively adopted and calculates often The TextRank value of a word.
The concrete methods of realizing of step 1054 of the present invention is identical with the implementation method of abovementioned steps 102.
Step 1055, according to the TextRank value of each word in the candidate set of words, from the candidate word collection Text word set is determined in conjunction.
The present invention, can be according to TextRank in candidate set of words is calculated after the TextRank value of each word Value is ranked up each word in candidate set of words, such as the word for being arranged in top N is determined as text word set. When the situation that retrieval text is the long texts such as judgement document, optionally, keyword set and text to be retrieved to retrieval text Text word set acquisition processing mode it is identical.
Step 1056, each word at least one described corresponding text word set of text to be retrieved is determined Term vector.
After determining text word set, using preparatory trained term vector model, determine each in text set of words The term vector of a word.
It should be noted that the term vector model used in step 1056 and the term vector mould used in step 104 Type is consistent.
Step 106, calculate the term vector of each word in the keyword set respectively with described at least one is to be checked The similarity of the term vector of each word in the corresponding text word set of Suo Wenben.
Step 107, at least one described text to be retrieved is ranked up by output according to the similarity.
Specifically, step 107 of the present invention can be realized using following steps 1071- step 1073, comprising:
Step 1071, for any one text to be retrieved, each word from the keyword set being calculated With in the similarity of each word, obtain each word in the keyword set in the text word set of the text to be retrieved Corresponding maximum similarity.
Step 1072, from the corresponding maximum similarity of word each in the keyword set, by maximum similar The sequence of degree from big to small, determines the corresponding maximum similarity of word at predetermined order position as the text to be retrieved Sort score.
In the present invention, predetermined order position is, for example, M, centre in the arrangement of maximum similarity descending order Position or last position, wherein preferably interposition, the invention is not limited in this regard.M is positive integer.
Step 1073, the sequence score according to each text to be retrieved arranges at least one described text to be retrieved Sequence output.
As an example it is assumed that keyword set K includes: tri- words of A1, A2 and A3, the text determined in certain text to be retrieved This set of words includes: tri- words of B1, B2, B3, and each word in keyword set K is determined with from the text to be retrieved Text word set in the similarity of each word be respectively as follows:
A1 and B1 similarity are 23%;
A1 and B2 similarity are 50%;
A1 and B3 similarity are 61%;
A2 and B1 similarity are 15%;
A2 and B2 similarity are 76%;
A2 and B3 similarity are 95%;
A3 and B1 similarity are 100%;
A3 and B2 similarity are 2%;
A3 and B3 similarity are 30%.
It can then determine that the maximum value in corresponding three similarities of A1 is 61%, it may be assumed that in A1 and text word set B3 is most like.Meanwhile it can determine that the maximum value in corresponding three similarities of A2 is 95%, it may be assumed that A2 and text word set In B3 it is most like.Meanwhile it can determine that the maximum value in corresponding three similarities of A3 is 100%, it may be assumed that A3 and text word B1 in set is most like.It then, can be by three most for including the keyword set K of tri- words of A1, A2, A3 Sequence score of the corresponding maximum similarity 95% of word of the 2nd in big value namely interposition as the text to be retrieved, Or by sequence score of the corresponding maximum similarity 61% of last word as the text to be retrieved in three maximum values.
It is understood that the maximum value of above-mentioned similarity represents word and text to be retrieved in keyword set K In some word it is highly relevant, and take maximum similarity descending order arrange in the 5th, the 7th, interposition or Last is to allow each word in keyword set K to be embodied in be checked as the sequence score of text to be retrieved In the similarity of the text word set determined in Suo Wenben, guarantee the accuracy of retrieval.
Therefore, text searching method provided by the invention calculates in retrieval set of words by using TextRank algorithm Each word TextRank value, and the TextRank value according to each word takes the word of preset quantity as keyword Set, the obtained keyword set can relatively accurately express retrieval text core content, eliminate some high frequencies without The interference for closing word, ensure that the accuracy of text to be retrieved to a certain extent.And the application indicates each word using term vector Relationship between language and word, and text to be retrieved is ranked up according to the relationship between each word itself and word, retrieval knot The accuracy of fruit further increases.
Corresponding with above method embodiment, the present invention also provides a kind of text retrieval devices.
As shown in Fig. 2, a kind of text retrieval device provided in an embodiment of the present invention, may include: participle unit 10, TextRank value computing unit 20, keyword set determination unit 30, the first term vector determination unit 40, text word set obtain Take unit 50, the second term vector determination unit 60, similarity calculated 70 and text sequence output unit 80.Wherein,
Participle unit 10 obtains retrieval set of words for segmenting to retrieval text;
TextRank value computing unit 20, for being respectively adopted for each word in the retrieval set of words TextRank algorithm calculates the TextRank value of each word;
Keyword set determination unit 30 chooses the word of preset quantity for the TextRank value according to each word As keyword set;
First term vector determination unit 40, for determining the term vector of each word in the keyword set;
Text word set acquiring unit 50, for obtaining at least one corresponding text word collection of text to be retrieved It closes;
Second term vector determination unit 60, for determining at least one described corresponding text word of text to be retrieved The term vector of each word in set;
Similarity calculated 70, for calculate the term vector of each word in the keyword set respectively with it is described The similarity of the term vector of each word at least one corresponding text word set of text to be retrieved;
Text sequence output unit 80, for being ranked up at least one described text to be retrieved according to the similarity Output.
Optionally, the text word set acquiring unit includes:
Text to be retrieved determines subelement, for being based on the retrieval text, using text similarity measurement algorithm, determines at least One text to be retrieved;
It segments subelement and obtains multiple words for segmenting to each text to be retrieved;
Candidate set of words determines subelement, for removing dittograph and stop words from the multiple word, obtains Candidate set of words;
TextRank value computation subunit, for being respectively adopted for each word in the candidate set of words TextRank algorithm calculates the TextRank value of each word;
Text word set determines subelement, for the TextRank according to each word in the candidate set of words Value determines text word set from the candidate set of words.
Optionally, the text sequence output unit includes:
Maximum similarity determines subelement, is used for for any one text to be retrieved, from the key being calculated The key is obtained in the similarity of each word in the text word set of each word and the text to be retrieved in set of words Each corresponding maximum similarity of word in set of words;
Sequence score determines subelement, for each corresponding maximum similarity of word from the keyword set In, by the sequence of maximum similarity from big to small, determine the corresponding maximum similarity of word at predetermined order position as institute State the sequence score of text to be retrieved;
Text sequence output subelement, for the sequence score according to each text to be retrieved, to it is described at least one wait for Retrieval text is ranked up output.
The text retrieval device includes processor and memory, above-mentioned participle unit 10, TextRank value computing unit 20, keyword set determination unit 30, the first term vector determination unit 40, text word set acquiring unit 50, the second word to Amount determination unit 60, similarity calculated 70 and text sequence output unit 80 etc. are stored in memory as program unit In, above procedure unit stored in memory is executed by processor to realize corresponding function.
Include kernel in processor, is gone in memory to transfer corresponding program unit by kernel.Kernel can be set one Or more, text retrieval is carried out by adjusting kernel parameter.
Memory may include the non-volatile memory in computer-readable medium, random access memory (RAM) and/ Or the forms such as Nonvolatile memory, if read-only memory (ROM) or flash memory (flash RAM), memory include that at least one is deposited Store up chip.
The embodiment of the invention provides a kind of storage mediums, are stored thereon with program, real when which is executed by processor The existing text searching method.
The embodiment of the invention provides a kind of processor, the processor is for running program, wherein described program operation Text searching method described in Shi Zhihang.
The embodiment of the invention provides a kind of equipment, equipment include processor, memory and storage on a memory and can The program run on a processor, processor perform the steps of when executing program
Retrieval text is segmented, retrieval set of words is obtained;
For each word in the retrieval set of words, TextRank algorithm is respectively adopted and calculates each word TextRank value;
According to the TextRank value of each word, the word of preset quantity is chosen as keyword set;
Determine the term vector of each word in the keyword set;
At least one corresponding text word set of text to be retrieved is obtained, and at least one is to be retrieved described in determination The term vector of each word in the corresponding text word set of text;
The term vector for calculating each word in the keyword set is each at least one described text to be retrieved respectively The similarity of the term vector of each word in self-corresponding text word set;
At least one described text to be retrieved is ranked up output according to the similarity.
Optionally, obtaining at least one text to be retrieved includes: to include: based on the retrieval text, similar using text Algorithm is spent, determines at least one text to be retrieved;
Obtaining at least one corresponding text word set of text to be retrieved includes:
Each text to be retrieved is segmented, multiple words are obtained;
Dittograph and stop words are removed from the multiple word, obtain candidate set of words;
For each word in the candidate set of words, TextRank algorithm is respectively adopted and calculates each word TextRank value;
According to the TextRank value of each word in the candidate set of words, determined from the candidate set of words Text word set.
It is optionally, described that at least one described text to be retrieved is ranked up by output according to the similarity, comprising:
For any one text to be retrieved, each word is to be retrieved with this from the keyword set being calculated In the text word set of text in the similarity of each word, it is corresponding to obtain each word in the keyword set Maximum similarity;
From the corresponding maximum similarity of word each in the keyword set, from big to small by maximum similarity Sequence, determine sequence score of the corresponding maximum similarity of word as the text to be retrieved at predetermined order position;
According to the sequence score of each text to be retrieved, output is ranked up at least one described text to be retrieved.
Optionally, the text to be retrieved includes: text topic to be retrieved and text body to be retrieved.
Optionally it is determined that the term vector of word includes:
Using preparatory trained term vector model, the term vector of word is determined;
Wherein the trained term vector model in advance includes any of the following: word2vector model, potential language Justice analysis LSA matrix decomposition model, the latent semantic analysis PLSA latent semantic analysis probabilistic model of probability and potential Di Li Cray point Cloth LDA model.Equipment herein can be server, PC, PAD, mobile phone etc..
Present invention also provides a kind of computer program products, when executing on data processing equipment, are adapted for carrying out just The program of beginningization there are as below methods step:
Retrieval text is segmented, retrieval set of words is obtained;
For each word in the retrieval set of words, TextRank algorithm is respectively adopted and calculates each word TextRank value;
According to the TextRank value of each word, the word of preset quantity is chosen as keyword set;
Determine the term vector of each word in the keyword set;
At least one corresponding text word set of text to be retrieved is obtained, and at least one is to be retrieved described in determination The term vector of each word in the corresponding text word set of text;
The term vector for calculating each word in the keyword set is each at least one described text to be retrieved respectively The similarity of the term vector of each word in self-corresponding text word set;
At least one described text to be retrieved is ranked up output according to the similarity.
Optionally, obtaining at least one text to be retrieved includes: to include: based on the retrieval text, similar using text Algorithm is spent, determines at least one text to be retrieved;
Obtaining at least one corresponding text word set of text to be retrieved includes:
Each text to be retrieved is segmented, multiple words are obtained;
Dittograph and stop words are removed from the multiple word, obtain candidate set of words;
For each word in the candidate set of words, TextRank algorithm is respectively adopted and calculates each word TextRank value;
According to the TextRank value of each word in the candidate set of words, determined from the candidate set of words Text word set.
It is optionally, described that at least one described text to be retrieved is ranked up by output according to the similarity, comprising:
For any one text to be retrieved, each word is to be retrieved with this from the keyword set being calculated In the text word set of text in the similarity of each word, it is corresponding to obtain each word in the keyword set Maximum similarity;
From the corresponding maximum similarity of word each in the keyword set, from big to small by maximum similarity Sequence, determine sequence score of the corresponding maximum similarity of word as the text to be retrieved at predetermined order position;
According to the sequence score of each text to be retrieved, output is ranked up at least one described text to be retrieved.
Optionally, the text to be retrieved includes: text topic to be retrieved and text body to be retrieved.
Optionally it is determined that the term vector of word includes:
Using preparatory trained term vector model, the term vector of word is determined;
Wherein the trained term vector model in advance includes any of the following: word2vector model, potential language Justice analysis LSA matrix decomposition model, the latent semantic analysis PLSA latent semantic analysis probabilistic model of probability and potential Di Li Cray point Cloth LDA model.
It should be understood by those skilled in the art that, embodiments herein can provide as method, system or computer program Product.Therefore, complete hardware embodiment, complete software embodiment or reality combining software and hardware aspects can be used in the application Apply the form of example.Moreover, it wherein includes the computer of computer usable program code that the application, which can be used in one or more, The computer program implemented in usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) produces The form of product.
The application is referring to method, the process of equipment (system) and computer program product according to the embodiment of the present application Figure and/or block diagram describe.It should be understood that every one stream in flowchart and/or the block diagram can be realized by computer program instructions The combination of process and/or box in journey and/or box and flowchart and/or the block diagram.It can provide these computer programs Instruct the processor of general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce A raw machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute for real The device for the function of being specified in present one or more flows of the flowchart and/or one or more blocks of the block diagram.
These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates, Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or The function of being specified in multiple boxes.
These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one The step of function of being specified in a box or multiple boxes.
In a typical configuration, calculating equipment includes one or more processors (CPU), input/output interface, net Network interface and memory.
Memory may include the non-volatile memory in computer-readable medium, random access memory (RAM) and/ Or the forms such as Nonvolatile memory, such as read-only memory (ROM) or flash memory (flash RAM).Memory is computer-readable Jie The example of matter.
Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method Or technology come realize information store.Information can be computer readable instructions, data structure, the module of program or other data. The example of the storage medium of computer includes, but are not limited to phase change memory (PRAM), static random access memory (SRAM), moves State random access memory (DRAM), other kinds of random access memory (RAM), read-only memory (ROM), electric erasable Programmable read only memory (EEPROM), flash memory or other memory techniques, read-only disc read only memory (CD-ROM) (CD-ROM), Digital versatile disc (DVD) or other optical storage, magnetic cassettes, tape magnetic disk storage or other magnetic storage devices Or any other non-transmission medium, can be used for storage can be accessed by a computing device information.As defined in this article, it calculates Machine readable medium does not include temporary computer readable media (transitory media), such as the data-signal and carrier wave of modulation.
It should also be noted that, the terms "include", "comprise" or its any other variant are intended to nonexcludability It include so that the process, method, commodity or the equipment that include a series of elements not only include those elements, but also to wrap Include other elements that are not explicitly listed, or further include for this process, method, commodity or equipment intrinsic want Element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including element There is also other identical elements in process, method, commodity or equipment.
It will be understood by those skilled in the art that embodiments herein can provide as method, system or computer program product. Therefore, complete hardware embodiment, complete software embodiment or embodiment combining software and hardware aspects can be used in the application Form.It is deposited moreover, the application can be used to can be used in the computer that one or more wherein includes computer usable program code The shape for the computer program product implemented on storage media (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) Formula.
The above is only embodiments herein, are not intended to limit this application.To those skilled in the art, Various changes and changes are possible in this application.It is all within the spirit and principles of the present application made by any modification, equivalent replacement, Improve etc., it should be included within the scope of the claims of this application.

Claims (10)

1. a kind of text searching method, which is characterized in that the described method includes:
Retrieval text is segmented, retrieval set of words is obtained;
For each word in the retrieval set of words, TextRank algorithm is respectively adopted and calculates each word TextRank value;
According to the TextRank value of each word, the word of preset quantity is chosen as keyword set;
Determine the term vector of each word in the keyword set;
At least one corresponding text word set of text to be retrieved is obtained, and determines at least one described text to be retrieved The term vector of each word in corresponding text word set;
The term vector for calculating each word in the keyword set is respectively right at least one described text to be retrieved respectively The similarity of the term vector of each word in the text word set answered;
At least one described text to be retrieved is ranked up output according to the similarity.
2. the method according to claim 1, wherein
Obtaining at least one text to be retrieved includes: to include: based on the retrieval text, using text similarity measurement algorithm, is determined At least one text to be retrieved;
Obtaining at least one corresponding text word set of text to be retrieved includes:
Each text to be retrieved is segmented, multiple words are obtained;
Dittograph and stop words are removed from the multiple word, obtain candidate set of words;
For each word in the candidate set of words, TextRank algorithm is respectively adopted and calculates each word TextRank value;
According to the TextRank value of each word in the candidate set of words, text is determined from the candidate set of words Set of words.
3. method according to claim 1 or 2, which is characterized in that it is described according to the similarity will it is described at least one Text to be retrieved is ranked up output, comprising:
For any one text to be retrieved, each word and the text to be retrieved from the keyword set being calculated Text word set in each word similarity in, obtain each corresponding maximum of word in the keyword set Similarity;
From the corresponding maximum similarity of word each in the keyword set, by maximum similarity from big to small suitable Sequence determines sequence score of the corresponding maximum similarity of word as the text to be retrieved at predetermined order position;
According to the sequence score of each text to be retrieved, output is ranked up at least one described text to be retrieved.
4. according to the method described in claim 3, it is characterized in that, the text to be retrieved include: text topic to be retrieved and Text body to be retrieved.
5. the method according to claim 1, wherein determining that the term vector of word includes:
Using preparatory trained term vector model, the term vector of word is determined;
Wherein the trained term vector model in advance includes any of the following: word2vector model, potential applications point It analyses the latent semantic analysis PLSA latent semantic analysis probabilistic model of LSA matrix decomposition model, probability and potential Di Li Cray is distributed LDA Model.
6. a kind of text retrieval device, which is characterized in that described device includes:
Participle unit obtains retrieval set of words for segmenting to retrieval text;
TextRank value computing unit, for TextRank to be respectively adopted for each word in the retrieval set of words Algorithm calculates the TextRank value of each word;
Keyword set determination unit chooses the word of preset quantity as pass for the TextRank value according to each word Keyword set;
First term vector determination unit, for determining the term vector of each word in the keyword set;
Text word set acquiring unit, for obtaining at least one corresponding text word set of text to be retrieved;
Second term vector determination unit, for determining at least one described corresponding text word set of text to be retrieved The term vector of each word;
Similarity calculated, for calculating the term vector of each word in the keyword set respectively with described at least one The similarity of the term vector of each word in a corresponding text word set of text to be retrieved;
Text sequence output unit, at least one described text to be retrieved to be ranked up output according to the similarity.
7. device according to claim 6, which is characterized in that the text word set acquiring unit includes:
Text to be retrieved determines subelement, for determining at least one using text similarity measurement algorithm based on the retrieval text Text to be retrieved;
It segments subelement and obtains multiple words for segmenting to each text to be retrieved;
Candidate set of words determines subelement, for removing dittograph and stop words from the multiple word, obtains candidate Set of words;
TextRank value computation subunit, for being respectively adopted for each word in the candidate set of words TextRank algorithm calculates the TextRank value of each word;
Text word set determines subelement, for the TextRank value according to each word in the candidate set of words, from Text word set is determined in candidate's set of words.
8. device according to claim 6 or 7, which is characterized in that text sequence output unit includes:
Maximum similarity determines subelement, is used for for any one text to be retrieved, from the keyword set being calculated The keyword set is obtained in the similarity of each word in the text word set of each word and the text to be retrieved in conjunction Each corresponding maximum similarity of word in conjunction;
Sequence score determines subelement, is used for from the corresponding maximum similarity of word each in the keyword set, By the sequence of maximum similarity from big to small, determine the corresponding maximum similarity of word at predetermined order position as it is described to Retrieve the sequence score of text;
Text sequence output subelement, for the sequence score according to each text to be retrieved, at least one is to be retrieved to described Text is ranked up output.
9. a kind of storage medium, which is characterized in that be stored thereon with program, realize that right is wanted when described program is executed by processor Text searching method described in asking any one of 1 to 5.
10. a kind of processor, which is characterized in that the processor is for running program, wherein right of execution when described program is run Benefit require any one of 1 to 5 described in text searching method.
CN201711043608.2A 2017-10-31 2017-10-31 A kind of text searching method and device Pending CN110019668A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711043608.2A CN110019668A (en) 2017-10-31 2017-10-31 A kind of text searching method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711043608.2A CN110019668A (en) 2017-10-31 2017-10-31 A kind of text searching method and device

Publications (1)

Publication Number Publication Date
CN110019668A true CN110019668A (en) 2019-07-16

Family

ID=67186714

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711043608.2A Pending CN110019668A (en) 2017-10-31 2017-10-31 A kind of text searching method and device

Country Status (1)

Country Link
CN (1) CN110019668A (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110866144A (en) * 2019-11-06 2020-03-06 腾讯音乐娱乐科技(深圳)有限公司 Song retrieval method and device
CN110968702A (en) * 2019-11-29 2020-04-07 北京明略软件系统有限公司 Method and device for extracting matter relationship
CN111061869A (en) * 2019-11-13 2020-04-24 北京数字联盟网络科技有限公司 Application preference text classification method based on TextRank
CN111159461A (en) * 2019-12-30 2020-05-15 秒针信息技术有限公司 Audio file determination method and device, storage medium and electronic device
CN111274808A (en) * 2020-02-11 2020-06-12 支付宝(杭州)信息技术有限公司 Text retrieval method, model training method, text retrieval device, and storage medium
CN111538830A (en) * 2020-04-28 2020-08-14 清华大学 French retrieval method, French retrieval device, computer equipment and storage medium
CN111666461A (en) * 2020-04-24 2020-09-15 百度在线网络技术(北京)有限公司 Method, apparatus, device and computer storage medium for retrieving geographical location
CN112257436A (en) * 2020-09-29 2021-01-22 华为技术有限公司 Text detection method and device
CN112732870A (en) * 2020-12-31 2021-04-30 平安科技(深圳)有限公司 Searching method, device and equipment based on word vector and storage medium
CN112988971A (en) * 2021-03-15 2021-06-18 平安科技(深圳)有限公司 Word vector-based search method, terminal, server and storage medium
CN115203379A (en) * 2022-09-15 2022-10-18 太平金融科技服务(上海)有限公司深圳分公司 Retrieval method, retrieval apparatus, computer device, storage medium, and program product
CN116186203A (en) * 2023-03-01 2023-05-30 人民网股份有限公司 Text retrieval method, text retrieval device, computing equipment and computer storage medium
CN114036946B (en) * 2021-11-26 2023-07-07 浪潮卓数大数据产业发展有限公司 Text feature extraction and auxiliary retrieval system and method
US11836174B2 (en) 2020-04-24 2023-12-05 Baidu Online Network Technology (Beijing) Co., Ltd. Method and apparatus of establishing similarity model for retrieving geographic location

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120330978A1 (en) * 2008-06-24 2012-12-27 Microsoft Corporation Consistent phrase relevance measures
CN103886063A (en) * 2014-03-18 2014-06-25 国家电网公司 Text retrieval method and device
US20160070803A1 (en) * 2014-09-09 2016-03-10 Funky Flick, Inc. Conceptual product recommendation
CN105653671A (en) * 2015-12-29 2016-06-08 畅捷通信息技术股份有限公司 Similar information recommendation method and system
CN106021223A (en) * 2016-05-09 2016-10-12 Tcl集团股份有限公司 Sentence similarity calculation method and system
CN106156272A (en) * 2016-06-21 2016-11-23 北京工业大学 A kind of information retrieval method based on multi-source semantic analysis
CN106991092A (en) * 2016-01-20 2017-07-28 阿里巴巴集团控股有限公司 The method and apparatus that similar judgement document is excavated based on big data
CN107066621A (en) * 2017-05-11 2017-08-18 腾讯科技(深圳)有限公司 A kind of search method of similar video, device and storage medium
CN107153689A (en) * 2017-04-29 2017-09-12 安徽富驰信息技术有限公司 A kind of case search method based on Topic Similarity
CN107247780A (en) * 2017-06-12 2017-10-13 北京理工大学 A kind of patent document method for measuring similarity of knowledge based body

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120330978A1 (en) * 2008-06-24 2012-12-27 Microsoft Corporation Consistent phrase relevance measures
CN103886063A (en) * 2014-03-18 2014-06-25 国家电网公司 Text retrieval method and device
US20160070803A1 (en) * 2014-09-09 2016-03-10 Funky Flick, Inc. Conceptual product recommendation
CN105653671A (en) * 2015-12-29 2016-06-08 畅捷通信息技术股份有限公司 Similar information recommendation method and system
CN106991092A (en) * 2016-01-20 2017-07-28 阿里巴巴集团控股有限公司 The method and apparatus that similar judgement document is excavated based on big data
CN106021223A (en) * 2016-05-09 2016-10-12 Tcl集团股份有限公司 Sentence similarity calculation method and system
CN106156272A (en) * 2016-06-21 2016-11-23 北京工业大学 A kind of information retrieval method based on multi-source semantic analysis
CN107153689A (en) * 2017-04-29 2017-09-12 安徽富驰信息技术有限公司 A kind of case search method based on Topic Similarity
CN107066621A (en) * 2017-05-11 2017-08-18 腾讯科技(深圳)有限公司 A kind of search method of similar video, device and storage medium
CN107247780A (en) * 2017-06-12 2017-10-13 北京理工大学 A kind of patent document method for measuring similarity of knowledge based body

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
杨丽萍: "面向自然语言的法律检索系统的研究与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110866144A (en) * 2019-11-06 2020-03-06 腾讯音乐娱乐科技(深圳)有限公司 Song retrieval method and device
CN110866144B (en) * 2019-11-06 2022-08-05 腾讯音乐娱乐科技(深圳)有限公司 Song retrieval method and device
CN111061869A (en) * 2019-11-13 2020-04-24 北京数字联盟网络科技有限公司 Application preference text classification method based on TextRank
CN111061869B (en) * 2019-11-13 2024-01-26 北京数字联盟网络科技有限公司 Text classification method for application preference based on TextRank
CN110968702A (en) * 2019-11-29 2020-04-07 北京明略软件系统有限公司 Method and device for extracting matter relationship
CN110968702B (en) * 2019-11-29 2023-05-09 北京明略软件系统有限公司 Method and device for extracting rational relation
CN111159461A (en) * 2019-12-30 2020-05-15 秒针信息技术有限公司 Audio file determination method and device, storage medium and electronic device
CN111159461B (en) * 2019-12-30 2023-10-03 秒针信息技术有限公司 Audio file determining method and device, storage medium and electronic device
CN111274808B (en) * 2020-02-11 2023-07-04 支付宝(杭州)信息技术有限公司 Text retrieval method, model training method, text retrieval device, and storage medium
CN111274808A (en) * 2020-02-11 2020-06-12 支付宝(杭州)信息技术有限公司 Text retrieval method, model training method, text retrieval device, and storage medium
US11836174B2 (en) 2020-04-24 2023-12-05 Baidu Online Network Technology (Beijing) Co., Ltd. Method and apparatus of establishing similarity model for retrieving geographic location
CN111666461A (en) * 2020-04-24 2020-09-15 百度在线网络技术(北京)有限公司 Method, apparatus, device and computer storage medium for retrieving geographical location
CN111666461B (en) * 2020-04-24 2023-05-26 百度在线网络技术(北京)有限公司 Method, apparatus, device and computer storage medium for retrieving geographic location
CN111538830B (en) * 2020-04-28 2023-09-05 清华大学 French searching method, device, computer equipment and storage medium
CN111538830A (en) * 2020-04-28 2020-08-14 清华大学 French retrieval method, French retrieval device, computer equipment and storage medium
CN112257436A (en) * 2020-09-29 2021-01-22 华为技术有限公司 Text detection method and device
CN112257436B (en) * 2020-09-29 2024-04-02 华为技术有限公司 Text detection method and device
WO2022141876A1 (en) * 2020-12-31 2022-07-07 平安科技(深圳)有限公司 Word embedding-based search method, apparatus and device, and storage medium
CN112732870A (en) * 2020-12-31 2021-04-30 平安科技(深圳)有限公司 Searching method, device and equipment based on word vector and storage medium
CN112732870B (en) * 2020-12-31 2024-03-05 平安科技(深圳)有限公司 Word vector based search method, device, equipment and storage medium
CN112988971A (en) * 2021-03-15 2021-06-18 平安科技(深圳)有限公司 Word vector-based search method, terminal, server and storage medium
CN114036946B (en) * 2021-11-26 2023-07-07 浪潮卓数大数据产业发展有限公司 Text feature extraction and auxiliary retrieval system and method
CN115203379A (en) * 2022-09-15 2022-10-18 太平金融科技服务(上海)有限公司深圳分公司 Retrieval method, retrieval apparatus, computer device, storage medium, and program product
CN116186203A (en) * 2023-03-01 2023-05-30 人民网股份有限公司 Text retrieval method, text retrieval device, computing equipment and computer storage medium
CN116186203B (en) * 2023-03-01 2023-10-10 人民网股份有限公司 Text retrieval method, text retrieval device, computing equipment and computer storage medium

Similar Documents

Publication Publication Date Title
CN110019668A (en) A kind of text searching method and device
CN108241621B (en) legal knowledge retrieval method and device
US9542477B2 (en) Method of automated discovery of topics relatedness
US10268758B2 (en) Method and system of acquiring semantic information, keyword expansion and keyword search thereof
TWI710917B (en) Data processing method and device
CN110019669B (en) Text retrieval method and device
CN109582948B (en) Method and device for extracting evaluation viewpoints
CN110019670A (en) A kind of text searching method and device
CN106610931B (en) Topic name extraction method and device
CN112328544B (en) Multidisciplinary simulation data classification method, device and storage medium
CN112329460A (en) Text topic clustering method, device, equipment and storage medium
CN109597983A (en) A kind of spelling error correction method and device
CN110019785A (en) A kind of file classification method and device
CN106598997B (en) Method and device for calculating text theme attribution degree
CN109597982A (en) Summary texts recognition methods and device
CN109117434A (en) Judgement document's search method, device, storage medium and processor
CN110019665A (en) Text searching method and device
CN105786929B (en) A kind of information monitoring method and device
CN115563268A (en) Text abstract generation method and device, electronic equipment and storage medium
CN112487181A (en) Keyword determination method and related equipment
Zimniewicz et al. Scheduling aspects in keyword extraction problem
CN110019295A (en) Database index method, device, system and storage medium
CN110895703A (en) Legal document routing identification method and device
CN115391656A (en) User demand determination method, device and equipment
CN112613320A (en) Method and device for acquiring similar sentences, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 100080 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing

Applicant after: Beijing Guoshuang Technology Co.,Ltd.

Address before: 100086 Beijing city Haidian District Shuangyushu Area No. 76 Zhichun Road cuigongfandian 8 layer A

Applicant before: Beijing Guoshuang Technology Co.,Ltd.

RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190716