CN110413985B - Related text segment searching method and device - Google Patents

Related text segment searching method and device Download PDF

Info

Publication number
CN110413985B
CN110413985B CN201810394787.2A CN201810394787A CN110413985B CN 110413985 B CN110413985 B CN 110413985B CN 201810394787 A CN201810394787 A CN 201810394787A CN 110413985 B CN110413985 B CN 110413985B
Authority
CN
China
Prior art keywords
word
similarity
word vector
search
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810394787.2A
Other languages
Chinese (zh)
Other versions
CN110413985A (en
Inventor
何耀
蒋松岐
刘笑逸
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Haima Light Sail Entertainment Technology Co ltd
Original Assignee
Beijing Haima Light Sail Entertainment Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Haima Light Sail Entertainment Technology Co ltd filed Critical Beijing Haima Light Sail Entertainment Technology Co ltd
Priority to CN201810394787.2A priority Critical patent/CN110413985B/en
Publication of CN110413985A publication Critical patent/CN110413985A/en
Application granted granted Critical
Publication of CN110413985B publication Critical patent/CN110413985B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application discloses a text segment retrieval method and a text segment retrieval device, wherein a word vector matrix corresponding to a text segment is generated by using a word vector of a characteristic word in the text segment, the main content of the text segment can be reflected by the characteristic word, so the word vector matrix generated according to the word vector of the characteristic word can be used for representing the text segment, the word vector of a retrieval word is used for representing the retrieval word, the similarity between each word vector matrix and the word vector of the retrieval word is calculated, the similarity between the text segment and the retrieval word is represented by the similarity between the word vector matrix and the word vector of the retrieval word, the higher the similarity between the text segment and the retrieval word is, the higher the correlation between the text segment and the retrieval word is, and the text segment with the similarity larger than or equal to a first threshold is used as a retrieval result, so the retrieval accuracy of the related text segment is improved.

Description

Related text segment searching method and device
Technical Field
The present application relates to the field of internet technologies, and in particular, to a method and an apparatus for searching related text segments.
Background
For the literature creators, the material is an important factor, and the acquisition of the related literature material can effectively improve the creation efficiency, for example, the acquisition of the narrative segment material related to the narrative literature work can be greatly beneficial to the dramatizing practitioner. For the existing literary works, all parts of the literary works can be authored materials, the parts can contain different themes, and how to obtain the parts of the related themes as the materials is a problem which is more concerned by literary work researchers.
The method comprises the steps of taking all parts of the literary works as text segments, reading the literary works by content editors with rich experience mainly in the conventional mode of retrieving the text segments related to topics, disassembling the literary works on the basis of subjective understanding to form the text segments, then adding theme labels to the disassembled text segments, and matching retrieval words with the theme labels to realize retrieval of the related text segments. The retrieval method of the related text segments needs a large amount of labor cost and time cost, and subjective deviation may occur due to the manual addition of the subject labels of the text segments, so that the retrieval accuracy of the related text segments is reduced.
Disclosure of Invention
In order to solve the problems of high cost and low accuracy of a retrieval result in a text segment retrieval method in the prior art, the embodiment of the application provides a text segment retrieval method and a text segment retrieval device.
The method for retrieving the text segments provided by the embodiment of the application comprises the following steps:
acquiring a set of text segments, wherein the set of text segments comprises at least one text segment, extracting feature words of the text segments, generating a word vector matrix corresponding to the text segments according to word vectors of the feature words, and acquiring word vectors of search words;
calculating the similarity of each word vector matrix and the word vector of the search word, and taking the text segment with the similarity larger than or equal to a first threshold value as a search result.
Optionally, the calculating the similarity between each word vector matrix and the word vector of the search word includes:
calculating a first similarity between a word vector of each feature word in each text segment and a word vector of the search word, and generating an adjusted word vector matrix corresponding to the text segment according to the word vector of the feature word of which the first similarity is greater than or equal to a second threshold;
and calculating the similarity between each adjusted word vector matrix and the word vector of the search word.
Optionally, the calculating the similarity between each adjusted word vector matrix and the word vector of the search word includes:
calculating the average vector of the adjusted word vector matrix according to the average vector of the word vectors of the feature words in the adjusted word vector matrix;
and calculating the similarity between the average vector of the adjusted word vector matrix and the word vector of the search word.
Optionally, the calculating the similarity between each word vector matrix and the word vector of the search word includes:
calculating an average vector of the word vector matrix according to the average vector of the word vectors of the feature words in the word vector matrix;
and calculating the similarity of the average vector of the word vector matrix and the word vector of the search word.
Optionally, the similarity includes a second similarity, and the calculating the similarity between each word vector matrix and the word vector of the search word includes:
calculating a third similarity between each word vector matrix and the word vector of the search word;
according to the formula:
Figure BDA0001644327830000021
calculating a second similarity of the word vector matrix and the word vector of the search word, wherein the sim 2 Is a second degree of similarity, the sim 1 Is a third similarity, the alpha is a first adjustment coefficient, the beta is a second adjustment coefficient, the n 1 The number of the word vectors of the feature words corresponding to the adjusted word vector matrix is n 2 The number of the word vectors of the feature words corresponding to the word vector matrix;
the taking the text segments with the similarity greater than or equal to a first threshold as a retrieval result comprises:
and taking the text segments with the second similarity larger than or equal to a first threshold value as retrieval results.
Optionally, the calculating the similarity between each word vector matrix and the word vector of the search word includes:
acquiring similarity between the feature words in the word vector matrix and the word vectors of the search words;
and obtaining the average value of the similarity in the word vector matrix.
Optionally, the search result is multiple, and the taking the text segment with the similarity greater than or equal to the first threshold as the search result includes:
and sorting the text segments with the similarity greater than or equal to a first threshold value from high to low according to the similarity value, and taking the first m text segments as search results, wherein m is the preset number of the search results.
The device for retrieving the text segments provided by the embodiment of the application comprises the following components:
the word vector and word vector matrix acquisition unit is used for acquiring a set of text segments, wherein the set of text segments comprises at least one text segment, extracting the characteristic words of the text segments, generating a word vector matrix corresponding to the text segments according to the word vectors of the characteristic words, and acquiring the word vectors of the search words;
and the retrieval result acquisition unit is used for calculating the similarity between each word vector matrix and the word vector of the retrieval word and taking the text segment with the similarity larger than or equal to a first threshold value as a retrieval result.
Optionally, the search result obtaining unit includes:
the adjusted word vector matrix obtaining subunit is configured to calculate a first similarity between a word vector of each feature word in each text segment and a word vector of the search word, and generate an adjusted word vector matrix corresponding to the text segment according to a word vector of a feature word of which the first similarity is greater than or equal to a second threshold;
the similarity calculation operator unit is used for calculating the similarity between each adjusted word vector matrix and the word vector of the search word;
and the retrieval result acquisition subunit is used for taking the text segments with the similarity greater than or equal to a first threshold value as retrieval results.
Optionally, the similarity calculation subunit includes:
an average vector obtaining subunit, configured to calculate an average vector of the adjusted word vector matrix according to an average vector of word vectors of feature words in the adjusted word vector matrix;
and the adjusted similarity calculation operator unit is used for calculating the similarity between the average vector of the adjusted word vector matrix and the word vector of the search word.
The method and the device for searching the text segments provided by the embodiment of the application extract the characteristic words of each text segment in the text segment set by obtaining the text segment set, generate word vector matrixes corresponding to the text segments according to the word vectors of the characteristic words, obtain the word vectors of the search words, calculate the similarity between each word vector matrix and the word vectors of the search words, and take the text segments with the similarity larger than or equal to a first threshold value as a search result. The word vector matrix corresponding to the text segment is generated by using the word vectors of the feature words in the text segment, and the feature words can reflect the main content of the text segment, so the word vector matrix generated according to the word vectors of the feature words can be used for representing the text segment, the word vectors of the search words are used for representing the search words, the similarity between the text segment and the search words is represented by the similarity between the word vector matrix and the word vectors of the search words, the higher the similarity between the text segment and the search words, the higher the correlation between the text segment and the search words, and the text segment with higher similarity is used as a search result, thereby the accuracy of searching the related text segment is improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings can be obtained by those skilled in the art without creative efforts.
Fig. 1 is a flowchart of a method for retrieving text snippets according to an embodiment of the present disclosure;
fig. 2 is a block diagram of an apparatus for retrieving text segments according to an embodiment of the present disclosure.
Detailed Description
In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
In the prior art, the text segment is manually added with the theme tag, and the text segment related to the search word is searched by matching the search word with the theme tag, so that the searching mode is more dependent on the accuracy and comprehensiveness of the theme tag of the text segment, and subjective deviation may occur by manually adding the tag, so that the accuracy of the theme tag of the text segment is further reduced. For example, different people may have topic tags such as "fight", or "conflict" for the same text segment, and these words may have differences in matching with the search word, which may result in inaccurate search results.
In order to solve the above technical problem, an embodiment of the present application provides a method for retrieving a text segment, where a word vector matrix of the text segment and a word vector of a search word are obtained, the word vector of the search word is obtained according to the search word, and the word vector matrix of the text segment is generated according to the word vector of a feature word of the text segment, so that the word vector matrix is related to main content of the text segment, the similarity between the text segment and the search word is represented by the similarity between the word vector matrix and the word vector, and the text segment in which the similarity between the word vector matrix and the word vector of the search word is greater than or equal to a first threshold is used as a retrieval result, thereby improving accuracy of text segment retrieval.
Referring to fig. 1, a flowchart of a method for text segment retrieval provided by an embodiment of the present application is shown, where the method includes the following steps.
S101, acquiring a set of text segments, wherein the set of text segments comprises at least one text segment, extracting feature words of the text segments, generating a word vector matrix corresponding to the text segments according to word vectors of the feature words, and acquiring word vectors of search words.
The text segment may be a part of the literary work, for example, a chapter or a paragraph, and each text segment may have a single theme or a plurality of themes.
In the embodiment of the application, the literary work can be acquired through a crawler technology and divided into a plurality of text segments. For example, the text content of a paragraph is used as a text segment by segmenting the paragraph, so that a large number of text segments are obtained.
In addition, in order to further improve the accuracy of the search, when the literary works are acquired by the crawler technology, the literary works can be classified according to a certain rule to acquire one type of literary works, for example, narrative literary works or descriptive literary works and the like. This is because similar words are used in the same kind of literary works, and different meanings may appear in different kinds of literary works even in the same vocabulary, which may lead to inaccuracy of the search result. For example, the word "fight" may indicate that gas is lifted by qigong in narrative literature to form a gaseous field around the outside, and in other kinds of literature, it may be expressed as "opinion or joy. After the literary works are classified, the segmented text segments are classified into the corresponding literary works, and the text segments are processed according to the categories to which the text segments belong, so that the problem of inaccurate retrieval results caused by word use differences among different categories is solved.
And combining the acquired text segments to form a set of text segments, wherein the set of text segments at least comprises one text segment. A collection of text segments may contain only one type of text segment.
The method includes the steps of obtaining word vector matrixes corresponding to text segments in a text segment set, wherein each word vector matrix is formed by combining word vectors of a plurality of words in the corresponding text segments, and for example, the word vector matrixes can be formed by combining word vectors of characteristic words in the text segments.
In this embodiment of the present application, the obtaining manner of the word vector matrix of the text segment may specifically be: extracting the characteristic words of the text segments, and generating a word vector matrix corresponding to the text segments according to the word vectors of the characteristic words of the text segments, wherein the word vector matrix is composed of at least one word vector of the characteristic words.
The characteristic words of the text segments are more important words in the text segments, and mainly comprise nouns, verbs and pronouns. The extraction of the feature words can be performed through a convolutional neural network algorithm, the words of the text segments are firstly segmented, a plurality of words are obtained, the obtained words are analyzed, and more important words in the text segments are extracted as the feature words. Before the feature words are acquired, the maximum number of the feature words may be set in advance, and the acquired feature words should be less than or equal to the maximum number.
After the feature words of the text segment are obtained, the word vector of each feature word can be obtained according to the word vector model.
The word vectors are the result of digitizing human language, each word may correspond to a word vector, and the correspondence between words and word vectors forms a word vector model. For example, the word "i" may correspond to an n-dimensional word vector: [0.3, 0.8, … …, 0.7 ]. For words with similar meanings, the similarity between word vectors determined according to the word vector model is also high.
The acquisition of the word vector model can be obtained by training through a convolutional neural network algorithm in advance, for example, training through an open source word2vec packet. In the training process, a large amount of texts of literary works can be obtained, the texts are segmented, the corresponding relation between each vocabulary and a word vector is preset, the similarity between the vocabularies with similar meanings is calculated, the corresponding relation between the preset vocabularies and the word vectors before the word vectors is adjusted according to the calculated similarity, the similarity between the word vectors corresponding to the vocabularies with similar meanings is higher, and the finally obtained corresponding relation between the regulated vocabularies and the word vectors is used as a word vector model.
Different word vector models can be obtained for different types of literary works, for example, narrative literary works can have corresponding word vector models. The word vector of each characteristic word is obtained according to the word vector model, the corresponding word vector model can be determined according to the literary works type of the text segment, and then the word vector of each characteristic word is obtained according to the determined word vector model.
Combining the word vectors of each feature word to form a word vector matrix of the text segment, for example, there are k feature words in the text segment, each feature word corresponds to an n-dimensional row matrix, and the word of the first feature wordThe vector is [ a ] 11 ,a 12 ,…,a 1n ]The word vector of the second feature word is [ a ] 21 ,a 22 ,…,a 2n ]…, the word vector of the k-th feature word is [ a ] k1 ,a k2 ,…,a kn ]Then the word vector matrix for the text segment may be represented as
Figure BDA0001644327830000071
The search word is a basis for the user to search the text segment, for example, the user wants to obtain the text segment related to "fight", and the "fight" may be used as the search word.
After the search term is obtained, a word vector of the search term may be obtained. Specifically, a word vector of the search word may be obtained by using a word vector model, and in the case of classifying the text segment, the word vector model is a word vector model corresponding to a type of the literary work to which the text segment that the user wants to obtain belongs, for example, the user wants to obtain a text segment of a narrative literary work, and the word vector of the search word may be obtained by using the word vector model of the narrative text segment.
The generation of the word vector matrix of the text segment and the acquisition of the word vector of the search word can be performed in any order, and the implementation of the embodiment of the application is not affected.
S102, calculating the similarity between the word vector matrix of the text segment and the word vector of the search word, and taking the text segment with the similarity larger than or equal to a first threshold value as a search result.
In the embodiment of the application, the similarity between the text segment and the search word can be represented by the similarity between the word vector matrix of the text segment and the word vector of the search word, and if the similarity is higher, the text segment is related to the search word, so that the search result can be used as the search result according with the search requirement of the user.
The similarity between the word vector matrix of the text segment and the word vector of the search word is obtained and can be realized through a similarity calculation formula. Specifically, there may be the following two modes:
as a possible implementation manner, an average vector of the word vector matrix of the text segment may be calculated, where the average vector is obtained by averaging values of corresponding positions in the vectors of the feature words to obtain an average vector having the same dimension as the word vector matrix of the text segment, and then a similarity between the average vector and a word vector of a search word is obtained, and the similarity is used as a similarity between the word vector matrix of the text segment and the word vector of the search word and represents a similarity between the text segment and the search word.
The process of averaging the word vectors of each feature word to obtain an average vector can be performed by directly calculating the average value, taking the word vectors of three feature values as an example, and the word vector of the first feature word is [ a ] 11 ,a 12 ,…,a 1n ]The word vector of the second feature word is [ a ] 21 ,a 22 ,…,a 2n ]The word vector of the third feature word is [ a ] 31 ,a 32 ,…,a 3n ]Then the average vector may be [ (a) 11 +a 21 +a 31 )/3,(a 12 +a 22 +a 32 )/3,…,(a 1n +a 2n +a 3n )/3]。
The process of averaging the vectors of each feature word to obtain an average vector can also be obtained according to the weight of each feature word in the text segment, or taking a word vector of three feature values as an example, the word vector of the first feature word is [ a ] 11 ,a 12 ,…,a 1n ]The weight of which is 0.3, the word vector of the second feature word is [ a ] 21 ,a 22 ,…,a 2n ]The weight is 0.2, and the word vector of the third feature word is [ a ] 31 ,a 32 ,…,a 3n ]With a weight of 0.5, the weighted average vector is obtained as [0.3 a% 11 +0.2*a 21 +0.5*a 31 ,0.3*a 12 +0.2*a 22 +0.5*a 32 ,…,0.3*a 1n +0.2*a 2n +0.5*a 3n ]。
The similarity between the average vector and the word vector of the search word can be obtained through various calculation methods, for example: pearson correlation coefficient method, euclidean distance formula, Cosine similarity, Jaccard coefficient, etc.
For cosine similarity, assume that the word vector of the search word is [ b ] 11 ,b 12 ,…,b 1n ]The average vector of the text segment is [ c ] 11 ,c 12 ,…,c 1n ]. And substituting the average vector and the word vector of the search word into a cosine similarity formula:
Figure BDA0001644327830000081
wherein the vector
Figure BDA0001644327830000082
(Vector)
Figure BDA0001644327830000083
Thus, the similarity between the average vector and the word vector of the search word is obtained:
Figure BDA0001644327830000091
as another possible implementation manner, the similarity between the word vector of each feature word in the word vector matrix and the word vector of the search word may be obtained first, an average value of the obtained multiple similarities is calculated, and the average value is used as the similarity between the word vector matrix of the text segment and the word vector of the search word to represent the similarity between the text segment and the search word.
The similarity calculation method between the word vector of the feature word and the word vector of the search word may be similar to the similarity calculation method between the average vector and the word vector of the search word, and is not repeated herein. The similarity value of the word vector of each feature word and the word vector of the search word is averaged, and the similarity value can be obtained by directly calculating the average value, or obtaining a weighted average value according to the weight of each feature word. Taking the word vectors of the three feature words as an example, the similarity between the word vectors of the three feature words and the word vectors of the search words is T 1 、T 2 And T 3 The weights of the three characteristic words are 0.3, 0.2 and 0.5 respectively, and the weight is directly countedThe similarity obtained by calculating the mean value is (T) 1 +T 2 +T 3 ) (iv) 0.3 x T weighted mean obtained from weights 1 +0.2*T 2 +0.5*T 3 . The average value of the similarity calculated in the two ways can be used as the similarity between the word vector of the feature word and the word vector of the search word, namely, the similarity between the word vector matrix of the text segment and the word vector of the search word represents the similarity between the text segment and the search word.
After the similarity between the word vector matrix of the text segment and the word vector of the search word is obtained, the similarity between the word vector matrix of the text segment and the word vector of the search word can represent the similarity between the text segment and the search word, and the higher the similarity is, the higher the correlation between the text segment and the search word is, so that the text segment with the similarity larger than or equal to a first threshold value can be used as a search result according to a first threshold value obtained in advance. The first threshold may be determined according to the number of search results required by the user and the correlation between the search results and the search term, where the higher the first threshold is, the smaller the number of search results is, and the lower the first threshold is, the larger the number of search results is.
When the retrieval result is a plurality of text segments, the text segments with the similarity greater than or equal to the first threshold value can be ranked from high to low according to the similarity value, and the top m text segments are used as the search result, wherein m is the preset number of search results, so that the user can obtain the m text segments most relevant to the retrieval word.
According to the text segment searching method provided by the embodiment of the application, the feature words of each text segment in the text segment set are extracted by obtaining the text segment set, the word vector matrixes corresponding to the text segments are generated according to the word vectors of the feature words, the word vectors of the search words are obtained, the similarity between each word vector matrix and the word vector of the search word is calculated, and the text segment with the similarity larger than or equal to a first threshold value is used as the search result. The word vector matrix corresponding to the text segment is generated by using the word vectors of the feature words in the text segment, and the feature words can reflect the main content of the text segment, so the word vector matrix generated according to the word vectors of the feature words can be used for representing the text segment, the word vectors of the search words are used for representing the search words, the similarity between the text segment and the search words is represented by the similarity between the word vector matrix and the word vectors of the search words, the higher the similarity between the text segment and the search words, the higher the correlation between the text segment and the search words, and the text segment with higher similarity is used as a search result, thereby the accuracy of searching the related text segment is improved.
In this embodiment of the present application, the word vector matrix may also be adjusted, and as a possible implementation manner, the manner of calculating the similarity between each word vector matrix and the word vector of the search word may specifically be: calculating first similarity of word vectors of the feature words in the text segments and word vectors of the search words, generating an adjusted word vector matrix corresponding to the text segments according to the word vectors of the feature words of which the first similarity is greater than or equal to a second threshold value, and calculating the similarity of the word vector matrix of each adjusted word and the word vectors of the search words. The adjusted word vector matrix may be composed of word vectors of at least one feature word whose first similarity is greater than or equal to the second threshold.
After the word vector matrix is generated, the word vectors of the characteristic words forming the word vector matrix can be screened, the word vectors of the characteristic words with low correlation degree with the search words are removed, and the adjusted word vector matrix is formed. And forming an adjusted word vector matrix by calculating the similarity between the word vector of the characteristic word and the word vector of the search word and removing the word vector of the characteristic word with lower similarity between the word vector of the word vector matrix and the word vector of the search word, wherein the similarity between the word vector of the characteristic word and the word vector of the search word is greater than or equal to a second threshold value in the word vector matrix, so as to form the adjusted word vector matrix.
Taking four feature words as an example, the word vector of the first feature word is [ a ] 11 ,a 12 ,…,a 1n ]The word vector of the second feature word is [ a ] 21 ,a 22 ,…,a 2n ]The word vector of the third feature word is [ a ] 31 ,a 32 ,…,a 3n ]The word vector of the fourth feature word is [ a ] 41 ,a 42 ,…,a 4n ]Then form a word vector matrix of
Figure BDA0001644327830000111
Through similarity calculation of word vectors of the feature words and word vectors of the search words, the similarity of the word vector of the third feature word and the word vector of the search word is found to be low, namely the correlation of the third feature word and the search word is low, the word vector of the third feature word in the word vector matrix can be removed, and the adjusted word vector matrix can be formed
Figure BDA0001644327830000112
It should be noted that, if the similarity between the word vector of each feature word in the word vector matrix and the search word is low, it indicates that the correlation between the text segment and the search word is low, and the obtained adjusted word vector matrix may be a null matrix, it is considered that the text segment is not the search result required by the user, and the text segment is excluded.
After the adjusted word vector matrixes are obtained, the similarity between each adjusted word vector matrix and the word vector of the search word can be calculated, and the text segment with the similarity larger than or equal to a first threshold value is used as a search result. Specifically, the average vector of the adjusted word vector matrix may be calculated according to the average vector of the word vectors of the feature words in the adjusted word vector matrix, and the similarity between the average vector of the adjusted word vector matrix and the word vector of the search word may be calculated, which may be similar to the obtaining manner of the similarity between the word vector matrix and the word vector of the search word, and this is not illustrated here.
The word vectors of the feature words forming the word vector matrix are screened, the word vectors of the feature words with low correlation degree with the search words are removed, the adjusted word vector matrix is formed, only the word vectors of the feature words with high correlation degree with the search words can be considered in the search process, and the calculation amount of subsequent search is reduced.
For example, if the number of the feature words is 1, the similarity between the word vector of the feature word and the word vector of the search word is used as the similarity between the word vector matrix and the word vector of the search word, and when the number of the feature words is large, the final similarity needs to be obtained by integrating the similarities between the word vectors of the feature words and the word vectors of the search word, so that when the number of the feature words is large, the reliability of the similarity is high, and when the number of the feature words is small, the similarity is easily influenced by the word vectors of the single feature words, and the reliability is not high.
In order to further improve the reliability of the similarity, after the similarity between the adjusted word vector matrix and the word vector of the search word is obtained, the obtained similarity can be adjusted, for convenience of distinguishing, the obtained similarity can be marked as a third similarity, and the third similarity is adjusted according to an adjustment formula to obtain a second similarity. For example, the adjustment formula may be
Figure BDA0001644327830000121
Wherein, sim 2 Is the second degree of similarity, sim 1 Is a third similarity, alpha is a first adjustment coefficient, beta is a second adjustment coefficient, n 1 The number of the word vectors of the feature words in the adjusted word vector matrix, n 2 The number of the word vectors of the feature words corresponding to the word vector matrix.
And after the second similarity between the word vector matrix and the word vector of the search word is obtained, taking the text segment with the second similarity being larger than or equal to the first threshold value as a search result.
In the embodiment of the present application, when the first adjustment coefficient α is 2 and the second adjustment coefficient β is 0.03, a better adjustment effect can be obtained, and the adjusted second similarity can better weaken the influence of the number of the feature words, thereby improving the accuracy of the search result.
Based on the method for retrieving the text segment provided by the above embodiment, the embodiment of the present application further provides a device for retrieving the text segment, and the working principle of the device is described in detail below with reference to the accompanying drawings.
Referring to fig. 2, this figure is a block diagram of a structure of an apparatus for retrieving text snippets according to an embodiment of the present application, where the apparatus includes:
a word vector and word vector matrix obtaining unit 101, configured to obtain a set of text segments, where the set of text segments includes at least one text segment, extract a feature word of the text segment, generate a word vector matrix corresponding to the text segment according to a word vector of the feature word, and obtain a word vector of a search word;
the search result obtaining unit 102 is configured to calculate a similarity between each word vector matrix and a word vector of the search word, and use the text segment with the similarity greater than or equal to a first threshold as a search result.
Optionally, the retrieval result obtaining unit 102 includes:
the adjusted word vector matrix obtaining subunit is configured to calculate a first similarity between a word vector of each feature word in each text segment and a word vector of the search word, and generate an adjusted word vector matrix corresponding to the text segment according to a word vector of a feature word of which the first similarity is greater than or equal to a second threshold;
the similarity calculation operator unit is used for calculating the similarity between each adjusted word vector matrix and the word vector of the search word;
and the retrieval result acquisition subunit is used for taking the text segments with the similarity greater than or equal to a first threshold value as retrieval results.
Optionally, the similarity calculation subunit includes:
the adjusted average vector obtaining subunit is configured to calculate an average vector of the adjusted word vector matrix according to an average vector of word vectors of feature words in the adjusted word vector matrix;
and the adjusted similarity calculation operator unit is used for calculating the similarity between the average vector of the adjusted word vector matrix and the word vector of the search word.
Optionally, the retrieval result obtaining unit 102 includes:
the average vector obtaining subunit is configured to calculate an average vector of the word vector matrix according to an average vector of word vectors of feature words in the word vector matrix;
the similarity calculation operator unit is used for calculating the similarity between the average vector of the word vector matrix and the word vector of the search word;
and the retrieval result acquisition subunit is used for taking the text segments with the similarity greater than or equal to a first threshold value as retrieval results.
Optionally, the similarity includes a second similarity, and the adjusted similarity operator unit includes:
the third similarity calculation operator unit is used for calculating the third similarity of each word vector matrix and the word vectors of the search words;
a second similarity operator unit for, according to the formula:
Figure BDA0001644327830000131
calculating a second similarity of the word vector matrix and the word vector of the search word, wherein the sim 2 For a second degree of similarity, the sim 1 Is a third similarity, the alpha is a first adjustment coefficient, the beta is a second adjustment coefficient, the n 1 The number of the word vectors of the characteristic words corresponding to the adjusted word vector matrix is n 2 The number of the word vectors of the feature words corresponding to the word vector matrix;
and the retrieval result acquisition subunit is used for taking the text segments with the second similarity greater than or equal to a first threshold value as retrieval results.
Optionally, the retrieval result obtaining unit 102 includes:
the word vector similarity calculation operator unit is used for acquiring the similarity between the word vectors of the characteristic words in the word vector matrix and the word vectors of the search words;
the average value operator unit is used for acquiring the average value of the similarity in the word vector matrix;
and the retrieval result acquisition subunit is used for taking the text segments with the similarity greater than or equal to a first threshold value as retrieval results.
Optionally, the number of the search results is multiple, and the search result obtaining subunit is specifically configured to sort, according to the similarity value, the text segments whose similarity is greater than or equal to the first threshold from high to low, take the first m text segments as search results, where m is a preset number of search results.
The device for retrieving the text segments provided by the embodiment of the application extracts the feature words of each text segment in the text segment set by obtaining the text segment set, generates the word vector matrix corresponding to the text segment according to the word vectors of the feature words, obtains the word vectors of the retrieval words, calculates the similarity between each word vector matrix and the word vector of the retrieval words, and takes the text segment with the similarity larger than or equal to the first threshold as the retrieval result. The word vector matrix corresponding to the text segment is generated by using the word vectors of the feature words in the text segment, and the feature words can reflect the main content of the text segment, so the word vector matrix generated according to the word vectors of the feature words can be used for representing the text segment, the word vectors of the search words are used for representing the search words, the similarity between the text segment and the search words is represented by the similarity between the word vector matrix and the word vectors of the search words, the higher the similarity between the text segment and the search words, the higher the correlation between the text segment and the search words, and the text segment with higher similarity is used as a search result, thereby the accuracy of searching the related text segment is improved.
When introducing elements of various embodiments of the present application, the articles "a," "an," "the," and "said" are intended to mean that there are one or more of the elements. The terms "comprising," "including," and "having" are intended to be inclusive and mean that there may be additional elements other than the listed elements.
It should be noted that, as one of ordinary skill in the art would understand, all or part of the processes of the above method embodiments may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when executed, the computer program may include the processes of the above method embodiments. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus embodiment, since it is substantially similar to the method embodiment, it is relatively simple to describe, and reference may be made to some descriptions of the method embodiment for relevant points. The above-described apparatus embodiments are merely illustrative, and the units and modules described as separate components may or may not be physically separate. In addition, some or all of the units and modules may be selected according to actual needs to achieve the purpose of the solution of the embodiment. One of ordinary skill in the art can understand and implement without inventive effort.
The foregoing is directed to embodiments of the present application and it is noted that numerous modifications and adaptations may be made by those skilled in the art without departing from the principles of the present application and are intended to be within the scope of the present application.

Claims (8)

1. A method for text segment retrieval, the method comprising:
acquiring a set of text segments, wherein the set of text segments comprises at least one text segment, extracting feature words of the text segments, generating a word vector matrix corresponding to the text segments according to word vectors of the feature words, and acquiring word vectors of search words;
calculating the similarity of each word vector matrix and the word vector of the search word, and taking the text segment with the similarity larger than or equal to a first threshold value as a search result;
the calculating the similarity between each word vector matrix and the word vector of the search word comprises:
calculating a first similarity between a word vector of each feature word in each text segment and a word vector of the search word, and generating an adjusted word vector matrix corresponding to the text segment according to the word vector of the feature word of which the first similarity is greater than or equal to a second threshold;
and calculating the similarity between each adjusted word vector matrix and the word vector of the search word.
2. The method of claim 1, wherein said calculating the similarity between each of the adjusted word vector matrices and the word vector of the search word comprises:
calculating the average vector of the adjusted word vector matrix according to the average vector of the word vectors of the feature words in the adjusted word vector matrix;
and calculating the similarity between the average vector of the adjusted word vector matrix and the word vector of the search word.
3. The method of claim 2, wherein the similarity comprises a second similarity, and wherein calculating the similarity of each of the word vector matrices to the word vector of the search term comprises:
calculating a third similarity between each word vector matrix and the word vector of the search word;
according to the formula:
Figure DEST_PATH_IMAGE001
calculating a second similarity of the word vector matrix and the word vector of the search word, wherein the sim 2 For a second degree of similarity, the sim 1 Is a third similarity, the alpha is a first adjustment coefficient, the beta is a second adjustment coefficient, the n 1 The words of the characteristic words corresponding to the adjusted word vector matrixNumber of vectors, n 2 The number of the word vectors of the feature words corresponding to the word vector matrix;
the taking the text segments with the similarity greater than or equal to a first threshold as a retrieval result comprises:
and taking the text segments with the second similarity larger than or equal to a first threshold value as retrieval results.
4. The method of claim 1, wherein the calculating the similarity between each word vector matrix and the word vector of the search word comprises:
calculating an average vector of the word vector matrix according to the average vector of the word vectors of the feature words in the word vector matrix;
and calculating the similarity of the average vector of the word vector matrix and the word vector of the search word.
5. The method of claim 1, wherein the calculating the similarity between each word vector matrix and the word vector of the search word comprises:
obtaining the similarity between the word vector of the characteristic word in the word vector matrix and the word vector of the search word;
and obtaining the average value of the similarity in the word vector matrix.
6. The method according to claim 1, wherein the search result is a plurality of search results, and the using the text segment with the similarity greater than or equal to a first threshold as the search result comprises:
and sorting the text segments with the similarity greater than or equal to a first threshold value from high to low according to the similarity value, and taking the first m text segments as retrieval results, wherein m is the preset number of the retrieval results.
7. An apparatus for text segment retrieval, the apparatus comprising:
the word vector and word vector matrix acquisition unit is used for acquiring a set of text segments, wherein the set of text segments comprises at least one text segment, extracting the characteristic words of the text segments, generating a word vector matrix corresponding to the text segments according to the word vectors of the characteristic words, and acquiring the word vectors of the search words;
the retrieval result acquisition unit is used for calculating the similarity between each word vector matrix and the word vector of the retrieval word and taking the text segment with the similarity larger than or equal to a first threshold value as a retrieval result;
the retrieval result acquisition unit includes:
the adjusted word vector matrix obtaining subunit is configured to calculate a first similarity between a word vector of each feature word in each text segment and a word vector of the search word, and generate an adjusted word vector matrix corresponding to the text segment according to a word vector of a feature word of which the first similarity is greater than or equal to a second threshold;
the similarity calculation operator unit is used for calculating the similarity between each adjusted word vector matrix and the word vector of the search word;
and the retrieval result acquisition subunit is used for taking the text segments with the similarity greater than or equal to a first threshold value as retrieval results.
8. The apparatus of claim 7, wherein the similarity operator unit comprises:
an average vector obtaining subunit, configured to calculate an average vector of the adjusted word vector matrix according to an average vector of word vectors of feature words in the adjusted word vector matrix;
and the adjusted similarity calculation operator unit is used for calculating the similarity between the average vector of the adjusted word vector matrix and the word vector of the search word.
CN201810394787.2A 2018-04-27 2018-04-27 Related text segment searching method and device Active CN110413985B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810394787.2A CN110413985B (en) 2018-04-27 2018-04-27 Related text segment searching method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810394787.2A CN110413985B (en) 2018-04-27 2018-04-27 Related text segment searching method and device

Publications (2)

Publication Number Publication Date
CN110413985A CN110413985A (en) 2019-11-05
CN110413985B true CN110413985B (en) 2022-09-16

Family

ID=68347013

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810394787.2A Active CN110413985B (en) 2018-04-27 2018-04-27 Related text segment searching method and device

Country Status (1)

Country Link
CN (1) CN110413985B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113282702B (en) * 2021-03-16 2023-12-19 广东医通软件有限公司 Intelligent retrieval method and retrieval system
CN117688140B (en) * 2024-02-04 2024-04-30 深圳竹云科技股份有限公司 Document query method, device, computer equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20170051137A (en) * 2015-10-29 2017-05-11 한양대학교 산학협력단 Item recommendation apparatus and method for improving the reliability of the similarity calculated in collaborative filtering based recommendation system
CN107609101A (en) * 2017-09-11 2018-01-19 远光软件股份有限公司 Intelligent interactive method, equipment and storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20170051137A (en) * 2015-10-29 2017-05-11 한양대학교 산학협력단 Item recommendation apparatus and method for improving the reliability of the similarity calculated in collaborative filtering based recommendation system
CN107609101A (en) * 2017-09-11 2018-01-19 远光软件股份有限公司 Intelligent interactive method, equipment and storage medium

Also Published As

Publication number Publication date
CN110413985A (en) 2019-11-05

Similar Documents

Publication Publication Date Title
CN106503055B (en) A kind of generation method from structured text to iamge description
CN106156204B (en) Text label extraction method and device
CN107122352B (en) Method for extracting keywords based on K-MEANS and WORD2VEC
CN104765769B (en) The short text query expansion and search method of a kind of word-based vector
CN110825877A (en) Semantic similarity analysis method based on text clustering
CN105786991B (en) In conjunction with the Chinese emotion new word identification method and system of user feeling expression way
WO2018066445A1 (en) Causal relationship recognition apparatus and computer program therefor
WO2017107566A1 (en) Retrieval method and system based on word vector similarity
WO2022121163A1 (en) User behavior tendency identification method, apparatus, and device, and storage medium
CN110134792B (en) Text recognition method and device, electronic equipment and storage medium
CN110472203B (en) Article duplicate checking and detecting method, device, equipment and storage medium
CN106294863A (en) A kind of abstract method for mass text fast understanding
CN107273474A (en) Autoabstract abstracting method and system based on latent semantic analysis
CN108038099B (en) Low-frequency keyword identification method based on word clustering
CN106599072B (en) Text clustering method and device
CN105930416A (en) Visualization processing method and system of user feedback information
CN110705247A (en) Based on x2-C text similarity calculation method
CN110866102A (en) Search processing method
CN108268470A (en) A kind of comment text classification extracting method based on the cluster that develops
CN116756347B (en) Semantic information retrieval method based on big data
CN111090994A (en) Chinese-internet-forum-text-oriented event place attribution province identification method
CN108595411B (en) Method for acquiring multiple text abstracts in same subject text set
CN106681986A (en) Multi-dimensional sentiment analysis system
CN110413985B (en) Related text segment searching method and device
JP2009015796A (en) Apparatus and method for extracting multiplex topics in text, program, and recording medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant