CN104408036A - Correlated topic recognition method and device - Google Patents

Correlated topic recognition method and device Download PDF

Info

Publication number
CN104408036A
CN104408036A CN201410779602.1A CN201410779602A CN104408036A CN 104408036 A CN104408036 A CN 104408036A CN 201410779602 A CN201410779602 A CN 201410779602A CN 104408036 A CN104408036 A CN 104408036A
Authority
CN
China
Prior art keywords
target
topic
multidimensional
array corresponding
multidimensional array
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410779602.1A
Other languages
Chinese (zh)
Other versions
CN104408036B (en
Inventor
刘粉香
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201410779602.1A priority Critical patent/CN104408036B/en
Publication of CN104408036A publication Critical patent/CN104408036A/en
Application granted granted Critical
Publication of CN104408036B publication Critical patent/CN104408036B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a correlated topic recognition method and device. The correlated topic recognition method comprises the steps of obtaining a target keyword; determining a multidimensional array corresponding to the target keyword, wherein each dimensionality number in the multidimensional array is used for representing one attribute of the target keyword; calculating the correlation index between the multidimensional array corresponding to the target keyword and multidimensional arrays corresponding to target topics, wherein the correlation index is used for representing the correlation between the target keyword and each target topic, the target topics are multiple pre-marked topics provided with the multidimensional arrays; determining the topics correlated with the target keyword according to the correlation index obtained through calculation. By means of the correlated topic recognition method and device, the problem of low topic recognition accuracy in the prior art is solved, and the effect of improving the topic recognition accuracy is achieved.

Description

Identification method and device of associated topics
Technical Field
The invention relates to the field of topic identification, in particular to a method and a device for identifying a related topic.
Background
Topic identification mainly refers to identifying topics related to a given keyword from a large amount of texts according to the given keyword, such as: given the keyword "college entrance examination," how to identify topics in the text that are relevant to it. The topic here may refer to topics on the internet, such as news topics, microblog topics, and the like, and is mainly embodied in the form of text.
Currently, for topic identification, given keywords are mainly matched with topics in text, and if the given keywords appear in the topics, the topics are considered to be related to the keywords. However, due to the flexibility of the language, the situation arises: the topic is highly associated with a given keyword, but the keyword does not appear in the topic, and the topic related to the keyword cannot be accurately identified by adopting the matching mode.
Aiming at the problem of low accuracy of topic identification in the prior art, no effective solution is provided at present.
Disclosure of Invention
The invention mainly aims to provide a method and a device for identifying a related topic, which aim to solve the problem of low accuracy of topic identification in the prior art.
In order to achieve the above object, according to an aspect of an embodiment of the present invention, there is provided an identification method of a related topic. The identification method of the associated topics comprises the following steps: acquiring a target keyword; determining a multidimensional array corresponding to the target keyword by using a machine learning method, wherein each dimension number in the multidimensional array is used for representing one attribute of the target keyword; calculating a relevance index between a multidimensional array corresponding to the target keyword and a multidimensional array corresponding to a target topic, wherein the relevance index is used for representing relevance between the target keyword and each target topic, and the target topic is a plurality of topics which are marked in advance and have multidimensional arrays; and determining topics associated with the target keywords according to the calculated association index.
Further, calculating the association index between the multidimensional array corresponding to the target keyword and the multidimensional array corresponding to the target topic comprises: and calculating Euclidean distance between the multidimensional array corresponding to the target keyword and the multidimensional array corresponding to the target topic, and taking the Euclidean distance as the association index, wherein the smaller the Euclidean distance between the target keyword and the topic is, the higher the association between the target keyword and the topic is.
Further, calculating the association index between the multidimensional array corresponding to the target keyword and the multidimensional array corresponding to the target topic comprises: acquiring a multidimensional array corresponding to the target topic; directly calculating the association index between the multidimensional arrays corresponding to the target topics and the multidimensional arrays corresponding to the target keywords, or acquiring the multidimensional arrays corresponding to each word in the target topics; calculating the association index between the multidimensional arrays corresponding to the target keywords and the multidimensional arrays corresponding to each word in the target topic; and calculating the association index between the multidimensional array corresponding to the target keyword and the multidimensional array corresponding to the target topic according to the association index between the multidimensional array corresponding to the target keyword and the multidimensional array corresponding to each word.
Further, determining the topic associated with the target keyword according to the calculated association index includes: judging whether the calculated correlation index meets a preset condition or not; if the calculated association index is judged to meet the preset condition, determining that the target topic of which the calculated association index meets the preset condition is associated with the target keyword; and if the calculated association index is judged not to meet the preset condition, determining that the target topic of which the calculated association index does not meet the preset condition is not related to the target keyword.
Further, before obtaining the target keyword, the identification method further includes: acquiring a target text, wherein the target text comprises the target topic; utilizing a word segmentation tool to segment the target text and marking the part of speech of each word in the target text; determining the target topic according to the part of speech rule model established in advance and the part of speech of the word after word segmentation, and marking the target topic; and determining a multidimensional array corresponding to each word after word segmentation and a multidimensional array corresponding to the target topic according to a machine learning method.
In order to achieve the above object, according to another aspect of the embodiments of the present invention, there is provided an identification apparatus of a related topic. The related topic identification device according to the present invention includes: a first acquisition unit configured to acquire a target keyword; the first determining unit is used for determining a multidimensional array corresponding to the target keyword according to a machine learning method, wherein each dimension number in the multidimensional array is used for representing one attribute of the target keyword; the calculation unit is used for calculating a relevance index between a multidimensional array corresponding to the target keyword and a multidimensional array corresponding to a target topic, wherein the relevance index is used for representing the relevance between the target keyword and each target topic, and the target topic is a plurality of topics which are marked in advance and have multidimensional arrays; and a second determining unit, configured to determine, according to the calculated association index, a topic associated with the target keyword.
Further, the calculation unit includes: the first calculation module is configured to calculate a euclidean distance between a multidimensional array corresponding to the target keyword and a multidimensional array corresponding to the target topic, and use the euclidean distance as the association index, where a smaller euclidean distance between the target keyword and a topic indicates a higher association between the target keyword and the topic.
Further, the calculation unit includes: the first obtaining module is used for obtaining a multidimensional array corresponding to the target topic; a second calculating module, configured to directly calculate an association index between the multidimensional arrays corresponding to the target topic and the multidimensional arrays corresponding to the target keyword, or the calculating unit includes: the second acquisition module is used for acquiring a multi-dimensional array corresponding to each word in the target topic; the third calculation module is used for calculating the association index between the multidimensional arrays corresponding to the target keywords and the multidimensional arrays corresponding to each word in the target topic; and the fourth calculation module is used for calculating the association index between the multidimensional array corresponding to the target keyword and the multidimensional array corresponding to each word to obtain the association index between the multidimensional array corresponding to the target keyword and the multidimensional array corresponding to the target topic.
Further, the second determination unit includes: the judging module is used for judging whether the calculated correlation index meets a preset condition or not; a determining module, configured to determine that the target topic whose correlation index meets the preset condition is associated with the target keyword if it is determined that the correlation index meets the preset condition; and if the calculated association index is judged not to meet the preset condition, determining that the target topic of which the calculated association index does not meet the preset condition is not related to the target keyword.
Further, the identification device further comprises: the second acquisition unit is used for acquiring a target text before acquiring a target keyword, wherein the target text comprises the target topic; the word segmentation unit is used for segmenting the target text by using a word segmentation tool and marking the part of speech of each word in the target text; the third determining unit is used for determining the target topic according to the part of speech rule model established in advance and the part of speech of the word after word segmentation, and marking the target topic; and the fourth determining unit is used for determining the multidimensional arrays corresponding to each word after word segmentation and the multidimensional arrays corresponding to the target topic.
In the embodiment of the invention, the target keyword is obtained, the multidimensional array corresponding to the target keyword is determined, the association index between the multidimensional array corresponding to the target keyword and the multidimensional array corresponding to the target topic is calculated, the topic associated with the target keyword is determined according to the association index obtained by calculation, and the judgment of the correlation between the target keyword and the target topic is converted into the calculation of the association index between the multidimensional array used for representing the attribute of the target keyword and the multidimensional array used for representing the attribute of the target topic, so that the problem that the topic cannot be accurately identified in a keyword matching mode due to the fact that no keyword appears in the topic is solved, the problem of low topic identification accuracy in the prior art is solved, and the effect of improving the topic identification accuracy is achieved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the invention and, together with the description, serve to explain the invention and not to limit the invention. In the drawings:
FIG. 1 is a flow diagram of a method of identifying associated topics in accordance with an embodiment of the present invention; and
fig. 2 is a schematic diagram of an identification apparatus of related topics according to an embodiment of the present invention.
Detailed Description
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings.
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged under appropriate circumstances in order to facilitate the description of the embodiments of the invention herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The embodiment of the invention provides a method for identifying a related topic.
Fig. 1 is a flowchart of an identification method of an associated topic according to an embodiment of the present invention. As shown in fig. 1, the method for identifying the associated topic includes the following steps:
step S102, target keywords are obtained.
The target keywords may be one or more of, for example: 2014, college entrance examination and the like.
And step S104, determining the multidimensional arrays corresponding to the target keywords by using a machine learning method. Each dimension array in the multi-dimension array is used for representing one attribute of the target keyword.
Because each dimension number in the multidimensional array is used for representing one attribute of the target keyword, the target keyword corresponds to a unique multidimensional array, that is, the target keyword is represented by the multidimensional array. For example, the 500-dimensional array is used for representing the target keyword, and after the target keyword is obtained, a unique 500-dimensional array corresponding to the target keyword is obtained through a machine learning method.
And step S106, calculating the association index between the multidimensional arrays corresponding to the target keywords and the multidimensional arrays corresponding to the target topics. The relevance index is used for representing relevance between the target keywords and each target topic, and the target topics are a plurality of topics which are marked in advance and have multidimensional arrays.
And step S108, determining topics associated with the target keywords according to the calculated association index.
Each target topic also corresponds to a unique multidimensional array, that is, each topic is represented by a unique multidimensional array, wherein each dimension number in the multidimensional array represents an attribute in the target topic. It should be noted that the number of dimensions of the multidimensional array used for representing the topic is the same as the number of dimensions representing the target keyword (the same applies below), so as to avoid calculation errors.
And after the multidimensional arrays corresponding to the target keywords are determined, calculating the association index between the multidimensional arrays and the multidimensional arrays corresponding to the target topics. Because a plurality of topics exist in the text, the association indexes between the multidimensional arrays corresponding to the target keywords and the multidimensional data corresponding to the topics are respectively calculated, so that the association between the target keywords and the topics is obtained. Finally, determining topics associated with the target keywords according to the calculated association index, specifically, setting a corresponding threshold, and when the association index exceeds the threshold, regarding the target topics as being associated with the target keywords, otherwise, regarding the target topics as not being associated.
In the embodiment of the invention, the target keyword is obtained, the multidimensional array corresponding to the target keyword is determined, the association index between the multidimensional array corresponding to the target keyword and the multidimensional array corresponding to the target topic is calculated, the topic associated with the target keyword is determined according to the association index obtained by calculation, and the judgment of the correlation between the target keyword and the target topic is converted into the calculation of the association index between the multidimensional array used for representing the attribute of the target keyword and the multidimensional array used for representing the attribute of the target topic, so that the problem that the topic cannot be accurately identified in a keyword matching mode due to the fact that no keyword appears in the topic is solved, the problem of low topic identification accuracy in the prior art is solved, and the effect of improving the topic identification accuracy is achieved.
In the embodiment of the present invention, after determining the topics associated with the target keywords, the topics may be ranked according to the relevance conversation questions, for example, if the calculated relevance index is larger, it indicates that the relevance between the target keywords and the topics is higher, the target topics may be ranked according to the calculated relevance index from large to small, so as to obtain the topic attention ranking table. If the calculated association index is smaller, the higher the association between the target keyword and the topic is, the target topic can be sorted from small to large according to the association index.
Optionally, calculating an association index between the multidimensional array corresponding to the target keyword and the multidimensional array corresponding to the target topic includes: and calculating the Euclidean distance between the multidimensional array corresponding to the target keyword and the multidimensional array corresponding to the target topic, and taking the Euclidean distance as a correlation index.
In the embodiment of the invention, the relevance between the target keyword and the topic is represented by the Euclidean distance between the arrays, wherein the smaller the Euclidean distance between the target keyword and the topic is, the higher the relevance between the target keyword and the topic is; the larger the euclidean distance is, the lower the relevance between the target keyword and the topic is. In this way, when ranking is performed according to the high-low conversation questions related to the target topic and the keywords, in this embodiment, the target topics are ranked from small to large according to the euclidean distance, and the attention ranking table is obtained.
In the embodiment of the invention, the relevance between the target keyword and the target topic is judged by adopting the Euclidean distance between the calculation groups, so that the topic identification speed is improved.
Optionally, calculating an association index between the multidimensional array corresponding to the target keyword and the multidimensional array corresponding to the target topic includes: acquiring a multi-dimensional array corresponding to each word in a target topic; calculating an association index between the multidimensional arrays corresponding to the target keywords and the multidimensional arrays corresponding to each word in the target topic; and calculating the association index between the multidimensional array corresponding to the target keyword and the multidimensional array corresponding to each word to obtain the association index between the multidimensional array corresponding to the target keyword and the multidimensional array corresponding to the target topic.
Because the topic is composed of words according to a certain grammar, the topic comprises a plurality of words, when a machine learning method is used for calculating a multidimensional array corresponding to a target keyword and a multidimensional array corresponding to the target topic, the multidimensional array of each word in the target topic is calculated, the correlation index between the multidimensional arrays corresponding to the target topic can be the correlation index between the multidimensional array corresponding to each word in the target topic and the array corresponding to the target keyword, and then the correlation index obtains the correlation between the target keyword and the target topic. For example, the euclidean distance between the multidimensional array corresponding to each word in the target topic and the array corresponding to the target keyword is respectively calculated, and the relevance index between the multidimensional array corresponding to the target keyword and the multidimensional array corresponding to the target topic is obtained through calculation according to the euclidean distance. Therefore, the relevance between the target topic and the target keyword is determined through the relevance between each word in the topic and the target keyword, the calculation accuracy of the corresponding array of the topic is further improved, and the identification accuracy of the topic is further guaranteed.
Optionally, calculating an association index between the multidimensional array corresponding to the target keyword and the multidimensional array corresponding to the target topic includes: acquiring a multi-dimensional array corresponding to a target topic; and directly calculating the association index between the multidimensional arrays corresponding to the target topics corresponding to the target keywords.
Because the topic is composed of a plurality of words, the multidimensional array corresponding to the topic can be obtained through machine learning according to the multidimensional array corresponding to each word in the topic. Then, when calculating the association index, a unique multidimensional array obtained by the target topic through machine learning in advance can be obtained, and then the association index between the multidimensional array corresponding to the target keyword and the multidimensional array corresponding to the target topic is directly calculated. Compared with the method for calculating the association index of each word in the target keyword and the topic, the method greatly improves the speed of calculating the association index of the target keyword and the topic.
Preferably, determining the topic associated with the target keyword according to the calculated association index includes: judging whether the calculated correlation index meets a preset condition or not; if the calculated association index meets the preset condition, determining that the target topic with the calculated association index meeting the preset condition is associated with the target keyword; and if the calculated association index is judged not to meet the preset condition, determining that the target topic of which the calculated association index does not meet the preset condition is not related to the target keyword.
In this embodiment, the preset condition may be a preset threshold, for example, when the association index is larger, it indicates that the association performance between the target topic and the target keyword is higher, and then, determining whether the calculated association index meets the preset condition may be determining whether the calculated association index exceeds the preset threshold, if so, determining that the topic is associated with the target keyword, otherwise, determining that the topic is not associated.
If the correlation index is the Euclidean distance between the arrays, judging whether the calculated correlation index meets a preset condition can be judging whether the Euclidean distance is smaller than a preset threshold value, if so, determining that the topic is correlated with the target keyword, otherwise, determining that the topic is not correlated.
By setting a preset condition, topics related to the target keywords are quickly determined from the calculated result, and therefore the accuracy of topic identification is improved.
Preferably, before the target keyword is acquired, the identification method further includes: acquiring a target text, wherein the target text comprises a target topic; utilizing a word segmentation tool to segment the target text and marking the part of speech of each word in the target text; determining a target topic according to the part of speech rule model established in advance and the part of speech of the word after word segmentation, and marking the target topic; and determining the multidimensional arrays corresponding to each word after word segmentation and the multidimensional arrays corresponding to the target topics.
Acquiring a target text containing topics, establishing a text training set, and setting a text word segmentation rule according to needs; constructing a part-of-speech rule model (such as noun + verb, or noun + verb + object) of the topic by using a semantic analysis method; performing text analysis by using a word segmentation tool (including a set text word segmentation rule), labeling all parts of speech of each word, and labeling topics; all the terms (including topics) are respectively represented by multidimensional arrays, for example, 500 dimensions, and a corresponding unique multidimensional array of each term is obtained through a machine learning method. In this way, after the target keyword is acquired and the multidimensional array of the target keyword is determined, the relevance index such as the euclidean distance can be directly calculated by the multidimensional array corresponding to the topic.
In the embodiment of the invention, the topic is defined through the part of speech rule model, and the array corresponding to each word and topic is obtained by using a machine learning method, so that topic relevance judgment is converted into calculation of the correlation index between the arrays, and the speed and the accuracy of relevant topic identification are greatly improved.
The embodiment of the invention also provides a device for identifying the associated topics. The apparatus may implement its functionality via a computer device. It should be noted that the identification device of the related topic according to the embodiment of the present invention may be used to execute the identification method of the related topic provided by the embodiment of the present invention, and the identification method of the related topic according to the embodiment of the present invention may also be executed by the identification device of the related topic provided by the embodiment of the present invention.
Fig. 2 is a schematic diagram of an identification apparatus of related topics according to an embodiment of the present invention. As shown in fig. 2, the identification device of the related topic includes: a first acquisition unit 10, a first determination unit 20, a calculation unit 30 and a second determination unit 40.
The first acquisition unit 10 is used to acquire a target keyword.
The target keywords may be one or more of, for example: 2014, college entrance examination and the like.
The first determining unit 20 is configured to determine a multidimensional array corresponding to the target keyword by using a machine learning method, where each dimension number in the multidimensional array is used to represent one attribute of the target keyword.
Because each dimension number in the multidimensional array is used for representing one attribute of the target keyword, the target keyword corresponds to a unique multidimensional array, that is, the target keyword is represented by the multidimensional array. For example, the 500-dimensional array is used for representing the target keyword, and after the target keyword is obtained, a unique 500-dimensional array corresponding to the target keyword can be obtained through a machine learning method.
The calculating unit 30 is configured to calculate a relevance index between the multidimensional array corresponding to the target keyword and the multidimensional array corresponding to the target topic, where the relevance index is used to represent relevance between the target keyword and the target topic, and the target topic is a plurality of topics with multidimensional arrays marked in advance.
The second determining unit 40 is configured to determine a topic associated with the target keyword according to the calculated association index.
Each target topic also corresponds to a unique multidimensional array, that is, each topic is represented by a unique multidimensional array, wherein each dimension number in the multidimensional array represents an attribute in the target topic. It should be noted that the number of dimensions of the multidimensional array used for representing the topic is the same as the number of dimensions representing the target keyword (the same applies below), so as to avoid calculation errors.
And after the multidimensional arrays corresponding to the target keywords are determined, calculating the association index between the multidimensional arrays and the multidimensional arrays corresponding to the target topics. Because a plurality of topics exist in the text, the association indexes between the multidimensional arrays corresponding to the target keywords and the multidimensional data corresponding to the topics are respectively calculated, so that the association between the target keywords and the topics is obtained. Finally, determining topics associated with the target keywords according to the calculated association index, specifically, setting a corresponding threshold, and when the association index exceeds the threshold, regarding the target topics as being associated with the target keywords, otherwise, regarding the target topics as not being associated.
In the embodiment of the invention, the target keyword is obtained, the multidimensional array corresponding to the target keyword is determined, the association index between the multidimensional array corresponding to the target keyword and the multidimensional array corresponding to the target topic is calculated, the topic associated with the target keyword is determined according to the association index obtained by calculation, and the judgment of the correlation between the target keyword and the target topic is converted into the calculation of the association index between the multidimensional array used for representing the attribute of the target keyword and the multidimensional array used for representing the attribute of the target topic, so that the problem that the topic cannot be accurately identified in a keyword matching mode due to the fact that no keyword appears in the topic is solved, the problem of low topic identification accuracy in the prior art is solved, and the effect of improving the topic identification accuracy is achieved.
In the embodiment of the present invention, after determining the topics associated with the target keywords, the topics may be ranked according to the relevance conversation questions, for example, if the calculated relevance index is larger, it indicates that the relevance between the target keywords and the topics is higher, the target topics may be ranked according to the calculated relevance index from large to small, so as to obtain the topic attention ranking table. If the calculated association index is smaller, the higher the association between the target keyword and the topic is, the target topic can be sorted from small to large according to the association index.
Preferably, the calculation unit includes: and the first calculation module is used for calculating the Euclidean distance between the multidimensional array corresponding to the target keyword and the multidimensional array corresponding to the target topic, and taking the Euclidean distance as the association index.
In the embodiment of the invention, the relevance between the target keyword and the topic is represented by the Euclidean distance between the arrays, wherein the smaller the Euclidean distance between the target keyword and the topic is, the higher the relevance between the target keyword and the topic is; the larger the euclidean distance is, the lower the relevance between the target keyword and the topic is. In this way, when ranking is performed according to the high-low conversation questions related to the target topic and the keywords, in this embodiment, the target topics are ranked from small to large according to the euclidean distance, and the attention ranking table is obtained.
In the embodiment of the invention, the relevance between the target keyword and the target topic is judged by adopting the Euclidean distance between the calculation groups, so that the topic identification speed is improved.
Preferably, the calculation unit includes: the second acquisition module is used for acquiring a multi-dimensional array corresponding to each word in the target topic; the third calculation module is used for calculating the association index between the multidimensional arrays corresponding to the target keywords and the multidimensional arrays corresponding to each word in the target topic; and the fourth calculation module is used for calculating the association index between the multidimensional array corresponding to the target keyword and the multidimensional array corresponding to each word to obtain the association index between the multidimensional array corresponding to the target keyword and the multidimensional array corresponding to the target topic.
Because the topic is composed of words according to a certain grammar, the topic comprises a plurality of words, when a machine learning method is used for calculating a multidimensional array corresponding to a target keyword and a multidimensional array corresponding to the target topic, the multidimensional array of each word in the target topic is calculated, the correlation index between the multidimensional arrays corresponding to the target topic can be the correlation index between the multidimensional array corresponding to each word in the target topic and the array corresponding to the target keyword, and then the correlation index obtains the correlation between the target keyword and the target topic. For example, the euclidean distance between the multidimensional array corresponding to each word in the target topic and the array corresponding to the target keyword is respectively calculated, and the relevance index between the multidimensional array corresponding to the target keyword and the multidimensional array corresponding to the target topic is obtained through calculation according to the euclidean distance. Therefore, the relevance between the target topic and the target keyword is determined through the relevance between each word in the topic and the target keyword, the calculation accuracy of the corresponding array of the topic is further improved, and the identification accuracy of the topic is further guaranteed.
Optionally, the calculation unit comprises: the first acquisition module is used for acquiring a multi-dimensional array corresponding to the target topic; and the second calculation module is used for directly calculating the association index between the multi-dimensional arrays corresponding to the multi-dimensional array target topics corresponding to the target keywords.
Because the topic is composed of a plurality of words, the multidimensional array corresponding to the topic can be obtained through machine learning according to the multidimensional array corresponding to each word in the topic. Then, when calculating the association index, a unique multidimensional array obtained by the target topic through machine learning in advance can be obtained, and then the association index between the multidimensional array corresponding to the target keyword and the multidimensional array corresponding to the target topic is directly calculated. Compared with the method for calculating the association index of each word in the target keyword and the topic, the method greatly improves the speed of calculating the association index of the target keyword and the topic.
Preferably, the second determination unit includes: the judging module is used for judging whether the calculated correlation index meets a preset condition or not; the determining module is used for determining that the target topic with the calculated association index meeting the preset condition is associated with the target keyword if the calculated association index meeting the preset condition is judged; and if the calculated association index is judged not to meet the preset condition, determining that the target topic of which the calculated association index does not meet the preset condition is not related to the target keyword.
In this embodiment, the preset condition may be a preset threshold, for example, when the association index is larger, it indicates that the association performance between the target topic and the target keyword is higher, and then, determining whether the calculated association index meets the preset condition may be determining whether the calculated association index exceeds the preset threshold, if so, determining that the topic is associated with the target keyword, otherwise, determining that the topic is not associated.
If the correlation index is the Euclidean distance between the arrays, judging whether the calculated correlation index meets a preset condition can be judging whether the Euclidean distance is smaller than a preset threshold value, if so, determining that the topic is correlated with the target keyword, otherwise, determining that the topic is not correlated.
By setting a preset condition, topics related to the target keywords are quickly determined from the calculated result, and therefore the accuracy of topic identification is improved.
Preferably, the identification means further comprises: the second acquisition unit is used for acquiring a target text before acquiring the target key words, wherein the target text comprises a target topic; the word segmentation unit is used for segmenting the target text by using a word segmentation tool and marking the part of speech of each word in the target text; the third determining unit is used for determining a target topic according to the part of speech rule model established in advance and the part of speech of the word after word segmentation, and marking the target topic; and the fourth determining unit is used for determining the multidimensional arrays corresponding to each word after word segmentation and the multidimensional arrays corresponding to the target topics.
Acquiring a target text containing topics, establishing a text training set, and setting a text word segmentation rule according to needs; constructing a part-of-speech rule model (such as noun + verb, or noun + verb + object) of the topic by using a semantic analysis method; performing text analysis by using a word segmentation tool (including a set text word segmentation rule), labeling all parts of speech of each word, and labeling topics; all the terms (including topics) are respectively represented by multidimensional arrays, for example, 500 dimensions, and a corresponding unique multidimensional array of each term is obtained through a machine learning method. In this way, after the target keyword is acquired and the multidimensional array of the target keyword is determined, the relevance index such as the euclidean distance can be directly calculated by the multidimensional array corresponding to the topic.
In the embodiment of the invention, the topic is defined through the part of speech rule model, and the array corresponding to each word and topic is obtained by using a machine learning method, so that topic relevance judgment is converted into calculation of the correlation index between the arrays, and the speed and the accuracy of relevant topic identification are greatly improved.
It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.
In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implementing, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of some interfaces, devices or units, and may be an electric or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a mobile terminal, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A method for identifying a related topic is characterized by comprising the following steps:
acquiring a target keyword;
determining a multidimensional array corresponding to the target keyword by using a machine learning method, wherein each dimension number in the multidimensional array is used for representing one attribute of the target keyword;
calculating a relevance index between a multidimensional array corresponding to the target keyword and a multidimensional array corresponding to a target topic, wherein the relevance index is used for representing relevance between the target keyword and each target topic, and the target topic is a plurality of topics which are marked in advance and have multidimensional arrays; and
and determining topics associated with the target keywords according to the calculated association index.
2. The identification method of claim 1, wherein calculating the association index between the multidimensional array corresponding to the target keyword and the multidimensional array corresponding to the target topic comprises:
and calculating Euclidean distance between the multidimensional array corresponding to the target keyword and the multidimensional array corresponding to the target topic, and taking the Euclidean distance as the association index, wherein the smaller the Euclidean distance between the target keyword and the topic is, the higher the association between the target keyword and the topic is.
3. The identification method according to claim 1, wherein calculating the association index between the multidimensional array corresponding to the target keyword and the multidimensional array corresponding to the target topic comprises:
acquiring a multidimensional array corresponding to the target topic; directly calculating the association index between the multidimensional arrays corresponding to the target topics and the multidimensional arrays corresponding to the target keywords,
or,
acquiring a multidimensional array corresponding to each word in the target topic; calculating the association index between the multidimensional arrays corresponding to the target keywords and the multidimensional arrays corresponding to each word in the target topic; and calculating the association index between the multidimensional array corresponding to the target keyword and the multidimensional array corresponding to the target topic according to the association index between the multidimensional array corresponding to the target keyword and the multidimensional array corresponding to each word.
4. The method of claim 1, wherein determining topics associated with the target keyword according to the calculated relevance index comprises:
judging whether the calculated correlation index meets a preset condition or not;
if the calculated association index is judged to meet the preset condition, determining that the target topic of which the calculated association index meets the preset condition is associated with the target keyword;
and if the calculated association index is judged not to meet the preset condition, determining that the target topic of which the calculated association index does not meet the preset condition is not related to the target keyword.
5. The recognition method according to claim 1, wherein before obtaining the target keyword, the recognition method further comprises:
acquiring a target text, wherein the target text comprises the target topic;
utilizing a word segmentation tool to segment the target text and marking the part of speech of each word in the target text;
determining the target topic according to the part of speech rule model established in advance and the part of speech of the word after word segmentation, and marking the target topic; and
and determining a multidimensional array corresponding to each word after word segmentation and a multidimensional array corresponding to the target topic.
6. An apparatus for identifying related topics, comprising:
a first acquisition unit configured to acquire a target keyword;
the first determining unit is used for determining a multidimensional array corresponding to the target keyword by using a machine learning method, wherein each dimension number in the multidimensional array is used for representing one attribute of the target keyword;
the calculation unit is used for calculating a relevance index between a multidimensional array corresponding to the target keyword and a multidimensional array corresponding to a target topic, wherein the relevance index is used for representing the relevance between the target keyword and each target topic, and the target topic is a plurality of topics which are marked in advance and have multidimensional arrays; and
and the second determining unit is used for determining the topics associated with the target keywords according to the calculated association index.
7. The recognition apparatus according to claim 6, wherein the calculation unit includes:
the first calculation module is configured to calculate a euclidean distance between a multidimensional array corresponding to the target keyword and a multidimensional array corresponding to the target topic, and use the euclidean distance as the association index, where a smaller euclidean distance between the target keyword and a topic indicates a higher association between the target keyword and the topic.
8. The recognition apparatus according to claim 6, wherein the calculation unit includes:
the first obtaining module is used for obtaining a multidimensional array corresponding to the target topic; a second calculating module for directly calculating the association index between the multidimensional arrays corresponding to the target topic and the multidimensional arrays corresponding to the target keyword,
alternatively, the calculation unit includes:
the second acquisition module is used for acquiring a multi-dimensional array corresponding to each word in the target topic; the third calculation module is used for calculating the association index between the multidimensional arrays corresponding to the target keywords and the multidimensional arrays corresponding to each word in the target topic; and the fourth calculation module is used for calculating the association index between the multidimensional array corresponding to the target keyword and the multidimensional array corresponding to each word to obtain the association index between the multidimensional array corresponding to the target keyword and the multidimensional array corresponding to the target topic.
9. The identification device according to claim 6, wherein the second determination unit includes:
the judging module is used for judging whether the calculated correlation index meets a preset condition or not;
a determining module, configured to determine that the target topic whose correlation index meets the preset condition is associated with the target keyword if it is determined that the correlation index meets the preset condition; and if the calculated association index is judged not to meet the preset condition, determining that the target topic of which the calculated association index does not meet the preset condition is not related to the target keyword.
10. The identification device of claim 6, further comprising:
the second acquisition unit is used for acquiring a target text before acquiring a target keyword, wherein the target text comprises the target topic;
the word segmentation unit is used for segmenting the target text by using a word segmentation tool and marking the part of speech of each word in the target text;
the third determining unit is used for determining the target topic according to the part of speech rule model established in advance and the part of speech of the word after word segmentation, and marking the target topic; and
and the fourth determining unit is used for determining the multidimensional arrays corresponding to each word after word segmentation and the multidimensional arrays corresponding to the target topic.
CN201410779602.1A 2014-12-15 2014-12-15 It is associated with recognition methods and the device of topic Active CN104408036B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410779602.1A CN104408036B (en) 2014-12-15 2014-12-15 It is associated with recognition methods and the device of topic

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410779602.1A CN104408036B (en) 2014-12-15 2014-12-15 It is associated with recognition methods and the device of topic

Publications (2)

Publication Number Publication Date
CN104408036A true CN104408036A (en) 2015-03-11
CN104408036B CN104408036B (en) 2019-01-08

Family

ID=52645668

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410779602.1A Active CN104408036B (en) 2014-12-15 2014-12-15 It is associated with recognition methods and the device of topic

Country Status (1)

Country Link
CN (1) CN104408036B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106326392A (en) * 2016-08-17 2017-01-11 合网络技术(北京)有限公司 Participating method and participating device for multimedia resource topic
CN107545039A (en) * 2017-07-31 2018-01-05 腾讯科技(深圳)有限公司 The index acquisition methods and device of keyword, computer equipment and storage medium
CN109345282A (en) * 2018-08-22 2019-02-15 中国平安人寿保险股份有限公司 A kind of response method and equipment of business consultation
CN110457599A (en) * 2019-08-15 2019-11-15 中国电子信息产业集团有限公司第六研究所 Hot topic method for tracing, device, server and readable storage medium storing program for executing

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6741959B1 (en) * 1999-11-02 2004-05-25 Sap Aktiengesellschaft System and method to retrieving information with natural language queries
CN102063469A (en) * 2010-12-03 2011-05-18 百度在线网络技术(北京)有限公司 Method and device for acquiring relevant keyword message and computer equipment
CN102073671A (en) * 2009-11-19 2011-05-25 索尼公司 Topic identification system, topic identification device, topic identification method, client terminal, and information processing method
CN103020164A (en) * 2012-11-26 2013-04-03 华北电力大学 Semantic search method based on multi-semantic analysis and personalized sequencing
CN103207899A (en) * 2013-03-19 2013-07-17 新浪网技术(中国)有限公司 Method and system for recommending text files

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6741959B1 (en) * 1999-11-02 2004-05-25 Sap Aktiengesellschaft System and method to retrieving information with natural language queries
CN102073671A (en) * 2009-11-19 2011-05-25 索尼公司 Topic identification system, topic identification device, topic identification method, client terminal, and information processing method
CN102063469A (en) * 2010-12-03 2011-05-18 百度在线网络技术(北京)有限公司 Method and device for acquiring relevant keyword message and computer equipment
CN103020164A (en) * 2012-11-26 2013-04-03 华北电力大学 Semantic search method based on multi-semantic analysis and personalized sequencing
CN103207899A (en) * 2013-03-19 2013-07-17 新浪网技术(中国)有限公司 Method and system for recommending text files

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106326392A (en) * 2016-08-17 2017-01-11 合网络技术(北京)有限公司 Participating method and participating device for multimedia resource topic
CN107545039A (en) * 2017-07-31 2018-01-05 腾讯科技(深圳)有限公司 The index acquisition methods and device of keyword, computer equipment and storage medium
CN107545039B (en) * 2017-07-31 2021-05-18 腾讯科技(深圳)有限公司 Keyword index acquisition method and device, computer equipment and storage medium
CN109345282A (en) * 2018-08-22 2019-02-15 中国平安人寿保险股份有限公司 A kind of response method and equipment of business consultation
CN110457599A (en) * 2019-08-15 2019-11-15 中国电子信息产业集团有限公司第六研究所 Hot topic method for tracing, device, server and readable storage medium storing program for executing
CN110457599B (en) * 2019-08-15 2021-09-03 中国电子信息产业集团有限公司第六研究所 Hot topic tracking method and device, server and readable storage medium

Also Published As

Publication number Publication date
CN104408036B (en) 2019-01-08

Similar Documents

Publication Publication Date Title
CN108334533B (en) Keyword extraction method and device, storage medium and electronic device
CN107436922B (en) Text label generation method and device
CN106528845B (en) Retrieval error correction method and device based on artificial intelligence
WO2019227710A1 (en) Network public opinion analysis method and apparatus, and computer-readable storage medium
AU2014212510B2 (en) Systems and methods for indentifying documents based on citation history
KR101605430B1 (en) SYSTEM AND METHOD FOR BUINDING QAs DATABASE AND SEARCH SYSTEM AND METHOD USING THE SAME
CN107870927B (en) File evaluation method and device
CN108536708A (en) A kind of automatic question answering processing method and automatically request-answering system
CN106844341B (en) Artificial intelligence-based news abstract extraction method and device
CN109241526B (en) Paragraph segmentation method and device
US20140379719A1 (en) System and method for tagging and searching documents
CN108305180B (en) Friend recommendation method and device
CN108269122B (en) Advertisement similarity processing method and device
CN104408036B (en) It is associated with recognition methods and the device of topic
US20210151038A1 (en) Methods and systems for automatic generation and convergence of keywords and/or keyphrases from a media
CN104899335A (en) Method for performing sentiment classification on network public sentiment of information
CN106407316B (en) Software question and answer recommendation method and device based on topic model
CN106021532B (en) Keyword display method and device
CN109992651B (en) Automatic identification and extraction method for problem target features
US10353927B2 (en) Categorizing columns in a data table
CN105512300A (en) Information filtering method and system
CN106708880A (en) Topic associated word obtaining method and apparatus
CN106649367B (en) Method and device for detecting keyword popularization degree
CN108475265B (en) Method and device for acquiring unknown words
CN111291561B (en) Text recognition method, device and system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
PE01 Entry into force of the registration of the contract for pledge of patent right
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: Correlated topic recognition method and device

Effective date of registration: 20190531

Granted publication date: 20190108

Pledgee: Shenzhen Black Horse World Investment Consulting Co.,Ltd.

Pledgor: BEIJING GRIDSUM TECHNOLOGY Co.,Ltd.

Registration number: 2019990000503

CP02 Change in the address of a patent holder
CP02 Change in the address of a patent holder

Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing

Patentee after: BEIJING GRIDSUM TECHNOLOGY Co.,Ltd.

Address before: 100086 Beijing city Haidian District Shuangyushu Area No. 76 Zhichun Road cuigongfandian 8 layer A

Patentee before: BEIJING GRIDSUM TECHNOLOGY Co.,Ltd.

PP01 Preservation of patent right
PP01 Preservation of patent right

Effective date of registration: 20240604

Granted publication date: 20190108