Detailed Description
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings.
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged under appropriate circumstances in order to facilitate the description of the embodiments of the invention herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The embodiment of the invention provides a method for identifying a related topic.
Fig. 1 is a flowchart of an identification method of an associated topic according to an embodiment of the present invention. As shown in fig. 1, the method for identifying the associated topic includes the following steps:
step S102, target keywords are obtained.
The target keywords may be one or more of, for example: 2014, college entrance examination and the like.
And step S104, determining the multidimensional arrays corresponding to the target keywords by using a machine learning method. Each dimension array in the multi-dimension array is used for representing one attribute of the target keyword.
Because each dimension number in the multidimensional array is used for representing one attribute of the target keyword, the target keyword corresponds to a unique multidimensional array, that is, the target keyword is represented by the multidimensional array. For example, the 500-dimensional array is used for representing the target keyword, and after the target keyword is obtained, a unique 500-dimensional array corresponding to the target keyword is obtained through a machine learning method.
And step S106, calculating the association index between the multidimensional arrays corresponding to the target keywords and the multidimensional arrays corresponding to the target topics. The relevance index is used for representing relevance between the target keywords and each target topic, and the target topics are a plurality of topics which are marked in advance and have multidimensional arrays.
And step S108, determining topics associated with the target keywords according to the calculated association index.
Each target topic also corresponds to a unique multidimensional array, that is, each topic is represented by a unique multidimensional array, wherein each dimension number in the multidimensional array represents an attribute in the target topic. It should be noted that the number of dimensions of the multidimensional array used for representing the topic is the same as the number of dimensions representing the target keyword (the same applies below), so as to avoid calculation errors.
And after the multidimensional arrays corresponding to the target keywords are determined, calculating the association index between the multidimensional arrays and the multidimensional arrays corresponding to the target topics. Because a plurality of topics exist in the text, the association indexes between the multidimensional arrays corresponding to the target keywords and the multidimensional data corresponding to the topics are respectively calculated, so that the association between the target keywords and the topics is obtained. Finally, determining topics associated with the target keywords according to the calculated association index, specifically, setting a corresponding threshold, and when the association index exceeds the threshold, regarding the target topics as being associated with the target keywords, otherwise, regarding the target topics as not being associated.
In the embodiment of the invention, the target keyword is obtained, the multidimensional array corresponding to the target keyword is determined, the association index between the multidimensional array corresponding to the target keyword and the multidimensional array corresponding to the target topic is calculated, the topic associated with the target keyword is determined according to the association index obtained by calculation, and the judgment of the correlation between the target keyword and the target topic is converted into the calculation of the association index between the multidimensional array used for representing the attribute of the target keyword and the multidimensional array used for representing the attribute of the target topic, so that the problem that the topic cannot be accurately identified in a keyword matching mode due to the fact that no keyword appears in the topic is solved, the problem of low topic identification accuracy in the prior art is solved, and the effect of improving the topic identification accuracy is achieved.
In the embodiment of the present invention, after determining the topics associated with the target keywords, the topics may be ranked according to the relevance conversation questions, for example, if the calculated relevance index is larger, it indicates that the relevance between the target keywords and the topics is higher, the target topics may be ranked according to the calculated relevance index from large to small, so as to obtain the topic attention ranking table. If the calculated association index is smaller, the higher the association between the target keyword and the topic is, the target topic can be sorted from small to large according to the association index.
Optionally, calculating an association index between the multidimensional array corresponding to the target keyword and the multidimensional array corresponding to the target topic includes: and calculating the Euclidean distance between the multidimensional array corresponding to the target keyword and the multidimensional array corresponding to the target topic, and taking the Euclidean distance as a correlation index.
In the embodiment of the invention, the relevance between the target keyword and the topic is represented by the Euclidean distance between the arrays, wherein the smaller the Euclidean distance between the target keyword and the topic is, the higher the relevance between the target keyword and the topic is; the larger the euclidean distance is, the lower the relevance between the target keyword and the topic is. In this way, when ranking is performed according to the high-low conversation questions related to the target topic and the keywords, in this embodiment, the target topics are ranked from small to large according to the euclidean distance, and the attention ranking table is obtained.
In the embodiment of the invention, the relevance between the target keyword and the target topic is judged by adopting the Euclidean distance between the calculation groups, so that the topic identification speed is improved.
Optionally, calculating an association index between the multidimensional array corresponding to the target keyword and the multidimensional array corresponding to the target topic includes: acquiring a multi-dimensional array corresponding to each word in a target topic; calculating an association index between the multidimensional arrays corresponding to the target keywords and the multidimensional arrays corresponding to each word in the target topic; and calculating the association index between the multidimensional array corresponding to the target keyword and the multidimensional array corresponding to each word to obtain the association index between the multidimensional array corresponding to the target keyword and the multidimensional array corresponding to the target topic.
Because the topic is composed of words according to a certain grammar, the topic comprises a plurality of words, when a machine learning method is used for calculating a multidimensional array corresponding to a target keyword and a multidimensional array corresponding to the target topic, the multidimensional array of each word in the target topic is calculated, the correlation index between the multidimensional arrays corresponding to the target topic can be the correlation index between the multidimensional array corresponding to each word in the target topic and the array corresponding to the target keyword, and then the correlation index obtains the correlation between the target keyword and the target topic. For example, the euclidean distance between the multidimensional array corresponding to each word in the target topic and the array corresponding to the target keyword is respectively calculated, and the relevance index between the multidimensional array corresponding to the target keyword and the multidimensional array corresponding to the target topic is obtained through calculation according to the euclidean distance. Therefore, the relevance between the target topic and the target keyword is determined through the relevance between each word in the topic and the target keyword, the calculation accuracy of the corresponding array of the topic is further improved, and the identification accuracy of the topic is further guaranteed.
Optionally, calculating an association index between the multidimensional array corresponding to the target keyword and the multidimensional array corresponding to the target topic includes: acquiring a multi-dimensional array corresponding to a target topic; and directly calculating the association index between the multidimensional arrays corresponding to the target topics corresponding to the target keywords.
Because the topic is composed of a plurality of words, the multidimensional array corresponding to the topic can be obtained through machine learning according to the multidimensional array corresponding to each word in the topic. Then, when calculating the association index, a unique multidimensional array obtained by the target topic through machine learning in advance can be obtained, and then the association index between the multidimensional array corresponding to the target keyword and the multidimensional array corresponding to the target topic is directly calculated. Compared with the method for calculating the association index of each word in the target keyword and the topic, the method greatly improves the speed of calculating the association index of the target keyword and the topic.
Preferably, determining the topic associated with the target keyword according to the calculated association index includes: judging whether the calculated correlation index meets a preset condition or not; if the calculated association index meets the preset condition, determining that the target topic with the calculated association index meeting the preset condition is associated with the target keyword; and if the calculated association index is judged not to meet the preset condition, determining that the target topic of which the calculated association index does not meet the preset condition is not related to the target keyword.
In this embodiment, the preset condition may be a preset threshold, for example, when the association index is larger, it indicates that the association performance between the target topic and the target keyword is higher, and then, determining whether the calculated association index meets the preset condition may be determining whether the calculated association index exceeds the preset threshold, if so, determining that the topic is associated with the target keyword, otherwise, determining that the topic is not associated.
If the correlation index is the Euclidean distance between the arrays, judging whether the calculated correlation index meets a preset condition can be judging whether the Euclidean distance is smaller than a preset threshold value, if so, determining that the topic is correlated with the target keyword, otherwise, determining that the topic is not correlated.
By setting a preset condition, topics related to the target keywords are quickly determined from the calculated result, and therefore the accuracy of topic identification is improved.
Preferably, before the target keyword is acquired, the identification method further includes: acquiring a target text, wherein the target text comprises a target topic; utilizing a word segmentation tool to segment the target text and marking the part of speech of each word in the target text; determining a target topic according to the part of speech rule model established in advance and the part of speech of the word after word segmentation, and marking the target topic; and determining the multidimensional arrays corresponding to each word after word segmentation and the multidimensional arrays corresponding to the target topics.
Acquiring a target text containing topics, establishing a text training set, and setting a text word segmentation rule according to needs; constructing a part-of-speech rule model (such as noun + verb, or noun + verb + object) of the topic by using a semantic analysis method; performing text analysis by using a word segmentation tool (including a set text word segmentation rule), labeling all parts of speech of each word, and labeling topics; all the terms (including topics) are respectively represented by multidimensional arrays, for example, 500 dimensions, and a corresponding unique multidimensional array of each term is obtained through a machine learning method. In this way, after the target keyword is acquired and the multidimensional array of the target keyword is determined, the relevance index such as the euclidean distance can be directly calculated by the multidimensional array corresponding to the topic.
In the embodiment of the invention, the topic is defined through the part of speech rule model, and the array corresponding to each word and topic is obtained by using a machine learning method, so that topic relevance judgment is converted into calculation of the correlation index between the arrays, and the speed and the accuracy of relevant topic identification are greatly improved.
The embodiment of the invention also provides a device for identifying the associated topics. The apparatus may implement its functionality via a computer device. It should be noted that the identification device of the related topic according to the embodiment of the present invention may be used to execute the identification method of the related topic provided by the embodiment of the present invention, and the identification method of the related topic according to the embodiment of the present invention may also be executed by the identification device of the related topic provided by the embodiment of the present invention.
Fig. 2 is a schematic diagram of an identification apparatus of related topics according to an embodiment of the present invention. As shown in fig. 2, the identification device of the related topic includes: a first acquisition unit 10, a first determination unit 20, a calculation unit 30 and a second determination unit 40.
The first acquisition unit 10 is used to acquire a target keyword.
The target keywords may be one or more of, for example: 2014, college entrance examination and the like.
The first determining unit 20 is configured to determine a multidimensional array corresponding to the target keyword by using a machine learning method, where each dimension number in the multidimensional array is used to represent one attribute of the target keyword.
Because each dimension number in the multidimensional array is used for representing one attribute of the target keyword, the target keyword corresponds to a unique multidimensional array, that is, the target keyword is represented by the multidimensional array. For example, the 500-dimensional array is used for representing the target keyword, and after the target keyword is obtained, a unique 500-dimensional array corresponding to the target keyword can be obtained through a machine learning method.
The calculating unit 30 is configured to calculate a relevance index between the multidimensional array corresponding to the target keyword and the multidimensional array corresponding to the target topic, where the relevance index is used to represent relevance between the target keyword and the target topic, and the target topic is a plurality of topics with multidimensional arrays marked in advance.
The second determining unit 40 is configured to determine a topic associated with the target keyword according to the calculated association index.
Each target topic also corresponds to a unique multidimensional array, that is, each topic is represented by a unique multidimensional array, wherein each dimension number in the multidimensional array represents an attribute in the target topic. It should be noted that the number of dimensions of the multidimensional array used for representing the topic is the same as the number of dimensions representing the target keyword (the same applies below), so as to avoid calculation errors.
And after the multidimensional arrays corresponding to the target keywords are determined, calculating the association index between the multidimensional arrays and the multidimensional arrays corresponding to the target topics. Because a plurality of topics exist in the text, the association indexes between the multidimensional arrays corresponding to the target keywords and the multidimensional data corresponding to the topics are respectively calculated, so that the association between the target keywords and the topics is obtained. Finally, determining topics associated with the target keywords according to the calculated association index, specifically, setting a corresponding threshold, and when the association index exceeds the threshold, regarding the target topics as being associated with the target keywords, otherwise, regarding the target topics as not being associated.
In the embodiment of the invention, the target keyword is obtained, the multidimensional array corresponding to the target keyword is determined, the association index between the multidimensional array corresponding to the target keyword and the multidimensional array corresponding to the target topic is calculated, the topic associated with the target keyword is determined according to the association index obtained by calculation, and the judgment of the correlation between the target keyword and the target topic is converted into the calculation of the association index between the multidimensional array used for representing the attribute of the target keyword and the multidimensional array used for representing the attribute of the target topic, so that the problem that the topic cannot be accurately identified in a keyword matching mode due to the fact that no keyword appears in the topic is solved, the problem of low topic identification accuracy in the prior art is solved, and the effect of improving the topic identification accuracy is achieved.
In the embodiment of the present invention, after determining the topics associated with the target keywords, the topics may be ranked according to the relevance conversation questions, for example, if the calculated relevance index is larger, it indicates that the relevance between the target keywords and the topics is higher, the target topics may be ranked according to the calculated relevance index from large to small, so as to obtain the topic attention ranking table. If the calculated association index is smaller, the higher the association between the target keyword and the topic is, the target topic can be sorted from small to large according to the association index.
Preferably, the calculation unit includes: and the first calculation module is used for calculating the Euclidean distance between the multidimensional array corresponding to the target keyword and the multidimensional array corresponding to the target topic, and taking the Euclidean distance as the association index.
In the embodiment of the invention, the relevance between the target keyword and the topic is represented by the Euclidean distance between the arrays, wherein the smaller the Euclidean distance between the target keyword and the topic is, the higher the relevance between the target keyword and the topic is; the larger the euclidean distance is, the lower the relevance between the target keyword and the topic is. In this way, when ranking is performed according to the high-low conversation questions related to the target topic and the keywords, in this embodiment, the target topics are ranked from small to large according to the euclidean distance, and the attention ranking table is obtained.
In the embodiment of the invention, the relevance between the target keyword and the target topic is judged by adopting the Euclidean distance between the calculation groups, so that the topic identification speed is improved.
Preferably, the calculation unit includes: the second acquisition module is used for acquiring a multi-dimensional array corresponding to each word in the target topic; the third calculation module is used for calculating the association index between the multidimensional arrays corresponding to the target keywords and the multidimensional arrays corresponding to each word in the target topic; and the fourth calculation module is used for calculating the association index between the multidimensional array corresponding to the target keyword and the multidimensional array corresponding to each word to obtain the association index between the multidimensional array corresponding to the target keyword and the multidimensional array corresponding to the target topic.
Because the topic is composed of words according to a certain grammar, the topic comprises a plurality of words, when a machine learning method is used for calculating a multidimensional array corresponding to a target keyword and a multidimensional array corresponding to the target topic, the multidimensional array of each word in the target topic is calculated, the correlation index between the multidimensional arrays corresponding to the target topic can be the correlation index between the multidimensional array corresponding to each word in the target topic and the array corresponding to the target keyword, and then the correlation index obtains the correlation between the target keyword and the target topic. For example, the euclidean distance between the multidimensional array corresponding to each word in the target topic and the array corresponding to the target keyword is respectively calculated, and the relevance index between the multidimensional array corresponding to the target keyword and the multidimensional array corresponding to the target topic is obtained through calculation according to the euclidean distance. Therefore, the relevance between the target topic and the target keyword is determined through the relevance between each word in the topic and the target keyword, the calculation accuracy of the corresponding array of the topic is further improved, and the identification accuracy of the topic is further guaranteed.
Optionally, the calculation unit comprises: the first acquisition module is used for acquiring a multi-dimensional array corresponding to the target topic; and the second calculation module is used for directly calculating the association index between the multi-dimensional arrays corresponding to the multi-dimensional array target topics corresponding to the target keywords.
Because the topic is composed of a plurality of words, the multidimensional array corresponding to the topic can be obtained through machine learning according to the multidimensional array corresponding to each word in the topic. Then, when calculating the association index, a unique multidimensional array obtained by the target topic through machine learning in advance can be obtained, and then the association index between the multidimensional array corresponding to the target keyword and the multidimensional array corresponding to the target topic is directly calculated. Compared with the method for calculating the association index of each word in the target keyword and the topic, the method greatly improves the speed of calculating the association index of the target keyword and the topic.
Preferably, the second determination unit includes: the judging module is used for judging whether the calculated correlation index meets a preset condition or not; the determining module is used for determining that the target topic with the calculated association index meeting the preset condition is associated with the target keyword if the calculated association index meeting the preset condition is judged; and if the calculated association index is judged not to meet the preset condition, determining that the target topic of which the calculated association index does not meet the preset condition is not related to the target keyword.
In this embodiment, the preset condition may be a preset threshold, for example, when the association index is larger, it indicates that the association performance between the target topic and the target keyword is higher, and then, determining whether the calculated association index meets the preset condition may be determining whether the calculated association index exceeds the preset threshold, if so, determining that the topic is associated with the target keyword, otherwise, determining that the topic is not associated.
If the correlation index is the Euclidean distance between the arrays, judging whether the calculated correlation index meets a preset condition can be judging whether the Euclidean distance is smaller than a preset threshold value, if so, determining that the topic is correlated with the target keyword, otherwise, determining that the topic is not correlated.
By setting a preset condition, topics related to the target keywords are quickly determined from the calculated result, and therefore the accuracy of topic identification is improved.
Preferably, the identification means further comprises: the second acquisition unit is used for acquiring a target text before acquiring the target key words, wherein the target text comprises a target topic; the word segmentation unit is used for segmenting the target text by using a word segmentation tool and marking the part of speech of each word in the target text; the third determining unit is used for determining a target topic according to the part of speech rule model established in advance and the part of speech of the word after word segmentation, and marking the target topic; and the fourth determining unit is used for determining the multidimensional arrays corresponding to each word after word segmentation and the multidimensional arrays corresponding to the target topics.
Acquiring a target text containing topics, establishing a text training set, and setting a text word segmentation rule according to needs; constructing a part-of-speech rule model (such as noun + verb, or noun + verb + object) of the topic by using a semantic analysis method; performing text analysis by using a word segmentation tool (including a set text word segmentation rule), labeling all parts of speech of each word, and labeling topics; all the terms (including topics) are respectively represented by multidimensional arrays, for example, 500 dimensions, and a corresponding unique multidimensional array of each term is obtained through a machine learning method. In this way, after the target keyword is acquired and the multidimensional array of the target keyword is determined, the relevance index such as the euclidean distance can be directly calculated by the multidimensional array corresponding to the topic.
In the embodiment of the invention, the topic is defined through the part of speech rule model, and the array corresponding to each word and topic is obtained by using a machine learning method, so that topic relevance judgment is converted into calculation of the correlation index between the arrays, and the speed and the accuracy of relevant topic identification are greatly improved.
It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.
In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implementing, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of some interfaces, devices or units, and may be an electric or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a mobile terminal, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.