CN106610931B - Topic name extraction method and device - Google Patents

Topic name extraction method and device Download PDF

Info

Publication number
CN106610931B
CN106610931B CN201510697984.8A CN201510697984A CN106610931B CN 106610931 B CN106610931 B CN 106610931B CN 201510697984 A CN201510697984 A CN 201510697984A CN 106610931 B CN106610931 B CN 106610931B
Authority
CN
China
Prior art keywords
topic
text data
target
words
mutual information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510697984.8A
Other languages
Chinese (zh)
Other versions
CN106610931A (en
Inventor
朱波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201510697984.8A priority Critical patent/CN106610931B/en
Publication of CN106610931A publication Critical patent/CN106610931A/en
Application granted granted Critical
Publication of CN106610931B publication Critical patent/CN106610931B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a topic name extraction method and device, relates to the technical field of information, and solves the problem of low readability of topic names. The main technical scheme of the invention is as follows: the method comprises the steps of obtaining mutual information values corresponding to all co-occurring words in text data, extracting target co-occurring words with mutual information values larger than preset mutual information values from the co-occurring words, obtaining similarity values of the target co-occurring words and topic word clusters of the text data respectively, and determining the target co-occurring words with the similarity values larger than a preset threshold value as topic names of the text data. The method is mainly used for extracting the topic names from the text data.

Description

Topic name extraction method and device
Technical Field
The invention relates to the technical field of information, in particular to a topic name extraction method and device.
Background
The topic name refers to a word or phrase capable of representing a focus of a chapter of text data such as news, microblog, forum, blog and the like, wherein the topic name can represent core content of the text data, so that extracting the topic name from massive text data is helpful for analyzing the core content of the text data.
At present, the topic name extraction methods mainly include two ways, namely an extraction method based on clustering and an extraction method based on a topic model, and the topic representation ways of the two topic name extraction methods are word clusters formed by a plurality of words, and each word cluster can represent one topic.
However, in the topic representation mode based on the word cluster, since the extraction of noun phrases in the text data has a certain difficulty and is influenced by Chinese word segmentation and part of speech tagging, a certain error exists in the extraction result of the noun phrases, so that the topic representation mode based on the word cluster cannot accurately represent the topic content; in addition, because the extracted phrases have sparse data, no matter whether a topic model extraction method or a clustering extraction method is adopted for topic identification, noun phrases of partial topic contents cannot be displayed due to the sparse data, and therefore the readability of topic names is low in a topic representation mode based on word clusters.
Disclosure of Invention
The present invention has been made in view of the above problems, and aims to provide a topic name extraction method and apparatus that overcomes or at least partially solves the above problems.
In order to achieve the purpose, the invention mainly provides the following technical scheme:
in one aspect, an embodiment of the present invention provides a method for extracting a topic name, where the method includes:
acquiring mutual information values corresponding to the co-occurrence words in the text data;
extracting target co-occurring words with mutual information values larger than preset mutual information values from the co-occurring words;
respectively acquiring similarity values of the target co-occurrence words and topic word clusters of the text data;
and determining the target co-occurrence words with similarity values larger than a preset threshold value as the topic names of the text data.
On the other hand, an embodiment of the present invention further provides an apparatus for extracting a topic name, where the apparatus includes:
the acquiring unit is used for acquiring mutual information values corresponding to the co-occurrence words in the text data;
the extraction unit is used for extracting the target co-occurrence words with mutual information values larger than preset mutual information values from the co-occurrence words;
the acquiring unit is further configured to acquire similarity values of the target co-occurrence word and the topic word cluster of the text data respectively;
and the determining unit is used for determining the target co-occurrence words with the similarity values larger than a preset threshold value as the topic names of the text data.
By the technical scheme, the technical scheme provided by the embodiment of the invention at least has the following advantages:
according to the technical scheme, the topic name extraction method and the topic name extraction device provided by the invention firstly obtain mutual information values corresponding to all co-occurring words in text data respectively, then extract target co-occurring words with mutual information values larger than preset mutual information values from the co-occurring words, then respectively obtain similarity values of the target co-occurring words and topic word clusters of the text data, and finally determine the target co-occurring words with the similarity values larger than a preset threshold value as the topic names of the text data. Compared with the topic names extracted by a clustering extraction method or a topic model extraction method at present, the embodiment of the invention firstly obtains mutual information values corresponding to all co-occurring words in text data, then extracts target co-occurring words with mutual information values larger than preset mutual information values from the co-occurring words, and finally determines the target co-occurring words with similarity values larger than a preset threshold value with topic word clusters as the topic names of the text data.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
fig. 1 is a flowchart of a topic name extraction method according to an embodiment of the present invention;
fig. 2 is a flowchart of another topic name extraction method provided in the embodiment of the present invention;
fig. 3 is a block diagram of a topic name extraction device according to an embodiment of the present invention;
fig. 4 is a block diagram of another topic name extraction device according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
In order to make the advantages of the technical solutions of the present invention clearer, the present invention is described in detail below with reference to the accompanying drawings and examples.
The embodiment of the invention provides a topic name extraction method, as shown in fig. 1, the method comprises the following steps:
s101, obtaining mutual information values corresponding to the co-occurrence words in the text data.
The text data is a text of the topic name to be extracted, and the text data may be specifically chapters such as news, microblogs, forums, blogs, and the like. It should be noted that before obtaining the mutual information values corresponding to the co-occurring words in the text data, word segmentation needs to be performed on the text data, where the word segmentation refers to dividing a Chinese character sequence in the text data into a single word. In the embodiment of the invention, the co-occurrence words are words frequently collocated and co-occurring in the text data, the set of the co-occurrence words of a certain word in the text data describes the semantic environment of the word to a certain extent, the mutual information values of the co-occurrence words can represent the association strength between the co-occurrence words, the mutual information values reflect the association strength between the semantics represented by the words to a certain extent, and the larger the mutual information value is, the larger the association strength of the co-occurrence words is, the smaller the mutual information value is, the smaller the association strength of the co-occurrence words is.
And S102, extracting the target co-occurrence words with mutual information values larger than the preset mutual information values from the co-occurrence words.
The preset mutual information value is set according to actual needs, and may also be configured by default of the system, and the embodiment of the present invention is not particularly limited. The target co-occurrence word is a co-occurrence word of which the mutual information value is greater than the preset mutual information value. It should be noted that the larger the preset mutual information value is set, the fewer target co-occurring words are extracted from the co-occurring words; the smaller the preset mutual information value is set, the more target co-occurrence words are extracted from the co-occurrence words.
S103, respectively obtaining similarity values of the target co-occurrence words and topic word clusters of the text data.
The topic word cluster of the text data can perform topic identification on the text data by using a topic model lda (late dirichlet allocation), and the topic word cluster can represent a topic of the text data. It should be noted that the number of topic word clusters of the text data may be set according to actual requirements, for example, the number of topic word clusters may be limited to 3, 5, 8, 10, and the like according to actual requirements, and the embodiment of the present invention is not limited specifically. When there are a plurality of topic word clusters of the text data, similarity values of the target co-occurrence word and each topic word cluster need to be calculated respectively.
For example, there are 3 target co-occurring words obtained, which are "data out", "data structure", and "database", respectively; the number of topic word clusters of the text data is limited to 2, which are respectively a database and a data volume, and in this example, if the similarity value between the target co-occurrence word and the topic word cluster of the text data is calculated, the similarity value between the target co-occurrence word data output and the topic word cluster database and the similarity value between the target co-occurrence word data output and the topic word cluster data volume are required to be calculated; similarity values of the target co-occurrence word data structure and the topic word cluster data base and data volume are respectively obtained; similarity values of the target co-occurrence word database and the topic word cluster database and the similarity values of the topic word database and the topic word quantity are respectively obtained.
And S104, determining the target co-occurrence word with the similarity value larger than a preset threshold value as the topic name of the text data.
The preset threshold value can be set according to actual requirements, and the larger the value set by the preset threshold value is, the more the topic names of the text data are determined; the smaller the value of the preset threshold setting, the fewer the topic names of the determined text data. In the embodiment of the invention, mutual information values corresponding to all co-occurring words in text data are firstly obtained, then target co-occurring words with mutual information values larger than preset mutual information values are extracted from the co-occurring words, similarity values of the target co-occurring words and topic word clusters of the text data are respectively obtained, and finally the target co-occurring words with the similarity values larger than a preset threshold value are determined as topic names of the text data. The topic name in the invention is extracted from the co-occurrence words in the text data, and the topic name and the topic word cluster of the text data meet certain similarity, so the accuracy and readability of the topic name extracted by the invention are higher.
The method for extracting topic names provided by the embodiment of the invention comprises the steps of firstly obtaining mutual information values corresponding to all co-occurring words in text data, then extracting target co-occurring words with mutual information values larger than preset mutual information values from the co-occurring words, then respectively obtaining similarity values of the target co-occurring words and topic word clusters of the text data, and finally determining the target co-occurring words with the similarity values larger than a preset threshold value as the topic names of the text data. Compared with the topic names extracted by a clustering extraction method or a topic model extraction method at present, the embodiment of the invention firstly obtains mutual information values corresponding to all co-occurring words in text data, then extracts target co-occurring words with mutual information values larger than preset mutual information values from the co-occurring words, and finally determines the target co-occurring words with similarity values larger than a preset threshold value with topic word clusters as the topic names of the text data.
The embodiment of the present invention provides another method for extracting a topic name, as shown in fig. 2, the method includes:
s201, dividing the text data into a plurality of data modules according to a preset dividing rule.
The text data is a text of the topic name to be extracted, and the text data may be specifically chapters such as news, microblogs, forums, blogs, and the like. For the embodiment of the present invention, the preset division rule may be configured according to a requirement for actually extracting a topic name, for example, the division rule may be divided according to paragraphs in the text data, or may be divided according to sentences in the text data, or may be divided according to the number of words in the text data, which is not specifically limited in the embodiment of the present invention. It should be noted that, when the paragraph or sentence of the text data is divided, the length of the paragraph or sentence may be selected according to the actual situation, for example, the length of the divided sentence may be 1, 2, or 3. Preferably, in the embodiment of the present invention, the text data is divided into a plurality of data modules according to each sentence in the text, and after the division is completed, the co-occurrence word can be obtained in each sentence in the text data in the subsequent step, so that the relevance between the extracted co-occurrence word and the text data can be ensured.
S202, mutual information values corresponding to the co-occurrence words in the data modules are obtained respectively.
For the embodiment of the invention, before the mutual information values corresponding to the co-occurring words in the data modules are respectively obtained, word segmentation needs to be carried out on the text data, wherein the word segmentation means that the Chinese character sequence in the text data is divided into a single word. In the embodiment of the invention, the co-occurrence words are words frequently collocated and co-occurring in the text data, the set of the co-occurrence words of a certain word in the text data describes the semantic environment of the word to a certain extent, the mutual information values of the co-occurrence words can represent the association strength between the co-occurrence words, the mutual information values reflect the association strength between the semantics represented by the words to a certain extent, and the larger the mutual information value is, the larger the association strength of the co-occurrence words is, the smaller the mutual information value is, the smaller the association strength of the co-occurrence words is.
In the embodiment of the invention, each co-occurrence word in each data module is respectively obtained, so that the obtained co-occurrence word has certain relevance with the content in the data module, and the relevance between the co-occurrence word and the text data can be improved.
S203, extracting the target co-occurrence words with mutual information values larger than the preset mutual information values from the co-occurrence words.
The preset mutual information value is set according to actual needs, and may also be configured by default of the system, and the embodiment of the present invention is not particularly limited. The target co-occurrence word is a co-occurrence word of which the mutual information value is greater than the preset mutual information value. It should be noted that the larger the preset mutual information value is set, the fewer target co-occurring words are extracted from the co-occurring words; the smaller the preset mutual information value is set, the more target co-occurrence words are extracted from the co-occurrence words.
S204, respectively obtaining similarity values of the target co-occurrence words and topic word clusters of the text data.
The topic word cluster of the text data can perform topic identification on the text data by using a topic model lda (late dirichlet allocation), and the topic word cluster can represent a topic of the text data. It should be noted that the number of topic word clusters of the text data may be set according to actual requirements, for example, the number of topic word clusters may be limited to 2, 4, 6, 8, and the like according to actual requirements, and the embodiment of the present invention is not limited specifically. When there are a plurality of topic word clusters of the text data, similarity values of the target co-occurrence word and each topic word cluster need to be calculated respectively.
For the embodiment of the present invention, the obtaining the similarity values of the target co-occurring word and the topic word cluster of the text data respectively includes: and respectively obtaining similarity values of the target co-occurrence words and topic word clusters of the text data through a cosine similarity algorithm.
S205, determining the target co-occurrence word with the similarity value larger than a preset threshold value as the topic name of the text data.
The preset threshold value can be set according to actual requirements, and the larger the value set by the preset threshold value is, the more the topic names of the text data are determined; the smaller the value of the preset threshold setting, the fewer the topic names of the determined text data. In the embodiment of the invention, mutual information values corresponding to all co-occurring words in text data are firstly obtained, then target co-occurring words with mutual information values larger than preset mutual information values are extracted from the co-occurring words, similarity values of the target co-occurring words and topic word clusters of the text data are respectively obtained, and finally the target co-occurring words with the similarity values larger than a preset threshold value are determined as topic names of the text data.
For the embodiment of the present invention, after determining the target co-occurring word with the similarity value greater than the preset threshold as the topic name of the text data, the method further includes: acquiring position information of the topic names in topic word clusters of the text data respectively; and sequencing the topic names according to the sequence of the position information. For example, if the acquired topic names are "standard code" and "information exchange", respectively, and the topic word cluster of the text data is "american standard code" for information exchange, the location information of the "standard code" and the "information exchange" in the "american standard code" for information exchange is acquired, respectively, and then the topic name is "standard code for information exchange" according to the sequence of the location information. In the embodiment of the invention, the readability of the topic name can be improved by extracting the topic name according to the position information of the topic name in the topic word cluster.
For the embodiment of the present invention, the applicable scenarios are as follows, but not limited to the following scenarios: dividing the text data into a plurality of data modules according to each sentence in the text, then respectively obtaining the co-occurrence words in each sentence from each divided sentence, then calculating the mutual information value corresponding to each co-occurrence word, then extracting the target co-occurrence words with the mutual information value larger than the preset mutual information value from the co-occurrence words, respectively obtaining the similarity value of the target co-occurrence words and the topic word cluster of the text data, and finally determining the target co-occurrence words with the similarity value larger than the preset threshold value as the topic names of the text data. The topic name in the invention is extracted from the co-occurrence words in the text data, and the topic name and the topic word cluster of the text data meet certain similarity, so the accuracy and readability of the topic name extracted by the invention are higher.
Another topic name extraction method provided in the embodiments of the present invention includes obtaining mutual information values corresponding to respective co-occurring words in text data, then extracting target co-occurring words with mutual information values larger than a preset mutual information value from the co-occurring words, then obtaining similarity values of the target co-occurring words and topic word clusters of the text data, and finally determining the target co-occurring words with similarity values larger than a preset threshold as the topic names of the text data. Compared with the topic names extracted by a clustering extraction method or a topic model extraction method at present, the embodiment of the invention firstly obtains mutual information values corresponding to all co-occurring words in text data, then extracts target co-occurring words with mutual information values larger than preset mutual information values from the co-occurring words, and finally determines the target co-occurring words with similarity values larger than a preset threshold value with topic word clusters as the topic names of the text data.
Further, an embodiment of the present invention provides an apparatus for extracting a topic name, as shown in fig. 3, the apparatus includes: an acquisition unit 31, an extraction unit 32, a determination unit 33.
The obtaining unit 31 is configured to obtain mutual information values corresponding to the co-occurring words in the text data.
And the extracting unit 32 is used for extracting the target co-occurrence words with mutual information values larger than the preset mutual information values from the co-occurrence words.
The obtaining unit 31 is further configured to obtain similarity values of the target co-occurring word and the topic word cluster of the text data, respectively.
A determining unit 33, configured to determine a target co-occurring word with a similarity value greater than a preset threshold as the topic name of the text data.
It should be noted that, for other corresponding descriptions of the functional units related to the apparatus for extracting a topic name provided in the embodiment of the present invention, reference may be made to corresponding descriptions of the method shown in fig. 1, which are not described herein again, but it should be clear that the apparatus in the embodiment can correspondingly implement all the contents in the foregoing method embodiments.
The device for extracting topic names provided by the embodiment of the invention firstly obtains mutual information values corresponding to all co-occurring words in text data, then extracts target co-occurring words with mutual information values larger than preset mutual information values from the co-occurring words, then respectively obtains similarity values of the target co-occurring words and topic word clusters of the text data, and finally determines the target co-occurring words with the similarity values larger than a preset threshold value as the topic names of the text data. Compared with the topic names extracted by a clustering extraction method or a topic model extraction method at present, the embodiment of the invention firstly obtains mutual information values corresponding to all co-occurring words in text data, then extracts target co-occurring words with mutual information values larger than preset mutual information values from the co-occurring words, and finally determines the target co-occurring words with similarity values larger than a preset threshold value with topic word clusters as the topic names of the text data.
Further, an embodiment of the present invention provides another apparatus for extracting a topic name, as shown in fig. 4, the apparatus includes: an acquisition unit 41, an extraction unit 42, a determination unit 43.
The obtaining unit 41 is configured to obtain mutual information values corresponding to the co-occurring words in the text data.
And the extracting unit 42 is configured to extract a target co-occurrence word with a mutual information value larger than a preset mutual information value from the co-occurrence words.
The obtaining unit 41 is further configured to obtain similarity values of the target co-occurring word and the topic word cluster of the text data, respectively.
A determining unit 43, configured to determine a target co-occurring word with a similarity value greater than a preset threshold as the topic name of the text data.
Further, the apparatus further comprises:
the dividing unit 44 is configured to divide the text data into a plurality of data modules according to a preset dividing rule.
The obtaining unit 41 is specifically configured to obtain mutual information values corresponding to the co-occurrence words in the data modules respectively.
The obtaining unit 41 is specifically configured to obtain similarity values of the target co-occurrence word and the topic word cluster of the text data respectively through a cosine similarity algorithm.
Further, the apparatus further comprises: a sorting unit 45.
The obtaining unit 41 is further configured to obtain position information of the topic names in the topic word clusters of the text data respectively.
The sorting unit 45 is configured to sort the topic names according to the sequence of the position information.
It should be noted that, for other corresponding descriptions of the functional units related to the another topic name extraction apparatus provided in the embodiment of the present invention, reference may be made to corresponding descriptions of the method shown in fig. 2, which are not described herein again, but it should be clear that the apparatus in the embodiment can correspondingly implement all the contents in the foregoing method embodiments.
Another topic name extraction device provided in the embodiment of the present invention first obtains mutual information values corresponding to respective co-occurring words in text data, then extracts target co-occurring words with mutual information values larger than a preset mutual information value from the co-occurring words, then obtains similarity values of the target co-occurring words and topic word clusters of the text data, and finally determines the target co-occurring words with similarity values larger than a preset threshold as the topic names of the text data. Compared with the topic names extracted by a clustering extraction method or a topic model extraction method at present, the embodiment of the invention firstly obtains mutual information values corresponding to all co-occurring words in text data, then extracts target co-occurring words with mutual information values larger than preset mutual information values from the co-occurring words, and finally determines the target co-occurring words with similarity values larger than a preset threshold value with topic word clusters as the topic names of the text data.
The embodiment of the apparatus corresponds to the embodiment of the method, and for convenience of reading, details in the embodiment of the apparatus are not repeated one by one, but it should be clear that the apparatus in the embodiment can correspondingly implement all the contents in the embodiment of the method.
The topic name extracting device comprises a processor and a memory, wherein the acquiring unit, the extracting unit, the determining unit, the dividing unit, the sequencing unit and the like are stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions.
The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set to be one or more, and the readability of the topic names is improved by adjusting the kernel parameters.
The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip.
The present application further provides a computer program product adapted to perform program code for initializing the following method steps when executed on a data processing device: acquiring mutual information values corresponding to the co-occurrence words in the text data; extracting target co-occurring words with mutual information values larger than preset mutual information values from the co-occurring words; respectively acquiring similarity values of the target co-occurrence words and topic word clusters of the text data; and determining the target co-occurrence words with similarity values larger than a preset threshold value as the topic names of the text data.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (12)

1. A topic name extraction method is characterized by comprising the following steps:
acquiring mutual information values corresponding to the co-occurrence words in the text data;
extracting target co-occurring words with mutual information values larger than preset mutual information values from the co-occurring words;
respectively acquiring similarity values of the target co-occurrence words and topic word clusters of the text data;
and determining the target co-occurrence words with similarity values larger than a preset threshold value as the topic names of the text data.
2. The method for extracting topic names according to claim 1, wherein before the obtaining mutual information values corresponding to the co-occurring words in the text data, the method further comprises:
and dividing the text data into a plurality of data modules according to a preset dividing rule.
3. The method for extracting the topic name according to claim 2, wherein the obtaining mutual information values corresponding to the co-occurring words in the text data respectively comprises:
and respectively acquiring mutual information values corresponding to the co-occurrence words in the data modules.
4. The method for extracting topic name according to claim 1, wherein the separately obtaining similarity values of the target co-occurrence word and the topic word cluster of the text data comprises:
and respectively obtaining similarity values of the target co-occurrence words and topic word clusters of the text data through a cosine similarity algorithm.
5. The method for extracting topic name according to claim 1, wherein after the target co-occurring word having a similarity value greater than a preset threshold is determined as the topic name of the text data, the method further comprises:
acquiring position information of the topic names in topic word clusters of the text data respectively;
and sequencing the topic names according to the sequence of the position information.
6. An extraction device of topic names, characterized by comprising:
the acquiring unit is used for acquiring mutual information values corresponding to the co-occurrence words in the text data;
the extraction unit is used for extracting the target co-occurrence words with mutual information values larger than preset mutual information values from the co-occurrence words;
the acquiring unit is further configured to acquire similarity values of the target co-occurrence word and the topic word cluster of the text data respectively;
and the determining unit is used for determining the target co-occurrence words with the similarity values larger than a preset threshold value as the topic names of the text data.
7. The apparatus for extracting topic name according to claim 6, characterized by further comprising:
and the dividing unit is used for dividing the text data into a plurality of data modules according to a preset dividing rule.
8. The apparatus for extracting a topic name according to claim 7,
the obtaining unit is specifically configured to obtain mutual information values corresponding to the co-occurring words in the data modules respectively.
9. The apparatus for extracting a topic name according to claim 6,
the obtaining unit is specifically configured to obtain similarity values of the target co-occurring word and the topic word cluster of the text data respectively through a cosine similarity algorithm.
10. The apparatus for extracting topic name according to claim 6, characterized by further comprising: a sorting unit;
the acquiring unit is further configured to acquire position information of the topic names in topic word clusters of the text data respectively;
and the ordering unit is used for ordering the topic names according to the sequence of the position information.
11. A storage medium characterized by comprising a stored program, wherein an apparatus where the storage medium is located is controlled to execute the topic name extraction method of any one of claim 1 to claim 5 when the program runs.
12. A processor, configured to execute a program, wherein the program executes the method for extracting the topic name of any one of claim 1 to claim 5.
CN201510697984.8A 2015-10-23 2015-10-23 Topic name extraction method and device Active CN106610931B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510697984.8A CN106610931B (en) 2015-10-23 2015-10-23 Topic name extraction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510697984.8A CN106610931B (en) 2015-10-23 2015-10-23 Topic name extraction method and device

Publications (2)

Publication Number Publication Date
CN106610931A CN106610931A (en) 2017-05-03
CN106610931B true CN106610931B (en) 2019-12-31

Family

ID=58613183

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510697984.8A Active CN106610931B (en) 2015-10-23 2015-10-23 Topic name extraction method and device

Country Status (1)

Country Link
CN (1) CN106610931B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109800428B (en) * 2018-12-28 2023-01-13 东软集团股份有限公司 Method, device and equipment for labeling segmentation result for corpus and storage medium
CN110245355B (en) * 2019-06-24 2024-02-13 深圳市腾讯网域计算机网络有限公司 Text topic detection method, device, server and storage medium
CN110704609B (en) * 2019-10-15 2022-03-15 中国科学技术信息研究所 Text theme visualization method and device based on community membership
CN111324725B (en) * 2020-02-17 2023-05-16 昆明理工大学 Topic acquisition method, terminal and computer readable storage medium
CN113821630B (en) * 2020-06-19 2023-10-17 菜鸟智能物流控股有限公司 Data clustering method and device
CN114938477B (en) * 2022-06-23 2024-05-03 阿里巴巴(中国)有限公司 Video topic determination method, device and equipment
CN114925692B (en) * 2022-07-21 2022-10-11 中科雨辰科技有限公司 Data processing system for acquiring target event

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103399901A (en) * 2013-07-25 2013-11-20 三星电子(中国)研发中心 Keyword extraction method
CN104063387A (en) * 2013-03-19 2014-09-24 三星电子(中国)研发中心 Device and method abstracting keywords in text
CN104090929A (en) * 2014-06-23 2014-10-08 吕志雪 Recommendation method and device of personalized picture

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2005101236A2 (en) * 2004-04-06 2005-10-27 Educational Testing Service Lexical association metric for knowledge-free extraction of phrasal terms
US8392175B2 (en) * 2010-02-01 2013-03-05 Stratify, Inc. Phrase-based document clustering with automatic phrase extraction

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104063387A (en) * 2013-03-19 2014-09-24 三星电子(中国)研发中心 Device and method abstracting keywords in text
CN103399901A (en) * 2013-07-25 2013-11-20 三星电子(中国)研发中心 Keyword extraction method
CN104090929A (en) * 2014-06-23 2014-10-08 吕志雪 Recommendation method and device of personalized picture

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于互信息与词语共现的领域术语自动抽取方法研究;吴海燕;《重庆邮电大学学报(自然科学版)》;20131015;第25卷(第5期);第691-692页 *

Also Published As

Publication number Publication date
CN106610931A (en) 2017-05-03

Similar Documents

Publication Publication Date Title
CN106610931B (en) Topic name extraction method and device
KR20190085098A (en) Keyword extraction method, computer device, and storage medium
CN106598999B (en) Method and device for calculating text theme attribution degree
CN109344406B (en) Part-of-speech tagging method and device and electronic equipment
CN107341143B (en) Sentence continuity judgment method and device and electronic equipment
US20190272319A1 (en) Method and Device for Identifying Specific Text Information
CN110019668A (en) A kind of text searching method and device
CN109597983B (en) Spelling error correction method and device
CN111291177A (en) Information processing method and device and computer storage medium
CN110083832B (en) Article reprint relation identification method, device, equipment and readable storage medium
WO2019028990A1 (en) Code element naming method, device, electronic equipment and medium
CN109960815A (en) A kind of creation method and system of nerve machine translation NMT model
CN110232156B (en) Information recommendation method and device based on long text
CN106598997B (en) Method and device for calculating text theme attribution degree
CN111126060A (en) Method, device and equipment for extracting subject term and storage medium
CN111143551A (en) Text preprocessing method, classification method, device and equipment
CN109597982B (en) Abstract text recognition method and device
CN110019670A (en) A kind of text searching method and device
CN110019659B (en) Method and device for searching referee document
CN110008807A (en) A kind of training method, device and the equipment of treaty content identification model
CN112487181B (en) Keyword determination method and related equipment
CN111597336A (en) Processing method and device of training text, electronic equipment and readable storage medium
CN116028626A (en) Text matching method and device, storage medium and electronic equipment
CN115858776A (en) Variant text classification recognition method, system, storage medium and electronic equipment
CN110210030B (en) Statement analysis method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing

Applicant after: Beijing Guoshuang Technology Co.,Ltd.

Address before: 100086 Cuigong Hotel, 76 Zhichun Road, Shuangyushu District, Haidian District, Beijing

Applicant before: Beijing Guoshuang Technology Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant