CN106776530B - Method and device for extracting subject term - Google Patents

Method and device for extracting subject term Download PDF

Info

Publication number
CN106776530B
CN106776530B CN201510819148.2A CN201510819148A CN106776530B CN 106776530 B CN106776530 B CN 106776530B CN 201510819148 A CN201510819148 A CN 201510819148A CN 106776530 B CN106776530 B CN 106776530B
Authority
CN
China
Prior art keywords
word
documents
subject
words
matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510819148.2A
Other languages
Chinese (zh)
Other versions
CN106776530A (en
Inventor
祁国晟
徐文斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201510819148.2A priority Critical patent/CN106776530B/en
Publication of CN106776530A publication Critical patent/CN106776530A/en
Application granted granted Critical
Publication of CN106776530B publication Critical patent/CN106776530B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method and a device for extracting subject terms. Wherein, the method comprises the following steps: acquiring all documents needing to extract subject terms and terms appearing in the documents; constructing a word document matrix based on the frequency of each word appearing in the document, wherein each row of the word document matrix represents the word frequency information of each word in one document, and each column represents the word frequency information of one word in each document; performing semantic analysis on the word document matrix by using a potential semantic analysis model to generate a potential semantic space; and extracting all subject terms of the documents of which the subject terms need to be extracted according to the potential semantic space. The invention solves the technical problem that the extraction quality of the subject term is influenced due to the fact that one word is ambiguous or the synonyms of multiple words are synonymous.

Description

Method and device for extracting subject term
Technical Field
The invention relates to the field of natural language processing, in particular to a method and a device for extracting subject terms.
Background
The theme can embody the central thought expressed by the document, and is one of effective modes for expressing the document by the computer. The extraction of the subject information is helpful for understanding the effective information of the document, and the processing efficiency of the computer on the document is improved. Currently, topic extraction technology is a popular technology in the field of natural language processing.
Generally, taking chinese theme extraction as an example, the theme extraction task is generally divided into three layers of theme words, theme concepts and theme sentences. Although individual subject words do not have definite meanings as subject concepts and subject sentences, a set of subject words can clearly describe a subject and is more advantageous for computer processing.
In the related art, a topic word extraction method is provided, which specifically executes the following processes: (1) collecting a large number of documents to construct a large Document set, counting the Frequency of terms appearing in all the documents, and constructing a term-Document Frequency model (IDF); (2) counting word Frequency information (Term Frequency, abbreviated as TF) of a word in a document aiming at the document needing theme extraction; (3) constructing a weighted weight calculation model based on word frequency information, determining the weight value of each word in the document, and sequencing all words according to the weight values; (4) and outputting top-n words sequenced in the previous step according to a preset threshold value.
The inventors have found that the above-described technical process has the following disadvantages: (1) the topic word extraction model based on the word frequency information needs to rely on the word frequency information when extracting the topic words, and is easily influenced by high-frequency noise words, so that the extracted topic words and the set thereof are easily polluted by the high-frequency noise words, and the extraction quality of the topic words cannot be ensured; (2) the subject term extraction technology based on weighted value sorting cannot consider the semantics of each term no matter how the weighted value calculation model changes, so that the problems of Chinese character such as multiple meaning or synonymy of multiple terms cannot be solved, namely, the semantics of the terms cannot be effectively distinguished, and the quality of the extracted subject terms and the set thereof is influenced. In addition, the above scheme requires learning the IDF model, and the IDF model has an obvious effect in the whole network data in the non-domain, but when processing documents in the same domain, the effect is obviously reduced, and the IDF model in the domain generally needs to be retrained, which is not flexible enough.
In view of the above problems, no effective solution has been proposed.
Disclosure of Invention
The embodiment of the invention provides a method and a device for extracting a subject term, which at least solve the technical problem that the quality of extracting the subject term is influenced due to the fact that one word is ambiguous or multiple words are synonymous.
According to an aspect of the embodiments of the present invention, there is provided a topic word extraction method, including: acquiring all documents needing to extract subject terms and terms appearing in the documents; constructing a word document matrix based on the frequency of each word appearing in the document, wherein each row of the word document matrix represents the word frequency information of each word in one document, and each column represents the word frequency information of one word in each document; performing semantic analysis on the word document matrix by using a potential semantic analysis model to generate a potential semantic space; and extracting the subject terms of all the documents needing to extract the subject terms according to the potential semantic space.
Further, performing semantic analysis on the word document matrix by using a potential semantic analysis model, and generating a potential semantic space includes: analyzing the corresponding relation between the words and the documents in the word document matrix by using the latent semantic analysis model; and mapping the words and the documents in the word document matrix to a vector space meeting a preset dimension condition according to the corresponding relation to generate the potential semantic space.
Further, performing semantic analysis on the word document matrix by using a potential semantic analysis model, and generating a potential semantic space includes: and carrying out semantic analysis on the word document matrix by using a singular value decomposition model or a non-negative matrix decomposition model or a probability latent semantic index model to generate a latent semantic space.
Further, the extracting the subject terms of all the documents needing the subject term extraction according to the latent semantic space comprises: determining a subject term matrix according to the potential semantic space, wherein each row of the subject term matrix represents the semantic category of a subject term, and each column represents terms appearing in all the documents needing to extract the subject term; sorting each line of words in the subject word matrix according to the weight value of the words; and extracting words with weight values larger than a preset threshold value in the sorted thematic word matrix as the thematic words of all the documents needing to extract the thematic words.
Further, acquiring all documents from which the subject term needs to be extracted and the term appearing in the document includes: acquiring all the documents needing to extract the subject term; and performing word segmentation processing on all the documents needing to extract the subject words to obtain the words appearing in the documents.
According to another aspect of the embodiments of the present invention, there is also provided a topic word extraction apparatus, including: the acquisition unit is used for acquiring all documents of which the subject terms need to be extracted and terms appearing in the documents; the word document processing device comprises a construction unit, a processing unit and a processing unit, wherein the construction unit is used for constructing a word document matrix based on the frequency of each word appearing in the document, each line of the word document matrix represents the word frequency information of each word in one document, and each line represents the word frequency information of one word in each document; the generating unit is used for carrying out semantic analysis on the word document matrix by utilizing a potential semantic analysis model to generate a potential semantic space; and the extraction unit is used for extracting the subject terms of all the documents needing to extract the subject terms according to the potential semantic space.
Further, the generating unit includes: the analysis module is used for analyzing the corresponding relation between the words and the documents in the word document matrix by utilizing the latent semantic analysis model; and the generating module is used for mapping the words and the documents in the word document matrix to a vector space meeting a preset dimension condition according to the corresponding relation to generate the potential semantic space.
Furthermore, the generating unit is further configured to perform semantic analysis on the word document matrix by using a singular value decomposition model or a non-negative matrix decomposition model or a probabilistic latent semantic index model to generate a latent semantic space.
Further, the extraction unit includes: a determining module, configured to determine a subject term matrix according to the latent semantic space, where each row of the subject term matrix represents a semantic category of a subject term, and each column represents a term appearing in all documents that need to extract the subject term; the sorting module is used for sorting each line of words in the subject word matrix according to the weight value of each line of words; and the extraction module is used for extracting the terms with the weight values larger than a preset threshold value in the sorted topic word term matrix as the topic words of all the documents needing to extract the topic words.
Further, the acquiring unit includes: the acquisition module is used for acquiring all the documents of which the subject terms need to be extracted; and the word segmentation module is used for carrying out word segmentation on all the documents needing to extract the subject words to obtain the words appearing in the documents.
In the embodiment of the invention, a mode of extracting subject terms based on semantic analysis results is adopted, and all documents needing to extract the subject terms and terms appearing in the documents are obtained; constructing a word document matrix based on the frequency of each word appearing in the document, wherein each row of the word document matrix represents the word frequency information of each word in one document, and each column represents the word frequency information of one word in each document; performing semantic analysis on the word document matrix by using a potential semantic analysis model to generate a potential semantic space; the method and the device have the advantages that the subject terms of all the documents needing to be extracted are extracted according to the potential semantic space, the purpose of extracting the subject terms based on the semantic analysis result is achieved, the technical effect of improving the subject term extraction quality is achieved, and the technical problem that the subject term extraction quality is affected due to the fact that one word is ambiguous or multiple words are synonymous is solved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
FIG. 1 is a flow chart of an alternative topic word extraction method according to an embodiment of the present invention;
fig. 2 is a schematic diagram of an alternative topic word extraction apparatus according to an embodiment of the invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Example 1
In accordance with an embodiment of the present invention, there is provided a method embodiment of a subject term extraction method, it is noted that the steps illustrated in the flowchart of the drawings may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than presented herein.
Fig. 1 is a flowchart of an alternative topic word extraction method according to an embodiment of the present invention, as shown in fig. 1, the method includes the following steps:
step S102, acquiring all documents needing to extract subject words and words appearing in the documents;
step S104, constructing a word document matrix based on the frequency of each word appearing in the document, wherein each line of the word document matrix represents the word frequency information of each word in one document, and each line represents the word frequency information of one word in each document;
step S106, carrying out semantic analysis on the word document matrix by using a potential semantic analysis model to generate a potential semantic space;
and step S108, extracting all subject terms of the documents of which the subject terms need to be extracted according to the potential semantic space.
For example, assuming that there are N documents from which a topic word needs to be extracted, the documents refer to M words in total, the set of documents is denoted as D ═ { D1, D2, D3 … …, dn }, and the set of M words is denoted as W ═ W1, W2, W3 … …, wm }, then an N × M word document matrix a (i.e., word-document matrix a) can be established from the documents and words, and the matrix a is shown as follows:
Figure BDA0000855023710000051
each row in the matrix A corresponds to a document, wherein each element represents word frequency information of a corresponding word in the document; each column corresponds to a word, wherein each element represents the word frequency information of the word in the corresponding document, specifically, a in AijFrom D and W through aij=DiWjAnd mapping to obtain word frequency information of the word j in the document i.
Further, on the basis of the matrix a, a normalization factor may be calculated and each row vector may be normalized. The normalization factor calculation method includes various methods, but is not limited thereto, and for example, an L2-normalization method can be selected for vector normalization. Specifically, the calculation method of the L2-normalization factor is as follows:
Norm=(d1)2+…+…(dn)2
through the steps, the method can realize the processing of the document at chapter level by using a potential semantic analysis method, perfects the defect of extracting the subject words based on the word frequency information, takes the semantics of the words into consideration to reduce the influence of noise words on the quality of the subject words, enables the subject words used for representing the subjects to better cover the document information, enables the representation of the subjects to be more perfect, effectively improves the quality of the extracted subject words, enables the extracted subjects to have better universality in later-stage application, and has important significance for the work of calculating similarity or document retrieval and the like.
Optionally, performing semantic analysis on the word document matrix by using the latent semantic analysis model, and generating the latent semantic space includes:
s2, analyzing the corresponding relation between the words and the documents in the word document matrix by using a potential semantic analysis model;
and S4, mapping the words and the documents in the word document matrix to a vector space meeting a preset dimensional condition according to the corresponding relation, and generating a potential semantic space.
The purpose of latent semantic analysis is to find out the true meaning, i.e. latent semantic, of each word in the document, so as to obtain semantic information of the word and the relationship between the word and the subject. Specifically, generating the latent semantic space models a large collection of documents in the maintenance space using a reasonable dimension and represents both terms and documents into the space. For example, there are 2000 documents, which contain 7000 words, and in the latent semantic analysis, the words and the documents are represented into a vector space with a dimension of 100 according to a corresponding relationship.
By the embodiment of the invention, the topic is extracted based on the latent semantic analysis model, so that the influence of noise words can be reduced, and the extracted topic words can better describe the topic of the document.
Based on the above embodiment, optionally, performing semantic analysis on the word document matrix by using the latent semantic analysis model, and generating the latent semantic space includes:
and S6, performing semantic analysis on the word document matrix by using a singular value decomposition model or a non-negative matrix decomposition model NMF or a probability latent semantic index model pLSI to generate a latent semantic space.
The following describes the process of generating the latent semantic space in detail by taking the singular value decomposition K-SVD model as an example:
among them, Singular Value Decomposition (SVD) is an important matrix Decomposition in linear algebra, is a generalization of unitary diagonalization of a normal matrix in matrix analysis, and has important applications in the fields of signal processing, statistics, and the like. The unitary matrix U is a complex matrix with n rows and n columns, and satisfies UTU=UUT=EnWherein, UTIs a conjugate transpose of U, EnIs an n-order identity matrix. In linear algebra, the column rank of a matrix is the maximum number of linearly independent columns of the matrix. Similarly, the row rank of the matrix is the maximum number of linearly independent horizontal rows of the matrix.
In implementation, the term document matrix is processed by using SVD (singular value decomposition), and the matrix A is set as A ═ U ∑ VTIs decomposed into U, sigma and VTThree matrices, where Σ is a diagonal matrix, and each element on the diagonal is a singular value (i.e., eigenvalue) of matrix a.The following description refers to the case of A ═ U ∑ VTA simple solution of (1):
(1) matrix A is solvedTUnitary similar diagonal matrix and unitary similar matrix V of a:
Figure BDA0000855023710000062
(2) remember that V is (V)1,V2),V1∈Cn×r,V2∈Cn×(n-r)
(3) Let U1=AV1Δ-1,U1∈Cm×r
(4) Extended U1Is a U matrix, U ═ U1,U2),
(5) Constructive singular value decomposition
Figure BDA0000855023710000061
Wherein, each singular value in Σ corresponds to a weight value of each "semantic" dimension. Further, a less important weight value may be configured to be 0, that is, all dimension values smaller than a certain weight threshold value are configured to be 0, and only the most important dimension information is retained, so that some noise words may be filtered out from the obtained potential semantic space.
By adopting the singular value decomposition mode, the semantic category and the words with low membership degree of the subject words with small weight values can be filtered out by adopting the singular value decomposition mode and the singular value filtering and membership degree filtering modes, and the influence of high-frequency noise words is eliminated, so that the extracted subject words can better describe the theme of the document.
Optionally, extracting the subject term of all the documents of which the subject term needs to be extracted according to the potential semantic space comprises:
s8, determining a subject term matrix according to the potential semantic space, wherein each row of the subject term matrix represents the semantic category of the subject term, and each column represents terms appearing in all documents needing to extract the subject term;
s10, sorting each line of words in the subject word matrix according to the weight value;
and S12, extracting words with weight values larger than a preset threshold value in the sorted thematic word matrix as the thematic words of all the documents of which the thematic words need to be extracted.
Based on the foregoing embodiment, after singular value decomposition is performed on the word document matrix a, diagonal matrices Σ and V among the three matrices are obtainedTTwo matrices, according to T1=ΣVTObtaining the intermediate matrix T by multiplying1Filter out T1All 0 rows and all 0 columns in the matrix to obtain the final subject term matrix T2Wherein T is2The lines in (1) represent semantic categories of the extracted subject term, the columns represent terms in the document, T2Each element represents the membership (i.e., degree of membership) between the term represented by the column in which the element is located and the subject term represented by the row in which the element is located. Then to the matrix T2And sequencing each row according to the weight value, adding words and weights corresponding to the rows with the weight values larger than the weight threshold value into a theme set as theme words and theme information to form a theme word set for representing the theme of each document.
It should be noted that, according to different task requirements, the weight threshold may be divided into two types: the method comprises the following steps that an integer type m is used for representing that the first m subject words related to a subject need to be extracted to represent the subject of a document; the second is a decimal f which indicates that all words with weight values larger than f need to be extracted as subject words to represent the subjects of the documents.
Optionally, acquiring all documents from which the subject term needs to be extracted and the term appearing in the document includes:
s14, acquiring all documents of which the subject terms need to be extracted;
and S16, performing word segmentation processing on all the documents needing to extract the subject words to obtain the words appearing in the documents.
That is, after all the documents whose subject word needs to be extracted, the documents need to be preprocessed, including: and performing word segmentation processing on the documents to obtain words related to the documents, and counting word frequency information of the words. For a Chinese document, a Chinese word segmentation tool may be used for word segmentation processing, thereby processing a long text document into a set of words. In order to improve the quality of extracted subject words and reduce the influence of high-frequency noise words, common Chinese stop words such as 'ones', 'kanes', etc. can be filtered after word segmentation is finished.
By the embodiment of the invention, a large corpus model does not need to be trained in advance, the use is flexible, and the method has universality on document sets or whole network data in different fields.
Example 2
According to an embodiment of the present invention, an embodiment of an apparatus for extracting a topic word is provided.
Fig. 2 is a schematic diagram of an alternative topic word extraction apparatus according to an embodiment of the present invention, as shown in fig. 2, the apparatus includes: an obtaining unit 202, configured to obtain all documents in which a subject term needs to be extracted and terms appearing in the documents; a constructing unit 204, configured to construct a word document matrix based on the frequency of occurrence of each word in the document, where each row of the word document matrix represents word frequency information of each word in one document, and each column represents word frequency information of one word in each document; a generating unit 206, configured to perform semantic analysis on the word document matrix by using the latent semantic analysis model to generate a latent semantic space; and the extracting unit 208 is configured to extract the subject terms of all the documents in which the subject terms need to be extracted according to the potential semantic space.
For example, assuming that there are N documents from which a topic word needs to be extracted, the documents refer to M words in total, the set of documents is denoted as D ═ { D1, D2, D3 … …, dn }, and the set of M words is denoted as W ═ W1, W2, W3 … …, wm }, then an N × M word document matrix a (i.e., word-document matrix a) can be established from the documents and words, and the matrix a is shown as follows:
Figure BDA0000855023710000081
each row in matrix A corresponds to a document, whichEach element in the document represents word frequency information of a corresponding word in the document; each column corresponds to a word, wherein each element represents the word frequency information of the word in the corresponding document, specifically, a in AijFrom D and W through aij=DiWjAnd mapping to obtain word frequency information of the word j in the document i.
Further, on the basis of the matrix a, a normalization factor may be calculated and each row vector may be normalized. The normalization factor calculation method includes various methods, but is not limited thereto, and for example, an L2-normalization method can be selected for vector normalization. Specifically, the L2-norm normalization factor is calculated as follows:
Norm=(d1)2+…+…(dn)2
through the embodiment, the method can realize the processing of the document at chapter level by using a potential semantic analysis method, perfects the defect of extracting the subject words based on the word frequency information, takes the semantics of the words into account to reduce the influence of noise words on the quality of the subject words, enables the subject words used for representing the subjects to better cover the document information, enables the representation of the subjects to be more perfect, effectively improves the quality of the extracted subject words, enables the extracted subjects to have better universality in later-stage application, and has important significance for the work of calculating similarity or document retrieval and the like.
Optionally, the generating unit includes: the analysis module is used for analyzing the corresponding relation between the words and the documents in the word document matrix by utilizing the latent semantic analysis model; and the generating module is used for mapping the words and the documents in the word document matrix to a vector space meeting a preset dimension condition according to the corresponding relation to generate a potential semantic space.
The purpose of latent semantic analysis is to find out the true meaning, i.e. latent semantic, of each word in the document, so as to obtain semantic information of the word and the relationship between the word and the subject. Specifically, generating the latent semantic space models a large collection of documents in the maintenance space using a reasonable dimension and represents both terms and documents into the space. For example, there are 2000 documents, which contain 7000 words, and in the latent semantic analysis, the words and the documents are represented into a vector space with a dimension of 100 according to a corresponding relationship.
By the embodiment of the invention, the topic is extracted based on the latent semantic analysis model, so that the influence of noise words can be reduced, and the extracted topic words can better describe the topic of the document.
Based on the above embodiment, optionally, the generating unit is further configured to perform semantic analysis on the word document matrix by using a singular value decomposition model or a non-negative matrix decomposition model or a probabilistic latent semantic index model, so as to generate a latent semantic space.
The process of generating the latent semantic space by using the singular value decomposition K-SVD model is the same as the process described in embodiment 1, and is not described herein again.
By adopting the singular value decomposition mode, the semantic category and the words with low membership degree of the subject words with small weight values can be filtered out by adopting the singular value decomposition mode and the singular value filtering and membership degree filtering modes, and the influence of high-frequency noise words is eliminated, so that the extracted subject words can better describe the theme of the document.
Optionally, the extracting unit includes: the determining module is used for determining a subject term matrix according to the potential semantic space, wherein each row of the subject term matrix represents the semantic category of a subject term, and each column represents terms appearing in all documents needing to extract the subject term; the sorting module is used for sorting each line of words in the subject word matrix according to the weight value of each line of words; and the extraction module is used for extracting the terms with the weight values larger than a preset threshold value in the sorted topic word term matrix as the topic words of all the documents needing to extract the topic words.
Based on the foregoing embodiment, after singular value decomposition is performed on the word document matrix a, diagonal matrices Σ and V among the three matrices are obtainedTTwo matrices, according to T1=ΣVTObtaining the intermediate matrix T by multiplying1Filter out T1All 0 rows and all 0 columns in the matrix to obtain the final subject term matrix T2Wherein T is2The lines in (1) represent decimatedSemantic category of subject word, column representing word in document, T2Each element represents the membership (i.e., degree of membership) between the term represented by the column in which the element is located and the subject term represented by the row in which the element is located. Then to the matrix T2And sequencing each row according to the weight value, adding words and weights corresponding to the rows with the weight values larger than the weight threshold value into a theme set as theme words and theme information to form a theme word set for representing the theme of each document.
It should be noted that, according to different task requirements, the weight threshold may be divided into two types: the method comprises the following steps that an integer type m is used for representing that the first m subject words related to a subject need to be extracted to represent the subject of a document; the second is a decimal f which indicates that all words with weight values larger than f need to be extracted as subject words to represent the subjects of the documents.
Optionally, the obtaining unit includes: the acquisition module is used for acquiring all documents of which the subject terms need to be extracted; and the word segmentation module is used for carrying out word segmentation on all the documents of which the subject words need to be extracted to obtain the words appearing in the documents.
That is, after all the documents whose subject word needs to be extracted, the documents need to be preprocessed, including: and performing word segmentation processing on the documents to obtain words related to the documents, and counting word frequency information of the words. For a Chinese document, a Chinese word segmentation tool may be used for word segmentation processing, thereby processing a long text document into a set of words. In order to improve the quality of extracted subject words and reduce the influence of high-frequency noise words, common Chinese stop words such as 'ones', 'kanes', etc. can be filtered after word segmentation is finished.
By the embodiment of the invention, a large corpus model does not need to be trained in advance, the use is flexible, and the method has universality on document sets or whole network data in different fields.
The subject word extraction device includes a processor and a memory, and the acquisition unit, the construction unit, the generation unit, the extraction unit, and the like are stored in the memory as program units, and the processor executes the program units stored in the memory.
The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set to one or more than one, and the text content is analyzed by adjusting the kernel parameters.
The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (fl ash RAM), and the memory includes at least one memory chip.
The present application further provides an embodiment of a computer program product, which, when being executed on a data processing device, is adapted to carry out program code for initializing the following method steps: acquiring all documents needing to extract subject terms and terms appearing in the documents; constructing a word document matrix based on the frequency of each word appearing in the document, wherein each row of the word document matrix represents the word frequency information of each word in one document, and each column represents the word frequency information of one word in each document; performing semantic analysis on the word document matrix by using a potential semantic analysis model to generate a potential semantic space; and extracting all subject terms of the documents of which the subject terms need to be extracted according to the potential semantic space.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units may be a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (6)

1. A method for extracting a subject term, comprising:
acquiring all documents of which the subject words need to be extracted and words appearing in the documents;
constructing a word document matrix based on the frequency of each word appearing in the document, wherein each row of the word document matrix represents the word frequency information of each word in one document, and each column represents the word frequency information of one word in each document;
performing semantic analysis on the word document matrix by using a potential semantic analysis model to generate a potential semantic space;
extracting subject terms of all the documents needing to extract the subject terms according to the potential semantic space;
performing semantic analysis on the word document matrix by using a potential semantic analysis model, and generating a potential semantic space comprises the following steps: analyzing the corresponding relation between the words and the documents in the word document matrix by utilizing the potential semantic analysis model; mapping the words and the documents in the word document matrix to a vector space meeting a preset dimension condition according to the corresponding relation, and generating the potential semantic space;
extracting the subject terms of the documents of which the subject terms need to be extracted according to the potential semantic space comprises the following steps: determining a subject term matrix according to the potential semantic space, wherein each row of the subject term matrix represents the semantic category of a subject term, and each column represents terms appearing in all documents needing to extract the subject term; sorting each line of words in the subject word matrix according to the weight value of the words; and extracting words with weight values larger than a preset threshold value in the sorted thematic word matrix as the thematic words of all the documents needing to extract the thematic words.
2. The method of claim 1, wherein semantically analyzing the word document matrix using a latent semantic analysis model, generating a latent semantic space comprising:
and carrying out semantic analysis on the word document matrix by using a singular value decomposition model or a non-negative matrix decomposition model or a probability latent semantic index model to generate a latent semantic space.
3. The method of claim 1, wherein obtaining all documents from which subject terms need to be extracted and terms appearing in the documents comprises:
acquiring all the documents of which the subject terms need to be extracted;
and performing word segmentation processing on all the documents needing to extract the subject words to obtain the words appearing in the documents.
4. An apparatus for extracting a subject term, comprising:
the acquisition unit is used for acquiring all documents of which the subject terms need to be extracted and terms appearing in the documents;
the word frequency information generating unit is used for generating word frequency information of each word in the document according to the word frequency information of each word in each document;
the generating unit is used for carrying out semantic analysis on the word document matrix by utilizing a potential semantic analysis model to generate a potential semantic space;
the extraction unit is used for extracting the subject terms of all the documents needing to extract the subject terms according to the potential semantic space;
wherein the generating unit includes: the analysis module is used for analyzing the corresponding relation between the words and the documents in the word document matrix by utilizing the latent semantic analysis model; the generating module is used for mapping the words and the documents in the word document matrix to a vector space meeting a preset dimension condition according to the corresponding relation to generate the potential semantic space;
wherein the extraction unit includes: a determining module, configured to determine a subject term matrix according to the potential semantic space, where each row of the subject term matrix represents a semantic category of a subject term, and each column represents a term appearing in all documents that need to extract the subject term; the sorting module is used for sorting each line of words in the subject word matrix according to the weight value of each line of words; and the extraction module is used for extracting the terms with the weight values larger than a preset threshold value in the sorted topic word term matrix as the topic words of all the documents needing to extract the topic words.
5. The apparatus according to claim 4, wherein the generating unit is further configured to perform semantic analysis on the word document matrix using a singular value decomposition model or a non-negative matrix decomposition model or a probabilistic latent semantic indexing model to generate a latent semantic space.
6. The apparatus of claim 4, wherein the obtaining unit comprises:
the acquisition module is used for acquiring all the documents of which the subject terms need to be extracted;
and the word segmentation module is used for carrying out word segmentation on all the documents needing to extract the subject words to obtain the words appearing in the documents.
CN201510819148.2A 2015-11-23 2015-11-23 Method and device for extracting subject term Active CN106776530B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510819148.2A CN106776530B (en) 2015-11-23 2015-11-23 Method and device for extracting subject term

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510819148.2A CN106776530B (en) 2015-11-23 2015-11-23 Method and device for extracting subject term

Publications (2)

Publication Number Publication Date
CN106776530A CN106776530A (en) 2017-05-31
CN106776530B true CN106776530B (en) 2020-07-03

Family

ID=58963111

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510819148.2A Active CN106776530B (en) 2015-11-23 2015-11-23 Method and device for extracting subject term

Country Status (1)

Country Link
CN (1) CN106776530B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117494726B (en) * 2023-12-29 2024-04-12 成都航空职业技术学院 Information keyword extraction method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于衰减词共现图的多文档摘要研究;周进华 等;《小型微型计算机系统》;20090131;第30卷(第1期);第174页 *

Also Published As

Publication number Publication date
CN106776530A (en) 2017-05-31

Similar Documents

Publication Publication Date Title
WO2020192401A1 (en) System and method for generating answer based on clustering and sentence similarity
CN108197111B (en) Text automatic summarization method based on fusion semantic clustering
CN110019843B (en) Knowledge graph processing method and device
CN106874292B (en) Topic processing method and device
CN106776574B (en) User comment text mining method and device
CN105786991A (en) Chinese emotion new word recognition method and system in combination with user emotion expression ways
CN106776562A (en) A kind of keyword extracting method and extraction system
CN108154395A (en) A kind of customer network behavior portrait method based on big data
CN109388743B (en) Language model determining method and device
CN111159407A (en) Method, apparatus, device and medium for training entity recognition and relation classification model
US20110264443A1 (en) Information processing device, information processing method, and program
CN112559684A (en) Keyword extraction and information retrieval method
CN105005590B (en) A kind of generation method of the interim abstract of the special topic of information media
CN104731812A (en) Text emotion tendency recognition based public opinion detection method
CN111291177A (en) Information processing method and device and computer storage medium
CN112527958A (en) User behavior tendency identification method, device, equipment and storage medium
CN115017320A (en) E-commerce text clustering method and system combining bag-of-words model and deep learning model
CN109992665A (en) A kind of classification method based on the extension of problem target signature
CN110019763B (en) Text filtering method, system, equipment and computer readable storage medium
CN110457707B (en) Method and device for extracting real word keywords, electronic equipment and readable storage medium
CN106776530B (en) Method and device for extracting subject term
CN117149955A (en) Method, medium and system for automatically answering insurance clause consultation
CN110413985B (en) Related text segment searching method and device
Chang et al. Incorporating word embedding into cross-lingual topic modeling
CN115391551A (en) Event detection method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing

Applicant after: Beijing Guoshuang Technology Co.,Ltd.

Address before: 100086 Cuigong Hotel, 76 Zhichun Road, Shuangyushu District, Haidian District, Beijing

Applicant before: Beijing Guoshuang Technology Co.,Ltd.

GR01 Patent grant
GR01 Patent grant