CN115935977A - Text theme recognition method and device and electronic equipment - Google Patents
Text theme recognition method and device and electronic equipment Download PDFInfo
- Publication number
- CN115935977A CN115935977A CN202211409921.4A CN202211409921A CN115935977A CN 115935977 A CN115935977 A CN 115935977A CN 202211409921 A CN202211409921 A CN 202211409921A CN 115935977 A CN115935977 A CN 115935977A
- Authority
- CN
- China
- Prior art keywords
- word
- subject
- text
- candidate
- term
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 41
- 230000011218 segmentation Effects 0.000 claims abstract description 63
- 238000001914 filtration Methods 0.000 claims description 7
- 238000012163 sequencing technique Methods 0.000 claims description 4
- 238000010586 diagram Methods 0.000 description 3
- 238000004590 computer program Methods 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 230000004075 alteration Effects 0.000 description 1
- 238000004140 cleaning Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Machine Translation (AREA)
Abstract
The disclosure provides a text theme identification method and device and electronic equipment. The text topic identification method comprises the following steps: acquiring a text to be recognized, and performing word segmentation processing on the text to be recognized to obtain a first candidate word; matching the first candidate word with a second candidate word in a word segmentation dictionary; determining a first subject word from the first candidate words with failed matching; determining a first candidate word matched with the keyword in the knowledge base as a second subject word from the successfully matched first candidate words; and determining the subject of the text to be recognized based on the first subject term and the second subject term. When new types of words are included in the text, the present disclosure can identify the new types of words and extract accurate text topics.
Description
Technical Field
The present disclosure relates to the field of natural language processing technologies, and in particular, to a text topic identification method and apparatus, and an electronic device.
Background
The theme is the central idea of the text, which summarizes and reflects the body and core of the text content. And the theme label is the main content that can briefly summarize the text through a few words. In an era of information overload and rapid data growth, enterprises can accumulate massive text resources, and under the conditions of large text quantity and wide source channels, data sets contain contents of different fields and types, so that the problems of obtaining subjects of texts from multiple angles and understanding relative relationships among the texts are actually faced.
Under the condition of the rapid development of the internet, new vocabularies can be generated frequently, and the new vocabularies can be widely used. For the extraction of the text theme, in the prior art, a knowledge base-based recognition method is mainly adopted, and the method cannot recognize the novel vocabulary, so that when the text theme comprises the novel vocabulary, the prior art cannot extract an accurate text theme.
Disclosure of Invention
The embodiment of the disclosure provides a text theme identification method and device and electronic equipment.
An embodiment of a first aspect of the present disclosure provides a text topic identification method, including: acquiring a text to be recognized, and performing word segmentation processing on the text to be recognized to obtain a first candidate word;
matching the first candidate word with a second candidate word in a word segmentation dictionary;
determining a first subject word from the first candidate words with failed matching;
determining a first candidate word matched with the keyword in the knowledge base as a second subject word from the successfully matched first candidate words;
and determining the subject of the text to be recognized based on the first subject term and the second subject term.
In the embodiment of the disclosure, a text to be recognized is obtained, word segmentation processing is performed on the text to be recognized to obtain a first candidate word, the first candidate word is matched with a second candidate word in a word segmentation dictionary, a first subject word is determined from the first candidate words which are failed to be matched, the first candidate word which is matched with a keyword in a knowledge base is determined as a second subject word from the first candidate words which are successfully matched, and a subject of the text to be recognized is determined based on the first subject word and the second subject word. The first candidate word successfully matched with the second candidate word in the word segmentation dictionary is a novel word, and in the embodiment of the disclosure, the novel word is considered as a subject word of the text, so that the accuracy of text subject recognition is improved.
In one embodiment of the present disclosure, the determining a first subject word from the first candidate words with failed matching includes: and identifying the first candidate word which repeatedly appears as the first subject word from the first candidate words which fail to be matched.
In one embodiment of the disclosure, after the identifying the repeatedly appearing first candidate word as the first subject word, the method includes: and storing the first subject word in the word segmentation dictionary to update the word segmentation dictionary.
In an embodiment of the disclosure, the determining the topic of the text to be recognized based on the first topic word and the second topic word includes: filtering the first subject term and the second subject term to obtain filtered subject terms; obtaining the score value of the filtered subject term, wherein the score value represents the importance degree of the subject term; and sequencing the filtered subject words according to the score value to obtain a subject word sequence, wherein the subject word sequence represents the subject of the text to be recognized.
In an embodiment of the present disclosure, the obtaining the score value of the filtered topic word includes: acquiring the word frequency of the filtered subject word; and acquiring the score value of the filtered subject term according to the term frequency.
In an embodiment of the present disclosure, the obtaining the score value of the filtered topic word includes: acquiring the information entropy of the filtered subject term; and acquiring the score value of the filtered subject term according to the information entropy.
An embodiment of a second aspect of the present disclosure provides a text topic identification apparatus, including: the acquisition module is used for acquiring a text to be recognized and performing word segmentation processing on the text to be recognized to obtain a first candidate word;
the matching module is used for matching the first candidate word with a second candidate word in a word segmentation dictionary;
the first determining module is used for determining a first subject word from the first candidate words with failed matching;
the second determining module is used for determining the first candidate words matched with the keywords in the knowledge base as second subject words from the successfully matched first candidate words;
and the third determining module is used for determining the theme of the text to be recognized based on the first theme word and the second theme word.
An embodiment of a third aspect of the present disclosure provides an electronic device, including: a processor; a memory for storing the processor-executable instructions; wherein the processor is configured to execute the instructions to implement the text topic identification method provided by the embodiment of the first aspect of the disclosure.
Additional aspects and advantages of the disclosure will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the disclosure.
Drawings
The foregoing and/or additional aspects and advantages of the present disclosure will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
fig. 1 is a schematic flowchart of a text topic identification method provided in an embodiment of the present disclosure;
fig. 2 is a schematic flowchart of another text topic identification method provided in the embodiment of the present disclosure;
fig. 3 is a schematic flowchart of another text topic identification method provided in the embodiment of the present disclosure;
fig. 4 is a schematic flowchart of another text topic identification method provided in the embodiment of the present disclosure;
fig. 5 is a schematic flowchart of another text topic identification method provided in the embodiment of the present disclosure;
fig. 6 is a schematic structural diagram of a text topic identification apparatus according to an embodiment of the present disclosure;
fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with embodiments of the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the disclosed embodiments, as detailed in the appended claims.
The terminology used in the embodiments of the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the embodiments of the present disclosure. As used in the disclosed embodiments and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information in the embodiments of the present disclosure, such information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of embodiments of the present disclosure. The words "if" and "if" as used herein may be interpreted as "at \8230; \8230whenor" when 8230; \8230, when or "in response to a determination", depending on the context.
Reference will now be made in detail to the embodiments of the present disclosure, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the like or similar elements throughout. The embodiments described below with reference to the accompanying drawings are illustrative and intended to explain the present disclosure, and should not be construed as limiting the present disclosure.
The text topic identification method, the text topic identification device and the electronic device according to the embodiments of the present disclosure are described below with reference to the drawings.
Fig. 1 is a schematic flowchart of a text topic identification method provided in an embodiment of the present disclosure. As shown in fig. 1, the method comprises the following steps:
s101, obtaining a text to be recognized, and performing word segmentation processing on the text to be recognized to obtain a first candidate word.
It should be noted that the text to be recognized described in this embodiment may be a text expressed in various written languages, for example, a chinese text, an english text, a russian text, a malaysian text, a mixed chinese and english text, and the like. The text to be recognized may include a sentence, a paragraph, or a chapter, such as a news manuscript.
Before performing word segmentation processing on the text to be recognized, preprocessing can be performed on the text to be recognized, wherein the preprocessing includes but is not limited to cleaning and sorting characters in the text to be recognized, such as punctuations, blanks, chinese and English characters, simplified and complex characters, and the like.
In some embodiments, a word segmentation method based on a dictionary, a word segmentation method based on statistics, a machine learning word segmentation method, or the like may be used to perform word segmentation on the text to be recognized, so as to obtain a first candidate word. Wherein, the first candidate words are multiple.
S102, matching the first candidate word with a second candidate word in a word segmentation dictionary.
The word segmentation dictionary in the embodiment of the present disclosure may be a word segmentation dictionary commonly used in the technical field of natural language processing, or may be a word segmentation dictionary pre-constructed based on rules.
S103, determining a first subject word from the first candidate words with failed matching.
When the first candidate word is matched with the second candidate word in the word segmentation dictionary, the first candidate word which fails in matching can be regarded as a novel vocabulary, namely, a vocabulary which does not exist in the word segmentation dictionary, and the novel vocabulary is often related to the theme of the text to be recognized. In some embodiments, a first candidate word with a higher degree of correlation with the topic of the text to be recognized may be selected as the first topic word from the first candidate words with failed matching. In other embodiments, the first candidate word that fails to match is directly used as the first subject word.
S104, determining the first candidate words matched with the keywords in the knowledge base as second subject words from the successfully matched first candidate words.
The knowledge base comprises named entities and other set keywords, wherein the named entities generally refer to entities with specific meanings or strong reference in texts, and generally comprise names of people, names of places, names of organizations, dates and times, proper nouns and the like. Wherein, set for the keyword and can add to the knowledge base according to the demand.
And matching the first candidate word successfully matched with the second candidate word in the word segmentation dictionary with the keyword in the knowledge base, and taking the successfully matched first candidate word as a second subject word.
And S105, determining the theme of the text to be recognized based on the first subject term and the second subject term.
After the first subject term and the second subject term are obtained, the subject term more fitting the subject of the text to be recognized can be selected from the first subject term and the second subject term to represent the subject of the text to be recognized.
In the embodiment of the disclosure, a text to be recognized is obtained, word segmentation processing is performed on the text to be recognized to obtain a first candidate word, the first candidate word is matched with a second candidate word in a word segmentation dictionary, a first subject word is determined from the first candidate words with failed matching, the first candidate word matched with a keyword in a knowledge base is determined as a second subject word from the first candidate words with successful matching, and a subject of the text to be recognized is determined based on the first subject word and the second subject word. The first candidate word successfully matched with the second candidate word in the word segmentation dictionary is a novel word, and in the embodiment of the disclosure, the novel word is considered as a subject word of the text, so that the accuracy of text subject recognition is improved.
Fig. 2 is a schematic flowchart of a text topic identification method according to an embodiment of the present disclosure. As shown in fig. 2, the method comprises the following steps:
s201, obtaining a text to be recognized, and performing word segmentation processing on the text to be recognized to obtain a first candidate word.
S202, matching the first candidate word with a second candidate word in the word segmentation dictionary.
For specific descriptions of steps S201 to S202, reference may be made to the description of relevant contents in the above embodiments, and details are not repeated herein.
S203, identifying the first candidate word which repeatedly appears as a first subject word from the first candidate words which fail to be matched.
In some embodiments, the repeated first candidate word may be identified as the first subject word from the first candidate words that have failed to be matched through a text similarity algorithm, for example, a Simhash algorithm, a word2vec algorithm, or the like.
Further, determining a first subject word from the first candidate words with failed matching includes: and storing the first subject term in a word segmentation dictionary to update the word segmentation dictionary.
After the repeated first candidate word is identified as the first subject word from the first candidate words which are failed to be matched, the first subject word can be stored in the word segmentation dictionary, and the word segmentation dictionary is updated, so that the updated word segmentation dictionary can be directly used when the text topic is identified next time, and the identification efficiency of the text topic is improved.
S204, determining the first candidate words matched with the keywords in the knowledge base as second subject words from the successfully matched first candidate words.
S205, determining the theme of the text to be recognized based on the first subject term and the second subject term.
For specific descriptions of steps S204 to S205, reference may be made to the descriptions of relevant contents in the above embodiments, and details are not repeated here.
In the embodiment of the disclosure, a text to be recognized is obtained, word segmentation processing is performed on the text to be recognized to obtain a first candidate word, the first candidate word is matched with a second candidate word in a word segmentation dictionary, the first candidate word which appears repeatedly is identified as a first subject word from the first candidate words which fail to be matched, the first candidate word which matches with a keyword in a knowledge base is determined as a second subject word from the first candidate words which succeed to be matched, and a subject of the text to be recognized is determined based on the first subject word and the second subject word. The first candidate word successfully matched with the second candidate word in the word segmentation dictionary is a novel word, and in the embodiment of the disclosure, the novel word which repeatedly appears is used as a subject word of the text, so that the accuracy of text subject recognition is improved.
Fig. 3 is a schematic flowchart of a text topic identification method according to an embodiment of the present disclosure. As shown in fig. 3, the method comprises the following steps:
s301, obtaining a text to be recognized, and performing word segmentation processing on the text to be recognized to obtain a first candidate word.
S302, the first candidate word is matched with a second candidate word in the word segmentation dictionary.
S303, determining a first subject word from the first candidate words with failed matching.
S304, determining the first candidate words matched with the keywords in the knowledge base as second subject words from the successfully matched first candidate words.
For specific descriptions of steps S301 to S304, reference may be made to the descriptions of relevant contents in the above embodiments, and details are not repeated here.
And S305, filtering the first subject term and the second subject term to obtain filtered subject terms.
In some embodiments, the first subject term and the second subject term may be filtered by a filtering algorithm, for example, a string matching algorithm, a regular expression matching algorithm, a Deterministic Finite Automaton (DFA) algorithm, or the like, to obtain a filtered subject term, where the filtered subject term is a term that is more fit to a subject of the text to be recognized.
S306, obtaining the score value of the filtered subject term, wherein the score value represents the importance degree of the subject term.
And obtaining the score value of the filtered subject term according to the importance degree of the filtered subject term in the text to be recognized. Alternatively, the degree of importance of the subject word may be determined by the word frequency, the information entropy, the context information, and the like of the subject word, which is not limited herein.
S307, the filtered subject words are sequenced according to the score values to obtain a subject word sequence, and the subject word sequence represents the subject of the text to be recognized.
After the score values of the filtered subject terms are obtained, the filtered subject terms can be sorted according to the score values to obtain corresponding subject term sequences, and the subject term sequences can represent subjects of the text to be recognized.
In the embodiment of the disclosure, a text to be recognized is obtained, word segmentation processing is performed on the text to be recognized to obtain a first candidate word, the first candidate word is matched with a second candidate word in a word segmentation dictionary, a first subject word is determined from the first candidate words which are failed to be matched, the first candidate word which is matched with a keyword in a knowledge base is determined as a second subject word from the first candidate words which are successfully matched, the first subject word and the second subject word are filtered to obtain a filtered subject word, a score value of the filtered subject word is obtained, the score value represents the importance degree of the subject word, the filtered subject word is sorted according to the score value to obtain a subject word sequence, and the subject word sequence represents the subject of the text to be recognized. In the embodiment of the disclosure, the first subject term and the second subject term are filtered, scored and sorted, so that a more refined text subject can be obtained, and the accuracy of the text subject is further improved.
Fig. 4 is a schematic flowchart of a text topic identification method according to an embodiment of the present disclosure. As shown in fig. 4, the method comprises the following steps:
s401, obtaining a text to be recognized, and performing word segmentation processing on the text to be recognized to obtain a first candidate word.
S402, matching the first candidate word with a second candidate word in a word segmentation dictionary.
S403, determining a first subject word from the first candidate words with failed matching.
S404, determining the first candidate words matched with the keywords in the knowledge base as second subject words from the successfully matched first candidate words.
S405, filtering the first subject term and the second subject term to obtain the filtered subject terms.
For specific descriptions of steps S401 to S405, reference may be made to the descriptions of relevant contents in the above embodiments, and details are not repeated here.
S406, acquiring the word frequency of the filtered subject word.
The word frequency is the number of times that the filtered subject word appears in the text to be recognized.
S407, obtaining the score value of the filtered subject term according to the term frequency.
In some embodiments, the word frequency may be directly used as the score value of the filtered subject word.
In other embodiments, the score value of the filtered subject Term may be calculated according to the Term Frequency (TF) and the Inverse Document Frequency (TDF) of the filtered subject Term in the text to be recognized. Specifically, the word frequency of the filtered subject word in the text to be recognized and the word frequency of the inverse document may be multiplied to obtain a word frequency-inverse document frequency (TF-IDF), and the word frequency-inverse document frequency is used as the score of the filtered subject word.
S408, sorting the filtered subject words according to the score values to obtain a subject word sequence, wherein the subject word sequence represents the subject of the text to be recognized.
For a specific description of step S408, reference may be made to the description of relevant contents in the above embodiments, and details are not repeated here.
In the embodiment of the disclosure, a text to be recognized is obtained, word segmentation processing is performed on the text to be recognized to obtain a first candidate word, the first candidate word is matched with a second candidate word in a word segmentation dictionary, a first subject word is determined from the first candidate words which are failed to be matched, the first candidate word which is matched with a keyword in a knowledge base is determined as a second subject word from the first candidate words which are successfully matched, the first subject word and the second subject word are filtered to obtain a filtered subject word, the word frequency of the filtered subject word is obtained, the score value of the filtered subject word is obtained according to the word frequency, the filtered subject word is sequenced according to the score value to obtain a subject word sequence, and the subject word sequence represents the subject of the text to be recognized. In the embodiment of the disclosure, the first subject term and the second subject term are filtered, scored and sequenced, so that a more refined text subject can be obtained, and the accuracy of the text subject is further improved.
Fig. 5 is a schematic flowchart of a text topic identification method according to an embodiment of the present disclosure. As shown in fig. 5, the method comprises the following steps:
s501, obtaining a text to be recognized, and performing word segmentation processing on the text to be recognized to obtain a first candidate word.
S502, matching the first candidate word with a second candidate word in a word segmentation dictionary.
S503, determining a first subject word from the first candidate words with failed matching.
S504, determining the first candidate words matched with the keywords in the knowledge base as second subject words from the successfully matched first candidate words.
And S505, filtering the first subject term and the second subject term to obtain the filtered subject terms.
For specific descriptions of steps S501 to S505, reference may be made to the descriptions of relevant contents in the above embodiments, and details are not repeated here.
S506, acquiring the information entropy of the filtered subject term.
In some embodiments, the information entropy of the filtered subject word may be obtained based on a dictionary-free word segmentation algorithm of the information entropy.
And S507, acquiring the score value of the filtered subject term according to the information entropy.
The larger the information amount included in the subject term is, the larger the information entropy is, and the higher the importance degree is, and the importance degree of the filtered subject term can be scored according to the information entropy of the filtered subject term to obtain the score value of the subject term.
And S508, sequencing the filtered subject terms according to the score values to obtain a subject term sequence, wherein the subject term sequence represents the subject of the text to be recognized.
For a detailed description of step S508, reference may be made to the description of relevant contents in the above embodiments, which are not repeated herein.
In the embodiment of the disclosure, a text to be recognized is obtained, word segmentation processing is performed on the text to be recognized to obtain a first candidate word, the first candidate word is matched with a second candidate word in a word segmentation dictionary, a first subject word is determined from the first candidate words which are failed to be matched, the first candidate word which is matched with a keyword in a knowledge base is determined as a second subject word from the first candidate words which are successfully matched, the first subject word and the second subject word are filtered to obtain a filtered subject word, the information entropy of the filtered subject word is obtained, the score value of the filtered subject word is obtained according to the information entropy, the filtered subject word is sorted according to the score value to obtain a subject word sequence, the subject word sequence represents the subject of the text to be recognized, and the subject word sequence represents the subject of the text to be recognized. In the embodiment of the disclosure, the first subject term and the second subject term are filtered, scored and sorted, so that a more refined text subject can be obtained, and the accuracy of the text subject is further improved.
In order to implement the above embodiments, the embodiments of the present disclosure further provide a text topic identification apparatus. Fig. 6 is a schematic structural diagram of a text topic identification apparatus according to an embodiment of the present disclosure. As shown in fig. 6, the text topic identification apparatus 600 includes:
the acquiring module 610 is configured to acquire a text to be recognized, and perform word segmentation processing on the text to be recognized to obtain a first candidate word;
a matching module 620, configured to match the first candidate word with a second candidate word in a word segmentation dictionary;
a first determining module 630, configured to determine a first subject word from the first candidate words with failed matching;
the second determining module 640 is configured to determine, from the successfully matched first candidate words, that the first candidate word matched with the keyword in the knowledge base is a second subject word;
and a third determining module 650, configured to determine a topic of the text to be recognized based on the first subject term and the second subject term.
In the embodiment of the disclosure, a text to be recognized is obtained, word segmentation processing is performed on the text to be recognized to obtain a first candidate word, the first candidate word is matched with a second candidate word in a word segmentation dictionary, a first subject word is determined from the first candidate words which are failed to be matched, the first candidate word which is matched with a keyword in a knowledge base is determined as a second subject word from the first candidate words which are successfully matched, and a subject of the text to be recognized is determined based on the first subject word and the second subject word. The first candidate word successfully matched with the second candidate word in the word segmentation dictionary is a novel word, and in the embodiment of the disclosure, the novel word is considered as a subject word of the text, so that the accuracy of text subject recognition is improved.
In some embodiments, the first determining module 630 is further configured to: and identifying the first candidate word which repeatedly appears as a first subject word from the first candidate words which fail to be matched.
In some embodiments, the first determining module 630 is further configured to: and after the first candidate word which appears repeatedly is identified as the first subject word, storing the first subject word in the word segmentation dictionary to update the word segmentation dictionary.
In some embodiments, the third determining module 650 is further configured to: filtering the first subject term and the second subject term to obtain filtered subject terms; obtaining the score value of the filtered subject term, wherein the score value represents the importance degree of the subject term; and sequencing the filtered subject words according to the scoring values to obtain a subject word sequence, wherein the subject word sequence represents the subject of the text to be recognized.
In some embodiments, the third determining module 650 is further configured to: acquiring the word frequency of the filtered subject word; and acquiring the score value of the filtered subject term according to the term frequency.
In some embodiments, the third determining module 650 is further configured to: acquiring the information entropy of the filtered subject term; and obtaining the score value of the filtered subject term according to the information entropy.
The text topic identification device provided in the embodiment of the present disclosure may be used to implement the technical solution of the text topic identification method in the first aspect embodiment in the foregoing embodiments, and the implementation principle and the technical effect are similar, which are not described herein again.
In order to implement the foregoing embodiments, as shown in fig. 7, the present disclosure further provides an electronic device 700, which includes a memory 710, a processor 720 and a computer program stored in the memory 710 and executable on the processor 720, wherein the processor 720 executes the computer program to implement the text topic identification method proposed by the foregoing embodiments of the present disclosure.
In the description of the present specification, the terms "first", "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implying any number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present disclosure, "a plurality" means at least two, e.g., two, three, etc., unless explicitly specifically limited otherwise.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present disclosure. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Although embodiments of the present disclosure have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present disclosure, and that changes, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present disclosure.
Claims (9)
1. A text topic identification method is characterized by comprising the following steps:
acquiring a text to be recognized, and performing word segmentation processing on the text to be recognized to obtain a first candidate word;
matching the first candidate word with a second candidate word in a word segmentation dictionary;
determining a first subject word from the first candidate words with failed matching;
determining a first candidate word matched with the keyword in the knowledge base as a second subject word from the successfully matched first candidate words;
and determining the subject of the text to be recognized based on the first subject term and the second subject term.
2. The method of claim 1, wherein determining the first subject word from the first candidate words that failed matching comprises:
and identifying the repeated first candidate word as the first subject word from the first candidate words with failed matching.
3. The method of claim 2, wherein identifying the first candidate word that appears repeatedly as the first subject word comprises:
and storing the first subject term in the word segmentation dictionary to update the word segmentation dictionary.
4. The method of claim 1, wherein determining the subject of the text to be recognized based on the first subject term and the second subject term comprises:
filtering the first subject term and the second subject term to obtain filtered subject terms;
obtaining the score value of the filtered subject term, wherein the score value represents the importance degree of the subject term;
and sequencing the filtered subject words according to the score values to obtain a subject word sequence, wherein the subject word sequence represents the subject of the text to be identified.
5. The method of claim 4, wherein obtaining the score value of the filtered subject term comprises:
acquiring the word frequency of the filtered subject word;
and acquiring the score value of the filtered subject term according to the term frequency.
6. The method of claim 4, wherein obtaining the score value of the filtered subject term comprises:
acquiring the information entropy of the filtered subject term;
and acquiring the score value of the filtered subject term according to the information entropy.
7. A text topic identification apparatus, comprising:
the acquisition module is used for acquiring a text to be recognized and performing word segmentation processing on the text to be recognized to obtain a first candidate word;
the matching module is used for matching the first candidate word with a second candidate word in a word segmentation dictionary;
the first determining module is used for determining a first subject word from the first candidate words which fail to be matched;
the second determining module is used for determining the first candidate words matched with the keywords in the knowledge base as second subject words from the successfully matched first candidate words;
and the third determining module is used for determining the theme of the text to be recognized based on the first theme word and the second theme word.
8. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.
9. A computer-readable storage medium having computer instructions stored thereon for causing a computer to perform the method of any one of claims 1-6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211409921.4A CN115935977A (en) | 2022-11-10 | 2022-11-10 | Text theme recognition method and device and electronic equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211409921.4A CN115935977A (en) | 2022-11-10 | 2022-11-10 | Text theme recognition method and device and electronic equipment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115935977A true CN115935977A (en) | 2023-04-07 |
Family
ID=86696696
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211409921.4A Pending CN115935977A (en) | 2022-11-10 | 2022-11-10 | Text theme recognition method and device and electronic equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115935977A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116431814A (en) * | 2023-06-06 | 2023-07-14 | 北京中关村科金技术有限公司 | Information extraction method, information extraction device, electronic equipment and readable storage medium |
-
2022
- 2022-11-10 CN CN202211409921.4A patent/CN115935977A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116431814A (en) * | 2023-06-06 | 2023-07-14 | 北京中关村科金技术有限公司 | Information extraction method, information extraction device, electronic equipment and readable storage medium |
CN116431814B (en) * | 2023-06-06 | 2023-09-05 | 北京中关村科金技术有限公司 | Information extraction method, information extraction device, electronic equipment and readable storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110502621B (en) | Question answering method, question answering device, computer equipment and storage medium | |
US11544459B2 (en) | Method and apparatus for determining feature words and server | |
US11100124B2 (en) | Systems and methods for similarity and context measures for trademark and service mark analysis and repository searches | |
CN106649818B (en) | Application search intention identification method and device, application search method and server | |
JP7028858B2 (en) | Systems and methods for contextual search of electronic records | |
US10565533B2 (en) | Systems and methods for similarity and context measures for trademark and service mark analysis and repository searches | |
US8452763B1 (en) | Extracting and scoring class-instance pairs | |
JP5751253B2 (en) | Information extraction system, method and program | |
US11625537B2 (en) | Analysis of theme coverage of documents | |
CN107577663B (en) | Key phrase extraction method and device | |
CN102214189B (en) | Data mining-based word usage knowledge acquisition system and method | |
US20140289238A1 (en) | Document creation support apparatus, method and program | |
US9063923B2 (en) | Method for identifying the integrity of information | |
TW201826145A (en) | Method and system for knowledge extraction from Chinese corpus useful for extracting knowledge from source corpuses mainly written in Chinese | |
CN110032622B (en) | Keyword determination method, keyword determination device, keyword determination equipment and computer readable storage medium | |
CN110597978A (en) | Article abstract generation method and system, electronic equipment and readable storage medium | |
JP2008152522A (en) | Data mining system, data mining method and data retrieval system | |
CN111369980A (en) | Voice detection method and device, electronic equipment and storage medium | |
WO2021017951A1 (en) | Dual monolingual cross-entropy-delta filtering of noisy parallel data and use thereof | |
CN105653553B (en) | Word weight generation method and device | |
CN115935977A (en) | Text theme recognition method and device and electronic equipment | |
CN109344397B (en) | Text feature word extraction method and device, storage medium and program product | |
Balog et al. | The university of amsterdam at weps2 | |
EP4270238A1 (en) | Extracting content from freeform text samples into custom fields in a software application | |
CN111061924A (en) | Phrase extraction method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |