CN115935977A - Text theme recognition method and device and electronic equipment - Google Patents

Text theme recognition method and device and electronic equipment Download PDF

Info

Publication number
CN115935977A
CN115935977A CN202211409921.4A CN202211409921A CN115935977A CN 115935977 A CN115935977 A CN 115935977A CN 202211409921 A CN202211409921 A CN 202211409921A CN 115935977 A CN115935977 A CN 115935977A
Authority
CN
China
Prior art keywords
word
subject
text
candidate
term
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211409921.4A
Other languages
Chinese (zh)
Inventor
梁玉晨
杨加畅
郭家义
朱芳
朱蓉华
林峰璞
石志国
肖益
李宝东
刘韶辉
张菁
穆显显
贾若
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Big Data Center
Taiji Computer Corp Ltd
Original Assignee
Beijing Big Data Center
Taiji Computer Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Big Data Center, Taiji Computer Corp Ltd filed Critical Beijing Big Data Center
Priority to CN202211409921.4A priority Critical patent/CN115935977A/en
Publication of CN115935977A publication Critical patent/CN115935977A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Machine Translation (AREA)

Abstract

The disclosure provides a text theme identification method and device and electronic equipment. The text topic identification method comprises the following steps: acquiring a text to be recognized, and performing word segmentation processing on the text to be recognized to obtain a first candidate word; matching the first candidate word with a second candidate word in a word segmentation dictionary; determining a first subject word from the first candidate words with failed matching; determining a first candidate word matched with the keyword in the knowledge base as a second subject word from the successfully matched first candidate words; and determining the subject of the text to be recognized based on the first subject term and the second subject term. When new types of words are included in the text, the present disclosure can identify the new types of words and extract accurate text topics.

Description

Text theme recognition method and device and electronic equipment
Technical Field
The present disclosure relates to the field of natural language processing technologies, and in particular, to a text topic identification method and apparatus, and an electronic device.
Background
The theme is the central idea of the text, which summarizes and reflects the body and core of the text content. And the theme label is the main content that can briefly summarize the text through a few words. In an era of information overload and rapid data growth, enterprises can accumulate massive text resources, and under the conditions of large text quantity and wide source channels, data sets contain contents of different fields and types, so that the problems of obtaining subjects of texts from multiple angles and understanding relative relationships among the texts are actually faced.
Under the condition of the rapid development of the internet, new vocabularies can be generated frequently, and the new vocabularies can be widely used. For the extraction of the text theme, in the prior art, a knowledge base-based recognition method is mainly adopted, and the method cannot recognize the novel vocabulary, so that when the text theme comprises the novel vocabulary, the prior art cannot extract an accurate text theme.
Disclosure of Invention
The embodiment of the disclosure provides a text theme identification method and device and electronic equipment.
An embodiment of a first aspect of the present disclosure provides a text topic identification method, including: acquiring a text to be recognized, and performing word segmentation processing on the text to be recognized to obtain a first candidate word;
matching the first candidate word with a second candidate word in a word segmentation dictionary;
determining a first subject word from the first candidate words with failed matching;
determining a first candidate word matched with the keyword in the knowledge base as a second subject word from the successfully matched first candidate words;
and determining the subject of the text to be recognized based on the first subject term and the second subject term.
In the embodiment of the disclosure, a text to be recognized is obtained, word segmentation processing is performed on the text to be recognized to obtain a first candidate word, the first candidate word is matched with a second candidate word in a word segmentation dictionary, a first subject word is determined from the first candidate words which are failed to be matched, the first candidate word which is matched with a keyword in a knowledge base is determined as a second subject word from the first candidate words which are successfully matched, and a subject of the text to be recognized is determined based on the first subject word and the second subject word. The first candidate word successfully matched with the second candidate word in the word segmentation dictionary is a novel word, and in the embodiment of the disclosure, the novel word is considered as a subject word of the text, so that the accuracy of text subject recognition is improved.
In one embodiment of the present disclosure, the determining a first subject word from the first candidate words with failed matching includes: and identifying the first candidate word which repeatedly appears as the first subject word from the first candidate words which fail to be matched.
In one embodiment of the disclosure, after the identifying the repeatedly appearing first candidate word as the first subject word, the method includes: and storing the first subject word in the word segmentation dictionary to update the word segmentation dictionary.
In an embodiment of the disclosure, the determining the topic of the text to be recognized based on the first topic word and the second topic word includes: filtering the first subject term and the second subject term to obtain filtered subject terms; obtaining the score value of the filtered subject term, wherein the score value represents the importance degree of the subject term; and sequencing the filtered subject words according to the score value to obtain a subject word sequence, wherein the subject word sequence represents the subject of the text to be recognized.
In an embodiment of the present disclosure, the obtaining the score value of the filtered topic word includes: acquiring the word frequency of the filtered subject word; and acquiring the score value of the filtered subject term according to the term frequency.
In an embodiment of the present disclosure, the obtaining the score value of the filtered topic word includes: acquiring the information entropy of the filtered subject term; and acquiring the score value of the filtered subject term according to the information entropy.
An embodiment of a second aspect of the present disclosure provides a text topic identification apparatus, including: the acquisition module is used for acquiring a text to be recognized and performing word segmentation processing on the text to be recognized to obtain a first candidate word;
the matching module is used for matching the first candidate word with a second candidate word in a word segmentation dictionary;
the first determining module is used for determining a first subject word from the first candidate words with failed matching;
the second determining module is used for determining the first candidate words matched with the keywords in the knowledge base as second subject words from the successfully matched first candidate words;
and the third determining module is used for determining the theme of the text to be recognized based on the first theme word and the second theme word.
An embodiment of a third aspect of the present disclosure provides an electronic device, including: a processor; a memory for storing the processor-executable instructions; wherein the processor is configured to execute the instructions to implement the text topic identification method provided by the embodiment of the first aspect of the disclosure.
Additional aspects and advantages of the disclosure will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the disclosure.
Drawings
The foregoing and/or additional aspects and advantages of the present disclosure will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
fig. 1 is a schematic flowchart of a text topic identification method provided in an embodiment of the present disclosure;
fig. 2 is a schematic flowchart of another text topic identification method provided in the embodiment of the present disclosure;
fig. 3 is a schematic flowchart of another text topic identification method provided in the embodiment of the present disclosure;
fig. 4 is a schematic flowchart of another text topic identification method provided in the embodiment of the present disclosure;
fig. 5 is a schematic flowchart of another text topic identification method provided in the embodiment of the present disclosure;
fig. 6 is a schematic structural diagram of a text topic identification apparatus according to an embodiment of the present disclosure;
fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with embodiments of the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the disclosed embodiments, as detailed in the appended claims.
The terminology used in the embodiments of the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the embodiments of the present disclosure. As used in the disclosed embodiments and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information in the embodiments of the present disclosure, such information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of embodiments of the present disclosure. The words "if" and "if" as used herein may be interpreted as "at \8230; \8230whenor" when 8230; \8230, when or "in response to a determination", depending on the context.
Reference will now be made in detail to the embodiments of the present disclosure, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the like or similar elements throughout. The embodiments described below with reference to the accompanying drawings are illustrative and intended to explain the present disclosure, and should not be construed as limiting the present disclosure.
The text topic identification method, the text topic identification device and the electronic device according to the embodiments of the present disclosure are described below with reference to the drawings.
Fig. 1 is a schematic flowchart of a text topic identification method provided in an embodiment of the present disclosure. As shown in fig. 1, the method comprises the following steps:
s101, obtaining a text to be recognized, and performing word segmentation processing on the text to be recognized to obtain a first candidate word.
It should be noted that the text to be recognized described in this embodiment may be a text expressed in various written languages, for example, a chinese text, an english text, a russian text, a malaysian text, a mixed chinese and english text, and the like. The text to be recognized may include a sentence, a paragraph, or a chapter, such as a news manuscript.
Before performing word segmentation processing on the text to be recognized, preprocessing can be performed on the text to be recognized, wherein the preprocessing includes but is not limited to cleaning and sorting characters in the text to be recognized, such as punctuations, blanks, chinese and English characters, simplified and complex characters, and the like.
In some embodiments, a word segmentation method based on a dictionary, a word segmentation method based on statistics, a machine learning word segmentation method, or the like may be used to perform word segmentation on the text to be recognized, so as to obtain a first candidate word. Wherein, the first candidate words are multiple.
S102, matching the first candidate word with a second candidate word in a word segmentation dictionary.
The word segmentation dictionary in the embodiment of the present disclosure may be a word segmentation dictionary commonly used in the technical field of natural language processing, or may be a word segmentation dictionary pre-constructed based on rules.
S103, determining a first subject word from the first candidate words with failed matching.
When the first candidate word is matched with the second candidate word in the word segmentation dictionary, the first candidate word which fails in matching can be regarded as a novel vocabulary, namely, a vocabulary which does not exist in the word segmentation dictionary, and the novel vocabulary is often related to the theme of the text to be recognized. In some embodiments, a first candidate word with a higher degree of correlation with the topic of the text to be recognized may be selected as the first topic word from the first candidate words with failed matching. In other embodiments, the first candidate word that fails to match is directly used as the first subject word.
S104, determining the first candidate words matched with the keywords in the knowledge base as second subject words from the successfully matched first candidate words.
The knowledge base comprises named entities and other set keywords, wherein the named entities generally refer to entities with specific meanings or strong reference in texts, and generally comprise names of people, names of places, names of organizations, dates and times, proper nouns and the like. Wherein, set for the keyword and can add to the knowledge base according to the demand.
And matching the first candidate word successfully matched with the second candidate word in the word segmentation dictionary with the keyword in the knowledge base, and taking the successfully matched first candidate word as a second subject word.
And S105, determining the theme of the text to be recognized based on the first subject term and the second subject term.
After the first subject term and the second subject term are obtained, the subject term more fitting the subject of the text to be recognized can be selected from the first subject term and the second subject term to represent the subject of the text to be recognized.
In the embodiment of the disclosure, a text to be recognized is obtained, word segmentation processing is performed on the text to be recognized to obtain a first candidate word, the first candidate word is matched with a second candidate word in a word segmentation dictionary, a first subject word is determined from the first candidate words with failed matching, the first candidate word matched with a keyword in a knowledge base is determined as a second subject word from the first candidate words with successful matching, and a subject of the text to be recognized is determined based on the first subject word and the second subject word. The first candidate word successfully matched with the second candidate word in the word segmentation dictionary is a novel word, and in the embodiment of the disclosure, the novel word is considered as a subject word of the text, so that the accuracy of text subject recognition is improved.
Fig. 2 is a schematic flowchart of a text topic identification method according to an embodiment of the present disclosure. As shown in fig. 2, the method comprises the following steps:
s201, obtaining a text to be recognized, and performing word segmentation processing on the text to be recognized to obtain a first candidate word.
S202, matching the first candidate word with a second candidate word in the word segmentation dictionary.
For specific descriptions of steps S201 to S202, reference may be made to the description of relevant contents in the above embodiments, and details are not repeated herein.
S203, identifying the first candidate word which repeatedly appears as a first subject word from the first candidate words which fail to be matched.
In some embodiments, the repeated first candidate word may be identified as the first subject word from the first candidate words that have failed to be matched through a text similarity algorithm, for example, a Simhash algorithm, a word2vec algorithm, or the like.
Further, determining a first subject word from the first candidate words with failed matching includes: and storing the first subject term in a word segmentation dictionary to update the word segmentation dictionary.
After the repeated first candidate word is identified as the first subject word from the first candidate words which are failed to be matched, the first subject word can be stored in the word segmentation dictionary, and the word segmentation dictionary is updated, so that the updated word segmentation dictionary can be directly used when the text topic is identified next time, and the identification efficiency of the text topic is improved.
S204, determining the first candidate words matched with the keywords in the knowledge base as second subject words from the successfully matched first candidate words.
S205, determining the theme of the text to be recognized based on the first subject term and the second subject term.
For specific descriptions of steps S204 to S205, reference may be made to the descriptions of relevant contents in the above embodiments, and details are not repeated here.
In the embodiment of the disclosure, a text to be recognized is obtained, word segmentation processing is performed on the text to be recognized to obtain a first candidate word, the first candidate word is matched with a second candidate word in a word segmentation dictionary, the first candidate word which appears repeatedly is identified as a first subject word from the first candidate words which fail to be matched, the first candidate word which matches with a keyword in a knowledge base is determined as a second subject word from the first candidate words which succeed to be matched, and a subject of the text to be recognized is determined based on the first subject word and the second subject word. The first candidate word successfully matched with the second candidate word in the word segmentation dictionary is a novel word, and in the embodiment of the disclosure, the novel word which repeatedly appears is used as a subject word of the text, so that the accuracy of text subject recognition is improved.
Fig. 3 is a schematic flowchart of a text topic identification method according to an embodiment of the present disclosure. As shown in fig. 3, the method comprises the following steps:
s301, obtaining a text to be recognized, and performing word segmentation processing on the text to be recognized to obtain a first candidate word.
S302, the first candidate word is matched with a second candidate word in the word segmentation dictionary.
S303, determining a first subject word from the first candidate words with failed matching.
S304, determining the first candidate words matched with the keywords in the knowledge base as second subject words from the successfully matched first candidate words.
For specific descriptions of steps S301 to S304, reference may be made to the descriptions of relevant contents in the above embodiments, and details are not repeated here.
And S305, filtering the first subject term and the second subject term to obtain filtered subject terms.
In some embodiments, the first subject term and the second subject term may be filtered by a filtering algorithm, for example, a string matching algorithm, a regular expression matching algorithm, a Deterministic Finite Automaton (DFA) algorithm, or the like, to obtain a filtered subject term, where the filtered subject term is a term that is more fit to a subject of the text to be recognized.
S306, obtaining the score value of the filtered subject term, wherein the score value represents the importance degree of the subject term.
And obtaining the score value of the filtered subject term according to the importance degree of the filtered subject term in the text to be recognized. Alternatively, the degree of importance of the subject word may be determined by the word frequency, the information entropy, the context information, and the like of the subject word, which is not limited herein.
S307, the filtered subject words are sequenced according to the score values to obtain a subject word sequence, and the subject word sequence represents the subject of the text to be recognized.
After the score values of the filtered subject terms are obtained, the filtered subject terms can be sorted according to the score values to obtain corresponding subject term sequences, and the subject term sequences can represent subjects of the text to be recognized.
In the embodiment of the disclosure, a text to be recognized is obtained, word segmentation processing is performed on the text to be recognized to obtain a first candidate word, the first candidate word is matched with a second candidate word in a word segmentation dictionary, a first subject word is determined from the first candidate words which are failed to be matched, the first candidate word which is matched with a keyword in a knowledge base is determined as a second subject word from the first candidate words which are successfully matched, the first subject word and the second subject word are filtered to obtain a filtered subject word, a score value of the filtered subject word is obtained, the score value represents the importance degree of the subject word, the filtered subject word is sorted according to the score value to obtain a subject word sequence, and the subject word sequence represents the subject of the text to be recognized. In the embodiment of the disclosure, the first subject term and the second subject term are filtered, scored and sorted, so that a more refined text subject can be obtained, and the accuracy of the text subject is further improved.
Fig. 4 is a schematic flowchart of a text topic identification method according to an embodiment of the present disclosure. As shown in fig. 4, the method comprises the following steps:
s401, obtaining a text to be recognized, and performing word segmentation processing on the text to be recognized to obtain a first candidate word.
S402, matching the first candidate word with a second candidate word in a word segmentation dictionary.
S403, determining a first subject word from the first candidate words with failed matching.
S404, determining the first candidate words matched with the keywords in the knowledge base as second subject words from the successfully matched first candidate words.
S405, filtering the first subject term and the second subject term to obtain the filtered subject terms.
For specific descriptions of steps S401 to S405, reference may be made to the descriptions of relevant contents in the above embodiments, and details are not repeated here.
S406, acquiring the word frequency of the filtered subject word.
The word frequency is the number of times that the filtered subject word appears in the text to be recognized.
S407, obtaining the score value of the filtered subject term according to the term frequency.
In some embodiments, the word frequency may be directly used as the score value of the filtered subject word.
In other embodiments, the score value of the filtered subject Term may be calculated according to the Term Frequency (TF) and the Inverse Document Frequency (TDF) of the filtered subject Term in the text to be recognized. Specifically, the word frequency of the filtered subject word in the text to be recognized and the word frequency of the inverse document may be multiplied to obtain a word frequency-inverse document frequency (TF-IDF), and the word frequency-inverse document frequency is used as the score of the filtered subject word.
S408, sorting the filtered subject words according to the score values to obtain a subject word sequence, wherein the subject word sequence represents the subject of the text to be recognized.
For a specific description of step S408, reference may be made to the description of relevant contents in the above embodiments, and details are not repeated here.
In the embodiment of the disclosure, a text to be recognized is obtained, word segmentation processing is performed on the text to be recognized to obtain a first candidate word, the first candidate word is matched with a second candidate word in a word segmentation dictionary, a first subject word is determined from the first candidate words which are failed to be matched, the first candidate word which is matched with a keyword in a knowledge base is determined as a second subject word from the first candidate words which are successfully matched, the first subject word and the second subject word are filtered to obtain a filtered subject word, the word frequency of the filtered subject word is obtained, the score value of the filtered subject word is obtained according to the word frequency, the filtered subject word is sequenced according to the score value to obtain a subject word sequence, and the subject word sequence represents the subject of the text to be recognized. In the embodiment of the disclosure, the first subject term and the second subject term are filtered, scored and sequenced, so that a more refined text subject can be obtained, and the accuracy of the text subject is further improved.
Fig. 5 is a schematic flowchart of a text topic identification method according to an embodiment of the present disclosure. As shown in fig. 5, the method comprises the following steps:
s501, obtaining a text to be recognized, and performing word segmentation processing on the text to be recognized to obtain a first candidate word.
S502, matching the first candidate word with a second candidate word in a word segmentation dictionary.
S503, determining a first subject word from the first candidate words with failed matching.
S504, determining the first candidate words matched with the keywords in the knowledge base as second subject words from the successfully matched first candidate words.
And S505, filtering the first subject term and the second subject term to obtain the filtered subject terms.
For specific descriptions of steps S501 to S505, reference may be made to the descriptions of relevant contents in the above embodiments, and details are not repeated here.
S506, acquiring the information entropy of the filtered subject term.
In some embodiments, the information entropy of the filtered subject word may be obtained based on a dictionary-free word segmentation algorithm of the information entropy.
And S507, acquiring the score value of the filtered subject term according to the information entropy.
The larger the information amount included in the subject term is, the larger the information entropy is, and the higher the importance degree is, and the importance degree of the filtered subject term can be scored according to the information entropy of the filtered subject term to obtain the score value of the subject term.
And S508, sequencing the filtered subject terms according to the score values to obtain a subject term sequence, wherein the subject term sequence represents the subject of the text to be recognized.
For a detailed description of step S508, reference may be made to the description of relevant contents in the above embodiments, which are not repeated herein.
In the embodiment of the disclosure, a text to be recognized is obtained, word segmentation processing is performed on the text to be recognized to obtain a first candidate word, the first candidate word is matched with a second candidate word in a word segmentation dictionary, a first subject word is determined from the first candidate words which are failed to be matched, the first candidate word which is matched with a keyword in a knowledge base is determined as a second subject word from the first candidate words which are successfully matched, the first subject word and the second subject word are filtered to obtain a filtered subject word, the information entropy of the filtered subject word is obtained, the score value of the filtered subject word is obtained according to the information entropy, the filtered subject word is sorted according to the score value to obtain a subject word sequence, the subject word sequence represents the subject of the text to be recognized, and the subject word sequence represents the subject of the text to be recognized. In the embodiment of the disclosure, the first subject term and the second subject term are filtered, scored and sorted, so that a more refined text subject can be obtained, and the accuracy of the text subject is further improved.
In order to implement the above embodiments, the embodiments of the present disclosure further provide a text topic identification apparatus. Fig. 6 is a schematic structural diagram of a text topic identification apparatus according to an embodiment of the present disclosure. As shown in fig. 6, the text topic identification apparatus 600 includes:
the acquiring module 610 is configured to acquire a text to be recognized, and perform word segmentation processing on the text to be recognized to obtain a first candidate word;
a matching module 620, configured to match the first candidate word with a second candidate word in a word segmentation dictionary;
a first determining module 630, configured to determine a first subject word from the first candidate words with failed matching;
the second determining module 640 is configured to determine, from the successfully matched first candidate words, that the first candidate word matched with the keyword in the knowledge base is a second subject word;
and a third determining module 650, configured to determine a topic of the text to be recognized based on the first subject term and the second subject term.
In the embodiment of the disclosure, a text to be recognized is obtained, word segmentation processing is performed on the text to be recognized to obtain a first candidate word, the first candidate word is matched with a second candidate word in a word segmentation dictionary, a first subject word is determined from the first candidate words which are failed to be matched, the first candidate word which is matched with a keyword in a knowledge base is determined as a second subject word from the first candidate words which are successfully matched, and a subject of the text to be recognized is determined based on the first subject word and the second subject word. The first candidate word successfully matched with the second candidate word in the word segmentation dictionary is a novel word, and in the embodiment of the disclosure, the novel word is considered as a subject word of the text, so that the accuracy of text subject recognition is improved.
In some embodiments, the first determining module 630 is further configured to: and identifying the first candidate word which repeatedly appears as a first subject word from the first candidate words which fail to be matched.
In some embodiments, the first determining module 630 is further configured to: and after the first candidate word which appears repeatedly is identified as the first subject word, storing the first subject word in the word segmentation dictionary to update the word segmentation dictionary.
In some embodiments, the third determining module 650 is further configured to: filtering the first subject term and the second subject term to obtain filtered subject terms; obtaining the score value of the filtered subject term, wherein the score value represents the importance degree of the subject term; and sequencing the filtered subject words according to the scoring values to obtain a subject word sequence, wherein the subject word sequence represents the subject of the text to be recognized.
In some embodiments, the third determining module 650 is further configured to: acquiring the word frequency of the filtered subject word; and acquiring the score value of the filtered subject term according to the term frequency.
In some embodiments, the third determining module 650 is further configured to: acquiring the information entropy of the filtered subject term; and obtaining the score value of the filtered subject term according to the information entropy.
The text topic identification device provided in the embodiment of the present disclosure may be used to implement the technical solution of the text topic identification method in the first aspect embodiment in the foregoing embodiments, and the implementation principle and the technical effect are similar, which are not described herein again.
In order to implement the foregoing embodiments, as shown in fig. 7, the present disclosure further provides an electronic device 700, which includes a memory 710, a processor 720 and a computer program stored in the memory 710 and executable on the processor 720, wherein the processor 720 executes the computer program to implement the text topic identification method proposed by the foregoing embodiments of the present disclosure.
In the description of the present specification, the terms "first", "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implying any number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present disclosure, "a plurality" means at least two, e.g., two, three, etc., unless explicitly specifically limited otherwise.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present disclosure. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Although embodiments of the present disclosure have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present disclosure, and that changes, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present disclosure.

Claims (9)

1. A text topic identification method is characterized by comprising the following steps:
acquiring a text to be recognized, and performing word segmentation processing on the text to be recognized to obtain a first candidate word;
matching the first candidate word with a second candidate word in a word segmentation dictionary;
determining a first subject word from the first candidate words with failed matching;
determining a first candidate word matched with the keyword in the knowledge base as a second subject word from the successfully matched first candidate words;
and determining the subject of the text to be recognized based on the first subject term and the second subject term.
2. The method of claim 1, wherein determining the first subject word from the first candidate words that failed matching comprises:
and identifying the repeated first candidate word as the first subject word from the first candidate words with failed matching.
3. The method of claim 2, wherein identifying the first candidate word that appears repeatedly as the first subject word comprises:
and storing the first subject term in the word segmentation dictionary to update the word segmentation dictionary.
4. The method of claim 1, wherein determining the subject of the text to be recognized based on the first subject term and the second subject term comprises:
filtering the first subject term and the second subject term to obtain filtered subject terms;
obtaining the score value of the filtered subject term, wherein the score value represents the importance degree of the subject term;
and sequencing the filtered subject words according to the score values to obtain a subject word sequence, wherein the subject word sequence represents the subject of the text to be identified.
5. The method of claim 4, wherein obtaining the score value of the filtered subject term comprises:
acquiring the word frequency of the filtered subject word;
and acquiring the score value of the filtered subject term according to the term frequency.
6. The method of claim 4, wherein obtaining the score value of the filtered subject term comprises:
acquiring the information entropy of the filtered subject term;
and acquiring the score value of the filtered subject term according to the information entropy.
7. A text topic identification apparatus, comprising:
the acquisition module is used for acquiring a text to be recognized and performing word segmentation processing on the text to be recognized to obtain a first candidate word;
the matching module is used for matching the first candidate word with a second candidate word in a word segmentation dictionary;
the first determining module is used for determining a first subject word from the first candidate words which fail to be matched;
the second determining module is used for determining the first candidate words matched with the keywords in the knowledge base as second subject words from the successfully matched first candidate words;
and the third determining module is used for determining the theme of the text to be recognized based on the first theme word and the second theme word.
8. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.
9. A computer-readable storage medium having computer instructions stored thereon for causing a computer to perform the method of any one of claims 1-6.
CN202211409921.4A 2022-11-10 2022-11-10 Text theme recognition method and device and electronic equipment Pending CN115935977A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211409921.4A CN115935977A (en) 2022-11-10 2022-11-10 Text theme recognition method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211409921.4A CN115935977A (en) 2022-11-10 2022-11-10 Text theme recognition method and device and electronic equipment

Publications (1)

Publication Number Publication Date
CN115935977A true CN115935977A (en) 2023-04-07

Family

ID=86696696

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211409921.4A Pending CN115935977A (en) 2022-11-10 2022-11-10 Text theme recognition method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN115935977A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116431814A (en) * 2023-06-06 2023-07-14 北京中关村科金技术有限公司 Information extraction method, information extraction device, electronic equipment and readable storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116431814A (en) * 2023-06-06 2023-07-14 北京中关村科金技术有限公司 Information extraction method, information extraction device, electronic equipment and readable storage medium
CN116431814B (en) * 2023-06-06 2023-09-05 北京中关村科金技术有限公司 Information extraction method, information extraction device, electronic equipment and readable storage medium

Similar Documents

Publication Publication Date Title
CN110502621B (en) Question answering method, question answering device, computer equipment and storage medium
US11544459B2 (en) Method and apparatus for determining feature words and server
US11100124B2 (en) Systems and methods for similarity and context measures for trademark and service mark analysis and repository searches
CN106649818B (en) Application search intention identification method and device, application search method and server
JP7028858B2 (en) Systems and methods for contextual search of electronic records
US10565533B2 (en) Systems and methods for similarity and context measures for trademark and service mark analysis and repository searches
US8452763B1 (en) Extracting and scoring class-instance pairs
JP5751253B2 (en) Information extraction system, method and program
US11625537B2 (en) Analysis of theme coverage of documents
CN107577663B (en) Key phrase extraction method and device
CN102214189B (en) Data mining-based word usage knowledge acquisition system and method
US20140289238A1 (en) Document creation support apparatus, method and program
US9063923B2 (en) Method for identifying the integrity of information
TW201826145A (en) Method and system for knowledge extraction from Chinese corpus useful for extracting knowledge from source corpuses mainly written in Chinese
CN110032622B (en) Keyword determination method, keyword determination device, keyword determination equipment and computer readable storage medium
CN110597978A (en) Article abstract generation method and system, electronic equipment and readable storage medium
JP2008152522A (en) Data mining system, data mining method and data retrieval system
CN111369980A (en) Voice detection method and device, electronic equipment and storage medium
WO2021017951A1 (en) Dual monolingual cross-entropy-delta filtering of noisy parallel data and use thereof
CN105653553B (en) Word weight generation method and device
CN115935977A (en) Text theme recognition method and device and electronic equipment
CN109344397B (en) Text feature word extraction method and device, storage medium and program product
Balog et al. The university of amsterdam at weps2
EP4270238A1 (en) Extracting content from freeform text samples into custom fields in a software application
CN111061924A (en) Phrase extraction method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination