CN112711940B - Information processing system, information processing method and non-transitory computer readable recording medium - Google Patents

Information processing system, information processing method and non-transitory computer readable recording medium Download PDF

Info

Publication number
CN112711940B
CN112711940B CN201910950217.1A CN201910950217A CN112711940B CN 112711940 B CN112711940 B CN 112711940B CN 201910950217 A CN201910950217 A CN 201910950217A CN 112711940 B CN112711940 B CN 112711940B
Authority
CN
China
Prior art keywords
list
text
keywords
category
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910950217.1A
Other languages
Chinese (zh)
Other versions
CN112711940A (en
Inventor
曾俋颖
汤珮茹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Delta Electronics Inc
Original Assignee
Delta Electronics Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Delta Electronics Inc filed Critical Delta Electronics Inc
Priority to CN201910950217.1A priority Critical patent/CN112711940B/en
Publication of CN112711940A publication Critical patent/CN112711940A/en
Application granted granted Critical
Publication of CN112711940B publication Critical patent/CN112711940B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

An information processing system, an information processing method and a non-transitory computer readable recording medium. The information processing system comprises at least one processor, a communication interface and a database. The communication interface is coupled to the at least one processor. A database is coupled to the one or more processors and the database is configured to store at least one text received from the communication interface. The at least one processor is configured to: obtaining a plurality of training words using basic feature information of a plurality of words of at least one text; classifying the training words to establish a first list corresponding to the first category and a second list corresponding to the second category; matching a plurality of keywords in the first list and the second list in the text to be marked, and calculating the confidence values of the text to be marked on the first list and the second list respectively; and labeling the text to be labeled as the first category or the second category according to the confidence value.

Description

Information processing system, information processing method and non-transitory computer readable recording medium
Technical Field
The present disclosure relates to a processing system and a processing method, and more particularly, to an information processing system and an information processing method.
Background
The conventional text labeling method is to label the articles one by using the experience of an analyst after reading the articles by manpower (e.g., the analyst). However, such an approach is quite time consuming and the labeling results are also highly dependent on the analyst's experience. Furthermore, since the article needs to be read by an analyst, there is a considerable risk in terms of data security.
On the other hand, the mechanism of training the classification model by the machine learning method requires a large number of accurate labeling articles to ensure the accuracy of the classification model. If the number of the marked articles is insufficient or the quality is poor, the accuracy is also low. Accordingly, how to improve the classification accuracy and the data confidentiality at the same time is a technical problem to be solved in the field of text classification.
Disclosure of Invention
This summary is intended to provide a simplified summary of the disclosure so that the reader will have a basic understanding of the disclosure. This summary is not an extensive overview of the disclosure and is intended to neither identify key/critical elements of the embodiments of the disclosure nor delineate the scope of the disclosure.
According to one embodiment of the present disclosure, an information processing system is disclosed, comprising at least one processor, a communication interface, and a database. The communication interface is coupled to the at least one processor. A database is coupled to the one or more processors and the database is configured to store at least one text received from the communication interface. The at least one processor is configured to: obtaining a plurality of training words using basic feature information of a plurality of words of at least one text; classifying the training words to establish a first list corresponding to the first category and a second list corresponding to the second category; matching a plurality of keywords in the first list and the second list in the text to be marked, and calculating the confidence values of the text to be marked on the first list and the second list respectively; and labeling the text to be labeled as the first category or the second category according to the confidence value.
According to another embodiment, an information processing method is disclosed, comprising: obtaining a plurality of training words using a basic feature information of a plurality of words of at least one text; classifying the training words to establish a first list corresponding to a first category and a second list corresponding to a second category; matching a plurality of keywords in the first list and the second list in a text to be marked, and calculating a confidence value of the text to be marked relative to the first list and the second list respectively; and labeling the text to be labeled as the first category or the second category according to the confidence value.
According to another embodiment, a non-transitory computer readable recording medium storing a plurality of program codes is disclosed, wherein when the program codes are loaded into at least one processor, the at least one processor executes the program codes to perform the following steps: obtaining a plurality of training words using a basic feature information of a plurality of words of at least one text; classifying the training words to establish a first list corresponding to a first category and a second list corresponding to a second category; matching a plurality of keywords in the first list and the second list in a text to be marked, and calculating a confidence value of the text to be marked relative to the first list and the second list respectively; and labeling the text to be labeled as the first category or the second category according to the confidence value.
Drawings
The following detailed description, when read in conjunction with the accompanying drawings of the specification, will facilitate a preferred understanding of embodiments of the present disclosure. It should be noted that the features in the drawings are not necessarily drawn to scale, depending on the requirements of the actual implementation in the description. In fact, the dimensions of the various features may be arbitrarily increased or decreased for clarity of discussion.
FIG. 1 is a functional block diagram illustrating an information handling system according to some embodiments of the present disclosure.
Fig. 2 is a flowchart illustrating an information processing method according to some embodiments of the present disclosure.
Fig. 3 is a flowchart illustrating an information processing method according to further embodiments of the present disclosure.
Reference numerals illustrate:
100. Information processing system
110. Processor and method for controlling the same
120. Communication interface
130. Database for storing data
140 User interface (interface)
S210 to S240, S310 to S330 steps
Detailed Description
The following disclosure provides many different embodiments, or examples, for implementing different features of the disclosure. Specific examples of elements and arrangements are described below to simplify the present disclosure. Of course, these examples are merely illustrative and are not intended to be limiting. For example, forming a first feature over or on a second feature in the description below may include embodiments in which the first and second features are formed in direct contact, and may also include embodiments in which additional features may be formed between the first and second features such that the first and features may not be in direct contact. In addition, the present disclosure may repeat reference numerals and/or letters in the various examples. This repetition is for the purpose of simplicity and clarity and does not in itself dictate a relationship between the various embodiments and/or configurations discussed.
Referring to FIG. 1, a functional block diagram illustrating an information handling system 100 according to some embodiments of the present disclosure is shown. As shown in fig. 1, information handling system 100 includes a processor 110, a communication interface 120, and a database 130. In some embodiments, data processing may be performed by at least one processor 110 such that information handling system 100 operates in a multi-threading (multithreading) environment. For ease of illustration, the present disclosure is described below in terms of an embodiment with processor 110.
The communication interface 120 is coupled to the processor 110 and is configured to transmit/receive text data with another device or system (not shown). In some embodiments, the communication interface 120 may be, but is not limited to, a communication chip supporting global system for mobile communications (Global System for Mobile communication, GSM), long term evolution communications (Long Term Evolution, LTE), worldwide interoperability for microwave access (Worldwide interoperability for Microwave Access, wiMAX), wireless fidelity (WIRELESS FIDELITY, wi-Fi), bluetooth technology, wired network, and the like.
Database 130 is coupled to processor 110. In some embodiments, information handling system 100 may provide an external database (not shown) external to the system and communicatively coupled to processor 110 via communication interface 120 to access data external to the system.
In some embodiments, database 130 is configured to store at least one text via communication interface 120. The text may be a file representing any language.
Referring to fig. 2, a flowchart illustrating an information processing method according to some embodiments of the present disclosure is shown. The information processing method of fig. 2 may be performed by the information processing system 100 of fig. 1. For convenience of explanation of the information processing method of fig. 2, various related terms or elements will be explained with reference to fig. 1.
In step S210, a plurality of training words are obtained using basic feature information of at least one text word.
In some embodiments, the processor 110 uses words in text as a basis for training the keywords of the dictionary.
First, the processor 110 parses words in the text through natural language processing techniques, such as finding vocabularies or word breaks in the text. The processor 110 then obtains basic feature information of the words according to a predetermined database (not shown). The basic feature information may be, but is not limited to, mutual information of words (mutual information, MI), entropy value (entropy), word frequency (term ffrequency, TF), combination change value (accessor variety, AV), and context value (position). In some embodiments, processor 110 calculates the reference value for each word using a comprehensive weight calculation formula, such as formula (1).
W (new word) =α×w MI+β×Wentropy+γ×WTF+δ×WAV+ε×Wposition, 0 < α, β, γ, δ, ε < 1
In formula (1), W (new word) is a reference value of a word. W MI is mutual information of words, W entropy is entropy of words, W TF is word frequency of words, W AV is change value between words and left and right words, W position is relative relation value between words and contexts, and alpha, beta, gamma, delta and epsilon are probability values. The mutual information is an estimated value of the tightness degree or the relevance between the word and other adjacent words, and the entropy value is an estimated value of the degree of freedom between the word and other adjacent words. The mutual information and entropy are a loop of the information theory (Information Theory), and are not described in detail herein.
Therefore, by adjusting the probability value of each basic feature information in the formula (1), different probability values can be used for finding out the references of a plurality of keywords.
In step S220, the processor 110 classifies the training words to create a plurality of lists corresponding to the plurality of categories, respectively.
In some embodiments, the processor 110 may set different thresholds to determine the classification of keywords. For example, the training words detected in the text are "artificial intelligence server", "intelligent robot", "virtual assistant", "natural language", "home appliance", etc., but only if the reference value of the former four is greater than the first threshold, the training words are set as keywords in the first list related to artificial intelligence (first category). For example, the training words detected in the text are "financial transaction", "smart contract", "bank", and the reference value of the first three training words is larger than the second threshold, and the training words "financial transaction", "smart contract" are set as the keywords in the second list related to the blockchain (second category). By analogy, the processor 110 may build many different lists.
In some embodiments, the keywords of the first list are organized into dictionaries for artificial intelligence and the keywords of the second list are organized into dictionaries for blockchain. In this way, the information processing system 100 may perform content classification or labeling on some texts to be classified based on the dictionary files. It is noted that the terms "list" and "dictionary" are used interchangeably in this disclosure.
In step S225, the processor 110 determines whether the training of the dictionary is completed.
In some embodiments, steps S210 to S220 may be regarded as one loop, and in the method for creating a list of the present disclosure, the loop may be repeatedly performed a plurality of times to repeatedly obtain a plurality of training words based on a plurality of words of the same or different texts, so that keywords classified into various categories of the list are more accurate. For example, the training word "bank" may be categorized into a second list of blockchain categories on the L1 st loop as a keyword for the second list. However, it is possible that at the L2 nd loop, the training word "bank" that compares the classifications that do not fit the "blockchain" is removed from the second list. In this way, the list of keywords can be updated and optimized continuously by executing a plurality of loops.
In some embodiments, the information processing method of the present disclosure uses a word extraction algorithm to reduce the time required for training words and to improve the accuracy of training words. For example, the word extraction algorithm is a TextRank algorithm, as shown in equation (2).
In equation (2), V i、Vj、Vk is a different node, WS (V i) is a weight of node V i, W ji is a side weight of node V j to node V i, in (V i) is a set of all nodes pointing to node V i, out (V j) is a set of all nodes pointed to by node V j, and d is an adjustment coefficient (e.g., 0.85).
In some embodiments, when executing the word extraction algorithm, the edge weight W ji in the formula (2) is applied with information about the occurrence frequency and the popularity of the words in the term frequency-inverse document frequency, TF-IDF technology, so that the occurrence frequency and the popularity of different words can be considered when calculating the weight value of each node, and the process of calculating the iteration in the formula (2) can accelerate convergence. For example, the processor 110 calculates the weight values of the N training words using equation (2). After sorting the weights (e.g., from large to small), the first few (e.g., 50) training words are set as keywords and can be added to the list.
In step S230, the processor 110 uses the keywords of the lists to match the text to be annotated, so as to calculate confidence values of the lists.
In some embodiments, the present disclosure uses a multi-word multi-dictionary matching (multiple string multipledictionary, MSMD) algorithm for labeling of text. For example, in step S220, a plurality of lists are obtained as a plurality of dictionaries D [1, …, D ], each dictionary (e.g., dictionary 1-dictionary D) being of mutually exclusive type. Each dictionary contains a plurality of words S [1, …, S ]. In the matching procedure, the processor 110 may take a section of the main string T from the text to be annotated to determine whether each dictionary is a matching category of T one by one, for example, search whether keywords completely matching the main string T exist in each dictionary.
For example, the processor 110 sets the keywords in the first list as a plurality of first node values (or first template strings) of a dictionary Tree (Trie-Tree), and sets the keywords in the second list as a plurality of second node values (or second template strings) of the dictionary Tree. In other words, all keywords are integrated into one dictionary tree.
Then, the processor 110 compares the plurality of words of the text to be annotated using the first node values and the second node values simultaneously. When the matching procedure is performed, the first template strings of the dictionary tree are automatically searched each time by the main strings T of the text to be marked. Each word of the main string T will be aligned one by one with the first template string. In one embodiment, when the master string T is completely matched with any one of the first template strings, the processor 110 records the template string, the number of times the matched template string occurs in the text to be annotated, and the location where the matched template string occurs in the text to be annotated. Similarly, each word of the master string T will be aligned with the second template string one by one. When the master string T matches exactly any of the second template strings, the processor 110 records the template string, the number of times the matching template string occurs in the text to be annotated, and the location where the matching template string occurs in the text to be annotated.
In some embodiments, the data structure of the dictionary tree is stored in nodes with the same prefix of the string (e.g., each word is stored in a node such that the tree height of the dictionary tree is the longest string length plus one), and thus each string corresponds to a unique node. When the dictionary tree is searched according to the main word string T, searching is performed from the root node of the dictionary tree, and searching is performed layer by layer to the child nodes. On the other hand, since the index (pointer) is used to record the word string in the dictionary tree, the processor 110 uses the finite state machine control (e.g., the Aho-cornick algorithm) to modify the index in the process of searching the dictionary tree in cooperation with each pre-constructed template, and when any word in the main word string T fails to be searched, returns to the state in the finite state machine, and turns to other branches of the dictionary tree to avoid repeated matching with the same word head, so that the time for searching the main word string T can be reduced and the efficiency for searching the dictionary tree can be improved.
It should be noted that the present disclosure is not limited to dictionary tree algorithms, and any multi-string search algorithm is within the scope of the present disclosure.
Further, the present disclosure builds a dictionary tree from all keywords of all dictionaries according to the rules of the same prefix. Since a dictionary tree contains all the keywords of all the dictionaries, in the matching procedure, a main word string T can be used to match keywords to all the dictionaries at the same time. Compared with the common method (i.e. only one dictionary can be matched at a time), the method of simultaneous multi-dictionary matching of the present disclosure can greatly improve the efficiency of keyword matching.
In the following, two dictionaries (lists) are integrated into one dictionary tree, where a plurality of keywords corresponding to a first list are a plurality of first nodes, and a plurality of keywords corresponding to a second list are a plurality of second nodes.
In some embodiments, the processor 110 records the number of words of the text to be annotated that match the first node value (i.e., the first number of matches), and records the number of words of the text to be annotated that match the second node value (i.e., the second number of matches). Next, the processor 110 sets the first matching number to the confidence value of the first list and the second matching number to the confidence value of the second list.
In step S240, the processor 110 labels the labeled text as at least one of the categories according to the confidence value.
In some embodiments, the processor 110 maximizes the confidence value of the first list and the confidence value of the second list. For example, if the confidence value of the first list is the maximum value, the text to be annotated is marked as the category (e.g. artificial intelligence) corresponding to the first list. For another example, if the confidence value of the second list is the maximum value, the text to be annotated is marked as the category (e.g. blockchain) corresponding to the second list. In another embodiment, the text to be annotated may also be annotated with more than one category.
Referring to fig. 3, there is a flowchart illustrating an information processing method according to further embodiments of the present disclosure. The information processing method can further update the existing list, so that the keywords of each category are more accurate.
In step S310, the processor 110 obtains at least one of the first keywords, the second keywords and the third keywords by using the basic feature information of the words in the new text. The step of obtaining the keywords refers to the content of the foregoing steps S210 to S220, and will not be repeated here.
In some embodiments, the processor 110 may receive new text through the communication interface 120. The new text may be any text that may be used to train all lists, such as text already stored in database 130, the previously noted text to be annotated, text that has not been utilized in a training program, and so forth.
In some embodiments, if keywords that can be categorized into existing categories are calculated in the new text, step S320 is performed.
In step S320, the processor 110 updates the first list corresponding to the first category according to the first keywords and/or updates the second list corresponding to the second category according to the second keywords.
In another embodiment, if keywords (e.g., third keywords) that cannot be classified into the existing category are calculated in the text to be annotated, step S330 is performed.
In step S330, the processor 110 creates a third list corresponding to a third category according to the third keywords.
For example, keywords detected in text as "tablet," "display," "optical film," "glass screen," etc., such keywords are neither artificial intelligence (first category) nor blockchain (second category). Thus, the processor 110 creates a third list corresponding to electronic information (third category).
Referring back to FIG. 1, the information handling system 100 also includes a user interface 140. The user interface 140 is coupled to the processor 110. The user interface 140 may be a graphical user interface, keyboard, screen, mouse, etc. to provide a user with associated operations. For example, a graphical user interface is provided to build up a plurality of lists and keywords thereof.
Referring to table one, a plurality of lists and keywords thereof are shown in table one.
Table one: multiple lists (hereinafter dictionary file)
In some embodiments, multiple lists of the present disclosure may provide corresponding services for different labeling requirements. For example, if the text to be annotated is a plurality of YAHOO news texts, the information processing system 100 may use dictionary files, such as table one, to annotate all YAHOO news texts, and such content is described above. For example, a first news article is labeled as an article related to "blockchain" and "big data", while a second article is labeled as an article related to "semiconductor".
In other embodiments, if the text to be annotated is multiple texts of eastern news, the user interface 140 may be configured to receive an operation instruction for the processor 110 to perform the modification of the category. For example, artificial intelligence (first category) may be modified to a smart home appliance (fourth category) such that the smart home appliance contains all the keywords of the artificial intelligence. Similarly, the blockchain (second class) may be modified to e-commerce (fifth class) such that e-commerce includes all of the keywords of the blockchain.
In other embodiments, the user interface 140 provides a user (e.g., domain experts) to evaluate whether each list of dictionary files and their keywords are correct, and to evaluate whether the classified text is also correctly labeled. If an unsuitable portion is found, the expert in each field can also correct the erroneous portion through the user interface 140 to avoid the repeated labeling or inconsistent standards.
In this way, the information processing system 100 of the present disclosure can be compatible with text providers with different labeling requirements after completing one-stage training and creating dictionary files. Thus, when providing annotation services to different text providers, the training of the dictionary file (perhaps only with fine tuning) need not be re-performed for each text provider, allowing the existing dictionary file to be applied to different text providers. In other words, by replacing the dictionary classification and the input text, the conversion can be rapidly performed in different fields and data sources, and the working efficiency is improved.
In some embodiments, text of multiple (e.g., 195) corporate websites is annotated based on five category labels in the dictionary file of table one. The text of a pre-designed part (e.g., 15) of the company website has been classified into a part tag, and thus the text labeling step described above is performed on the remaining part (e.g., 80) of the company website. For example, training steps (e.g., steps S210 to S225) are performed on the labeled website text between 15 to obtain a dictionary file (e.g., table one). Then, keyword labeling is performed on the website text of the company between 80 companies by using labeling steps (for example, the foregoing steps S230 to S240), so as to obtain a labeling result with a first accuracy.
Alternatively, the optimization step (e.g., steps S310 to S330) may be performed using the web site text of the company 80, and the classification of the dictionary file and its keywords may be trained again to obtain the optimized dictionary file. Then, the text labeling step (e.g. steps S230 to S240) is performed again on the rest (e.g. 100) company websites, and a labeling result with a second accuracy is obtained, where the second accuracy is higher than the first accuracy. Similarly, the text labeling method and device can be continuously optimized, so that the dictionary file can be optimized for each text labeling, and the accuracy of the next text labeling is improved.
In summary, the information processing system and the information processing method disclosed by the disclosure provide a highly flexible text labeling method, use basic feature information to find new words, and combine word frequency reverse file frequency with a word extraction algorithm to improve the efficiency of keyword setting. The present disclosure can continually train and refine the classification of the dictionary relative to the human effort required to complete a generic text annotation. In addition, the automatic labeling mode can simultaneously realize online data labeling and data protection, and the problem of data leakage caused by manual labeling is avoided.
The foregoing outlines features of several embodiments so that those skilled in the art may better understand the embodiments of the present disclosure. Those skilled in the art should appreciate that the present disclosure may be readily utilized as a basis for designing or modifying other processes and structures for carrying out the same purposes and/or achieving the same advantages of the embodiments introduced herein. Those skilled in the art should also realize that such equivalent constructions do not depart from the spirit and scope of the present disclosure, and that they may make various changes, substitutions, and alterations herein without departing from the spirit and scope of the present disclosure.

Claims (17)

1. An information processing system, comprising:
at least one processor;
A communication interface coupled to the at least one processor; and
A database coupled to the at least one processor, and the database configured to store at least one text received from the communication interface, wherein the at least one processor is configured to:
Obtaining a plurality of training words from the at least one text using basic feature information of the plurality of words of the at least one text, the basic feature information including mutual information of words, entropy values, word frequencies, combined change values, and context relation values;
Classifying the plurality of training words to create a first list corresponding to a first category and a second list corresponding to a second category;
Matching a plurality of keywords in the first list and the second list in a text to be marked, and calculating a confidence value of the text to be marked relative to the first list and the second list respectively; and
Labeling the text to be labeled as the first category or the second category according to the confidence value,
Wherein the at least one processor is further configured to:
using the basic feature information and a probability value of the basic feature information to calculate a reference value of the plurality of training words,
Wherein the at least one processor is further configured to:
setting the plurality of training words as the plurality of keywords of the first list in response to the reference value meeting a first threshold; and
And setting the plurality of training words as the plurality of keywords of the second list in response to the reference value meeting a second threshold.
2. The information handling system of claim 1 wherein the at least one processor is further configured to:
Calculating the reference value of each training word by using the occurrence frequency and the popularity information of the plurality of training words; and
Setting the plurality of training words in the reference value which meet the first threshold value as the plurality of keywords of the first list, and setting the plurality of training words in the reference value which meet the second threshold value as the plurality of keywords of the second list.
3. The information handling system of claim 1 wherein the at least one processor is further configured to:
Setting the keywords of the first list and the keywords of the second list as a plurality of node values of a dictionary tree; and
The plurality of node values are used to compare a plurality of words of the text to be annotated.
4. The information handling system of claim 3 wherein the at least one processor is further configured to:
Recording a first matching number, wherein the first matching number is the confidence value of the first list, and is the number of the plurality of node values corresponding to the first list and the plurality of words of the text to be annotated; and
Recording a second matching number, wherein the second matching number is the number of the node values and the word matches corresponding to the second list, so as to set the second matching number as the confidence value of the second list.
5. The information handling system of claim 4 wherein the at least one processor is further configured to:
labeling the text to be labeled as the first category or the second category by the maximum of the confidence values of the first list and the second list.
6. The information handling system of claim 1 wherein the at least one processor is further configured to:
receiving a new text through the communication interface;
obtaining a plurality of first keywords and/or a plurality of second keywords in the new text by using the basic feature information of the plurality of words in the new text; and
Updating the first list corresponding to the first category according to the first keywords and/or updating the second list corresponding to the second category according to the second keywords.
7. The information handling system of claim 1 wherein the at least one processor is further configured to:
receiving a new text through the communication interface;
obtaining a plurality of third keywords in the new text using the basic feature information of the plurality of words in the new text; and
A third list corresponding to a third category is established according to the third keywords in the new text.
8. The information handling system of claim 6, further comprising:
A user interface coupled to the at least one processor, wherein the user interface is configured to receive an operation instruction for the at least one processor to execute the operation instruction to:
Modifying the first category into a fourth category, such that the fourth category includes the plurality of first keywords; and/or
Modifying the second category into a fifth category, such that the second list corresponding to the fifth category includes the plurality of second keywords.
9. An information processing method, comprising:
Obtaining a plurality of training words from at least one text using basic feature information of the plurality of words of the at least one text, the basic feature information including mutual information of the words, entropy values, word frequencies, combined change values, and context relation values;
Classifying the plurality of training words to create a first list corresponding to a first category and a second list corresponding to a second category;
Matching a plurality of keywords in the first list and the second list in a text to be marked, and calculating a confidence value of the text to be marked relative to the first list and the second list respectively; and
Labeling the text to be labeled as the first category or the second category according to the confidence value,
Also comprises:
using the basic feature information and a probability value of the basic feature information to calculate a reference value of the plurality of training words,
Also comprises:
setting the plurality of training words as the plurality of keywords of the first list in response to the reference value meeting a first threshold; and
And setting the plurality of training words as the plurality of keywords of the second list in response to the reference value meeting a second threshold.
10. The information processing method according to claim 9, further comprising:
Calculating the reference value of each training word by using the occurrence frequency and the popularity information of the plurality of training words; and
Setting the plurality of training words in the reference value which meet the first threshold value as the plurality of keywords of the first list, and setting the plurality of training words in the reference value which meet the second threshold value as the plurality of keywords of the second list.
11. The information processing method according to claim 9, further comprising:
Setting the keywords of the first list and the keywords of the second list as a plurality of node values of a dictionary tree; and
The plurality of node values are used to compare a plurality of words of the text to be annotated.
12. The information processing method according to claim 11, further comprising:
Recording a first matching number, wherein the first matching number is the confidence value of the first list, and is the number of the plurality of node values corresponding to the first list and the plurality of words of the text to be annotated; and
Recording a second matching number, wherein the second matching number is the number of the node values and the word matches corresponding to the second list, so as to set the second matching number as the confidence value of the second list.
13. The information processing method according to claim 12, further comprising:
labeling the text to be labeled as the first category or the second category by the maximum of the confidence values of the first list and the second list.
14. The information processing method according to claim 9, further comprising:
obtaining a plurality of first keywords and/or a plurality of second keywords in a new text by using the basic feature information of a plurality of words in the new text; and
Updating the first list corresponding to the first category according to the first keywords and/or updating the second list corresponding to the second category according to the second keywords.
15. The information processing method according to claim 9, further comprising:
obtaining a plurality of third keywords in a new text using the basic feature information of a plurality of words in the new text; and
A third list corresponding to a third category is established according to the third keywords in the new text.
16. The information processing method as claimed in claim 14, further comprising:
Modifying the first category into a fourth category, such that the fourth category includes the plurality of first keywords; and/or
Modifying the second category into a fifth category, such that the second list corresponding to the fifth category includes the plurality of second keywords.
17. A non-transitory computer readable recording medium storing a plurality of program codes, the at least one processor executing the plurality of program codes to perform the following steps when the plurality of program codes are loaded into the at least one processor:
Obtaining a plurality of training words from at least one text using basic feature information of the plurality of words of the at least one text, the basic feature information including mutual information of the words, entropy values, word frequencies, combined change values, and context relation values;
Classifying the plurality of training words to create a first list corresponding to a first category and a second list corresponding to a second category;
Matching a plurality of keywords in the first list and the second list in a text to be marked, and calculating a confidence value of the text to be marked relative to the first list and the second list respectively; and
Labeling the text to be labeled as the first category or the second category according to the confidence value,
Also comprises:
using the basic feature information and a probability value of the basic feature information to calculate a reference value of the plurality of training words,
Also comprises:
setting the plurality of training words as the plurality of keywords of the first list in response to the reference value meeting a first threshold; and
And setting the plurality of training words as the plurality of keywords of the second list in response to the reference value meeting a second threshold.
CN201910950217.1A 2019-10-08 2019-10-08 Information processing system, information processing method and non-transitory computer readable recording medium Active CN112711940B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910950217.1A CN112711940B (en) 2019-10-08 2019-10-08 Information processing system, information processing method and non-transitory computer readable recording medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910950217.1A CN112711940B (en) 2019-10-08 2019-10-08 Information processing system, information processing method and non-transitory computer readable recording medium

Publications (2)

Publication Number Publication Date
CN112711940A CN112711940A (en) 2021-04-27
CN112711940B true CN112711940B (en) 2024-06-11

Family

ID=75540122

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910950217.1A Active CN112711940B (en) 2019-10-08 2019-10-08 Information processing system, information processing method and non-transitory computer readable recording medium

Country Status (1)

Country Link
CN (1) CN112711940B (en)

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1584883A (en) * 2004-05-27 2005-02-23 威盛电子股份有限公司 Related document connecting managing system, method and recording media
JP2011170786A (en) * 2010-02-22 2011-09-01 Nomura Research Institute Ltd Document classification system, document classification program, and document classification method
CN102332012A (en) * 2011-09-13 2012-01-25 南方报业传媒集团 Chinese text sorting method based on correlation study between sorts
CN105468713A (en) * 2015-11-19 2016-04-06 西安交通大学 Multi-model fused short text classification method
CN107168992A (en) * 2017-03-29 2017-09-15 北京百度网讯科技有限公司 Article sorting technique and device, equipment and computer-readable recording medium based on artificial intelligence
CN107436875A (en) * 2016-05-25 2017-12-05 华为技术有限公司 File classification method and device
CN107679153A (en) * 2017-09-27 2018-02-09 国家电网公司信息通信分公司 A kind of patent classification method and device
CN107967311A (en) * 2017-11-20 2018-04-27 阿里巴巴集团控股有限公司 A kind of method and apparatus classified to network data flow
CN109145097A (en) * 2018-06-11 2019-01-04 人民法院信息技术服务中心 A kind of judgement document's classification method based on information extraction
CN110110080A (en) * 2019-03-29 2019-08-09 平安科技(深圳)有限公司 Textual classification model training method, device, computer equipment and storage medium
CN110245232A (en) * 2019-06-03 2019-09-17 网易传媒科技(北京)有限公司 File classification method, device, medium and calculating equipment

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7016895B2 (en) * 2002-07-05 2006-03-21 Word Data Corp. Text-classification system and method

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1584883A (en) * 2004-05-27 2005-02-23 威盛电子股份有限公司 Related document connecting managing system, method and recording media
JP2011170786A (en) * 2010-02-22 2011-09-01 Nomura Research Institute Ltd Document classification system, document classification program, and document classification method
CN102332012A (en) * 2011-09-13 2012-01-25 南方报业传媒集团 Chinese text sorting method based on correlation study between sorts
CN105468713A (en) * 2015-11-19 2016-04-06 西安交通大学 Multi-model fused short text classification method
CN107436875A (en) * 2016-05-25 2017-12-05 华为技术有限公司 File classification method and device
CN107168992A (en) * 2017-03-29 2017-09-15 北京百度网讯科技有限公司 Article sorting technique and device, equipment and computer-readable recording medium based on artificial intelligence
CN107679153A (en) * 2017-09-27 2018-02-09 国家电网公司信息通信分公司 A kind of patent classification method and device
CN107967311A (en) * 2017-11-20 2018-04-27 阿里巴巴集团控股有限公司 A kind of method and apparatus classified to network data flow
CN109145097A (en) * 2018-06-11 2019-01-04 人民法院信息技术服务中心 A kind of judgement document's classification method based on information extraction
CN110110080A (en) * 2019-03-29 2019-08-09 平安科技(深圳)有限公司 Textual classification model training method, device, computer equipment and storage medium
CN110245232A (en) * 2019-06-03 2019-09-17 网易传媒科技(北京)有限公司 File classification method, device, medium and calculating equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"基于信息熵的主动学习半监督分类研究";陈锦禾等;《计算机技术与发展》;20100210;第20卷(第2期);第110-113页 *

Also Published As

Publication number Publication date
CN112711940A (en) 2021-04-27

Similar Documents

Publication Publication Date Title
US11928434B2 (en) Method for text generation, device and storage medium
Qi et al. Compatibility-aware web API recommendation for mashup creation via textual description mining
Javed et al. Carotene: A job title classification system for the online recruitment domain
US9390086B2 (en) Classification system with methodology for efficient verification
US9336299B2 (en) Acquisition of semantic class lexicons for query tagging
US8732116B1 (en) Harvesting relational tables from lists on the web
US20220284174A1 (en) Correcting content generated by deep learning
EP3738083A1 (en) Knowledge base construction
US20210073257A1 (en) Logical document structure identification
Xie et al. The named entity recognition of Chinese cybersecurity using an active learning strategy
CN113505190B (en) Address information correction method, device, computer equipment and storage medium
CN116383412B (en) Functional point amplification method and system based on knowledge graph
KR102148451B1 (en) Method, server, and system for providing question and answer data set synchronization service for integration management and inkage of multi-shopping mall
Revindasari et al. Traceability between business process and software component using Probabilistic Latent Semantic Analysis
CN112711940B (en) Information processing system, information processing method and non-transitory computer readable recording medium
US9195940B2 (en) Jabba-type override for correcting or improving output of a model
US11720560B2 (en) Smart filters and search
TWI725568B (en) Message processing system, message processing method and non-transitory computer readable storage medium
KR20230174503A (en) System and Method for generating automatically question based on neural network
Kholodna et al. LLMs in the Loop: Leveraging Large Language Model Annotations for Active Learning in Low-Resource Languages
US9530094B2 (en) Jabba-type contextual tagger
Abedini et al. Epci: an embedding method for post-correction of inconsistency in the RDF knowledge bases
CN107341169B (en) Large-scale software information station label recommendation method based on information retrieval
JP5342574B2 (en) Topic modeling apparatus, topic modeling method, and program
US11900229B1 (en) Apparatus and method for iterative modification of self-describing data structures

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant