CN112711940B

CN112711940B - Information processing system, information processing method and non-transitory computer readable recording medium

Info

Publication number: CN112711940B
Application number: CN201910950217.1A
Authority: CN
Inventors: 曾俋颖; 汤珮茹
Original assignee: Delta Electronics Inc
Current assignee: Delta Electronics Inc
Priority date: 2019-10-08
Filing date: 2019-10-08
Publication date: 2024-06-11
Anticipated expiration: 2039-10-08
Also published as: CN112711940A

Abstract

An information processing system, an information processing method and a non-transitory computer readable recording medium. The information processing system comprises at least one processor, a communication interface and a database. The communication interface is coupled to the at least one processor. A database is coupled to the one or more processors and the database is configured to store at least one text received from the communication interface. The at least one processor is configured to: obtaining a plurality of training words using basic feature information of a plurality of words of at least one text; classifying the training words to establish a first list corresponding to the first category and a second list corresponding to the second category; matching a plurality of keywords in the first list and the second list in the text to be marked, and calculating the confidence values of the text to be marked on the first list and the second list respectively; and labeling the text to be labeled as the first category or the second category according to the confidence value.

Description

Information processing system, information processing method and non-transitory computer readable recording medium

Technical Field

The present disclosure relates to a processing system and a processing method, and more particularly, to an information processing system and an information processing method.

Background

The conventional text labeling method is to label the articles one by using the experience of an analyst after reading the articles by manpower (e.g., the analyst). However, such an approach is quite time consuming and the labeling results are also highly dependent on the analyst's experience. Furthermore, since the article needs to be read by an analyst, there is a considerable risk in terms of data security.

On the other hand, the mechanism of training the classification model by the machine learning method requires a large number of accurate labeling articles to ensure the accuracy of the classification model. If the number of the marked articles is insufficient or the quality is poor, the accuracy is also low. Accordingly, how to improve the classification accuracy and the data confidentiality at the same time is a technical problem to be solved in the field of text classification.

Disclosure of Invention

This summary is intended to provide a simplified summary of the disclosure so that the reader will have a basic understanding of the disclosure. This summary is not an extensive overview of the disclosure and is intended to neither identify key/critical elements of the embodiments of the disclosure nor delineate the scope of the disclosure.

According to one embodiment of the present disclosure, an information processing system is disclosed, comprising at least one processor, a communication interface, and a database. The communication interface is coupled to the at least one processor. A database is coupled to the one or more processors and the database is configured to store at least one text received from the communication interface. The at least one processor is configured to: obtaining a plurality of training words using basic feature information of a plurality of words of at least one text; classifying the training words to establish a first list corresponding to the first category and a second list corresponding to the second category; matching a plurality of keywords in the first list and the second list in the text to be marked, and calculating the confidence values of the text to be marked on the first list and the second list respectively; and labeling the text to be labeled as the first category or the second category according to the confidence value.

According to another embodiment, an information processing method is disclosed, comprising: obtaining a plurality of training words using a basic feature information of a plurality of words of at least one text; classifying the training words to establish a first list corresponding to a first category and a second list corresponding to a second category; matching a plurality of keywords in the first list and the second list in a text to be marked, and calculating a confidence value of the text to be marked relative to the first list and the second list respectively; and labeling the text to be labeled as the first category or the second category according to the confidence value.

According to another embodiment, a non-transitory computer readable recording medium storing a plurality of program codes is disclosed, wherein when the program codes are loaded into at least one processor, the at least one processor executes the program codes to perform the following steps: obtaining a plurality of training words using a basic feature information of a plurality of words of at least one text; classifying the training words to establish a first list corresponding to a first category and a second list corresponding to a second category; matching a plurality of keywords in the first list and the second list in a text to be marked, and calculating a confidence value of the text to be marked relative to the first list and the second list respectively; and labeling the text to be labeled as the first category or the second category according to the confidence value.

Drawings

The following detailed description, when read in conjunction with the accompanying drawings of the specification, will facilitate a preferred understanding of embodiments of the present disclosure. It should be noted that the features in the drawings are not necessarily drawn to scale, depending on the requirements of the actual implementation in the description. In fact, the dimensions of the various features may be arbitrarily increased or decreased for clarity of discussion.

FIG. 1 is a functional block diagram illustrating an information handling system according to some embodiments of the present disclosure.

Fig. 2 is a flowchart illustrating an information processing method according to some embodiments of the present disclosure.

Fig. 3 is a flowchart illustrating an information processing method according to further embodiments of the present disclosure.

Reference numerals illustrate:

100. Information processing system

110. Processor and method for controlling the same

120. Communication interface

130. Database for storing data

140 User interface (interface)

S210 to S240, S310 to S330 steps

Detailed Description

The following disclosure provides many different embodiments, or examples, for implementing different features of the disclosure. Specific examples of elements and arrangements are described below to simplify the present disclosure. Of course, these examples are merely illustrative and are not intended to be limiting. For example, forming a first feature over or on a second feature in the description below may include embodiments in which the first and second features are formed in direct contact, and may also include embodiments in which additional features may be formed between the first and second features such that the first and features may not be in direct contact. In addition, the present disclosure may repeat reference numerals and/or letters in the various examples. This repetition is for the purpose of simplicity and clarity and does not in itself dictate a relationship between the various embodiments and/or configurations discussed.

Referring to FIG. 1, a functional block diagram illustrating an information handling system 100 according to some embodiments of the present disclosure is shown. As shown in fig. 1, information handling system 100 includes a processor 110, a communication interface 120, and a database 130. In some embodiments, data processing may be performed by at least one processor 110 such that information handling system 100 operates in a multi-threading (multithreading) environment. For ease of illustration, the present disclosure is described below in terms of an embodiment with processor 110.

The communication interface 120 is coupled to the processor 110 and is configured to transmit/receive text data with another device or system (not shown). In some embodiments, the communication interface 120 may be, but is not limited to, a communication chip supporting global system for mobile communications (Global System for Mobile communication, GSM), long term evolution communications (Long Term Evolution, LTE), worldwide interoperability for microwave access (Worldwide interoperability for Microwave Access, wiMAX), wireless fidelity (WIRELESS FIDELITY, wi-Fi), bluetooth technology, wired network, and the like.

Database 130 is coupled to processor 110. In some embodiments, information handling system 100 may provide an external database (not shown) external to the system and communicatively coupled to processor 110 via communication interface 120 to access data external to the system.

In some embodiments, database 130 is configured to store at least one text via communication interface 120. The text may be a file representing any language.

Referring to fig. 2, a flowchart illustrating an information processing method according to some embodiments of the present disclosure is shown. The information processing method of fig. 2 may be performed by the information processing system 100 of fig. 1. For convenience of explanation of the information processing method of fig. 2, various related terms or elements will be explained with reference to fig. 1.

In step S210, a plurality of training words are obtained using basic feature information of at least one text word.

In some embodiments, the processor 110 uses words in text as a basis for training the keywords of the dictionary.

First, the processor 110 parses words in the text through natural language processing techniques, such as finding vocabularies or word breaks in the text. The processor 110 then obtains basic feature information of the words according to a predetermined database (not shown). The basic feature information may be, but is not limited to, mutual information of words (mutual information, MI), entropy value (entropy), word frequency (term ffrequency, TF), combination change value (accessor variety, AV), and context value (position). In some embodiments, processor 110 calculates the reference value for each word using a comprehensive weight calculation formula, such as formula (1).

W (new word) =α×w _MI+β×W_entropy+γ×W_TF+δ×W_AV+ε×W_position, 0 < α, β, γ, δ, ε < 1

In formula (1), W (new word) is a reference value of a word. W _MI is mutual information of words, W _entropy is entropy of words, W _TF is word frequency of words, W _AV is change value between words and left and right words, W _position is relative relation value between words and contexts, and alpha, beta, gamma, delta and epsilon are probability values. The mutual information is an estimated value of the tightness degree or the relevance between the word and other adjacent words, and the entropy value is an estimated value of the degree of freedom between the word and other adjacent words. The mutual information and entropy are a loop of the information theory (Information Theory), and are not described in detail herein.

Therefore, by adjusting the probability value of each basic feature information in the formula (1), different probability values can be used for finding out the references of a plurality of keywords.

In step S220, the processor 110 classifies the training words to create a plurality of lists corresponding to the plurality of categories, respectively.

In some embodiments, the processor 110 may set different thresholds to determine the classification of keywords. For example, the training words detected in the text are "artificial intelligence server", "intelligent robot", "virtual assistant", "natural language", "home appliance", etc., but only if the reference value of the former four is greater than the first threshold, the training words are set as keywords in the first list related to artificial intelligence (first category). For example, the training words detected in the text are "financial transaction", "smart contract", "bank", and the reference value of the first three training words is larger than the second threshold, and the training words "financial transaction", "smart contract" are set as the keywords in the second list related to the blockchain (second category). By analogy, the processor 110 may build many different lists.

In some embodiments, the keywords of the first list are organized into dictionaries for artificial intelligence and the keywords of the second list are organized into dictionaries for blockchain. In this way, the information processing system 100 may perform content classification or labeling on some texts to be classified based on the dictionary files. It is noted that the terms "list" and "dictionary" are used interchangeably in this disclosure.

In step S225, the processor 110 determines whether the training of the dictionary is completed.

In some embodiments, steps S210 to S220 may be regarded as one loop, and in the method for creating a list of the present disclosure, the loop may be repeatedly performed a plurality of times to repeatedly obtain a plurality of training words based on a plurality of words of the same or different texts, so that keywords classified into various categories of the list are more accurate. For example, the training word "bank" may be categorized into a second list of blockchain categories on the L1 st loop as a keyword for the second list. However, it is possible that at the L2 nd loop, the training word "bank" that compares the classifications that do not fit the "blockchain" is removed from the second list. In this way, the list of keywords can be updated and optimized continuously by executing a plurality of loops.

In some embodiments, the information processing method of the present disclosure uses a word extraction algorithm to reduce the time required for training words and to improve the accuracy of training words. For example, the word extraction algorithm is a TextRank algorithm, as shown in equation (2).

In equation (2), V _i、V_j、V_k is a different node, WS (V _i) is a weight of node V _i, W _ji is a side weight of node V _j to node V _i, in (V _i) is a set of all nodes pointing to node V _i, out (V _j) is a set of all nodes pointed to by node V _j, and d is an adjustment coefficient (e.g., 0.85).

In some embodiments, when executing the word extraction algorithm, the edge weight W _ji in the formula (2) is applied with information about the occurrence frequency and the popularity of the words in the term frequency-inverse document frequency, TF-IDF technology, so that the occurrence frequency and the popularity of different words can be considered when calculating the weight value of each node, and the process of calculating the iteration in the formula (2) can accelerate convergence. For example, the processor 110 calculates the weight values of the N training words using equation (2). After sorting the weights (e.g., from large to small), the first few (e.g., 50) training words are set as keywords and can be added to the list.

In step S230, the processor 110 uses the keywords of the lists to match the text to be annotated, so as to calculate confidence values of the lists.

In some embodiments, the present disclosure uses a multi-word multi-dictionary matching (multiple string multipledictionary, MSMD) algorithm for labeling of text. For example, in step S220, a plurality of lists are obtained as a plurality of dictionaries D [1, …, D ], each dictionary (e.g., dictionary 1-dictionary D) being of mutually exclusive type. Each dictionary contains a plurality of words S [1, …, S ]. In the matching procedure, the processor 110 may take a section of the main string T from the text to be annotated to determine whether each dictionary is a matching category of T one by one, for example, search whether keywords completely matching the main string T exist in each dictionary.

For example, the processor 110 sets the keywords in the first list as a plurality of first node values (or first template strings) of a dictionary Tree (Trie-Tree), and sets the keywords in the second list as a plurality of second node values (or second template strings) of the dictionary Tree. In other words, all keywords are integrated into one dictionary tree.

Then, the processor 110 compares the plurality of words of the text to be annotated using the first node values and the second node values simultaneously. When the matching procedure is performed, the first template strings of the dictionary tree are automatically searched each time by the main strings T of the text to be marked. Each word of the main string T will be aligned one by one with the first template string. In one embodiment, when the master string T is completely matched with any one of the first template strings, the processor 110 records the template string, the number of times the matched template string occurs in the text to be annotated, and the location where the matched template string occurs in the text to be annotated. Similarly, each word of the master string T will be aligned with the second template string one by one. When the master string T matches exactly any of the second template strings, the processor 110 records the template string, the number of times the matching template string occurs in the text to be annotated, and the location where the matching template string occurs in the text to be annotated.

In some embodiments, the data structure of the dictionary tree is stored in nodes with the same prefix of the string (e.g., each word is stored in a node such that the tree height of the dictionary tree is the longest string length plus one), and thus each string corresponds to a unique node. When the dictionary tree is searched according to the main word string T, searching is performed from the root node of the dictionary tree, and searching is performed layer by layer to the child nodes. On the other hand, since the index (pointer) is used to record the word string in the dictionary tree, the processor 110 uses the finite state machine control (e.g., the Aho-cornick algorithm) to modify the index in the process of searching the dictionary tree in cooperation with each pre-constructed template, and when any word in the main word string T fails to be searched, returns to the state in the finite state machine, and turns to other branches of the dictionary tree to avoid repeated matching with the same word head, so that the time for searching the main word string T can be reduced and the efficiency for searching the dictionary tree can be improved.

It should be noted that the present disclosure is not limited to dictionary tree algorithms, and any multi-string search algorithm is within the scope of the present disclosure.

Further, the present disclosure builds a dictionary tree from all keywords of all dictionaries according to the rules of the same prefix. Since a dictionary tree contains all the keywords of all the dictionaries, in the matching procedure, a main word string T can be used to match keywords to all the dictionaries at the same time. Compared with the common method (i.e. only one dictionary can be matched at a time), the method of simultaneous multi-dictionary matching of the present disclosure can greatly improve the efficiency of keyword matching.

In the following, two dictionaries (lists) are integrated into one dictionary tree, where a plurality of keywords corresponding to a first list are a plurality of first nodes, and a plurality of keywords corresponding to a second list are a plurality of second nodes.

In some embodiments, the processor 110 records the number of words of the text to be annotated that match the first node value (i.e., the first number of matches), and records the number of words of the text to be annotated that match the second node value (i.e., the second number of matches). Next, the processor 110 sets the first matching number to the confidence value of the first list and the second matching number to the confidence value of the second list.

In step S240, the processor 110 labels the labeled text as at least one of the categories according to the confidence value.

In some embodiments, the processor 110 maximizes the confidence value of the first list and the confidence value of the second list. For example, if the confidence value of the first list is the maximum value, the text to be annotated is marked as the category (e.g. artificial intelligence) corresponding to the first list. For another example, if the confidence value of the second list is the maximum value, the text to be annotated is marked as the category (e.g. blockchain) corresponding to the second list. In another embodiment, the text to be annotated may also be annotated with more than one category.

Referring to fig. 3, there is a flowchart illustrating an information processing method according to further embodiments of the present disclosure. The information processing method can further update the existing list, so that the keywords of each category are more accurate.

In step S310, the processor 110 obtains at least one of the first keywords, the second keywords and the third keywords by using the basic feature information of the words in the new text. The step of obtaining the keywords refers to the content of the foregoing steps S210 to S220, and will not be repeated here.

In some embodiments, the processor 110 may receive new text through the communication interface 120. The new text may be any text that may be used to train all lists, such as text already stored in database 130, the previously noted text to be annotated, text that has not been utilized in a training program, and so forth.

In some embodiments, if keywords that can be categorized into existing categories are calculated in the new text, step S320 is performed.

In step S320, the processor 110 updates the first list corresponding to the first category according to the first keywords and/or updates the second list corresponding to the second category according to the second keywords.

In another embodiment, if keywords (e.g., third keywords) that cannot be classified into the existing category are calculated in the text to be annotated, step S330 is performed.

In step S330, the processor 110 creates a third list corresponding to a third category according to the third keywords.

For example, keywords detected in text as "tablet," "display," "optical film," "glass screen," etc., such keywords are neither artificial intelligence (first category) nor blockchain (second category). Thus, the processor 110 creates a third list corresponding to electronic information (third category).

Referring back to FIG. 1, the information handling system 100 also includes a user interface 140. The user interface 140 is coupled to the processor 110. The user interface 140 may be a graphical user interface, keyboard, screen, mouse, etc. to provide a user with associated operations. For example, a graphical user interface is provided to build up a plurality of lists and keywords thereof.

Referring to table one, a plurality of lists and keywords thereof are shown in table one.

Table one: multiple lists (hereinafter dictionary file)

In some embodiments, multiple lists of the present disclosure may provide corresponding services for different labeling requirements. For example, if the text to be annotated is a plurality of YAHOO news texts, the information processing system 100 may use dictionary files, such as table one, to annotate all YAHOO news texts, and such content is described above. For example, a first news article is labeled as an article related to "blockchain" and "big data", while a second article is labeled as an article related to "semiconductor".

In other embodiments, if the text to be annotated is multiple texts of eastern news, the user interface 140 may be configured to receive an operation instruction for the processor 110 to perform the modification of the category. For example, artificial intelligence (first category) may be modified to a smart home appliance (fourth category) such that the smart home appliance contains all the keywords of the artificial intelligence. Similarly, the blockchain (second class) may be modified to e-commerce (fifth class) such that e-commerce includes all of the keywords of the blockchain.

In other embodiments, the user interface 140 provides a user (e.g., domain experts) to evaluate whether each list of dictionary files and their keywords are correct, and to evaluate whether the classified text is also correctly labeled. If an unsuitable portion is found, the expert in each field can also correct the erroneous portion through the user interface 140 to avoid the repeated labeling or inconsistent standards.

In this way, the information processing system 100 of the present disclosure can be compatible with text providers with different labeling requirements after completing one-stage training and creating dictionary files. Thus, when providing annotation services to different text providers, the training of the dictionary file (perhaps only with fine tuning) need not be re-performed for each text provider, allowing the existing dictionary file to be applied to different text providers. In other words, by replacing the dictionary classification and the input text, the conversion can be rapidly performed in different fields and data sources, and the working efficiency is improved.

In some embodiments, text of multiple (e.g., 195) corporate websites is annotated based on five category labels in the dictionary file of table one. The text of a pre-designed part (e.g., 15) of the company website has been classified into a part tag, and thus the text labeling step described above is performed on the remaining part (e.g., 80) of the company website. For example, training steps (e.g., steps S210 to S225) are performed on the labeled website text between 15 to obtain a dictionary file (e.g., table one). Then, keyword labeling is performed on the website text of the company between 80 companies by using labeling steps (for example, the foregoing steps S230 to S240), so as to obtain a labeling result with a first accuracy.

Alternatively, the optimization step (e.g., steps S310 to S330) may be performed using the web site text of the company 80, and the classification of the dictionary file and its keywords may be trained again to obtain the optimized dictionary file. Then, the text labeling step (e.g. steps S230 to S240) is performed again on the rest (e.g. 100) company websites, and a labeling result with a second accuracy is obtained, where the second accuracy is higher than the first accuracy. Similarly, the text labeling method and device can be continuously optimized, so that the dictionary file can be optimized for each text labeling, and the accuracy of the next text labeling is improved.

In summary, the information processing system and the information processing method disclosed by the disclosure provide a highly flexible text labeling method, use basic feature information to find new words, and combine word frequency reverse file frequency with a word extraction algorithm to improve the efficiency of keyword setting. The present disclosure can continually train and refine the classification of the dictionary relative to the human effort required to complete a generic text annotation. In addition, the automatic labeling mode can simultaneously realize online data labeling and data protection, and the problem of data leakage caused by manual labeling is avoided.

The foregoing outlines features of several embodiments so that those skilled in the art may better understand the embodiments of the present disclosure. Those skilled in the art should appreciate that the present disclosure may be readily utilized as a basis for designing or modifying other processes and structures for carrying out the same purposes and/or achieving the same advantages of the embodiments introduced herein. Those skilled in the art should also realize that such equivalent constructions do not depart from the spirit and scope of the present disclosure, and that they may make various changes, substitutions, and alterations herein without departing from the spirit and scope of the present disclosure.

Claims

1. An information processing system, comprising:

at least one processor;

A communication interface coupled to the at least one processor; and

A database coupled to the at least one processor, and the database configured to store at least one text received from the communication interface, wherein the at least one processor is configured to:

Obtaining a plurality of training words from the at least one text using basic feature information of the plurality of words of the at least one text, the basic feature information including mutual information of words, entropy values, word frequencies, combined change values, and context relation values;

Classifying the plurality of training words to create a first list corresponding to a first category and a second list corresponding to a second category;

Matching a plurality of keywords in the first list and the second list in a text to be marked, and calculating a confidence value of the text to be marked relative to the first list and the second list respectively; and

Labeling the text to be labeled as the first category or the second category according to the confidence value,

Wherein the at least one processor is further configured to:

using the basic feature information and a probability value of the basic feature information to calculate a reference value of the plurality of training words,

Wherein the at least one processor is further configured to:

setting the plurality of training words as the plurality of keywords of the first list in response to the reference value meeting a first threshold; and

And setting the plurality of training words as the plurality of keywords of the second list in response to the reference value meeting a second threshold.

2. The information handling system of claim 1 wherein the at least one processor is further configured to:

Calculating the reference value of each training word by using the occurrence frequency and the popularity information of the plurality of training words; and

Setting the plurality of training words in the reference value which meet the first threshold value as the plurality of keywords of the first list, and setting the plurality of training words in the reference value which meet the second threshold value as the plurality of keywords of the second list.

3. The information handling system of claim 1 wherein the at least one processor is further configured to:

Setting the keywords of the first list and the keywords of the second list as a plurality of node values of a dictionary tree; and

The plurality of node values are used to compare a plurality of words of the text to be annotated.

4. The information handling system of claim 3 wherein the at least one processor is further configured to:

Recording a first matching number, wherein the first matching number is the confidence value of the first list, and is the number of the plurality of node values corresponding to the first list and the plurality of words of the text to be annotated; and

Recording a second matching number, wherein the second matching number is the number of the node values and the word matches corresponding to the second list, so as to set the second matching number as the confidence value of the second list.

5. The information handling system of claim 4 wherein the at least one processor is further configured to:

labeling the text to be labeled as the first category or the second category by the maximum of the confidence values of the first list and the second list.

6. The information handling system of claim 1 wherein the at least one processor is further configured to:

receiving a new text through the communication interface;

obtaining a plurality of first keywords and/or a plurality of second keywords in the new text by using the basic feature information of the plurality of words in the new text; and

Updating the first list corresponding to the first category according to the first keywords and/or updating the second list corresponding to the second category according to the second keywords.

7. The information handling system of claim 1 wherein the at least one processor is further configured to:

receiving a new text through the communication interface;

obtaining a plurality of third keywords in the new text using the basic feature information of the plurality of words in the new text; and

A third list corresponding to a third category is established according to the third keywords in the new text.

8. The information handling system of claim 6, further comprising:

A user interface coupled to the at least one processor, wherein the user interface is configured to receive an operation instruction for the at least one processor to execute the operation instruction to:

Modifying the first category into a fourth category, such that the fourth category includes the plurality of first keywords; and/or

Modifying the second category into a fifth category, such that the second list corresponding to the fifth category includes the plurality of second keywords.

9. An information processing method, comprising:

Obtaining a plurality of training words from at least one text using basic feature information of the plurality of words of the at least one text, the basic feature information including mutual information of the words, entropy values, word frequencies, combined change values, and context relation values;

Also comprises:

10. The information processing method according to claim 9, further comprising:

11. The information processing method according to claim 9, further comprising:

12. The information processing method according to claim 11, further comprising:

13. The information processing method according to claim 12, further comprising:

14. The information processing method according to claim 9, further comprising:

obtaining a plurality of first keywords and/or a plurality of second keywords in a new text by using the basic feature information of a plurality of words in the new text; and

15. The information processing method according to claim 9, further comprising:

obtaining a plurality of third keywords in a new text using the basic feature information of a plurality of words in the new text; and

16. The information processing method as claimed in claim 14, further comprising:

17. A non-transitory computer readable recording medium storing a plurality of program codes, the at least one processor executing the plurality of program codes to perform the following steps when the plurality of program codes are loaded into the at least one processor:

Also comprises: