WO2020044558A1 - Programme de génération de règle de classification, procédé de génération de règle de classification, et dispositif de génération de règle de classification - Google Patents

Programme de génération de règle de classification, procédé de génération de règle de classification, et dispositif de génération de règle de classification Download PDF

Info

Publication number
WO2020044558A1
WO2020044558A1 PCT/JP2018/032449 JP2018032449W WO2020044558A1 WO 2020044558 A1 WO2020044558 A1 WO 2020044558A1 JP 2018032449 W JP2018032449 W JP 2018032449W WO 2020044558 A1 WO2020044558 A1 WO 2020044558A1
Authority
WO
WIPO (PCT)
Prior art keywords
classification
text data
character strings
character string
appearance frequency
Prior art date
Application number
PCT/JP2018/032449
Other languages
English (en)
Japanese (ja)
Inventor
智哉 野呂
謙介 馬場
茂紀 福田
清司 大倉
太田 唯子
隆夫 毛利
靖 岩崎
祐太郎 木田
Original Assignee
富士通株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 富士通株式会社 filed Critical 富士通株式会社
Priority to PCT/JP2018/032449 priority Critical patent/WO2020044558A1/fr
Priority to JP2020540004A priority patent/JP7044162B2/ja
Publication of WO2020044558A1 publication Critical patent/WO2020044558A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes

Definitions

  • the present invention relates to a classification rule generation program, a classification rule generation method, and a classification rule generation device.
  • Job logs such as application and window titles used in the computer, window titles, calendar schedule titles, titles of sent and received e-mails, etc. Classification has been done. For example, each text data associated with the business content is divided to extract a partial character string, and the partial character string having the highest appearance frequency is associated as a characteristic word of the business content.
  • a partial character string is extracted for each of a plurality of texts used in “product planning”, and a partial character string “minutes” having the highest appearance frequency is specified. Then, “characteristic words (minutes), business contents (product planning)” are generated as classification rules. Thereafter, if the newly generated business log includes “minutes”, the business log is classified as “product planning”.
  • the classification rule generated by the above technique is a classification based on a character string obtained from text data and is not suitable for subsequent classification, so that the classification accuracy is low.
  • the character string “regular meeting” appears in many business logs, but the classification rule for including or not including “regular meeting” cannot specify the contents of the business.
  • One object of the present invention is to provide a classification rule generation program, a classification rule generation method, and a classification rule generation device capable of generating a classification rule with high classification accuracy.
  • the classification rule generation program causes the computer to execute a process of extracting a plurality of connected character strings obtained by connecting words obtained by dividing text data into words.
  • the classification rule generating program refers to a computer for a storage unit that stores a classification category in which each of the plurality of text data including the text data is classified for each of the plurality of connected character strings. Then, a process of calculating a distribution destination of text data including the concatenated character string is performed.
  • the classification rule generation program causes a computer to execute a process of selecting a characteristic word from the plurality of connected character strings based on the distribution of the classification destination.
  • the classification rule generation program causes a computer to execute a process of generating a classification rule in which the classification category in which the text data is classified is associated with the characteristic word.
  • a classification rule with high classification accuracy can be generated.
  • FIG. 1 is a diagram illustrating an example of the overall configuration of a classification device according to the first embodiment.
  • FIG. 2 is a functional block diagram illustrating the functional configuration of the classification device according to the first embodiment.
  • FIG. 3 is a diagram illustrating an example of a business log stored in the business log DB.
  • FIG. 4 is a diagram illustrating an example of the classification information stored in the business classification DB.
  • FIG. 5 is a diagram illustrating an example of information stored in the character string DB.
  • FIG. 6 is a diagram illustrating a processing example of morphological analysis.
  • FIG. 7 is a diagram illustrating calculation of the appearance frequency and correction of the appearance frequency.
  • FIG. 8 is a diagram illustrating a correction result of the appearance frequency.
  • FIG. 1 is a diagram illustrating an example of the overall configuration of a classification device according to the first embodiment.
  • FIG. 2 is a functional block diagram illustrating the functional configuration of the classification device according to the first embodiment.
  • FIG. 3 is a diagram
  • FIG. 9 is a diagram illustrating the identification of a character string boundary based on the corrected appearance frequency.
  • FIG. 10 is a diagram illustrating text division based on the corrected appearance frequency.
  • FIG. 11 is a diagram illustrating an example of generating a classification rule.
  • FIG. 12 is a flowchart illustrating the flow of the appearance frequency process.
  • FIG. 13 is a flowchart illustrating the flow of the rule generation process.
  • FIG. 14 is a diagram illustrating an example of a hardware configuration.
  • FIG. 1 is a diagram illustrating an example of the overall configuration of a classification device 10 according to the first embodiment.
  • the classification device 10 illustrated in FIG. 1 divides text data such as log information to extract a characteristic word, generates a business classification model that associates the characteristic word with a prepared category, and according to the generated business classification model.
  • 1 is an example of a classification rule generation device that classifies log information into categories.
  • the classification device 10 includes a learning device that learns a classification rule to be applied to a task classification model, and a classifier to which a learning result by the learning device is applied.
  • the classification device 10 collects, as log information, business data such as mail, schedule, and telephone, and operation logs such as window titles and application files. Then, the learning device of the classification device 10 divides the text data included in the log information into words, and calculates the appearance frequency of a character string (word string) of N consecutive words.
  • the learning device of the classification device 10 calculates the distribution (variation) of the classification destination of the business content when each character string appears in the business log, and corrects the appearance frequency. That is, the learning device lowers the frequency of appearance of a character string having a large variation. After that, the learning device of the classification device 10 determines a text data division unit based on the corrected appearance frequency of each character string, and determines a character string suitable for classification from the character strings divided by the determined division unit. Extract and adopt for classification rules. For example, the learning device generates “estimation, customer correspondence”, “business trip application, paperwork”, and the like as the classification rule “characteristic word, category”.
  • the classifier of the classifying apparatus 10 constructs a business classification model to which the classification rules generated in this way are applied, and classifies newly generated business data and operation logs into categories according to the classification rules. For example, the classifier classifies a business log whose window title includes “estimate” into a category “customer correspondence”.
  • the learning device of the classification device 10 periodically executes the above-described learning and updates the classification rule. For example, the learning device newly adds “ABC, customer correspondence”, “XY system, customer correspondence” as the classification rule “feature word, category” by repeating new learning according to the feedback. Therefore, the classification device 10 can generate a classification rule with high classification accuracy.
  • FIG. 2 is a functional block diagram illustrating the functional configuration of the classification device 10 according to the first embodiment.
  • the classification device 10 includes a communication unit 11, a storage unit 12, and a control unit 20.
  • the communication unit 11 is a processing unit that controls communication with another device, and is, for example, a communication interface.
  • the communication unit 11 receives log information such as business data and operation logs from another device, and transmits a classification result and the like to a management terminal and the like.
  • the storage unit 12 is an example of a storage device that stores data, programs executed by the control unit 20, and the like, and is, for example, a hard disk or a memory.
  • the storage unit 12 stores a business log DB 13, a business classification DB 14, a character string DB 15, and a classification rule DB 16.
  • the business log DB 13 is a database that stores log information such as business data to be learned and operation logs.
  • the log information stored here is text data, which can be periodically stored by the administrator, or can be acquired and stored by the control unit 20.
  • FIG. 3 is a diagram showing an example of a business log stored in the business log DB 13.
  • the business log DB 13 stores business logs such as transmission and reception mail titles, operation logs such as application files and window titles, a schedule generated by a scheduler and the like.
  • the business log is log information in which “creation time” indicating the time at which the log was created is associated with “file name” that is the log file name.
  • “X professional regular meeting_minutes” created at 9:00: 00 is stored as the business log as the business log.
  • the operation log is log information in which "start time and end time” indicating the start and end of the operation, "window title” indicating the operation target, and "start process” indicating the process for starting the operation target are associated with each other. It is.
  • “business negotiation report material.bbb” activated by the BBB process and operated by the user is stored as the business log from 9:35:06 to 9:38:43. .
  • the schedule table is log information in which “start time and end time” indicating the start and end of the schedule are associated with “subject” indicating the content of the schedule.
  • start time and end time indicating the start and end of the schedule are associated with “subject” indicating the content of the schedule.
  • subject indicating the content of the schedule.
  • a “section” that starts at 9:30:00 and ends at 10:30:30 is stored as a business log.
  • the business classification DB 14 is a database that stores the classification destination category and the classified business logs in association with each other.
  • FIG. 4 is a diagram showing an example of the classification information stored in the business classification DB 14. As shown in FIG. 4, the business classification DB 14 stores “classification category, applicable log” in association with each other.
  • the “category category” indicates a category to be classified and can be arbitrarily changed by a user or the like.
  • Applicable logs is a list of business logs classified into categories.
  • the business log “business negotiation report material.bbb” is classified into the classification category “customer correspondence”, and the business log “X pro regular meeting_minutes” is classified into the classification category “product planning”. Indicates that it was done. Examples of classification categories include “customer response”, which corresponds to meetings with clients and creation of materials, “product planning”, which corresponds to surveys and review meetings for new products, product design, development, and testing. "Design / development” and “sales expansion” which correspond to planning and implementation of product sales events.
  • the character string DB 15 is a database that stores information on the frequency of appearance of character strings extracted from the business log. The information stored here is generated by the control unit 20.
  • FIG. 5 is a diagram illustrating an example of information stored in the character string DB 15. As shown in FIG. 5, the character string DB 15 stores “character string (w), appearance frequency (F (w)), corrected appearance frequency (F e (w))” in association with each other.
  • “Character string (w)” is a character string extracted by the control unit 20, and “appearance frequency (F (w))” is an appearance frequency of the character string in all task logs to be learned.
  • the corrected appearance frequency (F e (w)) ” is the appearance frequency of the character string corrected by the control unit 20.
  • the appearance frequency of the entire business log in the character string “X professional regular meeting” is “6” times, and is corrected to “2.35” times by the control unit 20. The calculation method of each item will be described later.
  • the classification rule DB 16 is a database that stores the classification rules generated by the control unit 20. Specifically, the classification rule DB 16 stores the character string and the category of the classification destination in association with each other. The classification rules stored here are generated by the control unit 20, which will be described later, are applied to a business classification model, and are used for classification of business logs.
  • the control unit 20 is a processing unit that performs overall processing of the classification device 10, and is, for example, a processor.
  • the control unit 20 includes a learning unit 30 and a classification unit 60.
  • the learning unit 30 and the classifying unit 60 are an example of a process executed by an electronic circuit or a processor included in the processor or the like.
  • the learning unit 30 includes an appearance frequency processing unit 40 and a rule processing unit 50.
  • the learning unit 30 learns the relationship between a business log and the frequency of occurrence of a character string appearing in the business log, and generates a classification rule.
  • the appearance frequency processing unit 40 includes a morphological analysis unit 41, a frequency calculation unit 42, and a frequency correction unit 43, and is a processing unit that extracts a relationship between an operation log and an appearance frequency of a character string appearing in the operation log.
  • the morphological analysis unit 41 divides each text data of each business log into words and connects a series of N words (N is an arbitrary natural number) to each other (hereinafter, may be simply referred to as a character string).
  • N is an arbitrary natural number
  • the morphological analysis unit 41 divides the text data into words using a general morphological analysis technique. Then, the morphological analysis unit 41 extracts a connected character string every two consecutive words, a connected character string every three words, a connected character string every four words, and a connected character string every five words from the extracted words. Then, the output is output to the frequency calculation unit 42.
  • FIG. 6 is a diagram for explaining a processing example of morphological analysis.
  • the text data of the work log to be learned is “X pro regular meeting_minutes.xxx”.
  • the morphological analysis unit 41 divides the text data “X pro regular meeting_minutes.xxx” into words and “X”, “pro”, “regular”, “kai”, “_” , “Proceedings”, “records”, “.”, And “xxx” are extracted.
  • the frequency calculation unit 42 is a processing unit that calculates the appearance frequency of each character string generated by the morphological analysis unit 41. Specifically, the frequency calculation unit 42 counts how many times each character string extracted as N consecutive words appears in all the work logs to be learned, and outputs the number to the frequency correction unit 43. For example, the frequency calculation unit 42 indicates that the character string “X pro” appears in the text data “X pro regular meeting_minutes.xxx”, “X pro meal meeting”, and “X pro member minutes.yyy” of the business log. In this case, the appearance frequency of the character string “XPro” is counted as “3 times”. In this way, the frequency calculation unit 42 calculates the appearance frequency of each character string extracted as N words from the text data of each business log. The frequency calculating unit 42 stores each character string in the character string DB 15 in association with the appearance frequency.
  • the frequency correction unit 43 is a processing unit that totalizes the classification destinations of business logs including each character string, calculates entropy as an index of the variation, and corrects the appearance frequency of each character string. Specifically, the frequency correction unit 43 specifies which classification category each character string is classified into by a classification method according to the current classification rule, and specifies the distribution of the classification destination. Then, the frequency correction unit 43 performs correction to reduce the frequency of appearance of character strings classified into many classification categories. That is, the frequency correction unit 43 lowers the appearance frequency of a character string whose classification destination varies.
  • FIG. 7 is a diagram for explaining calculation of the appearance frequency and correction of the appearance frequency.
  • the frequency of appearance of a character string when N is 2 will be described as an example.
  • the frequency calculating unit 42 sets “character string (w), appearance frequency (F (w))” as “X pro, 8”, “pro regular example, 10”, “regular meeting 144”. , “Meeting_, 88”, “_proceedings, 37”, “minutes, 94”, “records., 22”, and “.xxx, 540” are extracted.
  • the frequency correction unit 43 specifies the classification destination distribution of each character string by referring to the business classification DB 14 and the character string DB 15. For example, the frequency correction unit 43 determines that among the business logs (text data) including the character string “regular meeting”, the business log classified into the classification category “workplace activity” is “66” and the business log is classified into the classification category “product planning”.
  • the business log to be classified is “20”, the business log to be classified into the classification category “sales expansion” is “13”, the business log to be classified into the classification category “design / development” is “7”, and the classification category is “ It specifies that the business log classified as “customer correspondence” is “four” and the business log classified into the classification category “other” is “four”.
  • the frequency correction unit 43 calculates the entropy based on the ratio at which the classification destination of the business log including the character string (w) is c using Expression (1).
  • "w” is a character string "regular meeting (144)”
  • "c” is a business classification category of "workplace planning (66), product planning (20)”. ), Sales expansion (13), design / development (7), customer service (4), and other (4) ".
  • the frequency correction unit 43 corrects the appearance frequency calculated for each character string by using Expression (2).
  • “F (w)” in Expression (2) is the appearance frequency calculated by the frequency calculation unit 42, and “b” is a value larger than 1.0, and is 8.0 here.
  • the frequency correction unit 43 executes the specification of the distribution of the classification destination, the calculation of the entropy, and the correction of the appearance frequency for each character string extracted as N consecutive words. Then, the frequency correction unit 43 stores the corrected appearance frequency in the character string DB 15.
  • FIG. 8 is a diagram illustrating a correction result of the appearance frequency.
  • the appearance frequency of the character string “X Pro” is corrected from “8" to "1.23", and the appearance frequency of the character string “Pro” is changed from “10” to "1.41".
  • the frequency of occurrence of the character string "X pro regular” is corrected from “6” to "2.35", and the frequency of appearance of the character string "pro regular meeting” is corrected from “10" to "1.41" You.
  • the “X pro regular” has a lower frequency of appearance than the “X pro”, “pro regular”, and “pro regular meeting”, but the variation (entropy) of the distribution of the classification destination is small.
  • the frequency of appearance after correction becomes higher.
  • the rule processing unit 50 includes a text division unit 51 and a rule generation unit 52, and categorizes from the text data of each business log using the corrected appearance frequency generated by the appearance frequency processing unit 40.
  • This is a processing unit that extracts a character string suitable for. And generates a classification rule.
  • the text division unit 51 is a processing unit that divides a business log (text data), which is learning data, into words using the corrected appearance frequency. Specifically, the text division unit 51 searches for a division unit of text data based on the corrected appearance frequency of the character string. Then, the text division unit 51 divides the text data in the searched unit and outputs the division result to the rule generation unit 52.
  • FIG. 9 is a diagram for explaining the specification of a character string boundary based on the corrected appearance frequency.
  • FIG. 9 shows an example of word division when N is 3.
  • the text division unit 51 divides the text data “X pro regular meeting_minutes.xxx” into words, and “X”, “pro”, “regular”, “kai”, “_”, “Proceedings”, “records”, “.”, “Xxx” are extracted.
  • the text division unit 51 calculates (1) the corrected appearance frequency “2.35” of the character string “X pro regular”, and (2) the corrected appearance frequency of the character string “kai_proceeding”.
  • the frequency “2.87”, (3) the corrected appearance frequency “1.41” of the character string “Pro-Regular Meeting”, and (4) the corrected appearance frequency “1.58” of the character string “Regular Meeting_” are acquired. I do.
  • the text division unit 51 straddles the attention boundary 1 with the corrected appearance frequency “2.35” of (1) and the corrected appearance frequency “2.87” of (2) on both sides of the attention boundary 1.
  • the corrected appearance frequency “1.41” of (3) and the corrected appearance frequency “1.57” of (4) are specified.
  • FIG. 10 is a diagram illustrating text division based on the corrected appearance frequency.
  • the text division unit 51 indicates the ease of division between the words “X” and “pro” for the text data “X pro regular meeting_minutes.xxx”. Score “0.50”, score “0.50” indicating the ease of division between the words “pro” and “regular”, score “0.50” indicating the ease of division between the words “regular” and “meeting” 0.67 ", the ratio when the character string is generated from the top is specified based on the corrected appearance frequency ratio of the character string that straddles the boundary calculated in FIG. Then, the text division unit 51 sets a boundary that satisfies either the condition 1 “score exceeds a threshold value (for example, 0.5)” or the condition 2 “score higher than both adjacent boundaries” as a division boundary.
  • a threshold value for example, 0.5
  • the text dividing unit 51 determines the division boundary. .
  • the text division unit 51 determines the division boundary because the score between the word “_” and the word “proceeding” satisfies the condition 1 with “0.75”.
  • the text division unit 51 determines the division boundary. Then, the text division unit 51 outputs information on the determined division boundary to the rule generation unit 52.
  • a division boundary different from a division boundary determined by a simple appearance frequency is obtained. For example, as shown in FIG. 10B, when a boundary having an appearance frequency exceeding a threshold value (for example, 0.5) is set as a division boundary, the word “pro” is between the word “X” and the word “pro”. A division boundary is determined between the word “regular”, between the word “_” and the word “proceeding”, and between the word “record” and the word “.”.
  • a threshold value for example, 0.5
  • the rule generation unit 52 extracts a divided character string based on the division boundary determined by the text division unit 51, and determines a divided character string suitable for the classification category from the extracted divided character strings.
  • This is a processing unit. Specifically, the rule generation unit 52 divides text data (business log), which is learning data, according to the division boundary determined by the method shown in FIG. 10A, and extracts a corresponding divided character string. I do. Then, the rule generation unit 52 specifies a divided character string most suitable for classification from among the extracted plurality of divided character strings based on the above-described entropy and the appearance frequency before correction. After that, the rule generation unit 52 generates a classification rule that associates the specified divided character string with a classification category into which the text data as learning data is classified, and stores the generated classification rule in the classification rule DB 16.
  • FIG. 11 is a diagram illustrating an example of generating a classification rule.
  • the rule generation unit 52 converts the business log “X professional regular meeting_minutes.xxx”, which is learning data, into the divided character string “X professional regular” in accordance with the division boundary determined in FIG. , "Meeting_”, “minutes”, and “.xxx”. Subsequently, the rule generation unit 52 calculates the uncorrected appearance frequency and entropy of each of the divided character strings “X pro regular meeting”, “kai_”, “minutes”, and “.xxx” by using the appearance frequency processing unit 40 Obtained from column DB15.
  • ⁇ Rule generation unit 52 extracts a divided character string that satisfies the condition as a feature word of the learning data.
  • the rule generation unit 52 sets the appearance frequency “6” and entropy “0.45” for the divided character string “X Pro Regular”, and the appearance frequency “88” and entropy “ 1.44, the appearance frequency “94” and entropy “1.23” for the divided character string “minutes”, and the appearance frequency “540” and entropy “1.38” for the divided character string “.xxx”. Then, the rule generation unit 52 specifies the divided character string “X pro regular example” satisfying the condition “appearance frequency> 3 and entropy ⁇ 0.5”.
  • the rule generation unit 52 specifies the classification category “product planning” of the business log “X professional regular meeting_minutes.xxx” as learning data from the business classification DB 14. Then, the rule generation unit 52 generates a classification rule that associates the divided character string “X pro regular meeting” with the classification category “product planning”.
  • the classification unit 60 is a processing unit that classifies business logs according to the classification rule DB16. Specifically, the classifying unit 60 acquires a newly generated business log to be classified, and determines whether or not a divided character string of the classification rule stored in the classification rule DB 16 is included. Then, the classification unit 60 classifies the classification category associated with the divided character string included in the classification target business log into the classification target business log, and stores the result in the business classification DB 14.
  • the classification unit 60 classifies the classification target business log as “sales expansion” in accordance with the classification rule shown in FIG. Similarly, when the classification target business log includes “Z system administrator regular”, the classification unit 60 classifies the classification target business log into “design / development” according to the classification rule shown in FIG. I do.
  • the classification unit 60 may select any of the divided character strings. Can be associated with a plurality of classification categories.
  • FIG. 12 is a flowchart illustrating the flow of the appearance frequency process. As illustrated in FIG. 12, the appearance frequency processing unit 40 acquires all the business logs (S101), and determines whether the word division processing has been performed on all the business logs (S102).
  • the appearance frequency processing unit 40 extracts one business log (S103), divides the business log into words (S104), and A character string of N words to be extracted is extracted (S105).
  • the appearance frequency processing unit 40 acquires all the extracted character strings (S106). Subsequently, the appearance frequency processing unit 40 determines whether or not the appearance frequency correction processing has been completed for all the character strings (S107).
  • the appearance frequency processing unit 40 extracts one character string (S108), and classifies the business log including the character string. It is acquired from the business classification DB 14 (S109). After that, the appearance frequency processing unit 40 calculates the appearance frequency and the corrected appearance frequency, and updates the appearance frequency (S110). On the other hand, when the appearance frequency processing has been completed for all the business logs (S107: Yes), the appearance frequency processing unit 40 ends the processing.
  • FIG. 13 is a flowchart illustrating the flow of the rule generation process.
  • the rule processing unit 50 acquires all the business logs (S201), and determines whether or not the character string extraction processing has been performed for all the business logs (S202).
  • the rule processing unit 50 extracts one business log (S203), and based on the corrected appearance frequency, executes the business log. Is divided, and each divided character string is extracted (S204).
  • the rule processing unit 50 acquires all the extracted divided character strings (S205). Subsequently, in the character string extraction processing, it is determined whether or not the classification rule generation processing has been completed for all the divided character strings (S206).
  • the rule processing unit 50 extracts one divided character string (S207), and determines the appearance frequency of the business log including the divided character string and The distribution of the classification destination is acquired from each DB (S208). After that, when the processing target divided character string satisfies the condition (S209: Yes), the rule processing unit 50 adds it to the classification rule (S210), repeats S206 and subsequent steps, and the processing target divided character string does not satisfy the condition. In this case (S209: No), S206 and subsequent steps are repeated without executing S210. On the other hand, when the generation processing of the classification rules has been completed for all the business logs (S206: Yes), the rule processing unit 50 ends the processing.
  • the classification device 10 corrects the appearance frequency of each character string based on the distribution of the classification destination of the business log including the character string, and performs text segmentation using the appearance frequency information. Accordingly, a character string in a unit suitable for the later-stage business content classification can be acquired as a characteristic word, and the classification accuracy is improved by using a rule using the characteristic word. That is, the classification device 10 considers the latter classification at the stage of dividing the text data, and extracts the characteristic words using the distribution (variation) of the classification destination by each character string, so that the classification rule with high classification accuracy is obtained. Can be generated.
  • the “appearance frequency” of a word, a character string, a divided character string, or the like in the above-described embodiment can be replaced with “sum of business operation hours (total business hours)”.
  • each device shown in the drawings are functionally conceptual, and do not necessarily need to be physically configured as shown in the drawings. That is, the specific form of distribution and integration of each device is not limited to the illustrated one. That is, all or a part thereof can be configured to be functionally or physically distributed / integrated in arbitrary units according to various loads and usage conditions. Further, all or any part of each processing function performed by each device can be realized by a CPU and a program analyzed and executed by the CPU, or can be realized as hardware by wired logic.
  • FIG. 14 is a diagram illustrating an example of a hardware configuration.
  • the classification device 10 includes a communication device 10a, a hard disk drive (HDD) 10b, a memory 10c, and a processor 10d.
  • the units shown in FIG. 14 are mutually connected by a bus or the like.
  • the communication device 10a is a network interface card or the like, and performs communication with another server.
  • the HDD 10b stores programs and DBs for operating the functions shown in FIG.
  • the processor 10d operates the process for executing each function described in FIG. 2 and the like by reading a program for executing the same processing as each processing unit illustrated in FIG. 2 from the HDD 10b or the like and expanding the program in the memory 10c. That is, this process performs the same function as each processing unit included in the classification device 10. Specifically, the processor 10d reads, from the HDD 10b or the like, a program having the same functions as those of the appearance frequency processing unit 40 and the rule processing unit 50. Then, the processor 10d executes a process of executing the same processing as that of the appearance frequency processing unit 40, the rule processing unit 50, and the like.
  • the classification device 10 operates as an information processing device that executes a classification method by reading and executing a program.
  • the classifying apparatus 10 can also realize the same functions as those in the above-described embodiments by reading the program from a recording medium by a medium reading device and executing the read program.
  • the program referred to in the other embodiments is not limited to being executed by the classification device 10.
  • the present invention can be similarly applied to a case where another computer or server executes a program, or a case where these execute a program in cooperation with each other.
  • This program can be distributed via networks such as the Internet.
  • This program is recorded on a computer-readable recording medium such as a hard disk, a flexible disk (FD), a CD-ROM, an MO (Magneto-Optical disk), and a DVD (Digital Versatile Disc). It can be executed by being read.
  • a computer-readable recording medium such as a hard disk, a flexible disk (FD), a CD-ROM, an MO (Magneto-Optical disk), and a DVD (Digital Versatile Disc). It can be executed by being read.
  • Reference Signs List 10 Classification device 11 Communication unit 12 Storage unit 13 Business log DB 14 Business Classification DB 15 Character string DB 16 Classification Rule DB Reference Signs List 20 control unit 30 learning unit 40 appearance frequency processing unit 41 morphological analysis unit 42 frequency calculation unit 43 frequency correction unit 50 rule processing unit 51 text division unit 52 rule generation unit 60 classification unit

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

La présente invention concerne un dispositif de classification qui extrait une pluralité de chaînes de caractères dans lesquelles des mots obtenus par segmentation des données textuelles en unités de mots sont connectés. Pour chaque chaîne de la pluralité de chaînes de caractères connectées, le dispositif de classification consulte une partie de mémoire mémorisant des catégories de classification, une pluralité d'instances de données textuelles comprenant lesdites données textuelles étant classées, et calcule la distribution des catégories dans lesquelles les données textuelles parmi la pluralité d'instances de données textuelles qui comprennent la chaîne de caractères pertinente sont classifiées. Sur la base de ladite distribution, le dispositif de classification sélectionne des termes caractéristiques parmi la pluralité des chaînes de caractères connectées, et génère une règle de classification dans laquelle des catégories de classification dans lesquelles les données textuelles sont classifiées sont associées aux termes caractéristiques.
PCT/JP2018/032449 2018-08-31 2018-08-31 Programme de génération de règle de classification, procédé de génération de règle de classification, et dispositif de génération de règle de classification WO2020044558A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/JP2018/032449 WO2020044558A1 (fr) 2018-08-31 2018-08-31 Programme de génération de règle de classification, procédé de génération de règle de classification, et dispositif de génération de règle de classification
JP2020540004A JP7044162B2 (ja) 2018-08-31 2018-08-31 分類規則生成プログラム、分類規則生成方法および分類規則生成装置

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2018/032449 WO2020044558A1 (fr) 2018-08-31 2018-08-31 Programme de génération de règle de classification, procédé de génération de règle de classification, et dispositif de génération de règle de classification

Publications (1)

Publication Number Publication Date
WO2020044558A1 true WO2020044558A1 (fr) 2020-03-05

Family

ID=69642878

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2018/032449 WO2020044558A1 (fr) 2018-08-31 2018-08-31 Programme de génération de règle de classification, procédé de génération de règle de classification, et dispositif de génération de règle de classification

Country Status (2)

Country Link
JP (1) JP7044162B2 (fr)
WO (1) WO2020044558A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP7472652B2 (ja) 2020-05-21 2024-04-23 富士通株式会社 分類プログラム、分類方法、及び分類装置

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003085170A (ja) * 2001-09-11 2003-03-20 Nippon Hoso Kyokai <Nhk> 定型文クラスタリング装置および方法
WO2011071174A1 (fr) * 2009-12-10 2011-06-16 日本電気株式会社 Procédé d'exploration de texte, dispositif d'exploration de texte et programme d'exploration de texte
JP2011123706A (ja) * 2009-12-11 2011-06-23 Advanced Media Inc 文章分類装置および文章分類方法
WO2014208298A1 (fr) * 2013-06-28 2014-12-31 日本電気株式会社 Dispositif de classification de texte, procédé de classification de texte et support d'enregistrement

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003085170A (ja) * 2001-09-11 2003-03-20 Nippon Hoso Kyokai <Nhk> 定型文クラスタリング装置および方法
WO2011071174A1 (fr) * 2009-12-10 2011-06-16 日本電気株式会社 Procédé d'exploration de texte, dispositif d'exploration de texte et programme d'exploration de texte
JP2011123706A (ja) * 2009-12-11 2011-06-23 Advanced Media Inc 文章分類装置および文章分類方法
WO2014208298A1 (fr) * 2013-06-28 2014-12-31 日本電気株式会社 Dispositif de classification de texte, procédé de classification de texte et support d'enregistrement

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ISHIDA, EMI: "An overview of text categorization", THE JOURNAL OF INFORMATION SCIENCE AND TECHNOLOGY ASSOCIATION, vol. 56, no. 10, 1 October 2006 (2006-10-01), pages 469 - 474 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP7472652B2 (ja) 2020-05-21 2024-04-23 富士通株式会社 分類プログラム、分類方法、及び分類装置

Also Published As

Publication number Publication date
JP7044162B2 (ja) 2022-03-30
JPWO2020044558A1 (ja) 2021-04-30

Similar Documents

Publication Publication Date Title
US10318617B2 (en) Methods and apparatus for extraction of content from an email or email threads for use in providing implicit profile attributes and content for recommendation engines
US20130035929A1 (en) Information processing apparatus and method
CN102945246B (zh) 网络信息数据的处理方法及装置
US8433666B2 (en) Link information extracting apparatus, link information extracting method, and recording medium
CN111026961A (zh) 标引多重数据元素内的感兴趣的数据的方法及系统
US20070092857A1 (en) Method and apparatus for supporting training, and computer product
WO2020044558A1 (fr) Programme de génération de règle de classification, procédé de génération de règle de classification, et dispositif de génération de règle de classification
US20120254166A1 (en) Signature Detection in E-Mails
JP2021092925A (ja) データ生成装置およびデータ生成方法
CN110209780B (zh) 一种问题模板生成方法、装置、服务器及存储介质
US10474700B2 (en) Robust stream filtering based on reference document
CN112567364B (zh) 知识信息创建支援装置
JP5271863B2 (ja) 情報分析装置、情報分析方法および情報分析プログラム
JPWO2020111074A1 (ja) メール分類装置、メール分類方法、およびコンピュータプログラム
US20190392005A1 (en) Speech dialogue system, model creating device, model creating method
US10169418B2 (en) Deriving a multi-pass matching algorithm for data de-duplication
US9824140B2 (en) Method of creating classification pattern, apparatus, and recording medium
JP2008027431A (ja) 情報解析装置、情報解析方法、及び情報解析プログラム
CN102236652A (zh) 一种信息的分类方法和装置
JP2001022727A (ja) テキスト分類学習方法及び装置及びテキスト分類学習プログラムを格納した記憶媒体
US20230032143A1 (en) Log generation apparatus, log generation method, and computer readable recording medium
KR20080026931A (ko) 약어 생성 유형을 고려하는 약어 사전 자동 구축 방법, 그기록 매체 및 약어 생성 유형을 고려하는 약어 사전 자동구축 장치
JP6040138B2 (ja) 文書分類装置、文書分類方法および文書分類プログラム
WO2020017037A1 (fr) Dispositif d&#39;analyse de journal, procédé d&#39;analyse de journal et programme
JP2012022443A (ja) 文書検索装置、文書検索方法及び文書検索プログラム

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2020540004

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18931488

Country of ref document: EP

Kind code of ref document: A1