WO2021234799A1 - Dispositif de traitement de données, procédé de traitement de données et programme de traitement de données - Google Patents

Dispositif de traitement de données, procédé de traitement de données et programme de traitement de données Download PDF

Info

Publication number
WO2021234799A1
WO2021234799A1 PCT/JP2020/019700 JP2020019700W WO2021234799A1 WO 2021234799 A1 WO2021234799 A1 WO 2021234799A1 JP 2020019700 W JP2020019700 W JP 2020019700W WO 2021234799 A1 WO2021234799 A1 WO 2021234799A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
compound word
data processing
word candidate
data
Prior art date
Application number
PCT/JP2020/019700
Other languages
English (en)
Japanese (ja)
Inventor
聡 須永
一宏 菊間
Original Assignee
日本電信電話株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電信電話株式会社 filed Critical 日本電信電話株式会社
Priority to PCT/JP2020/019700 priority Critical patent/WO2021234799A1/fr
Publication of WO2021234799A1 publication Critical patent/WO2021234799A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Definitions

  • the present invention relates to a data processing apparatus, a data processing method, and a data processing program.
  • the present invention has been made in view of the above, and an object of the present invention is to provide a data processing apparatus, a data processing method, and a data processing program capable of automatically generating technical terms from document data with high accuracy. ..
  • the data processing apparatus has a specific part of speech and part of speech type from part of speech information and part of speech type information which are analysis results obtained by morphological analysis of document data.
  • the data processing method is a step of extracting a specific part of speech and a morphology of a part of speech type from part of speech information and part of speech type information which are analysis results obtained by morphological analysis of document data, and a step of extracting.
  • the process of generating a word by connecting the extracted morphological elements as a compound word candidate and whether or not the compound word candidate exactly matches the headword of the Japanese language dictionary electronic dictionary data are searched, and if they match completely, the compound word is searched. It is characterized by including a step of determining that the candidate is not a technical term and determining that the compound word candidate is a technical term if there is no exact match.
  • the data processing program has a step of extracting a specific part word and a part word type morphology from a part word information and a part word type information which are analysis results obtained by morphological analysis of document data, and a step of extracting the morphology.
  • the step of generating a word by connecting the extracted morphological elements as a compound word candidate and whether or not the compound word candidate exactly matches the headword of the Japanese dictionary electronic dictionary data are searched, and if they match exactly, the compound word is searched.
  • the computer is made to perform a step of determining that the candidate is not a technical term and, if there is no exact match, determining that the compound word candidate is a technical term.
  • FIG. 1 is a diagram schematically showing an example of a configuration of a data processing device according to an embodiment.
  • FIG. 2 is a diagram showing the results of verification in which nouns appearing in the requirement specifications of a certain organization are combined and connected.
  • FIG. 3 is a diagram illustrating the first half of the processing flow of the data processing apparatus shown in FIG.
  • FIG. 4 is a diagram illustrating the latter half of the processing flow of the data processing apparatus shown in FIG.
  • FIG. 5 is a flowchart showing a processing procedure of the data processing method according to the embodiment.
  • FIG. 6 is a diagram illustrating a conventional technical term generation method.
  • FIG. 7 is a diagram showing the relationship between compound words and technical terms.
  • FIG. 8 is a diagram showing an example of a computer in which a data processing device is realized by executing a program.
  • FIG. 1 is a diagram schematically showing an example of a configuration of a data processing device according to an embodiment.
  • the data processing device 10 is a data processing method executed by a data processing device that performs information processing including natural language processing, and is an input unit 11, a communication unit 12, an output unit 13, and a storage unit 14. And a control unit 15.
  • the input unit 11 is an input interface that receives various operations from the operator of the data processing device 10.
  • the input unit 11 is composed of an input device such as a touch panel, a voice input device, and a keyboard and a mouse.
  • the communication unit 12 is a communication interface for transmitting and receiving various information to and from other devices connected via a network or the like.
  • the communication unit 12 is realized by a NIC (Network Interface Card) or the like, and communicates between another device via a telecommunication line such as a LAN (Local Area Network) or the Internet and a control unit 15 (described later).
  • a NIC Network Interface Card
  • the communication unit 12 receives the data of the electronic file document via the network and outputs it to the control unit 15. Further, the communication unit 12 outputs the information indicating the technical terms generated by the control unit 15 to an external device via the network.
  • the output unit 13 is realized by, for example, a display device such as a liquid crystal display, a printing device such as a printer, an information communication device, or the like, and outputs information or the like indicating technical terms generated by the control unit 15.
  • the storage unit 14 is a storage device for an HDD (Hard Disk Drive), SSD (Solid State Drive), optical disk, or the like.
  • the storage unit 14 may be a semiconductor memory that can rewrite data such as RAM (Random Access Memory), flash memory, NVSRAM (Non Volatile Static Random Access Memory).
  • the storage unit 14 stores an OS (Operating System) and various programs executed by the data processing device 10. Further, the storage unit 14 stores various information used in executing the program.
  • the storage unit 14 stores electronic file document data 141, which is a text document to be processed, Japanese dictionary electronic dictionary data 142, and technical term data 143 including technical terms generated by the control unit 15.
  • the control unit 15 controls the entire data processing device 10.
  • the control unit 15 is, for example, an electronic circuit such as a CPU (Central Processing Unit) or MPU (Micro Processing Unit), or an integrated circuit such as an ASIC (Application Specific Integrated Circuit) or FPGA (Field Programmable Gate Array).
  • the control unit 15 has an internal memory for storing programs and control data that specify various processing procedures, and executes each process using the internal memory. Further, the control unit 15 functions as various processing units by operating various programs.
  • the control unit 15 has a morphological analysis unit 151, an extraction unit 152, a compound word candidate generation unit 153 (generation unit), and a determination unit 154.
  • the morphological analysis unit 151 analyzes a sentence or sentence for which a compound word is to be extracted from the electronic file document data 141, decomposes it into morphemes, and acquires part-speech information and part-speech type (part-speech subclassification) information of each morpheme. do.
  • the morphological analysis unit 151 performs morphological analysis using, for example, the morphological analysis tool MeCab.
  • the types of nouns are general, proper noun, number, pronoun, sa-variable connection, adverb possible, adjective verb stem, nai adjective stem, non-independent. , Classify into equipment.
  • FIG. 2 is a diagram showing the results of verification in which nouns appearing in the requirement specifications of a certain organization are combined and linked.
  • the control unit 15 extracts specific nouns, extracts points where those words are connected, and connects them to generate technical term candidates.
  • the extraction unit 152 extracts a specific part of speech and a morpheme of a part of speech type from the part of speech information and the part of speech type information obtained by the morphological analysis.
  • the extraction unit 152 extracts a morpheme whose part of speech is a noun and whose part of speech is a specific type.
  • the extraction unit 152 extracts a morpheme whose part of speech is a noun and whose part of speech is a general, proper noun, or s-irregular connection.
  • the compound word candidate generation unit 153 generates a word obtained by connecting the morphemes extracted by the extraction unit 152 as a compound word candidate.
  • Compound word candidates are technical term candidates.
  • a compound word is a word (one word) in which a plurality of nouns are connected.
  • the compound words of the specialized field are "tracing", “collection”, and “function” that are connected to each other, and “operation” and “data”. , "Backup”, “Operational data backup file” which is a combination of "File” and so on.
  • the determination unit 154 searches for each of the compound word candidates generated by the compound word candidate generation unit 153 whether or not the compound word candidate completely matches the headword of the Japanese dictionary electronic dictionary data 142. When the compound word candidate completely matches the headword of the Japanese dictionary electronic dictionary data 142, the determination unit 154 determines that the compound word candidate is not a technical term. When the compound word candidate does not completely match the headword of the Japanese dictionary electronic dictionary data 142, the determination unit 154 determines that the compound word candidate is a technical term. Then, the determination unit 154 stores the compound word candidate determined to be a technical term in the storage unit 14, and outputs the compound word candidate to the output unit 13, an external device, or the like.
  • FIG. 3 is a diagram illustrating the first half of the processing flow of the data processing apparatus shown in FIG.
  • FIG. 4 is a diagram illustrating the latter half of the processing flow of the data processing apparatus shown in FIG.
  • the morphological analysis unit 151 decomposes the sentence or sentence to be extracted of the compound word into words from the electronic file document data 141 to be processed by the morphological analysis. Acquire the part-of-speech information and the part-speech type information of each morphological element (see (1) in FIG. 3). For example, the morphological analysis unit 151 outputs a new alarm No. 5 when the maximum number of simultaneous connections of one session control server is 300, and the morphological element "1" is a part of speech. Acquire that the information is a "noun" and the part of speech type information is a "number".
  • the extraction unit 152 uses "noun” and “general”, “noun” and “proprietary noun”, or “noun” and “sahen” from the part-speech information and part-speech type information obtained by morphological analysis. Extraction is limited to the connection points of "connection” (see (2) in FIG. 3). Then, the compound word candidate generation unit 153 generates a compound word candidate by connecting the morphemes extracted by the extraction unit 152 (see (2) in FIG. 3). As a result, the compound word candidate generation unit 153 has "session” which is "" noun “and” general "", “control” which is “" noun “and” server change connection "", and "" noun “and” general “. ], To generate a compound word candidate "session control server".
  • the determination unit 154 extracts the generated compound word candidates one by one, and searches whether the extracted compound word candidates completely match any of the headwords (electronic dictionary headwords) of the Japanese dictionary electronic dictionary data 142 (Fig.). 4 (3)).
  • the determination unit 154 determines that the compound word candidate is a nomenclature (general) listed in the Japanese dictionary electronic dictionary and is not a technical term. (See (4) in FIG. 4). Further, when the compound word candidate does not match any of the electronic dictionary headwords, the determination unit 154 determines that the compound word candidate is not a noun (general) but a technical term (see (4) in FIG. 4). .. The determination unit 154 outputs compound words that do not match any of the electronic dictionary headwords, for example, "session control server”, "maximum simultaneous connection”, “phone number”, and “new alarm” as technical terms, and specializes in them. Run the update of the terminology dictionary.
  • FIG. 5 is a flowchart showing a processing procedure of the data processing method according to the embodiment.
  • control unit 15 analyzes the sentence or sentence to be extracted of the compound word from the electronic file document data 141 by morphological analysis, decomposes it into morphemes, and divides the sentence or sentence into morphemes, and the part of speech information and the part of speech type information of each morpheme. (Step S1).
  • the extraction unit 152 extracts a morpheme whose part of speech is a noun and whose part of speech is a general, proper noun, or sirregular connection from the part of speech information and the part of speech type information obtained by the morphological analysis (step S2).
  • the compound word candidate generation unit 153 generates a word obtained by connecting the morphemes extracted by the extraction unit 152 as a compound word candidate (step S3).
  • the determination unit 154 extracts the generated compound word candidate, and searches whether the extracted compound word candidate exactly matches any of the headwords of the Japanese dictionary electronic dictionary data 142 (step S4). The determination unit 154 determines whether the compound word candidate completely matches the headword of the Japanese dictionary electronic dictionary (step S5).
  • step S5 When the compound word candidate does not match any of the electronic dictionary headwords (step S5: No), the determination unit 154 determines that the compound word candidate is a technical term (step S6). Further, when the compound word candidate matches any of the electronic dictionary headwords (step S5: Yes), the determination unit 154 determines that the compound word candidate is not a technical term (step S7).
  • the determination unit 154 determines whether or not there is an undetermined compound word candidate (step S8).
  • the determination unit 154 performs the process of step S4 for the next compound word candidate to be searched.
  • the data processing device 10 outputs the compound word candidate determined as a technical term and ends the process.
  • FIG. 6 is a diagram illustrating a conventional technical term generation method.
  • the text is decomposed into words by morpheme analysis, part of speech information is acquired (see (1) in Fig. 6), noun concatenation points are extracted, and those nouns are before words other than nouns.
  • the compound words (candidates) were generated by connecting them together (see (2) in FIG. 6).
  • words that do not hold as technical terms such as "1 session control server”, “maximum number of simultaneous connections 300 phone numbers”, and “new alarm number 5", that include numbers and suffixes (not established). Word) was generated, and there was a problem that the accuracy was low.
  • the data processing apparatus 10 limits and extracts the morphemes constituting the technical terms.
  • the data processing device 10 extracts a specific part of speech and part of speech type morphology from the part of speech information and the part of speech type information which are the analysis results obtained by the morphological analysis of the document data.
  • the data processing apparatus 10 has a morpheme in which the part of speech is a noun and the part of speech is a specific type, specifically, the part of speech is a noun and the part of speech is general and unique. Extract only morphemes that are nouns or part-of-speech connections.
  • the data processing device 10 generates a word obtained by connecting the extracted morphemes as a compound word candidate.
  • the data processing device 10 searches for whether or not the compound word candidate exactly matches the headword of the Japanese dictionary electronic dictionary data 142, and if the compound word candidate completely matches, the compound word candidate is a noun (general) and is a technical term. If there is no exact match, it is determined that the compound word candidate is a technical term.
  • the data processing device 10 completely matches the headword of the Japanese dictionary electronic dictionary data 142 from the compound word candidates generated based on the morphological elements extracted by limiting them as the morphological elements constituting the technical term. By excluding compound word candidates and using the remaining compound word candidates as technical terms, technical terms can be generated accurately.
  • FIG. 7 is a diagram showing the relationship between compound words and technical terms. As shown in FIG. 7, the compound word generated by the data processing device 10 according to the embodiment is generally listed in the Japanese dictionary in order to exclude the compound word candidate that exactly matches the headword of the Japanese dictionary electronic dictionary data 142. Generates a technical term that is not a compound word and is a compound word in a specialized field.
  • each component of each of the illustrated devices is functional and conceptual, and does not necessarily have to be physically configured as shown in the figure. That is, the specific form of distribution / integration of each device is not limited to the one shown in the figure, and all or part of them may be functionally or physically distributed / physically in arbitrary units according to various loads and usage conditions. Can be integrated and configured. Further, each processing function performed by each device may be realized by a CPU and a program analyzed and executed by the CPU, or may be realized as hardware by wired logic.
  • FIG. 8 is a diagram showing an example of a computer in which the data processing device 10 is realized by executing a program.
  • the computer 1000 has, for example, a memory 1010 and a CPU 1020.
  • the computer 1000 also has a hard disk drive interface 1030, a disk drive interface 1040, a serial port interface 1050, a video adapter 1060, and a network interface 1070. Each of these parts is connected by a bus 1080.
  • the memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM 1012.
  • the ROM 1011 stores, for example, a boot program such as a BIOS (Basic Input Output System).
  • BIOS Basic Input Output System
  • the hard disk drive interface 1030 is connected to the hard disk drive 1090.
  • the disk drive interface 1040 is connected to the disk drive 1100.
  • a removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 1100.
  • the serial port interface 1050 is connected to, for example, a mouse 1110 and a keyboard 1120.
  • the video adapter 1060 is connected to, for example, the display 1130.
  • the hard disk drive 1090 stores, for example, an OS (Operating System) 1091, an application program 1092, a program module 1093, and program data 1094. That is, the program that defines each process of the data processing device 10 is implemented as a program module 1093 in which a code that can be executed by a computer is described.
  • the program module 1093 is stored in, for example, the hard disk drive 1090.
  • a program module 1093 for executing processing similar to the functional configuration in the data processing device 10 is stored in the hard disk drive 1090.
  • the hard disk drive 1090 may be replaced by an SSD (Solid State Drive).
  • the setting data used in the processing of the above-described embodiment is stored as program data 1094 in, for example, a memory 1010 or a hard disk drive 1090. Then, the CPU 1020 reads the program module 1093 and the program data 1094 stored in the memory 1010 and the hard disk drive 1090 into the RAM 1012 and executes them as needed.
  • the program module 1093 and the program data 1094 are not limited to those stored in the hard disk drive 1090, but may be stored in, for example, a removable storage medium and read by the CPU 1020 via the disk drive 1100 or the like. Alternatively, the program module 1093 and the program data 1094 may be stored in another computer connected via a network (LAN (Local Area Network), WAN (Wide Area Network), etc.). Then, the program module 1093 and the program data 1094 may be read from another computer by the CPU 1020 via the network interface 1070.
  • LAN Local Area Network
  • WAN Wide Area Network

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

Ce dispositif de traitement de données (10) comprend : une unité d'extraction (152) qui extrait des morphèmes d'une partie spécifique de la parole et d'une partie de type de parole, à partir d'informations de partie de parole et d'informations de type de partie de parole qui sont un résultat d'analyse obtenu par analyse morphologique de données de document ; une unité de génération de mot composé candidat (153) qui génère, en tant que mot composé candidat, un mot obtenu par concaténation des morphèmes extraits par l'unité d'extraction (152) ; et une unité de détermination (154), qui recherche si le mot composé candidat correspond complètement au mot d'entrée des données de dictionnaire électronique japonais, détermine que le composé candidat n'est pas une terminologie dans le cas de correspondance complète, et détermine que le composé candidat est une terminologie dans le cas de correspondance incomplète.
PCT/JP2020/019700 2020-05-18 2020-05-18 Dispositif de traitement de données, procédé de traitement de données et programme de traitement de données WO2021234799A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2020/019700 WO2021234799A1 (fr) 2020-05-18 2020-05-18 Dispositif de traitement de données, procédé de traitement de données et programme de traitement de données

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2020/019700 WO2021234799A1 (fr) 2020-05-18 2020-05-18 Dispositif de traitement de données, procédé de traitement de données et programme de traitement de données

Publications (1)

Publication Number Publication Date
WO2021234799A1 true WO2021234799A1 (fr) 2021-11-25

Family

ID=78708414

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2020/019700 WO2021234799A1 (fr) 2020-05-18 2020-05-18 Dispositif de traitement de données, procédé de traitement de données et programme de traitement de données

Country Status (1)

Country Link
WO (1) WO2021234799A1 (fr)

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
KOJI KINAMI, TETSUO IKEDA, YOSHITOSHI MURATA, TSUYOSHI TAKAYAMA, TOSHIAKI TAKEDA: "Research on Technical TermExtraction in the Nursing Domain", JOURNAL OFNATURAL LANGUAGE PROCESSING, vol. 15, no. 3, 10 July 2008 (2008-07-10), pages 3 - 20, XP009532378, ISSN: 1340-7619, DOI: 10.5715/jnlp.15.3_3 *
YANAGIDAIRA TAKAYA, KOKI SATO, SATOSHI SUNAGA, KOUJI HOSHINO, KAZUHIRO KIKUMA, KIYOSHI UEDA: "Improvement of compound word generation accuracy in automatictest item extraction method for large-scalesoftware development", IEICE TECHNICAL REPORT, vol. 118, no. 250 (NS2018-112), 11 October 2018 (2018-10-11), JP, pages 39 - 42, XP009532377, ISSN: 0913-5685 *

Similar Documents

Publication Publication Date Title
US5890103A (en) Method and apparatus for improved tokenization of natural language text
US5680628A (en) Method and apparatus for automated search and retrieval process
Ingason et al. A mixed method lemmatization algorithm using a hierarchy of linguistic identities (HOLI)
Sedláček et al. A new Czech morphological analyser ajka
US20070233460A1 (en) Computer-Implemented Method for Use in a Translation System
WO2005101236A2 (fr) Metrique d'association lexicale destinee a une extraction de termes syntagmatiques libre de connaissances
US7398210B2 (en) System and method for performing analysis on word variants
US20050273316A1 (en) Apparatus and method for translating Japanese into Chinese and computer program product
US7957956B2 (en) Systems and methods for normalization of linguistic structures
US7684975B2 (en) Morphological analyzer, natural language processor, morphological analysis method and program
KR20060043583A (ko) 언어 데이터의 로그의 압축 방법 및 시스템
CN113761161A (zh) 文本关键词提取方法、装置、计算机设备和存储介质
WO2021234799A1 (fr) Dispositif de traitement de données, procédé de traitement de données et programme de traitement de données
JP6777601B2 (ja) データ処理装置、データ処理方法及びデータ処理プログラム
JP2000259635A (ja) 翻訳装置及び翻訳方法並びに翻訳プログラムを記録した記録媒体
Nathani et al. A rule based light weight inflectional stemmer for sindhi devanagari using affix stripping approach
JP7211139B2 (ja) 校閲方法、情報処理装置および校閲プログラム
US20040054677A1 (en) Method for processing text in a computer and a computer
WO2020170804A1 (fr) Dispositif d'extraction de synonymes, procédé d'extraction de synonymes et programme d'extraction de synonymes
Ablov et al. The tools of a machine grammar of the Russian language (based on GG Belonogov)
JP2006190226A (ja) 用言自動換言装置、用言換言方法及び用言換言処理プログラム
KR100283100B1 (ko) 대용량 말뭉치를 위한 통계학적 용례 추출 수단 및 그 방법
WO2021084631A1 (fr) Dispositif de traitement d'informations, procédé d'extraction et programme d'extraction
Yilmaz et al. A case study of using domain engineering for the conflation algorithms domain
JPS6368972A (ja) 未登録語処理方式

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20936261

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20936261

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP