WO2023152914A1 - Dispositif d'incorporation, procédé d'incorporation et programme d'incorporation - Google Patents
Dispositif d'incorporation, procédé d'incorporation et programme d'incorporation Download PDFInfo
- Publication number
- WO2023152914A1 WO2023152914A1 PCT/JP2022/005474 JP2022005474W WO2023152914A1 WO 2023152914 A1 WO2023152914 A1 WO 2023152914A1 JP 2022005474 W JP2022005474 W JP 2022005474W WO 2023152914 A1 WO2023152914 A1 WO 2023152914A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- token
- unit
- vectorization
- user
- tokens
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims description 21
- 239000013598 vector Substances 0.000 claims abstract description 48
- 238000010801 machine learning Methods 0.000 claims abstract description 11
- 238000012545 processing Methods 0.000 claims description 16
- 230000000873 masking effect Effects 0.000 claims description 12
- 239000000470 constituent Substances 0.000 claims description 4
- 230000006870 function Effects 0.000 claims description 3
- 239000007943 implant Substances 0.000 claims 1
- 239000000284 extract Substances 0.000 abstract description 6
- 238000012549 training Methods 0.000 abstract description 3
- 230000003993 interaction Effects 0.000 description 12
- 238000010586 diagram Methods 0.000 description 11
- 238000000605 extraction Methods 0.000 description 7
- 230000010365 information processing Effects 0.000 description 3
- 238000013459 approach Methods 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000002457 bidirectional effect Effects 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000014509 gene expression Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000001131 transforming effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
Definitions
- the present invention relates to an embedding device, an embedding method, and an embedding program.
- Non-Patent Document 1 Although there are technologies related to vectorization of individual knowledge characteristics and interest characteristics (see Non-Patent Document 1) and techniques related to vectorization of interaction characteristics (see Non-Patent Document 2), a vector construction technology that includes both of these has been established. It has not been. In other words, there is no established technique for constructing a vector in which both individual knowledge characteristics and interaction characteristics necessary for matching, teaming, etc. are embedded.
- the object of the present invention is to construct a vector that embeds both individual knowledge characteristics and interaction characteristics.
- the present invention provides a chat history including, for each chat channel, the channel ID of the chat, the user ID of the user who spoke on the channel, and information showing the contents of the chat in chronological order.
- a splitter that splits the patch into tokens of minimal constituents; a masker that masks a portion of the token; a token vectorizer that vectorizes each token of the patch that includes the masked token using a first vectorizer and a second vectorizer that vectorizes other tokens;
- Using a restoration model that restores the original value of the masked token from the result of vectorizing each token of and the token vectorization unit for restoring the token value before masking as correct data so that the token vectorization unit and the restoration unit can accurately restore the original value of the masked token.
- FIG. 1 is a diagram for explaining the flow of model learning by an embedding device.
- FIG. 2 is a diagram for explaining the flow of user ID vectorization by the embedding device.
- FIG. 3 is a diagram showing a configuration example of an embedding device.
- 4A and 4B are diagrams showing examples of input data and output data of the patch extraction unit in FIG.
- FIG. 5 is a diagram showing an example of input data and output data of the token division unit in FIG. 6A and 6B are diagrams showing examples of input data and output data of the mask unit in FIG.
- FIG. 9 is a diagram showing an example of input data and output data to the user ID vectorization unit of FIG. 3 after learning.
- FIG. 10 is a flowchart illustrating an example of a model learning processing procedure by the embedding device.
- FIG. 11 is a flowchart illustrating an example of a processing procedure for vectorization of user IDs by an embedding device.
- FIG. 12 is a diagram showing a configuration example of a computer that executes an embedded program.
- the embedding device of this embodiment creates and outputs a vector in which both the user's personal knowledge and interest characteristics and interaction characteristics with other users are digitized and embedded from the user's chat history.
- chat history includes, for each chat channel, information in which the channel ID of the chat, the name of the channel, the user ID of the user who spoke on the channel, and the contents of the chat are arranged in chronological order.
- the embedding device extracts chat patches (chat patches) from a chat database in which chat histories are accumulated.
- a chat patch is, for example, a part of the utterances in the same chat channel extracted from the chat history.
- the embedding device divides the extracted patch into tokens and masks part of the tokens (for example, the user ID token).
- the embedder trains a machine learning model to infer (recover) the original values of the masked tokens. That is, the embedding device compares the mask restoration result output by the machine learning model with the pre-mask value (correct answer), and learns the machine learning model so that the mask restoration result approaches the correct answer.
- the above machine learning model includes, for example, a token vectorization unit 13d that vectorizes each token, and a restoration unit 13h that restores the value of the masked token from the vector of each token.
- the token vectorization unit 13d includes a user ID vectorization unit 13e that outputs a vector of the feature amount of the user of the user ID from the user ID, and a vectorization of information other than the user ID (for example, the contents of chat comments, etc.). and a subword vectorization unit 13f.
- the embedding device outputs a vector of features for each user participating in the chat by training a machine learning model that restores the original values of tokens masked in the chat patch. learning of the transforming unit 13e.
- the embedding device By inputting the user ID of the user to the user ID vectorization unit 13e learned as described above, the embedding device outputs a vector representing the user of the input user ID.
- the embedding device learns a machine learning model including the user ID vectorization unit 13e, for example, for each chat channel, the channel ID of the chat, the name of the channel, the user ID of the user who spoke on the channel, and the message
- the chat history including information in which the contents are arranged in chronological order, the following information about the user is embedded in the vector output by the user ID vectorization unit 13e after learning.
- the embedding device can output a vector that embeds both the user's individual knowledge characteristics and interaction characteristics.
- the embedding device 10 includes an input/output unit 11 , a storage unit 12 and a control unit 13 .
- the input/output unit 11 is an interface that controls input/output of various information.
- the storage unit 12 stores data referred to when the control unit 13 executes various processes.
- the storage unit 12 includes a chat database that stores chat histories. Note that the chat database may be installed outside the embedded device 10 .
- the storage unit 12 also stores model parameters used when the token vectorization unit 13d performs vectorization, model parameters used when the restoration unit 13h restores token values, and the like.
- the control unit 13 controls the embedding device 10 as a whole.
- the control unit 13 includes a patch extraction unit 13a, a token division unit 13b, a mask unit 13c, a token vectorization unit 13d, a restoration unit 13h, a learning unit 13i, a vectorization unit 13j, and an output processing unit 13k. Prepare.
- the patch extraction unit 13a extracts part of the chat history of the chat database as a chat patch.
- a chat patch is, for example, a part of utterances in the same chat channel extracted from the chat history.
- the patch extraction unit 13a receives input of chat database information, as shown in FIG. outputs information in the order in which they were said.
- the token division unit 13b divides the components of the chat patch extracted by the patch extraction unit 13a into tokens of minimum units. For example, the token dividing unit 13b extracts the patch channel ID and user ID using regular expressions and tokenizes them. Also, the token division unit 13b tokenizes other sentences of the patch using SentencePiece. The token division unit 13b divides the patch into tokens using, for example, the technique described in Reference 1 below.
- the token division unit 13b when the token division unit 13b receives an input of a chat patch shown in the "input example" of FIG. Then, after the channel ID, after the channel name, after the user ID, after the user's speech, etc., a separator [sep] is inserted. After that, the token division unit 13b outputs information in which tokens including the separator [sep] are arranged in a line, as shown in "output example" in FIG. 5, for example.
- the masking unit 13c masks some tokens of the chat patch. For example, the masking unit 13c replaces the token of the user ID among the tokens of the chat patch with a special token indicating that it has been masked. Note that the masking unit 13c may mask randomly selected tokens among the tokens of the chat patch.
- the masking unit 13c when the masking unit 13c receives an input of the token group shown in "input example” in FIG. 6, it replaces a part of the token with a special token [mask] and outputs it.
- the token vectorization unit 13d vectorizes each input token.
- the token vectorization unit 13d vectorizes tokens using an embedding layer, which is a basic neural network structure.
- the token vectorization unit 13d vectorizes tokens using, for example, the technique described in Reference 2 below.
- the token vectorization unit 13d includes, for example, a user ID vectorization unit (user ID vectorization unit) 13e and a subword vectorization unit (subword vectorization unit) 13f.
- the channel ID vectorization unit 13g indicated by a dashed line may or may not be installed in the token vectorization unit 13d, and the case where it is installed will be described later.
- the user ID vectorization unit 13e converts the user ID token into a vector. For example, the user ID vectorization unit 13e receives a user ID token as input and converts the user ID token into a vector using a user ID vectorization model that outputs a vector of the user ID token.
- the subword vectorization unit 13f converts tokens of subwords (eg, channel name tokens, statement content tokens) into vectors.
- the subword vectorization unit 13f converts the subword token into a vector using a subword vectorization model that receives a subword token and outputs a vector of the token.
- the token vectorization unit 13d converts the input token into a vector by an appropriate vectorization unit (user ID vectorization unit 13e or subword vectorization unit 13f). For example, if the token is a user ID token, the token vectorization unit 13d converts the token into a vector by the user ID vectorization unit 13e. 13f into a vector.
- an appropriate vectorization unit user ID vectorization unit 13e or subword vectorization unit 13f.
- the token vectorization unit 13d converts each of the tokens shown in the "input example" of FIG. Convert to
- a restorer (restorer) 13h restores the original token values of the masked portion of the patch.
- the restoration unit 13h receives the vector of each token of the patch output from the token vectorization unit 13d, estimates the value of the masked token, and restores the value of the masked token using the estimated value. do.
- the restoration unit 13h for example, guesses the value of the masked token from the vector of each token shown in "input example” in FIG. For example, the restoration unit 13h outputs the tokens shown in "output example" in FIG.
- the restoration unit 13h uses a model (restoration model) that receives the vector of each token as an input and outputs an estimated value of the value of the masked token, estimates the value of the masked token, and estimates the estimated value of the token. Restores the value of a token masked with .
- the restoration unit 13h uses a BERT model (see reference 3), which is a neural network for natural language processing, to guess the value of the masked token.
- the learning unit 13i uses the token group of the patch before masking as correct data, and uses the token vectorization unit 13d and the restoration unit 13h to accurately estimate the original values of the masked tokens.
- the part 13h (machine learning model) is learned.
- the learning unit 13i learns the token vectorization unit 13d and the restoration unit 13h so that the masked token estimated value by the restoration unit 13h approaches the correct value (value before masking) as much as possible.
- the user ID vector of the token vectorization unit 13d is adjusted so that the value of the user ID estimated by the restoring unit 13h is as close as possible to the value of the correct user ID. learning of the vectorization model, the subword vectorization model, and the restoration model of the restoration unit 13h.
- the vectorization unit 13j obtains a vector expressing the user's characteristics by the user ID vectorization unit 13e of the token vectorization unit 13d learned by the learning unit 13i. For example, the vectorization unit 13j selects user IDs to be vectorized from user IDs included in chat histories used for learning. Then, the vectorization unit 13j inputs the selected user ID to the post-learning user ID vectorization unit 13e to obtain information obtained by vectorizing the user ID.
- the post-learning user ID vectorization unit 13e receives an input of a user ID (a token indicating a user ID) shown in "input example” in FIG. 9, it outputs a vector shown in "output example".
- the vectorization unit 13j obtains a vector expressing the features of the user with the user ID obtained from the chat history.
- the output processing unit 13k outputs the information vectorized by the vectorization unit 13j.
- the output processing unit 13k outputs information obtained by vectorizing the user ID by the vectorization unit 13j.
- the patch extraction unit 13a of the embedding device 10 extracts a chat patch from the chat history of the chat database (S1 in FIG. 10: Extract chat patch).
- the token dividing unit 13b divides the components of the chat patch extracted in S1 into tokens of minimum units (S2: divide into tokens). Thereafter, the masking unit 13c masks part of the token of the chat patch (S3: mask part of the token).
- the learning unit 13i learns the token vectorization unit 13d and the restoration unit 13h so that the original values of the tokens masked in S3 can be restored (S4).
- the vectorization unit 13j receives an input of the user ID of the user to be vectorized (S11 in FIG. 11).
- the user ID input here is a user ID selected from user IDs included in the chat history of the chat database.
- the vectorization unit 13j uses the user ID vectorization unit 13e after learning to vectorize the user ID received in S11 (S12). After that, the output processing unit 13k outputs information obtained by vectorizing the user ID (S13).
- the embedding device 10 can output a vector in which both the user's individual knowledge characteristics and interaction characteristics are embedded.
- the token vectorization unit 13d of the embedding device 10 may further include, for example, a channel ID vectorization unit 13g (channel ID vectorizer) shown in FIG.
- the channel ID vectorization unit 13g converts the channel ID token into a vector.
- the channel ID vectorization unit 13g receives a channel ID token and converts the channel ID token into a vector using a channel ID vectorization model that outputs a vector of the channel ID token.
- the learning unit 13i learns the token vectorization unit 13d including the channel ID vectorization unit 13g. Then, the vectorization unit 13j inputs the channel ID of the user to be vectorized to the channel ID vectorization unit 13g in the post-learning token vectorization unit 13d, thereby vectorizing the characteristics of the chat channel.
- the token to be masked by the mask unit 13c may be used as the channel ID.
- chat channel ID vectorization unit 13g the following information regarding the chat channel is embedded in the vector output by the channel ID vectorization unit 13g after learning.
- the embedding device 10 can output a vector in which the features of the chat channel are embedded.
- the embedding device 10 uses the chat history of each user to learn the token vectorization unit 13d and the restoration unit 13h, it is not limited to this.
- the embedding device 10 may perform the above learning using each user's comments on SNS or the like.
- each constituent element of each part shown in the figure is functionally conceptual, and does not necessarily need to be physically configured as shown in the figure.
- the specific form of distribution and integration of each device is not limited to the illustrated one, and all or part of them can be functionally or physically distributed and integrated in arbitrary units according to various loads and usage conditions. Can be integrated and configured.
- all or any part of each processing function performed by each device can be implemented by a CPU and a program executed by the CPU, or implemented as hardware based on wired logic.
- the embedded device 10 described above can be implemented by installing a program (embedded program) as package software or online software in a desired computer.
- the information processing device can function as the embedding device 10 by causing the information processing device to execute the above program.
- the information processing apparatus referred to here includes mobile communication terminals such as smart phones, cellular phones, PHS (Personal Handyphone System), and terminals such as PDA (Personal Digital Assistant).
- FIG. 12 is a diagram showing an example of a computer that executes embedded programs.
- the computer 1000 has a memory 1010 and a CPU 1020, for example.
- Computer 1000 also has hard disk drive interface 1030 , disk drive interface 1040 , serial port interface 1050 , video adapter 1060 and network interface 1070 . These units are connected by a bus 1080 .
- the memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM (Random Access Memory) 1012 .
- the ROM 1011 stores a boot program such as BIOS (Basic Input Output System).
- BIOS Basic Input Output System
- Hard disk drive interface 1030 is connected to hard disk drive 1090 .
- a disk drive interface 1040 is connected to the disk drive 1100 .
- a removable storage medium such as a magnetic disk or optical disk is inserted into the disk drive 1100 .
- Serial port interface 1050 is connected to mouse 1110 and keyboard 1120, for example.
- Video adapter 1060 is connected to display 1130, for example.
- the hard disk drive 1090 stores, for example, an OS 1091, application programs 1092, program modules 1093, and program data 1094. That is, the program that defines each process executed by the embedded device 10 is implemented as a program module 1093 in which computer-executable code is described.
- Program modules 1093 are stored, for example, on hard disk drive 1090 .
- the hard disk drive 1090 stores a program module 1093 for executing processing similar to the functional configuration in the embedded device 10 .
- the hard disk drive 1090 may be replaced by an SSD (Solid State Drive).
- the data used in the processes of the above-described embodiments are stored as program data 1094 in the memory 1010 or the hard disk drive 1090, for example. Then, the CPU 1020 reads out the program module 1093 and the program data 1094 stored in the memory 1010 and the hard disk drive 1090 to the RAM 1012 as necessary and executes them.
- the program modules 1093 and program data 1094 are not limited to being stored in the hard disk drive 1090, but may be stored in a removable storage medium, for example, and read by the CPU 1020 via the disk drive 1100 or the like. Alternatively, the program modules 1093 and program data 1094 may be stored in another computer connected via a network (LAN (Local Area Network), WAN (Wide Area Network), etc.). Program modules 1093 and program data 1094 may then be read by CPU 1020 through network interface 1070 from other computers.
- LAN Local Area Network
- WAN Wide Area Network
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
Un dispositif d'incorporation selon la présente invention extrait un correctif de dialogue en ligne à partir d'un historique de dialogue en ligne accumulé dans une base de données de dialogue en ligne. En outre, le dispositif d'incorporation divise le correctif extrait en jetons et crée un problème à trous dans lequel une partie des jetons est masquée. Ensuite, le dispositif d'incorporation forme un modèle d'apprentissage automatique qui prédit (restaure) les valeurs d'origine de la partie masquée dans le problème à trous. De plus, le dispositif d'incorporation produit un vecteur représentant des caractéristiques d'utilisateur comprises dans l'historique de dialogue en ligne à l'aide d'une unité de vectorisation d'ID d'utilisateur (13e) comprise dans le modèle d'apprentissage automatique après la formation.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2023579991A JPWO2023152914A1 (fr) | 2022-02-10 | 2022-02-10 | |
PCT/JP2022/005474 WO2023152914A1 (fr) | 2022-02-10 | 2022-02-10 | Dispositif d'incorporation, procédé d'incorporation et programme d'incorporation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2022/005474 WO2023152914A1 (fr) | 2022-02-10 | 2022-02-10 | Dispositif d'incorporation, procédé d'incorporation et programme d'incorporation |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023152914A1 true WO2023152914A1 (fr) | 2023-08-17 |
Family
ID=87563922
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2022/005474 WO2023152914A1 (fr) | 2022-02-10 | 2022-02-10 | Dispositif d'incorporation, procédé d'incorporation et programme d'incorporation |
Country Status (2)
Country | Link |
---|---|
JP (1) | JPWO2023152914A1 (fr) |
WO (1) | WO2023152914A1 (fr) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2020086566A (ja) * | 2018-11-16 | 2020-06-04 | 富士通株式会社 | 知識補完プログラム、知識補完方法および知識補完装置 |
JP2020135457A (ja) * | 2019-02-20 | 2020-08-31 | 日本電信電話株式会社 | 生成装置、学習装置、生成方法及びプログラム |
US20200401661A1 (en) * | 2019-06-19 | 2020-12-24 | Microsoft Technology Licensing, Llc | Session embeddings for summarizing activity |
JP2021197132A (ja) * | 2020-06-12 | 2021-12-27 | ベイジン バイドゥ ネットコム サイエンス テクノロジー カンパニー リミテッド | 知識表現学習方法、装置、電子機器、記憶媒体及びコンピュータプログラム |
-
2022
- 2022-02-10 WO PCT/JP2022/005474 patent/WO2023152914A1/fr active Application Filing
- 2022-02-10 JP JP2023579991A patent/JPWO2023152914A1/ja active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2020086566A (ja) * | 2018-11-16 | 2020-06-04 | 富士通株式会社 | 知識補完プログラム、知識補完方法および知識補完装置 |
JP2020135457A (ja) * | 2019-02-20 | 2020-08-31 | 日本電信電話株式会社 | 生成装置、学習装置、生成方法及びプログラム |
US20200401661A1 (en) * | 2019-06-19 | 2020-12-24 | Microsoft Technology Licensing, Llc | Session embeddings for summarizing activity |
JP2021197132A (ja) * | 2020-06-12 | 2021-12-27 | ベイジン バイドゥ ネットコム サイエンス テクノロジー カンパニー リミテッド | 知識表現学習方法、装置、電子機器、記憶媒体及びコンピュータプログラム |
Also Published As
Publication number | Publication date |
---|---|
JPWO2023152914A1 (fr) | 2023-08-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Mihalcea | Unsupervised large-vocabulary word sense disambiguation with graph-based algorithms for sequence data labeling | |
US9317501B2 (en) | Data security system for natural language translation | |
CN111930914B (zh) | 问题生成方法和装置、电子设备以及计算机可读存储介质 | |
León-Paredes et al. | Presumptive detection of cyberbullying on twitter through natural language processing and machine learning in the Spanish language | |
CN110347802B (zh) | 一种文本分析方法及装置 | |
Zhang et al. | ESCOXLM-R: Multilingual taxonomy-driven pre-training for the job market domain | |
CN112667791A (zh) | 潜在事件预测方法、装置、设备及存储介质 | |
Lamsal et al. | CrisisTransformers: Pre-trained language models and sentence encoders for crisis-related social media texts | |
JP2018205945A (ja) | 対話応答文書自動作成人工知能装置 | |
Hranický et al. | Distributed pcfg password cracking | |
Barany et al. | Choosing units of analysis in temporal discourse | |
WO2023152914A1 (fr) | Dispositif d'incorporation, procédé d'incorporation et programme d'incorporation | |
CN112989797A (zh) | 模型训练、文本扩展方法,装置,设备以及存储介质 | |
Khan et al. | End-to-end natural language understanding pipeline for bangla conversational agents | |
Prianto et al. | The Covid-19 chatbot application using a natural language processing approach | |
Liu et al. | Autoencoder based API recommendation system for android programming | |
CN115795028A (zh) | 一种公文智能生成方法及系统 | |
CN110929517A (zh) | 地理位置定位方法、系统、计算机设备和存储介质 | |
WO2024161480A1 (fr) | Dispositif de traitement d'informations, procédé de traitement d'informations et programme de traitement d'informations | |
CN114610576A (zh) | 一种日志生成监控方法和装置 | |
Azhan et al. | MeToo: sentiment analysis using neural networks (grand challenge) | |
Agarwal et al. | TrICy: Trigger-guided Data-to-text Generation with Intent aware Attention-Copy | |
US20240078999A1 (en) | Learning method, learning system and learning program | |
Mannekote et al. | Agreement Tracking for Multi-Issue Negotiation Dialogues | |
CN114942980B (zh) | 一种确定文本匹配方法及装置 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22925935 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2023579991 Country of ref document: JP |
|
NENP | Non-entry into the national phase |
Ref country code: DE |