WO2014030258A1 - Dispositif d'analyse morphologique, procédé d'analyse de texte et programme associé - Google Patents

Dispositif d'analyse morphologique, procédé d'analyse de texte et programme associé Download PDF

Info

Publication number
WO2014030258A1
WO2014030258A1 PCT/JP2012/071485 JP2012071485W WO2014030258A1 WO 2014030258 A1 WO2014030258 A1 WO 2014030258A1 JP 2012071485 W JP2012071485 W JP 2012071485W WO 2014030258 A1 WO2014030258 A1 WO 2014030258A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
text
word
learning
analysis
Prior art date
Application number
PCT/JP2012/071485
Other languages
English (en)
Japanese (ja)
Inventor
要 小島
Original Assignee
株式会社日立製作所
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 株式会社日立製作所 filed Critical 株式会社日立製作所
Priority to JP2014531472A priority Critical patent/JPWO2014030258A1/ja
Priority to PCT/JP2012/071485 priority patent/WO2014030258A1/fr
Publication of WO2014030258A1 publication Critical patent/WO2014030258A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/268Morphological analysis

Definitions

  • the present invention relates to a morphological analyzer.
  • TFIDF indicating the frequency of appearance of words in a document is widely used when extracting the similarity between the contents of different documents and topics in the document.
  • the unsupervised word segmentation method is based on a method for obtaining a word segmentation that improves the compression ratio of a document based on a data compression index (MDL), and a word occurrence probability obtained from a sequence of characters and a probability in word linking.
  • MDL data compression index
  • the probabilistic model is roughly divided into a probabilistic model method that divides words so that the probability of the model increases.
  • the former MDL method maintains the same word division accuracy as the latter probability model method, and has the characteristics that it can be processed at high speed, but the word division accuracy by giving correct data in which words are divided manually. There is a problem that it is difficult to improve.
  • the connection between words obtained from the word sequence is considered in the probability model, but the part of speech information is not considered. For this reason, when the part of speech is taken into account, there is a problem that the word is inappropriately divided even when the word connection is inappropriate.
  • a typical example of the invention disclosed in the present application is as follows. That is, a morpheme analyzer for determining a part of speech of a word included in input text data, wherein at least one processor for executing the program, a memory in which the program is stored, and an input in which the text data is input And the input device receives input of the learning text and the analysis text, the morpheme analysis device analyzes the analysis text, and a morpheme analyzer learning unit that analyzes the learning text, A morpheme analysis unit that divides the analysis text into words and assigns parts of speech to the divided words, and the morpheme analyzer learning unit performs the next word from the part of speech of the words included in the learning text.
  • the morphological analysis unit divides the analysis text into words by referring to the constructed part of speech model, and determines the part of speech of the divided word. .
  • the part-of-speech information by using the part-of-speech information, it is possible to accurately divide a word and perform high-precision morphological analysis.
  • the morphological analyzer divides a text document into words and determines the part of speech of each word.
  • FIG. 1 is a block diagram showing a configuration example of a morphological analyzer 400 according to the first embodiment of the present invention.
  • the morphological analyzer 400 is a computer having a CPU (Central Processing Unit) 401, a main storage device (memory) 402, an auxiliary storage device 403, and a user interface unit 407.
  • the morphological analyzer 400 may be physically constructed on one computer, or may be physically constructed on a logical partition configured on one or a plurality of computers.
  • the morphological analyzer 400 is connected to an external network via a network 406 such as a LAN (Local Area Network).
  • a network 406 such as a LAN (Local Area Network).
  • the CPU 401 is a central processing unit that executes a program stored in the main storage device 402.
  • the morpheme analyzer learning unit 408, the morpheme analysis unit 409, the word / part of speech deletion unit 410, the word / part of speech addition unit 411, the word / part of speech sampling unit 412 and the parameter sampling unit 413 are executed by the CPU 401.
  • the main storage device 402 is a storage device such as a RAM (Random Access Memory) that stores a program executed by the CPU 401 and data (part of speech model 414, etc.) used when the program is executed.
  • the main storage device 402 temporarily stores the text data for learning 423 as necessary.
  • the auxiliary storage device 403 and the external storage device 404 are storage devices or storage media such as a magnetic disk drive and flash memory for storing text data and the program.
  • the auxiliary storage device 403 includes a part-of-speech model 414, an initial state probability model 415, a transition probability model 416, a word output probability model 417, a character N-gram model 418, an initial state count C0 (419), a transition state count C (420),
  • the hyper parameter A (421) and the word / part of speech list 422 are stored.
  • the removable medium 405 is a non-volatile recording medium such as a CD-ROM or a DVD on which text data is recorded, and data is read by a predetermined reading device (such as an optical disk drive or a USB interface). Data recorded in the auxiliary storage device 403, the external storage device 404, and the removable medium 405 is read out as necessary and stored in the main storage device 402 when the morphological analyzer 400 is activated.
  • a predetermined reading device such as an optical disk drive or a USB interface
  • the program executed by the CPU 401 is provided to the computer via the removable medium 405 or the network, and is stored in the auxiliary storage device 403 that is a non-temporary storage medium. That is, the program executed by the CPU 401 is read from the auxiliary storage device 403, loaded into the main storage device 402, and executed by the CPU 401.
  • the user interface unit 407 is an input / output device (for example, a keyboard, a mouse, a display) that provides a user interface.
  • the CPU 401 acquires text data as needed from the main storage device 402, the auxiliary storage device 403, the removable medium 405, or the external storage device 404 via the network 406. Thereafter, the CPU 401 activates the morphological analyzer learning unit 408 and learns the part of speech model 414 based on the acquired text data.
  • FIG. 1 shows an example in which text data is stored in a device on the main storage device 402, auxiliary storage device 403, removable media 405, and network 406.
  • the text data may be stored in a device that the CPU 401 can read and write.
  • the CPU 401 operates as a functional unit that realizes a predetermined function by executing a program for mounting each unit.
  • the CPU 401 functions as the morphological analyzer learning unit 408 by executing a morphological analyzer learning program.
  • the CPU 401 functions as the word / part of speech deletion unit 410 by executing the word / part of speech deletion program, and the word / part of speech addition unit 411 executes the word / part of speech addition program.
  • the word / part of speech sampling program functioning as the word / part of speech sampling unit 412
  • the parameter sampling program functioning as the parameter sampling unit 413
  • executing the morphological analysis program Function as a morphological analysis unit 409.
  • Data such as a program and a table for realizing each function of the morphological analyzer learning unit 408 and the like include a storage device such as an auxiliary storage device 403, a removable medium 405, a nonvolatile semiconductor memory, a magnetic disk drive, an SSD (Solid State Drive), Alternatively, it can be stored in a non-transitory data storage medium readable by a computer such as an IC card, an SD card, or a DVD.
  • the character N-gram is a model that gives a probability that a character following an N-1 character string will occur. For example, in the word “I”, the probability P (shi
  • the probability based on the character N-gram is calculated from the appearance frequency of the sequence of characters or words in the document. For example, if the character string “wata” appears x times in the document and the number of times that the character “shi” comes after “wawa” is y times, the 3-gram probability P (Shi
  • the character N-gram is used to predict a character that appears after a certain character string.
  • the word N-gram is a model that gives a probability that a word following a word string will occur.
  • N-gram probability smoothing The character N-gram probability can be calculated depending on the appearance frequency in the text, but since the amount of text is finite, it is not unnatural as a language, but may not exist in the learning data. . For example, considering the 3-gram character P (I
  • smoothing of the N-gram probability is performed by calculating the original N-gram probability, the (N-1) -gram probability and the weighted average.
  • the probability of (N-1) -gram is also smoothed hierarchically by smoothing with a weighted average with (N-2) -gram.
  • a method using a Bayesian statistical model based on the Pitman-Yor process described in Non-Patent Document 2 can be adopted as a method of taking a weighted average.
  • the character N-gram model 418 is a model that gives a probability that c appears after s when a character string s and a character c are given.
  • the character N-gram model is learned by adding and deleting words.
  • the character N-gram model 418 can be constructed by the methods described in Patent Document 2 and Non-Patent Document 2. From the character N-gram model 418, a character string d [1],. . . , D [K], the word probability P (d [1],... D [K], $) from the character features, called the word 0-gram probability, can be calculated. Note that $ is a special character representing the end of a word. Specifically, P (d [1],...
  • D [K], $) is P (d [1]) ⁇ P (d [2]
  • the length of the N-gram is limited to, for example, 4-gram, and therefore the character N-gram probability P ( d [k]
  • the part of speech model 414 is a probability model based on the hidden Markov model.
  • the hidden Markov model is a probabilistic model in which each state of observed series data is output from a hidden state that is not actually observed, and state transitions between hidden states are taken into account.
  • the hidden state corresponds to the part of speech
  • each state of the observed series data corresponds to the word.
  • FIG. 2 is a diagram for explaining an example of the part of speech model 414.
  • the part-of-speech model 414 shown in FIG. 2 includes words such as “wait”, “shi”, “te”, “ori”, “masa”, “.”, And parts of speech corresponding to the word. State transitions are considered between unobserved parts of speech other than a special part of speech called “end of sentence” representing the end of the sentence, and each word is modeled to be output from the part of speech.
  • words such as “noun”, “verb”, and “end of sentence” are applied as part of speech, but in reality, when the number of parts of speech given in advance is M, 0,. . . , M ⁇ 1 and M representing the end of the sentence are combined and represented by M + 1 numbers.
  • m [t]) of the word w [t] from the part of speech m [t] constitute a probability distribution. Therefore, the probability P (w [1],... W [T], m [0],...
  • M [T]) of the entire model is P (w [1]
  • m [t]) of the part of speech are described later.
  • the initial state probability model 415, the transition probability model 416, and the word output probability model 417 can be calculated.
  • the initial state probability model 415 can be defined by a multinomial distribution taking values from 0 to M.
  • the transition probability model 416 from p to the next part of speech for each part of speech p can be defined by a multinomial distribution taking values from 0 to M.
  • a parameter A421 of Dirichlet distribution which is a conjugate prior distribution of multinomial distribution is given.
  • ⁇ Word output probability model 417 The output probability P (w
  • the word 1-gram probability is obtained by smoothing the word 0-gram probability obtained from the character N-gram model 418.
  • a Pitman-Yor process described in Non-Patent Document 1 can be used in the present invention. In the sentence, there may be an event that the word w belongs to the part of speech m a plurality of times.
  • a part of the smoothing process is used to calculate the word 1-gram probability P (w) according to a stochastic process called a Pitman-Yor process. Further, a part of the data is registered as data for calculating a character N-gram model in the form of a character string constituting the word w.
  • P word 1-gram probability
  • a part of the data is registered as data for calculating a character N-gram model in the form of a character string constituting the word w.
  • each word is dynamically assigned to the part of speech.
  • registration and deletion of words in the part of speech are performed hierarchically as described above according to the Pitman-Yor process.
  • ⁇ Morphological analyzer learning unit 408> 3A and 3B are flowcharts of processing executed by the morphological analyzer learning unit 408 according to the first embodiment.
  • an input of a sentence list S and a repetition number N is accepted (step S301). Thereafter, the elements of the initial state count C0 (419) that is a one-dimensional array and the transition state count C (420) that is a two-dimensional array are initialized to 0 (step 302), and the variable i is initialized to 1 (step S303). ).
  • step S304 Thereafter, the following steps S305 to S316 are repeated until the variable i exceeds N (step S304).
  • step S305 the elements of the sentence list S are copied to the empty list T (step S305), and the elements of the sentence list T are rearranged randomly (step S306).
  • step S307 the following steps S309 to S316 are repeated until the elements of the list T become empty. If the element of the list T becomes empty, the variable i is incremented by 1, and the process returns to step S304 (steps S307 and S308).
  • the top element of the list T is deleted, and this is used as a sentence s (step S309).
  • the word string and the part of speech string corresponding to the sentence s are in the word / part of speech list 422, the word string and the part of speech string are deleted from the word / part of speech list 422 (steps S310 and S311).
  • the word / part of speech deletion unit 410 deletes the word string and the part of speech string of the sentence s from the part of speech model 414 (step S312).
  • the word / part of speech sampling unit 412 samples a word string and a part of speech string from the sentence s (step S313).
  • the word / part of speech list 422 adds the sampled word string and part of speech string as a word string and part of speech string corresponding to the sentence s (step S314).
  • the word / part of speech adding unit 411 adds the sampled word sequence and part of speech sequence to the part of speech model 414 (step S315)
  • the parameter sampling unit 413 samples the parameters of the part of speech model 414, and returns to step S207 (step S316).
  • the word / part-of-speech deleting unit 410 includes word strings w [1],. . . , W [T] and part of speech string m [1],. . . , M [T].
  • the word / part of speech deletion unit 410 decrements the p-th element of the initial state count C0 (419) that is a one-dimensional integer array by one.
  • the word / part of speech deletion unit 410 uses a two-dimensional integer array when the values of the parts of speech m [t] and m [t ⁇ 1] are p and q, respectively.
  • the q-th and p-th column elements of a certain transition state count C (420) are decremented by one.
  • the word / part of speech deleting unit 410 deletes the word w [t] from the word output probability model 417 corresponding to the part of speech m [t] according to the method described in Non-Patent Document 2, for example, according to the Pitman-Yor process. To do.
  • the word / part-of-speech adding unit 411 stores the word string w [1],. . . , W [T] and part of speech string m [1],. . . , M [T].
  • the word / part of speech adding unit 411 increments the p-th element of the initial state count C0 (419) which is a one-dimensional integer array by one.
  • the word / part of speech adding unit 411 is a two-dimensional integer array when the values of the parts of speech m [t] and m [t ⁇ 1] are p and q, respectively.
  • the element in the qth row and the pth column of a certain transition state count C (420) is incremented by one.
  • the word / part of speech adding unit 411 adds the word w [t] from the word output probability model 417 corresponding to the part of speech m [t] according to the method described in Non-Patent Document 2, for example, according to the Pitman-Yor process. To do.
  • the parameter sampling unit 413 samples the multinomial distribution of the initial state probability model 415 by sampling from the Dirichlet distribution using the sum of the initial state count C0 [p] (419) and the hyperparameter A [p] (421) as a parameter. Give the parameter.
  • the parameter sampling unit 413 corresponds to the part of speech p by sampling from the Dirichlet distribution using the sum of the transition state count C [p] [q] (420) and the hyperparameter A [q] (421) as a parameter.
  • a transition probability model 416 is provided.
  • the word / part-of-speech sampling unit 412 uses the forward-filtering backward sampling method described in Non-Patent Document 3 to sample the word parts of each sentence and the part-of-speech of the divided words according to the probability structure of the part-of-speech model 414.
  • 4A and 4B are flowcharts of processing executed by the word / part-of-speech sampling unit 412 according to the first embodiment.
  • step S401 the input of the sentence s, the number of parts of speech M, and the maximum word length L is accepted (step S401). Further, the number of characters of s is set to N, the word length list WL and the part-of-speech list PL are initialized to be empty (step S402), and the variable i is initialized to 1 (step S403).
  • steps S405 to S412 are repeated until the variable i becomes larger than N + 1 (step S404).
  • N ⁇ i + 1 is greater than L. If N ⁇ i + 1 is greater than L, L is set to the variable K. If N ⁇ i + 1 is less than or equal to L, N ⁇ i + 1 is set to the variable K. (Steps S405, S406, S407).
  • i-1 is greater than L. If i-1 is greater than L, L is set to variable J. If i-1 is L or less, i-1 is set to variable J. (Steps S408, S409, S410).
  • the character strings c [t],. . . , C [t + k ⁇ 1] is set to w
  • m) ⁇ G [m] is set to E [i + k] [k] [m] (step S415).
  • variable i is incremented by 1, and the process returns to step S404 (step S416).
  • N + 2 is set to the variable i
  • 1 is set to the variable k
  • M + 1 representing the end of the sentence is set to the part of speech m (step S417). Thereafter, the processes in steps S419 to S424 are repeated until k becomes i or less (step S418).
  • step S419) k is subtracted from i (step S419), and i is compared with N + 1 (step S420).
  • step S420 When i is larger than N + 1, L is set to the variable J, and when i is N + 1 or less, N-1 is set to the variable J (steps S420, S421, and S422).
  • variable j is changed from 1 to J
  • j and n are proportional to P (n
  • m) ⁇ E [i] [j] [n] when the part of speech n is changed from 1 to M.
  • k and m are set to k and m, respectively (step S423).
  • step S424) k is added to the word length list WL, m is added to the head of the part-of-speech list PL, and the process returns to step S418 (step S424).
  • the word length list WL obtained by the processing executed by the word / part-of-speech sampling unit 412, the lengths of the sampled words are held in order from the top, so the words are matched with the character string of the sentence s. You can get it.
  • the part of speech corresponding to the obtained word is stored in order from the top in the part of speech list PL.
  • the morpheme analysis unit 409 acquires, for each sentence, a word division that maximizes the probability in the probability structure in the part-of-speech model 414 and a sequence corresponding to each word using the Viterbi algorithm.
  • FIGS. 5A and 5B are flowcharts of processing executed by the morpheme analyzer 409 according to the first embodiment.
  • step S501 the input of the sentence s, the number of parts of speech M, and the maximum word length L is accepted (step S501). Further, the number of characters of s is set to N, the word length list WL and the part-of-speech list PL are initialized to be empty (step S502), and 1 is initialized to the variable i (step S503).
  • steps S505 to S512 are repeated until the variable i becomes larger than N + 1 (step S504).
  • N ⁇ i + 1 is greater than L. If N ⁇ i + 1 is greater than L, L is set to the variable K. If N ⁇ i + 1 is less than or equal to L, N ⁇ i + 1 is set to the variable K. (Steps S505, S506, S507).
  • i-1 is greater than L. If i-1 is greater than L, L is set to variable J. If i-1 is L or less, i-1 is set to variable J. (Steps S508, S509, S510).
  • n is set to Y [m] that maximizes P (m
  • the maximum value is set to G [m] (step S514).
  • the character strings c [t],. . . , C [t + k ⁇ 1] is set to w
  • m) ⁇ G [m] is set to E [i + k] [k] [m] (step S515).
  • variable i is incremented by 1, and the process returns to step S504 (step S516).
  • N + 2 is set to the variable i
  • 1 is set to the variable k
  • M + 1 representing the end of the sentence is set to the part of speech m (step S517). Thereafter, the processes in steps S519 to S524 are repeated until k becomes i or less (step S518).
  • step S519 k is subtracted from i (step S519), and i is compared with N + 1 (step S520).
  • step S520 When i is larger than N + 1, L is set to the variable J, and when i is N + 1 or less, N-1 is set to the variable J (steps S520, S521, S522).
  • variable j is changed from 1 to J, and j and n are maximized to P (n
  • step S518 step S524.
  • the word length list WL obtained by the process executed by the morphological analysis unit 409, the lengths of the sampled words are held in order from the top, so that the words can be acquired according to the character string of the sentence s. .
  • the part of speech corresponding to the obtained word is stored in order from the top in the part of speech list PL.
  • FIG. 6 is a sequence diagram illustrating the flow of the learning process of the morphological analyzer 400 according to the first embodiment.
  • the CPU 401 waits for input of text data for learning.
  • the learning text data is input (step S602)
  • the CPU 401 executes a learning process by the morphological analyzer learning unit 408 (step S603).
  • the learning text data to be input is normal text data that is not separated by dividing words.
  • the morphological analyzer learning unit 408 samples words and parts of speech from each sentence of the text data for learning using the part of speech model 414, and repeatedly learns the part of speech model 414 from the obtained words and parts of speech.
  • the CPU 401 When the learning process for the number of iterations given in advance is completed, the CPU 401 outputs the part of speech model 414 to the auxiliary storage device 403 (step S604).
  • FIG. 7 is a sequence diagram showing a flow of processing in which the morphological analysis apparatus 400 according to the first embodiment divides words by morphological analysis and assigns parts of speech.
  • the CPU 401 reads a text sentence for word division and part-of-speech assignment from the auxiliary storage device 403 and stores it in the main storage device 402 before the sequence shown in FIG.
  • the CPU 401 reads the part-of-speech model 414 from the auxiliary storage device 403 and stores it in the main storage device 402 (step S701), and then the user inputs a text sentence to the morphological analyzer 400 through the user interface 407 (step S701). S702). Thereafter, the morphological analysis unit 409 divides each sentence of the text sentence into words, and gives parts of speech to the obtained words (step S703). Finally, the CPU 401 outputs the obtained result (step S704).
  • the part-of-speech is estimated for a word obtained by dividing a sentence, and the connection between parts-of-speech in each word is included in the probability model.
  • a dictionary it is possible to divide words of text data in a language that is not divided into words with high accuracy, and to avoid word division that makes connection between parts of speech inappropriate.
  • the part of speech of the divided word can be determined with high accuracy. That is, the word segmentation and the part of speech of the segmented word can be obtained without depending on the language or the unknown word.
  • the output probability of the word from the part of speech is hierarchically smoothed from the output probability of the word and the output probability of the word obtained from the character N-gram model. . For this reason, the word output probability from the part of speech is affected by the word output probability obtained from the character N-gram model.
  • the word output model 417 in the first embodiment is changed. Specifically, first, when k is a length, the word length probability P (k) can be calculated from the word length distribution registered in the character N-gram model 418. Further, the probability P (k
  • m) of the word w from the part of speech m is obtained by using the Graphic Pitman-Yor process described in Non-Patent Document 4 and the 1-gram probability of the word w and the word 0-gram unique to the part of speech m. It is a smoothed distribution of probabilities.
  • the 1-gram probability of the word w is obtained by smoothing the word 0-gram probability directly obtained from the character N-gram model 418.
  • the specific difference from the first embodiment is the calculation of the output probability P (w
  • high-precision morphological analysis can be performed by using the property that the word length differs depending on the part of speech.
  • the word occurrence probability by multiplying the word occurrence probability from the part of speech by a value larger than 1, the word occurrence probability is increased, the reduction in word division is alleviated, and the problem of excessive reduction in the number of divisions is avoided. be able to.
  • a value to be multiplied by the word occurrence probability a value that maximizes the probability of the entire probability model in the final part-of-speech model may be selected.
  • the morpheme analyzer learning unit 411 sequentially deletes words and parts of speech from the part of speech model, samples words and parts of speech, and adds words and parts of speech to the part of speech model, in order from the part of speech model for each sentence of the learning text. Deletion of words and parts of speech, sampling of words and parts of speech from sentences, and addition of sampled words and parts of speech to the part of speech model are performed.
  • each sentence is independently processed by using a plurality of CPUs, thereby performing highly efficient parallelization and speeding up the process. be able to.
  • a new part-of-speech model is acquired from the analysis text using the part-of-speech model already obtained from the learning text.
  • the word / part-of-speech sampling unit 412 calculates a calculation performed using a single part-of-speech model using a mixture distribution of the part-of-speech model already obtained and the newly acquired part-of-speech model. . Then, the obtained word string and part of speech string are registered and deleted in the newly acquired part of speech model according to the probability that the newly prepared part of speech model contributed.
  • the morpheme analysis unit 409 When performing morpheme analysis after learning, the morpheme analysis unit 409 performs calculation using a mixture distribution of the already obtained part of speech model and the newly obtained part of speech model instead of one part of speech model, and the word string and the part of speech string To get.
  • a morphological analysis with high accuracy can be performed by complementing a newly acquired part-of-speech model for a part in which the part-of-speech model already obtained does not match the newly input analysis text.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

La présente invention concerne un dispositif d'analyse morphologique comprenant : un processeur ; une mémoire ; un dispositif d'entrée ; une unité destinée à l'apprentissage du dispositif d'analyse morphologique qui analyse un texte utilisé pour l'apprentissage ; et une unité d'analyse morphologique qui analyse un texte utilisé pour l'analyse, sépare le texte utilisé pour l'analyse en mots et associe des parties du discours aux mots séparés. L'unité destinée à l'apprentissage du dispositif d'analyse morphologique obtient, à partir d'une partie du discours associée à un mot faisant partie du texte utilisé pour l'apprentissage, la probabilité d'occurrence du prochain mot et, à partir d'une partie du discours associée à un mot faisant partie du texte utilisé pour l'apprentissage, la probabilité d'occurrence de la partie du discours associée au prochain mot. Elle construit ensuite un modèle de partie du discours qui intègre les probabilités d'occurrence obtenues. L'unité d'analyse morphologique se réfère au modèle de partie du discours construit, sépare le texte utilisé pour l'analyse en mots et détermine des parties du discours associées aux mots séparés.
PCT/JP2012/071485 2012-08-24 2012-08-24 Dispositif d'analyse morphologique, procédé d'analyse de texte et programme associé WO2014030258A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
JP2014531472A JPWO2014030258A1 (ja) 2012-08-24 2012-08-24 形態素解析装置、テキスト分析方法、及びそのプログラム
PCT/JP2012/071485 WO2014030258A1 (fr) 2012-08-24 2012-08-24 Dispositif d'analyse morphologique, procédé d'analyse de texte et programme associé

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2012/071485 WO2014030258A1 (fr) 2012-08-24 2012-08-24 Dispositif d'analyse morphologique, procédé d'analyse de texte et programme associé

Publications (1)

Publication Number Publication Date
WO2014030258A1 true WO2014030258A1 (fr) 2014-02-27

Family

ID=50149591

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2012/071485 WO2014030258A1 (fr) 2012-08-24 2012-08-24 Dispositif d'analyse morphologique, procédé d'analyse de texte et programme associé

Country Status (2)

Country Link
JP (1) JPWO2014030258A1 (fr)
WO (1) WO2014030258A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20200026351A (ko) * 2018-08-29 2020-03-11 동국대학교 산학협력단 향상된 잠재 디리클레 할당 모델을 이용한 토픽 분석 장치 및 방법
WO2021082637A1 (fr) * 2019-10-31 2021-05-06 北京字节跳动网络技术有限公司 Procédé de traitement d'informations audio, appareil, équipement électronique et support de stockage

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH07271792A (ja) * 1994-03-30 1995-10-20 Nippon Telegr & Teleph Corp <Ntt> 日本語形態素解析装置及び日本語形態素解析方法
JPH09288673A (ja) * 1996-04-23 1997-11-04 Nippon Telegr & Teleph Corp <Ntt> 日本語形態素解析方法と装置及び辞書未登録語収集方法と装置
JP2004355483A (ja) * 2003-05-30 2004-12-16 Oki Electric Ind Co Ltd 形態素解析装置、形態素解析方法及び形態素解析プログラム
JP2007087070A (ja) * 2005-09-21 2007-04-05 Oki Electric Ind Co Ltd 形態素解析装置、形態素解析方法及び形態素解析プログラム

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH07271792A (ja) * 1994-03-30 1995-10-20 Nippon Telegr & Teleph Corp <Ntt> 日本語形態素解析装置及び日本語形態素解析方法
JPH09288673A (ja) * 1996-04-23 1997-11-04 Nippon Telegr & Teleph Corp <Ntt> 日本語形態素解析方法と装置及び辞書未登録語収集方法と装置
JP2004355483A (ja) * 2003-05-30 2004-12-16 Oki Electric Ind Co Ltd 形態素解析装置、形態素解析方法及び形態素解析プログラム
JP2007087070A (ja) * 2005-09-21 2007-04-05 Oki Electric Ind Co Ltd 形態素解析装置、形態素解析方法及び形態素解析プログラム

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
TETSUJI NAKAGAWA ET AL.: "Chinese and Japanese Word Segmentation Using Word-level and Character-level Information", TRANSACTIONS OF INFORMATION PROCESSING SOCIETY OF JAPAN, vol. 46, no. 11, 15 November 2005 (2005-11-15), pages 2714 - 2727 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20200026351A (ko) * 2018-08-29 2020-03-11 동국대학교 산학협력단 향상된 잠재 디리클레 할당 모델을 이용한 토픽 분석 장치 및 방법
KR102181744B1 (ko) 2018-08-29 2020-11-25 동국대학교 산학협력단 향상된 잠재 디리클레 할당 모델을 이용한 토픽 분석 장치 및 방법
WO2021082637A1 (fr) * 2019-10-31 2021-05-06 北京字节跳动网络技术有限公司 Procédé de traitement d'informations audio, appareil, équipement électronique et support de stockage

Also Published As

Publication number Publication date
JPWO2014030258A1 (ja) 2016-07-28

Similar Documents

Publication Publication Date Title
Smith Linguistic structure prediction
Roark et al. Processing South Asian languages written in the Latin script: the Dakshina dataset
Khan et al. A novel natural language processing (NLP)–based machine translation model for English to Pakistan sign language translation
US9824085B2 (en) Personal language model for input method editor
JP5513898B2 (ja) 共有された言語モデル
Hagiwara Real-world natural language processing: practical applications with deep learning
JP5071373B2 (ja) 言語処理装置、言語処理方法および言語処理用プログラム
JP2010520531A (ja) 統合ピンイン及び画入力
JP6312467B2 (ja) 情報処理装置、情報処理方法、およびプログラム
JP5809381B1 (ja) 自然言語処理システム、自然言語処理方法、および自然言語処理プログラム
WO2020170912A1 (fr) Dispositif de production, dispositif d&#39;apprentissage, procédé de production et programme
Islam et al. Bangla sentence correction using deep neural network based sequence to sequence learning
Almutiri et al. Markov models applications in natural language processing: a survey
EP3598321A1 (fr) Procédé d&#39;analyse de texte en langue naturelle ayant des liaisons de construction constituantes
WO2023088309A1 (fr) Procédé de réécriture de texte narratif, dispositif, appareil et support
JP7103264B2 (ja) 生成装置、学習装置、生成方法及びプログラム
WO2014030258A1 (fr) Dispositif d&#39;analyse morphologique, procédé d&#39;analyse de texte et programme associé
CN111680146A (zh) 确定新词的方法、装置、电子设备及可读存储介质
Othmane et al. POS-tagging Arabic texts: A novel approach based on ant colony
Thu et al. Integrating dictionaries into an unsupervised model for Myanmar word segmentation
Tsarfaty Syntax and parsing of semitic languages
US20220092260A1 (en) Information output apparatus, question generation apparatus, and non-transitory computer readable medium
JP7385900B2 (ja) 推論器、推論プログラムおよび学習方法
Ciosici Improving quality of hierarchical clustering for large data series
Langlais Issues in Analogical Learning over Sequences of Symbols: a Case Study with Named Entity Transliteration.

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 12883252

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2014531472

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 12883252

Country of ref document: EP

Kind code of ref document: A1