WO2013046590A1

WO2013046590A1 - Information processing device, information processing method, and program

Info

Publication number: WO2013046590A1
Application number: PCT/JP2012/005906
Authority: WO
Inventors: 長田　誠也; 健花沢; 岡部　浩司
Original assignee: 日本電気株式会社
Priority date: 2011-09-26
Filing date: 2012-09-14
Publication date: 2013-04-04

Abstract

A candidate-vector generator (10) selects, on the basis of statistical data for a first language, a simultaneously used word that is a word used at the same time as a candidate word, the simultaneously used word being selected together with the usage count thereof. The candidate vector generator (10) then generates a candidate vector for individual candidate words according to at least one of the simultaneously used words and the usage count thereof. A context vector generator (20) generates a context vector for a second word by selecting, on the basis of statistical data for the first language, at least one simultaneously used word together with the usage count thereof. A selection unit (30) calculates the degree of similarity between the context vector and the candidate vector for individual candidate words. The selection unit (30) then selects the candidate word with the highest calculated degree of similarity as a word for the first language corresponding to a first word.

Description

Information processing apparatus, information processing method, and program

The present invention relates to an information processing apparatus, an information processing method, and a program for converting input data in which two words are continuous into a first language.

In machine translation and speech recognition, input data in which two words are continuous is converted into a first language using statistical data. In such a technique, the language model most often used at present is the N-gram model. This model gives the appearance probability of the next N-th word from a chain of up to (N−1) words based on the statistics of the N-word chain. In the N-gram model, since it is necessary to obtain the appearance probability from the learning data, the word that does not appear in the learning data has an appearance probability of zero. In order to avoid this, Non-Patent Document 1 uses a class N-gram using a thesaurus.

The technology described in Non-Patent Document 1 increases the construction cost of the thesaurus. An object of the present invention is to reduce the cost when converting input data in which two words are continuous into a first language.

According to the present invention, an information processing apparatus that converts input data in which at least two words are continuous into a first language,
The first word has a plurality of candidate words that are candidates for the word in the first language, and the second word has one candidate word,
For the first word, for each candidate word, based on the statistical data of the first language, by selecting a simultaneously used word that is a word used at the same time as the candidate word together with the number of times of use, Candidate vector generation means for generating a candidate vector for each candidate word;
A context vector generating means for generating a context vector for the second word by selecting the simultaneously used word of the second word together with the number of uses based on the statistical data of the first language;
Selecting means for selecting the candidate word having the candidate vector having the highest similarity with the context vector as a word of the first language corresponding to the first word;
An information processing apparatus is provided.

According to the present invention, there is provided an information processing method for converting input data in which at least two words are continuous into a first language,
The first word has a plurality of candidate words that are candidates for the word in the first language, and the second word has one candidate word,
For the first word, the computer selects, for each candidate word, a simultaneously used word, which is a word used simultaneously with the candidate word, together with the number of times of use, based on the statistical data of the first language. To generate a candidate vector for each candidate word,
The computer generates a context vector for the second word by selecting the simultaneously used word of the second word together with the number of uses based on the statistical data of the first language,
There is provided an information processing method in which the computer selects the candidate word having the candidate vector having the highest similarity with the context vector as a word in the first language corresponding to the first word. The

According to the present invention, there is provided a program for converting input data in which at least two words are continuous into a first language,
The first word has a plurality of candidate words that are candidates for the word in the first language, and the second word has one candidate word,
On the computer,
For the first word, for each candidate word, based on the statistical data of the first language, by selecting a simultaneously used word that is a word that is used at the same time as the candidate word together with the number of uses thereof, The ability to generate candidate vectors for each candidate word;
A function of generating a context vector for the second word by selecting the simultaneously used word of the second word together with the number of uses based on the statistical data of the first language;
A function of selecting the candidate word having the candidate vector having the highest similarity with the context vector as a word of the first language corresponding to the first word;
A program for realizing the above is provided.

According to the present invention, it is possible to reduce the cost when converting input data in which two words are continuous into the first language.

The above-described object and other objects, features, and advantages will be further clarified by a preferred embodiment described below and the following drawings attached thereto.

It is a block diagram which shows the function structure of the information processing apparatus which concerns on 1st Embodiment. 3 is a flowchart of processing performed by the information processing apparatus illustrated in FIG. 1. It is a block diagram which shows the function structure of the information processing apparatus which concerns on 2nd Embodiment. It is a flowchart of the process which the information processing apparatus shown in FIG. 3 performs. FIG. 3 is a diagram for explaining Example 1; FIG. 3 is a diagram for explaining Example 1; FIG. 3 is a diagram for explaining Example 1; FIG. 6 is a diagram for explaining a second embodiment. FIG. 6 is a diagram for explaining a second embodiment. FIG. 6 is a diagram for explaining a second embodiment.

Hereinafter, embodiments of the present invention will be described with reference to the drawings. In all the drawings, the same reference numerals are given to the same components, and the description will be omitted as appropriate.

(First embodiment)
FIG. 1 is a block diagram illustrating a functional configuration of the information processing apparatus according to the first embodiment. This information processing apparatus is an apparatus that converts input data in which two words are continuous into character data of a first language. The first word has a plurality of candidate words (hereinafter referred to as candidate words) in the first language after conversion. The second word has one candidate word. The information processing device is, for example, a machine translation device or a speech recognition device. When the information processing device is a machine translation device, the input data is character data in a second language that is a language different from the first language. Whether there are a plurality of candidate words is determined using, for example, external dictionary data.

This information processing apparatus includes a candidate vector generation unit 10, a context vector generation unit 20, and a selection unit 30.

The candidate vector generation unit 10 performs the following processing for each candidate word for the first word. First, the candidate vector generation part 10 selects the simultaneous use word which is a word used simultaneously with a candidate word with the use frequency based on the statistical data of a 1st language. And the candidate vector production | generation part 10 produces | generates a candidate vector for every candidate word with at least 1 simultaneous use word and its usage frequency. The statistical data used here is, for example, statistical data of chain information of two words in the first language.

The context vector generation unit 20 generates a context vector for the second word by selecting at least one simultaneously used word together with the number of uses based on the statistical data of the first language.

The selection unit 30 calculates the similarity between the context vector and the candidate vector for each candidate word. Then, the selection unit 30 selects the candidate word having the highest calculated similarity as the first language word corresponding to the first word.

Note that each component of the information processing apparatus shown in FIG. 1 is not a hardware unit configuration but a functional unit block. Each component of the information processing apparatus is centered on an arbitrary computer CPU, memory, a program for realizing the components shown in the figure loaded in the memory, a storage unit such as a hard disk for storing the program, and a network connection interface. It is realized by any combination of hardware and software. There are various modifications of the implementation method and apparatus.

FIG. 2 is a flowchart of processing performed by the information processing apparatus shown in FIG. First, input data is input to the information processing apparatus. Then, the candidate vector generation unit 10 selects a word that is the first word (a word having a plurality of candidate words) from among the words included in the input data by using dictionary data stored outside. Then, the candidate vector generation unit 10 generates a candidate vector for each candidate word by using statistical data for the selected first word (step S20).

Further, the context vector generation unit 20 uses the dictionary data to select a word that is the second word (a word having one candidate word) among the words included in the input data. Then, the context vector generation unit 20 generates a context vector for the selected second word (step S40).

Next, the selection unit 30 calculates the similarity between the context vector and the candidate vector for each candidate word (step S60). And the selection part 30 selects the candidate word with the highest calculated similarity as a word of the 1st language corresponding to a 1st word using statistical data (step S80).

As described above, according to the present embodiment, input data in which two words are continuous can be converted into the first language with high accuracy without constructing a thesaurus. In addition, as described below, the accuracy of conversion may be higher than when using a thesaurus.

For example, when a thesaurus is used, the accuracy may be deteriorated in contents related to a thesaurus node that is not separated by the thesaurus. For example, in a thesaurus, things are classified based on materials such as “wood” and “metal”. Estimate the probability of occurrence of “branch breaks” that do not appear much in the real world, using the probability of learning data that “board breaks” when “branches” “boards” are included in this “wood” node It can happen. In this case, according to this embodiment, since the appearance probability is not calculated, the conversion accuracy is higher than in the case of using the thesaurus.

(Second Embodiment)
FIG. 3 is a block diagram illustrating a functional configuration of the information processing apparatus according to the second embodiment. FIG. 4 is a flowchart of processing performed by the information processing apparatus illustrated in FIG. The information processing apparatus according to the present embodiment has the same configuration as the information processing apparatus according to the first embodiment, except that the conversion unit 40 is provided.

The conversion unit 40 converts the input data into the first language based on the statistical data. The statistical data used here is, for example, statistical data of chain information of two words in the first language. Specifically, the conversion unit 40 converts each of two words constituting the input data into words of the first language using dictionary data. Then, the conversion unit 40 generates a chain of words in the first language after conversion, and selects the chain having the largest statistical number among the generated chains as the first language after conversion (step S10 in FIG. 4). .

The candidate vector generation unit 10, the context vector generation unit 20, and the selection unit 30 are input data that could not be converted by the conversion unit 40, that is, input in which none of the first language word chains were included in the statistical data. The processing shown in the first embodiment is performed on the data.

Also in this embodiment, the same effect as that of the first embodiment can be obtained. First, since the conversion unit 40 performs conversion processing based on the statistical data, the conversion accuracy is further increased.

Example 1
An operation when the information processing apparatus according to the second embodiment selects a Japanese translation for “break” for the original data “radio breaks” will be described. “Break” is an intransitive verb that takes the English SV syntax.

5, when translating the original data “radio breaks”, the conversion unit 40 recognizes that “radio” has one translated word “radio” by referring to the dictionary data. In this case, “radio” forms a context. Further, the conversion unit 40 recognizes that “breaks” has a prototype “break” and that “break” has a plurality of translated words “break”, “break”, “break”,. Here, the selection candidates are a combination of a verb prototype and a translation such as `` break '', `` break '', `` break '', and the context is the subject such as `` radio '' A combination of a prototype and a translated word is used as input data as shown in FIG.

Statistic data holds a sentence subject with an SV syntax, a verb prototype, and a translation pair as two pieces of chain information. Then, in step S10 of FIG. 4, the conversion unit 40 sets “radio (radio) break (break)”, “radio (radio) break (break)”, “radio (radio) (break (break)”,. To see if exists in the statistics. Here, it is assumed that not all sets exist in the statistical data.

Then, the candidate vector generation unit 10 refers to the statistical data for each of the selection candidates “break”, “break”, and “break” in step S20 of FIG. Assume that the statistical data is as shown in FIG. In the example shown in FIG. 6, “bat (break)” break appears once and “bone” break appears once for “break”. In addition, for “break”, “cable (cable) break” appears twice. Furthermore, “tv (TV) break” appears twice for “break”.

The candidate vector generation unit 10 continues to generate candidate vectors as shown in FIG. 6 in step S20 of FIG. Candidate vectors corresponding to “break” are “bat: 1, bone: 1,...”. The candidate vector corresponding to “break” is “cable (cable): 2,...”. The candidate vector corresponding to “break” is “tv (television): 2,...”.

Further, the context vector generation unit 20 refers to the statistical data for “radio” in step S40 of FIG. In the statistical data, "radio (radio) start (start)" once, "radio (radio) close (near)" once, "radio (radio) open (start)" once appears Suppose you are.

Then, the context vector generation unit 20 extracts “start”, “close”, “open”, and so on, and extracts statistical data regarding this word. As a result, as shown in FIG. 7, "radio (radio) start (start)" is once, "tv (television) 始まる start (start)" is twice, and "radio (radio) close (close)" Once, “bakery close” is twice, “radiooopen” is once, “opera open” is once, ... Suppose that it appears. The context vector generation unit 20 selects “radio (radio): 1, tv (television): 2, ...” for “start”, and “radio” for “close”. (Radio): 1, bakery (bakery: 2, ...), "open" (start), "radio (radio): 1, opera (opera): 1, ..." Generate a vector.

And the selection part 30 calculates the similarity of a candidate vector and a context vector in step S60 of FIG. For example, using cosine similarity, find the similarity between the break vector and the start vector, find the similarity between break and close, break and open Find the similarity between (start), find the similarity between break (start) and start (start), ..., find the similarity between break (break) and open (start).

And the selection part 30 selects the candidate word which has a candidate vector with the highest similarity with a context vector in step S80 of FIG. Here, it is assumed that the cosine similarity of the break candidate vector and the start context vector pair is the highest compared to the cosine similarity of the other candidate vector / context vector pairs. Select to have a break.

In summary, in this embodiment, the translation of “break” for the context “radio” is selected for the original data “radio breaks”. And even if there is no data about `` radio break '' in the statistical data, `` break '' is accurately translated from `` break '' from `` break '', `` break '', `` break '' You can choose.

(Example 2)
In the case where the information processing apparatus according to the second embodiment is a Japanese speech recognition apparatus, an operation when a voice input “Ema is broken” is input will be described. The input data is a network (voice data) in a state before the language model is applied.

As shown in FIG. 8, it is assumed that there are “breakage of food”, “breakage of ema”, and “breakage of branch” as candidates for the speech recognition result. In this case, the first word candidate word is “bait”, “picture horse”, and “branch” one word at a time, and the second word (one candidate word is one word) “ga” “break” And A context is formed by these two words. These three-word chains are not included in the statistical data.

4, the conversion unit 40 confirms that the statistical data does not include a chain of three words “break the bait”, “break the ema”, and “break the branch”.

Next, in step S20 of FIG. 4, the candidate vector generation unit 10 extracts data including each candidate word (food, ema, and branch) from the statistical data. Here, “eat food”, “low food” for “bait”, “write ema”, “thin ema”, “branch” for “branches”, “branches” for “branches” ”Is long”. A candidate vector is generated as shown in FIG. 9 by using a portion other than the candidate word of the extracted data (that is, a simultaneously used word) and its frequency.

Next, in step S40 of FIG. 4, the context vector generation unit 20 extracts data including “breaking” in the context part from the statistical data. Here, it is assumed that “the plate breaks” can be extracted. Then, the context vector generation unit 20 recognizes “plates” and “cops” other than the context unit (“break”) as simultaneously used words, and extracts data including these from the statistical data. Here, it is assumed that “plate is broken” and “thin plate is thin” can be extracted for “plate”, and “cup is broken” and “drink with a cup” are extracted for “cup”. Then, the context vector generation unit 20 generates a context vector as shown in FIG.

4, the selection unit 30 calculates the similarity between the candidate vector and the context vector. The selection unit 30 calculates the degree of similarity between these two vectors by, for example, counting the number of elements having elements having a frequency of 1 or more and having the same elements using a Hamming distance.

And in step S80 of FIG. 4, the selection part 30 selects the candidate word which has a candidate vector with the highest similarity with a context vector. In this embodiment, a “board” context vector and a “picture horse” candidate vector having a common element “thin” are selected.

As described above, in this embodiment, even when there is no data “Ema can break” in the statistical data, the speech recognition result “Ema can break” can be output with high accuracy using the statistical data.

As mentioned above, although embodiment and Example of this invention were described with reference to drawings, these are illustrations of this invention and can also employ | adopt various structures other than the above. For example, an information processing device is not only a language processing system such as a machine translation device or a speech recognition device, but also a recommendation system that uses an action history such as “the person who bought this product also bought it”. May be.

In the above embodiment, the context is described as one, but there may be a plurality of contexts. At this time, the similarity calculation unit calculates the similarity for the context vectors of a plurality of contexts, and the selection unit extracts the candidate vector / context vector pair having the highest similarity, thereby selecting one context. May be.

This application claims priority based on Japanese Patent Application No. 2011-208706 filed on September 26, 2011, the entire disclosure of which is incorporated herein.

Claims

An information processing apparatus that converts input data in which at least two words are continuous into a first language,
The first word has a plurality of candidate words that are candidates for the word in the first language, and the second word has one candidate word,
For the first word, for each candidate word, based on the statistical data of the first language, by selecting a simultaneously used word that is a word that is used at the same time as the candidate word together with the number of uses thereof, Candidate vector generation means for generating a candidate vector for each candidate word;
A context vector generating means for generating a context vector for the second word by selecting the simultaneously used word of the second word together with the number of uses based on the statistical data of the first language;
Selection means for selecting the candidate word having the candidate vector having the highest similarity with the context vector as a word of the first language corresponding to the first word;
An information processing apparatus comprising:
The information processing apparatus according to claim 1,
The information processing device is a translation device;
The information processing apparatus, wherein the input data is character data of a second language that is a language different from the first language.
The information processing apparatus according to claim 1,
The information processing device is a voice recognition device.
The information processing apparatus according to any one of claims 1 to 3,
Conversion means for converting the input data into the first language based on statistical data;
The candidate vector generation unit, the context vector generation unit, and the selection unit process the input data that could not be converted by the conversion unit.
An information processing method for converting input data in which at least two words are continuous into a first language,
The first word has a plurality of candidate words that are candidates for the word in the first language, and the second word has one candidate word,
For the first word, the computer selects, for each candidate word, a simultaneously used word, which is a word used simultaneously with the candidate word, together with the number of uses, based on the statistical data of the first language. To generate a candidate vector for each candidate word,
The computer generates a context vector for the second word by selecting the simultaneously used word of the second word together with the number of uses based on the statistical data of the first language,
The information processing method, wherein the computer selects the candidate word having the candidate vector having the highest similarity with the context vector as a word in the first language corresponding to the first word.
A program for converting input data in which at least two words are continuous into a first language,
The first word has a plurality of candidate words that are candidates for the word in the first language, and the second word has one candidate word,
On the computer,
For the first word, for each candidate word, based on the statistical data of the first language, by selecting a simultaneously used word that is a word that is used at the same time as the candidate word together with the number of uses thereof, The ability to generate candidate vectors for each candidate word;
A function of generating a context vector for the second word by selecting the simultaneously used word of the second word together with the number of uses based on the statistical data of the first language;
A function of selecting the candidate word having the candidate vector having the highest similarity with the context vector as a word of the first language corresponding to the first word;
A program that realizes
The program according to claim 6,
The program is a machine translation program,
The input data is a program that is character data of a second language that is a language different from the first language.
The program according to claim 6,
The program is a program for speech recognition.