WO2022123695A1

WO2022123695A1 - Learning device, search device, learning method, search method, and program

Info

Publication number: WO2022123695A1
Application number: PCT/JP2020/045898
Authority: WO
Inventors: 拓長谷川; 京介西田; 宗一郎加来; 準二富田; 仙吉田
Original assignee: 日本電信電話株式会社
Priority date: 2020-12-09
Filing date: 2020-12-09
Publication date: 2022-06-16

Abstract

A learning device according to one embodiment is characterized by having: a feature quantity generation unit that accepts, as input, a plurality of training data in which a search query, a first document associated with the search query, and a second document not associated with the search query are included, and that, by using model parameters of a first neural network, generates a plurality of first feature quantity vectors which respectively represent a plurality of features of the search query, a plurality of second feature quantity vectors which respectively represent a plurality of features of the first document, and a plurality of third feature quantity vectors which respectively represent a plurality of features of the second document; a conversion unit that, by using the model parameters of a second neural network, converts the plurality of first feature quantity vectors, the plurality of second feature quantity vectors, and the plurality of third feature quantity vectors into a plurality of first sparse feature quantity vectors for learning, a plurality of second sparse feature quantity vectors for learning, and a plurality of third sparse feature quantity vectors for learning, respectively, the sparse feature quantity vectors for learning being made sparse by adjusting the ratio of elements that take a 0 in each dimension by normalization and mean shift; and an update unit for updating the model parameters of the first neural network and the model parameters of the second neural network by using the plurality of first sparse feature quantity vectors for learning, the plurality of second sparse feature quantity vectors for learning, and the plurality of third sparse feature quantity vectors for learning.

Description

Learning device, search device, learning method, search method and program

The present invention relates to a learning device, a search device, a learning method, a search method, and a program.

Document search requires high-speed retrieval of documents related to search queries from a large number of documents. As a technology to realize this requirement, for example, after creating an inverted index whose key is a word contained in a document and whose value is the document number of the document containing the word, a search query is performed using this inverted index. A technique for searching a document using the words contained in is known.

In addition, when a document search is performed with an exact word match, search omission may occur due to vocabulary ambiguity or notational fluctuations. For this reason, as a technique that can perform document retrieval even if words do not exactly match, there is a technique that regards the vector obtained by the neural network as a potential word vector, creates an inverted index, and performs document retrieval. It is known (for example, Non-Patent Document 1).

However, in the prior art described in Non-Patent Document 1 and the like described above, in order to perform a high-speed search using an inverted index, a constraint term relating to the vector norm is added to the loss function at the time of training in a high dimension. It realized a sparse vector. For this reason, it is often difficult to explicitly control the sparsity of the obtained vector, and there is a possibility that the feature space is represented by a specific low-dimensional subspace.

One embodiment of the present invention has been made in view of the above points, and an object thereof is to acquire a vector that can be regarded as a pseudo sparse in a document search using an inverted index by a neural network.

In order to achieve the above object, the learning device according to the embodiment includes a search query, a first document related to the search query, and a plurality of second documents not related to the search query. Using the training data as input and using the model parameters of the first neural network, the features of the plurality of first features and the features of the first document are represented, respectively, which represent the features of the plurality of search queries. Using a feature amount generator that generates a plurality of second feature amount vectors and a plurality of third feature amount vectors representing the features of the second document, and model parameters of the second neural network. The elements that take 0 in each dimension by normalization and average shift for each of the plurality of the first feature quantity vectors, the plurality of the second feature quantity vectors, and the plurality of the third feature quantity vectors. A conversion unit that converts a plurality of first learning sparse feature quantities vectors, a plurality of second learning sparse feature quantities vectors, and a plurality of third learning sparse feature quantity vectors that have been sparsed by adjusting the ratio. And the first neural using the plurality of the first learning sparse feature quantity vectors, the plurality of the second learning sparse feature quantity vectors, and the plurality of the third learning sparse feature quantity vectors. It is characterized by having an update unit for updating the model parameters of the network and the model parameters of the second neural network.

In a document search using an inverted index, a vector that can be regarded as a pseudo sparse can be acquired by a neural network.

It is a figure which shows an example of the whole structure of the search apparatus which concerns on 1st Embodiment. It is a flowchart which shows an example of the search process which concerns on 1st Embodiment. It is a figure which shows an example of the whole structure of the inverted index generation apparatus which concerns on 1st Embodiment. It is a flowchart which shows an example of the inverted index generation processing which concerns on 1st Embodiment. It is a figure which shows an example of the whole structure of the learning apparatus which concerns on 1st Embodiment. It is a flowchart which shows an example of the learning process which concerns on 1st Embodiment. It is a flowchart which shows an example of the model parameter update process which concerns on 1st Embodiment. It is a figure which shows the comparative example of a frequency distribution. It is a figure which shows the comparative example of a frequency distribution. It is a figure which shows an example of the whole structure of the search apparatus which concerns on 2nd Embodiment. It is a flowchart which shows an example of the search process which concerns on 2nd Embodiment. It is a figure which shows an example of the whole structure of the inverted index generation apparatus which concerns on 2nd Embodiment. It is a flowchart which shows an example of the inverted index generation processing which concerns on 2nd Embodiment. It is a figure which shows an example of the whole structure of the learning apparatus which concerns on 2nd Embodiment. It is a flowchart which shows an example of the model parameter update process which concerns on 2nd Embodiment. It is a figure which shows an example of the whole structure of the search apparatus which concerns on 3rd Embodiment. It is a figure which shows an example of the whole structure of the inverted index generation apparatus which concerns on 3rd Embodiment. It is a figure which shows an example of the whole structure of the learning apparatus which concerns on 3rd Embodiment. It is a figure which shows an example of the functions g ₁ and g ₂ and the partial derivative thereof. It is a flowchart which shows an example of the model parameter update process which concerns on 3rd Embodiment. It is a figure which shows the modification of the whole structure of the learning apparatus which concerns on 3rd Embodiment. It is a figure which shows an example of the hardware configuration of a computer.

Hereinafter, an embodiment of the present invention will be described.

[First Embodiment]
In the present embodiment, a search device 10 for searching a document related to a search query from among the documents to be searched will be described using a vector obtained by a neural network and an inverted index. Further, the inverted index generation device 20 for generating (or creating) the inverted index and the learning device 30 for learning the neural network will also be described.

In the present embodiment, the search device 10, the inverted index generation device 20, and the learning device 30 are described as different devices, but two or more of these devices are realized by the same device. You may be. For example, the search device 10 and the inverted index generation device 20 may be realized by the same device, the inverted index generation device 20 and the learning device 30 may be realized by the same device, or the learning device 30 and the search device may be realized. 10 may be realized by the same device, or the search device 10, the inverted index generation device 20, and the learning device 30 may be realized by the same device.

-At the time of search First, a case where a document search is performed by the search device 10 will be described. Here, the search target document set is {D ₁ , ..., D _m }, the search device 10 inputs the search query Q, and the sequence set of documents related to the search query Q {D ₁ , ... ..., D _k } and its relevance degree {S ₁ , ..., _Sk } shall be output. m is the number of documents to be searched, and k (where k ≦ m) is the number of documents related to the search query Q.

The search query Q and each search target document Di ( _i = 1, ..., M) are texts (character strings). Further, the document related to the search query Q is a document obtained as a search result for the search query Q.

<Overall configuration of search device 10>
The overall configuration of the search device 10 according to the present embodiment will be described with reference to FIG. FIG. 1 is a diagram showing an example of the overall configuration of the search device 10 according to the first embodiment.

As shown in FIG. 1, the search device 10 according to the present embodiment has a context coding unit 101, a pseudo sparse coding unit 102, an inverted index utilization unit 103, and a ranking unit 104. Here, it is assumed that the context coding unit 101 and the pseudo-sparse coding unit 102 are realized by a neural network, and their parameters have been learned in advance. Hereinafter, the parameters of the neural network that realizes the context coding unit 101 and the pseudo sparse coding unit 102 are referred to as “model parameters”. The trained model parameters are stored in an auxiliary storage device such as an HDD (Hard Disk Drive) or SSD (Solid State Drive), for example.

The context coding unit 101 takes the search query Q as an input and outputs the feature amount U of the search query Q using the trained model parameters.

Here, as the neural network that realizes the context coding unit 101, for example, BERT (Bidirectional Encoder Representations from Transformers) or the like can be used. BERT is a context-aware pre-learning model using Transformer, which takes text as input and outputs d-dimensional features. By converting this feature quantity with one layer of a fully coupled neural network, it demonstrates high performance in various tasks of natural language processing. For more information on BERT, see, for example, Reference 1 "J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. ArXiv preprint arXiv: 1810.4805 , 2018. ”etc. For details on Transformer, see, for example, Reference 2 "Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Please refer to "1706.03762, 2017." etc.

When BERT is used as a neural network to realize the context coding unit 101, the CLS tag is added to the beginning of the search query Q and the SEP tag is added to the end of the sentence, and then input to the context coding unit 101.

Note that BERT is an example, and another context-considered pre-learning model using Transformer may be used as the neural network that realizes the context coding unit 101. More generally, as the neural network that realizes the context coding unit 101, any neural network capable of encoding text may be used. However, by realizing the context coding unit 101 with a context-aware pre-learning model such as BERT, it is possible to obtain a feature quantity considering the entire context. Hereinafter, it is assumed that the context coding unit 101 is realized by BERT, and the feature quantity U is a d-dimensional vector.

The pseudo-sparse coding unit 102 takes the feature amount U of the search query Q as an input and uses the trained model parameters to use the pseudo-sparse feature amount U'of the search query Q (that is, the search query Q which can be regarded as pseudo-sparse). Feature quantity U') is output.

Here, as the neural network that realizes the pseudo-sparse coding unit 102, for example, the model of the fully connected layer described in Non-Patent Document 1 can be used. More specifically, several fully connected layers (for example, about 3 to 5 layers) are stacked so that the dimension number d'of the pseudo-sparse feature amount U'is larger than the dimension number d of the feature amount U. After using a general ignition function such as the ReLu function as the ignition function of the final layer of these fully connected layers, divide each output value of the ignition function by the L2 norm of the entire output value to move onto a hypersphere with a radius of 1. A projecting model can be used. By using the ReLu function as the firing function of the final layer, it is possible to obtain a pseudo-sparse feature quantity U'having 0 as an element (that is, by using the ReLu function as the firing function of the final layer, other functions. Compared to the case of using, it is possible to acquire more sparse expressive ability).

The model described in Non-Patent Document 1 is an example, and as a neural network for realizing the pseudo-sparse coding unit 102, the output dimension is higher than the input dimension and the output dimension is higher than the input dimension. Any model can be used as long as the final layer uses a general ignition function f: R → R that satisfies all of the following conditions 1-1 to 1-3.

Condition 1-1: f (x) ≧ 0 for all x Conditions 1-2: f is a monotonous increase Condition 1-3: a ∈ R such that f (a) = 0 Existence Further, it is preferable that the dimension number d'of the pseudo-sparse feature quantity U'is as high as possible. However, the higher the number of dimensions d', the higher the expressive power of the pseudo-sparse feature quantity U', while the higher the calculation cost for calculating the pseudo-sparse feature quantity U'and the higher the learning cost for learning the model parameters. Become. Further, the amount of information and the allowable calculation cost of the document set to be searched may differ depending on the situation, and the dimension of the space where the codomain of the map by the neural network that realizes the dimension number d'and the pseudo-sparse coding unit 102 extends. It does not always match the number (that is, the rank of the representation matrix of the map). Therefore, the degree of the dimension number d'may differ, for example, depending on the amount of information possessed by the document set to be searched, the available computational resources, and the like.

Also, the projection on the above-mentioned hypersphere is not essential and may not be necessary. However, it is preferable to perform this projection because it is highly expected to promote the learning of pseudo-sparse features.

In the present embodiment, the context coding unit 101 and the pseudo sparse coding unit 102 are expressed as different functional units, but this is for convenience, and the context coding unit 101 and the pseudo sparse coding unit 102 are represented. May be one functional unit. For example, the context coding unit 101 and the pseudo-sparse coding unit 102 may be collectively referred to as the coding unit 100.

The inverted index utilization unit 103 receives a pseudo-sparse feature quantity U'as an input, and obtains a subset {V'i | _i ∈ K} of the pseudo-sparse feature quantity of the search target document by using the inverted index generated in advance. K is | K | = k, and is a set of indexes (or document numbers, document IDs, etc.) of documents related to the search query Q. Further, the pseudo-sparse feature quantity _V'i of the search target document is a d'dimensional vector obtained by inputting the search target document _Di to the context coding unit 101 and the pseudo sparse coding unit 102. Hereinafter, for _i = 1, ..., M, it is expressed as V'i = ( _v'i1 , _v'i2 , ..., _v'id' ). The index of a document is also referred to as a "document index". The inverted index is stored in an auxiliary storage device such as an HDD or SSD, for example.

Here, the inverted index according to the present embodiment uses each dimension 1, 2, ..., D'(that is, a dimension index or a dimension number) of the pseudo-sparse feature quantity as a key, and Cr = {with respect to the key _r . (I, _v'ir ) | _v'ir ∈ Wr} ⁱ _{∈ {1, ..., m}} is the information set as the value. Wr is a set { _v'1r , _v'2r , ..., _Which is a collection of elements of the ^r -th dimension of the pseudo-sparse features _V'1 , _V'2 , ... For _v'mr }, it is a subset of the top t% in descending order of the value. In addition, t is a preset threshold value (however, 0 <t ≦ 100), and may be a value different from the threshold value t _pas or t _que described later, or may be the same value.

At this time, the inverted index utilization unit 103 transposes using each dimension _r where u'r ≠ 0 with respect to the sparse feature quantity U'= (u'1, _u'2 , ..., _u'd _' ) as a key. Get value from the index. At this time, if it is desired to perform document retrieval at a higher speed, the inverted index utilization unit 103 uses each element of the pseudo-sparse feature quantity U'= (u'1, _u'2 , ..., _U'd _' ). Of these, an element whose value is smaller than a predetermined threshold value may be approximated to 0, and then the value may be acquired (in this case, the value is not acquired for the dimension corresponding to the element approximated to 0). ).

Then, the inverted index utilization unit 103 obtains a subset {V'i | _i ∈ K} of the sparse features of the search target document, with the set of all document indexes included in the set of acquired values as K. In the following, by renumbering the document index, the subset {V'i | _i ∈ K} of the sparse features of the document to be searched is also expressed as { _V'1 , ..., _V'k }.

The ranking unit 104 is a subset of the pseudo-sparse feature quantity U'of the search query Q and the pseudo-sparse feature quantity of the search target document {V'i | _i ∈ K} = { _V'1 , ..., _V'k }. Is input to output the sequence set {D _i | i ∈ K} of the documents related to the search query Q (hereinafter, also referred to as "related documents") and the degree of relevance {S _i | i ∈ K}. .. The ordered set of related documents {D _i | i ∈ K} is a set ordered in ascending or descending order of the degree of relevance S _i . In addition, the ordered set of related documents {D _i | i ∈ K} and its relevance degree {S _i | i ∈ K} are renumbered from the document index to {D ₁ , ..., D _k } and, respectively. It can be expressed as _{ S ₁ , ..., Sk}.

At this time, the ranking unit 104 converts the subset { _V'1 , ..., _V'k } of the pseudo-sparse features of the search target document into { _V''1 , ..., _V''k }. Then, using an appropriate similarity function s that measures the similarity between vectors, the degree of association S _i between the search query Q and the document _Di is calculated by S _i = s (U', _V''i ). .. As the similarity function s, for example, an inner product or the like can be used. However, as the similarity function s, any function capable of measuring the similarity between vectors can be used. Further, for example, the similarity function s may be defined by s = 1 / d using an arbitrary distance function d capable of measuring the distance between vectors.

Here, assuming that _V''i = ( _{v''i1, v''i2} _, ..., V'' _ik ), the ranking unit 104 is subjected to _V'i = ( _v'i1 ) according to the following equation (1). , _V'i2 , ..., _v'id' ) is converted to _V''i .

As described above, the ranking unit 104 has v'' _ir = _v'ir if (i, _v'ir ) is included in the value Cr for _r = 1, ..., K, otherwise. Converts _V'i to _V''i by setting v'' _ir = 0. In addition, depending on the method of generating the value Cr of the inverted index, in this conversion, the value of each element of _V'i ₌ ( _v'i1 , _v'2 , ..., _V'ik ) is less than the upper t%. If it is, it means that it is set to 0.

<Search process>
The search process for obtaining the ordered set {D _i | i ∈ K} of the related documents of the input search query Q and the degree of relevance {S _i | i ∈ K} will be described with reference to FIG. FIG. 2 is a flowchart showing an example of the search process according to the first embodiment.

Step S101: First, the context coding unit 101 inputs the search query Q and outputs the feature amount U of the search query Q using the trained model parameters.

Step S102: Next, the pseudo-sparse coding unit 102 outputs the pseudo-sparse feature amount U'of the search query Q using the trained model parameters with the feature amount U obtained in the above step S101 as an input. ..

Step S103: Next, the inverted index utilization unit 103 uses the inverted index generated in advance with the pseudo-sparse feature amount U'obtained in the above step S102 as an input to obtain the pseudo-sparse feature amount of the search target document. Obtain a subset {V'i | _i ∈ K}.

Step S104: Then, the ranking unit 104 inputs {V'i | i ∈ K} of the pseudo-sparse feature quantity U'obtained in the above step S102 and the set {V'i | _i ∈ K} obtained in the above step S103. After converting V'i | _i ∈ K} to {V''i | _i ∈ K}, search using this {V''i | _i ∈ K} and the pseudo-sparse feature U'. The ordered set of related documents of query Q {D _i | i ∈ K} and its relevance degree {S _i | i ∈ K} are output.

As described above, the search device 10 according to the present embodiment obtains an ordered set {D _i | i ∈ K} of documents related to the input search query Q and a degree of relevance {S _i | i ∈ K}. Can be done. At this time, the search device 10 according to the present embodiment uses the pseudo-sparse feature amount U'of the search query Q and the translocation index generated in advance by the translocation index generation device 20 to reduce the document amount of the search target document. It is possible to obtain a related document and its relevance in consideration of the context of the search query Q and the entire search target document after satisfying the search speed required for the document search without depending on the order.

-At the time of generating an inverted index Next, a case where an inverted index is generated by the inverted index generator 20 will be described. Here, it is assumed that the inverted index generation device 20 inputs a set of documents to be searched {D ₁ , ..., D _m } and outputs an inverted index.

<Overall configuration of inverted index generator 20>
The overall configuration of the inverted index generation device 20 according to the present embodiment will be described with reference to FIG. FIG. 3 is a diagram showing an example of the overall configuration of the inverted index generation device 20 according to the first embodiment.

As shown in FIG. 3, the inverted index generation device 20 according to the present embodiment has a context coding unit 101, a pseudo sparse coding unit 102, and an inverted index generation unit 105. Here, the context coding unit 101 and the pseudo sparse coding unit 102 are realized by the same neural network as the context coding unit 101 and the pseudo sparse coding unit 102 described at the time of the above search, and their model parameters are realized. Is pre-learned.

The context coding unit 101 takes the search target document _Di as an input and _outputs the feature amount _Vi of the search target document Di using the trained model parameters.

The pseudo-sparse coding unit 102 inputs the feature amount V _i of the search target document _Di and outputs the pseudo sparse feature amount _V'i of the search target document _Di using the trained model parameters.

The inverted index generation unit 105 inputs the set { _V'1 , ..., _V'm } of the pseudo-sparse features of each search target document Di ( _i = 1, ..., M) as an inverted index. Is generated and output. As described above, the inverted index uses the index or dimension number of the dimension of the pseudo-sparse feature as a key, and for the key ^r , C _r = {(i, _v'ir ) | _v'ir ∈ Wr} _{i ∈ {1 , ..., m}} is the information set as the value. Therefore, in the inverted index generation unit 105, each element _v'ir (r = 1, ..., D') of each pseudo-sparse feature quantity V'i ( _i = 1, ..., M) is {v'. It is determined whether or not it is included in the upper t% of _1r , _v'2r , ..., _V'mr }, and if it is included in the upper t%, it is included in the set Cr of values whose key is _r (i, v). 'By adding _ir ), an inverted index is generated. The search speed at the time of document retrieval is determined according to the number of elements (that is, the number of values) of the set Cr of the values of the inverted index, and the number of elements can be adjusted by the value of the threshold value _t . Therefore, if the calculation speed of the processor or the like is known, it is possible to adjust the search speed (in other words, the search amount) so as to satisfy the search time required for document retrieval by adjusting the value of t. Become.

<Inverted index generation process>
The inverted index generation process for generating an inverted index from the input set of search target documents {D ₁ , ..., D _m } will be described with reference to FIG. FIG. 4 is a flowchart showing an example of the inverted index generation process according to the first embodiment. The inverted index generation process is executed after the learning process described later is completed and before the search process described above is executed.

Step S201: First, the context coding unit 101 inputs the search target document _Di and _outputs the feature amount _Vi of the search target document Di using the trained model parameters.

Step S202: Next, the pseudo-sparse coding unit 102 inputs the feature amount Vi of the search target document _Di and uses the _trained model parameters to generate the pseudo-sparse feature amount _V'i of the search target document _Di. Output.

The above steps S201 to S202 are repeatedly executed for all the search target documents Di ( _i = 1, ..., M).

Step S203: Then, the inverted index generation unit 105 sets { _V'1 , ..., _V'm } of pseudo-sparse features of each search target document Di ( _i = 1, ..., M). Generates and outputs an inverted index as input.

As described above, the inverted index generation device 20 according to the present embodiment can generate an inverted index from the set of input search target documents {D ₁ , ..., D _m }. As described above, by using this transposed index, the search device 10 satisfies the search speed required for the document search without depending on the order of the document amount of the search target document, and then performs the search query Q and the search query Q. It is possible to obtain a related document considering the context of the entire search target document and its degree of relevance (that is, a document related to the search query Q can be searched).

-During learning Next, a case where the neural network (neural network that realizes the context coding unit 101 and the pseudo-sparse coding unit 102) is learned by the learning device 30 will be described. Here, it is assumed that the model parameters have not been learned at the time of learning, and the learning device 30 inputs the training data set and learns the model parameters. A training data set is a set of training data used for training (training) model parameters.

In this embodiment, for example, Reference 3 “Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen A training dataset shall be created in advance from the dataset described in "Saurabh Tiwary, Tong Wang. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. ArXiv preprint arXiv: 1611.09268, 2018.".

The data set described in Reference 3 above includes a search query set R = {Q ₁ , ..., Q _c } and a set of search target documents G = {D ₁ , ..., D _m' }. Consists of. c is the number of search queries, and m'is the number of documents to be searched. In addition, m'= m may be used, or m'≠ m. However, it is preferable that m'≧ m.

Further, for the search query Q _i ( _{i = 1, ..., C), a set of documents related to this search query Q i} ₌ {D _j | D _j is a document related to Q _i }. It shall be labeled as correct answer data.

At this time, one document randomly extracted from the set of documents G _i related to the search query Q _i is D _i ⁺ , and one document randomly extracted from the set G \ G _i of the documents not related to the search query Q _i . Let _Di- ^, and _{let (Qi, Di +, Di-} ₎ ^be _the ^training data (that is, let the data composed of the search query _Qi , its positive example, and its negative example be the training data. ). Then, a set of these training data {(Q _i , Di ⁺ , _Di ⁻ ) | _i = 1, ..., C} is used as a training data set.

<Overall configuration of learning device 30>
The overall configuration of the learning device 30 according to the present embodiment will be described with reference to FIG. FIG. 5 is a diagram showing an example of the overall configuration of the learning device 30 according to the first embodiment.

As shown in FIG. 5, the learning device 30 according to the present embodiment has a context coding unit 101, a pseudo sparse coding unit 102, a ranking unit 104, a division unit 106, an update unit 107, and a determination unit 108. And have. Here, the context coding unit 101 and the pseudo-sparse coding unit 102 are realized by the same neural network as the context coding unit 101 and the pseudo-sparse coding unit 102 described in the above-mentioned search time and inverted index generation time. However, it is assumed that the model parameters have not been trained.

The division unit 106 takes the training data set as an input and randomly divides this training data set into a plurality of mini-batch. In this embodiment, it is assumed that the model parameters are repeatedly updated (learned) for each mini-batch.

The determination unit 108 determines whether or not the end condition for ending the repeated update of the model parameter is satisfied. The number of times one training data is repeatedly learned is called an epoch, and the number of repetitions is called the number of epochs.

The context coding unit 101 takes the training data (Q _i , _Di ⁺ , _Di ⁻ ) as an input, and uses the untrained model parameters to generate the training data (Q _i , _Di ⁺ , _Di ⁻ ). The feature quantity (U _i , V _i ⁺ , V _i- ⁾ is output. That is, the context coding unit 101 takes the search query Q _i , the positive example _Di ⁺ , and the negative example _Di ⁻ as inputs, and outputs the respective feature quantities U _i , V _i ⁺ , and V _i ⁻ .

The pseudo-sparse coding unit 102 takes the feature quantities (U _i , V _i ⁺ , V _i- ) of the training data (Q _i , _Di ⁺ , _Di ⁻ ) as inputs, ^and uses the model parameters that have not been trained. Obtain pseudo ^- sparse features ( _U'i , _V'i ⁺ , _V'i- ⁾ of training data (Qi, _Di ⁺ , _Di- ₎ . That is, the pseudo-sparse coding unit 102 takes the feature quantities U _i , V _i ⁺ , and V _i ⁻ as inputs, and obtains the respective pseudo sparse feature quantities _U'i , _V'i ⁺ , and _V'i ⁻ . _U'i = (u'i1, _u'i2 , ..., _{u'id'), V'i + = (v'+ i1} _, _v ^' ⁺ _i2 ^, _... , v' ⁺ _id') ) And _{V'i- = (v'-i1} ^, _v' ^- _i2 , ..., v' ^- _id' ⁾ .

Then, the pseudo-sparse coding unit 102 sets the pseudo-sparse feature quantities _U'i , _V'i ⁺ and _V'i ^- for learning pseudo-sparse feature quantities _U''i , _V''i ⁺ and _V''i , respectively. Convert ^to- .

Here, the subset of the set of pseudo-sparse features U'i ( _i = 1, ..., C) is Z _tr ¹ = { _{U'tr, 1} , _{U'tr, 2} , ..., U'. _{If tr, m''} }, U'i = _{U'tr, i} ₌ ( _u'i1 , _u'i2 , ..., _u'id' ), the pseudo-sparse coding unit 102 uses the following equation ( 2) Converts the pseudo-sparse feature quantity _U'i into the pseudo-sparse feature quantity _U''i for learning.

W'1r is a set { _u'1r , _u'2r , ..., u'm _' ' that collects the elements of the ^r -th dimension of _each pseudo-sparse feature quantity _{U'tr, i} contained in Z _tr ¹ . For _r }, it is a subset of the top t _que % in descending order of its value. This means that only the elements with large values are used for learning. Further, m'' is an arbitrary natural number satisfying m''≤c, and is, for example, the number of training data included in the mini-batch. Note that t _que is a preset threshold value (however, 0 <t _que ≦ 100).

Similarly, a subset of the set of pseudo-sparse features V'i ⁺ ( _i = 1, ..., C) is Z _tr ² = {V' ⁺ _{tr, 1} , V' ⁺ _{tr, 2} , ... , V' ⁺ _{tr, m''} }, V' ⁺ _i = V' ⁺ _{tr, i} = (v' ⁺ _i1 , v' ⁺ _i2 , ..., v' ⁺ _id' ) The conversion unit 102 converts the pseudo-sparse feature amount V' ⁺ _i for learning pseudo-sparse feature amount V'' ⁺ _i by the following equation (3).

W'2r is a set {v' ⁺ _1r , v' ₊ _2r , ..., V'that collects the elements of the ^r -th dimension of each pseudo-sparse feature quantity V' ⁺ _{tr, i} contained in Z _tr ² ^. For ⁺ _m''r }, it is a subset of the top _tpas % in descending order of the value. This means that only the elements with large values are used for learning. Although t _pas is a preset threshold value (however, 0 <t _pas ≦ 100), t _que and t _pas may have the same value or different values. Further, t _que = 100 and t _pas = 100 may be set, but in this case, it is equivalent to normal learning. Alternatively, either one of t _que and t _pas may be set to 100, and in this case, it is experimentally known that good results can be obtained by setting t _pas = 100.

Similarly, a subset of the set of pseudo-sparse features ^V'i- ( _i = 1, ..., C) is Z _tr ³ = {V' ^- _{tr, 1} , V' ^- _{tr, 2} , ... , V' ^- _{tr, m''} }, V' ^- _i = V' ^- _{tr, i} = (v' ^- _i1 , v' ^- _i2 , ..., v' ^- _id' ) The conversion unit 102 converts the pseudo-sparse feature amount V' ^{-i into the learning pseudo-sparse feature amount V''-} _i ^by _the following equation (4).

W'3r is a set {v' ^- _1r , v' ^- _2r , ..., V'that collects the elements of the ^r - _th dimension of each pseudo-sparse feature quantity V' ^- _{tr, i} contained in Z _tr ³ . _{-For m''r} ^} , it is a subset of the top _tpas % in descending order of the value.

It is preferable that each element of the above subset Z _tr ¹ is a pseudo-sparse feature obtained with the same model parameters, and can be realized by, for example, a set of pseudo-sparse features obtained in the same mini-batch. be. This also applies to Z _tr ² and Z _tr ³ . However, for example, when t _que <(1 / m'') × 100 or t _pas <(1 / m'') × 100, a subset in which the upper t _que % is collected or the upper t _pas % is collected. Since the subset can be an empty set, it is necessary to have t _que > (1 / m'') × 100 and t _pas ＞ (1 / m'') × 100 in order to avoid this. Further, when the norm of the output value is 1 (that is, when the output value of the firing function of the final layer of the neural network that realizes the pseudo-sparse coding unit 102 is projected onto a hypersphere having a radius of 1), t. It is necessary to satisfy _que > (2 / m'') × 100 and t _pas > (2 / m'') × 100. Therefore, for example, when it is difficult to obtain a subset of pseudo-sparse features having the same model parameters and having a magnitude satisfying the same, it can be obtained between the latest fixed learning steps only when the learning coefficient is not large. The pseudo-sparse features obtained may be added to the subset.

In addition, when the pseudo-sparse features obtained between the most recent fixed learning steps are added to the subset and the calculation graph cannot be saved in the memory, the pseudo-sparse features obtained in the past learning steps are in the upper t. It may only be used to calculate _que % (or _tpas %).

Further, in the present embodiment, a case where the training data set is divided into mini-batch units and model parameters are repeatedly learned for each mini-batch (that is, mini-batch learning) will be described, but it is not always necessary to use mini-batch learning, and online learning is performed. The model parameters may be learned by any other learning method such as batch learning or batch learning. However, as described above, since a subset of pseudo-sparse features is important, it is preferable to learn model parameters by mini-batch learning.

The ranking unit 104 inputs the pseudo-sparse feature quantities _U''i , _{V''i +, and V''i-for learning, and the relevance degree S i + of the regular example D i} ⁺ _to _the ^search _{query Q i} ^and _the ^search query. Outputs the degree of association S _i ^- of the negative example D _i ^- with respect to Q _i . Here, the relevance S _i ⁺ and S _i ⁻ are S _i ⁺ = s ( _{U''i, V''i +) and S i} _, ^respectively _, using the similarity function s described at the time of the above search. ^- = S ( _U''i ^, _V''i- ) is calculated.

The update unit 107 updates the model parameters by the supervised learning method with the relevance degrees S _i ⁺ and S _i ⁻ as inputs. Here, as the error function of supervised learning, the error function in ranking learning may be used.

More specifically, the hinge loss described in Non-Patent Document 1 (that is, the formula (3) described in Non-Patent Document 1) may be used. Hinge loss is expressed by hinge loss = max {0, ε- (S _i ⁺ -S _i- ⁾ } using an arbitrarily set parameter ε.

<Learning process>
The learning process for learning the model parameters from the input training data set will be described with reference to FIG. FIG. 6 is a flowchart showing an example of the learning process according to the first embodiment. It is assumed that the model parameters are initialized with appropriate values.

Step S301: First, the division unit 106 takes the training data set as an input and randomly divides this training data set into a plurality of mini-batch.

Step S302: Next, the learning device 30 executes model parameter update processing for each mini-batch. As a result, the model parameters are updated by the model parameter update process. Details of the model parameter update process will be described later. This model parameter update process is also called a learning step.

Step S303: Then, the determination unit 108 determines whether or not the predetermined end condition is satisfied. The learning device 30 ends the learning process when it is determined that the end condition is satisfied (YES in step S303), while the learning device 30 ends the learning process when it is determined that the end condition is not satisfied (NO in step S303). return. As a result, steps S301 to S302 are repeatedly executed until a predetermined end condition is satisfied.

The predetermined end conditions are, for example, that the number of epochs is equal to or greater than the predetermined first threshold value and that the error function has converged (for example, the value of the error function is less than the predetermined second threshold value). That, the amount of change in the error function before and after updating the model parameters is less than the predetermined third threshold value, etc.).

<Model parameter update process>
The model parameter update process in step S302 will be described with reference to FIG. 7. FIG. 7 is a flowchart showing an example of the model parameter update process according to the first embodiment. In the following, a case where model parameters are updated using a certain mini-batch will be described.

Step S401: First, the context coding unit 101 takes the training data (Q _i , _Di ⁺ , _Di ⁻ ) in the mini-batch as an input, and uses the untrained model parameters to use the training data (Q _i ,). Outputs the features (U _i , V _i ⁺ , V _i- ⁾ of D _i ⁺ , D _i- ⁾ .

Step S402: Next, the pseudo-sparse coding unit 102 has been trained by inputting the features (U _i , Vi ⁺ , Vi ⁻ ₎ of the training data (Q _i , _Di ⁺ , _Di ⁻ ₎ . Pseudo ^- sparse features ( _U'i , _V'i ⁺ , _V'i- ⁾ of the training data (Qi, _Di ⁺ , _Di- ₎ are obtained using model parameters that are not.

Step S403: Next, the pseudo-sparse coding unit 102 uses the pseudo-sparse features ( _U'i , _V'i ⁺ , _V'i- ) for learning pseudo ^- sparse features ( _U''i , _V''i . It is converted to ⁺ , _V''i- ⁾ and the learning pseudosparse features ( _U''i , _V''i ⁺ , _V''i- ⁾ are output.

Step S404: Next, the ranking unit 104 inputs the learning pseudo-sparse features ( _U''i , _V''i ⁺ , _V''i ⁻ ), and inputs the normal example D _i ⁺ to the search query Q _i . The relevance S _i ⁺ and the relevance S _i ^- of the negative example _Di ^- for the search query Q _i are output.

The above steps S401 to S404 are repeatedly executed for all the training data ( _Qi , _Di ⁺ , _Di ⁻ ) included in the mini-batch.

Step S405: Subsequently, the update unit 107 takes the respective relevance S _i ⁺ and S _i- obtained in the above step S404 as inputs, ^and inputs the value of the error function (for example, hinge loss) and the error function regarding the model parameter. Calculate with the gradient. The gradient of the error function related to the model parameters may be calculated by, for example, the error back propagation method.

Step S406: Then, the update unit 107 updates the model parameters by an arbitrary optimization method using the value of the error function calculated in step S405 above and its gradient.

As described above, the learning device 30 according to the present embodiment can learn the model parameters of the neural network that realizes the context coding unit 101 and the pseudo-sparse coding unit 102 by using the input training data set. At this time, the learning device 30 according to the present embodiment sets the pseudo-sparse feature amount to 0 by setting 0 as an element whose value is not included in the upper t _que or t _pas % among the elements of the pseudo-sparse feature amount. After converting to a pseudo-sparse feature for learning, model parameters are learned using this pseudo-sparse feature for learning (in other words, only the element with a large value among the elements of the pseudo-sparse feature is learned. Used for.). This makes it possible to stably acquire features that can be regarded as sparse in a pseudo manner when searching a document. Hereinafter, this learning method will also be referred to as "top t learning".

<Evaluation experiment>
Next, an evaluation experiment for evaluating the method of the present embodiment (hereinafter referred to as “proposal method”) will be described.

≪Data set≫
Experiments were performed using the MS MARCO Passage and Document Retrieval task described in reference 3 above. In this task, ReRanking and Full Ranking exist in Passage unit and Document unit respectively, and in this experiment, evaluation was performed using Passage Ranking. Details of the Passage Ranking data are shown in Table 1 below.

In the ReRanking task, the top 1000 passages narrowed down using BM25 are given in advance, but in the Full Ranking, it is required to search from about 8.8 million documents to be searched. Mean Reciprocal Rank (MRR) is used as a ranking index. For details of MRR, refer to Reference 4 “Craswell, N .: Mean Reciprocal Rank, in Encyclopedia of Database Systems, p. 1703 (2009)” and the like.

≪Settings during learning≫
Using the Train Triples Small set, learning was performed using four GPUs (Graphics Processing Units) with a batch size (the number of training data included in the mini-batch) of 100 and an epoch number of 1. Adam (Adaptive Moment Estimation) was used as the optimization method, and β ₁ = 0.9, β ₂ = 0.999, and ε = ^10-8 . For details of Adam, refer to Reference 5 “Kingma, D.P. and Ba, J .: Adam: A Method for Stochastic Optimization, in ICLR (2015)” and the like.

The model parameters were initialized according to the normal distribution N (0, 0.02) except for the bias initialized at 0. The learning rate was set to 5 × ^10-5 , and was linearly attenuated so as to be 0 in the final step. The gradient was clipped with a maximum norm of 1. BERT used the base model (768 dimensions), and the number of dimensions of the intermediate layer of the two output layers was 1000, and the number of dimensions D of the final layer (that is, the pseudo-sparse feature amount) was 30,000. The margin ε of hinge loss was 1.0. A BERT Wordpiece tokenizer (vocabulary number 30K) was used for tokenizing.

As a hyperparameter for top t learning, t _que = 0.1%. Further, when the batch size is 100, when the upper T% is calculated in the mini-batch with a certain threshold as T, the threshold T is too strongly dependent on the training data in the mini-batch, so the current learning step is used to stabilize the calculation. Pseudo-sparse feature quantities of the training data for the past 20 steps including this were saved, and by using these together, the calculation of the top T% was stabilized. Here, the pseudo-sparse features for 19 steps excluding the current learning step are used only for determining the threshold value T when calculating the upper T%, and the calculation graph is not retained and the parameters are not updated. ..

≪Evaluation of ranking accuracy and search speed≫
Hereinafter, in the proposed method, it is assumed that a two-step search is performed at the time of document retrieval. That is, in the proposed method, after the upper k related document set {D ₁ , ..., D _k } is output in the first stage, the final relation of the upper k'is set as t = 100 in the second stage. A set of documents shall be output.

・ Does the proposed method exceed the accuracy of the conventional method in Passage Full Ranking?
The results of the search accuracy of the proposed method and the conventional method are shown in Table 2 below. In the proposed method, k = 1000 and k'= 10.

As shown in Table 2 above, it was confirmed that the proposed method greatly exceeded the accuracy of BM25 in MRR @ 10, and that the search accuracy increased when t at the time of search was increased. Furthermore, since the result of t _que = 0.1% during training was better than the result of t _que = t _pas = 0.5%, the pseudo-sparse feature of the search query and the pseudo-sparse feature of the search target document. It was found that the effect can be obtained without performing the operation of calculating the upper T% in both cases.

・ Is the proposed method able to search a large number of documents at high speed and with high accuracy?
Table 3 below shows the results of the average search speed of the proposed method when t _que = 0.1%. The 1st top-T is the number of output of related documents in the first stage, the 1st RT is the response time in the first stage, and the 2st RT is the response time in the second stage.

As shown in Table 3 above, at t = 0.1%, the total calculation time for searching about 8.8 million passages is extremely fast, less than 0.4 seconds, and the proposed method is fast from a large number of documents. It can be said that the search is possible with high accuracy. Further, at t = 0.5%, the search takes about 8.4 seconds, but the accuracy is good accordingly. From this result, it is possible to maximize the accuracy by setting t from the time allowed in the search. In addition, since it is not necessary to calculate the pseudo-sparse feature amount in the second stage search (that is, the pseudo-sparse feature amount calculated in the first stage can be diverted), it can be calculated at a very high speed in the second stage. Was confirmed.

・ Does top t learning contribute to speeding up?
8A and 8B show the comparison results of the pseudo-sparse features of the search target document when top t learning is used and when top t learning is not used. FIG. 8A is a result of calculating the pseudo-sparse features of 80,000 documents out of all the search target documents using the trained model parameters. However, dimensions without non-zero elements are excluded. In the left figure of FIG. 8A (without top learning), the dimension without non-zero elements was 24769 out of 30000 dimensions. Further, it can be seen that many of the 80,000 pseudo-sparse features have non-zero elements, and a very biased pseudo-sparse feature set is obtained. On the other hand, in the right figure of FIG. 8A (with top t learning), the dimension without non-zero elements is 4438 dimensions, and it can be seen that the distribution of the number of non-zero elements is less biased than that without top t learning. .. From this result, it can be seen that top learning has the effect of suppressing convergence to model parameters that map to a biased subspace.

FIG. 8B calculates the pseudo-sparse features of all the searched target documents using the trained model parameters, and stores them in each index (that is, each dimension r) when the inverted index is created at t = 0.1%. This is the frequency distribution of the amount of documents that have been created. However, as in FIG. 8A, the dimension having no non-zero element is excluded. As shown in the left figure of FIG. 8B (without top t learning), when a pseudo-sparse feature is calculated and an inverted index is created without using top t learning, most documents are stored in any index of the inverted index. You can see that it has not been done. On the other hand, as shown in the right figure of FIG. 8B (with top t learning), when a pseudo-sparse feature is calculated using top t learning and an inverted index is created, the document is stored in 90% or more of the indexes. Moreover, since the number of documents per index is less than 8900, it can be seen that high-speed search is possible. From the above, it can be said that top t learning contributes to speeding up.

≪Summary of evaluation experiment≫
In this evaluation experiment, using a neural network that can consider the context, a search from a large document of 8.84 million documents was performed in less than 0.4 seconds per query in the passage ranking task of MS MARCO 2.1, and BM25 was used. Achieved superior performance.

While the conventional keyword search model such as BM25 made it possible to search this book on a large scale by using an inverted index, there is a problem that the context cannot be taken into consideration. In addition, the conventional technique SNRM (standalone neural ranking model) tried to realize a high-speed and high-precision search model by creating a sparse vector with a neural network and creating an inverted index, but the sparse constraint by the L1 norm is a vector. There is a problem that the set is biased to a specific subspace. To solve these problems, the proposed method realized a large-scale neural search that can consider the context by introducing top learning.

[Second embodiment]
Next, the second embodiment will be described. In this embodiment, a case where the sparsity can be adjusted by performing normalization and mean shift when the feature quantities of the search query Q and the search target document Di are _sparsified will be described.

In the second embodiment, the differences from the first embodiment will be mainly described, and the description of the components substantially the same as those of the first embodiment will be omitted.

-At the time of search First, a case where a document search is performed by the search device 10 will be described.

<Overall configuration of search device 10>
The overall configuration of the search device 10 according to the present embodiment will be described with reference to FIG. FIG. 9 is a diagram showing an example of the overall configuration of the search device 10 according to the second embodiment.

As shown in FIG. 9, the search device 10 according to the present embodiment has a context coding unit 101, a normalized sparse coding unit 109, an inverted index utilization unit 103, and a ranking unit 104. That is, the search device 10 according to the present embodiment has a normalized sparse coding unit 109 instead of the pseudo sparse coding unit 102. For example, the context coding unit 101 and the normalized sparse coding unit 109 may be combined into the coding unit 100A.

The normalized sparse coding unit 109 takes the feature amount U of the search query Q as an input, uses the trained model parameters, and uses the pseudo-sparse feature amount U'of the search query Q (this pseudo-sparse feature amount is used in the present embodiment. It is also called "normalized sparse feature quantity").

Here, the neural network that realizes the normalized sparse coding unit 109 is normalized and averaged before using the firing function of the final layer of the neural network that realizes the pseudo-sparse coding unit 102 described in the first embodiment. It is a model that shifts.

Specifically, when the feature quantity _Ui of the search query Qi included in the training data at the time of learning is input to the normalized _sparse coding unit 109, it is output in the fully connected layer before using the firing function of the final layer. Let the d'dimensional vector be x _i = (x _i1 , ..., x _ij , ..., x _id' ). Further, let X = {x ₁ , ..., x _i , ... x _s' ,} as these appropriate subsets. Note that s'is the number of elements of the subset X. Such a subset may be determined arbitrarily, but it is preferable that the subsets are uniformly sampled from each _xi (i = 1, 2, ...). For example, it is conceivable to use each _xi obtained from the training data included in a certain mini-batch at the time of learning as a subset X.

At this time, when the feature quantity U of the search query Q is input to the normalized sparse coding unit 109, the d'vector output in the fully connected layer before using the firing function of the final layer is z = (z ₁ , ...・・, Z _j , ..., z _d' ), the normalized sparse feature of the search query Q U'= (u' ₁ , ..., _u'j , ..., _{u'd '} ) Is calculated by the following equation (5).

However, μ and σ are hyperparameters set in advance. Since the sum part (that is, (1 / s') (x _1j + ... + x _s'j )) in the above equation (5) can be calculated in advance at the time of learning, the calculation result. May be used.

In this way, the output of the fully connected layer before using the firing function in the final layer of the neural network that realizes the normalized sparse coding unit 109 is normalized, and the average is shifted using μ. It is possible to adjust the sparseness of the sparse feature (that is, the proportion of non-zero elements).

For example, when μ = -3 and σ = 1, the value of each element of the vector output in the fully connected layer before using the ignition function of the final layer follows a normal distribution of μ = -3 and σ = 1. If so, the number of elements for which the output of the ReLu function takes a positive value is expected to be about 0.3% as a whole.

Although the ReLu function is used as an example in the above equation (5), as described in the first embodiment, it is possible to use a general ignition function that satisfies all of the conditions 1-1 to 1-3. It is possible.

<Search process>
The search process for obtaining the ordered set {D _i | i ∈ K} of the related documents of the input search query Q and the degree of relevance {S _i | i ∈ K} will be described with reference to FIG. FIG. 10 is a flowchart showing an example of the search process according to the second embodiment.

Step S501: First, the context coding unit 101 inputs the search query Q and outputs the feature amount U of the search query Q using the trained model parameters.

Step S502: Next, the normalized sparse coding unit 109 inputs the feature amount U obtained in step S501 above, and uses the trained model parameters to generate the normalized sparse feature amount U'of the search query Q. Output.

Step S503: Next, the inverted index utilization unit 103 takes the normalized sparse feature amount U'obtained in the above step S502 as an input, and uses the inverted index generated in advance to use the normalized sparse feature of the search target document. Obtain a subset of quantities {V'i | _i ∈ K}. The inverted index according to the present embodiment is obtained by replacing "pseudo-sparse feature amount" with "normalized sparse feature amount" in the explanation of the inverted index in the first embodiment, and its configuration and the like are the first. It is the same as the embodiment.

Step S504: Then, the ranking unit 104 inputs the normalized sparse feature quantity U'obtained in the above step S502 and the set {V'i | _i ∈ K} obtained in the above step S503. After converting {V'i | _i ∈ K} to {V''i | _i ∈ K}, this {V''i | _i ∈ K} and the normalized sparse feature U'are used. , The ordered set of related documents of the search query Q {D _i | i ∈ K} and its relevance degree {S _i | i ∈ K} are output.

As described above, the search device 10 according to the present embodiment obtains an ordered set {D _i | i ∈ K} of documents related to the input search query Q and a degree of relevance {S _i | i ∈ K}. Can be done.

-At the time of generating an inverted index Next, a case where an inverted index is generated by the inverted index generator 20 will be described.

<Overall configuration of inverted index generator 20>
The overall configuration of the inverted index generation device 20 according to the present embodiment will be described with reference to FIG. FIG. 11 is a diagram showing an example of the overall configuration of the inverted index generation device 20 according to the second embodiment.

As shown in FIG. 11, the inverted index generation device 20 according to the present embodiment has a context coding unit 101, a normalized sparse coding unit 109, and an inverted index generation unit 105. That is, the inverted index generation device 20 according to the present embodiment has a normalized sparse coding unit 109 instead of the pseudo sparse coding unit 102.

The normalized sparse coding unit 109 takes the feature amount V _i of the search target document _Di as an input, and outputs the normalized sparse feature amount _V'i of the search target document _Di using the trained model parameters.

Here, the normalized sparse coding unit 109 uses the firing function after performing normalization and mean shift on the d'vector output in the fully connected layer of the final layer, as explained at the time of retrieval. , The normalized sparse feature quantity _V'i is obtained.

Specifically, when a certain document Di is input to the normalized sparse coding unit 109, the d'dimensional vector output in the fully connected layer before using the firing function of the final layer is y _i = (y _i ₁ ). , ..., y _ij , ..., y _id' ). Further, let these appropriate subsets be Y = {y ₁ , ..., y _i , ... y _s' ,}. In addition, s'is arbitrarily set in advance.

At this time, when the feature quantity _Ui of the search target document Di is input to the normalized sparse coding unit 109, the d'vector output in the fully connected layer before using the firing function of the final layer is _wi ₌ (. If w _i1 , ..., wij, ..., _wid _' ), then the normalized sparse feature amount of the search target document _{Di i} _' = ( _v'i1 , ..., _v'ij , ..., _V'id' ) is calculated by the following equation (6).

Here, μ and σ are hyperparameters that are set in advance, but they may be different values at the time of search (of course, they may be the same values at the time of search and at the time of inverted index generation).

The above-mentioned document _Di may be a search target document _Di or another document (for example, a document used as training data). Further, the sum part (that is, (1 / s') (y _1j + ... + y _s'j )) in the above equation (6) may be calculated in advance and the calculation result may be used. .. In this case, _wi does not have to be included in Y. However, if _∃W ; wi ∈ W, then Y⊂W is preferable.

As described above, μ and σ make it possible to adjust the sparseness of the normalized sparse features, so that the amount of documents stored in each index of the inverted index can be adjusted accordingly, and the search can be performed. The speed (that is, the amount of search) can be adjusted.

Note that the inverted index is similarly generated by replacing "pseudo-sparse feature amount" with "normalized sparse feature amount" in the description of the inverted index in the first embodiment.

<Inverted index generation process>
The inverted index generation process for generating an inverted index from the input set of search target documents {D ₁ , ..., D _m } will be described with reference to FIG. FIG. 12 is a flowchart showing an example of the inverted index generation process according to the second embodiment.

Step S601: First, the context coding unit 101 inputs the search target document _Di and _outputs the feature amount _Vi of the search target document Di using the trained model parameters.

Step S602: Next, the normalized sparse coding unit 109 takes the feature amount V _i of the search target document _{Di as an input and uses the trained model parameters to search the search target document D i} _the normalized sparse feature amount _V'i . Is output.

The above steps S601 to S602 are repeatedly executed for all the search target documents Di ( _i = 1, ..., M).

Step S603: Then, the inverted index generation unit 105 sets the normalized sparse features of each search target document Di ( _i = 1, ..., M) { _V'1 , ..., _V'm }. Is used as an input to generate and output an inverted index.

As described above, the inverted index generation device 20 according to the present embodiment can generate an inverted index from the set of input search target documents {D ₁ , ..., D _m }. As described above, in the present embodiment, the amount of documents stored in each index of the inverted index can be adjusted by adjusting μ and σ, so that the search time required for document retrieval is satisfied. It is possible to adjust the amount of documents. The values of μ and σ can be set independently at the time of searching, at the time of generating an inverted index, and at the time of learning described later.

-During learning Next, a case where the neural network (neural network that realizes the context coding unit 101 and the normalized sparse coding unit 109) is learned by the learning device 30 will be described.

<Overall configuration of learning device 30>
The overall configuration of the learning device 30 according to the present embodiment will be described with reference to FIG. FIG. 13 is a diagram showing an example of the overall configuration of the learning device 30 according to the second embodiment.

As shown in FIG. 13, the learning device 30 according to the present embodiment includes a context coding unit 101, a normalized sparse coding unit 109, a ranking unit 104, a division unit 106, an update unit 107, and a determination unit. It has 108 and. Here, the context coding unit 101 and the normalized sparse coding unit 109 are realized by the same neural network as the context coding unit 101 and the normalized sparse coding unit 109 described in the search time and the inverted index generation time. However, it is assumed that the model parameters have not been trained.

The normalized sparse coding unit 109 takes the feature quantities (U _i , V _i ⁺ , V _i ⁻ ) of the training data (Q _i , _Di ⁺ , _Di ⁻ ) as inputs, and uses the model parameters that have not been trained. , The normalized sparse features ( _U'i ^, _V'i ⁺ , _V'i- ⁾ of the training data (Qi, _Di ⁺ , _Di- ₎ are obtained. That is, the normalized sparse coding unit 109 takes the feature quantities U _i , _Vi ⁺ , and _Vi ⁻ as inputs, and as described above at the time of searching and at the time of generating the inverted index, each normalized sparse feature quantity U ' _i ^, _V'i ⁺ and _V'i- are obtained. However, the values of μ and σ may be changed according to the learning stage (for example, learning step).

<Learning process>
Next, the learning process will be described. In this embodiment, as shown by the above equations (5) and (6), since the mean and variance are calculated from an arbitrary subset for each dimension, it is basically preferable to use mini-batch learning. However, it is not always necessary to use mini-batch learning if statistics such as mean and variance of each dimension can be approximated or estimated. When the mini-batch learning is used, the flow of the process is the same as that of FIG. 6, and therefore, the parameter update process of step S302 of FIG. 6 will be described below.

<Model parameter update process>
The parameter update process according to the present embodiment will be described with reference to FIG. FIG. 14 is a flowchart showing an example of the model parameter update process according to the second embodiment.

Step S701: First, the context coding unit 101 takes the training data (Qi, _Di ⁺ , _Di- ) in the _mini ^- batch as an input, and uses the untrained model parameters to use the training data ( _Qi ,). Outputs the features (U _i , V _i ⁺ , V _i- ⁾ of D _i ⁺ , D _i- ⁾ .

Step S702: Next, the normalized sparse coding unit 109 learns by inputting the features (U _i , V _i ⁺ , V _i ⁻ ) of the training data (Q _i , _Di ⁺ , _Di ⁻ ). The normalized sparse features ( _U'i ^, _V'i ⁺ , _V'i- ⁾ of the training data (Qi, _Di ⁺ , _Di- ₎ are obtained using the model parameters that have not been completed.

Step S703: Next, the normalized sparse coding unit 109 uses the normalized sparse features ( _U'i , _V'i ⁺ , _V'i- ) for learning pseudo ^- sparse features ( _U''i , V'. ' _i ⁺ ^, _V''i- ) is converted, ^and the learning pseudosparse features ( _U''i , _V''i ⁺ , _V''i- ) are output.

Subsequent steps S704 to S706 are the same as steps S404 to S406 in FIG. 6, and their description thereof will be omitted.

As described above, the learning device 30 according to the present embodiment can learn the model parameters of the neural network that realizes the context coding unit 101 and the normalized sparse coding unit 109 by using the input training data set. ..

[Third embodiment]
Next, the third embodiment will be described. In the present embodiment, when updating the model parameters by the gradient estimation type error back propagation method, the case where the elements having the gradient of 0 is reduced and the learning is performed stably and efficiently will be described.

In the third embodiment, the differences from the first embodiment will be mainly described, and the description of the components substantially the same as those of the first embodiment will be omitted.

<Overall configuration of search device 10>
The overall configuration of the search device 10 according to the present embodiment will be described with reference to FIG. FIG. 15 is a diagram showing an example of the overall configuration of the search device 10 according to the third embodiment.

As shown in FIG. 15, the search device 10 according to the present embodiment has a context coding unit 101, a gradient estimation type pseudo-sparse coding unit 110, an inverted index utilization unit 103, and a ranking unit 104. That is, the search device 10 according to the present embodiment has a gradient estimation type pseudo-sparse coding unit 110 instead of the pseudo-sparse coding unit 102. For example, the context coding unit 101 and the gradient estimation type pseudo-sparse coding unit 110 may be collectively referred to as the coding unit 100B.

Similar to the pseudo-sparse coding unit 102 described in the first embodiment, the gradient estimation type pseudo-sparse coding unit 110 uses the feature quantity U of the search query Q as an input and uses the trained model parameters to make a search query. Outputs the pseudo-sparse feature amount U'of Q.

The name "gradient estimation type pseudo-sparse coding unit" is a threshold value ^t _{2, ru} described later in the forward propagation process of the neural network by the gradient estimation type pseudo sparse coding unit 110 when performing gradient estimation during learning. Strictly speaking, the processing content is different from that of the pseudo-sparse coding unit 102 in order to calculate the above. Since the gradient estimation type pseudo-sparse coding unit 110 at the time of search and at the time of generating an inverted index performs the same processing as the pseudo-sparse coding unit 102 described in the first embodiment, the gradient at the time of searching and at the time of generating an inverted index The estimation type pseudo-sparse coding unit 110 may be referred to as a “pseudo-sparse coding unit 102”. Therefore, the search process and the inverted index generation process according to the present embodiment are the same as those of the first embodiment.

<Overall configuration of inverted index generator 20>
The overall configuration of the inverted index generation device 20 according to the present embodiment will be described with reference to FIG. FIG. 16 is a diagram showing an example of the overall configuration of the inverted index generation device 20 according to the third embodiment.

As shown in FIG. 16, the inverted index generation device 20 according to the present embodiment has a context coding unit 101, a gradient estimation type pseudo-sparse coding unit 110, and an inverted index generation unit 105. That is, the inverted index generation device 20 according to the present embodiment has a gradient estimation type pseudo-sparse coding unit 110 instead of the pseudo-sparse coding unit 102. However, as described above, since the gradient estimation type pseudo-sparse coding unit 110 performs the same processing as the pseudo-sparse coding unit 102 at the time of generating the inverted index, the gradient estimation type pseudo-sparse coding unit 110 is referred to as “pseudo-sparse code”. It may be "sparse part 102". Further, as described above, the inverted index generation process according to the present embodiment is the same as that of the first embodiment.

-During learning Next, a case where the neural network (neural network that realizes the context coding unit 101 and the gradient estimation type pseudo-sparse coding unit 110) is learned by the learning device 30 will be described.

<Overall configuration of learning device 30>
The overall configuration of the learning device 30 according to the present embodiment will be described with reference to FIG. FIG. 17 is a diagram showing an example of the overall configuration of the learning device 30 according to the third embodiment.

As shown in FIG. 17, the learning device 30 according to the present embodiment includes a context coding unit 101, a gradient estimation type pseudo-sparse coding unit 110, a ranking unit 104, a division unit 106, and an update unit 107A. It has a determination unit 108. Here, the context coding unit 101 and the gradient estimation type pseudo-sparse coding unit 110 are the same as the context coding unit 101 and the gradient estimation type pseudo-sparse coding unit 110 described in the above-mentioned search time and inverted index generation time. It is realized by a neural network, but its model parameters are not trained.

The gradient estimation type pseudo-sparse coding unit 110 obtains the features (U _i , V _i ⁺ , V _i ⁻ ) of the training data (Q _i , _Di ⁺ , _Di ⁻ ) as in the first embodiment. Pseudo ^- sparse features for training ( _U''i ^, _V''i ⁺ , _V''i- of training data (Qi, _Di ⁺ , _Di- ₎ using untrained model parameters as input. ) Is output. At this time, the gradient estimation type pseudo-sparse coding unit 110 calculates the threshold values ^t _2, ru, etc., which will be described later.

Here, in the present embodiment, the transformation in the forward propagation of the neural network that realizes the gradient estimation type pseudo _- sparse coding unit 110 is represented by the function g1. As a result, for example, the element _ir (r = 1, ..., D') of the feature quantity U _i of the search query Q _i is input, and the learning pseudo-sparse feature quantity _U''i of the search query Q _i is input. The transformation g ₁ to obtain the element u'' _ir = g ₁ (u _ir ) of is expressed by the following equation (7).

Here, t _1, ru is _a threshold value, and ^t _1, ru = ^minW'1 ^r (that is, the smallest element among the elements included in _W'1 ^r ).

It should be noted that, with V _i ⁺ = (v ⁺ _i1 , ..., v ⁺ _ir , ..., v ⁺ _id' ), " ^u _ir " is "v ⁺ _ir " and "t _1, ru" is "t 1, ru". When read as "t _{1, r} ^{v +} ", the above equation (7) represents a transformation (function) for obtaining the element v'' ⁺ _ir of the learning pseudo-sparse feature quantity _V''i ⁺ . However, t _{1, r} ^{v +} is t _{1, r} ^{v +} = _minW'2 ^r .

Similarly, as _{Vi- = (v-i1} ^, _... , v ^- _ir , ... ^{, v-} ^id _' ), " ^u _ir " is changed to "v ^- _ir ", "t _1, ru". When read as "t _{1, r} ^v- ", the above equation (7) represents a transformation (function) for obtaining the element v'' ^- _ir of the learning pseudo ^- sparse feature quantity _V''i- . However, t _{1, r} ^v- is t _{1, r} ^v- = _minW'3 ^r .

Similar to the first embodiment, the update unit 107A updates the model parameters by the supervised learning method with the relevance degrees S _i ⁺ and S _i ⁻ as inputs. At this time, when the update unit 107A calculates (estimates) the gradient of the error function (for example, hinte loss) by the error back propagation method, instead of the partial differential of the function g1 shown in the above equation ( ₇ ), the update unit 107A replaces the partial differential of the function g1. The gradient of the error function is obtained by back _- propagating the error using the partial derivative of the function g2 described later.

_The partial differential of the function g1 shown in the above equation (7) is the following equation (8).

In the present embodiment, when the error is back-propagated by the error back-propagation method, the partial differential of the function g ₂ shown in the following equation (9) is replaced with the partial derivative of the function g ₁ shown in the above equation (8). Is used.

Here, ^t _2, ru is a threshold value, and is a value to be set so as to satisfy ^t _1, ^ru > t _{2, ru} .

FIG. 18 shows _a graph showing the functions g1 and g2 and their partial derivatives when _b = ^t _1, ru and a = ^t _{2, ru} . The upper left of FIG. 18 is a graph showing the function g ₁ , the lower left is the function g ₂ , the upper right is the partial derivative of the function g ₁ , and the lower right is the partial derivative of the function g ₂ . That is, the partial derivative of the function g ₂ is represented by a linear function that passes through g ₂ (a) = 0 and g ₂ (b) = 1 when a <u <b.

As shown in the upper right and lower right of FIG. 18, in the partial differential of the function g ₁ , the element of b or less is 0, whereas in the partial differential of the function g ₂ , the element of a or more and b or less is 0. It does not become. Therefore, by using the partial derivative of the function g ₂ at the time of back propagation, it is possible to reduce the elements in which the gradient of the error function becomes 0 (in other words, increase the elements in which the error can be back-propagated. It is possible to learn stably and efficiently.

Regarding the transformation (function) that obtains the element v'' ⁺ _ir of the pseudo-sparse feature quantity for learning _V''i ⁺ , instead of the partial differential, "u _ir " is changed to "v" in the above equation (8). Partial differentiation is used in which " ⁺ _ir " and "t _1, ru" are read as " ^t _{1, r} ^{v +} " and "t _2, ru" is read as " ^t _{2, r} ^{v +} ". However, t _{2, r} ^{v +} is a threshold value, and is a value that should be set so as to satisfy t _{1, r} ^{v +} > t _{2, r} ^{v +} .

Similarly, for the transformation (function) that obtains the element v'' ^- _ir of the pseudo-sparse feature quantity _V''i ^- for learning, instead of the partial differential, "u ir" is replaced with "u _ir " in the above equation (8). Partial differentiation is used in which "v ^- _ir " and "t _1, ru" are read as " ^t _{1, r} ^v- " and "t ₂ , ru" is read as " ^t _{2, r} ^v- ". However, t _{2, r} ^v- is a threshold value and is a value that should be set so as to satisfy t _{1, r} ^v- > t _{2, r} ^v- .

Here, how to determine the threshold values t _2, ru, ^t _{2, r} ^{v +} and t _{2, r} ^v- will be described. For example, the threshold values t _{2 and} _{ru are a set {u'1r, u'2r, ..., U'm''r} _} _in ^which the elements of the r-th dimension of the pseudo-sparse features contained in Z _tr ¹ are collected. It is conceivable to set the minimum value of the subset obtained by collecting the top 2 × t _que % in descending order of the value. Such threshold values ^t _{2 and} ru are calculated at the time of forward propagation of the neural network that realizes the gradient estimation type pseudo-sparse coding unit 110.

Similarly, the threshold values t _{2, r} ^{v +} are a set {v' ⁺ _1r , v' ⁺ _2r , ..., v' ⁺ _m , which is a collection of elements of the r-th dimension of the pseudo-sparse features contained in Z _tr ² . _For''r }, it is conceivable to set the minimum value of the subset obtained by collecting the top 2 × t _pas % in descending order of the value. Such threshold values t _{2, r} ^{v +} are calculated at the time of forward propagation of the neural network that realizes the gradient estimation type pseudo-sparse coding unit 110.

Similarly, the threshold values t _{2, r} ^v- are a set {v' ^- _1r , v' ^- _2r , ..., V'-, which is a collection of elements of the r-th dimension of the pseudo ^- sparse features contained in Z _tr ³ . For _m''r }, it is conceivable to set the minimum value of the subset obtained by collecting the top 2 × t _pas % in descending order of the value. Such threshold values t _{2, r} ^v- are calculated at the time of forward propagation of the neural network that realizes the gradient estimation type pseudo-sparse coding unit 110.

As a result, t _{1, r} ^u > t _{2, r} ^for the thresholds t _1, ru, t _{1, r} ^{v +} and t _{1, r} ^v- , which change depending on the dimension r and how to take a mini-batch at the time of learning. The thresholds t _{2, r} ^u , t _{2, r} ^{v +} and t _{2, r} ^v- satisfying ^u , t _{1, r} ^{v +} > t _{2, r} ^{v +} and t _{1, r} ^v- > t _{2, r} ^v- , respectively. It can be calculated and determined automatically (this determination method is referred to as "determination method 1"). The above 2 × t _que % and 2 × t _pas % are examples, and an arbitrary value L (where L> 1) can be used instead of 2.

However, the above-mentioned method of determining the threshold values t 1, ru, ^t _{1, r} _v ⁺ and t _{1, r} ^v- is an example, and the threshold value may be determined by other methods. For example, as b = t _1, ru (or ^t _{1, r} ^{v +} or t _{1, r} ^v- ), a = t _2, ru (or ^t _{2, r} ^{v +} or t _{2, r} ^v- ). , The maximum value c of the element of the dimension r in the mini-batch may be used to set a = 2bc (this determination method is referred to as "determination method 2"). For c, for example, when the number of training data included in the mini-batch is m'', when b = t _1, ru, a = t _2, ^ru , c = max { _u'1r , ^u '. _2r , ..., _u'm''r }. Similarly, when b = t _{1, r} ^{v +} and a = t _{2, r} ^{v +} , c = max {v' ⁺ _1r , v' ⁺ _2r , ..., V' ⁺ _m''r }. When b = t _{1, r} ^v- and a = t _{2, r} ^v- , c = max {v' ^- _1r , v' ^- _2r , ..., v' ^- _m''r }.

<Learning process>
Next, the learning process will be described. In this embodiment, the case of using mini-batch learning will be described as in the first embodiment. However, it is also possible to use learning methods other than mini-batch learning. Since the flow of mini-batch learning is the same as that of FIG. 6, the parameter update process of step S302 of FIG. 6 will be described below.

<Model parameter update process>
The parameter update process according to the present embodiment will be described with reference to FIG. FIG. 19 is a flowchart showing an example of the model parameter update process according to the third embodiment. Since steps S801 to S804 are the same as steps S401 to S404 in FIG. 7, the description thereof will be omitted.

Step S805: The gradient estimation type pseudo-sparse coding unit 110 calculates the threshold values t _2, ru, ^t _{2, r} ^{v +} and t _{2, r} ^v- .

Step S806: Subsequently, the update unit 107A takes the relevance S _i ⁺ and S _i ⁻ obtained in step S804 as inputs, and sets the value of the error function (for example, hinge loss) and the gradient of the error function with respect to the model parameter. To calculate. At this time, when the update unit 107A calculates (estimates) the gradient of the error function by the error back propagation method, the update unit 107A back-propagates the error by using the partial differential of the function g ₂ instead of the partial differential of the function g ₁ . Obtain the gradient of the error function.

Step S807: Then, the update unit 107A updates the model parameters by an arbitrary optimization method using the value of the error function calculated in step S806 and the gradient thereof.

As described above, the learning device 30 according to the present embodiment learns the model parameters of the neural network that realizes the context coding unit 101 and the gradient estimation type pseudo-sparse coding unit 110 by using the input training data set. Can be done. At this time, in the present embodiment, by using the gradient estimation type backpropagation method, the threshold value does not depend on how to take the subset during learning (that is, Z _tr ¹ , Z _tr ² and Z _tr ³ ). Stability can be taken into consideration, and learning can be further stabilized and promoted. That is, for example, in the first embodiment, the error can be back-propagated to the element whose gradient becomes 0 or does not become 0 depending on how the subset is taken.

In the present embodiment, the threshold values t _2, ru, ^t _{2, r} ^{v +} and t _{2, r} ^v- are calculated at the time of forward propagation of the neural network that realizes the gradient estimation type pseudo-sparse coding unit 110, but for example. , May be calculated by the update unit 107A in step S806 above. In this case, as shown in FIG. 21, the learning device 30 according to the present embodiment may have the coding unit 100 instead of the coding unit 100B (that is, the gradient estimation type pseudo-sparse coding unit 110). Instead, it may have a pseudo-sparse coding unit 102).

<Evaluation experiment>
Next, an evaluation experiment for evaluating the method of the present embodiment (hereinafter referred to as “proposal method”) will be described. The points not specifically mentioned below are the same settings (data set, learning settings, etc.) as in the evaluation experiment described in the first embodiment.

Assuming that the proposed method performs a two-step search as in the first embodiment, the first step does not perform gradient estimation, and the second step is "no gradient estimation", "gradient estimation pattern 1", and " There are three types of gradient estimation pattern 2 ”. The gradient estimation pattern 1 is a case where the threshold values t 2, ru, ^t _{2, r} _v ⁺ and t _{2, r} ^v- are determined by the above determination method 2. On the other hand, the gradient estimation pattern 2 is a case where the threshold values t 2, ru, ^t _{2, r} _v ⁺ and t _{2, r} ^v- are determined by the above determination method 1 with L = 2.

In addition, BM25 was used as the conventional method. The evaluation results are shown below.

MRR represents the average reverse rank, and P represents the recall rate. Latency represents the average value of search time (unit is ms).

As shown in Table 4 above, it can be seen that the proposed method (particularly, the gradient estimation pattern 1) can achieve higher performance than the conventional method.

[Hardware configuration]
Finally, the hardware configurations of the search device 10, the inverted index generation device 20, and the learning device 30 according to the first to third embodiments will be described. The search device 10, the inverted index generation device 20, and the learning device 30 can be realized by the hardware configuration of a general computer or computer system, and can be realized by, for example, the hardware configuration of the computer 500 shown in FIG. FIG. 21 is a diagram showing an example of the hardware configuration of the computer 500.

The computer 500 shown in FIG. 21 has an input device 501, a display device 502, an external I / F 503, a communication I / F 504, a processor 505, and a memory device 506. Each of these hardware is connected so as to be communicable via the bus 507.

The input device 501 is, for example, a keyboard, a mouse, a touch panel, or the like. The display device 502 is, for example, a display or the like. The computer 500 may not have at least one of the input device 501 and the display device 502.

The external I / F 503 is an interface with an external device. The external device includes a recording medium 503a and the like. The computer 500 can read and write the recording medium 503a via the external I / F 503. In the recording medium 503a, each functional unit (context coding unit 101, pseudo-sparse coding unit 102 (or normalized sparse coding unit 109 or gradient estimation type pseudo-sparse coding unit 110), and translocation of the search device 10 are included. One or more programs that realize the index utilization unit 103 and the ranking unit 104) may be stored. Similarly, the recording medium 503a has each functional unit (context coding unit 101, pseudo-sparse coding unit 102 (or normalized sparse coding unit 109 or gradient estimation type pseudo-sparse coding)) included in the inverted index generator 20. One or more programs that realize the unit 110) and the inverted index generation unit 105) may be stored. Similarly, the recording medium 503a has each functional unit (context coding unit 101, pseudo-sparse coding unit 102 (or normalized sparse coding unit 109 or gradient estimation type pseudo-sparse coding unit 110) included in the learning device 30. ), The ranking unit 104, the division unit 106, the update unit 107 (or the update unit 107A), and the determination unit 108) may be stored.

The recording medium 503a includes, for example, a CD (Compact Disc), a DVD (Digital Versatile Disc), an SD memory card (Secure Digital memory card), a USB (Universal Serial Bus) memory card, and the like.

The communication I / F 504 is an interface for connecting the computer 500 to the communication network. One or more programs that realize each functional unit of the search device 10 may be acquired (downloaded) from a predetermined server device or the like via the communication I / F 504. Similarly, one or more programs that realize each functional unit of the inverted index generation device 20 may be acquired from a predetermined server device or the like via the communication I / F 504. Similarly, one or more programs that realize each functional unit of the learning device 30 may be acquired from a predetermined server device or the like via the communication I / F 504.

The processor 505 is, for example, various arithmetic units such as a CPU (Central Processing Unit) and a GPU. Each functional unit included in the search device 10 is realized, for example, by a process in which one or more programs stored in the memory device 506 are executed by the processor 505. Similarly, each functional unit of the inverted index generation device 20 is realized by, for example, a process of causing the processor 505 to execute one or more programs stored in the memory device 506. Similarly, each functional unit included in the learning device 30 is realized by, for example, a process of causing the processor 505 to execute one or more programs stored in the memory device 506.

The memory device 506 is, for example, various storage devices such as HDD, SSD, RAM (RandomAccessMemory), ROM (ReadOnlyMemory), and flash memory.

The search device 10 according to the first to third embodiments can realize the above-mentioned search process by having the hardware configuration of the computer 500 shown in FIG. 21. Similarly, the inverted index generation device 20 according to the first to third embodiments can realize the above-mentioned inverted index generation process by having the hardware configuration of the computer 500 shown in FIG. 21. .. Similarly, the learning device 30 according to the first to third embodiments can realize the above-mentioned learning process by having the hardware configuration of the computer 500 shown in FIG. 21. The hardware configuration of the computer 500 shown in FIG. 21 is an example, and the computer 500 may have another hardware configuration. For example, the computer 500 may have a plurality of processors 505 or may have a plurality of memory devices 506.

Regarding the above embodiments, the following additional notes will be further disclosed.

(Appendix 1)
With memory
With at least one processor connected to the memory
Including
The processor
Using the model parameters of the first neural network as input to a plurality of training data including the search query, the first document related to the search query, and the second document not related to the search query. A plurality of first feature quantity vectors representing the features of the plurality of search queries, a plurality of second feature quantity vectors representing the features of the plurality of first documents, and a plurality of the second features. Generate a plurality of third feature vectors representing each feature of the document,
Normalization and averaging of each of the plurality of the first feature vector, the plurality of the second feature vector, and the plurality of the third feature vector using the model parameters of the second neural network. A plurality of first learning sparse feature vectors, a plurality of second learning sparse feature vectors, and a plurality of third learning sparses by adjusting the ratio of elements that take 0 in each dimension by shifting. Converted to a sparse feature vector and
Using the plurality of the first learning sparse feature quantity vectors, the plurality of the second learning sparse feature quantity vectors, and the plurality of the third learning sparse feature quantity vectors of the first neural network. A learning device that updates a model parameter and a model parameter of the second neural network.

(Appendix 2)
The processor
For each of the first feature amount vector, the second feature amount vector, and the third feature amount vector, each of the output vectors of the fully connected layer included in the final layer of the second neural network. After normalizing and averaging the element values in the dimension, the first sparse feature vector, the second sparse feature vector, and the second sparse feature vector are calculated by calculating the value of the firing function that satisfies the predetermined condition of the final layer. Converted to 3 sparse feature vectors,
By setting the value of the element satisfying a predetermined condition in each dimension of each of the first sparse feature amount vector, the second sparse feature amount vector, and the third sparse feature amount vector to 0. The learning device according to Appendix 1, which converts the first learning sparse feature amount vector, the second learning sparse feature amount vector, and the third learning sparse feature amount vector into the third learning sparse feature amount vector.

(Appendix 3)
The processor
As preset parameters μ and σ,
In each dimension of the plurality of output vectors relating to the plurality of the first feature vector, the normalization is performed by the subset of the set of the elements of the dimension and the parameter σ, and the average shift is performed by the parameter μ. After that, by calculating the value of the firing function, the first feature quantity vector is converted into the first sparse feature quantity vector.
In each dimension of the plurality of output vectors relating to the plurality of the second feature vectors, the normalization is performed by the subset of the set of the elements of the dimension and the parameter σ, and the average shift is performed by the parameter μ. After that, by calculating the value of the firing function, the second feature amount vector is converted into the second sparse feature amount vector.
In each dimension of the plurality of output vectors relating to the plurality of the third feature vector, the normalization is performed by the subset of the set of the elements of the dimension and the parameter σ, and the average shift is performed by the parameter μ. The learning device according to Appendix 2, wherein the third feature amount vector is converted into the third sparse feature amount vector by calculating the value of the firing function.

(Appendix 4)
With memory
With at least one processor connected to the memory
Including
The processor
Using the search query as an input and using the trained model parameters of the first neural network, a feature vector representing the characteristics of the search query is generated.
By using the trained model parameters of the second neural network, the output vector of the fully connected layer with respect to the feature quantity vector is normalized and mean-shifted in each dimension, and then sparsed by an ignition function satisfying a predetermined condition. , Converted to the first sparse feature vector,
The second sparse feature of the document related to the search query is sparse using the inverted index created in advance, using the index of the dimension corresponding to the non-zero element included in the first sparse feature vector as a key. Get the set of sparse feature vectors as a value and
t is set as a preset value satisfying 0 <t ≦ 100, and the element not included in the upper t% in the set of the elements of the same dimension of the second sparse feature vector is set to 0. A search device that calculates the degree of association between the search query and a document related to the search query using a third sparse feature vector.

(Appendix 5)
A non-temporary storage medium that stores a program that can be executed by a computer to perform a learning process.
The learning process is
Using the model parameters of the first neural network as input to a plurality of training data including the search query, the first document related to the search query, and the second document not related to the search query. A plurality of first feature quantity vectors representing the features of the plurality of search queries, a plurality of second feature quantity vectors representing the features of the plurality of first documents, and a plurality of the second features. Generate a plurality of third feature vectors representing each feature of the document,
Normalization and averaging of each of the plurality of the first feature vector, the plurality of the second feature vector, and the plurality of the third feature vector using the model parameters of the second neural network. A plurality of first learning sparse feature vectors, a plurality of second learning sparse feature vectors, and a plurality of third learning sparses by adjusting the ratio of elements that take 0 in each dimension by shifting. Converted to a sparse feature vector and
Using the plurality of the first learning sparse feature vector, the plurality of the second learning sparse feature vector, and the plurality of the third learning sparse feature vector of the first neural network. A non-temporary storage medium that updates the model parameters and the model parameters of the second neural network.

(Appendix 6)
A non-temporary storage medium that stores a program that can be executed by a computer to perform a search process.
The search process is
Using the search query as an input and using the trained model parameters of the first neural network, a feature vector representing the characteristics of the search query is generated.
By using the trained model parameters of the second neural network, the output vector of the fully connected layer with respect to the feature quantity vector is normalized and mean-shifted in each dimension, and then sparsed by an ignition function satisfying a predetermined condition. , Converted to the first sparse feature vector,
The second sparse feature of the document related to the search query is sparse using the inverted index created in advance, using the index of the dimension corresponding to the non-zero element included in the first sparse feature vector as a key. Get the set of sparse feature vectors as a value and
t is set as a preset value satisfying 0 <t ≦ 100, and the element not included in the upper t% in the set of the elements of the same dimension of the second sparse feature vector is set to 0. A non-temporary storage medium for calculating the degree of association between the search query and a document related to the search query using a third sparse feature vector.

The present invention is not limited to the above-described embodiment specifically disclosed, and various modifications and modifications, combinations with known techniques, and the like are possible without departing from the description of the claims. ..

10 Search device 20 Inverted index generator 30 Learning device 100 Coding unit

100A Coding unit

100B Coding unit 101 Contextual coding unit 102 Pseudo-sparse coding unit 103 Inverted index utilization unit 104 Ranking unit 105 Inverted index generation unit 106 Division 107 Update part 107A Update part 108 Judgment part 109 Normalized sparse coding part 110 Gradient estimation type pseudo sparse coding part

Claims

Using the model parameters of the first neural network as input to a plurality of training data including the search query, the first document related to the search query, and the second document not related to the search query. A plurality of first feature quantity vectors representing the features of the plurality of search queries, a plurality of second feature quantity vectors representing the features of the plurality of first documents, and a plurality of the second features. A feature amount generator that generates a plurality of third feature amount vectors representing the features of the document, respectively.
Normalization and averaging of each of the plurality of the first feature vector, the plurality of the second feature vector, and the plurality of the third feature vector using the model parameters of the second neural network. A plurality of first learning sparse feature vectors, a plurality of second learning sparse feature vectors, and a plurality of third learning sparses by adjusting the ratio of elements that take 0 in each dimension by shifting. A converter that converts to a sparse feature vector,
Using the plurality of the first learning sparse feature vector, the plurality of the second learning sparse feature vector, and the plurality of the third learning sparse feature vector of the first neural network. An updater that updates the model parameters and the model parameters of the second neural network,
A learning device characterized by having.
The conversion unit
For each of the first feature amount vector, the second feature amount vector, and the third feature amount vector, each of the output vectors of the fully connected layer included in the final layer of the second neural network. After normalizing and averaging the element values in the dimension, the first sparse feature vector, the second sparse feature vector, and the second sparse feature vector are calculated by calculating the value of the firing function that satisfies the predetermined condition of the final layer. Converted to 3 sparse feature vectors,
By setting the value of the element satisfying a predetermined condition in each dimension of each of the first sparse feature amount vector, the second sparse feature amount vector, and the third sparse feature amount vector to 0. The first aspect of claim 1, wherein the first learning sparse feature amount vector is converted into the second learning sparse feature amount vector and the third learning sparse feature amount vector. Learning device.
The conversion unit
As preset parameters μ and σ,
In each dimension of the plurality of output vectors relating to the plurality of the first feature vector, the normalization is performed by the subset of the set of the elements of the dimension and the parameter σ, and the average shift is performed by the parameter μ. After that, by calculating the value of the firing function, the first feature quantity vector is converted into the first sparse feature quantity vector.
In each dimension of the plurality of output vectors relating to the plurality of the second feature vectors, the normalization is performed by the subset of the set of the elements of the dimension and the parameter σ, and the average shift is performed by the parameter μ. After that, by calculating the value of the firing function, the second feature amount vector is converted into the second sparse feature amount vector.
In each dimension of the plurality of output vectors relating to the plurality of the third feature vector, the normalization is performed by the subset of the set of the elements of the dimension and the parameter σ, and the average shift is performed by the parameter μ. The learning device according to claim 2, wherein the third feature amount vector is converted into the third sparse feature amount vector by calculating the value of the firing function.
A feature quantity generator that generates a feature quantity vector representing the characteristics of the search query using the trained model parameters of the first neural network with the search query as an input.
By using the trained model parameters of the second neural network, the output vector of the fully connected layer with respect to the feature quantity vector is normalized and mean-shifted in each dimension, and then sparsed by an ignition function satisfying a predetermined condition. , A conversion unit that converts to the first sparse feature vector,
The second sparse feature of the document related to the search query is sparse using the inverted index created in advance, using the index of the dimension corresponding to the non-zero element included in the first sparse feature vector as a key. Inverted index utilization unit that acquires a set of sparse feature vectors as a value,
t is set as a preset value satisfying 0 <t ≦ 100, and the element not included in the upper t% in the set of the elements of the same dimension of the second sparse feature vector is set to 0. A calculation unit that calculates the degree of association between the search query and the document related to the search query using the third sparse feature vector.
A search device characterized by having.
Using the model parameters of the first neural network as input to a plurality of training data including the search query, the first document related to the search query, and the second document not related to the search query. A plurality of first feature quantity vectors representing the features of the plurality of search queries, a plurality of second feature quantity vectors representing the features of the plurality of first documents, and a plurality of the second features. A feature amount generation procedure for generating a plurality of third feature amount vectors representing each feature of a document, and a feature amount generation procedure.
Normalization and averaging of each of the plurality of the first feature vector, the plurality of the second feature vector, and the plurality of the third feature vector using the model parameters of the second neural network. A plurality of first learning sparse feature vectors, a plurality of second learning sparse feature vectors, and a plurality of third learning sparses by adjusting the ratio of elements that take 0 in each dimension by shifting. Conversion procedure part to convert to sparse feature vector,
Using the plurality of the first learning sparse feature vector, the plurality of the second learning sparse feature vector, and the plurality of the third learning sparse feature vector of the first neural network. An update procedure for updating the model parameters and the model parameters of the second neural network, and
A learning method characterized by a computer performing.
A feature quantity generation procedure for generating a feature quantity vector representing the characteristics of the search query using the trained model parameters of the first neural network with the search query as an input.
By using the trained model parameters of the second neural network, the output vector of the fully connected layer with respect to the feature quantity vector is normalized and mean-shifted in each dimension, and then sparsed by an ignition function satisfying a predetermined condition. , The conversion procedure to convert to the first sparse feature vector,
The second sparse feature of the document related to the search query is sparse using the inverted index created in advance, using the index of the dimension corresponding to the non-zero element included in the first sparse feature vector as a key. Inverted index utilization procedure to acquire a set of sparse feature vectors as a value,
t is set as a preset value satisfying 0 <t ≦ 100, and the element not included in the upper t% in the set of the elements of the same dimension of the second sparse feature vector is set to 0. A calculation procedure for calculating the degree of association between the search query and the document related to the search query using the third sparse feature vector.
A search method characterized by a computer running.
A program that causes a computer to function as the learning device according to any one of claims 1 to 3 or the search device according to claim 4.