WO2020182122A1 - 用于生成文本匹配模型的方法和装置 - Google Patents

用于生成文本匹配模型的方法和装置 Download PDF

Info

Publication number
WO2020182122A1
WO2020182122A1 PCT/CN2020/078584 CN2020078584W WO2020182122A1 WO 2020182122 A1 WO2020182122 A1 WO 2020182122A1 CN 2020078584 W CN2020078584 W CN 2020078584W WO 2020182122 A1 WO2020182122 A1 WO 2020182122A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
word
sample
preset number
matching
Prior art date
Application number
PCT/CN2020/078584
Other languages
English (en)
French (fr)
Inventor
万圣贤
陈诗妮
Original Assignee
北京字节跳动网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京字节跳动网络技术有限公司 filed Critical 北京字节跳动网络技术有限公司
Publication of WO2020182122A1 publication Critical patent/WO2020182122A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation

Definitions

  • the embodiments of the present disclosure relate to the field of computer technology, in particular to a method and device for generating a text matching model.
  • the problem of text semantic matching refers to how to determine the similarity of two pieces of text (such as a query text and a text included in a web page).
  • Typical applications include search engines, question and answer systems, and intelligent customer service systems. For example, in a search engine, candidate documents can be sorted according to this similarity, and in an intelligent customer service system, the closest question and answer in the database can be found according to the user's question.
  • Related technologies for text matching methods mainly include the following: methods based on precise keyword hits (such as BM25 algorithm, TF-IDF (Term Frequency-Inverse Document Frequency, term frequency-inverse text frequency) algorithm, etc.), based on The deep learning model of implicit semantic expression, the deep learning model based on deep interaction.
  • precise keyword hits such as BM25 algorithm, TF-IDF (Term Frequency-Inverse Document Frequency, term frequency-inverse text frequency) algorithm, etc.
  • the deep learning model of implicit semantic expression the deep learning model based on deep interaction.
  • the embodiment of the present disclosure proposes a method and device for generating a text matching model, and a method and device for outputting text.
  • an embodiment of the present disclosure provides a method for generating a text matching model, the method includes: obtaining a training sample set, wherein the training sample includes a preset number of sample word sequences and a preset number of matching samples Word sequence, a preset number of non-matching sample word sequences; select training samples from the training sample set, and perform the following training steps: the selected training sample includes a preset number of sample word sequences and a preset number of matching samples The word sequence is input to the initial model, and the first similarity value used to characterize the degree of similarity between the text indicated by the input preset number of sample word sequences and the text indicated by the preset number of matching sample word sequences is obtained; the selected training samples A preset number of sample word sequences and a preset number of non-matching sample word sequences are included in the input initial model, and the text used to characterize the input of the preset number of sample word sequences and the preset number of non-matching sample word sequence instructions are obtained
  • obtaining the training sample set includes: obtaining sample text, matching text that matches the obtained sample text and non-matching text that does not match the obtained sample text; and matching the obtained sample text
  • the text and the non-matching text are segmented according to the preset number of word segmentation granularities, and the preset number of sample word sequences corresponding to the sample text are obtained, the preset number of matching sample word sequences corresponding to the matching text, and the preset number corresponding to the non-matching text Non-matching sample word sequences; determine the obtained word alignment information corresponding to the preset number of sample word sequences, the preset number of matching sample word sequences, and the preset number of non-matching sample word sequences, where the word alignment information is used To characterize the correspondence between words in the word sequence corresponding to different word segmentation granularities for the same text.
  • the initial model includes a vector alignment sub-model, a similarity matrix generation layer, and a convolutional neural network; and obtaining the first similarity value and obtaining the second similarity value include: including the selected training sample The sample word sequence and the matching sample word sequence input vector alignment sub-model, the sample aligned word vector sequence corresponding to the input sample word sequence and the matching sample aligned word vector sequence corresponding to the input matching sample word sequence are obtained, where the vector aligner The model is used to determine the word vector of the words included in the input word sequence, and based on the word alignment information corresponding to the word sequence, vector align the word vector sequence corresponding to the input word sequence to obtain the aligned word vector corresponding to the input word sequence Sequence; input the word vector sequence obtained after sample alignment and the word vector sequence after matching sample alignment into the similarity matrix generation layer to obtain the similarity matrix; input the obtained similarity matrix into the convolutional neural network to obtain the first similarity Value; input the sample word sequence and non-matching sample word sequence included in the selected training sample into
  • the convolutional neural network includes at least one convolution sub-network and a similarity value generation layer.
  • the convolution sub-network is used to perform convolution operations on the input similarity matrix to generate sub-similarity values and similarity values.
  • the generation layer is used to generate the similarity value based on the sub-similarity value.
  • the at least one convolution sub-network includes a proximity convolution sub-network
  • the proximity convolution sub-network includes a proximity convolution kernel
  • the proximity convolution kernel includes weights, and the weights are used to characterize text for matching The degree of influence of the distance between the positions in the matching text of words that match words included in the sample word sequence in determining the similarity value.
  • the similarity matrix generation layer includes a word weight generation layer, and the word weight generation layer is used to determine the weight of the sample words in the sample word sequence corresponding to the pre-specified word segmentation granularity in the text indicated by the sample word sequence.
  • the degree matrix generation layer is used to generate a weighted similarity matrix using the weights generated by the word weight generation layer and the generated similarity matrix.
  • the method further includes: in response to determining that the optimization goal is not reached, adjusting the parameters of the initial model, and reselecting training samples from the training samples that have not been selected in the training sample set, and using reselection The training samples and the initial model of the last parameter adjustment, continue to perform the training step.
  • an embodiment of the present disclosure provides a method for outputting text.
  • the method includes: obtaining a target text and a set of text to be matched, where the target text is a text input by a user;
  • the text to be matched in the set is segmented according to the preset number of word segmentation granularities to generate a preset number of target word sequences corresponding to the target text and a preset number of word sequences to be matched corresponding to the text to be matched in the text set to be matched ;
  • the preset number of to-be-matched word sequences and the preset number of target word sequences corresponding to the to-be-matched text are input into the pre-trained text matching model to obtain the characterization of the to-be-matched
  • the similarity value of the similarity between the text and the target text wherein the text matching model is generated according to the method described in any one of the embodiments in the first aspect; based on the size of
  • the word segmentation processing includes: segmenting the target text and the to-be-matched text in the to-be-matched text set according to a preset number of word segmentation granularities, respectively, to obtain a preset number of target word sequences and the to-be-matched text corresponding to the target text
  • the preset number of to-be-matched word sequences corresponding to the text to be matched in the set ; determine the preset number of target word sequences, and the words corresponding to the preset number of to-be-matched word sequences corresponding to the text to be matched in the text set to be matched Alignment information, so that the text matching model uses word alignment information to generate similarity values.
  • selecting the text to be matched from the set of text to be matched and outputting includes: selecting the text to be matched from the set of text to be matched based on the size of the obtained similarity value Text; display the selected text to be matched on the target display screen.
  • an embodiment of the present disclosure provides an apparatus for generating a text matching model.
  • the apparatus includes: a training sample acquisition unit configured to acquire a training sample set, wherein the training sample includes a preset number of sample words Sequence, a preset number of matching sample word sequences, a preset number of non-matching sample word sequences; the training unit is configured to select training samples from the training sample set, and perform the following training steps: include the selected training samples A preset number of sample word sequences and a preset number of matching sample word sequences are input to the initial model to obtain the similarity between the inputted text indicated by the preset number of sample word sequences and the text indicated by the preset number of matching sample word sequences. The first similarity value of the degree; the preset number of sample word sequences and the preset number of non-matching sample word sequences included in the selected training sample are input into the initial model to obtain the preset number of sample word sequences used to characterize the input The second similarity value of the similarity between the indicated text and the text indicated
  • an embodiment of the present disclosure provides a device for outputting text, the device comprising: a text obtaining unit configured to obtain a target text and a text set to be matched, wherein the target text is text input by a user;
  • the word segmentation unit is configured to perform word segmentation processing on the target text and the text to be matched in the text set to be matched according to a preset number of word segmentation granularities, and generate a preset number of target word sequences corresponding to the target text and the text to be matched in the text set to be matched.
  • the preset number of word sequences to be matched corresponding to the text to be matched; the matching unit is configured to, for the text to be matched in the text to be matched, the preset number of word sequences to be matched and the preset number corresponding to the text to be matched
  • a sequence of target words is input into a pre-trained text matching model to obtain a similarity value representing the degree of similarity between the text to be matched and the target text, wherein the text matching model is described according to any one of the embodiments of the first aspect above
  • the output unit is configured to select and output the text to be matched from the text set to be matched based on the size of the obtained similarity value.
  • the embodiments of the present disclosure provide an electronic device that includes: one or more processors; a storage device on which one or more programs are stored; when one or more programs are Multiple processors execute, so that one or more processors implement the method described in any implementation manner of the first aspect or the second aspect.
  • the embodiments of the present disclosure provide a computer-readable medium on which a computer program is stored, and when the computer program is executed by a processor, the method described in any one of the first aspect or the second aspect is implemented .
  • the method and apparatus for generating a text matching model obtained by the embodiments of the present disclosure obtain a training sample set, wherein the training sample includes a preset number of sample word sequences, a preset number of matching sample word sequences, and a preset number of Non-matching sample word sequence, then at least one training sample is selected from the training sample set, and the selected training sample and the initial model are used to obtain the text indicated by the preset number of sample word sequences and the preset number of matching samples used to represent the input
  • the first similarity value of the similarity of the text indicated by the word sequence and the second similarity value used to characterize the similarity between the text indicated by the input sample word sequence and the text indicated by the non-matching sample word sequence, according to the first similarity
  • the comparison result of the value and the second similarity value trains the initial model to obtain the text matching model, thereby realizing the training of the text matching model using a preset number of word sequences corresponding to the same text, so that the obtained text matching model can be More comprehensively processing the
  • FIG. 1 is an exemplary system architecture diagram in which an embodiment of the present disclosure can be applied
  • FIG. 2 is a flowchart of an embodiment of a method for generating a text matching model according to an embodiment of the present disclosure
  • FIG. 3 is an exemplary schematic diagram of generating a similarity matrix of the method for generating a text matching model according to an embodiment of the present disclosure
  • FIG. 4 is an exemplary schematic diagram of the sub-similarity value of the proximity convolution sub-network of the method for generating a text matching model according to an embodiment of the present disclosure
  • FIG. 5 is a schematic diagram of an application scenario of the method for generating a text matching model according to an embodiment of the present disclosure
  • Fig. 6 is a flowchart of one embodiment of a method for outputting text according to an embodiment of the present disclosure
  • Fig. 7 is a schematic structural diagram of an embodiment of an apparatus for generating a text matching model according to an embodiment of the present disclosure
  • Fig. 8 is a schematic structural diagram of an embodiment of an apparatus for outputting text according to an embodiment of the present disclosure
  • Fig. 9 is a schematic structural diagram of an electronic device suitable for implementing the embodiments of the present disclosure.
  • FIG. 1 shows an exemplary system architecture 100 of a method for generating a text matching model or an apparatus for generating a text matching model to which an embodiment of the present disclosure can be applied, and the method and apparatus for outputting text.
  • the system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105.
  • the network 104 is used to provide a medium for communication links between the terminal devices 101, 102, 103 and the server 105.
  • the network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables.
  • the user can use the terminal devices 101, 102, 103 to interact with the server 105 through the network 104 to receive or send messages and so on.
  • Various communication client applications such as search applications, web browser applications, shopping applications, instant messaging tools, email clients, and social platform software, can be installed on the terminal devices 101, 102, and 103.
  • the terminal devices 101, 102, and 103 may be hardware or software. When the terminal devices 101, 102, and 103 are hardware, they may be various electronic devices. When the terminal devices 101, 102, 103 are software, they can be installed in the above electronic devices. It can be implemented as multiple software or software modules (for example, software or software modules used to provide distributed services), or as a single software or software module. There is no specific limitation here.
  • the server 105 may be a server that provides various services, such as a back-end server that performs model training on the training sample set uploaded by the terminal devices 101, 102, and 103, or a back-end server that processes text uploaded by the terminal devices 101, 102, and 103.
  • the background server can use the acquired training sample set to perform model training to obtain a text matching model, or use the text matching model to generate similarity values between texts, and output texts according to the similarity values.
  • the method for generating a text matching model provided by the embodiments of the present disclosure can be executed by the server 105, or can be executed by the terminal devices 101, 102, 103, and accordingly, a device for generating a text matching model It can be set in the server 105 or in the terminal devices 101, 102, 103.
  • the method for outputting text provided by the embodiments of the present disclosure can be executed by the server 105, or can be executed by the terminal devices 101, 102, 103, and accordingly, the device for outputting text can be set in the server 105, It can also be set in the terminal devices 101, 102, 103.
  • the server can be hardware or software.
  • the server can be implemented as a distributed server cluster composed of multiple servers, or as a single server.
  • the server is software, it can be implemented as multiple software or software modules (for example, software or software modules for providing distributed services), or as a single software or software module. There is no specific limitation here.
  • terminal devices, networks, and servers in FIG. 1 are merely illustrative. According to implementation needs, there can be any number of terminal devices, networks and servers.
  • the above system architecture may not include a network, but only include servers or terminal devices.
  • the method for generating a text matching model includes the following steps:
  • Step 201 Obtain a training sample set.
  • the executor of the method for generating a text matching model can obtain the training sample collection remotely or locally through a wired connection or a wireless connection.
  • the training sample includes a preset number of sample word sequences, a preset number of matching sample word sequences, and a preset number of non-matching sample word sequences.
  • the words in each of the aforementioned word sequences may include but are not limited to at least one of the following: single-character words, multi-character words, and phrases. Generally, the aforementioned preset number is greater than or equal to two.
  • the preset number of sample word sequences may correspond to the sample text
  • the preset number of matching sample word sequences may correspond to the matching sample text
  • the preset number of non-matching sample word sequences may correspond to the non-matching sample text.
  • the matching sample text may be a text with a higher degree of correlation with the sample text
  • a non-matching sample text may be a text with a lower degree of correlation with the sample text.
  • the sample text may be a search sentence entered by the user, and the execution body used to generate the training sample may set the text included in the search result and clicked by the user as the matching sample text, and set the text that has not been clicked by the user as Non-matching text.
  • the sample word sequences in the preset number of sample word sequences may be word sequences obtained by word segmentation of the sample text.
  • the executive body that generates the sample word sequence may segment the sample text using a preset number of different word segmentation granularities to obtain a preset number of sample word sequences.
  • the word segmentation granularity is used to represent the number of words included in the word when the text is segmented.
  • the word segmentation granularity is large, a single word includes more characters, and the word segmentation granularity is small, and a single word includes less characters.
  • the words obtained after large-granularity segmentation include "boyfriend", and the words obtained after small-granularity segmentation include "male” and "friend”.
  • the method of using different word segmentation granularities to segment the text is a well-known technology in the art, and will not be repeated here.
  • the above-mentioned execution subject may also use a preset number of different word segmentation algorithms to segment the sample text to obtain a preset number of sample word sequences.
  • the executive body that generates the sample word sequence can use the same method as the method used to segment the sample text to segment the matching text and the non-matching text, to obtain a preset number of matching sample word sequences and a preset number A sequence of non-matching sample words.
  • the method for segmenting text in this embodiment may include but is not limited to at least one of the following: a dictionary-based method, a statistical-based method, and a semantic-based method.
  • the above-mentioned execution subject may perform the following steps:
  • Step 1 Obtain sample text, matching text that matches the obtained sample text and non-matching text that does not match the obtained sample text.
  • the sample text may be a search sentence entered by the user
  • the matching text may be the text included in the search result that the user clicked (or the highest click rate)
  • the non-matching text may be the text included in the search result that the user did not Click (or the lowest click-through rate) text.
  • Step 2 Perform word segmentation on the obtained sample text, matching text, and non-matching text according to a preset number of word segmentation granularities, to obtain a preset number of sample word sequences corresponding to the sample text, and a preset number of matching samples corresponding to the matching text
  • the word sequence is a preset number of non-matching sample word sequences corresponding to the non-matching text.
  • using a preset number of word segmentation granularities for word segmentation can reduce the probability of matching failure when using a single word segmentation granularity for word segmentation, thereby helping to improve the similarity between the text generated by the final training text matching model The accuracy of the value.
  • Step 3 Determine the obtained word alignment information corresponding to the preset number of sample word sequences, the preset number of matching sample word sequences, and the preset number of non-matching sample word sequences.
  • the word alignment information is used to characterize the correspondence between words in the word sequence corresponding to different word segmentation granularities of the same text.
  • the above-mentioned execution subject can be selected from a preset number of word sequences (may be any one of a preset number of sample word sequences, a preset number of matching sample word sequences, and a preset number of non-matching sample word sequences) , Determine the word sequence obtained by word segmentation according to the pre-specified word segmentation granularity as the reference word sequence, and obtain the word alignment information according to the words included in the reference word sequence.
  • a preset number of word sequences may be any one of a preset number of sample word sequences, a preset number of matching sample word sequences, and a preset number of non-matching sample word sequences
  • sequence 1, sequence 2, sequence 3 the three sample word sequences (hereinafter referred to as sequence 1, sequence 2, sequence 3) are respectively characterized according to the following information: "A, B, C, D", “A, BC, D” , ", "A, BCD", the letters or letter combinations are used to characterize words.
  • the word segmentation granularity corresponding to sequence 1 to sequence 3 gradually increases.
  • Sequence 2 is the reference word sequence.
  • the generated word alignment information can include: "B, C-BC", “BCD-BC, D”, where "B, C” -BC” corresponds to sequence 1, used to characterize the words B and C in sequence 1 correspond to the word BC in sequence 2; "BCD-BC, D” corresponds to sequence 3, used to characterize the word BCD in sequence 3 corresponds to The words BC and D in sequence 2. It should be understood that the above examples are applicable to a preset number of sample word sequences, a preset number of matching sample word sequences, and a preset number of non-matching sample word sequences.
  • This implementation method obtains a preset number of word sequences by segmenting the text according to different word segmentation granularities, and can realize the training of the model by using multiple word sequences for the same text, because multiple word sequences for the same text can fully represent the text Therefore, this implementation manner helps to make the generated model more comprehensively match the two texts, thereby improving the accuracy of generating similarity values.
  • Step 202 Select training samples from the training sample set, and perform the following training steps: input a preset number of sample word sequences and a preset number of matching sample word sequences included in the selected training sample into the initial model to obtain The first similarity value of the similarity between the input text indicated by the preset number of sample word sequences and the text indicated by the preset number of matching sample word sequences; the selected training sample includes the preset number of sample word sequences and A preset number of non-matching sample word sequences are input to the initial model, and a second similarity used to characterize the degree of similarity between the input text indicated by the preset number of sample word sequences and the text indicated by the preset number of non-matching sample word sequences is obtained Value; compare the first similarity value with the second similarity value, and determine whether the initial model reaches the preset optimization goal according to the comparison result; in response to determining that the optimization goal is reached, determine the initial model as a text matching model.
  • the above-mentioned execution subject may select training samples from the training sample set, and perform the following training steps: (including step 2021 to step 2024):
  • Step 2021 input the sample word sequence and the matching sample word sequence included in the selected training sample into the initial model to obtain the text indicated by the preset number of sample word sequences and the preset number of matching sample word sequences.
  • the first similarity value of the similarity of the text is the first similarity value of the similarity of the text.
  • the initial model may include neural networks with various structures, such as a neural network with a Siamese structure, an LSF-SCNN (Lexical Semantic Feature based Skip Convolution Neural Network, and a jumping convolutional neural network based on the semantic features of words).
  • the initial model can be an untrained model with initial parameters, or a trained model.
  • the initial model can convert the words included in the input word sequence into the form of vectors, and the similarity value can be determined according to each vector. Generally, the larger the similarity value, the higher the similarity between two texts.
  • the similarity value can be determined according to the distance between the vectors (for example, Euclidean distance, cosine distance, etc.). For example, the cosine distance is determined as the similarity value, or the reciprocal of the Euclidean distance is determined as the similarity value.
  • the input to the initial model is usually a preset number of sample word sequences and a preset number of matching sample word sequences included in a training sample.
  • the initial model can perform processing such as vector conversion and distance calculation on the input preset number of sample word sequences and preset number of matching sample word sequences to obtain the first similarity value.
  • Step 2022 Input the sample word sequence and the non-matching sample word sequence included in the selected training sample into the initial model to obtain the text indicated by the preset number of sample word sequences and the preset number of non-matching sample word sequences for characterizing the input The second similarity value indicating the similarity of the text.
  • input to the initial model is usually a preset number of sample word sequences and a preset number of non-matching sample word sequences included in one training sample.
  • the initial model can obtain the second similarity value according to the same method as in step 2021 above.
  • the initial model may include a vector alignment sub-model, a similarity matrix generation layer, and a convolutional neural network.
  • the above-mentioned execution subject may determine the first similarity value according to the following steps:
  • the vector alignment sub-model First, input the sample word sequence and matching sample word sequence included in the selected training sample into the vector alignment sub-model, and obtain the sample alignment corresponding to the input sample word sequence.
  • the word vector sequence is aligned with the matching sample corresponding to the input matching sample word sequence.
  • the last word vector sequence is used to determine the word vector of the words included in the input word sequence, and based on the word alignment information corresponding to the word sequence, perform vector alignment on the word vector sequence corresponding to the input word sequence to obtain the input word sequence correspondence The aligned word vector sequence.
  • the word alignment information is obtained according to the method described in the optional implementation manner in step 201 above.
  • the aforementioned vector alignment sub-models may include related technology models for determining word vectors (for example, Word2Vec model, n-gram model, etc.).
  • word vector of each word includes the same number of elements.
  • the word vector corresponding to the words included in the word sequence is the word vector sequence corresponding to the word sequence.
  • the vector alignment sub-model can perform vector alignment on the word vector sequences corresponding to the input preset number of sample word sequences, and perform vector alignment on the word vector sequences corresponding to the input preset number of matching sample word sequences.
  • the vector alignment sub-model can perform vector alignment by merging or expanding word vectors.
  • sequence 2 is the reference word sequence
  • the vector alignment sub-model can correspond to word B and word C according to the word alignment information "B, C-BC" corresponding to sequence 1
  • the word vectors of are merged, so that the number of elements included in the merged word vector is the same as the word vector corresponding to the word BC included in the reference word sequence.
  • a mean pooling algorithm can be used to merge the word vectors, that is, the elements in the same element positions in the two word vectors are averaged, and the new word vector obtained is the merged word vector.
  • the vector alignment sub-model can expand the word vector corresponding to the word BCD according to the word alignment information "BCD-BC, D" corresponding to sequence 3, so that the number of elements included in the expanded word vector is equal to the words included in the reference word sequence The sum of the number of elements included in BC and word D.
  • the word vector corresponding to the word BCD can be copied, that is, the word vector corresponding to two BCDs can be obtained as the expanded word vector.
  • the number of word vectors included in the word vector sequence after alignment of each sample can be the same, and the number of word vectors included in the word vector sequence after alignment of each matching sample can be the same.
  • the similarity matrix generation layer may combine the obtained word vector sequence after each sample alignment and the obtained word vector sequence after each matching sample alignment. For each combination, the similarity matrix generation layer performs pairwise similarity calculations on the word vectors in the sample-aligned word vector sequence included in the combination and the word vectors in the matched sample-aligned word vector sequence to obtain the corresponding Pending similarity matrix.
  • each element in the undetermined similarity matrix corresponds to a sample-aligned word vector and a matching sample-aligned word vector, that is, each element is the difference between the corresponding sample-aligned word vector and the matched sample-aligned word vector Similarity value (for example, cosine distance).
  • the similarity matrix generation layer may further obtain the similarity matrix according to the obtained undetermined similarity matrices. For example, it is possible to take the maximum value for the elements located at the same element position in each to-be-determined similarity matrix to obtain the similarity matrix.
  • the matrices A1 and A2 respectively correspond to the first word segmentation granularity and the second word segmentation granularity, and each row of the matrices A1 and A2 is a sample-aligned word vector.
  • the matrices B1 and B2 respectively correspond to the first word segmentation granularity and the second word segmentation granularity, and each row of the matrices B1 and B2 is a word vector aligned with a matching sample.
  • the matrices A1 and A2 are combined with the matrices B1 and B2 in pairs to obtain four combinations (including A1-B1, A1-B2, A2-B1, A2-B2).
  • the similarity between each row in A1 and each row in B1 is determined, so as to obtain the undetermined similarity matrix X1 corresponding to the combination A1-B1.
  • the element in the first row and first column of matrix X1 is the similarity between the first row of matrix A1 and the first row of matrix B1
  • the element in the first row and second column of matrix X1 is the first row and the second column of matrix A1
  • the similarity of the second row of matrix B1 and so on.
  • the similarity matrix X2, X3, X4 corresponding to other combinations can be obtained.
  • the foregoing similarity matrix generation layer may include a word weight generation layer.
  • the word weight generation layer is used to determine the weight of the sample word in the sample word sequence corresponding to the predetermined word segmentation granularity in the sample text indicated by the sample word sequence.
  • the similarity matrix generation layer is used to generate a weighted similarity matrix using the weights generated by the word weight generation layer and the generated similarity matrix.
  • the word weight generation layer can use various methods of related technologies to determine the weight of the word in the text to determine the weight of the sample word in the sample text.
  • the TF-IDF algorithm can be used to determine the TF-IDF value of each sample word, and then the ratio of each TF-IDF value to the total TF-IDF value (that is, the sum of each TF-IDF value) is determined as the sample word the weight of.
  • the similarity matrix generation layer may further use the weights generated by the word weight generation layer and the generated similarity matrix to generate a weighted similarity matrix.
  • the elements included in each row of the similarity matrix Y shown in FIG. 3 can be respectively multiplied by the weights of the sample words indicated by the row to obtain the final weighted similarity matrix.
  • a weighted similarity matrix can be generated according to the weight of each word, so that the elements in the final similarity matrix can more accurately represent the similarity between two words. This helps the final trained text matching model to more accurately determine the degree of similarity between two texts.
  • the convolutional neural network can be used to perform convolution operations, full connection operations, etc. on the similarity matrix, so as to obtain the first similarity value.
  • the structure of the convolutional neural network can be various structures of related technologies, such as the LSF-SCNN structure.
  • the convolutional neural network may include at least one convolution sub-network and a similarity value generation layer.
  • the convolution sub-network is used to perform convolution operations on the input similarity matrix to generate The sub-similarity value
  • the similarity value generation layer is used to generate the similarity value based on the sub-similarity value.
  • the at least one convolution subnetwork may include a convolution subnetwork that uses a two-dimensional convolution kernel of related technologies (for example, the size of the convolution kernel is 5 ⁇ 5) to perform convolution operations.
  • each convolution sub-network in at least one convolution sub-network can generate a sub-similarity value.
  • the sub-similarity value can be input to the similarity value generation layer, and the similarity value generation layer can be similar to the input sub-sub-networks.
  • the degree value is calculated to obtain the similarity value.
  • the similarity value generation layer may use a preset weight corresponding to the sub-similarity value to perform a weighted summation on each sub-similarity value to obtain the similarity value.
  • the above-mentioned initial model may also include other sub-models for determining the similarity between two texts.
  • the sub-models may include, but are not limited to, at least one of the following: Bag-of-words (Bag-of-words, BOW) ) Model, Recurrent Neural Network (RNN) model, etc.
  • the sample word sequence can be selected from the input preset number of sample word sequences (for example, selecting the word sequence corresponding to the pre-specified word segmentation granularity), and the matching sample word can be selected from the input preset number of matching sample word sequences Sequence, input the selected sample word sequence and matching sample word sequence into the above sub-model to obtain the sub-similarity value.
  • the above-mentioned execution subject may input the obtained sub-similarity value into the above-mentioned similarity value generation layer, so that the similarity value generation layer can calculate the input sub-similarity value to obtain the similarity value.
  • the aforementioned at least one convolution sub-network may include a proximity convolution sub-network, the proximity convolution sub-network includes a proximity convolution kernel, and the proximity convolution kernel includes a weight.
  • the weight is used to characterize the degree of influence of the distance between the positions of the words in the matching text that match the words included in the sample word sequence in the matching text on the determination of the similarity value.
  • the text for matching is a text that is similarly calculated with the text indicated by the input sample word sequence.
  • the matching text may be the matching text indicated by the input matching sample word sequence, or the non-matching text indicated by the input non-matching sample word sequence.
  • the similarity matrix 401 is a matrix with 3 rows and 10 columns.
  • A, B, and C in the figure are used to represent the words included in the sample word sequence
  • D, E, F, G,..., M and N are used to characterize the sequence of words determined from the matching text.
  • the elements in the first row and the first column of the similarity matrix 401 are the similarity values between the words A and D
  • the elements in the first row and the second column are the similarity values between the words A and E
  • 402 is the proximity convolution kernel. It can be seen from the figure that the weight of the middle column of the proximity convolution kernel is the largest, and it gradually decreases to both sides.
  • the current proximity convolution kernel 402 slides to the position shown in the figure, that is, the middle column of the proximity convolution kernel 402 is aligned with the third column of the similarity matrix 401.
  • the result matrix 403 is obtained, and the maximum value (ie 0.8, 0.8, 0.9) is taken from each row in the result matrix 403, and after each maximum value is added, the result is
  • the third column of the similarity matrix 401 corresponds to the similarity value (ie 2.5).
  • the similarity value corresponding to each column of the similarity matrix 401 can be obtained, and the maximum value is selected from the obtained similarity values, which is the result of the aforementioned proximity convolution sub-network.
  • the determined sub-similarity value It can be seen from Figure 4 that when calculating the similarity value corresponding to the third column of the similarity matrix 401, if in the matching text, the positions of the words matching the words A, B, and C are corresponding to the third column. The positions of words (that is, words corresponding to F) are closer, and because their corresponding weights are larger, the calculated similarity value is larger.
  • the calculated similarity value is smaller (for example, the similarity value corresponding to AJ in the figure is the same as the similarity value corresponding to AE, which is 1, but because The position of J is farther from the third column, and its corresponding weight value of 0.4 is smaller, so the value 0.4 obtained by multiplying it with the weight value is smaller).
  • the proximity convolution sub-network includes the proximity convolution kernel, the sub-similarity value obtained by using it to calculate the similarity value can reflect the distance between the positions of the matched words in the matching text . So that the calculated similarity value can more accurately represent the similarity between the two texts.
  • the above-mentioned execution subject can continue to determine the second similarity value according to the following steps:
  • the word vector sequence after the matching sample is aligned.
  • the obtained sample-aligned word vector sequence and the non-matched sample-aligned word vector sequence are input into the similarity matrix generation layer to obtain the similarity matrix.
  • execution subject may determine the second similarity value according to the same method as the determination of the first similarity value, which will not be repeated here.
  • Step 2023 Compare the first similarity value with the second similarity value, and determine whether the initial model reaches the preset optimization target according to the comparison result.
  • the above-mentioned execution body can use a preset loss function (for example, a hinge loss function, a square hinge loss loss function, etc.) to compare the first similarity value with the second similarity value, and the above loss function can be used to calculate Loss value, if the loss value meets a preset condition (for example, the loss value is less than or equal to the preset value or the loss value is no longer reduced), it is determined that the initial model reaches the optimization goal.
  • a preset loss function for example, a hinge loss function, a square hinge loss loss function, etc.
  • the foregoing preset loss function may be a hinge loss loss function.
  • the degree value, s1 is the first similarity value, and sigma is the preset value.
  • s2-s1+sigma is minimized. When s2-s1+sigma meets the above preset conditions, it is determined that the initial model reaches the optimization goal.
  • Step 2024 in response to determining that the optimization goal is reached, determine that the initial model is a text matching model.
  • step 202 the above-mentioned execution subject may also perform the following steps:
  • step 2021-step 2024 In response to determining that the initial model has not reached the optimization goal based on the above comparison results, adjust the parameters of the initial model, and reselect training samples from the training samples that have not been selected in the training sample set, using the reselected training samples and the latest Adjust the initial model of the parameters once, and continue to perform the above-mentioned training steps (that is, step 2021-step 2024).
  • the above-mentioned execution subject may adopt various methods to adjust the parameters of the initial model according to the comparison result.
  • the BP (Back Propagation) algorithm or the SGD (Stochastic Gradient Descent) algorithm can be used to adjust the parameters of the initial model.
  • Fig. 5 is a schematic diagram of an application scenario of the method for generating a text matching model according to this embodiment.
  • the electronic device 501 first obtains the training sample set 502.
  • the training sample includes a preset number (for example, 3) sample word sequences, a preset number of matching sample word sequences, and a preset number of non-matching sample word sequences.
  • the preset number of sample word sequences may be word sequences extracted from the sample text in advance, and each sample word sequence corresponds to a granularity of word segmentation.
  • the preset number of matching sample word sequences may be word sequences extracted in advance from the matching sample text
  • the preset number of non-matching sample word sequences may be word sequences extracted in advance from the non-matching sample text.
  • the electronic device 501 selects a training sample 5021 from the training sample set 502, and executes the following training steps: input the sample word sequence 50211 and the matching sample word sequence 50212 included in the selected training sample 5021 into the initial model 503 to obtain the characterization
  • the input is the first similarity value 504 of the similarity between the text indicated by the preset number of sample word sequences and the text indicated by the preset number of matching sample word sequences;
  • the selected training sample 5021 includes the sample word sequence 50211 and non-
  • the matching sample word sequence 50213 is input into the initial model 503 to obtain a second similarity value 505 used to characterize the similarity between the input text indicated by the preset number of sample word sequences and the text indicated by the preset number of non-matching sample word sequences;
  • the first similarity value 504 and the second similarity value 505 are compared (for example, using a hinge loss function to calculate the loss value), and according to the comparison result (for example, the loss value), it is determined whether the initial model 503
  • the method provided by the foregoing embodiment of the present disclosure obtains a training sample set, where the training sample includes a preset number of sample word sequences, a preset number of matching sample word sequences, a preset number of non-matching sample word sequences, and then At least one training sample is selected from the training sample set, and the selected training sample and the initial model are used to obtain the similarity between the text indicated by the preset number of sample word sequences and the text indicated by the preset number of matching sample word sequences.
  • the comparison result trains the initial model to obtain the text matching model, thus realizing model training using a preset number of word sequences corresponding to the same text, so that the resulting text matching model can more comprehensively target the preset number corresponding to the same text
  • the sequence of words is processed to more accurately determine the similarity between two texts, which helps to improve the accuracy of text matching.
  • FIG. 6 shows a flow 600 of an embodiment of a method for outputting text.
  • the process 600 of the method for outputting text includes the following steps:
  • Step 601 Obtain the target text and the text set to be matched.
  • the execution body of the method for outputting text can obtain the target text and the set of texts to be matched remotely or locally through a wired connection or a wireless connection.
  • the target text is the text input by the user.
  • the target text may be text used to search for information.
  • the target text may be text input by the user in the search field displayed on the screen of the execution subject.
  • the text collection to be matched may be a text collection pre-stored in the execution subject, or a text collection pre-stored on an electronic device communicatively connected with the aforementioned execution subject.
  • Step 602 Perform word segmentation processing on the target text and the text to be matched in the text set to be matched according to a preset number of word segmentation granularities to generate a preset number of target word sequences corresponding to the target text and the text to be matched in the text set to be matched The corresponding preset number of word sequences to be matched.
  • the above-mentioned execution body may perform word segmentation processing on the target text and the to-be-matched text in the set of to-be-matched texts respectively according to a preset number of word segmentation granularities, to generate a preset number of target word sequences corresponding to the target text and to-be-matched A preset number of word sequences to be matched corresponding to the text to be matched in the text set.
  • the word segmentation granularity is used to represent the number of words included in the word when the text is segmented.
  • the word segmentation granularity is large, a single word includes more characters, and the word segmentation granularity is small, and a single word includes less characters.
  • the words obtained after large-granularity segmentation include "boyfriend", and the words obtained after small-granularity segmentation include "male” and "friend”.
  • the method of using different word segmentation granularities to segment the text is a well-known technology in the art, and will not be repeated here.
  • Step 603 For the text to be matched in the text set to be matched, the preset number of to-be-matched word sequences and the preset number of target word sequences corresponding to the text to be matched are input into a pre-trained text matching model to obtain the The similarity value of the similarity between the text to be matched and the target text.
  • the above-mentioned execution subject may input the preset number of to-be-matched word sequences and the preset number of target word sequences corresponding to the text to be matched into the pre-trained text matching
  • the model obtains the similarity value used to characterize the similarity between the text to be matched and the target text.
  • the text matching model is generated according to the method described in the embodiment corresponding to FIG. 2 above.
  • the word segmentation processing in step 602 includes:
  • the target text and the text to be matched in the text set to be matched are segmented according to the preset number of word segmentation granularities, and the preset number of target word sequences corresponding to the target text and the text to be matched in the text set to be matched are obtained.
  • a preset number of word sequences to be matched are obtained.
  • the text matching model uses the word alignment information to generate similarity value.
  • the word alignment information is used to represent the correspondence between words in the word sequence corresponding to different word segmentation granularities. It should be noted that, for the description of the word alignment information, reference may be made to the content of the word alignment information in the optional implementation in the embodiment corresponding to FIG. 2, which is not repeated here.
  • the aforementioned text matching model may use word alignment information to generate a similarity value.
  • the text matching model may include a vector alignment sub-model, a similarity matrix generation layer, and a convolutional neural network.
  • the vector alignment sub-model is used to determine the word vector of the words included in the input word sequence, and based on the word alignment information corresponding to the word sequence, perform vector alignment on the word vector sequence corresponding to the input word sequence to obtain the alignment corresponding to the input word sequence The last word vector sequence.
  • the similarity matrix generation layer is used to generate the similarity matrix by using the aligned word vector sequence corresponding to the obtained target word sequence and the aligned word vector sequence corresponding to the word sequence to be matched.
  • the convolutional neural network is used to use the obtained similarity matrix to generate a similarity value used to characterize the similarity between the text to be matched and the target text. It should be noted that, regarding the vector alignment sub-model, the similarity matrix generation layer, and the convolutional neural network included in the text matching model, you can refer to the content described in the alternative implementation in the corresponding embodiment in FIG. Repeat.
  • Step 604 based on the obtained similarity value, select the text to be matched from the text set to be matched and output.
  • the above-mentioned execution subject may select and output the text to be matched from the text set to be matched based on the size of the obtained similarity value.
  • the above-mentioned execution subject may select the text to be matched from the text set to be matched in the order of the similarity value from the largest to the smallest. Then, output the selected text to be matched in various ways. For example, when the above-mentioned execution subject is the server shown in Figure 1, the server can send the selected text to be matched to the terminal device shown in Figure 1 in descending order of similarity values, so that all The selected text to be matched is displayed on the screen of the terminal device.
  • the above-mentioned execution subject may select and output the text to be matched from the text set to be matched according to the following steps:
  • the text to be matched is selected from the text set to be matched.
  • the above-mentioned execution subject can select the text to be matched from the text set to be matched in the order of the corresponding similarity value from the largest to the smallest.
  • the target display screen may be a display screen on which text is to be displayed.
  • the target display screen may be a display screen included in the execution subject, or a display screen included in other electronic devices communicatively connected with the execution subject.
  • the text to be matched similar to the target text can be displayed on the target display screen in a more targeted manner. Due to the limited size of the display screen included in the electronic device for displaying the text, this implementation method can Make full use of the limited display screen size, display texts to users in a targeted manner, save display resources of the display screen, and save storage resources for storing displayed texts.
  • the target text and the text to be matched in the text set to be matched are respectively segmented according to a preset number of word segmentation granularities to generate the corresponding target text
  • the preset number of target word sequences and the preset number of to-be-matched word sequences corresponding to the text to be matched in the text to be matched set, and then the preset number of to-be-matched word sequences and the preset number of target words corresponding to the text to be matched Input the sequence into the pre-trained text matching model to obtain the similarity value used to characterize the similarity between the text to be matched and the target text.
  • the present disclosure provides an embodiment of a device for generating a text matching model.
  • the device embodiment corresponds to the method embodiment shown in FIG. 2 ,
  • the device can be specifically applied to various electronic equipment.
  • the apparatus 700 for generating a text matching model of this embodiment includes: a training sample obtaining unit 701, configured to obtain a training sample set, wherein the training sample includes a preset number of sample word sequences, a preset A number of matching sample word sequences and a preset number of non-matching sample word sequences; the training unit 702 is configured to select training samples from the training sample set, and perform the following training steps: include the selected training samples in the preset number Sample word sequences and a preset number of matching sample word sequences are input to the initial model to obtain the first degree of similarity between the inputted text indicated by the preset number of sample word sequences and the text indicated by the preset number of matching sample word sequences A similarity value; the preset number of sample word sequences and the preset number of non-matching sample word sequences included in the selected training sample are input into the initial model to obtain the text indicating the input of the preset number of sample word sequences The second similarity value of the similarity of the text indicated by the preset
  • the training sample obtaining unit 701 may obtain the training sample set remotely or locally through a wired connection or a wireless connection.
  • the training sample includes a preset number of sample word sequences, a preset number of matching sample word sequences, and a preset number of non-matching sample word sequences.
  • the words in each of the aforementioned word sequences may include but are not limited to at least one of the following: single-character words, multi-character words, and phrases. Generally, the aforementioned preset number is greater than or equal to two.
  • the preset number of sample word sequences may correspond to the sample text
  • the preset number of matching sample word sequences may correspond to the matching sample text
  • the preset number of non-matching sample word sequences may correspond to the non-matching sample text.
  • the matching sample text may be a text with a higher degree of correlation with the sample text
  • a non-matching sample text may be a text with a lower degree of correlation with the sample text.
  • the sample text may be a search sentence entered by the user, and the execution body used to generate the training sample may set the text included in the search result and clicked by the user as the matching sample text, and set the text that has not been clicked by the user as Non-matching text.
  • the sample word sequences in the preset number of sample word sequences may be word sequences obtained by word segmentation of the sample text.
  • training sample acquisition unit 701 may also use a preset number of different word segmentation algorithms to segment the sample text to obtain a preset number of sample word sequences.
  • the executive body that generates the sample word sequence can use the same method as the method used to segment the sample text to segment the matching text and the non-matching text, to obtain a preset number of matching sample word sequences and a preset number A sequence of non-matching sample words.
  • the method for segmenting text in this embodiment may include but is not limited to at least one of the following: a dictionary-based method, a statistics-based method, a semantic-based method, and the like.
  • the training unit 702 can select training samples from the training sample set, and perform the following training steps: (including steps 7021 to 7024):
  • Step 7021 Input the sample word sequence and the matching sample word sequence included in the selected training sample into the initial model to obtain the text indicated by the preset number of sample word sequences and the preset number of matching sample word sequences. The first similarity value of the similarity of the text.
  • the initial model may include neural networks with various structures, such as a neural network with a Siamese structure, an LSF-SCNN (Lexical Semantic Feature based Skip Convolution Neural Network, a jumping convolutional neural network based on lexical semantic features), etc.
  • the initial model can be an untrained model with initial parameters, or a trained model.
  • the initial model can convert the words included in the input word sequence into the form of vectors, and the similarity value can be determined according to each vector. Generally, the larger the similarity value, the higher the similarity between two texts.
  • the similarity value can be determined according to the distance between the vectors (for example, Euclidean distance, cosine distance, etc.). For example, the cosine distance is determined as the similarity value, or the reciprocal of the Euclidean distance is determined as the similarity value.
  • the input to the initial model is usually a preset number of sample word sequences and a preset number of matching sample word sequences included in a training sample.
  • the initial model can perform processing such as vector conversion and distance calculation on the input preset number of sample word sequences and preset number of matching sample word sequences to obtain the first similarity value.
  • Step 7022 Input the sample word sequence and the non-matching sample word sequence included in the selected training sample into the initial model to obtain the text indicated by the preset number of sample word sequences and the preset number of non-matching sample word sequences for characterizing the input The second similarity value indicating the similarity of the text.
  • input to the initial model is usually a preset number of sample word sequences and a preset number of non-matching sample word sequences included in one training sample.
  • the initial model can obtain the second similarity value according to the same method as in step 7021.
  • Step 7023 Compare the first similarity value with the second similarity value, and determine whether the initial model reaches the preset optimization target according to the comparison result.
  • the above-mentioned training unit 702 may use a preset loss function (for example, a hinge loss function, a square hinge loss function, etc.) to compare the first similarity value with the second similarity value, and use the above loss function to calculate Obtain the loss value, and if the loss value meets a preset condition (for example, the loss value is less than or equal to the preset value or the loss value no longer decreases), it is determined that the initial model reaches the optimization goal.
  • a preset loss function for example, a hinge loss function, a square hinge loss function, etc.
  • Step 7024 in response to determining that the optimization goal is reached, determine that the initial model is a text matching model.
  • the training sample obtaining unit 701 may include: an obtaining module configured to obtain sample text, and matching text that matches the obtained sample text and does not match the obtained sample text. Matching non-matching text; the word segmentation module is configured to segment the acquired sample text, matching text and non-matching text according to a preset number of word segmentation granularities, to obtain a preset number of sample word sequences corresponding to the sample text, and match A preset number of matching sample word sequences corresponding to the text, and a preset number of non-matching sample word sequences corresponding to the non-matching text; the determining module is configured to determine the obtained preset number of sample word sequences and a preset number of matches The word alignment information corresponding to the sample word sequence and the preset number of non-matching sample word sequences respectively, wherein the word alignment information is used to represent the correspondence between words in the word sequence corresponding to different word segmentation granularities.
  • the initial model may include a vector alignment sub-model, a similarity matrix generation layer, and a convolutional neural network
  • the training unit 702 may include: a first generation module (not shown in the figure) ), configured to input the sample word sequence and the matching sample word sequence included in the selected training sample into the vector alignment sub-model to obtain the sample aligned word vector sequence corresponding to the input sample word sequence and the input matching sample word sequence
  • the word vector sequence after the matching sample is aligned, where the vector alignment sub-model is used to determine the word vector of the word included in the input word sequence, and based on the word alignment information corresponding to the word sequence, vector the word vector sequence corresponding to the input word sequence Align, obtain the aligned word vector sequence corresponding to the input word sequence; input the obtained sample-aligned word vector sequence and the matched sample-aligned word vector sequence into the similarity matrix generation layer to obtain the similarity matrix; compare the obtained similarity The degree matrix is input to the convolutional neural network to obtain the first similar
  • the convolutional neural network includes at least one convolution sub-network and a similarity value generation layer.
  • the convolution sub-network is used to perform convolution operations on the input similarity matrix to generate sub-networks. Similarity value, the similarity value generation layer is used to generate the similarity value based on the sub-similarity value.
  • At least one convolution sub-network includes a proximity convolution sub-network
  • the proximity convolution sub-network includes a proximity convolution kernel
  • the proximity convolution kernel includes weights, weights The value is used to characterize the degree of influence of the distance between the positions of the words in the matching text and the words included in the sample word sequence in the matching text on determining the similarity value.
  • the similarity matrix generation layer includes a word weight generation layer, and the word weight generation layer is used to determine the sample word sequence indication in the sample word sequence corresponding to the pre-specified word segmentation granularity The weights in the text, the similarity matrix generation layer is used to use the weights generated by the word weight generation layer and the generated similarity matrix to generate a weighted similarity matrix.
  • the above-mentioned apparatus 700 may further include: a selection unit (not shown in the figure) configured to adjust the parameters of the initial model in response to determining that the optimization target is not reached, and from training Among the training samples in the sample set that have not been selected, the training samples are re-selected, and the training steps are continued with the re-selected training samples and the initial model of the last adjustment of the parameters.
  • a selection unit (not shown in the figure) configured to adjust the parameters of the initial model in response to determining that the optimization target is not reached, and from training Among the training samples in the sample set that have not been selected, the training samples are re-selected, and the training steps are continued with the re-selected training samples and the initial model of the last adjustment of the parameters.
  • the device 700 provided by the above-mentioned embodiment of the present disclosure obtains a training sample set, where the training sample includes a preset number of sample word sequences, a preset number of matching sample word sequences, a preset number of non-matching sample word sequences, and then At least one training sample is selected from the training sample set, and the selected training sample and the initial model are used to obtain the similarity between the text indicated by the preset number of sample word sequences and the text indicated by the preset number of matching sample word sequences.
  • the first similarity value of the degree and the second similarity value used to characterize the similarity between the text indicated by the input sample word sequence and the text indicated by the non-matching sample word sequence are based on the first similarity value and the second similarity value
  • the initial model is trained to obtain a text matching model based on the comparison results of the same text, thereby realizing the use of a preset number of word sequences corresponding to the same text for model training, so that the resulting text matching model can more comprehensively target the preset corresponding to the same text
  • a number of word sequences are processed to more accurately determine the similarity between two texts, which helps to improve the accuracy of text matching.
  • the present disclosure provides an embodiment of a device for outputting text.
  • the device embodiment corresponds to the method embodiment shown in FIG.
  • the device can be specifically applied to various electronic devices.
  • the apparatus 800 for outputting text of this embodiment includes: a text obtaining unit 801, configured to obtain a target text and a set of texts to be matched, where the target text is text input by a user; and a word segmentation unit 802, It is configured to perform word segmentation processing on the target text and the text to be matched in the text set to be matched according to a preset number of word segmentation granularities, to generate a preset number of target word sequences corresponding to the target text and the text to be matched in the text set to be matched Corresponding preset number of word sequences to be matched; the matching unit 803 is configured to, for the text to be matched in the text set to be matched, a preset number of word sequences to be matched and a preset number of targets corresponding to the text to be matched
  • the word sequence is input to a pre-trained text matching model to obtain a similarity value representing the degree of similarity between the text to be matched and the target text, wherein the
  • the text obtaining unit 801 may obtain the target text and the text set to be matched remotely or locally through a wired connection or a wireless connection.
  • the target text is the text input by the user.
  • the target text may be text used to search for information.
  • the target text may be text input by the user in the search field displayed on the screen of the above-mentioned device 800.
  • the text collection to be matched may be a text collection pre-stored in the aforementioned apparatus 800 or a text collection pre-stored on an electronic device that is communicatively connected with the aforementioned apparatus 800.
  • the word segmentation unit 802 may perform word segmentation processing on the target text and the text to be matched in the text to be matched set according to a preset number of word segmentation granularities, and generate a preset number of target word sequences corresponding to the target text and to be matched. A preset number of word sequences to be matched corresponding to the text to be matched in the text set.
  • the word segmentation granularity is used to represent the number of words included in the word when the text is segmented.
  • the word segmentation granularity is large, a single word includes more characters, and the word segmentation granularity is small, and a single word includes less characters.
  • the words obtained after large-granularity segmentation include "boyfriend", and the words obtained after small-granularity segmentation include "male” and "friend”.
  • the method of using different word segmentation granularities to segment the text is a well-known technology in the art, and will not be repeated here.
  • the matching unit 803 can input the preset number of to-be-matched word sequences and the preset number of target word sequences corresponding to the text to be matched into the pre-trained text
  • the matching model obtains the similarity value used to characterize the similarity between the text to be matched and the target text.
  • the text matching model is generated according to the method described in the embodiment corresponding to FIG. 2 above.
  • the output unit 804 may select and output the text to be matched from the text set to be matched based on the size of the obtained similarity value.
  • the above-mentioned output unit 804 may select the text to be matched from the text set to be matched according to the order of the similarity value. Then, output the selected text to be matched in various ways. For example, when the above-mentioned apparatus 800 is set in the server as shown in FIG. 1, the apparatus 800 may send the selected text to be matched to the terminal device as shown in FIG. 1 in the descending order of similarity values. So that the selected text to be matched is displayed on the screen of the terminal device.
  • the word segmentation unit 802 may include: a word segmentation module (not shown in the figure), which is configured to adjust the target text and the text to be matched in the text set to be matched according to a preset number Different word segmentation granularity performs word segmentation to obtain a preset number of target word sequences corresponding to the target text and a preset number of word sequences to be matched corresponding to the text to be matched in the text set to be matched; the determining module (not shown in the figure) is It is configured to determine a preset number of target word sequences and word alignment information corresponding to the preset number of word sequences to be matched corresponding to the text to be matched in the text set to be matched, so that the text matching model uses the word alignment information to generate similarity value.
  • a word segmentation module (not shown in the figure)
  • Different word segmentation granularity performs word segmentation to obtain a preset number of target word sequences corresponding to the target text and a preset number of word sequence
  • the output unit 804 may include: a selection module (not shown in the figure) configured to select the text to be matched from the set of text to be matched based on the size of the obtained similarity value. Matching text; a display module (not shown in the figure), configured to display the selected text to be matched on the target display screen.
  • the device 800 obtained by the above-mentioned embodiment of the present disclosure obtains the target text and the text to be matched, and performs word segmentation processing on the target text and the text to be matched in the text to be matched set according to the preset number of word segmentation granularities, and generates the target text correspondence
  • the word sequence is input to the pre-trained text matching model, and the similarity value used to characterize the similarity between the text to be matched and the target text is obtained, and finally based on the size of the obtained similarity value, the text to be matched is selected from the set of text to be matched Text and output, so as to effectively use the text matching model, improve the accuracy of determining the similarity value between the texts, and output the
  • FIG. 9 shows a schematic structural diagram of an electronic device (such as the server or terminal device in FIG. 1) 900 suitable for implementing the embodiments of the present disclosure.
  • the terminal devices in the embodiments of the present disclosure may include, but are not limited to, mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablets), PMPs (portable multimedia players), vehicle-mounted terminals ( For example, mobile terminals such as car navigation terminals) and fixed terminals such as digital TVs and desktop computers.
  • the electronic device shown in FIG. 9 is only an example, and should not bring any limitation to the function and scope of use of the embodiments of the present disclosure.
  • the electronic device 900 may include a processing device (such as a central processing unit, a graphics processor, etc.) 901, which may be loaded into a random access device according to a program stored in a read-only memory (ROM) 902 or from a storage device 908
  • the program in the memory (RAM) 903 executes various appropriate actions and processing.
  • the RAM 903 also stores various programs and data required for the operation of the electronic device 900.
  • the processing device 901, the ROM 902, and the RAM 903 are connected to each other through a bus 904.
  • An input/output (I/O) interface 905 is also connected to the bus 904.
  • the following devices can be connected to the I/O interface 905: including input devices 906 such as touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, liquid crystal display (LCD), speakers, vibration An output device 907 such as a computer; a storage device 908 such as a memory; and a communication device 909.
  • the communication device 909 may allow the electronic device 900 to perform wireless or wired communication with other devices to exchange data.
  • FIG. 9 shows an electronic device 900 having various devices, it should be understood that it is not required to implement or have all the illustrated devices. It may alternatively be implemented or provided with more or fewer devices. Each block shown in FIG. 9 may represent one device, or may represent multiple devices as needed.
  • the process described above with reference to the flowchart can be implemented as a computer software program.
  • the embodiments of the present disclosure include a computer program product, which includes a computer program carried on a computer-readable medium, and the computer program contains program code for executing the method shown in the flowchart.
  • the computer program may be downloaded and installed from the network through the communication device 909, or installed from the storage device 908, or installed from the ROM 902.
  • the processing device 901 the above-mentioned functions defined in the method of the embodiment of the present disclosure are executed.
  • the computer-readable medium described in the embodiments of the present disclosure may be a computer-readable signal medium or a computer-readable medium, or any combination of the two.
  • the computer-readable medium may be, for example, but not limited to, an electric, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the above. More specific examples of computer-readable media may include, but are not limited to: electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable Read only memory (EPROM or flash memory), optical fiber, portable compact disk read only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • the computer-readable medium may be any tangible medium that contains or stores a program, and the program may be used by or combined with an instruction execution system, apparatus, or device.
  • the computer-readable signal medium may include a data signal propagated in a baseband or as a part of a carrier wave, and a computer-readable program code is carried therein. This propagated data signal can take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • the computer-readable signal medium may also be any computer-readable medium other than the computer-readable medium, and the computer-readable signal medium may send, propagate, or transmit the program for use by or in combination with the instruction execution system, apparatus, or device.
  • the program code contained on the computer-readable medium can be transmitted by any suitable medium, including but not limited to: wire, optical cable, RF (Radio Frequency), etc., or any suitable combination of the above.
  • the above-mentioned computer-readable medium may be included in the above-mentioned electronic device; or it may exist alone without being assembled into the electronic device.
  • the above-mentioned computer-readable medium carries one or more programs.
  • the electronic device is caused to obtain a training sample set, wherein the training sample includes a preset number of sample word sequences, A preset number of matching sample word sequences and a preset number of non-matching sample word sequences are selected; training samples are selected from the training sample set, and the following training steps are performed: the selected training sample includes the preset number of sample word sequences and A preset number of matching sample word sequences are input into the initial model, and a first similarity value used to characterize the similarity between the input text indicated by the preset number of sample word sequences and the text indicated by the preset number of matching sample word sequences is obtained; The preset number of sample word sequences and the preset number of non-matching sample word sequences included in
  • the electronic device can also be made to: obtain the target text and the text set to be matched, where the target text is the text input by the user;
  • the to-be-matched texts are respectively processed according to a preset number of word segmentation granularities to generate a preset number of target word sequences corresponding to the target text and a preset number of to-be-matched word sequences corresponding to the text to be matched in the text set to be matched;
  • the preset number of to-be-matched word sequences and the preset number of target word sequences corresponding to the to-be-matched text are input into the pre-trained text matching model to obtain the characterization of the text to be matched
  • the similarity value of the similarity with the target text based on the size of the obtained similarity value, the text to be matched is selected from the text set to be matched and output.
  • the computer program code for performing the operations of the embodiments of the present disclosure can be written in one or more programming languages or a combination thereof, the programming languages including object-oriented programming languages such as Java, Smalltalk, C++, It also includes conventional procedural programming languages-such as "C" language or similar programming languages.
  • the program code can be executed entirely on the user's computer, partly on the user's computer, executed as an independent software package, partly on the user's computer and partly executed on a remote computer, or entirely executed on the remote computer or server.
  • the remote computer can be connected to the user’s computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (for example, using an Internet service provider to pass Internet connection).
  • LAN local area network
  • WAN wide area network
  • each block in the flowchart or block diagram can represent a module, program segment, or part of code, and the module, program segment, or part of code contains one or more for realizing the specified logical function Executable instructions.
  • the functions marked in the block may also occur in a different order from the order marked in the drawings. For example, two blocks shown in succession can actually be executed substantially in parallel, or they can sometimes be executed in the reverse order, depending on the functions involved.
  • each block in the block diagram and/or flowchart, and the combination of the blocks in the block diagram and/or flowchart can be implemented by a dedicated hardware-based system that performs the specified function or operation Or it can be realized by a combination of dedicated hardware and computer instructions.
  • the units involved in the embodiments described in the present disclosure can be implemented in software or hardware.
  • the described unit can also be provided in the processor, for example, it can be described as: a processor includes a training sample acquisition unit and a training unit. Among them, the names of these units do not constitute a limitation on the unit itself under certain circumstances.
  • the training sample acquisition unit can also be described as "a unit for acquiring a training sample set.”

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本公开的实施例公开了用于生成文本匹配模型的方法和装置。该方法的一具体实施方式包括:获取训练样本集合;从训练样本集合中选取训练样本,以及执行以下训练步骤:将所选取的训练样本包括的预设数量个样本词语序列和预设数量个匹配样本词语序列输入初始模型,得到第一相似度值;将所选取的训练样本包括的预设数量个样本词语序列和预设数量个非匹配样本词语序列输入初始模型,得到第二相似度值;将第一相似度值和第二相似度值进行比较,根据比较结果确定初始模型是否达到优化目标;响应于确定达到优化目标,确定初始模型为文本匹配模型。该实施方式所得到的文本匹配模型可以更准确地确定两个文本之间的相似度,有助于提高文本匹配的准确性。

Description

用于生成文本匹配模型的方法和装置
相关申请的交叉引用
本申请基于申请号为201910184893.2、申请日为2019年03月12日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。
技术领域
本公开的实施例涉及计算机技术领域,具体涉及用于生成文本匹配模型的方法和装置。
背景技术
文本语义匹配问题是指,给定两段文本(比如一个查询文本和一个网页包括的文本),如何确定这两段文本的相似程度。典型的应用包括搜索引擎、问答系统以及智能客服系统等。比如在搜索引擎中,根据这个相似程度可以对候选文档进行排序,在智能客服系统中,可以根据用户的问题找到数据库里最接近的问题和答案。
相关技术的用于匹配文本的方法,主要包括以下几种:基于关键词精确命中的方法(例如BM25算法、TF-IDF(Term Frequency–Inverse Document Frequency,词频-逆文本频率)算法等),基于隐式语义表达的深度学习模型,基于深度交互的深度学习模型。
发明内容
本公开的实施例提出了用于生成文本匹配模型的方法和装置,以及用于输出文本的方法和装置。
第一方面,本公开的实施例提供了一种用于生成文本匹配模型的方法,该方法包括:获取训练样本集合,其中,训练样本包括预设数量个样本词语序列、预设数量个匹配样本词语序列、预设数量个非匹 配样本词语序列;从训练样本集合中选取训练样本,以及执行以下训练步骤:将所选取的训练样本包括的预设数量个样本词语序列和预设数量个匹配样本词语序列输入初始模型,得到用于表征输入的预设数量个样本词语序列指示的文本与预设数量个匹配样本词语序列指示的文本的相似程度的第一相似度值;将所选取的训练样本包括的预设数量个样本词语序列和预设数量个非匹配样本词语序列输入初始模型,得到用于表征输入的预设数量个样本词语序列指示的文本与预设数量个非匹配样本词语序列指示的文本的相似程度的第二相似度值;将第一相似度值和第二相似度值进行比较,根据比较结果确定初始模型是否达到预设的优化目标;响应于确定达到优化目标,确定初始模型为文本匹配模型。
在一些实施例中,获取训练样本集合,包括:获取样本文本,以及与所获取的样本文本匹配的匹配文本和与所获取的样本文本不匹配的非匹配文本;对所获取的样本文本、匹配文本和非匹配文本分别按照预设数量种分词粒度进行分词,得到样本文本对应的预设数量个样本词语序列,匹配文本对应的预设数量个匹配样本词语序列,非匹配文本对应的预设数量个非匹配样本词语序列;确定所得到的预设数量个样本词语序列、预设数量个匹配样本词语序列、预设数量个非匹配样本词语序列分别对应的词语对齐信息,其中,词语对齐信息用于表征针对同一文本的不同的分词粒度对应的词语序列中的词语的对应关系。
在一些实施例中,初始模型包括向量对齐子模型、相似度矩阵生成层、卷积神经网络;以及得到第一相似度值以及得到第二相似度值,包括:将所选取的训练样本包括的样本词语序列和匹配样本词语序列输入向量对齐子模型,得到输入的样本词语序列对应的样本对齐后词向量序列和输入的匹配样本词语序列对应的匹配样本对齐后词向量序列,其中,向量对齐子模型用于确定输入的词语序列包括的词语的词向量,以及基于词语序列对应的词语对齐信息,对输入的词语序列对应的词向量序列进行向量对齐,得到输入的词语序列对应的对齐后词向量序列;将所得到的样本对齐后词向量序列和匹配样本对齐后词向 量序列输入相似度矩阵生成层,得到相似度矩阵;将所得到的相似度矩阵输入卷积神经网络,得到第一相似度值;将所选取的训练样本包括的样本词语序列和非匹配样本词语序列输入向量对齐子模型,得到输入的样本词语序列对应的样本对齐后词向量序列和输入的非匹配样本词语序列对应的非匹配样本对齐后词向量序列;将所得到的样本对齐后词向量序列和非匹配样本对齐后词向量序列输入相似度矩阵生成层,得到相似度矩阵;将所得到的相似度矩阵输入卷积神经网络,得到第二相似度值。
在一些实施例中,卷积神经网络包括至少一个卷积子网络和相似度值生成层,卷积子网络用于对输入的相似度矩阵进行卷积运算,生成子相似度值,相似度值生成层用于基于子相似度值生成相似度值。
在一些实施例中,至少一个卷积子网络包括邻近度卷积子网络,邻近度卷积子网络包括邻近度卷积核,邻近度卷积核包括权值,权值用于表征匹配用文本中的、与样本词语序列包括的词语匹配的词语在匹配用文本中所处的位置之间的距离对确定相似度值的影响程度。
在一些实施例中,相似度矩阵生成层包括词语权重生成层,词语权重生成层用于确定预先指定的分词粒度对应的样本词语序列中的样本词语在样本词语序列指示的文本中的权重,相似度矩阵生成层用于利用词语权重生成层生成的权重和已生成的相似度矩阵,生成加权后的相似度矩阵。
在一些实施例中,该方法还包括:响应于确定未达到优化目标,调整初始模型的参数,以及从训练样本集合中的、未被选择过的训练样本中,重新选取训练样本,利用重新选择的训练样本和最近一次调整参数的初始模型,继续执行训练步骤。
第二方面,本公开的实施例提供了一种用于输出文本的方法,该方法包括:获取目标文本和待匹配文本集合,其中,目标文本是用户输入的文本;对目标文本和待匹配文本集合中的待匹配文本分别按照预设数量种分词粒度进行分词处理,生成目标文本对应的预设数量个目标词语序列和待匹配文本集合中的待匹配文本对应的预设数量个待匹配词语序列;对于待匹配文本集合中的待匹配文本,将该待匹配文 本对应的预设数量个待匹配词语序列和预设数量个目标词语序列输入预先训练的文本匹配模型,得到用于表征该待匹配文本与目标文本之间的相似程度的相似度值,其中,文本匹配模型是根据上述第一方面中任一实施例描述的方法生成的;基于所得到的相似度值的大小,从待匹配文本集合中选择待匹配文本及输出。
在一些实施例中,分词处理包括:对目标文本和待匹配文本集合中的待匹配文本分别按照预设数量种分词粒度进行分词,得到目标文本对应的预设数量个目标词语序列和待匹配文本集合中的待匹配文本对应的预设数量个待匹配词语序列;确定预设数量个目标词语序列,和待匹配文本集合中的待匹配文本对应的预设数量个待匹配词语序列分别对应的词语对齐信息,以使文本匹配模型利用词语对齐信息生成相似度值。
在一些实施例中,基于所得到的相似度值的大小,从待匹配文本集合中选择待匹配文本及输出,包括:基于所得到的相似度值的大小,从待匹配文本集合中选择待匹配文本;将所选择的待匹配文本显示在目标显示屏上。
第三方面,本公开的实施例提供了一种用于生成文本匹配模型的装置,该装置包括:训练样本获取单元,被配置成获取训练样本集合,其中,训练样本包括预设数量个样本词语序列、预设数量个匹配样本词语序列、预设数量个非匹配样本词语序列;训练单元,被配置成从训练样本集合中选取训练样本,以及执行以下训练步骤:将所选取的训练样本包括的预设数量个样本词语序列和预设数量个匹配样本词语序列输入初始模型,得到用于表征输入的预设数量个样本词语序列指示的文本与预设数量个匹配样本词语序列指示的文本的相似程度的第一相似度值;将所选取的训练样本包括的预设数量个样本词语序列和预设数量个非匹配样本词语序列输入初始模型,得到用于表征输入的预设数量个样本词语序列指示的文本与预设数量个非匹配样本词语序列指示的文本的相似程度的第二相似度值;将第一相似度值和第二相似度值进行比较,根据比较结果确定初始模型是否达到预设的优化目标;响应于确定达到优化目标,确定初始模型为文本匹配模型。
第四方面,本公开的实施例提供了一种用于输出文本的装置,该装置包括:文本获取单元,被配置成获取目标文本和待匹配文本集合,其中,目标文本是用户输入的文本;分词单元,被配置成对目标文本和待匹配文本集合中的待匹配文本分别按照预设数量种分词粒度进行分词处理,生成目标文本对应的预设数量个目标词语序列和待匹配文本集合中的待匹配文本对应的预设数量个待匹配词语序列;匹配单元,被配置成对于待匹配文本集合中的待匹配文本,将该待匹配文本对应的预设数量个待匹配词语序列和预设数量个目标词语序列输入预先训练的文本匹配模型,得到用于表征该待匹配文本与目标文本之间的相似程度的相似度值,其中,文本匹配模型是根据上述第一方面中任一实施例描述的方法生成的;输出单元,被配置成基于所得到的相似度值的大小,从待匹配文本集合中选择待匹配文本及输出。
第五方面,本公开的实施例提供了一种电子设备,该电子设备包括:一个或多个处理器;存储装置,其上存储有一个或多个程序;当一个或多个程序被一个或多个处理器执行,使得一个或多个处理器实现如第一方面或第二方面中任一实现方式描述的方法。
第六方面,本公开的实施例提供了一种计算机可读介质,其上存储有计算机程序,该计算机程序被处理器执行时实现如第一方面或第二方面中任一实现方式描述的方法。
本公开的实施例提供的用于生成文本匹配模型的方法和装置,通过获取训练样本集合,其中,训练样本包括预设数量个样本词语序列、预设数量个匹配样本词语序列、预设数量个非匹配样本词语序列,然后从训练样本集合中选取至少一个训练样本,利用选择的训练样本和初始模型,得到用于表征输入的预设数量个样本词语序列指示的文本与预设数量个匹配样本词语序列指示的文本的相似程度的第一相似度值和用于表征输入的样本词语序列指示的文本与非匹配样本词语序列指示的文本的相似程度的第二相似度值,根据第一相似度值和第二相似度值的比较结果对初始模型进行训练,得到文本匹配模型,从而实现了使用同一文本对应的预设数量个词语序列进行文本匹配模型的训练,使得所得到的文本匹配模型可以更全面地针对同一文本对应的预 设数量个词语序列进行处理,从而更准确地确定两个文本之间的相似度,有助于提高文本匹配的准确性。
附图说明
通过阅读参照以下附图所作的对非限制性实施例所作的详细描述,本公开的其它特征、目的和优点将会变得更明显:
图1是本公开的一个实施例可以应用于其中的示例性系统架构图;
图2是根据本公开的实施例的用于生成文本匹配模型的方法的一个实施例的流程图;
图3是根据本公开的实施例的用于生成文本匹配模型的方法的生成相似度矩阵的示例性示意图;
图4是根据本公开的实施例的用于生成文本匹配模型的方法的由邻近度卷积子网络子相似度值的示例性示意图;
图5是根据本公开的实施例的用于生成文本匹配模型的方法的一个应用场景的示意图;
图6是根据本公开的实施例的用于输出文本的方法的一个实施例的流程图;
图7是根据本公开的实施例的用于生成文本匹配模型的装置的一个实施例的结构示意图;
图8是根据本公开的实施例的用于输出文本的装置的一个实施例的结构示意图;
图9是适于用来实现本公开的实施例的电子设备的结构示意图。
具体实施方式
下面结合附图和实施例对本公开作进一步的详细说明。可以理解的是,此处所描述的具体实施例仅仅用于解释相关公开,而非对该公开的限定。另外还需要说明的是,为了便于描述,附图中仅示出了与有关公开相关的部分。
需要说明的是,在不冲突的情况下,本公开中的实施例及实施例 中的特征可以相互组合。下面将参考附图并结合实施例来详细说明本公开。
图1示出了可以应用本公开的实施例的用于生成文本匹配模型的方法或用于生成文本匹配模型的装置,以及用于输出文本的方法和装置的示例性系统架构100。
如图1所示,系统架构100可以包括终端设备101、102、103,网络104和服务器105。网络104用以在终端设备101、102、103和服务器105之间提供通信链路的介质。网络104可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等等。
用户可以使用终端设备101、102、103通过网络104与服务器105交互,以接收或发送消息等。终端设备101、102、103上可以安装有各种通讯客户端应用,例如搜索类应用、网页浏览器应用、购物类应用、即时通信工具、邮箱客户端、社交平台软件等。
终端设备101、102、103可以是硬件,也可以是软件。当终端设备101、102、103为硬件时,可以是各种电子设备。当终端设备101、102、103为软件时,可以安装在上述电子设备中。其可以实现成多个软件或软件模块(例如用来提供分布式服务的软件或软件模块),也可以实现成单个软件或软件模块。在此不做具体限定。
服务器105可以是提供各种服务的服务器,例如对终端设备101、102、103上传的训练样本集合进行模型训练的后台服务器,或对终端设备101、102、103上传的文本进行处理的后台服务器。后台服务器可以利用获取的训练样本集合进行模型训练,得到文本匹配模型,或者使用文本匹配模型生成文本之间的相似度值,并根据相似度值输出文本。
需要说明的是,本公开的实施例所提供的用于生成文本匹配模型的方法可以由服务器105执行,也可以由终端设备101、102、103执行,相应地,用于生成文本匹配模型的装置可以设置于服务器105中,也可以设置于终端设备101、102、103中。此外,本公开的实施例所提供的用于输出文本的方法可以由服务器105执行,也可以由终端设备101、102、103执行,相应地,用于输出文本的装置可以设置于服 务器105中,也可以设置于终端设备101、102、103中。
需要说明的是,服务器可以是硬件,也可以是软件。当服务器为硬件时,可以实现成多个服务器组成的分布式服务器集群,也可以实现成单个服务器。当服务器为软件时,可以实现成多个软件或软件模块(例如用来提供分布式服务的软件或软件模块),也可以实现成单个软件或软件模块。在此不做具体限定。
应该理解,图1中的终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端设备、网络和服务器。在训练模型所需的训练样本,或目标文本和待匹配文本集合不需要从远程获取的情况下,上述系统架构可以不包括网络,而只包括服务器或终端设备。
继续参考图2,示出了根据本公开的用于生成文本匹配模型的方法的一个实施例的流程200。该用于生成文本匹配模型的方法,包括以下步骤:
步骤201,获取训练样本集合。
在本实施例中,用于生成文本匹配模型的方法的执行主体(例如图1所示的服务器或终端设备)可以通过有线连接方式或者无线连接方式从远程,或从本地获取训练样本集合。其中,训练样本包括预设数量个样本词语序列、预设数量个匹配样本词语序列、预设数量个非匹配样本词语序列。上述各个词语序列中的词语可以包括但不限于以下至少一种:单字词、多字词、短语。通常,上述预设数量大于等于二。
具体地,预设数量个样本词语序列可以对应于样本文本,预设数量个匹配样本词语序列可以对应于匹配样本文本,预设数量个非匹配样本词语序列可以对应于非匹配样本文本。其中,匹配样本文本可以是与样本文本的相关程度较高的文本,非匹配样本文本可以是与样本文本的相关程度较低的文本。例如,样本文本可以是用户输入的搜索语句,用于生成训练样本的执行主体可以将搜索结果中包括的、上述用户点击过的文本设置为匹配样本文本,将上述用户未点击过的文本 设置为非匹配文本。
预设数量个样本词语序列中的样本词语序列可以是对样本文本进行分词得到的词语序列。作为示例,生成样本词语序列的执行主体可以使用预设数量种不同的分词粒度对样本文本进行分词,得到预设数量个样本词语序列。其中,分词粒度用于表征对文本进行分词时,词语包括的文字的数量。通常,分词粒度大,单个词语包括的文字多,分词粒度小,单个词语包括的文字少。例如,采用大粒度分词后得到的词语包括“男朋友”,采用小粒度分词后得到的词语包括“男”和“朋友”。需要说明的是,使用不同的分词粒度对文本进行分词的方法是本领域的公知技术,这里不再赘述。
此外,上述执行主体还可以采用预设数量种不同的分词算法,对样本文本进行分词,得到预设数量个样本词语序列。
应当理解,生成样本词语序列的执行主体可以采用与对样本文本进行分词时所采用的方法相同的方法分别对匹配文本和非匹配文本进行分词,得到预设数量个匹配样本词语序列、预设数量个非匹配样本词语序列。本实施例中的对文本进行分词的方法可以包括但不限于以下至少一种:基于词典的方法、基于统计的方法、基于语义的方法。
在本实施例的一些可选的实现方式中,上述执行主体可以执行如下步骤:
步骤一,获取样本文本,以及与所获取的样本文本匹配的匹配文本和与所获取的样本文本不匹配的非匹配文本。具体地,作为示例,样本文本可以是用户输入的搜索语句,匹配文本可以是搜索结果包括的、该用户点击(或点击率最高)的文本,非匹配文本可以是搜索结果包括的、该用户未点击(或点击率最低)的文本。
步骤二,对所获取的样本文本、匹配文本和非匹配文本分别按照预设数量种分词粒度进行分词,得到样本文本对应的预设数量个样本词语序列,匹配文本对应的预设数量个匹配样本词语序列,非匹配文本对应的预设数量个非匹配样本词语序列。具体地,关于对文本按照不同的分词粒度进行分词的方法可以参考上述步骤201中描述的内容,这里不再赘述。本步骤中,使用预设数量种分词粒度进行分词, 可以减少使用单一分词粒度进行分词时带来的匹配失败的概率,从而有助于提高最终训练得到的文本匹配模型生成文本之间的相似度值的准确性。
步骤三,确定所得到的预设数量个样本词语序列、预设数量个匹配样本词语序列、预设数量个非匹配样本词语序列分别对应的词语对齐信息。其中,词语对齐信息用于表征针对同一文本的不同的分词粒度对应的词语序列中的词语的对应关系。
具体地,上述执行主体可以从预设数量个词语序列(可以是预设数量个样本词语序列、预设数量个匹配样本词语序列、预设数量个非匹配样本词语序列中的任意一项)中,确定按照预先指定的分词粒度进行分词得到的词语序列作为基准词语序列,根据基准词语序列包括的词语,得到词语对齐信息。作为示例,假设预设数量为三,三个样本词语序列(以下称为序列1、序列2、序列3)分别按照如下信息表征:“A、B、C、D”,“A、BC、D、”,“A,BCD”,其中的字母或字母组合用于表征词语。序列1-序列3对应的分词粒度逐渐增大,序列2为基准词语序列,生成的词语对齐信息可以包括:“B、C-BC”、“BCD-BC、D”,其中,“B、C-BC”对应于序列1,用于表征序列1中的词语B和C对应于序列2中的词语BC;“BCD-BC、D”对应于序列3,用于表征序列3中的词语BCD对应于序列2中的词语BC和D。应当理解,上述示例适用于预设数量个样本词语序列、预设数量个匹配样本词语序列、预设数量个非匹配样本词语序列。
本实现方式通过对文本按照不同分词粒度进行分词得到预设数量个词语序列,可以实现通过使用针对同一文本的多个词语序列对模型进行训练,由于针对同一文本的多个词语序列能够全面地表征文本,因此本实现方式有助于使得生成的模型更全面地对两个文本进行匹配,从而提高生成相似度值的准确性。
步骤202,从训练样本集合中选取训练样本,以及执行以下训练步骤:将所选取的训练样本包括的预设数量个样本词语序列和预设数量个匹配样本词语序列输入初始模型,得到用于表征输入的预设数量个样本词语序列指示的文本与预设数量个匹配样本词语序列指示的文 本的相似程度的第一相似度值;将所选取的训练样本包括的预设数量个样本词语序列和预设数量个非匹配样本词语序列输入初始模型,得到用于表征输入的预设数量个样本词语序列指示的文本与预设数量个非匹配样本词语序列指示的文本的相似程度的第二相似度值;将第一相似度值和第二相似度值进行比较,根据比较结果确定初始模型是否达到预设的优化目标;响应于确定达到优化目标,确定初始模型为文本匹配模型。
在本实施例中,上述执行主体可以从训练样本集合中选取训练样本,以及执行以下训练步骤:(包括步骤2021-步骤2024):
步骤2021,将所选取的训练样本包括的样本词语序列和匹配样本词语序列输入初始模型,得到用于表征输入的预设数量个样本词语序列指示的文本与预设数量个匹配样本词语序列指示的文本的相似程度的第一相似度值。
具体地,初始模型可以包括各种结构的神经网络,例如Siamese结构的神经网络、LSF-SCNN(Lexical Semantic Feature based Skip Convolution Neural Network,基于词汇语义特征的跳跃卷积神经网络)等。初始模型可以是未经训练的、初始化参数的模型,也可以是训练过的模型。通常,初始模型可以将输入的词语序列包括的词语转换为向量的形式,根据各个向量,可以确定相似度值。通常,相似度值越大,表征两个文本之间的相似程度越高。实践中,相似度值可以根据向量之间的距离(例如欧氏距离、余弦距离等)确定。例如将余弦距离确定为相似度值,或将欧式距离的倒数确定为相似度值。
在本步骤中,输入初始模型的通常为一个训练样本包括的预设数量个样本词语序列和预设数量个匹配样本词语序列。初始模型可以对输入的预设数量个样本词语序列和预设数量个匹配样本词语序列进行诸如向量转换、距离计算等处理,得到第一相似度值。
步骤2022,将所选取的训练样本包括的样本词语序列和非匹配样本词语序列输入初始模型,得到用于表征输入的预设数量个样本词语序列指示的文本与预设数量个非匹配样本词语序列指示的文本的相似程度的第二相似度值。
具体地,在本步骤中,输入初始模型的通常为一个训练样本包括的预设数量个样本词语序列和预设数量个非匹配样本词语序列。初始模型可以按照与上述步骤2021相同的方法得到第二相似度值。
在本实施例的一些可选的实现方式中,初始模型可以包括向量对齐子模型、相似度矩阵生成层、卷积神经网络。上述执行主体可以按照如下步骤确定第一相似度值:
首先,将所选取的训练样本包括的样本词语序列和匹配样本词语序列输入向量对齐子模型,得到输入的样本词语序列对应的样本对齐后词向量序列和输入的匹配样本词语序列对应的匹配样本对齐后词向量序列。其中,向量对齐子模型用于确定输入的词语序列包括的词语的词向量,以及基于词语序列对应的词语对齐信息,对输入的词语序列对应的词向量序列进行向量对齐,得到输入的词语序列对应的对齐后词向量序列。其中,词语对齐信息是根据上述步骤201中的可选实现方式描述的方法得到的。
上述向量对齐子模型可以包括相关技术的用于确定词向量的模型(例如Word2Vec模型、n-gram模型等)。通常,每个词语的词向量包括的元素的数量相同。对于某个词语序列,该词语序列包括的词语分别对应的词向量即为该词语序列对应的词向量序列。然后,向量对齐子模型可以对输入的预设数量个样本词语序列分别对应的词向量序列进行向量对齐,以及对输入的预设数量个匹配样本词语序列分别对应的词向量序列进行向量对齐。
向量对齐子模型可以采用对词向量进行合并或扩展的方式进行向量对齐。继续上述步骤201中的可选实现方式中的示例,序列2为基准词语序列,向量对齐子模型可以根据序列1对应的词语对齐信息“B、C-BC”,将词语B和词语C分别对应的词向量进行合并,以使合并后的词向量包括的元素的数量与基准词语序列包括的词语BC对应的词向量相同。例如,可以采用平均池化(mean pooling)算法对词向量进行合并,即将两个词向量中处于相同的元素位置的元素取均值,得到新的词向量即为合并后的词向量。向量对齐子模型可以根据序列3对应的词语对齐信息“BCD-BC、D”,将词语BCD对应的词向量进行扩 展,使扩展后得到的词向量包括的元素的数量等于基准词语序列包括的词语BC和词语D包括的元素的数量之和。例如,可以将词语BCD对应的词向量复制一份,即得到两个BCD对应的词向量作为扩展后的词向量。通过向量对齐,可以使得各个样本对齐后词向量序列分别包括的词向量的数量相同,以及使得各个匹配样本对齐后词向量序列分别包括的词向量的数量相同。
然后,将所得到的样本对齐后词向量序列和匹配样本对齐后词向量序列输入相似度矩阵生成层,得到相似度矩阵。具体地,相似度矩阵生成层可以将所得到的各个样本对齐后词向量序列和所得到的各个匹配样本对齐后词向量序列进行两两组合。对于每个组合,相似度矩阵生成层对该组合包括的样本对齐后词向量序列中的词向量和匹配样本对齐后词向量序列中的词向量进行两两相似度计算,从而得到该组合对应的待定相似度矩阵。其中,待定相似度矩阵中的每个元素对应于一个样本对齐后词向量和一个匹配样本对齐后词向量,即每个元素为对应的样本对齐后词向量和匹配样本对齐后词向量之间的相似度值(例如余弦距离)。相似度矩阵生成层可以进一步根据所得到的各个待定相似度矩阵,得到相似度矩阵。例如可以对各个待定相似度矩阵中的、位于相同元素位置处的元素取最大值,从而得到相似度矩阵。
作为示例,如图3所示,假设预设数量为2,矩阵A1、A2分别对应于第一分词粒度和第二分词粒度,矩阵A1、A2的每一行为一个样本对齐后词向量。矩阵B1、B2分别对应于第一分词粒度和第二分词粒度,矩阵B1、B2的每一行为一个匹配样本对齐后词向量。矩阵A1、A2和矩阵B1、B2两两组合,得到四个组合(包括A1-B1,A1-B2,A2-B1,A2-B2)。以组合A1-B1为例,确定A1中的各个行和B1中的各个行两两之间的相似度,从而得到组合A1-B1对应的待定相似度矩阵X1。矩阵X1中的第一行第一列的元素为矩阵A1的第一行和矩阵B1的第一行的相似度,矩阵X1中的第一行第二列的元素为矩阵A1的第一行和矩阵B1的第二行的相似度,以此类推。同理,可以得到其他组合对应的相似度矩阵X2、X3、X4。最后,从相似度矩阵X1、X2、X3、X4中的处于相同元素位置处的元素取最大值,得到相似度 矩阵Y。
在本实施例的一些可选的实现方式中,上述相似度矩阵生成层可以包括词语权重生成层。词语权重生成层用于确定预先指定的分词粒度对应的样本词语序列中的样本词语在样本词语序列指示的样本文本中的权重。相似度矩阵生成层用于利用词语权重生成层生成的权重和已生成的相似度矩阵,生成加权后的相似度矩阵。具体地,词语权重生成层可以利用相关技术的各种确定词语在文本中的权重的方法,确定样本词语在样本文本中的权重。例如,可以采用TF-IDF算法,确定每个样本词语的TF-IDF值,再将各个TF-IDF值分别占总TF-IDF值(即各个TF-IDF值之和)的比例确定为样本词语的权重。相似度矩阵生成层可以进一步利用词语权重生成层生成的权重和已生成的相似度矩阵,生成加权后的相似度矩阵。作为示例,可以将上述图3所示的相似度矩阵Y中的每一行包括的元素分别乘以该行指示的样本词语的权重,从而得到最终的加权后的相似度矩阵。本实现方式可以根据各个词语的权重生成加权后的相似度矩阵,使得最终的相似度矩阵中的元素更准确地表征两个词语之间的相似程度。从而有助于最终训练得到的文本匹配模型可以更准确地确定两个文本之间的相似程度。
最后,将所得到的相似度矩阵输入卷积神经网络,得到第一相似度值。
具体地,卷积神经网络可以用于对相似度矩阵进行卷积运算、全连接运算等,从而得到第一相似度值。卷积神经网络的结构可以是相关技术的各种结构,例如LSF-SCNN结构。
在本实施例的一些可选的实现方式中,卷积神经网络可以包括至少一个卷积子网络和相似度值生成层,卷积子网络用于对输入的相似度矩阵进行卷积运算,生成子相似度值,相似度值生成层用于基于子相似度值生成相似度值。具体地,至少一个卷积子网络可以包括使用相关技术的二维卷积核(例如卷积核的大小为5×5)进行卷积运算的卷积子网络。通常,至少一个卷积子网络中的每个卷积子网络可以生成一个子相似度值,子相似度值可以被输入到相似度值生成层,相似度值生成层可以对输入的各个子相似度值进行运算,得到相似度值。 例如,相似度值生成层可以利用预设的、子相似度值对应的权重,对各个子相似度值进行加权求和,从而得到相似度值。
此外,可选的,上述初始模型还可以包括其他用于确定两个文本之间的相似度的子模型,子模型可以包括但不限于以下至少一种:词袋(Bag-of-words,BOW)模型、循环神经网络(RNN,Recurrent Neural Network)模型等。通常,可以从输入的预设数量个样本词语序列中选择(例如选择与预先指定的分词粒度对应的词语序列)样本词语序列,以及从输入的预设数量个匹配样本词语序列中选择匹配样本词语序列,将选择的样本词语序列和匹配样本词语序列输入上述子模型,得到子相似度值。上述执行主体可以将得到的子相似度值输入上述相似度值生成层,以使相似度值生成层对输入的子相似度值进行运算,得到相似度值。
在本实施例的一些可选的实现方式中,上述至少一个卷积子网络可以包括邻近度卷积子网络,邻近度卷积子网络包括邻近度卷积核,邻近度卷积核包括权值,权值用于表征匹配用文本中的、与样本词语序列包括的词语匹配的词语在匹配用文本中所处的位置之间的距离对确定相似度值的影响程度。其中,匹配用文本是与输入的样本词语序列指示的文本进行相似度运算的文本。在训练文本匹配模型时,匹配用文本可以是输入的匹配样本词语序列指示的匹配文本,或输入的非匹配样本词语序列指示的非匹配文本。
作为示例,如图4所示,假设相似度矩阵401为3行10列的矩阵,图中的A、B、C用于表征样本词语序列包括的词语,D、E、F、G、…、M、N用于表征从匹配用文本中确定的词语序列。相似度矩阵401的第一行第一列元素为词语A和D之间的相似度值,第一行第二列元素为词语A和E之间的相似度值,以此类推。402为邻近度卷积核,从图中可以看出,邻近度卷积核的中间一列的权值最大,向两边逐渐减小。当前的邻近度卷积核402滑动到如图所示的位置,即邻近度卷积核402的中间一列与相似度矩阵401的第三列对齐。两个矩阵的对应位置的元素相乘后,得到结果矩阵403,从结果矩阵403中的每一行取最大值(即0.8、0.8、0.9),并将各个最大值相加后,得到的结果为 相似度矩阵401的第三列对应的相似度值(即2.5)。随着邻近度卷积核402的滑动,可以得到相似度矩阵401的每一列分别对应的相似度值,从所得到的各个相似度值中选择最大值,即为上述邻近度卷积子网络所确定的子相似度值。从图4可以看出,在计算相似度矩阵401的第三列对应的相似度值时,如果在匹配用文本中,与词语A、B、C分别匹配的词语的位置距离第三列对应的词语(即F对应的词语)的位置较近,由于其对应的权值较大,因此,计算得到的相似度值较大。反之,如果距离第三列对应的词语的位置较远,则计算得到的相似度值较小(例如图中A-J对应的相似度值虽然与AE对应的相似度值相同,均为1,但是由于J的位置距离第三列较远,其对应的权值0.4较小,因此将其与权值相乘后得到的数值0.4较小)。
由图4可知,邻近度卷积子网络由于包括邻近度卷积核,利用其进行相似度值计算所得到的子相似度值可以反映相匹配的词语在匹配用文本中的位置之间的距离。从而使得计算得到的相似度值可以更准确地表征两个文本之间的相似程度。
上述执行主体可以继续按照如下步骤确定第二相似度值:
首先,将所选取的训练样本包括的样本词语序列和非匹配样本词语序列输入向量对齐子模型,得到输入的样本词语序列对应的样本对齐后词向量序列和输入的非匹配样本词语序列对应的非匹配样本对齐后词向量序列。
然后,将所得到的样本对齐后词向量序列和非匹配样本对齐后词向量序列输入相似度矩阵生成层,得到相似度矩阵。
最后,将所得到的相似度矩阵输入卷积神经网络,得到第二相似度值。
需要说明的是,上述执行主体可以按照与确定第一相似度值相同的方法,确定第二相似度值,这里不再赘述。
步骤2023,将第一相似度值和第二相似度值进行比较,根据比较结果确定初始模型是否达到预设的优化目标。
具体地,上述执行主体可以利用预设的损失函数(例如hinge loss损失函数、square hinge loss损失函数等),对第一相似度值和第二相 似度值进行比较,使用上述损失函数可以计算得到损失值,如果损失值满足预设条件(例如损失值小于等于预设的数值或者损失值不再减小),则确定初始模型达到优化目标。
作为示例,上述预设的损失函数可以为hinge loss损失函数。在本实施例中的具体形式为:L=max(0,s2-s1+sigma),其中,L用于表征损失值,max()用于表征取括号中的最大值,s2为第二相似度值,s1为第一相似度值,sigma为预设的数值。训练时,使得s2-s1+sigma最小,当s2-s1+sigma满足上述预设条件时,确定初始模型达到优化目标。
步骤2024,响应于确定达到优化目标,确定初始模型为文本匹配模型。
在本实施例的一些可选的实现方式中,在步骤202之后,上述执行主体还可以执行如下步骤:
响应于根据上述比较结果确定初始模型未达到优化目标,调整初始模型的参数,以及从训练样本集合中的、未被选择过的训练样本中,重新选取训练样本,利用重新选择的训练样本和最近一次调整参数的初始模型,继续执行上述训练步骤(即步骤2021-步骤2024)。
这里,上述执行主体可以采用各种方式,根据比较结果调整初始模型的参数。例如,可以采用BP(Back Propagation,反向传播)算法或者SGD(Stochastic Gradient Descent,随机梯度下降)算法来调整初始模型的参数。
继续参见图5,图5是根据本实施例的用于生成文本匹配模型的方法的应用场景的一个示意图。在图5的应用场景中,电子设备501首先获取训练样本集合502。其中,训练样本包括预设数量(例如3)个样本词语序列、预设数量个匹配样本词语序列、预设数量个非匹配样本词语序列。例如,预设数量个样本词语序列可以是预先从样本文本提取的词语序列,每个样本词语序列对应于一种分词粒度。同理,预设数量个匹配样本词语序列可以是预先从匹配样本文本中提取的词语序列,预设数量个非匹配样本词语序列可以是预先从非匹配样本文本中提取的词语序列。
然后,电子设备501从训练样本集合502中选取训练样本5021,以及执行以下训练步骤:将所选取的训练样本5021包括的样本词语序列50211和匹配样本词语序列50212输入初始模型503,得到用于表征输入的预设数量个样本词语序列指示的文本与预设数量个匹配样本词语序列指示的文本的相似程度的第一相似度值504;将所选取的训练样本5021包括的样本词语序列50211和非匹配样本词语序列50213输入初始模型503,得到用于表征输入的预设数量个样本词语序列指示的文本与预设数量个非匹配样本词语序列指示的文本的相似程度的第二相似度值505;将第一相似度值504和第二相似度值505进行比较(例如利用hinge loss损失函数计算损失值),根据比较结果(例如损失值)确定初始模型503是否达到预设的优化目标。响应于确定达到优化目标(例如当损失值小于等于预设的数值时,确定达到优化目标),确定初始模型503为文本匹配模型506。
本公开的上述实施例提供的方法,通过获取训练样本集合,其中,训练样本包括预设数量个样本词语序列、预设数量个匹配样本词语序列、预设数量个非匹配样本词语序列,然后从训练样本集合中选取至少一个训练样本,利用选取的训练样本和初始模型,得到用于表征输入的预设数量个样本词语序列指示的文本与预设数量个匹配样本词语序列指示的文本的相似程度的第一相似度值和用于表征输入的样本词语序列指示的文本与非匹配样本词语序列指示的文本的相似程度的第二相似度值,根据第一相似度值和第二相似度值的比较结果对初始模型进行训练,得到文本匹配模型,从而实现了使用同一文本对应的预设数量个词语序列进行模型训练,使得所得到的文本匹配模型可以更全面地针对同一文本对应的预设数量个词语序列进行处理,从而更准确地确定两个文本之间的相似度,有助于提高文本匹配的准确性。
进一步参考图6,其示出了用于输出文本的方法的一个实施例的流程600。该用于输出文本的方法的流程600,包括以下步骤:
步骤601,获取目标文本和待匹配文本集合。
在本实施例中,用于输出文本的方法的执行主体(例如图1所示 的服务器或终端设备)可以通过有线连接方式或者无线连接方式从远程,或从本地获取目标文本和待匹配文本集合。其中,目标文本是用户输入的文本。通常,目标文本可以是用于搜索信息的文本,例如,目标文本可以是用户在上述执行主体的屏幕上显示的搜索栏中输入的文本。待匹配文本集合可以是预先存储在上述执行主体中的文本集合,或预先存储在与上述执行主体通信连接的电子设备上的文本集合。
步骤602,对目标文本和待匹配文本集合中的待匹配文本分别按照预设数量种分词粒度进行分词处理,生成目标文本对应的预设数量个目标词语序列和待匹配文本集合中的待匹配文本对应的预设数量个待匹配词语序列。
在本实施例中,上述执行主体可以对目标文本和待匹配文本集合中的待匹配文本分别按照预设数量种分词粒度进行分词处理,生成目标文本对应的预设数量个目标词语序列和待匹配文本集合中的待匹配文本对应的预设数量个待匹配词语序列。
其中,分词粒度用于表征对文本进行分词时,词语包括的文字的数量。通常,分词粒度大,单个词语包括的文字多,分词粒度小,单个词语包括的文字少。例如,采用大粒度分词后得到的词语包括“男朋友”,采用小粒度分词后得到的词语包括“男”和“朋友”。需要说明的是,使用不同的分词粒度对文本进行分词的方法是本领域的公知技术,这里不再赘述。
步骤603,对于待匹配文本集合中的待匹配文本,将该待匹配文本对应的预设数量个待匹配词语序列和预设数量个目标词语序列输入预先训练的文本匹配模型,得到用于表征该待匹配文本与目标文本之间的相似程度的相似度值。
在本实施例中,对于待匹配文本集合中的待匹配文本,上述执行主体可以将该待匹配文本对应的预设数量个待匹配词语序列和预设数量个目标词语序列输入预先训练的文本匹配模型,得到用于表征该待匹配文本与目标文本之间的相似程度的相似度值。其中,文本匹配模型是根据上述图2对应实施例描述的方法生成的。
在本实施例的一些可选的实现方式中,步骤602中的分词处理包 括:
首先,对目标文本和待匹配文本集合中的待匹配文本分别按照预设数量种分词粒度进行分词,得到目标文本对应的预设数量个目标词语序列和待匹配文本集合中的待匹配文本对应的预设数量个待匹配词语序列。
然后,确定预设数量个目标词语序列,和待匹配文本集合中的待匹配文本对应的预设数量个待匹配词语序列分别对应的词语对齐信息,以使文本匹配模型利用词语对齐信息生成相似度值。其中,词语对齐信息用于表征不同的分词粒度对应的词语序列中的词语的对应关系。需要说明的是,关于词语对齐信息的描述,可以参考上述图2对应实施例中的可选的实现方式中关于词语对齐信息的内容,这里不再赘述。
在本可选的实现方式中,上述文本匹配模型可以利用词语对齐信息生成相似度值。具体地,文本匹配模型可以包括向量对齐子模型、相似度矩阵生成层、卷积神经网络。向量对齐子模型用于确定输入的词语序列包括的词语的词向量,以及基于词语序列对应的词语对齐信息,对输入的词语序列对应的词向量序列进行向量对齐,得到输入的词语序列对应的对齐后词向量序列。相似度矩阵生成层用于利用所得到的目标词语序列对应的对齐后词向量序列和待匹配词语序列对应的对齐后词向量序列,生成相似度矩阵。卷积神经网络用于利用所得到的相似度矩阵,生成用于表征待匹配文本与目标文本之间的相似程度的相似度值。需要说明的是,关于文本匹配模型包括的向量对齐子模型、相似度矩阵生成层、卷积神经网络,可以参考上述图2对应实施例中的可选的实现方式中描述的内容,这里不再赘述。
步骤604,基于所得到的相似度值的大小,从待匹配文本集合中选择待匹配文本及输出。
在本实施例中,上述执行主体可以基于所得到的相似度值的大小,从待匹配文本集合中选择待匹配文本及输出。
通常,上述执行主体可以按照相似度值由大到小的顺序,从待匹配文本集合中选择待匹配文本。然后,将选择的待匹配文本按照各种 方式输出。例如,当上述执行主体为如图1所示的服务器时,服务器可以将选择的待匹配文本按照相似度值由大到小的顺序,发送到如图1所示的终端设备上,以使所选择的待匹配文本显示在终端设备的屏幕上。
在本实施例的一些可选的实现方式中,上述执行主体可以按照如下步骤从待匹配文本集合中选择待匹配文本及输出:
首先,基于所得到的相似度值的大小,从待匹配文本集合中选择待匹配文本。通常,上述执行主体可以按照对应的相似度值由大到小的顺序,从待匹配文本集合中选择待匹配文本。
然后,将所选择的待匹配文本显示在目标显示屏上。其中,目标显示屏可以是待在其上显示文本的显示屏。例如,目标显示屏可以是上述执行主体包括的显示屏,也可以是与上述执行主体通信连接的其他电子设备包括的显示屏。通过执行本可选的实现方式,可以在目标显示屏上更有针对性地显示与目标文本相似的待匹配文本,由于用于展示文本的电子设备包括的显示屏的尺寸有限,本实现方式可以充分地利用有限的显示屏尺寸,向用户有针对性地展示文本,节约了显示屏的显示资源,以及节约了用于存储展示的文本的存储资源。
本公开的上述实施例提供的方法,通过获取目标文本和待匹配文本集合,对目标文本和待匹配文本集合中的待匹配文本分别按照预设数量种分词粒度进行分词处理,生成目标文本对应的预设数量个目标词语序列和待匹配文本集合中的待匹配文本对应的预设数量个待匹配词语序列,再将待匹配文本对应的预设数量个待匹配词语序列和预设数量个目标词语序列输入预先训练的文本匹配模型,得到用于表征待匹配文本与目标文本之间的相似程度的相似度值,最后基于所得到的相似度值的大小,从待匹配文本集合中选择待匹配文本及输出,从而有效地利用文本匹配模型,提高确定文本之间的相似度值的准确性,以及有针对性地输出与目标文本匹配的文本,有利于节约用于展示与目标文本匹配的文本的电子设备的硬件资源。
进一步参考图7,作为对上述图2所示方法的实现,本公开提供 了一种用于生成文本匹配模型的装置的一个实施例,该装置实施例与图2所示的方法实施例相对应,该装置具体可以应用于各种电子设备中。
如图7所示,本实施例的用于生成文本匹配模型的装置700包括:训练样本获取单元701,被配置成获取训练样本集合,其中,训练样本包括预设数量个样本词语序列、预设数量个匹配样本词语序列、预设数量个非匹配样本词语序列;训练单元702,被配置成从训练样本集合中选取训练样本,以及执行以下训练步骤:将所选取的训练样本包括的预设数量个样本词语序列和预设数量个匹配样本词语序列输入初始模型,得到用于表征输入的预设数量个样本词语序列指示的文本与预设数量个匹配样本词语序列指示的文本的相似程度的第一相似度值;将所选取的训练样本包括的预设数量个样本词语序列和预设数量个非匹配样本词语序列输入初始模型,得到用于表征输入的预设数量个样本词语序列指示的文本与预设数量个非匹配样本词语序列指示的文本的相似程度的第二相似度值;将第一相似度值和第二相似度值进行比较,根据比较结果确定初始模型是否达到预设的优化目标;响应于确定达到优化目标,确定初始模型为文本匹配模型。
在本实施例中,训练样本获取单元701可以通过有线连接方式或者无线连接方式从远程,或从本地获取训练样本集合。其中,训练样本包括预设数量个样本词语序列、预设数量个匹配样本词语序列、预设数量个非匹配样本词语序列。上述各个词语序列中的词语可以包括但不限于以下至少一种:单字词、多字词、短语。通常,上述预设数量大于等于二。
具体地,预设数量个样本词语序列可以对应于样本文本,预设数量个匹配样本词语序列可以对应于匹配样本文本,预设数量个非匹配样本词语序列可以对应于非匹配样本文本。其中,匹配样本文本可以是与样本文本的相关程度较高的文本,非匹配样本文本可以是与样本文本的相关程度较低的文本。例如,样本文本可以是用户输入的搜索语句,用于生成训练样本的执行主体可以将搜索结果中包括的、上述用户点击过的文本设置为匹配样本文本,将上述用户未点击过的文本 设置为非匹配文本。
预设数量个样本词语序列中的样本词语序列可以是对样本文本进行分词得到的词语序列。
此外,上述训练样本获取单元701还可以采用预设数量种不同的分词算法,对样本文本进行分词,得到预设数量个样本词语序列。
应当理解,生成样本词语序列的执行主体可以采用与对样本文本进行分词时所采用的方法相同的方法分别对匹配文本和非匹配文本进行分词,得到预设数量个匹配样本词语序列、预设数量个非匹配样本词语序列。本实施例中的对文本进行分词的方法可以包括但不限于以下至少一种:基于词典的方法、基于统计的方法、基于语义的方法等。
在本实施例中,训练单元702可以从训练样本集合中选取训练样本,以及执行以下训练步骤:(包括步骤7021-步骤7024):
步骤7021,将所选取的训练样本包括的样本词语序列和匹配样本词语序列输入初始模型,得到用于表征输入的预设数量个样本词语序列指示的文本与预设数量个匹配样本词语序列指示的文本的相似程度的第一相似度值。
具体地,初始模型可以包括各种结构的神经网络,例如Siamese结构的神经网络、LSF-SCNN(Lexical Semantic Feature based Skip Convolution Neural Network,基于词汇语义特征的跳跃卷积神经网络)等。初始模型可以是未经训练的、初始化参数的模型,也可以是训练过的模型。通常,初始模型可以将输入的词语序列包括的词语转换为向量的形式,根据各个向量,可以确定相似度值。通常,相似度值越大,表征两个文本之间的相似程度越高。实践中,相似度值可以根据向量之间的距离(例如欧氏距离、余弦距离等)确定。例如将余弦距离确定为相似度值,或将欧式距离的倒数确定为相似度值。
在本步骤中,输入初始模型的通常为一个训练样本包括的预设数量个样本词语序列和预设数量个匹配样本词语序列。初始模型可以对输入的预设数量个样本词语序列和预设数量个匹配样本词语序列进行诸如向量转换、距离计算等处理,得到第一相似度值。
步骤7022,将所选取的训练样本包括的样本词语序列和非匹配样 本词语序列输入初始模型,得到用于表征输入的预设数量个样本词语序列指示的文本与预设数量个非匹配样本词语序列指示的文本的相似程度的第二相似度值。
具体地,在本步骤中,输入初始模型的通常为一个训练样本包括的预设数量个样本词语序列和预设数量个非匹配样本词语序列。初始模型可以按照与上述步骤7021相同的方法得到第二相似度值。
步骤7023,将第一相似度值和第二相似度值进行比较,根据比较结果确定初始模型是否达到预设的优化目标。
具体地,上述训练单元702可以利用预设的损失函数(例如hinge loss损失函数、square hinge loss损失函数等),对第一相似度值和第二相似度值进行比较,使用上述损失函数可以计算得到损失值,如果损失值满足预设条件(例如损失值小于等于预设的数值或者损失值不再减小),则确定初始模型达到优化目标。
步骤7024,响应于确定达到优化目标,确定初始模型为文本匹配模型。
在本实施例的一些可选的实现方式中,训练样本获取单元701可以包括:获取模块,被配置成获取样本文本,以及与所获取的样本文本匹配的匹配文本和与所获取的样本文本不匹配的非匹配文本;分词模块,被配置成对所获取的样本文本、匹配文本和非匹配文本分别按照预设数量种分词粒度进行分词,得到样本文本对应的预设数量个样本词语序列,匹配文本对应的预设数量个匹配样本词语序列,非匹配文本对应的预设数量个非匹配样本词语序列;确定模块,被配置成确定所得到的预设数量个样本词语序列、预设数量个匹配样本词语序列、预设数量个非匹配样本词语序列分别对应的词语对齐信息,其中,词语对齐信息用于表征不同的分词粒度对应的词语序列中的词语的对应关系。
在本实施例的一些可选的实现方式中,初始模型可以包括向量对齐子模型、相似度矩阵生成层、卷积神经网络;以及训练单元702可以包括:第一生成模块(图中未示出),被配置成将所选取的训练样本包括的样本词语序列和匹配样本词语序列输入向量对齐子模型,得到 输入的样本词语序列对应的样本对齐后词向量序列和输入的匹配样本词语序列对应的匹配样本对齐后词向量序列,其中,向量对齐子模型用于确定输入的词语序列包括的词语的词向量,以及基于词语序列对应的词语对齐信息,对输入的词语序列对应的词向量序列进行向量对齐,得到输入的词语序列对应的对齐后词向量序列;将所得到的样本对齐后词向量序列和匹配样本对齐后词向量序列输入相似度矩阵生成层,得到相似度矩阵;将所得到的相似度矩阵输入卷积神经网络,得到第一相似度值;第二生成模块(图中未示出),被配置成将所选取的训练样本包括的样本词语序列和非匹配样本词语序列输入向量对齐子模型,得到输入的样本词语序列对应的样本对齐后词向量序列和输入的非匹配样本词语序列对应的非匹配样本对齐后词向量序列;将所得到的样本对齐后词向量序列和非匹配样本对齐后词向量序列输入相似度矩阵生成层,得到相似度矩阵;将所得到的相似度矩阵输入卷积神经网络,得到第二相似度值。
在本实施例的一些可选的实现方式中,卷积神经网络包括至少一个卷积子网络和相似度值生成层,卷积子网络用于对输入的相似度矩阵进行卷积运算,生成子相似度值,相似度值生成层用于基于子相似度值生成相似度值。
在本实施例的一些可选的实现方式中,至少一个卷积子网络包括邻近度卷积子网络,邻近度卷积子网络包括邻近度卷积核,邻近度卷积核包括权值,权值用于表征匹配用文本中的、与样本词语序列包括的词语匹配的词语在匹配用文本中所处的位置之间的距离对确定相似度值的影响程度。
在本实施例的一些可选的实现方式中,相似度矩阵生成层包括词语权重生成层,词语权重生成层用于确定预先指定的分词粒度对应的样本词语序列中的样本词语在样本词语序列指示的文本中的权重,相似度矩阵生成层用于利用词语权重生成层生成的权重和已生成的相似度矩阵,生成加权后的相似度矩阵。
在本实施例的一些可选的实现方式中,上述装置700还可以包括:选择单元(图中未示出),被配置成响应于确定未达到优化目标,调整 初始模型的参数,以及从训练样本集合中的、未被选择过的训练样本中,重新选取训练样本,利用重新选择的训练样本和最近一次调整参数的初始模型,继续执行训练步骤。
本公开的上述实施例提供的装置700,通过获取训练样本集合,其中,训练样本包括预设数量个样本词语序列、预设数量个匹配样本词语序列、预设数量个非匹配样本词语序列,然后从训练样本集合中选取至少一个训练样本,利用选取的训练样本和初始模型,得到用于表征输入的预设数量个样本词语序列指示的文本与预设数量个匹配样本词语序列指示的文本的相似程度的第一相似度值和用于表征输入的样本词语序列指示的文本与非匹配样本词语序列指示的文本的相似程度的第二相似度值,根据第一相似度值和第二相似度值的比较结果对初始模型进行训练,得到文本匹配模型,从而实现了使用同一文本对应的预设数量个词语序列进行模型训练,使得所得到的文本匹配模型可以更全面地针对同一文本对应的预设数量个词语序列进行处理,从而更准确地确定两个文本之间的相似度,有助于提高文本匹配的准确性。
进一步参考图8,作为对上述图6所示方法的实现,本公开提供了一种用于输出文本的装置的一个实施例,该装置实施例与图6所示的方法实施例相对应,该装置具体可以应用于各种电子设备中。
如图8所示,本实施例的用于输出文本的装置800包括:文本获取单元801,被配置成获取目标文本和待匹配文本集合,其中,目标文本是用户输入的文本;分词单元802,被配置成对目标文本和待匹配文本集合中的待匹配文本分别按照预设数量种分词粒度进行分词处理,生成目标文本对应的预设数量个目标词语序列和待匹配文本集合中的待匹配文本对应的预设数量个待匹配词语序列;匹配单元803,被配置成对于待匹配文本集合中的待匹配文本,将该待匹配文本对应的预设数量个待匹配词语序列和预设数量个目标词语序列输入预先训练的文本匹配模型,得到用于表征该待匹配文本与目标文本之间的相似程度的相似度值,其中,文本匹配模型是根据上述第一方面中任一实施例描述的方法生成的;输出单元804,被配置成基于所得到的相 似度值的大小,从待匹配文本集合中选择待匹配文本及输出。
在本实施例中,文本获取单元801可以通过有线连接方式或者无线连接方式从远程,或从本地获取目标文本和待匹配文本集合。其中,目标文本是用户输入的文本。通常,目标文本可以是用于搜索信息的文本,例如,目标文本可以是用户在上述装置800的屏幕上显示的搜索栏中输入的文本。待匹配文本集合可以是预先存储在上述装置800中的文本集合,或预先存储在与上述装置800通信连接的电子设备上的文本集合。
在本实施例中,分词单元802可以对目标文本和待匹配文本集合中的待匹配文本分别按照预设数量种分词粒度进行分词处理,生成目标文本对应的预设数量个目标词语序列和待匹配文本集合中的待匹配文本对应的预设数量个待匹配词语序列。
其中,分词粒度用于表征对文本进行分词时,词语包括的文字的数量。通常,分词粒度大,单个词语包括的文字多,分词粒度小,单个词语包括的文字少。例如,采用大粒度分词后得到的词语包括“男朋友”,采用小粒度分词后得到的词语包括“男”和“朋友”。需要说明的是,使用不同的分词粒度对文本进行分词的方法是本领域的公知技术,这里不再赘述。
在本实施例中,对于待匹配文本集合中的待匹配文本,上述匹配单元803可以将该待匹配文本对应的预设数量个待匹配词语序列和预设数量个目标词语序列输入预先训练的文本匹配模型,得到用于表征该待匹配文本与目标文本之间的相似程度的相似度值。其中,文本匹配模型是根据上述图2对应实施例描述的方法生成的。
在本实施例中,输出单元804可以基于所得到的相似度值的大小,从待匹配文本集合中选择待匹配文本及输出。
通常,上述输出单元804可以按照相似度值由大到小的顺序,从待匹配文本集合中选择待匹配文本。然后,将选择的待匹配文本按照各种方式输出。例如,当上述装置800设置在如图1所示的服务器中时,装置800可以将选择的待匹配文本按照相似度值由大到小的顺序,发送到如图1所示的终端设备上,以使所选择的待匹配文本显示在终 端设备的屏幕上。
在本实施例的一些可选的实现方式中,分词单元802可以包括:分词模块(图中未示出),被配置成对目标文本和待匹配文本集合中的待匹配文本分别按照预设数量种分词粒度进行分词,得到目标文本对应的预设数量个目标词语序列和待匹配文本集合中的待匹配文本对应的预设数量个待匹配词语序列;确定模块(图中未示出),被配置成确定预设数量个目标词语序列,和待匹配文本集合中的待匹配文本对应的预设数量个待匹配词语序列分别对应的词语对齐信息,以使文本匹配模型利用词语对齐信息生成相似度值。
在本实施例的一些可选的实现方式中,输出单元804可以包括:选择模块(图中未示出),被配置成基于所得到的相似度值的大小,从待匹配文本集合中选择待匹配文本;显示模块(图中未示出),被配置成将所选择的待匹配文本显示在目标显示屏上。
本公开的上述实施例提供的装置800,通过获取目标文本和待匹配文本集合,对目标文本和待匹配文本集合中的待匹配文本分别按照预设数量种分词粒度进行分词处理,生成目标文本对应的预设数量个目标词语序列和待匹配文本集合中的待匹配文本对应的预设数量个待匹配词语序列,再将待匹配文本对应的预设数量个待匹配词语序列和预设数量个目标词语序列输入预先训练的文本匹配模型,得到用于表征待匹配文本与目标文本之间的相似程度的相似度值,最后基于所得到的相似度值的大小,从待匹配文本集合中选择待匹配文本及输出,从而有效地利用文本匹配模型,提高确定文本之间的相似度值的准确性,以及有针对性地输出与目标文本匹配的文本,有利于节约用于展示与目标文本匹配的文本的电子设备的硬件资源。
下面参考图9,其示出了适于用来实现本公开的实施例的电子设备(例如图1中的服务器或终端设备)900的结构示意图。本公开的实施例中的终端设备可以包括但不限于诸如移动电话、笔记本电脑、数字广播接收器、PDA(个人数字助理)、PAD(平板电脑)、PMP(便携式多媒体播放器)、车载终端(例如车载导航终端)等等的移动终端 以及诸如数字TV、台式计算机等等的固定终端。图9示出的电子设备仅仅是一个示例,不应对本公开的实施例的功能和使用范围带来任何限制。
如图9所示,电子设备900可以包括处理装置(例如中央处理器、图形处理器等)901,其可以根据存储在只读存储器(ROM)902中的程序或者从存储装置908加载到随机访问存储器(RAM)903中的程序而执行各种适当的动作和处理。在RAM 903中,还存储有电子设备900操作所需的各种程序和数据。处理装置901、ROM 902以及RAM 903通过总线904彼此相连。输入/输出(I/O)接口905也连接至总线904。
通常,以下装置可以连接至I/O接口905:包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置906;包括例如液晶显示器(LCD)、扬声器、振动器等的输出装置907;包括例如内存等的存储装置908;以及通信装置909。通信装置909可以允许电子设备900与其他设备进行无线或有线通信以交换数据。虽然图9示出了具有各种装置的电子设备900,但是应理解的是,并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。图9中示出的每个方框可以代表一个装置,也可以根据需要代表多个装置。
特别地,根据本公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信装置909从网络上被下载和安装,或者从存储装置908被安装,或者从ROM 902被安装。在该计算机程序被处理装置901执行时,执行本公开的实施例的方法中限定的上述功能。需要说明的是,本公开的实施例所述的计算机可读介质可以是计算机可读信号介质或者计算机可读介质或者是上述两者的任意组合。计算机可读介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读介 质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。
在本公开的实施例中,计算机可读介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开的实施例中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:电线、光缆、RF(射频)等等,或者上述的任意合适的组合。
上述计算机可读介质可以是上述电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该电子设备执行时,使得该电子设备:获取训练样本集合,其中,训练样本包括预设数量个样本词语序列、预设数量个匹配样本词语序列、预设数量个非匹配样本词语序列;从训练样本集合中选取训练样本,以及执行以下训练步骤:将所选取的训练样本包括的预设数量个样本词语序列和预设数量个匹配样本词语序列输入初始模型,得到用于表征输入的预设数量个样本词语序列指示的文本与预设数量个匹配样本词语序列指示的文本的相似程度的第一相似度值;将所选取的训练样本包括的预设数量个样本词语序列和预设数量个非匹配样本词语序列输入初始模型,得到用于表征输入的预设数量个样本词语序列指示的文本与预设数量个非匹配样本词语序列指示的文本的相似程度的第二相似度值;将第一相似度值和第二相似度值进行比较,根据比较结果确定初始模型是否达 到预设的优化目标;响应于确定达到优化目标,确定初始模型为文本匹配模型。
此外,当上述一个或者多个程序被该电子设备执行时,还可以使得该电子设备:获取目标文本和待匹配文本集合,其中,目标文本是用户输入的文本;对目标文本和待匹配文本集合中的待匹配文本分别按照预设数量种分词粒度进行分词处理,生成目标文本对应的预设数量个目标词语序列和待匹配文本集合中的待匹配文本对应的预设数量个待匹配词语序列;对于待匹配文本集合中的待匹配文本,将该待匹配文本对应的预设数量个待匹配词语序列和预设数量个目标词语序列输入预先训练的文本匹配模型,得到用于表征该待匹配文本与目标文本之间的相似程度的相似度值;基于所得到的相似度值的大小,从待匹配文本集合中选择待匹配文本及输出。
可以以一种或多种程序设计语言或其组合来编写用于执行本公开的实施例的操作的计算机程序代码,所述程序设计语言包括面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网(LAN)或广域网(WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。
附图中的流程图和框图,图示了按照本公开各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是, 框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
描述于本公开的实施例中所涉及到的单元可以通过软件的方式实现,也可以通过硬件的方式来实现。所描述的单元也可以设置在处理器中,例如,可以描述为:一种处理器包括训练样本获取单元和训练单元。其中,这些单元的名称在某种情况下并不构成对该单元本身的限定,例如,训练样本获取单元还可以被描述为“获取训练样本集合的单元”。
以上描述仅为本公开的较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解,本公开的实施例中所涉及的发明范围,并不限于上述技术特征的特定组合而成的技术方案,同时也应涵盖在不脱离上述发明构思的情况下,由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本公开的实施例中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。

Claims (14)

  1. 一种用于生成文本匹配模型的方法,包括:
    获取训练样本集合,其中,训练样本包括预设数量个样本词语序列、预设数量个匹配样本词语序列、预设数量个非匹配样本词语序列;
    从所述训练样本集合中选取训练样本,以及执行以下训练步骤:
    将所选取的训练样本包括的预设数量个样本词语序列和预设数量个匹配样本词语序列输入初始模型,得到用于表征输入的预设数量个样本词语序列指示的文本与预设数量个匹配样本词语序列指示的文本的相似程度的第一相似度值;
    将所选取的训练样本包括的预设数量个样本词语序列和预设数量个非匹配样本词语序列输入初始模型,得到用于表征输入的预设数量个样本词语序列指示的文本与预设数量个非匹配样本词语序列指示的文本的相似程度的第二相似度值;
    将第一相似度值和第二相似度值进行比较,根据比较结果确定初始模型是否达到预设的优化目标;
    响应于确定达到优化目标,确定初始模型为文本匹配模型。
  2. 根据权利要求1所述的方法,其中,所述获取训练样本集合,包括:
    获取样本文本,以及与所获取的样本文本匹配的匹配文本和与所获取的样本文本不匹配的非匹配文本;
    对所获取的样本文本、匹配文本和非匹配文本分别按照预设数量种分词粒度进行分词,得到样本文本对应的预设数量个样本词语序列,匹配文本对应的预设数量个匹配样本词语序列,非匹配文本对应的预设数量个非匹配样本词语序列;
    确定所得到的预设数量个样本词语序列、预设数量个匹配样本词语序列、预设数量个非匹配样本词语序列分别对应的词语对齐信息,其中,词语对齐信息用于表征针对同一文本的不同的分词粒度对应的词语序列中的词语的对应关系。
  3. 根据权利要求2所述的方法,其中,初始模型包括向量对齐子模型、相似度矩阵生成层、卷积神经网络;以及
    所述得到第一相似度值以及得到第二相似度值,包括:
    将所选取的训练样本包括的样本词语序列和匹配样本词语序列输入向量对齐子模型,得到输入的样本词语序列对应的样本对齐后词向量序列和输入的匹配样本词语序列对应的匹配样本对齐后词向量序列,其中,向量对齐子模型用于确定输入的词语序列包括的词语的词向量,以及基于词语序列对应的词语对齐信息,对输入的词语序列对应的词向量序列进行向量对齐,得到输入的词语序列对应的对齐后词向量序列;将所得到的样本对齐后词向量序列和匹配样本对齐后词向量序列输入相似度矩阵生成层,得到相似度矩阵;将所得到的相似度矩阵输入卷积神经网络,得到第一相似度值;
    将所选取的训练样本包括的样本词语序列和非匹配样本词语序列输入向量对齐子模型,得到输入的样本词语序列对应的样本对齐后词向量序列和输入的非匹配样本词语序列对应的非匹配样本对齐后词向量序列;将所得到的样本对齐后词向量序列和非匹配样本对齐后词向量序列输入相似度矩阵生成层,得到相似度矩阵;将所得到的相似度矩阵输入卷积神经网络,得到第二相似度值。
  4. 根据权利要求3所述的方法,其中,卷积神经网络包括至少一个卷积子网络和相似度值生成层,卷积子网络用于对输入的相似度矩阵进行卷积运算,生成子相似度值,相似度值生成层用于基于子相似度值生成相似度值。
  5. 根据权利要求4所述的方法,其中,至少一个卷积子网络包括邻近度卷积子网络,邻近度卷积子网络包括邻近度卷积核,邻近度卷积核包括权值,权值用于表征匹配用文本中的、与样本词语序列包括的词语匹配的词语在匹配用文本中所处的位置之间的距离对确定相似度值的影响程度。
  6. 根据权利要求3所述的方法,其中,相似度矩阵生成层包括词语权重生成层,词语权重生成层用于确定预先指定的分词粒度对应的样本词语序列中的样本词语在样本词语序列指示的文本中的权重,相似度矩阵生成层用于利用词语权重生成层生成的权重和已生成的相似度矩阵,生成加权后的相似度矩阵。
  7. 根据权利要求1-6之一所述的方法,其中,所述方法还包括:
    响应于确定未达到优化目标,调整初始模型的参数,以及从所述训练样本集合中的、未被选择过的训练样本中,重新选取训练样本,利用重新选择的训练样本和最近一次调整参数的初始模型,继续执行所述训练步骤。
  8. 一种用于输出文本的方法,包括:
    获取目标文本和待匹配文本集合,其中,所述目标文本是用户输入的文本;
    对所述目标文本和待匹配文本集合中的待匹配文本分别按照预设数量种分词粒度进行分词处理,生成所述目标文本对应的预设数量个目标词语序列和待匹配文本集合中的待匹配文本对应的预设数量个待匹配词语序列;
    对于所述待匹配文本集合中的待匹配文本,将该待匹配文本对应的预设数量个待匹配词语序列和所述预设数量个目标词语序列输入预先训练的文本匹配模型,得到用于表征该待匹配文本与所述目标文本之间的相似程度的相似度值,其中,所述文本匹配模型是根据权利要求1-7之一所述的方法生成的;
    基于所得到的相似度值的大小,从所述待匹配文本集合中选择待匹配文本及输出。
  9. 根据权利要求8所述的方法,其中,所述分词处理包括:
    对所述目标文本和待匹配文本集合中的待匹配文本分别按照预设 数量种分词粒度进行分词,得到所述目标文本对应的预设数量个目标词语序列和待匹配文本集合中的待匹配文本对应的预设数量个待匹配词语序列;
    确定所述预设数量个目标词语序列,和所述待匹配文本集合中的待匹配文本对应的预设数量个待匹配词语序列分别对应的词语对齐信息,以使所述文本匹配模型利用词语对齐信息生成相似度值。
  10. 根据权利要求8或9所述的方法,其中,所述基于所得到的相似度值的大小,从所述待匹配文本集合中选择待匹配文本及输出,包括:
    基于所得到的相似度值的大小,从所述待匹配文本集合中选择待匹配文本;
    将所选择的待匹配文本显示在目标显示屏上。
  11. 一种用于生成文本匹配模型的装置,包括:
    训练样本获取单元,被配置成获取训练样本集合,其中,训练样本包括预设数量个样本词语序列、预设数量个匹配样本词语序列、预设数量个非匹配样本词语序列;
    训练单元,被配置成从所述训练样本集合中选取训练样本,以及执行以下训练步骤:
    将所选取的训练样本包括的预设数量个样本词语序列和预设数量个匹配样本词语序列输入初始模型,得到用于表征输入的预设数量个样本词语序列指示的文本与预设数量个匹配样本词语序列指示的文本的相似程度的第一相似度值;
    将所选取的训练样本包括的预设数量个样本词语序列和预设数量个非匹配样本词语序列输入初始模型,得到用于表征输入的预设数量个样本词语序列指示的文本与预设数量个非匹配样本词语序列指示的文本的相似程度的第二相似度值;
    将第一相似度值和第二相似度值进行比较,根据比较结果确定初始模型是否达到预设的优化目标;
    响应于确定达到优化目标,确定初始模型为文本匹配模型。
  12. 一种用于输出文本的装置,包括:
    文本获取单元,被配置成获取目标文本和待匹配文本集合,其中,所述目标文本是用户输入的文本;
    分词单元,被配置成对所述目标文本和待匹配文本集合中的待匹配文本分别按照预设数量种分词粒度进行分词处理,生成所述目标文本对应的预设数量个目标词语序列和待匹配文本集合中的待匹配文本对应的预设数量个待匹配词语序列;
    匹配单元,被配置成对于所述待匹配文本集合中的待匹配文本,将该待匹配文本对应的预设数量个待匹配词语序列和所述预设数量个目标词语序列输入预先训练的文本匹配模型,得到用于表征该待匹配文本与所述目标文本之间的相似程度的相似度值,其中,所述文本匹配模型是根据权利要求1-7之一所述的方法生成的;
    输出单元,被配置成基于所得到的相似度值的大小,从所述待匹配文本集合中选择待匹配文本及输出。
  13. 一种电子设备,包括:
    一个或多个处理器;
    存储装置,其上存储有一个或多个程序,
    当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如权利要求1-10中任一所述的方法。
  14. 一种计算机可读介质,其上存储有计算机程序,其中,该程序被处理器执行时实现如权利要求1-10中任一所述的方法。
PCT/CN2020/078584 2019-03-12 2020-03-10 用于生成文本匹配模型的方法和装置 WO2020182122A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910184893.2A CN109947919B (zh) 2019-03-12 2019-03-12 用于生成文本匹配模型的方法和装置
CN201910184893.2 2019-03-12

Publications (1)

Publication Number Publication Date
WO2020182122A1 true WO2020182122A1 (zh) 2020-09-17

Family

ID=67009743

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/078584 WO2020182122A1 (zh) 2019-03-12 2020-03-10 用于生成文本匹配模型的方法和装置

Country Status (2)

Country Link
CN (1) CN109947919B (zh)
WO (1) WO2020182122A1 (zh)

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109947919B (zh) * 2019-03-12 2020-05-15 北京字节跳动网络技术有限公司 用于生成文本匹配模型的方法和装置
CN112446405A (zh) * 2019-09-04 2021-03-05 杭州九阳小家电有限公司 一种家电客服的用户意图引导方法及智能家电
CN110633360B (zh) * 2019-09-16 2023-06-20 腾讯科技(上海)有限公司 一种语义匹配的方法以及相关装置
CN110795913B (zh) * 2019-09-30 2024-04-12 北京大米科技有限公司 一种文本编码方法、装置、存储介质及终端
CN111225227A (zh) * 2020-01-03 2020-06-02 网易(杭州)网络有限公司 弹幕的发布方法、模型生成方法及装置
CN111291563B (zh) * 2020-01-20 2023-09-01 腾讯科技(深圳)有限公司 词向量对齐方法和词向量对齐模型训练方法
CN113221550B (zh) * 2020-02-06 2023-09-29 百度在线网络技术(北京)有限公司 文本过滤方法、装置、设备和介质
CN111310478B (zh) * 2020-03-18 2023-09-19 电子科技大学 一种基于tf-idf和词向量的相似句子检测方法
CN111783424B (zh) * 2020-06-17 2024-02-13 泰康保险集团股份有限公司 一种文本分句方法和装置
CN111950272B (zh) * 2020-06-23 2023-06-27 北京百度网讯科技有限公司 文本相似度的生成方法、装置及电子设备
CN111897950A (zh) * 2020-07-29 2020-11-06 北京字节跳动网络技术有限公司 用于生成信息的方法和装置
CN111897951A (zh) * 2020-07-29 2020-11-06 北京字节跳动网络技术有限公司 用于生成信息的方法和装置
CN111984814B (zh) * 2020-08-10 2024-04-12 广联达科技股份有限公司 一种建筑图纸中的箍筋匹配方法和装置
CN112668664B (zh) * 2021-01-06 2022-11-15 安徽迪科数金科技有限公司 一种基于智能语音的话术训练方法
CN112765960B (zh) * 2021-02-07 2022-11-25 成都新潮传媒集团有限公司 一种文本匹配方法、装置及计算机设备
CN113283351B (zh) * 2021-05-31 2024-02-06 深圳神目信息技术有限公司 一种使用cnn优化相似度矩阵的视频抄袭检测方法
CN115238049B (zh) * 2022-06-17 2023-08-04 北京优酷科技有限公司 剧本标注方法及电子设备

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104424279A (zh) * 2013-08-30 2015-03-18 腾讯科技(深圳)有限公司 一种文本的相关性计算方法和装置
CN104715063A (zh) * 2015-03-31 2015-06-17 百度在线网络技术(北京)有限公司 搜索排序方法和装置
CN107239574A (zh) * 2017-06-29 2017-10-10 北京神州泰岳软件股份有限公司 一种智能问答系统知识‑问题匹配的方法及装置
CN107315772A (zh) * 2017-05-24 2017-11-03 北京邮电大学 基于深度学习的问题匹配方法以及装置
US9852648B2 (en) * 2015-07-10 2017-12-26 Fujitsu Limited Extraction of knowledge points and relations from learning materials
CN108509407A (zh) * 2017-02-27 2018-09-07 广东神马搜索科技有限公司 文本语义相似度计算方法、装置及用户终端
CN109947919A (zh) * 2019-03-12 2019-06-28 北京字节跳动网络技术有限公司 用于生成文本匹配模型的方法和装置

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106897403B (zh) * 2017-02-14 2019-03-26 中国科学院电子学研究所 面向知识图谱构建的细粒度中文属性对齐方法
CN109299262B (zh) * 2018-10-09 2022-04-15 中山大学 一种融合多粒度信息的文本蕴含关系识别方法

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104424279A (zh) * 2013-08-30 2015-03-18 腾讯科技(深圳)有限公司 一种文本的相关性计算方法和装置
CN104715063A (zh) * 2015-03-31 2015-06-17 百度在线网络技术(北京)有限公司 搜索排序方法和装置
US9852648B2 (en) * 2015-07-10 2017-12-26 Fujitsu Limited Extraction of knowledge points and relations from learning materials
CN108509407A (zh) * 2017-02-27 2018-09-07 广东神马搜索科技有限公司 文本语义相似度计算方法、装置及用户终端
CN107315772A (zh) * 2017-05-24 2017-11-03 北京邮电大学 基于深度学习的问题匹配方法以及装置
CN107239574A (zh) * 2017-06-29 2017-10-10 北京神州泰岳软件股份有限公司 一种智能问答系统知识‑问题匹配的方法及装置
CN109947919A (zh) * 2019-03-12 2019-06-28 北京字节跳动网络技术有限公司 用于生成文本匹配模型的方法和装置

Also Published As

Publication number Publication date
CN109947919B (zh) 2020-05-15
CN109947919A (zh) 2019-06-28

Similar Documents

Publication Publication Date Title
WO2020182122A1 (zh) 用于生成文本匹配模型的方法和装置
US20210334624A1 (en) Neural architecture search using a performance prediction neural network
CN107463704B (zh) 基于人工智能的搜索方法和装置
US11144831B2 (en) Regularized neural network architecture search
US10546066B2 (en) End-to-end learning of dialogue agents for information access
US20190332938A1 (en) Training machine learning models
JP2020149663A (ja) ニューラルネットワーク生成用の方法及び装置
CN111428010B (zh) 人机智能问答的方法和装置
CN110275939B (zh) 对话生成模型的确定方法及装置、存储介质、电子设备
CN111666416B (zh) 用于生成语义匹配模型的方法和装置
CN110909550B (zh) 文本处理方法、装置、电子设备和可读存储介质
WO2023124029A1 (zh) 深度学习模型的训练方法、内容推荐方法和装置
CN109858045B (zh) 机器翻译方法和装置
WO2020182123A1 (zh) 用于推送语句的方法和装置
CN114861889B (zh) 深度学习模型的训练方法、目标对象检测方法和装置
CN111738010B (zh) 用于生成语义匹配模型的方法和装置
KR20190138562A (ko) 정보를 생성하기 위한 방법 및 장치
WO2020154536A1 (en) Compound model scaling for neural networks
CN113407814B (zh) 文本搜索方法、装置、可读介质及电子设备
CN111008213B (zh) 用于生成语言转换模型的方法和装置
CN111931494B (zh) 用于生成预测信息的方法、装置、电子设备和介质
WO2021012691A1 (zh) 用于检索图像的方法和装置
CN116049370A (zh) 信息查询方法和信息生成模型的训练方法、装置
CN115910062A (zh) 音频识别方法、装置、设备及存储介质
CN111754984B (zh) 文本选取的方法、装置、设备和计算机可读介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20769054

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 21/01/2022)

122 Ep: pct application non-entry in european phase

Ref document number: 20769054

Country of ref document: EP

Kind code of ref document: A1