WO2022001888A1 - 基于词向量生成模型的信息生成方法和装置 - Google Patents

基于词向量生成模型的信息生成方法和装置 Download PDF

Info

Publication number
WO2022001888A1
WO2022001888A1 PCT/CN2021/102487 CN2021102487W WO2022001888A1 WO 2022001888 A1 WO2022001888 A1 WO 2022001888A1 CN 2021102487 W CN2021102487 W CN 2021102487W WO 2022001888 A1 WO2022001888 A1 WO 2022001888A1
Authority
WO
WIPO (PCT)
Prior art keywords
word
word vector
matched
vector generation
generation model
Prior art date
Application number
PCT/CN2021/102487
Other languages
English (en)
French (fr)
Inventor
彭宗徽
张永华
Original Assignee
北京字节跳动网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京字节跳动网络技术有限公司 filed Critical 北京字节跳动网络技术有限公司
Publication of WO2022001888A1 publication Critical patent/WO2022001888A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Definitions

  • the embodiments of the present application relate to the field of computer technology, and in particular, to a method and apparatus for generating information based on a word vector generation model.
  • search technology the construction of word vectors and the determination of similarity are the basis for matching between candidate information and search words.
  • a related approach is usually to pre-build word-level word vectors, and then generate recall information based on matching of prefixes, characters, etc.
  • the embodiments of the present application propose an information generation method and apparatus based on a word vector generation model.
  • an embodiment of the present application provides an information generation method based on a word vector generation model, the method includes: obtaining a search term; inputting the search term into a pre-trained word vector generation model to obtain a word corresponding to the search term vector, where the word vector generation model is used to generate word vectors based on non-semantic similarity, and the non-semantic similarity includes at least one of the following: similar in sound, similar in shape; generating word vectors corresponding to search words and preset words to be matched The similarity between the word vectors to be matched in the vector set, where the word vectors to be matched in the set of word vectors to be matched are obtained based on the word vector generation model.
  • the method further includes: selecting a first target number of to-be-matched word vectors from the set of to-be-matched word vectors according to the determined similarity; based on the selected first target number of to-be-matched word vectors The vectors are reordered to generate a returned word sequence, wherein the order of the words in the returned word sequence corresponds to the order of the reordered word vectors to be matched; the returned word sequence is sent to the target device.
  • an embodiment of the present application provides a method for training a word vector generation model, the method includes: obtaining an initial model, wherein the initial model includes an initial word vector generation model and an output layer; obtaining a training sample set, wherein , the training samples in the training sample set include at least two words that are similar, and the similarity includes at least one of the following: similar in sound, similar in shape; the first word of the training samples in the training sample set is used as the input of the initial model, and the same as the input The second word corresponding to the first word is used as the expected output, and the initial word vector generation model of the initial model obtained by training is determined as the word vector generation model, wherein the first word and the second word belong to the same training sample.
  • the above-mentioned obtaining the training sample set includes: obtaining a first historical matching word, wherein the first historical matching word includes a word selected in a search result fed back according to the first historical search word; at least one first historical search word corresponding to the matching word; combining at least one first historical search word corresponding to the first historical matching word into a training sample.
  • the above-mentioned obtaining the training sample set includes: obtaining a second historical search term; obtaining at least one second historical matching term corresponding to the second historical searching term, wherein the second historical matching term includes The search result of the search word feedback; according to the click-through rate, select a second target number of second history match words from at least one second history match word; The historical matching words are combined into training samples.
  • the words included in the training samples in the above-mentioned training sample set include phonographic words and at least one n-gram word corresponding to the phonographic words.
  • obtaining the training sample set above includes: obtaining a target word; generating at least one n-gram word corresponding to the target word; performing morphological transformation on the target word and the corresponding at least one n-gram word to generate a transformed The transformed word set; based on the transformed word set, a training sample is generated.
  • the above-mentioned morphological transformation includes character replacement; and the above-mentioned morphological transformation is performed on the target word and the corresponding at least one n-gram word to generate a transformed word set, including: from the target word and the corresponding at least one The word to be replaced is selected from the n-gram words; the characters in the word to be replaced are replaced according to a preset probability to generate a transformed word, wherein the preset probability is associated with the arrangement positions of keys representing different characters on the keyboard.
  • an embodiment of the present application provides an information generation device based on a word vector generation model.
  • the device includes: a word acquisition unit, configured to acquire search words; a vector generation unit, configured to input the search words into a preset
  • the trained word vector generation model obtains word vectors corresponding to the search words, wherein the word vector generation model is used to generate word vectors based on non-semantic similarity, and the non-semantic similarity includes at least one of the following: similar in sound, similar in shape
  • the similarity generating unit is configured to generate the similarity between the word vector corresponding to the search term and the word vector to be matched in the preset word vector set to be matched, wherein the word vector to be matched in the set of word vector to be matched
  • the vector is obtained based on the word vector generation model.
  • the device further includes: a selection unit, configured to select a first target number of word vectors to be matched from the set of word vectors to be matched according to the determined similarity; a sorting unit, configured to Reordering is performed based on the selected first target number of word vectors to be matched, and a returned word sequence is generated, wherein the order of the words in the returned word sequence corresponds to the order of the reordered word vectors to be matched; the sending unit is configured with to send the return word sequence to the target device.
  • a selection unit configured to select a first target number of word vectors to be matched from the set of word vectors to be matched according to the determined similarity
  • a sorting unit configured to Reordering is performed based on the selected first target number of word vectors to be matched, and a returned word sequence is generated, wherein the order of the words in the returned word sequence corresponds to the order of the reordered word vectors to be matched
  • the sending unit is configured with to send the return
  • an embodiment of the present application provides an apparatus for training a word vector generation model
  • the apparatus includes: a model acquisition unit configured to acquire an initial model, wherein the initial model includes an initial word vector generation model and an output layer
  • the sample acquisition unit is configured to acquire a training sample set, wherein the training samples in the training sample set include at least two words that are similar, and the similarity includes at least one of the following: similar in sound, similar in shape;
  • the first word of the training sample in the sample set is used as the input of the initial model, the second word corresponding to the input first word is used as the expected output, and the initial word vector generation model of the initial model obtained by training is determined as the word vector generation model , where the first word and the second word belong to the same training sample.
  • the sample obtaining unit includes: a first obtaining subunit configured to obtain a first historical matching word, wherein the first historical matching word includes a word selected in the search results fed back according to the first historical search word
  • the second acquisition subunit is configured to acquire at least one first historical search word corresponding to the first historical matching word
  • the first combination subunit is configured to retrieve at least one first historical search word corresponding to the first historical matching word word combinations into training samples.
  • the sample acquisition unit includes: a third acquisition subunit configured to acquire a second historical search term; a fourth acquisition subunit configured to acquire at least one second history corresponding to the second historical search term matching words, wherein the second historical matching words include search results fed back according to the second historical search words; the selection subunit is configured to: select a second target number from at least one second historical matching word according to the click pass rate The second historical matching word; the second combining subunit is configured to combine the second historical search word and the selected second target number of the second historical matching words into a training sample.
  • the words included in the training samples in the above-mentioned training sample set include phonographic words and at least one n-gram word corresponding to the phonographic words.
  • the sample acquisition unit includes: a fifth acquisition subunit configured to acquire a target word; a first generation subunit configured to generate at least one n-gram word corresponding to the target word; a second generation subunit The unit is configured to perform morphological transformation on the target word and the corresponding at least one n-gram word to generate a transformed word set; the third generation subunit is configured to generate training samples based on the transformed word set.
  • the above-described morphological transformation includes character substitution.
  • the above-mentioned second generation subunit includes: a selection module, configured to select a word to be replaced from a target word and corresponding at least one n-gram word; a generation module, configured to replace the characters in the word to be replaced according to a preset probability , to generate transformed words, wherein preset probabilities are associated with the arrangement positions of keys representing different characters on the keyboard.
  • an embodiment of the present application provides a server, the server includes: one or more processors; a storage device on which one or more programs are stored; when one or more programs are processed by one or more The processor executes such that the one or more processors implement the method as described in any one of the implementations of the first aspect.
  • an embodiment of the present application provides a computer-readable medium on which a computer program is stored, and when the program is executed by a processor, implements the method described in any of the implementation manners of the first aspect.
  • a search word is obtained; then, the search word is input into a pre-trained word vector generation model, and a word vector corresponding to the search word is obtained, wherein,
  • the word vector generation model is used to generate word vectors based on non-semantic similarity, and the non-semantic similarity includes at least one of the following: sound similarity, shape similarity; finally, the word vector corresponding to the search word and the preset word vector set to be matched are generated
  • the similarity between the to-be-matched word vectors in wherein the to-be-matched word vectors in the to-be-matched word vector set are obtained based on the word vector generation model.
  • the similar relationship between the features other than the meaning of the word itself learned by the word vector generation model can be fully utilized, and the fuzzy retrieval can be significantly improved in the scene where the input is wrong during retrieval and the scene without obvious semantic search (such as person names). the quality of.
  • FIG. 1 is an exemplary system architecture diagram to which an embodiment of the present application may be applied;
  • FIG. 2 is a flowchart of an embodiment of an information generation method based on a word vector generation model according to the present application
  • FIG. 3 is a flowchart of an embodiment of a method for training a word vector generation model according to the present application
  • FIG. 4 is a schematic diagram of an application scenario of a method for training a word vector generation model according to an embodiment of the present application
  • FIG. 5 is a schematic structural diagram of an embodiment of an information generation device based on a word vector generation model according to the present application
  • FIG. 6 is a schematic structural diagram of an embodiment of an apparatus for training a word vector generation model according to the present application.
  • FIG. 7 is a schematic structural diagram of an electronic device suitable for implementing the embodiments of the present application.
  • FIG. 1 shows an exemplary architecture 100 to which the word vector generation model-based information generation method or word vector generation model-based information generation apparatus of the present application can be applied.
  • the system architecture 100 may include terminal devices 101 , 102 , 103 , networks 104 , 106 and servers 105 , 107 .
  • the networks 104, 106 are used to provide a medium of communication links between the terminal devices 101, 102, 103 and the server 105, and between the server 105 and the server 107, respectively.
  • the networks 104, 106 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
  • the terminal devices 101, 102, and 103 interact with the server 105 through the network 104 to receive or send messages and the like.
  • Various communication client applications may be installed on the terminal devices 101 , 102 and 103 , such as web browser applications, shopping applications, search applications, instant messaging tools, email clients, social platform software, reading applications, and the like.
  • the terminal devices 101, 102, and 103 may be hardware or software.
  • the terminal devices 101, 102, and 103 can be various electronic devices with display screens and support for searching, including but not limited to smart phones, tablet computers, e-book readers, laptop computers and desktop computers, etc. Wait.
  • the terminal devices 101, 102, and 103 are software, they can be installed in the electronic devices listed above. It can be implemented as a plurality of software or software modules (eg, software or software modules for providing distributed services), or can be implemented as a single software or software module. There is no specific limitation here.
  • the server 105 may be a server that provides various services, for example, a background server that provides support for web pages displayed on the terminal devices 101 , 102 , and 103 .
  • the server 105 may be configured to execute the above-mentioned information generation method based on the word vector generation model.
  • the server 107 may be a server for training a word vector generation model.
  • the background server 105 can obtain the trained word vector generation model from the server 107 . Then use the acquired word vector generation model to analyze the search words received from the terminal device, and generate processing results (for example, search results matching the search words) to feed back to the terminal device.
  • the above-mentioned server 105 can also be used to train the word vector generation model, so the above-mentioned trained word vector generation model can also be directly stored locally on the server 105, and the server 105 can directly extract the locally stored word vector generation model , at this time, the network 106 and the server 107 may not exist.
  • the server may be hardware or software.
  • the server can be implemented as a distributed server cluster composed of multiple servers, or can be implemented as a single server.
  • the server is software, it can be implemented as a plurality of software or software modules (for example, software or software modules for providing distributed services), or can be implemented as a single software or software module. There is no specific limitation here.
  • the information generation method based on the word vector generation model provided by the embodiments of the present application is generally executed by the server 105 , and accordingly, the information generation device based on the word vector generation model is generally set in the server 105 .
  • terminal devices, networks and servers in FIG. 1 are merely illustrative. There can be any number of terminal devices, networks and servers according to implementation needs.
  • FIG. 2 it shows a process 200 of yet another embodiment of the information generation method based on the word vector generation model.
  • the process 200 of the information generation method based on the word vector generation model includes the following steps:
  • Step 201 acquiring search terms.
  • the execution body (for example, the server 105 shown in FIG. 1 ) of the information generation method based on the word vector generation model can acquire the search words through a wired connection or a wireless connection.
  • the above-mentioned search words may generally be words sent by a terminal device connected in communication to obtain search results.
  • the above search term may include a user name, such as "Tutu”, or a hashtag, such as "blue sky”.
  • Step 202 Input the search words into a pre-trained word vector generation model to obtain word vectors corresponding to the search words.
  • the above-mentioned execution body may input the search words obtained in step 201 into a pre-trained word vector generation model, and extract the search words from the hidden layer (ie, the input end of the output layer) of the above-mentioned word vector generation model. the corresponding word vector.
  • the above word vector generation model can be used to generate word vectors based on non-word similarity.
  • the above-mentioned non-semantic similarity includes at least one of the following: similar in sound, similar in shape.
  • the above-mentioned word vector generation may include various language models obtained by training using training samples with the above-mentioned non-word-sense similarity.
  • Step 203 Generate the similarity between the word vector corresponding to the search word and the word vector to be matched in the preset set of word vectors to be matched.
  • the above-mentioned execution body can use various vector similarity generating methods to generate the similarity between the word vector corresponding to the search word and the to-be-matched word vector in the preset to-be-matched word vector set.
  • the to-be-matched word vectors in the above-mentioned to-be-matched word vector set are obtained based on the above-mentioned word vector generation model.
  • the above-mentioned set of word vectors to be matched may include a set of word vectors obtained by inputting a preset set of words to be matched into the above-mentioned word vector generation model.
  • the above-mentioned words to be matched may include various historical data, which may include, but are not limited to, user names of registered users, published topic names, and the like.
  • the above-mentioned execution body may further perform the following steps:
  • a first target number of word vectors to be matched are selected from the set of word vectors to be matched.
  • the execution subject may select the first target number of word vectors to be matched from the set of word vectors to be matched in various ways.
  • the above-mentioned execution body may select the first target number of word vectors to be matched from the above-mentioned set of word vectors to be matched according to the determined similarity in descending order.
  • the execution subject may select from the set of word vectors to be matched, the first target number of word vectors to be matched whose similarity is greater than a preset threshold.
  • the above-mentioned first target number may be any value pre-specified according to actual application requirements, or may be a value satisfying a preset condition (for example, the number of word vectors to be matched whose similarity is greater than a preset threshold).
  • reordering is performed based on the selected first target number of word vectors to be matched to generate a returned word sequence.
  • the above-mentioned execution body may reorder the first target number of word vectors to be matched selected in the above-mentioned first step in various ways.
  • the above reordering basis may include, but is not limited to, at least one of the following: edit distance, prefix matching, and the like.
  • a sequence of return words can be generated.
  • the order of the words in the returned word sequence generally corresponds to the order of the reordered word vectors to be matched.
  • the above-mentioned execution body may introduce non-semantic dimensions such as word form and word sound when sorting the word vectors to be matched, so as to improve the matching degree of the fuzzy search results.
  • the third step is to send the returned word sequence to the target device.
  • the above-mentioned execution subject may send the return word sequence generated in the above-mentioned second step to the target device in various ways.
  • the above-mentioned target device may include a terminal device for sending search words, and may also include a background server for further sorting the above-mentioned returned word sequence, which is not limited herein.
  • the process 200 of the information generation method based on the word vector generation model in this embodiment embodies the steps of using the above word vector generation model to match the words to be matched and the search words. Therefore, the solution described in this embodiment can make full use of the similarity between the features other than the meaning of the word itself learned by the above-mentioned word vector generation model, so as to input the wrong scene during retrieval and do not have obvious semantic search (for example, the name of a person) ) can significantly improve the quality of fuzzy retrieval.
  • a flow 300 of one embodiment of a method for training a word vector generation model according to the present application is shown.
  • the method for training a word vector generation model includes the following steps:
  • Step 301 obtaining an initial model.
  • the execution body of the method for training the word vector generation model can obtain the initial model through wired connection or wireless connection.
  • the above-mentioned initial model may include an initial word vector generation model and an output layer.
  • the above-mentioned initial word vector generation model may include various artificial neural networks (Artificial Neural Network, ANN) including hidden layers, for example, a neural network based on a combination of a skip-gram model and a fasttext model.
  • ANN Artificial Neural Network
  • the above-mentioned execution body may also acquire a pre-stored initial model locally, or may acquire the above-mentioned initial model from a communication-connected electronic device, which is not limited herein.
  • Step 302 acquiring a training sample set.
  • each training sample in the above-mentioned training sample set may include at least two similar words.
  • the above-mentioned similarity includes at least one of the following: similar in sound, similar in shape.
  • the above-mentioned words may include a single character in ideographic characters or a phrase composed of multiple characters, and may also include words in phonetic characters, etc., which are not limited herein.
  • phrase consisting of multiple characters may include the spelling of misspelled characters in the word, and may also include the alternative writing of several characters in the word.
  • the above training samples may be [keep up, keep up, keep up].
  • the above training samples may be [ear, year, Yeer].
  • the above-mentioned execution body may acquire the training sample set according to the following steps:
  • the first step is to obtain the first historical matching word.
  • the aforementioned executive body may acquire the first historical matching word from a local or communicatively connected electronic device through wired and wireless connections.
  • the above-mentioned first historical matching words may include words selected in the search results fed back according to the first historical search words.
  • At least one first historical search word corresponding to the first historical matching word is acquired.
  • the execution subject may acquire at least one first historical search word corresponding to the first historical matching word acquired in the first step above from a local or communicatively connected electronic device through wired and wireless connections.
  • the above-mentioned execution body may extract the word selected by the terminal device from the feedback search result from the historical search data (represented by receiving a content acquisition request corresponding to the word) as the above-mentioned first historical matching word (for example, "weather” forecast”). Then, the execution subject may extract, from the historical search data, search words (eg, "weather", "weather forecast") used by the terminal that also selects the first historical matching word when searching, as the first historical search word. It can be understood that the above-mentioned first historical matching word may generally correspond to at least one first historical search word.
  • At least one first historical search word corresponding to the first historical matching word is combined into a training sample.
  • the above-mentioned execution body may collect different search words corresponding to the same selected word from real historical search data as training samples.
  • the above-mentioned execution body may acquire the training sample set according to the following steps:
  • the first step is to obtain the second historical search term.
  • the above-mentioned execution body may acquire the second historical search term from a local or communicatively connected electronic device through wired and wireless connection.
  • the above-mentioned execution body may extract, from the above-mentioned historical search data, a search word (for example, "weather") used by the terminal when searching, as the second historical search word.
  • At least one second historical matching word corresponding to the second historical search word is acquired.
  • the execution subject may acquire at least one second historical match word corresponding to the second historical search word acquired in the first step above from a local or communicatively connected electronic device through wired and wireless connections.
  • the above-mentioned second historical matching words may include search results fed back according to the above-mentioned second historical search words (for example, "weather forecast”, “weather query”, “weather radar”).
  • a second target number of second historical matching words are selected from at least one second historical matching word.
  • the execution subject may select a second target number of second historical matching words from the at least one second historical matching word acquired in the second step according to the click pass rate.
  • the execution subject may select the second historical matching word from the at least one second historical matching word acquired in the second step according to the order of the click-through rate from high to low.
  • the execution subject may select a second historical matching word whose click-through rate is greater than a preset threshold from the at least one second historical matching word acquired in the second step.
  • the above-mentioned second target number may be any value pre-specified according to actual application requirements, or may be a value satisfying a preset condition (for example, the number of second historically matched words whose click pass rate is greater than a preset threshold).
  • the above-mentioned execution body may collect matching words with a high click-through rate and corresponding different search words from real historical search data as training samples.
  • the training samples in the above training sample set may include words including phonetic words and at least one n-gram word corresponding to the above phonetic words.
  • the above-mentioned phonetic characters may include syllabic characters (such as Japanese kana) and phonemic characters (such as Latin letters used in English and French, Cyrillic letters used in Russian, Arabic letters used in Arabic and Uyghur, etc.).
  • the above-mentioned n-gram word may include a character string composed of consecutive n letters selected from the above-mentioned phonetic words.
  • the above phonetic word may be "happy".
  • the n-gram words corresponding to the above phonetic words (for example, n is 3) may include "hap", "app", and "ppy".
  • the above-mentioned execution subject may acquire a training sample set according to the following steps:
  • the first step is to obtain the target word.
  • the above-mentioned execution subject can acquire the target word.
  • the above-mentioned target words may include any words selected from the above-mentioned historical search data.
  • the above target word may be "happy".
  • At least one n-gram word corresponding to the target word is generated.
  • the above-mentioned execution body may generate at least one n-gram word corresponding to the target word obtained in the above-mentioned first step in various ways.
  • n 2
  • at least one n-gram word corresponding to the above-mentioned "happy” may include at least one of "ha”, “ap", “pp", and "py”.
  • morphological transformation is performed on the target word and the corresponding at least one n-gram word to generate a transformed word set.
  • the execution body may perform various morphological transformations on the target word obtained in the first step and the at least one n-gram word generated in the second step to generate a transformed word set.
  • the above-mentioned morphological transformation may include, but is not limited to, at least one of the following: character deletion, character repetition, and character exchange.
  • the transformed word set generated after the morphological transformation for "happy” may include, but is not limited to, at least the following two items: “hapy”, “hhappy”, “hyppa”, “hha”, "p” , "pa”.
  • the above-mentioned morphological transformation may include character replacement.
  • the above-mentioned execution body can perform morphological transformation on the target word and the corresponding at least one n-gram word according to the following steps, and generate a transformed word set, including:
  • the execution subject may select at least one word from the target word obtained in the first step and the corresponding at least one n-gram word as the word to be replaced in various ways. For example, random selection, selection of words with a number of characters greater than a preset value, etc.
  • the above-mentioned execution body may replace the characters in the words to be replaced according to the preset probability, and generate the transformed words.
  • the above preset probability may be associated with the arrangement positions of keys representing different characters on the keyboard.
  • the key most adjacent to the "S" character key may include the key corresponding to the characters "A” "W” "D” "X”; the key next to the "S” character key may Includes keys corresponding to the characters "Q” "E” "C” "Z”.
  • the above-mentioned preset probabilities corresponding to "A”, “W”, “D”, and “X” are generally higher than those corresponding to "Q", “E”, “C”, and “Z” (eg, 0.3) .
  • the execution subject may replace the character "s” in the word to be replaced according to the preset probability. For example, the word “smile” to be replaced is transformed into “amile” as the transformed word.
  • the fourth step is to generate training samples based on the transformed word set.
  • the above-mentioned training samples may include words that have undergone morphological transformation through the above-mentioned third step.
  • the above-mentioned training sample may also include the above-mentioned target word and the corresponding at least one n-gram word that has not been transformed.
  • the above-mentioned execution body may further de-duplicate the above-mentioned target word, the corresponding at least one n-gram word, and the transformed word, thereby generating the above-mentioned training sample.
  • the above-mentioned execution body can construct different training samples from the real historical search data, thereby making up for the limitation of insufficient historical data coverage, helping to reduce the overfitting of the model and improve the model generalization ability.
  • Step 303 take the first word of the training sample in the training sample set as the input of the initial model, take the second word corresponding to the input first word as the expected output, and determine the initial word vector generation model of the initial model obtained by training. Generate models for word vectors.
  • the above-mentioned execution body can use the machine learning method to take the first word of the training samples in the training sample set as the input of the initial model, and the second word corresponding to the input first word as the expected output, and train the The above word vector generation model is obtained.
  • the above-mentioned first word and the above-mentioned second word usually belong to the same training sample.
  • the hidden layer of the initial word vector generation model in the above initial model can be used to output word vectors. In each iteration, the above-mentioned first word and second word can be selected in various ways.
  • the above-mentioned execution body may randomly select any two words from the training sample that have not been selected at the same time as the above-mentioned first word and second word.
  • the above-mentioned execution body may also select two words from the training sample according to a preset sliding window as the above-mentioned first word and second word.
  • FIG. 4 is a schematic diagram of an application scenario of the method for training a word vector generation model according to an embodiment of the present application.
  • the backend server can obtain the initial model 401 and the training sample set 402 .
  • the initial model 401 may include an initial word vector generation model 4011 and an output layer 4012.
  • the above training sample set may include training samples of "Lucy", "Lucy” and "Lucy”.
  • the background server may use "Lucy" 4031 in the training sample as the first word, and "Lucy" 4032 as the second word.
  • the background server may input "Lucy" 4031 into the above-mentioned initial model 401, and use "Lucy” 4032 as the expected output of the above-mentioned initial model 401.
  • the background server may stop training on the premise of satisfying the training end condition, and determine the initial word vector generation model 4011 in the obtained initial model 401 as the word vector generation model.
  • one of the existing technologies is usually to construct word-level word vectors based on the meaning of words in advance, and then generate recall information based on the matching of prefixes, characters, etc., so that the existing word vectors cannot use word form or pronunciation features. .
  • the word vector generation model is trained by training samples with similar phonetics and/or glyphs, so that the generated word vector can reflect the features other than the meaning of the word itself, so as to be fuzzy
  • the improved quality of search provides a solid data base.
  • the present application provides an embodiment of an information generation device based on a word vector generation model, and the embodiment of the device corresponds to the method embodiment shown in FIG. 2 , Specifically, the device can be applied to various electronic devices.
  • the information generating apparatus 500 based on the word vector generating model includes a word acquiring unit 501 , a vector generating unit 502 and a similarity generating unit 503 .
  • the word acquisition unit 501 is configured to acquire search words;
  • the vector generation unit 502 is configured to input the search words into a pre-trained word vector generation model to obtain word vectors corresponding to the search words, wherein the word vector generation model
  • the above-mentioned non-word similarity includes at least one of the following: similar in sound, similar in shape;
  • the similarity generating unit 503 is configured to generate a word vector corresponding to the search word and a preset waiting The similarity between the word vectors to be matched in the set of matching word vectors, where the word vectors to be matched in the set of word vectors to be matched are obtained based on a word vector generation model.
  • step 201 the specific processing of the word acquisition unit 501 , the vector generation unit 502 and the similarity generation unit 503 and the technical effects brought by them can be referred to FIG.
  • step 201 the specific processing of the word acquisition unit 501 , the vector generation unit 502 and the similarity generation unit 503 and the technical effects brought by them can be referred to FIG.
  • step 202 the related descriptions of step 203 in the embodiment will not be repeated here.
  • the above-mentioned information generating apparatus 500 based on a word vector generating model may further include: a selecting unit (not shown in the figure), a sorting unit (not shown in the figure), and a sending unit (not shown in the figure).
  • the above selection unit may be configured to select a first target number of word vectors to be matched from the set of word vectors to be matched according to the determined similarity.
  • the above sorting unit may be configured to perform reordering based on the selected first target number of word vectors to be matched to generate a returned word sequence. The order of the words in the returned word sequence may correspond to the order of the reordered word vectors to be matched.
  • the above-mentioned sending unit may be configured to send the returned word sequence to the target device.
  • the search words are acquired through the word acquisition unit 501 .
  • the vector generating unit 502 inputs the search term into the pre-trained word vector generating model to obtain the word vector corresponding to the search term.
  • the word vector generation model is obtained by training the above-mentioned method for training the word vector generation model.
  • the similarity generating unit 503 generates the similarity between the word vector corresponding to the search word and the word vector to be matched in the preset set of word vectors to be matched.
  • the to-be-matched word vectors in the to-be-matched word vector set may be obtained based on a word vector generation model.
  • the present application provides an embodiment of an apparatus for training a word vector generation model, and the apparatus embodiment corresponds to the method embodiment shown in FIG. 3 , Specifically, the device can be applied to various electronic devices.
  • the apparatus 600 for training a word vector generation model includes a model obtaining unit 601 , a sample obtaining unit 602 and a training unit 603 .
  • the model obtaining unit 601 is configured to obtain an initial model, wherein the initial model includes an initial word vector generation model and an output layer;
  • the sample obtaining unit 602 is configured to obtain a training sample set, wherein the training samples in the training sample set Include at least two words that are similar, and the similarity includes at least one of the following: similar in sound, similar in shape;
  • the training unit 603 is configured to use the first word of the training sample in the training sample set as the input of the initial model, and use the first word of the input
  • the second word corresponding to the word is used as the expected output, and the initial word vector generation model of the initial model obtained by training is determined as the word vector generation model, wherein the first word and the second word belong to the same training sample.
  • the specific processing of the model obtaining unit 601, the sample obtaining unit 602 and the training unit 603 and the technical effects brought about by the model obtaining unit 601, and the technical effects brought about by them can refer to the corresponding embodiment in FIG. 3 respectively.
  • the related descriptions of step 301, step 302 and step 303 in the above will not be repeated here.
  • the above-mentioned sample acquisition unit 602 may include: a first acquisition subunit (not shown in the figure), a second acquisition subunit (not shown in the figure), a first combination Subunits (not shown in the figure).
  • the above-mentioned first obtaining subunit may be configured to obtain the first historical matching word.
  • the above-mentioned first historical matching words may include words selected in the search results fed back according to the first historical search words.
  • the above-mentioned second obtaining subunit may be configured to obtain at least one first historical search word corresponding to the first historical matching word.
  • the above-mentioned first combining subunit may be configured to combine at least one first historical search word corresponding to the first historical matching word into a training sample.
  • the above-mentioned sample acquisition unit 602 may include: a third acquisition subunit (not shown in the figure), a fourth acquisition subunit (not shown in the figure), and a selection subunit (not shown in the figure), a second combination subunit (not shown in the figure).
  • the above-mentioned third obtaining subunit may be configured to obtain the second historical search term.
  • the above-mentioned fourth obtaining subunit may be configured to obtain at least one second historical matching word corresponding to the second historical search word.
  • the above-mentioned second historical matching words may include search results fed back according to the second historical search words.
  • the above-mentioned selecting subunit may be configured to: select a second target number of second historical matching words from at least one second historical matching word according to the click pass rate.
  • the above-mentioned second combining subunit may be configured to combine the second historical search word and the selected second target number of second historical matching words into a training sample.
  • the words included in the training samples in the above-mentioned training sample set may include words in phonetic characters and at least one n-gram word corresponding to the words in phonetic characters.
  • the above-mentioned sample acquisition unit 602 may include: a fifth acquisition subunit (not shown in the figure), a first generation subunit (not shown in the figure), a second generation subunit (not shown in the figure) Subunit (not shown in the figure), a third generation subunit (not shown in the figure).
  • the above-mentioned fifth obtaining subunit may be configured to obtain the target word.
  • the above-mentioned first generating subunit may be configured to generate at least one n-gram word corresponding to the target word.
  • the above-mentioned second generating subunit may be configured to perform morphological transformation on the target word and the corresponding at least one n-gram word to generate a transformed word set.
  • the above-mentioned third generating subunit may be configured to generate training samples based on the transformed word set.
  • the above-mentioned morphological transformation may include character replacement.
  • the above-mentioned second generating subunit may include: a selection module (not shown in the figure) and a generation module (not shown in the figure).
  • the above selection module may be configured to select the word to be replaced from the target word and the corresponding at least one n-gram word.
  • the above-mentioned generating module may be configured to replace the characters in the word to be replaced according to a preset probability, and generate a transformed word.
  • the above preset probability may be associated with the arrangement positions of keys representing different characters on the keyboard.
  • an initial model is obtained through the model obtaining unit 601, where the initial model includes an initial word vector generation model and an output layer.
  • the sample obtaining unit 602 obtains a training sample set.
  • the training samples in the training sample set include at least two similar words. Similarity includes at least one of the following: similar in sound, similar in shape.
  • the training unit 603 takes the first word of the training sample in the training sample set as the input of the initial model, takes the second word corresponding to the input first word as the expected output, and generates the initial word vector of the initial model obtained by training
  • the model is determined to be a word vector generation model. Among them, the first word and the second word belong to the same training sample.
  • the word vector generation model is trained by training samples with similar pronunciation and/or glyph, so that the generated word vector can reflect features other than the meaning of the word itself, thereby providing a reliable data basis for improving the quality of fuzzy search.
  • FIG. 7 it shows a schematic structural diagram of an electronic device (eg, the server in FIG. 1 ) 700 suitable for implementing an embodiment of the present application.
  • Terminal devices in the embodiments of the present application may include, but are not limited to, such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablets), PMPs (portable multimedia players), vehicle-mounted terminals (such as mobile terminals such as in-vehicle navigation terminals), etc., and stationary terminals such as digital TVs, desktop computers, and the like.
  • the server shown in FIG. 7 is only an example, and should not impose any limitations on the functions and scope of use of the embodiments of the present application.
  • an electronic device 700 may include a processing device (eg, a central processing unit, a graphics processor, etc.) 701 that may be loaded into random access according to a program stored in a read only memory (ROM) 702 or from a storage device 708 Various appropriate actions and processes are executed by the programs in the memory (RAM) 703 . In the RAM 703, various programs and data required for the operation of the electronic device 700 are also stored.
  • the processing device 701, the ROM 702, and the RAM 703 are connected to each other through a bus 704.
  • An input/output (I/O) interface 705 is also connected to bus 704 .
  • the following devices can be connected to the I/O interface 705: input devices 706 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a Liquid Crystal Display (LCD) Output device 707 , speaker, vibrator, etc.; storage device 708 including, eg, magnetic tape, hard disk, etc.; and communication device 709 .
  • Communication means 709 may allow electronic device 700 to communicate wirelessly or by wire with other devices to exchange data.
  • FIG. 7 shows an electronic device 700 having various means, it should be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided. Each block shown in FIG. 7 can represent one device, and can also represent multiple devices as required.
  • embodiments of the present application include a computer program product comprising a computer program carried on a computer-readable medium, the computer program containing program code for performing the method illustrated in the flowchart.
  • the computer program may be downloaded and installed from the network via the communication device 709, or from the storage device 708, or from the ROM 702.
  • the processing device 701 the above-mentioned functions defined in the methods of the embodiments of the present application are executed.
  • the computer-readable medium described in the embodiments of the present application may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two.
  • the computer-readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above. More specific examples of computer readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable Programmable read only memory (EPROM or flash memory), fiber optics, portable compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal in baseband or propagated as part of a carrier wave, carrying computer-readable program code therein. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • a computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device .
  • the program code contained on the computer-readable medium can be transmitted by any suitable medium, including but not limited to: electric wire, optical cable, RF (Radio Frequency, radio frequency), etc., or any suitable combination of the above.
  • the above-mentioned computer-readable medium may be included in the above-mentioned server; or may exist alone without being assembled into the server.
  • the above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the server, the server: obtains a search term; inputs the search term into a pre-trained word vector generation model, obtains and retrieves The word vector corresponding to the word, wherein the word vector generation model is used to generate a word vector based on non-semantic similarity, and the non-semantic similarity includes at least one of the following: sound similarity, shape similarity; generating word vectors corresponding to search words and presets The similarity between the to-be-matched word vectors in the to-be-matched word vector set, wherein the to-be-matched word vectors in the to-be-matched word vector set are obtained based on a word vector generation model.
  • Computer program code for performing the operations of the embodiments of the present application may be written in one or more programming languages, or combinations thereof, including object-oriented programming languages—such as Java, Smalltalk, C++, and also This includes conventional procedural programming languages - such as the "C" language or similar programming languages.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server.
  • the remote computer may be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (eg, using an Internet service provider through Internet connection).
  • LAN local area network
  • WAN wide area network
  • each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented in dedicated hardware-based systems that perform the specified functions or operations , or can be implemented in a combination of dedicated hardware and computer instructions.
  • the units involved in the embodiments of the present application may be implemented in a software manner, and may also be implemented in a hardware manner.
  • the described unit can also be set in the processor, for example, it can be described as: a processor, including a word obtaining unit, a vector generating unit, and a similarity generating unit. Wherein, the names of these units do not constitute a limitation on the unit itself under certain circumstances, for example, the word acquisition unit may also be described as a "unit for acquiring search words".

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

基于词向量生成模型的信息生成方法和装置。该方法包括:获取检索词(201);将该检索词输入至预先训练的词向量生成模型,得到与该检索词对应的词向量(202),其中,该词向量生成模型用于生成基于非词义相似性的词向量,该非词义相似性包括以下至少一项:音似,形似;生成与该检索词对应的词向量与预设的待匹配词向量集合中的待匹配词向量之间的相似度(203),其中,该待匹配词向量集合中的待匹配词向量基于所述词向量生成模型得到。该方法可以充分利用词向量生成模型所学习到的词本身含义以外的特征之间的相近关系,从而在检索时输入错误的场景和不具备明显语义搜索(例如人名)的场景下能够显著提升模糊检索的质量。

Description

基于词向量生成模型的信息生成方法和装置
相关申请的交叉引用
本申请基于申请号为202010604164.0、申请日为2020年06月29日、名称为“基于词向量生成模型的信息生成方法和装置”的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。
技术领域
本申请实施例涉及计算机技术领域,具体涉及基于词向量生成模型的信息生成方法和装置。
背景技术
随着计算机技术的发展,文本搜索也取得了越来越广泛的应用。而在搜索技术中,词向量的构建和相似度的确定是实现备选信息与搜索词匹配的基础。
相关的方式通常是预先构建字词级别的词向量,然后基于前缀、字符等的匹配生成召回信息。
发明内容
本申请实施例提出了基于词向量生成模型的信息生成方法和装置。
第一方面,本申请实施例提供了一种基于词向量生成模型的信息生成方法,该方法包括:获取检索词;将检索词输入至预先训练的词向量生成模型,得到与检索词对应的词向量,其中,词向量生成模型用于生成基于非词义相似性的词向量,非词义相似性包括以下至少一项:音似,形似;生成与检索词对应的词向量与预设的待匹配词向量集合中的待匹配词向量之间的相似度,其中,待匹配词向量集合中的 待匹配词向量基于词向量生成模型得到。
在一些实施例中,该方法还包括:根据所确定的相似度的大小,从待匹配词向量集合中选取第一目标数目个待匹配词向量;基于所选取的第一目标数目个待匹配词向量进行重排序,生成返回词序列,其中,返回词序列中的词的顺序与重排序后的待匹配词向量的顺序对应;向目标设备发送返回词序列。
第二方面,本申请实施例提供了一种用于训练词向量生成模型的方法,该方法包括:获取初始模型,其中,初始模型包括初始词向量生成模型和输出层;获取训练样本集合,其中,训练样本集合中的训练样本包括相似的至少两个词,相似包括以下至少一项:音似,形似;将训练样本集合中的训练样本的第一词作为初始模型的输入,将与输入的第一词对应的第二词作为期望输出,将训练得到的初始模型的初始词向量生成模型确定为词向量生成模型,其中,第一词与第二词属于同一训练样本。
在一些实施例中,上述获取训练样本集合,包括:获取第一历史匹配词,其中,第一历史匹配词包括根据第一历史搜索词反馈的搜索结果中被选中的词;获取与第一历史匹配词对应的至少一个第一历史搜索词;将第一历史匹配词对应的至少一个第一历史搜索词组合成训练样本。
在一些实施例中,上述获取训练样本集合,包括:获取第二历史搜索词;获取与第二历史搜索词对应的至少一个第二历史匹配词,其中,第二历史匹配词包括根据第二历史搜索词反馈的搜索结果;根据点击通过率,从至少一个第二历史匹配词中选取第二目标数目个第二历史匹配词;将第二历史搜索词和所选取的第二目标数目个第二历史匹配词组合成训练样本。
在一些实施例中,上述训练样本集合中的训练样本包括的词包括表音文字的词和至少一个与表音文字的词对应的n-gram词。
在一些实施例中,上述获取训练样本集合,包括:获取目标词;生成与目标词对应的至少一个n-gram词;对目标词和对应的至少一个n-gram词进行词形变换,生成变换后的词集合;基于变换后的词集合, 生成训练样本。
在一些实施例中,上述词形变换包括字符替换;以及上述对目标词和对应的至少一个n-gram词进行词形变换,生成变换后的词集合,包括:从目标词和对应的至少一个n-gram词中选取待替换词;按照预设概率对待替换词中的字符进行替换,生成变换后的词,其中,预设概率与键盘上代表不同字符的键的排列位置相关联。
第三方面,本申请实施例提供了一种基于词向量生成模型的信息生成装置,该装置包括:词获取单元,被配置成获取检索词;向量生成单元,被配置成将检索词输入至预先训练的词向量生成模型,得到与检索词对应的词向量,其中,词向量生成模型用于生成基于非词义相似性的词向量,所述非词义相似性包括以下至少一项:音似,形似;相似度生成单元,被配置成生成与检索词对应的词向量与预设的待匹配词向量集合中的待匹配词向量之间的相似度,其中,待匹配词向量集合中的待匹配词向量基于词向量生成模型得到。
在一些实施例中,该装置还包括:选取单元,被配置成根据所确定的相似度的大小,从待匹配词向量集合中选取第一目标数目个待匹配词向量;排序单元,被配置成基于所选取的第一目标数目个待匹配词向量进行重排序,生成返回词序列,其中,返回词序列中的词的顺序与重排序后的待匹配词向量的顺序对应;发送单元,被配置成向目标设备发送返回词序列。
第四方面,本申请实施例提供了一种用于训练词向量生成模型的装置,该装置包括:模型获取单元,被配置成获取初始模型,其中,初始模型包括初始词向量生成模型和输出层;样本获取单元,被配置成获取训练样本集合,其中,训练样本集合中的训练样本包括相似的至少两个词,相似包括以下至少一项:音似,形似;训练单元,被配置成将训练样本集合中的训练样本的第一词作为初始模型的输入,将与输入的第一词对应的第二词作为期望输出,将训练得到的初始模型的初始词向量生成模型确定为词向量生成模型,其中,第一词与第二词属于同一训练样本。
在一些实施例中,样本获取单元包括:第一获取子单元,被配置 成获取第一历史匹配词,其中,第一历史匹配词包括根据第一历史搜索词反馈的搜索结果中被选中的词;第二获取子单元,被配置成获取与第一历史匹配词对应的至少一个第一历史搜索词;第一组合子单元,被配置成将第一历史匹配词对应的至少一个第一历史搜索词组合成训练样本。
在一些实施例中,样本获取单元包括:第三获取子单元,被配置成获取第二历史搜索词;第四获取子单元,被配置成获取与第二历史搜索词对应的至少一个第二历史匹配词,其中,第二历史匹配词包括根据第二历史搜索词反馈的搜索结果;选取子单元,被配置成:根据点击通过率,从至少一个第二历史匹配词中选取第二目标数目个第二历史匹配词;第二组合子单元,被配置成将第二历史搜索词和所选取的第二目标数目个第二历史匹配词组合成训练样本。
在一些实施例中,上述训练样本集合中的训练样本包括的词包括表音文字的词和至少一个与表音文字的词对应的n-gram词。
在一些实施例中,样本获取单元包括:第五获取子单元,被配置成获取目标词;第一生成子单元,被配置成生成与目标词对应的至少一个n-gram词;第二生成子单元,被配置成对目标词和对应的至少一个n-gram词进行词形变换,生成变换后的词集合;第三生成子单元,被配置成基于变换后的词集合,生成训练样本。
在一些实施例中,上述词形变换包括字符替换。上述第二生成子单元包括:选取模块,被配置成从目标词和对应的至少一个n-gram词中选取待替换词;生成模块,被配置成按照预设概率对待替换词中的字符进行替换,生成变换后的词,其中,预设概率与键盘上代表不同字符的键的排列位置相关联。
第五方面,本申请实施例提供了一种服务器,该服务器包括:一个或多个处理器;存储装置,其上存储有一个或多个程序;当一个或多个程序被一个或多个处理器执行,使得一个或多个处理器实现如第一方面中任一实现方式描述的方法。
第六方面,本申请实施例提供了一种计算机可读介质,其上存储有计算机程序,该程序被处理器执行时实现如第一方面中任一实现方 式描述的方法。
本申请实施例提供的基于词向量生成模型的信息生成方法和装置,首先,获取检索词;而后,将检索词输入至预先训练的词向量生成模型,得到与检索词对应的词向量,其中,词向量生成模型用于生成基于非词义相似性的词向量,非词义相似性包括以下至少一项:音似,形似;最后,生成与检索词对应的词向量与预设的待匹配词向量集合中的待匹配词向量之间的相似度,其中,待匹配词向量集合中的待匹配词向量基于词向量生成模型得到。从而可以充分利用词向量生成模型所学习到的词本身含义以外的特征之间的相近关系,从而在检索时输入错误的场景和不具备明显语义搜索(例如人名)的场景下能够显著提升模糊检索的质量。
附图说明
通过阅读参照以下附图所作的对非限制性实施例所作的详细描述,本申请的其它特征、目的和优点将会变得更明显:
图1是本申请的一个实施例可以应用于其中的示例性系统架构图;
图2是根据本申请的基于词向量生成模型的信息生成方法的一个实施例的流程图;
图3是根据本申请的用于训练词向量生成模型的方法的一个实施例的流程图;
图4是根据本申请的实施例的用于训练词向量生成模型的方法的一个应用场景的示意图;
图5是根据本申请的基于词向量生成模型的信息生成装置的一个实施例的结构示意图;
图6是根据本申请的用于训练词向量生成模型的装置的一个实施例的结构示意图;
图7是适于用来实现本申请的实施例的电子设备的结构示意图。
具体实施方式
下面结合附图和实施例对本申请作进一步的详细说明。可以理解的是,此处所描述的具体实施例仅仅用于解释相关发明,而非对该发明的限定。另外还需要说明的是,为了便于描述,附图中仅示出了与有关发明相关的部分。
需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。下面将参考附图并结合实施例来详细说明本申请。
图1示出了可以应用本申请的基于词向量生成模型的信息生成方法或基于词向量生成模型的信息生成装置的示例性架构100。
如图1所示,系统架构100可以包括终端设备101、102、103,网络104、106和服务器105、107。网络104、106用以分别在终端设备101、102、103和服务器105之间,服务器105和服务器107之间提供通信链路的介质。网络104、106可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等等。
终端设备101、102、103通过网络104与服务器105交互,以接收或发送消息等。终端设备101、102、103上可以安装有各种通讯客户端应用,例如网页浏览器应用、购物类应用、搜索类应用、即时通信工具、邮箱客户端、社交平台软件、阅读类应用等。
终端设备101、102、103可以是硬件,也可以是软件。当终端设备101、102、103为硬件时,可以是具有显示屏并且支持搜索的各种电子设备,包括但不限于智能手机、平板电脑、电子书阅读器、膝上型便携计算机和台式计算机等等。当终端设备101、102、103为软件时,可以安装在上述所列举的电子设备中。其可以实现成多个软件或软件模块(例如用来提供分布式服务的软件或软件模块),也可以实现成单个软件或软件模块。在此不做具体限定。
服务器105可以是提供各种服务的服务器,例如为终端设备101、102、103上显示的网页提供支持的后台服务器。服务器105可以用于执行上述基于词向量生成模型的信息生成方法。服务器107可以是用于训练词向量生成模型的服务器。后台服务器105可以从服务器107获取训练好的词向量生成模型。而后利用所获取的词向量生成模型对 对从终端设备接收的搜索词进行分析等处理,并生成处理结果(例如与搜索词匹配的搜索结果)反馈给终端设备。
需要说明的是,上述服务器105也可以用于训练词向量生成模型,从而上述训练好的词向量生成模型也可以直接存储在服务器105的本地,服务器105可以直接提取本地所存储的词向量生成模型,此时,可以不存在网络106和服务器107。
需要说明的是,服务器可以是硬件,也可以是软件。当服务器为硬件时,可以实现成多个服务器组成的分布式服务器集群,也可以实现成单个服务器。当服务器为软件时,可以实现成多个软件或软件模块(例如用来提供分布式服务的软件或软件模块),也可以实现成单个软件或软件模块。在此不做具体限定。
需要说明的是,本申请实施例所提供的基于词向量生成模型的信息生成方法一般由服务器105执行,相应地,基于词向量生成模型的信息生成装置一般设置于服务器105中。
应该理解,图1中的终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端设备、网络和服务器。
继续参考图2,其示出了基于词向量生成模型的信息生成方法的又一个实施例的流程200。该基于词向量生成模型的信息生成方法的流程200,包括以下步骤:
步骤201,获取检索词。
在本实施例中,基于词向量生成模型的信息生成方法的执行主体(例如图1所示的服务器105)可以通过有线连接方式或者无线连接方式获取检索词。其中,上述检索词通常可以是通信连接的终端设备发送用于获取检索结果的词。作为示例,上述检索词可以包括用户名,例如“兔兔”;也可以包括话题名(hashtag),例如“blue sky”。
步骤202,将检索词输入至预先训练的词向量生成模型,得到与检索词对应的词向量。
在本实施例中,上述执行主体可以将步骤201所获取的检索词输入至预先训练的词向量生成模型,从上述词向量生成模型的隐藏层(即输出层的输入端)提取与上述检索词对应的词向量。其中,上述词向 量生成模型可以用于生成基于非词义相似性的词向量。上述非词义相似性包括以下至少一项:音似,形似。作为示例,上述词向量生成可以包括各种利用具有上述非词义相似性的训练样本进行训练所得到的语言模型。
在本实施例的一些可选的实现方式中,上述词向量生成模型还可以通过如后续图3和图4所示的用于训练词向量生成模型的方法训练得到,具体参见后续描述。
步骤203,生成与检索词对应的词向量与预设的待匹配词向量集合中的待匹配词向量之间的相似度。
在本实施例中,上述执行主体可以利用各种向量相似度生成方法,生成与检索词对应的词向量与预设的待匹配词向量集合中的待匹配词向量之间的相似度。其中,上述待匹配词向量集合中的待匹配词向量基于上述词向量生成模型得到。上述待匹配词向量集合可以包括由预设的待匹配词集合输入至上述词向量生成模型所得到的词向量集合。上述待匹配词可以包括各种历史数据,其可以包括但不限于已注册用户的用户名,已发布的话题名等等。
在本实施例的一些可选的实现方式中,上述执行主体还可以继续执行以下步骤:
第一步,根据所确定的相似度的大小,从待匹配词向量集合中选取第一目标数目个待匹配词向量。
在这些实现方式中,根据所确定的相似度的大小,上述执行主体可以从待匹配词向量集合中通过各种方式选取第一目标数目个待匹配词向量。作为示例,上述执行主体可以根据所确定的相似度由大至小的顺序从上述待匹配词向量集合中选取第一目标数目个待匹配词向量。作为又一示例,上述执行主体可以从上述待匹配词向量集合中选取第一目标数目个所确定的相似度大于预设阈值的待匹配词向量。其中,上述第一目标数目可以是根据实际的应用需求而预先指定的任意数值,也可以是满足预设条件的数值(例如相似度大于预设阈值的待匹配词向量的数目)。
第二步,基于所选取的第一目标数目个待匹配词向量进行重排序, 生成返回词序列。
在这些实现方式中,上述执行主体可以通过各种方式对上述第一步所选取的第一目标数目个待匹配词向量进行重排序。上述重排序的依据可以包括但不限于以下至少一项:编辑距离,前缀匹配等。从而可以生成返回词序列。其中,上述返回词序列中的词的顺序通常与重排序后的待匹配词向量的顺序对应。
基于上述可选的实现方式,上述执行主体可以在对待匹配词向量进行排序时引入词形和词音等非语义维度,从而提升模糊搜索的结果匹配度。
第三步,向目标设备发送返回词序列。
在这些实现方式中,上述执行主体可以通过各种方式向目标设备发送上述第二步所生成的返回词序列。其中,上述目标设备可以包括发送检索词的终端设备,也可以包括用于对上述返回词序列进行进一步排序的后台服务器,在此不做限定。
从图2中可以看出,本实施例中的基于词向量生成模型的信息生成方法的流程200体现了利用上述词向量生成模型进行待匹配词与检索词的匹配的步骤。由此,本实施例描述的方案可以充分利用上述词向量生成模型所学习到的词本身含义以外的特征之间的相近关系,从而在检索时输入错误的场景和不具备明显语义搜索(例如人名)的场景下能够显著提升模糊检索的质量。
进一步参考图3,示出了根据本申请的用于训练词向量生成模型的方法的一个实施例的流程300。该用于训练词向量生成模型的方法包括以下步骤:
步骤301,获取初始模型。
在本实施例中,用于训练词向量生成模型的方法的执行主体(如图1所示的服务器105或107)可以通过有线连接方式或者无线连接方式获取初始模型。其中,上述初始模型可以包括初始词向量生成模型和输出层。上述初始词向量生成模型可以包括各种包含隐藏层的人工神经网络(Artificial Neural Network,ANN),例如基于skip-gram模型和fasttext模型进行结合的神经网络。
在本实施例中,上述执行主体还可以从本地获取预先存储的初始模型,也可以从通信连接的电子设备获取上述初始模型,在此不作限定。
步骤302,获取训练样本集合。
在本实施例中,上述执行主体可以通过各种方式获取训练样本集合。其中,上述训练样本集合中的每个训练样本可以包括相似的至少两个词。上述相似包括以下至少一项:音似,形似。上述词可以包括表意文字中的单个字或由多个字组成的词组,也可以包括表音文字中的单词等,在此不做限定。
需要说明的是,上述由多个字组成的词组可以包括词语中的错别字写法,也可以包括单词中存在若干字符的替换的写法。
作为示例,上述训练样本可以为[再接再厉,再接再励,再接再历]。作为又一示例,上述训练样本可以为[ear,year,yeer]。
在本实施例的一些可选的实现方式中,上述执行主体可以按照如下步骤获取训练样本集合:
第一步,获取第一历史匹配词。
在这些实现方式中,上述执行主体可以通过有线和无线连接的方式从本地或通信连接的电子设备获取第一历史匹配词。其中,上述第一历史匹配词可以包括根据第一历史搜索词反馈的搜索结果中被选中的词。
第二步,获取与第一历史匹配词对应的至少一个第一历史搜索词。
在这些实现方式中,上述执行主体可以通过有线和无线连接的方式从本地或通信连接的电子设备获取与上述第一步所获取的第一历史匹配词对应的至少一个第一历史搜索词。
作为示例,上述执行主体可以从历史搜索数据中提取出终端设备从反馈的搜索结果中选中的词(体现为接收到与该词对应的内容获取请求)作为上述第一历史匹配词(例如“天气预报”)。而后,上述执行主体可以从上述历史搜索数据中提取同样选中上述第一历史匹配词的终端在搜索时所使用的搜索词(例如“天气”、“天气预”)作为第一历史搜索词。可以理解,上述第一历史匹配词通常可以对应至少一个 第一历史搜索词。
第三步,将第一历史匹配词对应的至少一个第一历史搜索词组合成训练样本。
基于上述可选的实现方式,上述执行主体可以从真实的历史搜索数据中采集到对应与同一选中词的不同搜索词作为训练样本。
在本实施例的一些可选的实现方式中,上述执行主体可以按照如下步骤获取训练样本集合:
第一步,获取第二历史搜索词。
在这些实现方式中,上述执行主体可以通过有线和无线连接的方式从本地或通信连接的电子设备获取第二历史搜索词。作为示例,上述执行主体可以从上述历史搜索数据中提取终端在搜索时所使用的搜索词(例如“天气”)作为第二历史搜索词。
第二步,获取与第二历史搜索词对应的至少一个第二历史匹配词。
在这些实现方式中,上述执行主体可以通过有线和无线连接的方式从本地或通信连接的电子设备获取与上述第一步所获取的第二历史搜索词对应的至少一个第二历史匹配词。其中,上述第二历史匹配词可以包括根据上述第二历史搜索词反馈的搜索结果(例如“天气预报”、“天气查询”、“天气雷达”)。
第三步,根据点击通过率(CTR,Click-Through-Rate),从至少一个第二历史匹配词中选取第二目标数目个第二历史匹配词。
在这些实现方式中,上述执行主体可以根据点击通过率,从上述第二步所获取的至少一个第二历史匹配词中选取第二目标数目个第二历史匹配词。作为示例,上述执行主体可以根据点击通过率由高到低的顺序从上述第二步所获取的至少一个第二历史匹配词中选取第二历史匹配词。作为又一示例,上述执行主体可以从上述第二步所获取的至少一个第二历史匹配词中选取点击通过率大于预设阈值的第二历史匹配词。其中,上述第二目标数目可以是根据实际的应用需求而预先指定的任意数值,也可以是满足预设条件的数值(例如点击通过率大于预设阈值的第二历史匹配词的数目)。
第四步,将第二历史搜索词和所选取的第二目标数目个第二历史 匹配词组合成训练样本。
基于上述可选的实现方式,上述执行主体可以从真实的历史搜索数据中采集到具有较高点击通过率的匹配词和对应的不同搜索词作为训练样本。
在本实施例的一些可选的实现方式中,上述训练样本集合中的训练样本可以包括的词包括表音文字的词和至少一个与上述表音文字的词对应的n-gram词。其中,上述表音文字可以包括音节文字(例如日语假名)和音位文字(例如英语、法语等使用的拉丁字母,俄语使用的斯拉夫字母,阿拉伯语、维吾尔语使用的阿拉伯字母等)。上述n-gram词可以包括从上述表音文字的词中所选取的连续n个字母而组成的字符串。作为示例,上述表音文字的词可以是“happy”。与上述表音文字的词对应的n-gram词(例如n取3)可以包括“hap”,“app”,“ppy”。
在本实施例的一些可选的实现方式中,基于上述可选的实现方式中,上述执行主体可以按照如下步骤获取训练样本集合:
第一步,获取目标词。
在这些实现方式中,上述执行主体可以获取目标词。上述目标词可以包括从上述历史搜索数据中所选取的任意词。作为示例,上述目标词可以为“happy”。
第二步,生成与目标词对应的至少一个n-gram词。
在这些实现方式中,上述执行主体可以通过各种方式生成与上述第一步所获取的目标词对应的至少一个n-gram词。作为示例,当n取2时,与上述“happy”对应的至少一个n-gram词可以包括“ha”,“ap”,“pp”,“py”中的至少一项。
第三步,对目标词和对应的至少一个n-gram词进行词形变换,生成变换后的词集合。
在这些实现方式中,上述执行主体可以对上述第一步所获取的目标词和第二步所生成的至少一个n-gram词进行各种词形变换,生成变换后的词集合。其中,上述词形变换可以包括但不限于以下至少一项:字符删除,字符重复,字符交换。作为示例,针对“happy”进行词形变换后所生成的变换后的词集合中可以包括但不限于以下至少两项: “hapy”,“hhappy”,“hyppa”,“hha”,“p”,“pa”。
可选地,上述词形变换可以包括字符替换。上述执行主体可以按照如下步骤对目标词和对应的至少一个n-gram词进行词形变换,生成变换后的词集合,包括:
S1、从目标词和对应的至少一个n-gram词中选取待替换词。
在这些实现方式中,上述执行主体可以通过各种方式从上述第一步所获取的目标词和对应的至少一个n-gram词中选取至少一个词作为待替换词。例如,随机选取,选取字符数大于预设值的词等。
S2、按照预设概率对待替换词中的字符进行替换,生成变换后的词。
在这些实现方式中,上述执行主体可以按照预设概率对待替换词中的字符进行替换,生成变换后的词。其中,上述预设概率可以与键盘上代表不同字符的键的排列位置相关联。作为示例,在键盘上,与“S”字符键最相邻的键可以包括与字符“A”“W”“D”“X”对应的键;与“S”字符键次相邻的键可以包括与字符“Q”“E”“C”“Z”对应的键。从而,上述与“A”“W”“D”“X”对应的预设概率(例如0.7)通常高于与“Q”“E”“C”“Z”对应的预设概率(例如0.3)。当待替换词中的字符包括“s”时,上述执行主体可以按照上述预设概率对待替换词中的字符“s”进行替换。例如,将待替换词“smile”变换为“amile”作为变换后的词。
第四步,基于变换后的词集合,生成训练样本。
在这些实现方式中,上述训练样本可以包括通过上述第三步进行词形变换的词。可选地,上述训练样本也可以包括上述目标词和对应的至少一个n-gram词中未进行变换的词。可选地,上述执行主体还可以对包括上述目标词和对应的至少一个n-gram词以及变换后的词进行去重,从而生成上述训练样本。
基于上述可选的实现方式,上述执行主体可以从构造出与真实的历史搜索数据中不同的训练样本,从而弥补了历史数据覆盖面不足的局限性,有助于降低模型的过拟合,提升模型的泛化能力。
步骤303,将训练样本集合中的训练样本的第一词作为初始模型 的输入,将与输入的第一词对应的第二词作为期望输出,将训练得到的初始模型的初始词向量生成模型确定为词向量生成模型。
在本实施例中,上述执行主体可以利用机器学习方法,将训练样本集合中的训练样本的第一词作为初始模型的输入,将与输入的第一词对应的第二词作为期望输出,训练得到上述词向量生成模型。其中,上述第一词与上述第二词通常属于同一训练样本。上述初始模型中的初始词向量生成模型的隐层可以用于输出词向量。在每次迭代(iteration)中,上述第一词和第二词的选取可以有多种方式。作为示例,上述执行主体可以从训练样本中随机选取任意两个未被同时选取过的词作为上述第一词和第二词。作为又一示例,上述执行主体还可以按照预设的滑动窗口从训练样本中选取两个词作为上述第一词和第二词。
继续参见图4,图4是根据本申请实施例的用于训练词向量生成模型的方法的应用场景的一个示意图。在图4的应用场景中,后台服务器可以获取初始模型401和训练样本集合402。其中,初始模型401可以包括初始词向量生成模型4011和输出层4012。上述训练样本集合中可以包括“露西”“露茜”“璐西”的训练样本。后台服务器可以将训练样本中的“露茜”4031作为第一词,将“露西”4032作为第二词。后台服务器可以将“露茜”4031输入至上述初始模型401中,以“露西”4032作为上述初始模型401的期望输出。后台服务器可以在满足训练结束条件的前提下停止训练,将所得到的初始模型401中的初始词向量生成模型4011确定为词向量生成模型。其中,上述训练结束条件例如可以包括训练样本被整体训练过10次(epoch=10)。
目前,现有技术之一通常是预先构建基于字词的含义的、字词级别的词向量,然后基于前缀、字符等的匹配生成召回信息,导致现有的词向量无法利用词形或读音特征。在实际应用中,由于文本搜索时的输入错误或在进行诸如人名一类无法充分利用字词含义的搜索时不能得到满足应用要求的返回结果。而本申请的上述实施例提供的方法,通过字音和/或字形两方面相似的训练样本进行词向量生成模型的训 练,使得所生成的词向量能够体现出词本身含义以外的特征,从而为模糊搜索的质量提升提供可靠的数据基础。
进一步参考图5,作为对上述图2所示方法的实现,本申请提供了基于词向量生成模型的信息生成装置的一个实施例,该装置实施例与图2所示的方法实施例相对应,该装置具体可以应用于各种电子设备中。
如图5所示,本实施例提供的基于词向量生成模型的信息生成装置500包括词获取单元501、向量生成单元502和相似度生成单元503。其中,词获取单元501,被配置成获取检索词;向量生成单元502,被配置成将检索词输入至预先训练的词向量生成模型,得到与检索词对应的词向量,其中,词向量生成模型用于生成基于非词义相似性的词向量,上述非词义相似性包括以下至少一项:音似,形似;相似度生成单元503,被配置成生成与检索词对应的词向量与预设的待匹配词向量集合中的待匹配词向量之间的相似度,其中,待匹配词向量集合中的待匹配词向量基于词向量生成模型得到。
在本实施例中,基于词向量生成模型的信息生成装置500中:词获取单元501、向量生成单元502和相似度生成单元503的具体处理及其所带来的技术效果可分别参考图2对应实施例中的步骤201、步骤202和步骤203的相关说明,在此不再赘述。
在本实施例的一些可选的实现方式中,上述基于词向量生成模型的信息生成装置500还可以包括:选取单元(图中未示出)、排序单元(图中未示出)、发送单元(图中未示出)。其中,上述选取单元,可以被配置成根据所确定的相似度的大小,从待匹配词向量集合中选取第一目标数目个待匹配词向量。上述排序单元,可以被配置成基于所选取的第一目标数目个待匹配词向量进行重排序,生成返回词序列。其中,上述返回词序列中的词的顺序可以与重排序后的待匹配词向量的顺序对应。上述发送单元,可以被配置成向目标设备发送返回词序列。
本申请的上述实施例提供的装置,通过词获取单元501获取检索词。而后,向量生成单元502将检索词输入至预先训练的词向量生成 模型,得到与检索词对应的词向量。其中,词向量生成模型通过如前述的用于训练词向量生成模型的方法训练得到。之后,相似度生成单元503生成与检索词对应的词向量与预设的待匹配词向量集合中的待匹配词向量之间的相似度。其中,待匹配词向量集合中的待匹配词向量可以基于词向量生成模型得到。从而可以充分利用上述词向量生成模型所学习到的词本身含义以外的特征之间的相近关系。进而在检索时输入错误的场景和不具备明显语义搜索(例如人名)的场景下能够显著提升模糊检索的质量。
进一步参考图6,作为对上述各图所示方法的实现,本申请提供了用于训练词向量生成模型的装置的一个实施例,该装置实施例与图3所示的方法实施例相对应,该装置具体可以应用于各种电子设备中。
如图6所示,本实施例提供的用于训练词向量生成模型的装置600包括模型获取单元601、样本获取单元602和训练单元603。其中,模型获取单元601,被配置成获取初始模型,其中,初始模型包括初始词向量生成模型和输出层;样本获取单元602,被配置成获取训练样本集合,其中,训练样本集合中的训练样本包括相似的至少两个词,相似包括以下至少一项:音似,形似;训练单元603,被配置成将训练样本集合中的训练样本的第一词作为初始模型的输入,将与输入的第一词对应的第二词作为期望输出,将训练得到的初始模型的初始词向量生成模型确定为词向量生成模型,其中,第一词与第二词属于同一训练样本。
在本实施例中,用于训练词向量生成模型的装置600中:模型获取单元601、样本获取单元602和训练单元603的具体处理及其所带来的技术效果可分别参考图3对应实施例中的步骤301、步骤302和步骤303的相关说明,在此不再赘述。
在本实施例的一些可选的实现方式中,上述样本获取单元602可以包括:第一获取子单元(图中未示出)、第二获取子单元(图中未示出)、第一组合子单元(图中未示出)。其中,上述第一获取子单元,可以被配置成获取第一历史匹配词。上述第一历史匹配词可以包括根 据第一历史搜索词反馈的搜索结果中被选中的词。上述第二获取子单元,可以被配置成获取与第一历史匹配词对应的至少一个第一历史搜索词。上述第一组合子单元,可以被配置成将第一历史匹配词对应的至少一个第一历史搜索词组合成训练样本。
在本实施例的一些可选的实现方式中,上述样本获取单元602可以包括:第三获取子单元(图中未示出)、第四获取子单元(图中未示出)、选取子单元(图中未示出)、第二组合子单元(图中未示出)。其中,上述第三获取子单元,可以被配置成获取第二历史搜索词。上述第四获取子单元,可以被配置成获取与第二历史搜索词对应的至少一个第二历史匹配词。其中,上述第二历史匹配词可以包括根据第二历史搜索词反馈的搜索结果。上述选取子单元,可以被配置成:根据点击通过率,从至少一个第二历史匹配词中选取第二目标数目个第二历史匹配词。上述第二组合子单元,可以被配置成将第二历史搜索词和所选取的第二目标数目个第二历史匹配词组合成训练样本。
在本实施例的一些可选的实现方式中,上述训练样本集合中的训练样本包括的词可以包括表音文字的词和至少一个与表音文字的词对应的n-gram词。
在本实施例的一些可选的实现方式中,上述样本获取单元602可以包括:第五获取子单元(图中未示出)、第一生成子单元(图中未示出)、第二生成子单元(图中未示出)、第三生成子单元(图中未示出)。其中,上述第五获取子单元,可以被配置成获取目标词。上述第一生成子单元,可以被配置成生成与目标词对应的至少一个n-gram词。上述第二生成子单元,可以被配置成对目标词和对应的至少一个n-gram词进行词形变换,生成变换后的词集合。上述第三生成子单元,可以被配置成基于变换后的词集合,生成训练样本。
在本实施例的一些可选的实现方式中,上述词形变换可以包括字符替换。上述第二生成子单元可以包括:选取模块(图中未示出)、生成模块(图中未示出)。其中,上述选取模块,可以被配置成从目标词和对应的至少一个n-gram词中选取待替换词。上述生成模块,可以被配置成按照预设概率对待替换词中的字符进行替换,生成变换后的词。 其中,上述预设概率可以与键盘上代表不同字符的键的排列位置相关联。
本申请的上述实施例提供的装置,通过模型获取单元601获取初始模型,其中,初始模型包括初始词向量生成模型和输出层。而后,样本获取单元602获取训练样本集合。其中,训练样本集合中的训练样本包括相似的至少两个词。相似包括以下至少一项:音似,形似。之后,训练单元603将训练样本集合中的训练样本的第一词作为初始模型的输入,将与输入的第一词对应的第二词作为期望输出,将训练得到的初始模型的初始词向量生成模型确定为词向量生成模型。其中,第一词与第二词属于同一训练样本。从而通过字音和/或字形两方面相似的训练样本进行词向量生成模型的训练,使得所生成的词向量能够体现出词本身含义以外的特征,从而为模糊搜索的质量提升提供可靠的数据基础。
下面参考图7,其示出了适于用来实现本申请实施例的电子设备(例如图1中的服务器)700的结构示意图。本申请实施例中的终端设备可以包括但不限于诸如移动电话、笔记本电脑、数字广播接收器、PDA(个人数字助理)、PAD(平板电脑)、PMP(便携式多媒体播放器)、车载终端(例如车载导航终端)等等的移动终端以及诸如数字TV、台式计算机等等的固定终端。图7示出的服务器仅仅是一个示例,不应对本申请实施例的功能和使用范围带来任何限制。
如图7所示,电子设备700可以包括处理装置(例如中央处理器、图形处理器等)701,其可以根据存储在只读存储器(ROM)702中的程序或者从存储装置708加载到随机访问存储器(RAM)703中的程序而执行各种适当的动作和处理。在RAM 703中,还存储有电子设备700操作所需的各种程序和数据。处理装置701、ROM 702以及RAM 703通过总线704彼此相连。输入/输出(I/O)接口705也连接至总线704。
通常,以下装置可以连接至I/O接口705:包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置706;包括例如液晶显示器(LCD,Liquid Crystal Display)、扬声器、 振动器等的输出装置707;包括例如磁带、硬盘等的存储装置708;以及通信装置709。通信装置709可以允许电子设备700与其他设备进行无线或有线通信以交换数据。虽然图7示出了具有各种装置的电子设备700,但是应理解的是,并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。图7中示出的每个方框可以代表一个装置,也可以根据需要代表多个装置。
特别地,根据本申请的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本申请的实施例包括一种计算机程序产品,其包括承载在计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信装置709从网络上被下载和安装,或者从存储装置708被安装,或者从ROM 702被安装。在该计算机程序被处理装置701执行时,执行本申请的实施例的方法中限定的上述功能。
需要说明的是,本申请的实施例所述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本申请的实施例中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本申请的实施例中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其 结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:电线、光缆、RF(Radio Frequency,射频)等等,或者上述的任意合适的组合。
上述计算机可读介质可以是上述服务器中所包含的;也可以是单独存在,而未装配入该服务器中。上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该服务器执行时,使得该服务器:获取检索词;将检索词输入至预先训练的词向量生成模型,得到与检索词对应的词向量,其中,词向量生成模型用于生成基于非词义相似性的词向量,非词义相似性包括以下至少一项:音似,形似;生成与检索词对应的词向量与预设的待匹配词向量集合中的待匹配词向量之间的相似度,其中,待匹配词向量集合中的待匹配词向量基于词向量生成模型得到。
可以以一种或多种程序设计语言或其组合来编写用于执行本申请实施例的操作的计算机程序代码,所述程序设计语言包括面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网(LAN)或广域网(WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。
附图中的流程图和框图,图示了按照本申请的各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是, 框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
描述于本申请实施例中所涉及到的单元可以通过软件的方式实现,也可以通过硬件的方式来实现。所描述的单元也可以设置在处理器中,例如,可以描述为:一种处理器,包括词获取单元、向量生成单元、相似度生成单元。其中,这些单元的名称在某种情况下并不构成对该单元本身的限定,例如,词获取单元还可以被描述为“获取检索词的单元”。
以上描述仅为本申请的较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解,本申请的实施例中所涉及的发明范围,并不限于上述技术特征的特定组合而成的技术方案,同时也应涵盖在不脱离上述发明构思的情况下,由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本申请实施例中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。

Claims (12)

  1. 一种基于词向量生成模型的信息生成方法,包括:
    获取检索词;
    将所述检索词输入至预先训练的词向量生成模型,得到与所述检索词对应的词向量,其中,所述词向量生成模型用于生成基于非词义相似性的词向量,所述非词义相似性包括以下至少一项:音似,形似;
    生成与所述检索词对应的词向量与预设的待匹配词向量集合中的待匹配词向量之间的相似度,其中,所述待匹配词向量集合中的待匹配词向量基于所述词向量生成模型得到。
  2. 根据权利要求1所述的方法,其中,所述方法还包括:
    根据所确定的相似度的大小,从所述待匹配词向量集合中选取第一目标数目个待匹配词向量;
    基于所选取的第一目标数目个待匹配词向量进行重排序,生成返回词序列,其中,所述返回词序列中的词的顺序与重排序后的待匹配词向量的顺序对应;
    向目标设备发送所述返回词序列。
  3. 一种用于训练词向量生成模型的方法,包括:
    获取初始模型,其中,所述初始模型包括初始词向量生成模型和输出层;
    获取训练样本集合,其中,所述训练样本集合中的训练样本包括相似的至少两个词,所述相似包括以下至少一项:音似,形似;
    将所述训练样本集合中的训练样本的第一词作为所述初始模型的输入,将与输入的第一词对应的第二词作为期望输出,将训练得到的初始模型的初始词向量生成模型确定为所述词向量生成模型,其中,所述第一词与所述第二词属于同一训练样本。
  4. 根据权利要求3所述的方法,其中,所述获取训练样本集合,包括:
    获取第一历史匹配词,其中,所述第一历史匹配词包括根据第一历史搜索词反馈的搜索结果中被选中的词;
    获取与所述第一历史匹配词对应的至少一个第一历史搜索词;
    将所述第一历史匹配词对应的至少一个第一历史搜索词组合成训练样本。
  5. 根据权利要求3所述的方法,其中,所述获取训练样本集合,包括:
    获取第二历史搜索词;
    获取与所述第二历史搜索词对应的至少一个第二历史匹配词,其中,所述第二历史匹配词包括根据第二历史搜索词反馈的搜索结果;
    根据点击通过率,从所述至少一个第二历史匹配词中选取第二目标数目个第二历史匹配词;
    将所述第二历史搜索词和所选取的第二目标数目个第二历史匹配词组合成训练样本。
  6. 根据权利要求3-5之一所述的方法,其中,所述训练样本集合中的训练样本包括的词包括表音文字的词和至少一个与所述表音文字的词对应的n-gram词。
  7. 根据权利要求6所述的方法,其中,所述获取训练样本集合,包括:
    获取目标词;
    生成与所述目标词对应的至少一个n-gram词;
    对所述目标词和对应的至少一个n-gram词进行词形变换,生成变换后的词集合;
    基于所述变换后的词集合,生成训练样本。
  8. 根据权利要求7所述的方法,其中,所述词形变换包括字符替换;以及
    所述对所述目标词和对应的至少一个n-gram词进行词形变换,生成变换后的词集合,包括:
    从所述目标词和对应的至少一个n-gram词中选取待替换词;
    按照预设概率对所述待替换词中的字符进行替换,生成变换后的词,其中,所述预设概率与键盘上代表不同字符的键的排列位置相关联。
  9. 一种基于词向量生成模型的信息生成装置,包括:
    词获取单元,被配置成获取检索词;
    向量生成单元,被配置成将所述检索词输入至预先训练的词向量生成模型,得到与所述检索词对应的词向量,其中,所述词向量生成模型用于生成基于非词义相似性的词向量,所述非词义相似性包括以下至少一项:音似,形似;
    相似度生成单元,被配置成生成与所述检索词对应的词向量与预设的待匹配词向量集合中的待匹配词向量之间的相似度,其中,所述待匹配词向量集合中的待匹配词向量基于所述词向量生成模型得到。
  10. 一种用于训练词向量生成模型的装置,包括:
    模型获取单元,被配置成获取初始模型,其中,所述初始模型包括初始词向量生成模型和输出层;
    样本获取单元,被配置成获取训练样本集合,其中,所述训练样本集合中的训练样本包括相似的至少两个词,所述相似包括以下至少一项:音似,形似;
    训练单元,被配置成将所述训练样本集合中的训练样本的第一词作为所述初始模型的输入,将与输入的第一词对应的第二词作为期望输出,将训练得到的初始模型的初始词向量生成模型确定为所述词向量生成模型,其中,所述第一词与所述第二词属于同一训练样本。
  11. 一种服务器,包括:
    一个或多个处理器;
    存储装置,其上存储有一个或多个程序;
    当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如权利要求1-8中任一所述的方法。
  12. 一种计算机可读介质,其上存储有计算机程序,其中,该程序被处理器执行时实现如权利要求1-8中任一所述的方法。
PCT/CN2021/102487 2020-06-29 2021-06-25 基于词向量生成模型的信息生成方法和装置 WO2022001888A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010604164.0A CN111753551B (zh) 2020-06-29 2020-06-29 基于词向量生成模型的信息生成方法和装置
CN202010604164.0 2020-06-29

Publications (1)

Publication Number Publication Date
WO2022001888A1 true WO2022001888A1 (zh) 2022-01-06

Family

ID=72676772

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/102487 WO2022001888A1 (zh) 2020-06-29 2021-06-25 基于词向量生成模型的信息生成方法和装置

Country Status (2)

Country Link
CN (1) CN111753551B (zh)
WO (1) WO2022001888A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114722816A (zh) * 2022-06-09 2022-07-08 深圳市顺源科技有限公司 一种模拟信号隔离器智能装配方法及系统
CN116820986A (zh) * 2023-06-30 2023-09-29 南京数睿数据科技有限公司 移动应用测试脚本生成方法、装置、电子设备和介质
CN117725414A (zh) * 2023-12-13 2024-03-19 北京海泰方圆科技股份有限公司 训练内容生成模型方法、确定输出内容方法、装置及设备

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111198965B (zh) * 2019-12-31 2024-04-19 腾讯科技(深圳)有限公司 一种歌曲检索方法、装置、服务器及存储介质
CN111753551B (zh) * 2020-06-29 2022-06-14 北京字节跳动网络技术有限公司 基于词向量生成模型的信息生成方法和装置
CN113239257B (zh) * 2021-06-07 2024-05-14 北京字跳网络技术有限公司 信息处理方法、装置、电子设备及存储介质
CN113407814B (zh) * 2021-06-29 2023-06-16 抖音视界有限公司 文本搜索方法、装置、可读介质及电子设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105335391A (zh) * 2014-07-09 2016-02-17 阿里巴巴集团控股有限公司 基于搜索引擎的搜索请求的处理方法和装置
US20180365231A1 (en) * 2017-06-19 2018-12-20 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for generating parallel text in same language
CN109460461A (zh) * 2018-11-13 2019-03-12 苏州思必驰信息科技有限公司 基于文本相似度模型的文本匹配方法及系统
CN110879832A (zh) * 2019-10-23 2020-03-13 支付宝(杭州)信息技术有限公司 目标文本检测方法、模型训练方法、装置及设备
CN111753551A (zh) * 2020-06-29 2020-10-09 北京字节跳动网络技术有限公司 基于词向量生成模型的信息生成方法和装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105335391A (zh) * 2014-07-09 2016-02-17 阿里巴巴集团控股有限公司 基于搜索引擎的搜索请求的处理方法和装置
US20180365231A1 (en) * 2017-06-19 2018-12-20 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for generating parallel text in same language
CN109460461A (zh) * 2018-11-13 2019-03-12 苏州思必驰信息科技有限公司 基于文本相似度模型的文本匹配方法及系统
CN110879832A (zh) * 2019-10-23 2020-03-13 支付宝(杭州)信息技术有限公司 目标文本检测方法、模型训练方法、装置及设备
CN111753551A (zh) * 2020-06-29 2020-10-09 北京字节跳动网络技术有限公司 基于词向量生成模型的信息生成方法和装置

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114722816A (zh) * 2022-06-09 2022-07-08 深圳市顺源科技有限公司 一种模拟信号隔离器智能装配方法及系统
CN114722816B (zh) * 2022-06-09 2022-08-19 深圳市顺源科技有限公司 一种模拟信号隔离器智能装配方法及系统
CN116820986A (zh) * 2023-06-30 2023-09-29 南京数睿数据科技有限公司 移动应用测试脚本生成方法、装置、电子设备和介质
CN116820986B (zh) * 2023-06-30 2024-02-27 南京数睿数据科技有限公司 移动应用测试脚本生成方法、装置、电子设备和介质
CN117725414A (zh) * 2023-12-13 2024-03-19 北京海泰方圆科技股份有限公司 训练内容生成模型方法、确定输出内容方法、装置及设备

Also Published As

Publication number Publication date
CN111753551A (zh) 2020-10-09
CN111753551B (zh) 2022-06-14

Similar Documents

Publication Publication Date Title
WO2022001888A1 (zh) 基于词向量生成模型的信息生成方法和装置
US11636264B2 (en) Stylistic text rewriting for a target author
JP7112536B2 (ja) テキストにおける実体注目点のマイニング方法および装置、電子機器、コンピュータ読取可能な記憶媒体並びにコンピュータプログラム
US10650311B2 (en) Suggesting resources using context hashing
US10592607B2 (en) Iterative alternating neural attention for machine reading
CN107210035B (zh) 语言理解系统和方法的生成
EP3529711B1 (en) Device/server deployment of neural network data entry system
CN111428010B (zh) 人机智能问答的方法和装置
CN107241260B (zh) 基于人工智能的新闻推送的方法和装置
US11651015B2 (en) Method and apparatus for presenting information
CN113657113B (zh) 文本处理方法、装置和电子设备
WO2020182123A1 (zh) 用于推送语句的方法和装置
CN114861889B (zh) 深度学习模型的训练方法、目标对象检测方法和装置
JP7520246B2 (ja) テキストを生成するための方法および装置
WO2023005968A1 (zh) 文本类别识别方法、装置、电子设备和存储介质
CN111414471B (zh) 用于输出信息的方法和装置
CN112699656A (zh) 一种广告标题重写方法、装置、设备及储存介质
CN111555960A (zh) 信息生成的方法
CN113761923A (zh) 命名实体识别方法、装置、电子设备及存储介质
CN115129877B (zh) 标点符号预测模型的生成方法、装置和电子设备
CN115620726A (zh) 语音文本生成方法、语音文本生成模型的训练方法、装置
CN115292487A (zh) 基于朴素贝叶斯的文本分类方法、装置、设备和介质
CN109144284A (zh) 信息显示方法和装置
CN110807089B (zh) 一种问答方法、装置及电子设备
KR20230161788A (ko) 인공신경망을 이용한 이모지 추천 장치 및 방법

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21833292

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 19.04.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21833292

Country of ref document: EP

Kind code of ref document: A1