CN112507726A - Training method and device for embedding sense item into vector - Google Patents

Training method and device for embedding sense item into vector Download PDF

Info

Publication number
CN112507726A
CN112507726A CN202011465969.8A CN202011465969A CN112507726A CN 112507726 A CN112507726 A CN 112507726A CN 202011465969 A CN202011465969 A CN 202011465969A CN 112507726 A CN112507726 A CN 112507726A
Authority
CN
China
Prior art keywords
item
semantic
meaning
keywords
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011465969.8A
Other languages
Chinese (zh)
Inventor
赵佰承
弓利鹏
宫兆汉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sogou Technology Development Co Ltd
Original Assignee
Beijing Sogou Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sogou Technology Development Co Ltd filed Critical Beijing Sogou Technology Development Co Ltd
Priority to CN202011465969.8A priority Critical patent/CN112507726A/en
Publication of CN112507726A publication Critical patent/CN112507726A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Abstract

The application discloses a training method for embedding a sense item into a vector, and particularly can obtain a first search term, wherein the first search term comprises at least two sense items, any one of the at least two sense items is called a first sense item for convenience of description, and the first sense item is provided with an encyclopedia entry. In this application, when training the embedded vector of the first meaning item, a keyword related to the first meaning item may be determined based on the webpage related to the first meaning item and the encyclopedia entry of the first meaning item, and the embedded vector of the first meaning item may be obtained by further training using the keyword related to the first meaning item. Therefore, in the application, when the embedded vector of the first meaning item is trained, not only the encyclopedia entries of the first meaning item but also the webpages related to the first meaning item are considered, so that the accuracy of the embedded vector of the first meaning item obtained by training is improved.

Description

Training method and device for embedding sense item into vector
Technical Field
The present application relates to the field of data processing, and in particular, to a method and an apparatus for training a semantic item embedded vector.
Background
A word may have multiple meanings and each meaning may also be referred to as a meaning. In some scenarios, it is desirable to train the embedded vector of a term in order to facilitate further analysis processing using the embedded vector of the term. Wherein the embedded vector of the semantic item is a vectorized representation of semantic information of the semantic item.
At present, in some scenarios, the embedded vector of the semantic item cannot be accurately trained, and therefore, a solution is urgently needed to solve the problem.
Disclosure of Invention
The technical problem to be solved by the application is how to accurately train to obtain the embedding vector of the semantic item, and the method and the device for training the embedding vector of the semantic item are provided.
In a first aspect, an embodiment of the present application provides a method for training a semantic item embedded vector, where the method includes:
acquiring a first search term, wherein the first search term comprises at least two semantic items, the at least two semantic items comprise a first semantic item, and the first semantic item is provided with an encyclopedia entry;
determining keywords related to the first semantic item according to the webpage related to the first semantic item and the encyclopedia entry of the first semantic item;
and training to obtain an embedded vector of the first search term corresponding to the first meaning term by using the keywords related to the first meaning term.
Optionally, the determining, according to the web page related to the first semantic item and the encyclopedia entry of the first semantic item, a keyword related to the first semantic item includes:
determining a webpage related to the first meaning item, and extracting keywords of the webpage;
and matching the keywords of the webpage with the encyclopedia entries of the first meaning item, and determining the matched keywords as the keywords related to the first meaning item.
Optionally, the keyword of the web page is a participle with a term word frequency-inverse text frequency TF-IDF value greater than a first threshold value in the participles included in the web page; the training to obtain the embedded vector of the first meaning item by using the keyword related to the first meaning item comprises:
training to obtain an embedded vector of the first meaning item by using a skipgarn model and the keywords related to the first meaning item;
wherein:
and obtaining a loss function of the skipgarn model according to the TF-IDF value of the keyword related to the first meaning item.
Optionally, the method further includes:
acquiring a webpage corresponding to the first search term;
determining word embedding vectors corresponding to all the webpages in the webpages corresponding to the first search word;
and determining the web pages corresponding to each semantic item in the web pages corresponding to the first search term according to the embedded vector of each semantic item corresponding to the first search term and the word embedded vector corresponding to each web page.
Optionally, the method further includes:
responding to a search operation triggered by a user aiming at the first meaning item;
acquiring a webpage corresponding to the first meaning item in the webpages corresponding to the first search term;
and displaying the webpage corresponding to the first meaning item.
In a second aspect, an embodiment of the present application provides a training apparatus for embedding a sense item into a vector, where the apparatus includes:
the search device comprises a first acquisition unit, a second acquisition unit and a search unit, wherein the first search word comprises at least two semantic items, the at least two semantic items comprise a first semantic item, and the first semantic item is provided with an encyclopedia entry;
a first determining unit, configured to determine a keyword related to the first semantic item according to the web page related to the first semantic item and the encyclopedia entry of the first semantic item;
and the training unit is used for training to obtain an embedded vector of the first search term corresponding to the first meaning item by utilizing the keywords related to the first meaning item.
Optionally, the first determining unit is configured to:
determining a webpage related to the first meaning item, and extracting keywords of the webpage;
and matching the keywords of the webpage with the encyclopedia entries of the first meaning item, and determining the matched keywords as the keywords related to the first meaning item.
Optionally, the keyword of the web page is a participle with a term word frequency-inverse text frequency TF-IDF value greater than a first threshold value in the participles included in the web page; the training unit is configured to:
training to obtain an embedded vector of the first meaning item by using a skipgarn model and the keywords related to the first meaning item;
wherein:
and obtaining a loss function of the skipgarn model according to the TF-IDF value of the keyword related to the first meaning item.
Optionally, the apparatus further comprises:
the second acquisition unit is used for acquiring a webpage corresponding to the first search term;
a second determining unit, configured to determine word embedding vectors corresponding to respective web pages in the web pages corresponding to the first search word;
and a third determining unit, configured to determine, according to the embedded vector of each semantic item corresponding to the first search term and the word embedded vector corresponding to each web page, a web page corresponding to each semantic item in the web pages corresponding to the first search term.
Optionally, the apparatus further comprises:
a response unit, configured to respond to a search operation triggered by a user for the first meaning item;
a third obtaining unit, configured to obtain a web page corresponding to the first meaning item from web pages corresponding to the first search term;
and the display unit is used for displaying the webpage corresponding to the first meaning item.
In a third aspect, an embodiment of the present application provides a training apparatus for embedding a vector in a meaning item, including a memory, and one or more programs, where the one or more programs are stored in the memory, and configured to be executed by one or more processors includes instructions for:
acquiring a first search term, wherein the first search term comprises at least two semantic items, the at least two semantic items comprise a first semantic item, and the first semantic item is provided with an encyclopedia entry;
determining keywords related to the first semantic item according to the webpage related to the first semantic item and the encyclopedia entry of the first semantic item;
and training to obtain an embedded vector of the first search term corresponding to the first meaning term by using the keywords related to the first meaning term.
Optionally, the determining, according to the web page related to the first semantic item and the encyclopedia entry of the first semantic item, a keyword related to the first semantic item includes:
determining a webpage related to the first meaning item, and extracting keywords of the webpage;
and matching the keywords of the webpage with the encyclopedia entries of the first meaning item, and determining the matched keywords as the keywords related to the first meaning item.
Optionally, the keyword of the web page is a participle with a term word frequency-inverse text frequency TF-IDF value greater than a first threshold value in the participles included in the web page; the training to obtain the embedded vector of the first meaning item by using the keyword related to the first meaning item comprises:
training to obtain an embedded vector of the first meaning item by using a skipgarn model and the keywords related to the first meaning item;
wherein:
and obtaining a loss function of the skipgarn model according to the TF-IDF value of the keyword related to the first meaning item.
Optionally, the operations further include:
acquiring a webpage corresponding to the first search term;
determining word embedding vectors corresponding to all the webpages in the webpages corresponding to the first search word;
and determining the web pages corresponding to each semantic item in the web pages corresponding to the first search term according to the embedded vector of each semantic item corresponding to the first search term and the word embedded vector corresponding to each web page.
Optionally, the operations further include:
responding to a search operation triggered by a user aiming at the first meaning item;
acquiring a webpage corresponding to the first meaning item in the webpages corresponding to the first search term;
and displaying the webpage corresponding to the first meaning item.
In a fourth aspect, embodiments of the present application provide a computer-readable medium having instructions stored thereon, which, when executed by one or more processors, cause an apparatus to perform the method of any of the first aspects above
Compared with the prior art, the embodiment of the application has the following advantages:
the embodiment of the application provides a training method for embedding a semantic item into a vector, and specifically, a first search term can be obtained, wherein the first search term comprises at least two semantic items, for convenience of description, any one of the at least two semantic items is called a first semantic item, and the first semantic item is provided with an encyclopedia entry. In this application, when training the embedded vector of the first meaning item, a keyword related to the first meaning item may be determined based on the webpage related to the first meaning item and the encyclopedia entry of the first meaning item, and the embedded vector of the first meaning item may be obtained by further training using the keyword related to the first meaning item. Therefore, in the application, when the embedded vector of the first meaning item is trained, not only the encyclopedia entries of the first meaning item but also the webpages related to the first meaning item are considered, so that the accuracy of the embedded vector of the first meaning item obtained by training is improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a schematic flowchart of a method for training a semantic item embedding vector according to an embodiment of the present disclosure;
FIG. 2 is a schematic structural diagram of a training apparatus for embedding a sense item into a vector according to an embodiment of the present disclosure;
fig. 3 is a schematic structural diagram of a client according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of a server according to an embodiment of the present application.
Detailed Description
In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The inventor of the present application has found through research that, currently, when an embedded vector of a semantic item is trained, the embedded vector can be trained in combination with an encyclopedia entry of the semantic item. In one example, the embedded vector for the semantic item may be trained using the context corresponding to the hyperlink in the encyclopedia entry for the semantic item. Therefore, the accuracy of the word embedding vector obtained by training depends on the accuracy of the context corresponding to the hyperlink in the encyclopedia entry of the semantic item. And if the accuracy of the context corresponding to the hyperlink in the encyclopedia entry of the semantic item is not high, the accuracy of the embedded vector of the semantic item obtained by training is not high.
In order to solve the above problem, an embodiment of the present application provides a method for training a semantic item embedded vector, which can improve the accuracy of the semantic item embedded vector obtained by training.
Various non-limiting embodiments of the present application are described in detail below with reference to the accompanying drawings.
Exemplary method
Referring to fig. 1, the figure is a schematic flowchart of a method for training a semantic item embedding vector according to an embodiment of the present application.
The method for training a semantic item embedded vector provided in the embodiment of the present application may be executed by a controller or a processor having a data processing function, or may be executed by a device including the controller or the processor, which is not particularly limited in the embodiment of the present application. The device including the controller or the processor includes, but is not limited to, a terminal device and a server.
In the present embodiment, the training method of the semantic item embedding vector shown in fig. 1 may include the following steps S101 to S103, for example.
S101: the method comprises the steps of obtaining a first search term, wherein the first search term comprises at least two semantic items, the at least two semantic items comprise a first semantic item, and the first semantic item is provided with an encyclopedia entry.
In an embodiment of the present application, the first search term may be input by a user in a search input field provided by a search engine. In one example, a user may input a search sentence in a search input area provided by a search engine, and the first search word may be obtained by segmenting the search sentence. In yet another example, the content entered by the user in the search input field provided by the search engine may also include only the first search term.
In an embodiment of the present application, the first search term includes a plurality of semantic terms. For convenience of description, any one of the plurality of items will be referred to as a first item. With respect to the first search term, an example will now be described. For example, the first search term may be "Venus," which includes at least two items, one of which is "China talk show moderator" and the other of which is "planet in the solar system.
S102: and determining keywords related to the first meaning item according to the webpage related to the first meaning item and the encyclopedia entry of the first meaning item.
In embodiments of the present application, when training an embedded vector of a first term, the training may be performed using keywords associated with the first term. Thus, the keywords associated with the first term can affect the accuracy of the trained embedded vector of the first term.
In the embodiment of the present application, it is considered that the web page related to the first meaning item and the encyclopedia entry of the first meaning item include a word with a higher degree of correlation with the first meaning item, and therefore, the keyword related to the first meaning item may be determined according to the web page related to the first meaning item and the encyclopedia entry of the first meaning item. In this way, the accuracy of the embedded vector of the trained first term is no longer dependent solely on the encyclopedia entry of the first term. Correspondingly, the negative influence on the accuracy of the embedded vector of the trained first meaning item, which is caused by the inaccuracy of the encyclopedia entry of the first meaning item, can be compensated by the webpage related to the first meaning item to a certain extent, so that the accuracy of the embedded vector of the trained first meaning item is higher.
In one implementation manner of the embodiment of the present application, S102, when implemented specifically, may include the following steps a-B, for example.
Step A: and determining the webpage related to the first meaning item, and extracting the keywords of the webpage.
In the embodiment of the application, a search can be performed by taking "first meaning term + first search term" as a search keyword to obtain a webpage related to the first meaning term. For example, if the first meaning item is "china talk show host" and the first search term is "Venus", the search may be performed by using "Venus talk show host" as a search keyword, so as to obtain a webpage related to the first meaning item. For another example, if the first meaning item is "planet in the solar system," and the first search term is "Venus," the search may be performed by using "planet Venus in the solar system" as the search keyword, so as to obtain the web page related to the first meaning item.
After determining the web page to which the first semantic item is related, keywords of the web page may be extracted. In the embodiment of the present application, the keywords of each web page may be extracted respectively, so as to obtain the keywords of the plurality of web pages. Taking the first web page as an example, a specific implementation manner of extracting the keywords of the first web page is described below.
In the embodiment of the application, the text included in the first webpage can be participled to obtain a plurality of participles. Further, calculating term frequency-inverse text frequency (TF-IDF) values of the respective participles, and screening out the participles with larger TF-IDF values, for example, screening out the participles with larger TF-IDF values than a first threshold value, as the keywords of the first webpage. For example, the web page corresponding to the first meaning item includes 5 web pages, and each web page corresponds to 5 keywords, and after step a is executed, 25 keywords at most can be obtained.
And B: and matching the keywords of the webpage with the encyclopedia entries of the first meaning item, and determining the matched keywords as the keywords related to the first meaning item.
In the embodiment of the present application, in order to further ensure that the degree of correlation between the keyword related to the first meaning item and the first meaning item is relatively high. After the keywords of the web page are obtained in the step A, the keywords of the web page can be matched with the encyclopedia entries of the first meaning item. For example, the keywords of the web page are matched with the participles included in one or several areas of the encyclopedia entries of the first meaning item, and the keywords matched with the encyclopedia entries of the first meaning item in the keywords of the web page are determined as the keywords related to the first meaning item.
The keyword mentioned here, which matches the encyclopedia entry of the first meaning item, in the keywords of the web page refers to a keyword existing in the encyclopedia entry of the first meaning item, in the keywords of the web page; or, the semantic of the keywords of the webpage is the same as or similar to the semantic of the participle in the encyclopedia entry of the first semantic item. It can be understood that, the keywords matching the encyclopedia entry of the first semantic item in the keywords of the web page have a higher degree of correlation with the first semantic item, and therefore, determining the keywords matching the encyclopedia entry of the first semantic item in the keywords of the web page as the keywords related to the first semantic item can improve the accuracy of the embedded vector of the trained first semantic item.
S103: and training to obtain an embedded vector of the first search term corresponding to the first meaning term by using the keywords related to the first meaning term.
In this embodiment of the present application, after determining the keyword related to the first meaning item corresponding to the first search term in S102, an embedded vector of the first search term corresponding to the first meaning item may be obtained through training by using the keyword related to the first meaning item. In this embodiment of the present application, the skip garm model and the keyword related to the first meaning item may be utilized to train and obtain an embedded vector of the first search term corresponding to the first meaning item, which may also be referred to as an embedded vector of the first meaning item.
In the embodiment of the application, in order to further improve the accuracy of the embedded vector of the first meaning term obtained by training. When training the embedding vector of the first meaning item by using the skipgarn model and the keywords related to the first meaning item, the loss function of the skipgarn model may be improved so that the keywords having a high degree of correlation with the first meaning item among the keywords related to the first meaning item can exert more functions.
In the embodiment of the present application, the degree of relevance of each keyword in the keywords related to the first meaning item may be determined according to the TF-IDF value of the keyword. Wherein, the TF-IDF value of the keyword refers to the TF-IDF value of the keyword in the webpage which belongs to the keyword and is related to the first meaning item.
Wherein, the larger the TF-IDF value is, the higher the correlation degree of the keyword and the first meaning item is. Thus, in one example, the loss function of the improved skipgarn model may be derived from the TF-IDF values of the keyword associated with the first term. In one example, the loss function of the skipgarn model may be shown as equation (1) below.
Figure BDA0002834212090000091
In equation (1):
loss is the Loss function of the skipgarn model;
n is a keyword set related to the first meaning item;
j is the jth keyword in the keyword set related to the first meaning item;
TFIDF (j) is the TF-IDF value of the jth keyword;
u is an embedded vector of the jth keyword corresponding to the first meaning item;
wjembedding a vector for the word of the jth keyword;
Vjthe word set is the jth keyword and the corresponding negative sampling keyword;
wj′is a VjJ in (1)Words of individual keywords are embedded into the vector.
As can be seen from the above description, with the scheme of the embodiment of the present application, the accuracy of the embedded vector of the first search term corresponding to the first semantic item obtained by training can be improved.
In an implementation manner of the embodiment of the present application, after S101 is executed, a webpage corresponding to the first search term may also be acquired, so that the webpage corresponding to the first search term is displayed to a user. It is to be understood that, since the first search term includes at least two semantic items, the determined web page corresponding to the first search term may include web pages corresponding to the at least two semantic items, respectively. For example, the following steps are carried out: the first search term is 'jinxing', and the 'jinxing' includes two meaning items of 'Chinese talk show host' and 'planet in the solar system', so that the acquired webpage corresponding to the 'jinxing' can include a webpage corresponding to the 'Chinese talk show host' and also include a webpage related to the 'planet in the solar system'.
In this embodiment of the application, after obtaining the web pages corresponding to the first search term, the web pages corresponding to the first search term may be classified according to the following steps D and E, and the web pages corresponding to each semantic item in the web pages corresponding to the first search term are determined.
Step D: and determining word embedding vectors corresponding to all the webpages corresponding to the first search word respectively.
For convenience of description, any one of the web pages corresponding to the first search term is referred to as a "second web page". In an implementation manner of the embodiment of the present application, a word embedding vector of a second webpage may be obtained by training using a text in the second webpage. In yet another implementation, the keywords of the second web page may be extracted, and then the word embedding vector of the second web page is obtained by using the bag-of-words model and the keywords of the second web page. Regarding the extraction of the keywords of the second web page, reference may be made to the relevant description part of step a regarding the keywords of the extracted web page, which is not described here. The detailed implementation manner of obtaining the word embedding vector of the second web page by using the bag-of-words model and the keyword of the second web page is not described in detail herein.
Step E: and determining the web pages corresponding to each semantic item in the web pages corresponding to the first search term according to the embedded vector of each semantic item corresponding to the first search term and the word embedded vector corresponding to each web page.
It can be understood that the embedded vector of each semantic item corresponding to the first search term may represent the semantic meaning of the first search term under the semantic item, and the embedded vector of the word of the web page may also represent the semantic meaning of the content included in the web page. Therefore, for the plurality of semantic items of the first search term and the second web page, the degree of correlation between the second web page and each semantic item of the first search term can be obtained according to the term embedding vector of the second web page and the embedding vector of each semantic item corresponding to the first search term. In one example, cosine similarity between a word embedding vector of a second web page and an embedding vector of each semantic item of the first search word may be calculated, respectively, to obtain a degree of correlation between the second web page and each semantic item of the first search word. When the second web page is classified, the second web page may be determined as the web page corresponding to the semantic item with the highest degree of relevance. For example, the following steps are carried out: the first search term includes two terms, term 1 and term 2. And if the degree of correlation between the second webpage and the semantic item 1 is higher than the degree of correlation between the second webpage and the semantic item 2, determining that the second webpage is the webpage corresponding to the semantic item 1.
With respect to steps D to E, there will now be exemplified: the first search term is "Venus," and the web pages corresponding to the first search term include 10 web pages. After steps D and E are performed, the web page related to the semantic item "talk show host in china" and the web page related to the semantic item "planet in the solar system" can be determined from the 10 web pages. For example, the 1 st, 3 rd, and 5 th web pages are web pages related to "planet in the solar system"; the 2 nd, 4 th, 6 th, 7 th, 8 th, 9 th and 10 th web pages are web pages related to the "chinese talk show host".
In some embodiments, it is contemplated that when the first search term includes multiple terms, the user would like to search based on one of the terms. Therefore, in the embodiment of the present application, after the user inputs the first search term in the search input area provided by the search engine, the respective semantic items of the first search term may also be displayed. The user may trigger a search operation for one of the terms. For example, the user may perform a search operation under one of the terms for the first search term by clicking on the term.
In order to provide a search result meeting the user requirements for the user, in the embodiment of the present application, after the user triggers a search operation for a first meaning item, the search engine may acquire a web page corresponding to the first meaning item from among web pages corresponding to the first search term, and display the web page related to the first meaning item. For example, the following steps are carried out: the first search word is "Venus," and two semantic items corresponding to the search word are displayed in the search result page, namely "China talk show host" and "planet in the solar system," respectively. And the web pages corresponding to the first search term include 10 web pages, and after the foregoing steps D and E are performed, it may be determined that the 1 st web page, the 3 rd web page, and the 5 th web page of the 10 web pages are web pages related to the "planet in the solar system", and the 2 nd web page, the 4 th web page, the 6 th web page, the 7 th web page, the 8 th web page, the 9 th web page, and the 10 th web page of the 10 web pages are web pages related to the "chinese talk show host". Therefore, when the user triggers a search operation for the first term "chinese talk show host" of "kingstar", the web page related to "chinese talk show host" of the 10 web pages can be acquired and displayed to the user.
Exemplary device
Based on the method provided by the above embodiment, the embodiment of the present application further provides an apparatus, which is described below with reference to the accompanying drawings.
Referring to fig. 2, the drawing is a schematic structural diagram of a training apparatus for embedding a semantic item into a vector according to an embodiment of the present application. The apparatus 200 may specifically include, for example: a first acquisition unit 201, a first determination unit 202 and a training unit 203.
A first obtaining unit 201, configured to obtain a first search term, where the first search term includes at least two semantic items, where the at least two semantic items include a first semantic item, and the first semantic item includes an encyclopedia entry;
a first determining unit 202, configured to determine a keyword related to the first semantic item according to the web page related to the first semantic item and the encyclopedia entry of the first semantic item;
the training unit 203 is configured to train to obtain an embedded vector of the first search term corresponding to the first meaning item by using the keyword related to the first meaning item.
In one implementation manner, the first determining unit 202 is configured to:
determining a webpage related to the first meaning item, and extracting keywords of the webpage;
and matching the keywords of the webpage with the encyclopedia entries of the first meaning item, and determining the matched keywords as the keywords related to the first meaning item.
In one implementation manner, the keywords of the web page are the participles with a term word frequency-inverse text frequency TF-IDF value larger than a first threshold value in the participles included in the web page; the training unit 203 is configured to:
training to obtain an embedded vector of the first meaning item by using a skipgarn model and the keywords related to the first meaning item;
wherein:
and obtaining a loss function of the skipgarn model according to the TF-IDF value of the keyword related to the first meaning item.
In one implementation, the apparatus further comprises:
the second acquisition unit is used for acquiring a webpage corresponding to the first search term;
a second determining unit, configured to determine word embedding vectors corresponding to respective web pages in the web pages corresponding to the first search word;
and a third determining unit, configured to determine, according to the embedded vector of each semantic item corresponding to the first search term and the word embedded vector corresponding to each web page, a web page corresponding to each semantic item in the web pages corresponding to the first search term.
In one implementation, the apparatus further comprises:
a response unit, configured to respond to a search operation triggered by a user for the first meaning item;
a third obtaining unit, configured to obtain a web page corresponding to the first meaning item from web pages corresponding to the first search term;
and the display unit is used for displaying the webpage corresponding to the first meaning item.
Since the apparatus 200 is an apparatus corresponding to the method provided in the above method embodiment, and the specific implementation of each unit of the apparatus 200 is the same as that of the above method embodiment, for the specific implementation of each unit of the apparatus 200, reference may be made to the description part of the above method embodiment, and details are not repeated here.
The method provided by the embodiment of the present application may be executed by a client or a server, and the client and the server that execute the method are described below separately.
Fig. 3 shows a block diagram of a client 300. For example, the client 300 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, and the like.
Referring to fig. 3, client 300 may include one or more of the following components: processing component 302, memory 304, power component 306, multimedia component 308, audio component 310, input/output (I/O) interface 33, sensor component 314, and communication component 316.
The processing component 302 generally controls overall operation of the client 300, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing elements 302 may include one or more processors 320 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 302 can include one or more modules that facilitate interaction between the processing component 302 and other components. For example, the processing component 302 can include a multimedia module to facilitate interaction between the multimedia component 308 and the processing component 302.
The memory 304 is configured to store various types of data to support operations at the client 300. Examples of such data include instructions for any application or method operating on the client 300, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 304 may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.
The power component 306 provides power to the various components of the client 300. The power components 306 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the client 300.
The multimedia component 308 comprises a screen providing an output interface between the client 300 and the user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 308 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the client 300 is in an operation mode, such as a photographing mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.
The audio component 310 is configured to output and/or input audio signals. For example, the audio component 310 includes a Microphone (MIC) configured to receive external audio signals when the client 300 is in an operating mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 304 or transmitted via the communication component 316. In some embodiments, audio component 310 also includes a speaker for outputting audio signals.
The I/O interface provides an interface between the processing component 302 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.
Sensor component 314 includes one or more sensors for providing status assessment of various aspects to client 300. For example, sensor component 314 may detect an open/closed state of device 300, the relative positioning of components, such as a display and keypad of client 300, sensor component 314 may also detect a change in the position of client 300 or a component of client 300, the presence or absence of user contact with client 300, client 300 orientation or acceleration/deceleration, and a change in the temperature of client 300. Sensor assembly 314 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 314 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 314 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communication component 316 is configured to facilitate communications between the client 300 and other devices in a wired or wireless manner. The client 300 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication section 316 receives a broadcast signal or broadcast-related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 316 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the client 300 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the following methods:
acquiring a first search term, wherein the first search term comprises at least two semantic items, the at least two semantic items comprise a first semantic item, and the first semantic item is provided with an encyclopedia entry;
determining keywords related to the first semantic item according to the webpage related to the first semantic item and the encyclopedia entry of the first semantic item;
and training to obtain an embedded vector of the first search term corresponding to the first meaning term by using the keywords related to the first meaning term.
In one implementation, the determining, according to the web page related to the first semantic item and the encyclopedia entry of the first semantic item, a keyword related to the first semantic item includes:
determining a webpage related to the first meaning item, and extracting keywords of the webpage;
and matching the keywords of the webpage with the encyclopedia entries of the first meaning item, and determining the matched keywords as the keywords related to the first meaning item.
In one implementation manner, the keywords of the web page are the participles with a term word frequency-inverse text frequency TF-IDF value larger than a first threshold value in the participles included in the web page; the training to obtain the embedded vector of the first meaning item by using the keyword related to the first meaning item comprises:
training to obtain an embedded vector of the first meaning item by using a skipgarn model and the keywords related to the first meaning item;
wherein:
and obtaining a loss function of the skipgarn model according to the TF-IDF value of the keyword related to the first meaning item.
In one implementation, the method further comprises:
acquiring a webpage corresponding to the first search term;
determining word embedding vectors corresponding to all the webpages in the webpages corresponding to the first search word;
and determining the web pages corresponding to each semantic item in the web pages corresponding to the first search term according to the embedded vector of each semantic item corresponding to the first search term and the word embedded vector corresponding to each web page.
In one implementation, the method further comprises:
responding to a search operation triggered by a user aiming at the first meaning item;
acquiring a webpage corresponding to the first meaning item in the webpages corresponding to the first search term;
and displaying the webpage corresponding to the first meaning item.
Fig. 4 is a schematic structural diagram of a server in an embodiment of the present application. The server 400 may vary significantly due to configuration or performance, and may include one or more Central Processing Units (CPUs) 422 (e.g., one or more processors) and memory 432, one or more storage media 430 (e.g., one or more mass storage devices) storing applications 442 or data 444. Wherein the memory 432 and storage medium 430 may be transient or persistent storage. The program stored on the storage medium 430 may include one or more modules (not shown), each of which may include a series of instruction operations for the server. Still further, the central processor 422 may be arranged to communicate with the storage medium 430, and execute a series of instruction operations in the storage medium 430 on the server 400.
Still further, the central processor 422 may perform the following method:
acquiring a first search term, wherein the first search term comprises at least two semantic items, the at least two semantic items comprise a first semantic item, and the first semantic item is provided with an encyclopedia entry;
determining keywords related to the first semantic item according to the webpage related to the first semantic item and the encyclopedia entry of the first semantic item;
and training to obtain an embedded vector of the first search term corresponding to the first meaning term by using the keywords related to the first meaning term.
In one implementation, the determining, according to the web page related to the first semantic item and the encyclopedia entry of the first semantic item, a keyword related to the first semantic item includes:
determining a webpage related to the first meaning item, and extracting keywords of the webpage;
and matching the keywords of the webpage with the encyclopedia entries of the first meaning item, and determining the matched keywords as the keywords related to the first meaning item.
In one implementation manner, the keywords of the web page are the participles with a term word frequency-inverse text frequency TF-IDF value larger than a first threshold value in the participles included in the web page; the training to obtain the embedded vector of the first meaning item by using the keyword related to the first meaning item comprises:
training to obtain an embedded vector of the first meaning item by using a skipgarn model and the keywords related to the first meaning item;
wherein:
and obtaining a loss function of the skipgarn model according to the TF-IDF value of the keyword related to the first meaning item.
In one implementation, the method further comprises:
acquiring a webpage corresponding to the first search term;
determining word embedding vectors corresponding to all the webpages in the webpages corresponding to the first search word;
and determining the web pages corresponding to each semantic item in the web pages corresponding to the first search term according to the embedded vector of each semantic item corresponding to the first search term and the word embedded vector corresponding to each web page.
In one implementation, the method further comprises:
responding to a search operation triggered by a user aiming at the first meaning item;
acquiring a webpage corresponding to the first meaning item in the webpages corresponding to the first search term;
and displaying the webpage corresponding to the first meaning item.
The server 400 may also include one or more power supplies 426, one or more wired or wireless network interfaces 450, one or more input-output interfaces 456, one or more keyboards 456, and/or one or more operating systems 441, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, and so forth.
Embodiments of the present application also provide a computer-readable medium having stored thereon instructions that, when executed by one or more processors, cause an apparatus to perform the method for training a semantic embedded vector provided by the above method embodiments.
Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice in the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.
It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the attached claims
The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims (10)

1. A method for training a semantic item embedded vector, the method comprising:
acquiring a first search term, wherein the first search term comprises at least two semantic items, the at least two semantic items comprise a first semantic item, and the first semantic item is provided with an encyclopedia entry;
determining keywords related to the first semantic item according to the webpage related to the first semantic item and the encyclopedia entry of the first semantic item;
and training to obtain an embedded vector of the first search term corresponding to the first meaning term by using the keywords related to the first meaning term.
2. The method of claim 1, wherein determining keywords related to the first semantic item based on the web page related to the first semantic item and the encyclopedia entry of the first semantic item comprises:
determining a webpage related to the first meaning item, and extracting keywords of the webpage;
and matching the keywords of the webpage with the encyclopedia entries of the first meaning item, and determining the matched keywords as the keywords related to the first meaning item.
3. The method according to claim 2, wherein the keywords of the web page are the participles with a term word frequency-inverse text frequency TF-IDF value larger than a first threshold value in the participles included in the web page; the training to obtain the embedded vector of the first meaning item by using the keyword related to the first meaning item comprises:
training to obtain an embedded vector of the first meaning item by using a skipgarn model and the keywords related to the first meaning item;
wherein:
and obtaining a loss function of the skipgarn model according to the TF-IDF value of the keyword related to the first meaning item.
4. The method of claim 1, further comprising:
acquiring a webpage corresponding to the first search term;
determining word embedding vectors corresponding to all the webpages in the webpages corresponding to the first search word;
and determining the web pages corresponding to each semantic item in the web pages corresponding to the first search term according to the embedded vector of each semantic item corresponding to the first search term and the word embedded vector corresponding to each web page.
5. The method of claim 4, further comprising:
responding to a search operation triggered by a user aiming at the first meaning item;
acquiring a webpage corresponding to the first meaning item in the webpages corresponding to the first search term;
and displaying the webpage corresponding to the first meaning item.
6. An apparatus for training a semantic item embedding vector, the apparatus comprising:
the search device comprises a first acquisition unit, a second acquisition unit and a search unit, wherein the first search word comprises at least two semantic items, the at least two semantic items comprise a first semantic item, and the first semantic item is provided with an encyclopedia entry;
a first determining unit, configured to determine a keyword related to the first semantic item according to the web page related to the first semantic item and the encyclopedia entry of the first semantic item;
and the training unit is used for training to obtain an embedded vector of the first search term corresponding to the first meaning item by utilizing the keywords related to the first meaning item.
7. The apparatus of claim 6, wherein the first determining unit is configured to:
determining a webpage related to the first meaning item, and extracting keywords of the webpage;
and matching the keywords of the webpage with the encyclopedia entries of the first meaning item, and determining the matched keywords as the keywords related to the first meaning item.
8. The apparatus according to claim 7, wherein the keyword of the web page is a participle with a term word frequency-inverse text frequency TF-IDF value larger than a first threshold value among the participles included in the web page; the training unit is configured to:
training to obtain an embedded vector of the first meaning item by using a skipgarn model and the keywords related to the first meaning item;
wherein:
and obtaining a loss function of the skipgarn model according to the TF-IDF value of the keyword related to the first meaning item.
9. An apparatus for training a semantic embedded vector, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory, and wherein the one or more programs configured to be executed by one or more processors comprise instructions for:
acquiring a first search term, wherein the first search term comprises at least two semantic items, the at least two semantic items comprise a first semantic item, and the first semantic item is provided with an encyclopedia entry;
determining keywords related to the first semantic item according to the webpage related to the first semantic item and the encyclopedia entry of the first semantic item;
and training to obtain an embedded vector of the first search term corresponding to the first meaning term by using the keywords related to the first meaning term.
10. A computer-readable medium having stored thereon instructions, which when executed by one or more processors, cause an apparatus to perform the method of any one of claims 1 to 5.
CN202011465969.8A 2020-12-14 2020-12-14 Training method and device for embedding sense item into vector Pending CN112507726A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011465969.8A CN112507726A (en) 2020-12-14 2020-12-14 Training method and device for embedding sense item into vector

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011465969.8A CN112507726A (en) 2020-12-14 2020-12-14 Training method and device for embedding sense item into vector

Publications (1)

Publication Number Publication Date
CN112507726A true CN112507726A (en) 2021-03-16

Family

ID=74972742

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011465969.8A Pending CN112507726A (en) 2020-12-14 2020-12-14 Training method and device for embedding sense item into vector

Country Status (1)

Country Link
CN (1) CN112507726A (en)

Similar Documents

Publication Publication Date Title
CN107621886B (en) Input recommendation method and device and electronic equipment
CN107291772B (en) Search access method and device and electronic equipment
CN110019675B (en) Keyword extraction method and device
CN110391966B (en) Message processing method and device and message processing device
CN109918565B (en) Processing method and device for search data and electronic equipment
CN109471919B (en) Zero pronoun resolution method and device
CN108345625B (en) Information mining method and device for information mining
CN111382339A (en) Search processing method and device and search processing device
CN112784142A (en) Information recommendation method and device
CN111708943A (en) Search result display method and device and search result display device
CN110110207B (en) Information recommendation method and device and electronic equipment
CN110019885B (en) Expression data recommendation method and device
CN113033163A (en) Data processing method and device and electronic equipment
CN111241844A (en) Information recommendation method and device
CN111177521A (en) Method and device for determining query term classification model
CN107784037B (en) Information processing method and device, and device for information processing
CN109918624B (en) Method and device for calculating similarity of webpage texts
CN112307294A (en) Data processing method and device
CN110020082B (en) Searching method and device
CN109799916B (en) Candidate item association method and device
CN107301188B (en) Method for acquiring user interest and electronic equipment
CN112052395B (en) Data processing method and device
CN108108356B (en) Character translation method, device and equipment
CN108073664B (en) Information processing method, device, equipment and client equipment
CN112507726A (en) Training method and device for embedding sense item into vector

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination