CN110852077A - Method, device, medium and electronic equipment for dynamically adjusting Word2Vec model dictionary - Google Patents

Method, device, medium and electronic equipment for dynamically adjusting Word2Vec model dictionary Download PDF

Info

Publication number
CN110852077A
CN110852077A CN201911106491.7A CN201911106491A CN110852077A CN 110852077 A CN110852077 A CN 110852077A CN 201911106491 A CN201911106491 A CN 201911106491A CN 110852077 A CN110852077 A CN 110852077A
Authority
CN
China
Prior art keywords
vocabulary
word
index
added
deleted
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911106491.7A
Other languages
Chinese (zh)
Other versions
CN110852077B (en
Inventor
崔勇
杨雪松
杨光
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Taikang Asset Management Co Ltd
Taikang Insurance Group Co Ltd
Original Assignee
Taikang Asset Management Co Ltd
Taikang Insurance Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Taikang Asset Management Co Ltd, Taikang Insurance Group Co Ltd filed Critical Taikang Asset Management Co Ltd
Priority to CN201911106491.7A priority Critical patent/CN110852077B/en
Publication of CN110852077A publication Critical patent/CN110852077A/en
Application granted granted Critical
Publication of CN110852077B publication Critical patent/CN110852077B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures

Abstract

The embodiment of the invention provides a method, a device, a medium and electronic equipment for dynamically adjusting a Word2Vec model dictionary, wherein the method comprises the following steps: reading a vocabulary list, a Word index and a Word vector matrix in a Word2Vec model dictionary; adding the vocabulary to be added into the vocabulary list, adding the word vector corresponding to the vocabulary to be added into the word vector matrix, and adding the index number corresponding to the vocabulary to be added into the word index; deleting the vocabulary to be deleted from the vocabulary list, deleting the word vector corresponding to the vocabulary to be deleted from the word vector matrix, deleting the index number corresponding to the vocabulary to be deleted from the word index, and adjusting the word index to ensure that the index numbers in the adjusted word index are continuous. The technical scheme of the embodiment of the invention solves the problems that the efficiency of the current process for adjusting the Word2Vec model dictionary is low, and the specific vocabulary is difficult to be added or deleted in the Word2Vec model dictionary in a targeted manner according to the requirements of users.

Description

Method, device, medium and electronic equipment for dynamically adjusting Word2Vec model dictionary
Technical Field
The invention relates to the field of natural language processing, in particular to a method, a device, a medium and electronic equipment for dynamically adjusting a Word2Vec model dictionary.
Background
In the field of natural language processing, how to express a word efficiently is the most fundamental core problem. In recent years, the so-called "distributed expression" of words, i.e., the method of expressing words as word vectors in a low-dimensional space by training a model, has become a standard for the "distributed expression" of words in fact in academic and industrial circles. Word2Vec, which was made in 2013, was the earliest, most famous and currently most widely used model for "distributed expression" of words. Word2Vec is a natural language processing model, and by training the Word2Vec model, the processing of text content can be simplified into vector operation in a K-dimensional vector space, and the similarity in the vector space can be used for representing the similarity in text semantics. Therefore, the Word vectors output by the Word2Vec model can be used for many natural language processing related tasks, such as clustering, synonym finding, part-of-speech analysis, semantic analysis, emotion analysis, topic model, and so on.
At present, as the corpus is more and more, the trained Word2Vec model is larger and larger, so that the loading model occupies a large memory when the service is provided, and the running speed and the performance of the server are influenced. Therefore, in practical application scenarios, there is often a need to adjust the dictionary of the trained Word2Vec model. For example, when the user finds that some vocabulary which is not in the Word2Vec model dictionary needs to be supplemented to make the semantic analysis more accurate, some vocabulary needs to be added. For another example, when the loading model occupies a large memory, which results in a slow response speed of the server and a low quality of service, some irrelevant words need to be deleted to simplify the Word2Vec model. For another example, when the user uses the model dictionary, the user may find some error words in the current model dictionary, which results in inaccurate semantic analysis and also needs to delete the error words.
However, when newly adding words to the Word2Vec model dictionary, only the corpus for containing the newly added words can be further added, and then the model is trained again. When the vocabulary is to be deleted, the existing method can only set the min _ count parameter, and the vocabulary with the word frequency smaller than the set value of the min _ count parameter in the corpus can be discarded, so that some irrelevant words or part of error words can be discarded in the training process by setting the value of the min _ count parameter in a larger way. However, not only can the specific vocabulary not be deleted according to the requirement of the demand side, but also some useful vocabularies can be discarded together. It can be seen that, in the actual operation process, the adjustment of the min _ count parameter cannot meet the requirement that the user needs to add or delete some specific vocabularies, and especially, under the condition that the vocabulary to be added or deleted is huge, the adjustment of the parameters of the dictionary cannot be achieved. After the min _ count parameter is reset, the Word2Vec model still needs to be retrained as the newly added vocabulary.
In an actual application scenario, because an operator may feed back problems of inaccurate semantic analysis obtained in an operation process or change an application scenario, and the like, a method capable of quickly increasing and decreasing vocabularies in a Word2Vec model dictionary is urgently needed. The operator refers to a service provider. For example, the operator is a search engine provider that provides search services.
It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present invention and therefore may include information that does not constitute prior art known to a person of ordinary skill in the art.
Disclosure of Invention
Embodiments of the present invention provide a method, an apparatus, a medium, and an electronic device for dynamically adjusting a Word2Vec model dictionary, so as to overcome at least the problem of low efficiency of the current process for adjusting the Word2Vec model dictionary to a certain extent, and the problem that it is difficult for the current method to purposefully add or delete a specific vocabulary in the Word2Vec model dictionary according to the user's needs.
Additional features and advantages of the invention will be set forth in the detailed description which follows, or may be learned by practice of the invention.
According to a first aspect of the embodiments of the present invention, there is provided a method for dynamically adjusting a Word2Vec model dictionary, comprising: acquiring a vocabulary to be added and a Word vector corresponding to the vocabulary to be added, which need to be added into a Word2Vec model dictionary, and acquiring a vocabulary to be deleted, which needs to be deleted from the Word2Vec model dictionary; reading a Word list, a Word index and a Word vector matrix in a Word2Vec model dictionary, wherein the Word index comprises a group of continuous index numbers, the index numbers in the Word index correspond to words in the Word list in a one-to-one manner, and the index numbers in the Word index are also used for indicating the positions of Word vectors of the corresponding words in the Word vector matrix; adding the vocabulary to be added into the vocabulary list, adding the word vector corresponding to the vocabulary to be added into the word vector matrix, and adding the index number corresponding to the vocabulary to be added into the word index; deleting the vocabulary to be deleted from the vocabulary list, deleting the word vector corresponding to the vocabulary to be deleted from the word vector matrix, deleting the index number corresponding to the vocabulary to be deleted from the word index, and adjusting the word index to ensure that the index numbers in the adjusted word index are continuous.
In some embodiments of the present invention, based on the foregoing scheme, the method further comprises: sending the vocabulary to be added to a data processing party; and at least obtaining a word vector corresponding to the vocabulary to be added, which is obtained by the data processing party performing data processing on the vocabulary to be added, from the data processing party.
In some embodiments of the present invention, based on the foregoing scheme, adding the vocabulary to be added to the vocabulary list and adding the word vector corresponding to the vocabulary to be added to the word vector matrix includes: adding each vocabulary in the vocabularies to be added into the vocabulary list in sequence; and acquiring the last row in the word vector matrix and sequentially adding the word vectors corresponding to each vocabulary in the vocabularies to be added to the last row in the word vector matrix.
In some embodiments of the present invention, based on the foregoing scheme, adding an index number corresponding to a word to be added to the word index includes: and for each vocabulary added into the vocabulary list, acquiring the maximum index number in the word index, adding one to the maximum index number as the index number corresponding to the vocabulary, and adding the index number into the word index.
In some embodiments of the present invention, based on the foregoing solution, deleting the vocabulary to be deleted from the vocabulary list and deleting the word vector corresponding to the vocabulary to be deleted from the word vector matrix includes: acquiring an index number corresponding to a vocabulary to be deleted; deleting the vocabulary to be deleted from the vocabulary list; and deleting the word vectors at the positions indicated by the index numbers corresponding to the words to be deleted from the word vector matrix, and adjusting the positions of the rest word vectors in the word vector matrix to continuously arrange the word vectors in the adjusted word vector matrix.
In some embodiments of the present invention, based on the foregoing scheme, the method further comprises: before the vocabulary to be added is added to the vocabulary list, shielding the vocabulary which is common with the vocabulary list in the vocabulary to be added; and before deleting the vocabulary to be deleted from the vocabulary list, shielding the vocabulary which is not shared with the vocabulary list in the vocabulary to be deleted.
In some embodiments of the invention, based on the foregoing, the method comprises: after the operation of adjusting the Word index is completed, the consistency between the data structures in the Word2Vec model dictionary is analyzed, and in the case that the data structures in the Word2Vec model dictionary are consistent, a request for saving the Word2Vec model dictionary is sent.
According to a second aspect of the embodiments of the present invention, there is provided an apparatus for dynamically adjusting a Word2Vec model dictionary, comprising: the data transceiving unit is used for acquiring the vocabulary to be added and the corresponding Word vector which need to be added into the Word2Vec model dictionary and acquiring the vocabulary to be deleted which needs to be deleted from the Word2Vec model dictionary; the reading unit is used for reading a Word list, a Word index and a Word vector matrix in a Word2Vec model dictionary, wherein the Word index comprises a group of continuous index numbers, the index numbers in the Word index correspond to words in the Word list in a one-to-one mode, and the index numbers in the Word index are also used for indicating the positions of Word vectors of the corresponding words in the Word vector matrix; the adding process execution unit is used for adding the vocabulary to be added into the vocabulary list, adding the word vector corresponding to the vocabulary to be added into the word vector matrix, and adding the index number corresponding to the vocabulary to be added into the word index; and the deletion process execution unit is used for deleting the vocabulary to be deleted from the vocabulary list, deleting the word vector corresponding to the vocabulary to be deleted from the word vector matrix, deleting the index number corresponding to the vocabulary to be deleted from the word index, and adjusting the word index to ensure that the index numbers in the adjusted word index are continuous.
In some embodiments of the present invention, based on the foregoing solution, the apparatus for dynamically adjusting Word2Vec model dictionary further comprises: and the storage execution unit is used for storing the dynamically adjusted vocabulary list, Word indexes and Word vector matrixes into a Word2Vec model dictionary. The saving execution unit may be configured to execute an operation of saving the dynamically adjusted vocabulary list, Word index and Word vector matrix to the Word2Vec model dictionary in response to the saving request.
In some embodiments of the present invention, based on the foregoing solution, the apparatus for dynamically adjusting Word2Vec model dictionary further comprises: and the shielding unit is used for shielding the vocabulary which is shared with the vocabulary list in the vocabulary to be added before the vocabulary to be added is added to the vocabulary list. The masking unit may also be configured to mask words of the words to be deleted that are not common to the list of words before deleting the words to be deleted from the list of words.
In some embodiments of the present invention, based on the foregoing solution, the apparatus for dynamically adjusting Word2Vec model dictionary further comprises: and the consistency analysis unit is used for analyzing the consistency among the data structures in the Word2Vec model dictionary after finishing the operation of sorting the Word indexes, and sending out a storage request for storing the Word2Vec model dictionary when the data structures in the Word2Vec model dictionary are consistent.
In some embodiments of the present invention, based on the foregoing solution, the apparatus for dynamically adjusting Word2Vec model dictionary further comprises: the testing unit can be used for acquiring the Word2Vec model dictionary which is stored after dynamic adjustment, testing the running condition of the Word2Vec model dictionary, ending the testing process under the condition that the Word2Vec model dictionary can work normally, and generating error reporting information of the Word2Vec model dictionary under the condition that the Word2Vec model dictionary can not work normally.
According to a third aspect of embodiments of the present invention, there is provided a computer readable medium having stored thereon a computer program which, when executed by a processor, implements the method of dynamically adjusting a Word2Vec model dictionary as described in the first aspect of the embodiments above.
According to a fourth aspect of embodiments of the present invention, there is provided an electronic apparatus, including: one or more processors; a storage device for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement a method of dynamically adjusting a Word2Vec model dictionary as described in the first aspect of the embodiments above.
The technical scheme provided by the embodiment of the invention has the following beneficial effects:
in the technical solution provided by some embodiments of the present invention, by reading a vocabulary list, a Word index and a Word vector matrix in a Word2Vec model dictionary, wherein the Word index includes a set of consecutive index numbers, the index numbers in the Word index correspond to the vocabularies in the vocabulary list one-to-one, and the index numbers in the Word index are also used for indicating the positions of the Word vectors of the corresponding vocabularies in the Word vector matrix, adding the vocabulary to be added to the vocabulary list and adding the Word vectors corresponding to the vocabulary to be added to the Word vector matrix, adding the index numbers corresponding to the vocabulary to be added to the Word index, deleting the vocabulary to be deleted from the vocabulary list and deleting the Word vectors corresponding to the vocabulary to be deleted from the Word vector matrix, deleting the index numbers corresponding to the vocabulary to be deleted from the Word index, adjusting the Word index to ensure that the index numbers in the adjusted Word index are continuous, and then storing the dynamically adjusted vocabulary list, Word index and Word vector matrix into a Word2Vec model dictionary, so that the method of the invention fully utilizes the existing training result, does not need to dynamically adjust the Word2Vec model dictionary by retraining the model, and further can improve the efficiency of adjusting the Word2Vec model dictionary; meanwhile, due to the adoption of the method, specific words can be added or deleted in the Word2Vec model dictionary in a targeted manner according to the requirements of users.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention. It is obvious that the drawings in the following description are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort. In the drawings:
FIG. 1 schematically illustrates a flow diagram of a method of dynamically adjusting a Word2Vec model dictionary in accordance with one embodiment of the present invention;
FIG. 2 schematically illustrates a block diagram of an apparatus for dynamically adjusting a Word2Vec model dictionary, in accordance with one embodiment of the present invention;
FIG. 3 illustrates a schematic structural diagram of a computer system suitable for use with the electronic device to implement an embodiment of the invention.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to provide a thorough understanding of embodiments of the invention. One skilled in the relevant art will recognize, however, that the invention may be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations or operations have not been shown or described in detail to avoid obscuring aspects of the invention.
The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.
The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.
Before describing embodiments of the present invention in detail, some of the terms used in the present invention are explained as follows:
the Word2Vec model is a natural language processing model for vectorizing vocabularies, and is called Word toVector in English. The Word2Vec model is characterized in that all words can be vectorized, so that the relation between words can be quantitatively measured, and the relation between words can be mined. The trained Word2Vec model is stored in the form of Word2Vec model dictionary.
Fig. 1 schematically shows a flowchart of a method for dynamically adjusting a Word2Vec model dictionary according to an embodiment of the present invention, and the execution subject of the method for dynamically adjusting a Word2Vec model dictionary may be a server, a terminal device, a combination thereof, or the like. As shown in FIG. 1, a method for dynamically adjusting a Word2Vec model dictionary according to one embodiment of the present invention includes steps S110-S150, which are described in detail below:
in step S110, the vocabulary to be added and the corresponding Word vectors thereof that need to be added to the Word2Vec model dictionary are obtained, and the vocabulary to be deleted that needs to be deleted from the Word2Vec model dictionary is obtained.
In one embodiment of the invention, the vocabulary to be added or deleted may be any vocabulary. The vocabulary to be added may be, for example, thickening, new money, etc., and the vocabulary to be deleted may be, for example, korean edition, package mail, etc. The vocabulary need not be a chinese vocabulary, but can be a foreign vocabulary, such as english, german, korean, japanese, french, and so on, without any limitation. The use of a Chinese vocabulary is merely exemplary for ease of understanding.
In one embodiment of the invention, it is understood that for the method steps: the method comprises the steps of obtaining a vocabulary to be added and a Word vector corresponding to the vocabulary to be added which need to be added into a Word2Vec model dictionary, and obtaining a vocabulary to be deleted which needs to be deleted from the Word2Vec model dictionary, wherein the vocabulary to be added and the Word vector corresponding to the vocabulary to be deleted should be regarded as two selectable items instead of a necessary option. For example, the result of the electronic device obtaining the vocabulary to be added and the word vector corresponding to the vocabulary to be added at a time may be that there is no vocabulary to be added and the word vector corresponding to the vocabulary to be deleted, and the result of obtaining the vocabulary to be deleted is "new money". That is, although this option is provided, the user does not actually have the word to be added and the corresponding word vector, and only deletes the word "new money" this time. In other words, there are two cases for obtaining the result of the vocabulary to be added and the corresponding Word vector that need to be added into the Word2Vec model dictionary: one is that there is no vocabulary to be added and its corresponding word vector, and the other is that there is no vocabulary to be added and its corresponding word vector. There may be two cases to obtain the result of the vocabulary to be deleted that needs to be deleted from the Word2Vec model dictionary: one is that there is no word to be deleted, and the other is that there is a word to be deleted. The technical scheme of the step can at least realize the following beneficial technical effects: the vocabulary to be added and the corresponding Word vector and the vocabulary to be deleted are obtained in advance, so that the vocabulary to be added can be conveniently and rapidly and directly added into the Word2Vec model dictionary or the vocabulary to be deleted can be conveniently deleted from the Word2Vec model dictionary.
In step S120, the vocabulary list, Word index and Word vector matrix in the Word2Vec model dictionary are read. Wherein the word index comprises a set of consecutive index numbers. The index numbers in the word index correspond to the words in the word list one by one. The index number in the word index may be used to indicate the position of the word vector of its corresponding vocabulary in the word vector matrix.
In one embodiment of the present invention, reading the vocabulary list, Word index and Word vector matrix in the Word2Vec model dictionary may be loading the Word2Vec model dictionary stored locally into the memory. Specifically, reading the vocabulary list, the Word index and the Word vector matrix in the Word2Vec model dictionary may be to load the local Word2Vec model dictionary into the memory by calling the load function in the Gensim packet, and arrange the vocabulary list, the Word index and the Word vector matrix into the Word2Vec model. The role of the word index is to relate words in the list of words to word vectors in the matrix of word vectors. The word index, so-called, comprises a set of consecutive index numbers (e.g., integers starting with 1), each index number corresponding to a word in the list of words and indicating the row coordinates of the word vector corresponding to that word in the word vector matrix. A specific example will be described below.
In the example of the data structure shown in table 1, a specific form of a word index, a vocabulary list, and a word vector matrix is schematically shown, wherein the vocabulary is selected from the commonly used vocabularies of shopping websites. For simplicity, only the last four word indices, the vocabulary list and the word vectors are schematically shown, as described in the background, the word vectors in the word vector matrix are k-dimensional (k being a positive integer).
TABLE 1
Figure BDA0002271466590000081
Figure BDA0002271466590000091
In table 1, the word vector corresponding to the word "thickening" is (X11, X12, X13, X14 … … X1 k); the word vector corresponding to the word "pure cotton" is (X21, X22, X23, X24 … … X2 k); the word vector corresponding to the word "new money" is (X31, X32, X33, X34 … … X3 k); the word vector corresponding to the word "home" is (X41, X42, X43, X44 … … X4 k). Generally, the recommended value of k may be 50 to 300. Typically, k is set to 50 or 100. It should be understood that X11, X12, X13, X14 … … X1k, etc. are only symbols representing elements of respective dimensions for simplification in the present invention. The elements of each dimension in the word vector are actually real or floating point numbers. For example, X11, X12, X13, X14, and X1k may be 0.1265, 0.009635, 0.07018, -0.009701, and-0.2122, respectively.
In this example, assuming that the vocabulary to be added is "korean edition" or "package mail", and the vocabulary to be deleted is "thick" or "new", an operation interface or a visual interface may be provided for the user at the electronic device end to obtain the vocabulary to be added and the vocabulary to be deleted, which are input by the user. The operation interface may load words to be added and words to be deleted, which are described in text or table form, for example, the text may be in a text format with suffixes of txt,. xlxs,. doc,. docx,. wps, etc., which is not limited in any way herein. The user can record the words to be added, i.e. the Korean edition, the parcel post and the words to be deleted, i.e. the thickened words and the new words, in the text. Then, the text is read through the operation interface to identify the vocabulary to be added and the vocabulary to be deleted. Alternatively, the electronic device may provide a visual interface for the user, and the user inputs the words to be added "korean edition", "package mail", and "thickening" and "new money" from the visual interface.
In an embodiment of the present invention, the word vector corresponding to the vocabulary to be added may be generated by the data processing party in advance and obtained directly from the data processing party. Therefore, according to one embodiment of the invention, after the vocabulary to be added which needs to be added into the Word2Vec model dictionary is obtained, the vocabulary to be added is sent to the data processing party; and then, acquiring a word vector corresponding to the vocabulary to be added, which is obtained by data processing of the vocabulary to be added by the data processing party, from the data processing party. The data processing party can be, for example, a word vector database which is trained in advance by a material library, wherein a large number of words and corresponding word vectors are recorded; alternatively, the data processing side may be a component belonging to the electronic device; alternatively, the data processor may be a word vector database of a third party. When the data processing party is a word vector database of a third party, the electronic device sends the vocabulary to be added to the data processing party, and the data processing party can perform data processing on the vocabulary to be added to obtain word vectors corresponding to the vocabulary to be added, wherein the data processing process can include: searching a vocabulary to be added in vocabularies stored in a word vector database, and obtaining a word vector corresponding to the vocabulary to be added; then, the electronic device may obtain, from the data processing party, a word vector corresponding to the vocabulary to be added, which is obtained by the data processing party performing data processing on the vocabulary to be added.
Based on the foregoing example, the electronic device can send the vocabulary to be added to the data processor. The data processing party may perform data processing on the vocabulary to be added to obtain word vectors corresponding to the vocabulary to be added, and the word vectors corresponding to the vocabulary to be added may be as shown in table 2:
TABLE 2
In step S130, the vocabulary to be added is added to the vocabulary list and the word vector corresponding to the vocabulary to be added is added to the word vector matrix, and the index number corresponding to the vocabulary to be added is added to the word index.
In one embodiment of the present invention, step S130 includes the following sub-steps:
step S1301, adding each vocabulary in the vocabularies to be added into the vocabulary list in sequence, for example, adding the vocabulary to the tail of the vocabulary list in sequence; and acquiring the last row in the word vector matrix and sequentially adding the word vectors corresponding to each vocabulary in the vocabulary to be added to the last row in the word vector matrix. The technical scheme can at least realize the following beneficial technical effects: the adding operation of the vocabulary to be added and the corresponding word vectors is simply and quickly realized, so that one or more specific vocabularies and the word vectors thereof can be added according to the requirements of users.
Step S1302, for each vocabulary added to the vocabulary list, obtaining the maximum index number in the word index, adding one to the maximum index number as the index number corresponding to the vocabulary, and adding the index number corresponding to the vocabulary to the word index. In this way, at least the following advantageous technical effects can be achieved: and directly creating an index number for each vocabulary added into the vocabulary list so as to enable the index number to accord with the data format requirement of the Word2Vec model dictionary, and after the index number is saved in the Word2Vec model dictionary, querying the added vocabulary and the corresponding Word vector thereof through Word index without retraining the Word2Vec model.
Based on the foregoing example, the electronic device may add the words "korean edition" and "package mail" to be added to the end of the current word list in sequence, i.e., after the word "home". The electronic device may add the word vectors corresponding to the words "korean edition" and "package mail" to be added to the last row in the current word vector matrix, that is, add the words to be added to the last row in the current word vector matrix in sequence from the row in which the vectors X41, X42, X43, X44, and X4k are located, and as a result, see, for example, table 3:
TABLE 3
Figure BDA0002271466590000111
In step S140, the vocabulary to be deleted is deleted from the vocabulary list, the word vector corresponding to the vocabulary to be deleted is deleted from the word vector matrix, the index number corresponding to the vocabulary to be deleted is deleted from the word index, and the word index is adjusted such that the index numbers in the adjusted word index are continuous.
In one embodiment of the present invention, step S140 includes the following sub-steps:
and S1401, acquiring an index number corresponding to the vocabulary to be deleted.
And S1402, deleting the vocabulary to be deleted from the vocabulary list.
Step s1403, deleting the word vectors at the positions indicated by the index numbers corresponding to the words to be deleted from the word vector matrix, and adjusting the positions of the remaining word vectors in the word vector matrix so that the word vectors in the adjusted word vector matrix are continuously arranged (i.e., there are no blank rows). Specifically, according to the index number corresponding to the vocabulary to be deleted, finding the corresponding row number in the word vector matrix, and deleting the word vectors in the row in the word vector matrix; if the deleted word vector is not the last word vector in the word vector matrix, the word vectors in the row behind the deleted word vector are lifted up to the row where the deleted word vector is located, and so on, so that the word vectors in the word vector matrix are continuously arranged. The following beneficial technical effects can be realized in the step: after the index number corresponding to the vocabulary to be deleted is obtained, the vocabulary to be deleted in the vocabulary list is deleted, and the word vector at the position indicated by the index number of the vocabulary to be deleted is deleted from the word vector matrix, so that the invention can delete the specific vocabulary according to the requirement of the user.
Step S1404, starting from the first index number in the word index, comparing the numerical value of each index number in the word index with the numerical value of the next index number in sequence, and under the condition that the numerical value of a certain index number in the word index is discontinuous with the numerical value of the next index number, adjusting the numerical value of the next index number to be the numerical value of the previous index number plus one. Thus, the following beneficial technical effects are realized: after deleting the vocabulary and the corresponding vectors thereof, adjusting the word index to ensure that the index numbers in the adjusted word index are continuous so as to correctly indicate the position in the adjusted word vector matrix; and the data format requirement of the Word2Vec model dictionary is met, and the Word2Vec model does not need to be retrained after the Word2Vec model dictionary is saved. For example, the original index number 1055 and 1057 are not consecutive because the vocabulary corresponding to the original index number 1056 was deleted. The electronic device can adjust the index number 1057 for "cotton" to 1056. The electronic device can adjust the index number 1059 corresponding to "home" to 1057. Correspondingly, index number 1060 corresponding to "home" is adjusted to 1058, and index number 1061 corresponding to "pack and post" is adjusted to 1059.
Based on the foregoing example, the electronic device may first obtain the index numbers corresponding to the words "thickened" and "new money" to be deleted, that is, the index numbers 1056 and 1058; deleting the vocabulary "thickened" and "new style" from the vocabulary list; deleting the word vectors at the positions indicated by the index numbers 1056 and 1058 of the words to be deleted from the word vector matrix; and adjusting the word index to ensure that the index numbers in the adjusted word index are continuous. After adjustment, the word "korean version" has an index number of 1058, which indicates that the word vector matrix is still the corresponding word vector (X51, X52, X53, X54 … … X5k) at the position of the word vector matrix, as shown in table 4:
TABLE 4
In one embodiment of the invention, there may be instances where some of the acquired vocabulary to be added is already present in the Word2Vec model dictionary prior to addition due to user inattention. Still alternatively, there may be a case where some of the acquired vocabulary to be deleted does not exist in the current Word2Vec model dictionary due to the carelessness of the user. Thus, the method may further comprise: before the vocabulary to be added is added to the vocabulary list, shielding the vocabulary which is common with the vocabulary list in the vocabulary to be added; and/or masking the non-shared words in the word list to be deleted before deleting the word to be deleted from the word list. Based on the foregoing example, for example, the user is to add the Word "pure cotton", but at this time, the Word "pure cotton" already exists in the Word2Vec model dictionary, and the Word "pure cotton" belongs to the words in the Word to be added, which are common to the Word list, and is masked, so that the operation of adding the Word "pure cotton" is not performed in the subsequent process of adding the Word to be added. For another example, the user wants to delete the Word "lovely" to be deleted, but in the original Word2Vec model dictionary, there is no Word "lovely" in the words corresponding to the index numbers 1 to 1059, and the "lovely" belongs to the Word not shared with the Word list in the words to be deleted, so that the operation of deleting the Word "lovely" is not performed in the subsequent process of deleting the words to be deleted.
It should be noted that the execution sequence of steps S130 and S140 is not necessarily performed in the order shown in fig. 1. The order of execution of steps S130 and S140 may be interchanged. In other words, the method of the present invention can first execute step S130 to add the vocabulary to be added and the word vectors corresponding thereto, and then execute step S140 to delete the vocabulary to be deleted and the word vectors corresponding thereto. Alternatively, the method of the present invention may also execute step S140 to delete the vocabulary to be deleted and the word vector corresponding thereto, and then execute step S130 to add the vocabulary to be added and the word vector corresponding thereto.
In step S150, the dynamically adjusted vocabulary list, Word index, and Word vector matrix are saved to the Word2Vec model dictionary.
In an embodiment of the present invention, saving the dynamically adjusted vocabulary list, Word index and Word vector matrix to the Word2Vec model dictionary refers to saving the dynamically adjusted vocabulary list, Word index and Word vector matrix to form a new Word2Vec model dictionary, which may be a process of saving the dynamically adjusted vocabulary list, Word index and Word vector matrix in the memory from the memory to the local. For example, the vocabulary list, the Word index and the Word vector matrix which are dynamically adjusted in the memory of the electronic equipment are stored in the local Word2Vec model dictionary. Typically, the file format in which the Word2Vec model dictionary is stored locally may be in the.bin or.model format.
In the above embodiment, the word index includes an index number; in yet another embodiment, the word index may include an index number and a vocabulary. Index numbers in the word index correspond to words one by one, and words in the word index correspond to words in the word list one by one. See, for example, table 5:
TABLE 5
Figure BDA0002271466590000141
In the embodiment where the word index includes the index number and the vocabulary, when the vocabulary to be deleted in the vocabulary list is deleted, the corresponding vocabulary in the word index is also deleted. For example, if word "thickening" in the word list in Table 5 is deleted, word "thickening" in the word index is also deleted accordingly. When the vocabulary to be added is added into the vocabulary list, the corresponding vocabulary is correspondingly added into the word index. For example, if the word "korean version" is added to table 5, the word "korean version" is also added to the word index accordingly. The results after deleting the word "thickening" in table 5 and adding the word "korean version" in table 5 can be seen in table 6:
TABLE 6
Figure BDA0002271466590000142
Since there may be errors in the process of sorting the Word index, resulting in inconsistency between the data structures of the Word2Vec model, in a preferred embodiment, the method of dynamically adjusting the Word2Vec model dictionary may further include: after finishing the operation of adjusting the Word index, analyzing the consistency between data structures in the Word2Vec model dictionary; in the case of agreement between data structures in the Word2Vec model dictionary, a save request to save the Word2Vec model dictionary is issued. For example, whether the data structure in the Word vector matrix is consistent or not is analyzed, wherein it is possible that the Word vector in the Word2Vec model dictionary to be adjusted currently is 50-dimensional, but due to a fault or a data transmission error, the Word vector acquired from the data processing side is 100-dimensional, so that the data structure in the Word2Vec model dictionary is inconsistent, and in this case, the Word2Vec model dictionary may be wrong. Thus, the method for dynamically adjusting the Word2Vec model dictionary may further include: if the data structures in the Word2Vec model dictionary are not consistent with each other, error information of the Word2Vec model dictionary is generated.
In a preferred embodiment of the present invention, after the Word2Vec model dictionary is saved locally, in order to ensure that the Word2Vec model dictionary can work normally, the saved Word2Vec model dictionary can be tested to ensure that the new model dictionary can work normally. Thus, the method for dynamically adjusting the Word2Vec model dictionary may further include: and acquiring the Word2Vec model dictionary which is stored after dynamic adjustment, testing the running condition of the Word2Vec model dictionary, ending the testing process under the condition that the Word2Vec model dictionary can normally work, and generating error reporting information of the Word2Vec model dictionary under the condition that the Word2Vec model dictionary can not normally work.
Embodiments of the apparatus of the present invention are described below with reference to the accompanying drawings.
FIG. 2 schematically shows a block diagram of an apparatus for dynamically adjusting a Word2Vec model dictionary, according to one embodiment of the present invention.
Referring to fig. 2, an apparatus 200 for dynamically adjusting a Word2Vec model dictionary according to an embodiment of the present invention includes one or more of the following units: a data transceiving unit 201, a reading unit 202, an adding process execution unit 203, a deleting process execution unit 204, a word index sorting unit 205, a saving execution unit 206, a masking unit 207, a consistency analysis unit 208, and a test unit 209.
The data transceiver 201 may be configured to obtain a vocabulary to be added and a Word vector corresponding to the vocabulary to be added, which need to be added to the Word2Vec model dictionary. The data transceiving unit 201 can also be used to obtain the vocabulary to be deleted that needs to be deleted from the Word2Vec model dictionary. The data transceiving unit 201 may also be configured to send the vocabulary to be added to the data processing party; and at least obtaining a word vector corresponding to the vocabulary to be added, which is obtained by data processing of the vocabulary to be added by the data processing party, from the data processing party.
The reading unit 202 may be configured to read a vocabulary list, a Word index, and a Word vector matrix in the Word2Vec model dictionary. Wherein the word index may comprise a set of consecutive index numbers. The index numbers in the word index may correspond one-to-one with the words in the list of words. The index number in the word index may be used to indicate the position of the word vector of its corresponding vocabulary in the word vector matrix.
The adding process executing unit 203 may be configured to add a vocabulary to be added to the vocabulary list, add a word vector corresponding to the vocabulary to be added to the word vector matrix, and add an index number corresponding to the vocabulary to be added to the word index. Wherein, adding the vocabulary to be added into the vocabulary list and adding the word vector corresponding to the vocabulary to be added into the word vector matrix comprises: adding each vocabulary in the vocabularies to be added into a vocabulary list in sequence; and acquiring the last row in the word vector matrix and sequentially adding the word vectors corresponding to each vocabulary in the vocabulary to be added to the last row in the word vector matrix. Adding an index number corresponding to a vocabulary to be added in the word index comprises: and for each vocabulary added into the vocabulary list, acquiring the maximum index number in the word index, adding one to the maximum index number as the index number corresponding to the vocabulary, and adding the index number into the word index.
The deletion process executing unit 204 may be configured to delete the vocabulary to be deleted from the vocabulary list, delete the word vector corresponding to the vocabulary to be deleted from the word vector matrix, delete the index number corresponding to the vocabulary to be deleted from the word index, and adjust the word index so that the index numbers in the adjusted word index are consecutive. Deleting the vocabulary to be deleted from the vocabulary list and deleting the word vector corresponding to the vocabulary to be deleted from the word vector matrix, wherein the deleting comprises the following steps: acquiring an index number corresponding to a vocabulary to be deleted; deleting the vocabulary to be deleted from the vocabulary list; and deleting the word vectors at the positions indicated by the index numbers corresponding to the words to be deleted from the word vector matrix, and adjusting the positions of the rest word vectors in the word vector matrix to continuously arrange the word vectors in the adjusted word vector matrix.
The save execution unit 205 may be configured to save the dynamically adjusted vocabulary list, Word index, and Word vector matrix to the Word2Vec model dictionary. The save execution unit 205 may be configured to perform an operation of saving the dynamically adjusted vocabulary list, Word index, and Word vector matrix to the Word2Vec model dictionary in response to the save request.
The masking unit 206 may be configured to mask the vocabulary common to the vocabulary list in the vocabulary to be added before adding the vocabulary to be added to the vocabulary list. The masking unit 206 may also be configured to mask words in the to-be-deleted words that are not common to the list of words before deleting the to-be-deleted words from the list of words.
The consistency analysis unit 207 may be configured to, after the operation of adjusting the Word index is completed, analyze consistency between data structures in the Word2Vec model dictionary, and issue a save request for saving the Word2Vec model dictionary if the data structures in the Word2Vec model dictionary are consistent. For example, a save request to save the Word2Vec model dictionary may be sent to the save execution unit 205.
The testing unit 208 may be configured to obtain the Word2Vec model dictionary saved after the dynamic adjustment, test the running condition of the Word2Vec model dictionary, end the testing process when the Word2Vec model dictionary can normally work, and generate the error reporting information of the Word2Vec model dictionary when the Word2Vec model dictionary cannot normally work.
For details that are not disclosed in the embodiment of the apparatus of the present invention, please refer to the embodiment of the method for dynamically adjusting a Word2Vec model dictionary described above for the details that are not disclosed in the embodiment of the apparatus of the present invention, because each functional module of the apparatus for dynamically adjusting a Word2Vec model dictionary according to the embodiment of the present invention corresponds to the steps of the above-described embodiment of the method for dynamically adjusting a Word2Vec model dictionary.
Referring now to FIG. 3, a block diagram of a computer system 300 suitable for use with the electronic device implementing an embodiment of the invention is shown. The computer system 300 of the electronic device shown in fig. 3 is only an example, and should not bring any limitation to the function and the scope of the use of the embodiments of the present invention.
As shown in fig. 3, the computer system 300 includes a Central Processing Unit (CPU)301 that can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)302 or a program loaded from a storage section 308 into a Random Access Memory (RAM) 303. In the RAM 303, various programs and data necessary for system operation are also stored. The CPU301, ROM 302, and RAM 303 are connected to each other via a bus 304. An input/output (I/O) interface 305 is also connected to bus 304.
The following components are connected to the I/O interface 305: an input portion 306 including a keyboard, a mouse, and the like; an output section 307 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 308 including a hard disk and the like; and a communication section 309 including a network interface card such as a LAN card, a modem, or the like. The communication section 309 performs communication processing via a network such as the internet. A drive 310 is also connected to the I/O interface 305 as needed. A removable medium 311 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 310 as necessary, so that a computer program read out therefrom is mounted into the storage section 308 as necessary.
In particular, according to an embodiment of the present invention, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the invention include a computer program product comprising a computer program embodied on a computer-readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 309, and/or installed from the removable medium 311. The above-described functions defined in the system of the present application are executed when the computer program is executed by the Central Processing Unit (CPU) 301.
It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present invention may be implemented by software, or may be implemented by hardware, and the described units may also be disposed in a processor. Wherein the names of the elements do not in some way constitute a limitation on the elements themselves.
As another aspect, the present application also provides a computer-readable medium, which may be contained in the electronic device described in the above embodiments; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs, and when the one or more programs are executed by the electronic equipment, the electronic equipment is enabled to realize the method for dynamically adjusting the Word2Vec model dictionary as in the above embodiments.
For example, the electronic device may implement one or more of the following steps as shown in fig. 1: step S110, acquiring the vocabulary to be added and the corresponding Word vector which need to be added into the Word2Vec model dictionary, and acquiring the vocabulary to be deleted which needs to be deleted from the Word2Vec model dictionary; step S120, reading a vocabulary list, a Word index and a Word vector matrix in the Word2Vec model dictionary; step S130, adding the vocabulary to be added into the vocabulary list, adding the word vector corresponding to the vocabulary to be added into the word vector matrix, and adding the index number corresponding to the vocabulary to be added into the word index; step S140, deleting the vocabulary to be deleted from the vocabulary list, deleting the word vector corresponding to the vocabulary to be deleted from the word vector matrix, deleting the index number corresponding to the vocabulary to be deleted from the word index, and adjusting the word index to ensure that the index numbers in the adjusted word index are continuous; step S150, storing the vocabulary list, the Word index and the Word vector matrix after dynamic adjustment into a Word2Vec model dictionary.
It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the invention. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiment of the present invention can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which can be a personal computer, a server, a touch terminal, or a network device, etc.) to execute the method according to the embodiment of the present invention.
Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.
It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims (10)

1. A method for dynamically adjusting a Word2Vec model dictionary is characterized by comprising the following steps:
acquiring a vocabulary to be added and a Word vector corresponding to the vocabulary to be added, which need to be added into a Word2Vec model dictionary, and acquiring a vocabulary to be deleted, which needs to be deleted from the Word2Vec model dictionary;
reading a vocabulary list, a Word index and a Word vector matrix in a Word2Vec model dictionary; the word index comprises a group of continuous index numbers, the index numbers in the word index correspond to the words in the word list one by one, and the index numbers in the word index are also used for indicating the positions of the word vectors of the corresponding words in the word vector matrix;
adding the vocabulary to be added into the vocabulary list, adding the word vector corresponding to the vocabulary to be added into the word vector matrix, and adding the index number corresponding to the vocabulary to be added into the word index;
deleting the vocabulary to be deleted from the vocabulary list, deleting the word vector corresponding to the vocabulary to be deleted from the word vector matrix, deleting the index number corresponding to the vocabulary to be deleted from the word index, and adjusting the word index to ensure that the index numbers in the adjusted word index are continuous.
2. The method of dynamically adjusting a Word2Vec model dictionary in accordance with claim 1, further comprising:
sending the vocabulary to be added to a data processing party;
and at least obtaining a word vector corresponding to the vocabulary to be added, which is obtained by the data processing party performing data processing on the vocabulary to be added, from the data processing party.
3. The method of claim 1 or 2 for dynamically adjusting a Word2Vec model dictionary, wherein adding a vocabulary to be added to the vocabulary list and adding a Word vector corresponding to the vocabulary to be added to the Word vector matrix comprises:
adding each vocabulary in the vocabularies to be added into the vocabulary list in sequence;
and acquiring the last row in the word vector matrix and sequentially adding the word vectors corresponding to each vocabulary in the vocabularies to be added to the last row in the word vector matrix.
4. The method of dynamically adjusting a Word2Vec model dictionary as claimed in claim 3, wherein adding an index number corresponding to a vocabulary to be added to the Word index comprises:
and for each vocabulary added into the vocabulary list, acquiring the maximum index number in the word index, adding one to the maximum index number as the index number corresponding to the vocabulary, and adding the index number into the word index.
5. The method of dynamically adjusting a Word2Vec model dictionary as claimed in claim 1 or 2, wherein deleting the vocabulary to be deleted from the vocabulary list and deleting the Word vector corresponding to the vocabulary to be deleted from the Word vector matrix comprises:
acquiring an index number corresponding to a vocabulary to be deleted;
deleting the vocabulary to be deleted from the vocabulary list;
and deleting the word vectors at the positions indicated by the index numbers corresponding to the words to be deleted from the word vector matrix, and adjusting the positions of the rest word vectors in the word vector matrix to continuously arrange the word vectors in the adjusted word vector matrix.
6. The method of dynamically adjusting a Word2Vec model dictionary according to claim 1 or 2, wherein the method further comprises:
before the vocabulary to be added is added to the vocabulary list, shielding the vocabulary which is common with the vocabulary list in the vocabulary to be added;
and before deleting the vocabulary to be deleted from the vocabulary list, shielding the vocabulary which is not shared with the vocabulary list in the vocabulary to be deleted.
7. The method of dynamically adjusting a Word2Vec model dictionary according to claim 1 or 2, wherein the method comprises:
after the operation of adjusting the Word index is completed, the consistency between the data structures in the Word2Vec model dictionary is analyzed, and in the case that the data structures in the Word2Vec model dictionary are consistent, a request for saving the Word2Vec model dictionary is sent.
8. An apparatus for dynamically adjusting a Word2Vec model dictionary, comprising:
the data transceiving unit is used for acquiring the vocabulary to be added and the corresponding Word vector which need to be added into the Word2Vec model dictionary and acquiring the vocabulary to be deleted which needs to be deleted from the Word2Vec model dictionary;
the reading unit is used for reading a vocabulary list, a Word index and a Word vector matrix in the Word2Vec model dictionary; the word index comprises a group of continuous index numbers, the index numbers in the word index correspond to the words in the word list one by one, and the index numbers in the word index are also used for indicating the positions of the word vectors of the corresponding words in the word vector matrix;
the adding process execution unit is used for adding the vocabulary to be added into the vocabulary list, adding the word vector corresponding to the vocabulary to be added into the word vector matrix, and adding the index number corresponding to the vocabulary to be added into the word index;
and the deletion process execution unit is used for deleting the vocabulary to be deleted from the vocabulary list, deleting the word vector corresponding to the vocabulary to be deleted from the word vector matrix, deleting the index number corresponding to the vocabulary to be deleted from the word index, and adjusting the word index to ensure that the index numbers in the adjusted word index are continuous.
9. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out a method of dynamically adjusting a Word2Vec model dictionary according to any one of claims 1 to 7.
10. An electronic device, comprising:
one or more processors;
storage means for storing one or more programs which, when executed by the one or more processors, cause the electronic device to implement a method of dynamically adjusting a Word2Vec model dictionary in accordance with any one of claims 1 to 7.
CN201911106491.7A 2019-11-13 2019-11-13 Method, device, medium and electronic equipment for dynamically adjusting Word2Vec model dictionary Active CN110852077B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911106491.7A CN110852077B (en) 2019-11-13 2019-11-13 Method, device, medium and electronic equipment for dynamically adjusting Word2Vec model dictionary

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911106491.7A CN110852077B (en) 2019-11-13 2019-11-13 Method, device, medium and electronic equipment for dynamically adjusting Word2Vec model dictionary

Publications (2)

Publication Number Publication Date
CN110852077A true CN110852077A (en) 2020-02-28
CN110852077B CN110852077B (en) 2023-03-31

Family

ID=69600802

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911106491.7A Active CN110852077B (en) 2019-11-13 2019-11-13 Method, device, medium and electronic equipment for dynamically adjusting Word2Vec model dictionary

Country Status (1)

Country Link
CN (1) CN110852077B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114416213A (en) * 2022-03-29 2022-04-29 北京沃丰时代数据科技有限公司 Word vector file loading method and device and storage medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104765769A (en) * 2015-03-06 2015-07-08 大连理工大学 Short text query expansion and indexing method based on word vector
US20160179950A1 (en) * 2014-12-22 2016-06-23 Rovi Guides, Inc. Systems and methods for filtering techniques using metadata and usage data analysis
CN106776534A (en) * 2016-11-11 2017-05-31 北京工商大学 The incremental learning method of term vector model
CN106802888A (en) * 2017-01-12 2017-06-06 北京航空航天大学 Term vector training method and device
CN107291696A (en) * 2017-06-28 2017-10-24 达而观信息科技(上海)有限公司 A kind of comment word sentiment analysis method and system based on deep learning
CN109271632A (en) * 2018-09-14 2019-01-25 重庆邂智科技有限公司 A kind of term vector learning method of supervision
CN109977014A (en) * 2019-03-22 2019-07-05 泰康保险集团股份有限公司 Code error recognition methods, device, equipment and storage medium based on block chain
US20190286716A1 (en) * 2018-03-19 2019-09-19 Adobe Inc. Online Dictionary Extension of Word Vectors
CN110377899A (en) * 2019-05-30 2019-10-25 北京达佳互联信息技术有限公司 A kind of method, apparatus and electronic equipment of determining word part of speech

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160179950A1 (en) * 2014-12-22 2016-06-23 Rovi Guides, Inc. Systems and methods for filtering techniques using metadata and usage data analysis
CN104765769A (en) * 2015-03-06 2015-07-08 大连理工大学 Short text query expansion and indexing method based on word vector
CN106776534A (en) * 2016-11-11 2017-05-31 北京工商大学 The incremental learning method of term vector model
CN106802888A (en) * 2017-01-12 2017-06-06 北京航空航天大学 Term vector training method and device
CN107291696A (en) * 2017-06-28 2017-10-24 达而观信息科技(上海)有限公司 A kind of comment word sentiment analysis method and system based on deep learning
US20190286716A1 (en) * 2018-03-19 2019-09-19 Adobe Inc. Online Dictionary Extension of Word Vectors
CN109271632A (en) * 2018-09-14 2019-01-25 重庆邂智科技有限公司 A kind of term vector learning method of supervision
CN109977014A (en) * 2019-03-22 2019-07-05 泰康保险集团股份有限公司 Code error recognition methods, device, equipment and storage medium based on block chain
CN110377899A (en) * 2019-05-30 2019-10-25 北京达佳互联信息技术有限公司 A kind of method, apparatus and electronic equipment of determining word part of speech

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114416213A (en) * 2022-03-29 2022-04-29 北京沃丰时代数据科技有限公司 Word vector file loading method and device and storage medium

Also Published As

Publication number Publication date
CN110852077B (en) 2023-03-31

Similar Documents

Publication Publication Date Title
US20230004721A1 (en) Method for training semantic representation model, device and storage medium
TWI643076B (en) Financial analysis system and method for unstructured text data
CN111159220B (en) Method and apparatus for outputting structured query statement
CN111090641B (en) Data processing method and device, electronic equipment and storage medium
JP2022191412A (en) Method for training multi-target image-text matching model and image-text retrieval method and apparatus
CN113836314B (en) Knowledge graph construction method, device, equipment and storage medium
US20230134615A1 (en) Method of processing task, electronic device, and storage medium
CN110008807B (en) Training method, device and equipment for contract content recognition model
KR102608867B1 (en) Method for industry text increment, apparatus thereof, and computer program stored in medium
CN110852077B (en) Method, device, medium and electronic equipment for dynamically adjusting Word2Vec model dictionary
CN112667664A (en) Method and device for sentence conversion between databases, terminal equipment and medium
CN115186738B (en) Model training method, device and storage medium
CN114118049B (en) Information acquisition method, device, electronic equipment and storage medium
US20220382991A1 (en) Training method and apparatus for document processing model, device, storage medium and program
EP4246365A1 (en) Webpage identification method and apparatus, electronic device, and medium
CN116185389A (en) Code generation method and device, electronic equipment and medium
US20210342379A1 (en) Method and device for processing sentence, and storage medium
CN110471708B (en) Method and device for acquiring configuration items based on reusable components
CN114443802A (en) Interface document processing method and device, electronic equipment and storage medium
CN115017256A (en) Power data processing method and device, electronic equipment and storage medium
US10522246B2 (en) Concepts for extracting lab data
US11720531B2 (en) Automatic creation of database objects
US20230316709A1 (en) Data labeling method based on artificial intelligence, apparatus and storage medium
US20230145853A1 (en) Method of generating pre-training model, electronic device, and storage medium
CN115952403A (en) Method and device for evaluating performance of object, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant