CN111222327B - Word embedding representation method, device and equipment - Google Patents

Word embedding representation method, device and equipment Download PDF

Info

Publication number
CN111222327B
CN111222327B CN201911336859.9A CN201911336859A CN111222327B CN 111222327 B CN111222327 B CN 111222327B CN 201911336859 A CN201911336859 A CN 201911336859A CN 111222327 B CN111222327 B CN 111222327B
Authority
CN
China
Prior art keywords
word
represented
vector
word vector
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911336859.9A
Other languages
Chinese (zh)
Other versions
CN111222327A (en
Inventor
张少阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Neusoft Corp
Original Assignee
Neusoft Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Neusoft Corp filed Critical Neusoft Corp
Priority to CN201911336859.9A priority Critical patent/CN111222327B/en
Publication of CN111222327A publication Critical patent/CN111222327A/en
Application granted granted Critical
Publication of CN111222327B publication Critical patent/CN111222327B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The application discloses a word embedding representation method, a word embedding representation device and word embedding representation equipment, wherein the method comprises the following steps: word segmentation is carried out on the text to be processed to obtain word segmentation results; the word segmentation result comprises words to be represented; inputting the word to be represented into a word2vec model, and obtaining a first word vector of the word to be represented after processing the word2vec model; the word2vec model is obtained by training a data sample belonging to the same field as the text to be processed; inputting the word to be represented into a Bert model, and obtaining a second word vector of the word to be represented after the processing of the Bert model; the Bert model is obtained by training a data sample in an unlimited field; and determining the word vector of the word to be represented by combining the first word vector and the second word vector so as to realize word embedding representation of the word to be represented. According to the word embedding method and device, the word vector is determined for the word to be represented by combining the first vector output by the word2vec model and the second vector output by the Bert model, so that the word embedding representation effect can be improved to the greatest extent.

Description

Word embedding representation method, device and equipment
Technical Field
The present application relates to the field of natural language processing technologies, and in particular, to a word embedding representation method, device and equipment.
Background
Word embedding representation refers to the process of obtaining a corresponding word vector after vectorizing the word. Word embedding representation is a crucial step in the application field of Natural Language Processing (NLP), and as the result of word embedding representation has a great influence on the subsequent processing procedure in natural language processing, how to implement word embedding representation is a problem of continuous research in the technical field of natural language processing.
At present, a more common word embedding representation method is realized based on the Bert model, but because the network layer number of the Bert model is relatively deep and the data volume requirement on a training sample is large, the Bert model can show better word embedding representation effect in the field with large data volume and is not suitable for scenes with less data.
For biomedical field, aviation field, information security field, etc., due to the influence of factors such as privacy, the difficulty of acquiring professional data is high, and the data volume enough to support the Bert model is difficult to acquire, so that the accuracy of the result obtained by implementing word embedding representation in the fields based on the Bert model is insufficient.
Therefore, there is a need for a word embedding representation method suitable for the above-mentioned fields, and capable of ensuring the accuracy of word embedding representation.
Disclosure of Invention
In view of the above, the present application provides a word embedding representation method, apparatus, and device, which can implement more accurate word embedding representation for a field with a smaller sample data size.
In a first aspect, to achieve the above object, the present application provides a word embedding representation method, including:
word segmentation is carried out on the text to be processed to obtain word segmentation results; the word segmentation result comprises words to be represented;
inputting the word to be represented into a word2vec model, and obtaining a first word vector of the word to be represented after processing the word2vec model; the word2vec model is obtained by training a data sample belonging to the same field as the text to be processed;
inputting the word to be represented into a Bert model, and obtaining a second word vector of the word to be represented after the processing of the Bert model; the Bert model is obtained by training a data sample in an unlimited field;
and determining the word vector of the word to be represented by combining the first word vector and the second word vector so as to realize word embedding representation of the word to be represented.
In an alternative embodiment, the determining the word vector of the word to be represented by combining the first word vector and the second word vector to implement word embedding representation of the word to be represented includes;
setting weight values for the first word vector and the second word vector respectively based on the occurrence condition of the words to be represented with preset context in the data samples in the same field and the data samples in the non-limited field; the preset context environment is determined based on the context of the word to be represented in the text to be processed;
and determining the word vector of the word to be represented according to the weight value, the first word vector and the second word vector so as to realize word embedding representation of the word to be represented.
In an optional implementation manner, the setting weights for the first word vector and the second word vector based on the occurrence of the word to be represented in the data sample in the same domain and the data sample in the unlimited domain, where the occurrence of the word to be represented has a preset context, includes:
identifying N words positioned before and after the word to be represented in the text to be processed, and recording the corresponding relation between each word and the position information as a preset context; the position information is used for representing the position relation with the words to be represented;
counting the occurrence times corresponding to each word based on the corresponding relation between each word and the position information in the data samples in the same field and the data samples in the unlimited field respectively;
based on the occurrence times corresponding to each word and a preset relation weight, respectively determining the context environmental impact scores of the words to be represented relative to the data samples in the same field and the data samples in the unlimited field;
and setting weight values for the first word vector and the second word vector respectively based on the context environmental impact score.
In an optional implementation manner, the word segmentation processing is performed on the text to be processed to obtain a word segmentation result, which includes:
and performing word segmentation on the text to be processed based on a pre-constructed professional dictionary to obtain a word segmentation result.
In a second aspect, the present application also provides a word embedding representation apparatus, the apparatus comprising:
the word segmentation module is used for carrying out word segmentation on the text to be processed to obtain word segmentation results; the word segmentation result comprises words to be represented;
the first processing module is used for inputting the word to be represented into a word2vec model, and obtaining a first word vector of the word to be represented after the word2vec model is processed; the word2vec model is obtained by training a data sample belonging to the same field as the text to be processed;
the second processing module is used for inputting the word to be represented into a Bert model, and obtaining a second word vector of the word to be represented after the processing of the Bert model; the Bert model is obtained by training a data sample in an unlimited field;
and the determining module is used for combining the first word vector and the second word vector, and determining the word vector of the word to be represented so as to realize word embedding representation of the word to be represented.
In an alternative embodiment, the determining module includes;
a first setting sub-module, configured to set weight values for the first word vector and the second word vector based on occurrence of the word to be represented in the same-domain data sample and the non-domain data sample having a preset context; the preset context environment is determined based on the context of the word to be represented in the text to be processed;
and the first determining submodule is used for determining the word vector of the word to be represented according to the weight value, the first word vector and the second word vector so as to realize word embedding representation of the word to be represented.
In an alternative embodiment, the first setting submodule includes:
the recording sub-module is used for identifying N words positioned before and after the words to be represented in the text to be processed and recording the corresponding relation between each word and the position information as a preset context; the position information is used for representing the position relation with the words to be represented;
the statistics sub-module is used for counting the occurrence times corresponding to each word based on the corresponding relation between each word and the position information in the data samples in the same field and the data samples in the non-limited field respectively;
the second determining submodule is used for respectively determining the context environmental impact scores of the words to be represented relative to the data samples in the same field and the data samples in the unlimited field based on the occurrence times corresponding to the words and a preset relation weight;
and the second setting submodule is used for setting weight values for the first word vector and the second word vector respectively based on the context environmental impact score.
In an alternative embodiment, the word segmentation module is specifically configured to:
and performing word segmentation on the text to be processed based on a pre-constructed professional dictionary to obtain a word segmentation result.
In a third aspect, the present application provides a computer readable storage medium having instructions stored therein which, when executed on a terminal device, cause the terminal device to implement the method of any one of the preceding claims.
In a fourth aspect, the present application provides an apparatus comprising: a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the method of any one of the above when executing the computer program.
The word embedding representation method provided by the embodiment of the application can be applied to the field of small sample data size. Specifically, the word2vec model is trained by utilizing the data samples with smaller sample data volume in the field, so that the word vector output by the word2vec model can embody the characteristics of the field, and meanwhile, the Bert model is trained by utilizing the data samples with large data volume in the unlimited field, so that the training precision of the Bert model is ensured, and the word vector output by the Bert model can embody the contextual environmental influence of the words. In a word, the word vector is determined for the word to be represented by combining the first vector output by the word2vec model and the second vector output by the Bert model, so that the word embedding representation effect can be improved to the greatest extent.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort to a person skilled in the art.
FIG. 1 is a flowchart of a word embedding representation method provided in an embodiment of the present application;
fig. 2 is a flowchart of a weight value setting method provided in an embodiment of the present application;
FIG. 3 is a schematic diagram of a word embedding representation device according to an embodiment of the present application;
fig. 4 is a block diagram of a word embedding representation device according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.
For biomedical field, aviation field, information security field, etc., because of the influence of factors such as privacy, data security, etc., the acquisition degree of difficulty of professional data in this field is great, therefore, only constitute the training sample of model based on professional data in this field, obviously sample data volume is less.
The word2vec model is a word embedding representation learning model which is popular at present, has no particularly high requirement on the sample data volume, and can be used in a word embedding representation method in the field with smaller sample data volume. However, the word2vec model also has shortcomings in that it cannot solve the problem of ambiguities. For example, the word2vec model is fixed for its word embedded representation for the same word, regardless of the context in which it appears.
For the above-mentioned deficiency of the word2vec model, the Bert model can well solve, and for the same word, which is in different context, the Bert model has different word embedding representations for it. However, since the network layer number of the Bert model is deep, a training sample with a large data amount is required to embed words with good output effect into the representation result. Obviously, the Bert model cannot be directly used to implement word embedded representations in the field of small sample data volumes.
Based on the above problems, the present application provides a word embedding representation method, which can be applied to the field of small sample data size. Specifically, the word2vec model is trained by utilizing the data samples with smaller sample data volume in the field, so that the word vector output by the word2vec model can embody the characteristics of the field, and meanwhile, the Bert model is trained by utilizing the data samples with large data volume in the unlimited field, so that the training precision of the Bert model is ensured, and the word vector output by the Bert model can embody the contextual environmental influence of the words. In a word, the word vector is determined for the word to be represented by combining the first vector output by the word2vec model and the second vector output by the Bert model, so that the word embedding representation effect can be improved to the greatest extent.
The following application provides a word embedding representation method, and referring to fig. 1, a flowchart of the word embedding representation method provided in an embodiment of the application is provided, where the method includes:
s101: word segmentation is carried out on the text to be processed to obtain word segmentation results; the word segmentation result comprises words to be represented.
In this embodiment of the present application, the text to be processed may be any text that needs to be processed in natural language in each application field. The word embedding representation method provided by the application is particularly suitable for the fields with smaller sample data volume, such as biomedical fields, aviation fields, information security fields and the like.
Taking biomedical field as an example, the text to be processed can be clinical recorded text, such as short breath, abdominal distention and dry eyes when headache.
In the embodiment of the application, after determining the text to be processed, word segmentation is performed on the text to be processed to obtain a word segmentation result of the text to be processed; the word segmentation result includes each word obtained after the text to be processed is segmented, and any word in the word segmentation result can be used as a word to be represented in the embodiment of the application.
In an alternative implementation manner, word segmentation processing can be performed on the text to be processed based on a pre-constructed professional dictionary, so as to obtain a word segmentation result. Specifically, the professional dictionary stores professional words in the field.
Taking the clinical medicine field as an example, because the clinical record text has a larger relation with the personal record habit of the doctor, a word segmentation result with lower relative accuracy can be obtained based on the current traditional word segmentation processing mode, therefore, the doctor can provide professional words and store the professional words in a professional dictionary for word segmentation processing of the clinical record text recorded by the doctor, and obviously, a more accurate word segmentation result can be obtained.
In practical application, the words which are successfully matched with the professional words in the professional dictionary in the text to be processed can be first identified by scanning the text to be processed, then added into the word segmentation result, and then the word segmentation processing can be carried out on the rest words in the text to be processed by adopting various current word segmentation processing modes, so that the word segmentation result of the text to be processed is finally obtained. Because the professional word in the professional dictionary is utilized for word segmentation processing, word segmentation results more in accordance with the field can be obtained, the word segmentation processing mode based on the professional dictionary provided by the embodiment of the application can improve the accuracy of the word segmentation results of the text to be processed, and therefore a more accurate word foundation is provided for word embedding representation.
S102: inputting the word to be represented into a word2vec model, and obtaining a first word vector of the word to be represented after processing the word2vec model; the word2vec model is obtained by training a data sample belonging to the same field as the text to be processed.
Because the word2vec model has relatively low data volume requirements on training samples, the word2vec model can be trained by using the data samples belonging to the same field with the text to be processed, so that word vectors obtained by processing words to be represented by using the trained word2vec model can maximally embody the characteristics of the words to be represented in the field.
For example, for clinical medical fields with smaller sample data volumes, the word2vec model may be trained based on clinical recorded text in the historical data to ensure word2vec model word-embedded representation effects on words of the clinical medical field.
In the embodiment of the application, after the word to be expressed is obtained, the word to be expressed is input into a trained word2vec model, and after the word2vec model is used for processing the word, a first word vector of the word to be expressed is output. Since the word2vec model is trained based only on data samples in the art, the first word vector is able to characterize the word to be represented in the art.
S103: inputting the word to be represented into a Bert model, and obtaining a second word vector of the word to be represented after the processing of the Bert model; the Bert model is obtained through training by using data samples in an unlimited field.
Since training is performed on the model based on only the data samples belonging to the same field as the text to be processed, there may be cases where word embedding indicates inaccuracy, taking the clinical medical field as an example, for the clinical recorded text "headache, shortness of breath, abdominal distension, dry eyes", since the common feature of the "abdomen" and "time" context is that there are 2 words "qi" and "shortness", and the common feature of the "abdomen" and "eye" context is that there are 1 word "distension", if the Bert model is trained based on only the data samples of the field, the similarity between the "time" and "abdomen" learned by the Bert model based on the context should be larger than the similarity between the "eye" and "abdomen", but in fact, both the "eye" and "abdomen" belong to the human body part, should have higher similarity.
Therefore, the embodiment of the application needs to introduce data in the general field as a data sample to train the Bert model, so that the Bert model can learn an embedded representation with more complex context environment, and the "abdomen" and the "eye" in the clinical record text can be learned by using the Bert model to have higher similarity. In addition, training is performed by using data samples in the unlimited field, wherein the data quantity of the data samples can be ensured to be large enough, and the training precision of the Bert model can be ensured.
In the embodiment of the application, after training the Bert model by using the data samples in the unlimited field, the Bert model is used for processing the words to be represented to obtain the second word vectors of the words to be represented. Because the Bert model is trained based on data samples of unlimited domains, the second word vector can represent different word embedded representations of the word to be represented in different domains with the same context.
In an alternative embodiment, for the Bert model, only the structure weighted by the last four layers of the Bert model may be used to process the word to be represented, so as to obtain the second word vector of the word to be represented. The structure of weighting the last four layers of the Bert model can ensure the word embedding representation effect, and can improve the processing efficiency of the words to be represented, so that the algorithm complexity is reduced.
It should be noted that the embodiment of the present application does not limit the execution sequence of S102 and S103 described above.
S104: and determining the word vector of the word to be represented by combining the first word vector and the second word vector so as to realize word embedding representation of the word to be represented.
In the embodiment of the application, after the first word vector and the second word vector of the word to be represented are obtained, the word vector of the word to be represented is determined by combining the first word vector and the second word vector, so that word embedding representation of the word to be represented is realized.
In an alternative embodiment, weight values are set for the first word vector and the second word vector of the word to be expressed respectively, and then the word vector of the word to be expressed is determined based on the set weight values, the first word vector and the second word vector. The method for setting the weight values can set the weight values for the first word vector and the second word vector of the word to be expressed respectively based on strong correlation among the words.
In practical application, weight values can be set for the first word vector and the second word vector respectively based on the occurrence condition of the word to be represented with a preset context in a data sample belonging to the same field as the word to be represented and a data sample of an unlimited field; the preset context environment is determined based on the context of the word to be represented in the text to be processed. And then, determining the word vector of the word to be represented according to the weight value, the first word vector and the second word vector so as to realize word embedding representation of the word to be represented.
Specifically, before determining the weight value, the text to be processed is first scanned to determine the context of the word to be represented, and the context is used as a preset context. The context of the word to be represented may refer to N words located before and after the word to be represented in the text to be processed. For example, the context of the text to be processed "headache, shortness of breath, bloating, dry eyes" may include the first two words "hour" and "shortness of breath" and the second two words "eyes" and "dry eyes" to be expressed as the words "bloating". Then, the words to be represented with the preset context are respectively determined, and the occurrence situations in the data samples belonging to the same field as the words to be represented and the data samples of the unlimited field are determined. It can be understood that the higher the frequency of occurrence of the word to be represented with the preset context, the more the word embedding representation of the model trained based on the data sample can be described as representing the real context of the word to be represented, and therefore, the higher the weight value that can be set for the word to be represented; and vice versa.
The embodiment of the application provides a manner of setting weight values for a first word vector and a second word vector, and referring to fig. 2, a flowchart of a weight value setting method provided in the embodiment of the application is provided, where the method includes:
s201: identifying N words positioned before and after the word to be represented in the text to be processed, and recording the corresponding relation between each word and the position information as a preset context; the position information is used for representing the position relation between the corresponding words and the words to be represented.
Assuming that the words to be expressed are marked as X, identifying three words positioned before and after X in the text to be processed, and respectively marking as X -3 ,X -2 ,X -1 ,X +1 ,X +2 ,X +3 . Wherein X is -3 The position information of (a) refers to X -3 X is the third word before X +3 The position information of (a) refers to X +3 For X, the third word after X -2 ,X -1 ,X +1 ,X +2 The same is understood.
S202: and counting the occurrence times corresponding to each word based on the corresponding relation between each word and the position information in the data samples in the same field and the data samples in the non-limited field respectively.
In determining a preset context X of a word X to be represented -3 ,X -2 ,X -1 ,X,X +1 ,X +2 ,X +3 Then, respectively counting X in the data samples in the same field as X and the data samples in the unlimited field based on the corresponding relation between each word and the position information -3 ,X -2 ,X -1 ,X +1 ,X +2 ,X +3 Corresponding number of occurrences.
Specifically, in the data samples belonging to the same field as X, X is counted -3 Number of occurrences N of the preceding third word as X -3 Statistics of X -2 Number of occurrences N of the second word before X -2 Statistics of X -1 Number of occurrences N of the first word before X -1 And, count X +3 Number of occurrences N of the third word after X 3 Statistics of X +2 Number of occurrences N of the second word before X 2 Statistics of X +1 Number of occurrences N of the first word before X 1
In the same way, in the data samples of the unlimited area, respectively count X -3 ,X -2 ,X -1 ,X +1 ,X +2 ,X +3 The corresponding number of occurrences is not described in detail herein.
S203: and respectively determining the context environmental impact scores of the words to be represented relative to the data samples in the same field and the data samples in the unlimited field based on the occurrence times corresponding to the words and a preset relation weight.
In the embodiment of the application, since the positional relationship with the word to be expressed can reflect the influence degree of each word on the word to be expressed, the closer the positional relationship with the word to be expressed is, the higher the influence degree on the word to be expressed is. Therefore, a relationship weight can be set for each word in the preset context respectively, so as to reflect the influence degree of each word on the word to be represented.
With X as above -3 ,X -2 ,X -1 ,X,X +1 ,X +2 ,X +3 For example, it may be X -3 And X +3 The relation weight of (C) is set to 0.05, X -2 And X +2 The relation weight of (C) is set to 0.15, X -1 And X +1 The relation weights of (2) are all set to 0.3, and the sum is 1. And then, determining the context environmental impact score of the word to be represented relative to the data sample in the same domain and the data sample in the unlimited domain by combining the occurrence times of each word and the relation weight. Wherein the context environmental impact score of the word to be represented relative to the data sample of the same domain and the context environmental impact score of the word to be represented relative to the data sample of an unlimited domain are respectively determined according to the following score calculation formula (1), wherein the score calculation formula (1) is as follows:
S=0.05*N -3 +0.15*N -2 +0.3*N -1 +0.3*N 1 +0.015*N 2 +0.05*N 3 (1)
the relationship weight is not limited to the above-mentioned arrangement, and the feature that the closer the positional relationship between the word and the word to be expressed is, the higher the influence degree on the word to be expressed is may be represented.
S204: and setting weight values for the first word vector and the second word vector respectively based on the context environmental impact score.
In the embodiment of the application, after determining the context environmental impact scores of the words to be represented relative to the data samples in the same field and the data samples in the unlimited field, weight values are respectively set for the first word vector and the second word vector based on the scores.
In an alternative embodiment, after determining the context environmental impact score S1 of the word to be represented with respect to the data sample of the same domain and the context environmental impact score S2 of the word to be represented with respect to the data sample of the same domain, the value of S1/(s1+s2) is taken as the weight value of the first word vector, and the value of S2/(s1+s2) is taken as the weight value of the first word vector.
In the embodiment of the application, based on the occurrence of the word to be represented with the preset context in the data sample belonging to the same field as the word to be represented and the data sample not limited in the field, the weight values are respectively set for the first word vector and the second word vector, so that the influence of the context on word embedding representation can be embodied, and the effect of the final word embedding representation is better.
The word embedding representation method provided by the application can be applied to the field of small sample data size. Specifically, word2vec model is trained by using the word2vec model and the Bert model, words in the field with smaller sample data size are expressed in a word embedding mode, and particularly word2vec model is trained by using data samples with smaller sample data size in the field, so that word vectors output by word2vec model can embody the characteristics of the field, and meanwhile, bert model is trained by using data samples with large data size in the unlimited field, so that the training precision of Bert model is ensured, and word vectors output by Bert model can embody the contextual environmental influence of words. In a word, the word vector is determined for the word to be represented by combining the first vector output by the word2vec model and the second vector output by the Bert model, so that the word embedding representation effect can be improved to the greatest extent.
Corresponding to the above method embodiment, the present application further provides a word embedding representation device, referring to fig. 3, which is a schematic structural diagram of the word embedding representation device provided in the embodiment of the present application, where the device includes:
the word segmentation module 301 is configured to perform word segmentation on a text to be processed to obtain a word segmentation result; the word segmentation result comprises words to be represented;
the first processing module 302 is configured to input the word to be represented into a word2vec model, and obtain a first word vector of the word to be represented after processing by the word2vec model; the word2vec model is obtained by training a data sample belonging to the same field as the text to be processed;
the second processing module 303 is configured to input the word to be represented into a Bert model, and obtain a second word vector of the word to be represented after the processing of the Bert model; the Bert model is obtained by training a data sample in an unlimited field;
the determining module 304 is configured to determine a word vector of the word to be expressed by combining the first word vector and the second word vector, so as to implement word embedding expression of the word to be expressed.
In an alternative embodiment, the determining module includes;
a first setting sub-module, configured to set weight values for the first word vector and the second word vector based on occurrence of the word to be represented in the same-domain data sample and the non-domain data sample having a preset context; the preset context environment is determined based on the context of the word to be represented in the text to be processed;
and the first determining submodule is used for determining the word vector of the word to be represented according to the weight value, the first word vector and the second word vector so as to realize word embedding representation of the word to be represented.
In an alternative embodiment, the first setting submodule includes:
the recording sub-module is used for identifying N words positioned before and after the words to be represented in the text to be processed and recording the corresponding relation between each word and the position information as a preset context; the position information is used for representing the position relation with the words to be represented;
the statistics sub-module is used for counting the occurrence times corresponding to each word based on the corresponding relation between each word and the position information in the data samples in the same field and the data samples in the non-limited field respectively;
the second determining submodule is used for respectively determining the context environmental impact scores of the words to be represented relative to the data samples in the same field and the data samples in the unlimited field based on the occurrence times corresponding to the words and a preset relation weight;
and the second setting submodule is used for setting weight values for the first word vector and the second word vector respectively based on the context environmental impact score.
In an alternative embodiment, the word segmentation module is specifically configured to:
and performing word segmentation on the text to be processed based on a pre-constructed professional dictionary to obtain a word segmentation result.
The word embedding representation device provided by the embodiment of the application can be applied to the field of small sample data size. Specifically, the word2vec model is trained by utilizing the data samples with smaller sample data volume in the field, so that the word vector output by the word2vec model can embody the characteristics of the field, and meanwhile, the Bert model is trained by utilizing the data samples with large data volume in the unlimited field, so that the training precision of the Bert model is ensured, and the word vector output by the Bert model can embody the contextual environmental influence of the words. In a word, the word vector is determined for the word to be represented by combining the first vector output by the word2vec model and the second vector output by the Bert model, so that the word embedding representation effect can be improved to the greatest extent.
In addition, the embodiment of the application further provides a word embedding representation device, which is shown in fig. 4, and may include:
a processor 401, a memory 402, an input device 403 and an output device 404. The word embedded indicates that the number of processors 401 in the device may be one or more, one processor being exemplified in fig. 4. In some embodiments of the invention, the processor 401, memory 402, input device 403, and output device 404 may be connected by a bus or other means, with the bus connection being exemplified in FIG. 4.
The memory 402 may be used to store software programs and modules, and the processor 401 executes various functional applications and data processing of the word embedding representation apparatus by running the software programs and modules stored in the memory 402. The memory 402 may mainly include a storage program area that may store an operating system, application programs required for at least one function, and the like, and a storage data area. In addition, memory 402 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. The input means 403 may be used to receive entered numeric or character information and to generate signal inputs related to user settings and function control of the word embedding representation apparatus.
In particular, in this embodiment, the processor 401 loads executable files corresponding to the processes of one or more application programs into the memory 402 according to the following instructions, and the processor 401 executes the application programs stored in the memory 402, so as to implement the various functions of the word embedding representation device.
In addition, the application also provides a computer readable storage medium, wherein the computer readable storage medium stores instructions, and when the instructions are run on the terminal device, the instructions enable the terminal device to realize the word embedding representation function.
It is to be understood that for the device embodiments, reference is made to the description of the method embodiments for the relevant points, since they essentially correspond to the method embodiments. The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The foregoing has described in detail a method, apparatus and device for word embedding representation provided by the embodiments of the present application, and specific examples have been applied herein to illustrate the principles and embodiments of the present application, and the description of the foregoing examples is only for aiding in understanding the method and core idea of the present application; meanwhile, as those skilled in the art will have modifications in the specific embodiments and application scope in accordance with the ideas of the present application, the present description should not be construed as limiting the present application in view of the above.

Claims (6)

1. A method of word embedding representation, the method comprising:
word segmentation is carried out on the text to be processed to obtain word segmentation results; the word segmentation result comprises words to be represented;
inputting the word to be represented into a word2vec model, and obtaining a first word vector of the word to be represented after processing the word2vec model; the word2vec model is obtained by training a data sample belonging to the same field as the text to be processed;
inputting the word to be represented into a Bert model, and obtaining a second word vector of the word to be represented after the processing of the Bert model; the Bert model is obtained by training a data sample in an unlimited field;
determining a word vector of the word to be represented by combining the first word vector and the second word vector to realize word embedding representation of the word to be represented, and determining a word vector of the word to be represented by combining the first word vector and the second word vector to realize word embedding representation of the word to be represented, wherein the word embedding representation comprises the following steps of;
setting weight values for the first word vector and the second word vector based on occurrence of the word to be expressed having a preset context in the data sample of the same domain and the data sample of the non-limited domain, respectively, wherein the setting weight values for the first word vector and the second word vector based on occurrence of the word to be expressed having a preset context in the data sample of the same domain and the data sample of the non-limited domain, respectively, comprises:
identifying N words positioned before and after the word to be represented in the text to be processed, and recording the corresponding relation between each word and the position information as a preset context; the position information is used for representing the position relation with the words to be represented;
counting the occurrence times corresponding to each word based on the corresponding relation between each word and the position information in the data samples in the same field and the data samples in the unlimited field respectively;
based on the occurrence times corresponding to each word and a preset relation weight, respectively determining the context environmental impact scores of the words to be represented relative to the data samples in the same field and the data samples in the unlimited field;
setting weight values for the first word vector and the second word vector based on the context environmental impact score, respectively;
and determining the word vector of the word to be represented according to the weight value, the first word vector and the second word vector so as to realize word embedding representation of the word to be represented.
2. The method according to claim 1, wherein the word segmentation of the text to be processed to obtain a word segmentation result includes:
and performing word segmentation on the text to be processed based on a pre-constructed professional dictionary to obtain a word segmentation result.
3. A word embedding representation apparatus, the apparatus comprising:
the word segmentation module is used for carrying out word segmentation on the text to be processed to obtain word segmentation results; the word segmentation result comprises words to be represented;
the first processing module is used for inputting the word to be represented into a word2vec model, and obtaining a first word vector of the word to be represented after the word2vec model is processed; the word2vec model is obtained by training a data sample belonging to the same field as the text to be processed;
the second processing module is used for inputting the word to be represented into a Bert model, and obtaining a second word vector of the word to be represented after the processing of the Bert model; the Bert model is obtained by training a data sample in an unlimited field;
a determining module, configured to combine the first word vector and the second word vector, determine a word vector of the word to be expressed, so as to implement word embedding expression of the word to be expressed, where the determining module includes:
a first setting submodule, configured to set weight values for the first word vector and the second word vector based on occurrence of the word to be expressed in the data sample in the same domain and the data sample in the non-limited domain, where the occurrence of the word to be expressed has a preset context, and the first setting submodule includes:
the recording sub-module is used for identifying N words positioned before and after the words to be represented in the text to be processed and recording the corresponding relation between each word and the position information as a preset context; the position information is used for representing the position relation with the words to be represented;
the statistics sub-module is used for counting the occurrence times corresponding to each word based on the corresponding relation between each word and the position information in the data samples in the same field and the data samples in the non-limited field respectively;
the second determining submodule is used for respectively determining the context environmental impact scores of the words to be represented relative to the data samples in the same field and the data samples in the unlimited field based on the occurrence times corresponding to the words and a preset relation weight;
a second setting sub-module, configured to set weight values for the first word vector and the second word vector, respectively, based on the context environmental impact score;
and the first determining submodule is used for determining the word vector of the word to be represented according to the weight value, the first word vector and the second word vector so as to realize word embedding representation of the word to be represented.
4. The apparatus of claim 3, wherein the word segmentation module is specifically configured to:
and performing word segmentation on the text to be processed based on a pre-constructed professional dictionary to obtain a word segmentation result.
5. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein instructions, which when run on a terminal device, cause the terminal device to implement the method according to any of claims 1-2.
6. An apparatus, comprising: a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the method of any of claims 1-2 when the computer program is executed.
CN201911336859.9A 2019-12-23 2019-12-23 Word embedding representation method, device and equipment Active CN111222327B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911336859.9A CN111222327B (en) 2019-12-23 2019-12-23 Word embedding representation method, device and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911336859.9A CN111222327B (en) 2019-12-23 2019-12-23 Word embedding representation method, device and equipment

Publications (2)

Publication Number Publication Date
CN111222327A CN111222327A (en) 2020-06-02
CN111222327B true CN111222327B (en) 2023-04-28

Family

ID=70825973

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911336859.9A Active CN111222327B (en) 2019-12-23 2019-12-23 Word embedding representation method, device and equipment

Country Status (1)

Country Link
CN (1) CN111222327B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111507099A (en) * 2020-06-19 2020-08-07 平安科技(深圳)有限公司 Text classification method and device, computer equipment and storage medium
CN113449074A (en) * 2021-06-22 2021-09-28 重庆长安汽车股份有限公司 Sentence vector similarity matching optimization method and device containing proper nouns and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109992671A (en) * 2019-04-10 2019-07-09 出门问问信息科技有限公司 Intension recognizing method, device, equipment and storage medium
CN110287479A (en) * 2019-05-20 2019-09-27 平安科技(深圳)有限公司 Name entity recognition method, electronic device and storage medium
CN110334210A (en) * 2019-05-30 2019-10-15 哈尔滨理工大学 A kind of Chinese sentiment analysis method merged based on BERT with LSTM, CNN

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109992671A (en) * 2019-04-10 2019-07-09 出门问问信息科技有限公司 Intension recognizing method, device, equipment and storage medium
CN110287479A (en) * 2019-05-20 2019-09-27 平安科技(深圳)有限公司 Name entity recognition method, electronic device and storage medium
CN110334210A (en) * 2019-05-30 2019-10-15 哈尔滨理工大学 A kind of Chinese sentiment analysis method merged based on BERT with LSTM, CNN

Also Published As

Publication number Publication date
CN111222327A (en) 2020-06-02

Similar Documents

Publication Publication Date Title
CN111382255B (en) Method, apparatus, device and medium for question-answering processing
CN108491486B (en) Method, device, terminal equipment and storage medium for simulating patient inquiry dialogue
CN107194158A (en) A kind of disease aided diagnosis method based on image recognition
KR102265573B1 (en) Method and system for reconstructing mathematics learning curriculum based on artificial intelligence
CN110765882B (en) Video tag determination method, device, server and storage medium
CN110427486B (en) Body condition text classification method, device and equipment
CN111222327B (en) Word embedding representation method, device and equipment
CN106649739B (en) Multi-round interactive information inheritance identification method and device and interactive system
CN111738269B (en) Model training method, image processing device, model training apparatus, and storage medium
CN113298152B (en) Model training method, device, terminal equipment and computer readable storage medium
CN112487139A (en) Text-based automatic question setting method and device and computer equipment
CN113792871A (en) Neural network training method, target identification method, device and electronic equipment
CN110781413A (en) Interest point determining method and device, storage medium and electronic equipment
CN111753076A (en) Dialogue method, dialogue device, electronic equipment and readable storage medium
CN110399488A (en) File classification method and device
CN112069329B (en) Text corpus processing method, device, equipment and storage medium
CN115146068B (en) Method, device, equipment and storage medium for extracting relation triples
CN113221530A (en) Text similarity matching method and device based on circle loss, computer equipment and storage medium
CN114492451B (en) Text matching method, device, electronic equipment and computer readable storage medium
CN115222443A (en) Client group division method, device, equipment and storage medium
CN110852071A (en) Knowledge point detection method, device, equipment and readable storage medium
CN111429414B (en) Artificial intelligence-based focus image sample determination method and related device
CN110389999A (en) A kind of method, apparatus of information extraction, storage medium and electronic equipment
CN113435531A (en) Zero sample image classification method and system, electronic equipment and storage medium
CN116704591A (en) Eye axis prediction model training method, eye axis prediction method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant