CN114662488A - Word vector generation method and device, computing device and computer-readable storage medium - Google Patents

Word vector generation method and device, computing device and computer-readable storage medium Download PDF

Info

Publication number
CN114662488A
CN114662488A CN202111653193.7A CN202111653193A CN114662488A CN 114662488 A CN114662488 A CN 114662488A CN 202111653193 A CN202111653193 A CN 202111653193A CN 114662488 A CN114662488 A CN 114662488A
Authority
CN
China
Prior art keywords
word
text unit
vector
corpus data
group
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111653193.7A
Other languages
Chinese (zh)
Inventor
张冠华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202111653193.7A priority Critical patent/CN114662488A/en
Publication of CN114662488A publication Critical patent/CN114662488A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

Disclosed are a word vector generation method and apparatus, the word vector generation method including: obtaining corpus data comprising at least two text units, wherein each text unit at least comprises a word; determining the weight of each text unit in the corpus data according to the distribution of each group word in the group word set and each target word in the target word set in the text units of the corpus data; determining sample data for training a word vector model according to the corpus data and the weight of each text unit in the corpus data; and training a word vector model by using the sample data, and obtaining a word vector of at least one word of at least one text unit in the corpus data from the trained word vector model.

Description

Word vector generation method and device, computing device and computer-readable storage medium
Technical Field
The present disclosure relates to the field of natural language processing, and more particularly, to a method and apparatus for generating word vectors.
Background
Natural language processing is an important direction in the field of computer science and artificial intelligence. In the task of natural language processing, since a computer cannot directly read natural language, a mapping is needed to be designed to mathematically transform the natural language for the computer to process, and then word vectors are generated accordingly. Word vector, also known as word embedding, refers to a technique for mapping words in the human natural language into low-dimensional real number vectors, thereby characterizing the words themselves and the relationships between the words. Word vectors are widely used in deep learning to characterize words, often as the first layer of a deep learning model. Generally speaking, the higher the quality of a word vector is, the richer and more accurate the semantic information contained in the word vector is, the more easily a computing mechanism can solve the semantics in the natural language, and the processing result of the natural language processing task can be fundamentally improved. In the related art, the word vector model is generally trained by directly using the original natural language corpus data as a training sample, so as to obtain word vectors corresponding to words in the corpus data.
Disclosure of Invention
The inventors have discovered that raw corpus data collected from human society may often contain problems of cognitive deviation in the human world, such as cognitive deviation for a certain population or cognitive deviation for a particular thing. Since human beings may have cognitive deviation in understanding the world due to the fact that the human beings are influenced by various factors (such as culture, environment, region, living habits and the like) to recognize things, the cognitive deviation may cause the cognitive deviation factor to exist in the corresponding corpus data. Therefore, the training result inevitably captures the cognitive deviation because the model is trained by directly utilizing the corpus data with the cognitive deviation, so that the obtained word vector contains the cognitive deviation factor. Such biased word vectors, when applied to downstream tasks of natural language processing, cause downstream task models to likewise exhibit similar cognitive bias problems. Supposing that in a machine learning model for screening resumes, an unfair conclusion that men are more competent in managing posts than women may be obtained due to the use of biased word vectors with gender cognitive bias factors; in a more extreme case, the model may even predict a sentence as a "negative sentence" only because a sentence in the corpus data includes "female" and related words, and then automatically filter out the resume of the female job seeker. These inaccurate, unfair prediction results due to cognitive bias factors in the word vector are clearly unacceptable.
Furthermore, since the word vectors have cognitive deviation, the word vectors cannot truly and accurately represent the meanings of the corresponding words and the relations between the words, that is, the word vectors inevitably have deviation, so that the accuracy of the word vectors is difficult to ensure.
It is an object of the present invention to overcome at least one of the disadvantages of the related art. Specifically, the accuracy of the word vector can be improved through the weighted deviation rectifying processing of the text unit in the original corpus data.
According to an aspect of the present disclosure, there is provided a word vector generation method including: obtaining corpus data comprising at least two text units, wherein each text unit at least comprises a word; determining the weight of each text unit in the corpus data according to the distribution of each group word in the group word set and each target word in the target word set in the text units of the corpus data; determining sample data for training a word vector model according to the corpus data and the weight of each text unit in the corpus data; and training a word vector model by using the sample data, and obtaining a word vector of at least one word of at least one text unit in the corpus data from the trained word vector model.
In the word vector generation method according to some embodiments of the present disclosure, determining a weight of each text unit in the corpus data according to a distribution of each group word in the group word set and each target word in the target word set in the text unit of the corpus data includes: determining a group word distribution vector and a target word distribution vector of each text unit in corpus data, wherein the group word distribution vector is used for indicating a first distribution condition of each group word in a group word set in the text unit, each element in the group word distribution vector indicates whether a corresponding group word in the group word set exists in the text unit, the target word distribution vector is used for indicating a second distribution condition of each target word in the target word set in the text unit, and each element in the target word distribution vector indicates whether a corresponding target word in the target word set exists in the text unit; according to the group word distribution vector and the target word distribution vector of each text unit in the corpus data, determining a first probability of occurrence of a second distribution condition represented by the target word distribution vector of the text unit under the condition that the first distribution condition represented by the group word distribution vector of the text unit occurs for each text unit in the corpus data; according to the target word distribution vector of each text unit in the corpus data, determining a second probability of occurrence of a second distribution condition represented by the target word distribution vector of the text unit aiming at each text unit in the corpus data; and determining the weight of each text unit in the corpus data according to the first probability and the second probability.
In a word vector generation method according to some embodiments of the present disclosure, determining, for each text unit in corpus data, a first probability of occurrence of a second distribution condition characterized by a target word distribution vector of the text unit in a case where the first distribution condition characterized by the group word distribution vector of the text unit occurs according to the group word distribution vector and the target word distribution vector of each text unit in the corpus data, includes:
dividing corpus data into K groups, wherein each group comprises at least one text unit, and K is an integer greater than or equal to 2; and
for each group of corpus data in the K groups of corpus data, executing the following steps:
training set and testing set determination steps: taking the group word distribution vector and the target word distribution vector of each text unit in the group of corpus data as a test set, taking the group word distribution vector and the target word distribution vector of each text unit in other groups of corpus data except the group of corpus data in the K groups of corpus data as a training set,
training: training a classifier model by taking the group word distribution vector of each Chinese unit in the training set as input and the target word distribution vector as output, and
a prediction step: and aiming at each text unit in the test set, predicting a first probability of occurrence of a second distribution condition represented by the target word distribution vector under the condition that the first distribution condition represented by the group word distribution vector occurs by using a trained classifier model according to the group word distribution vector and the target word distribution vector of the text unit.
In a word vector generation method according to some embodiments of the present disclosure, determining, for each text unit in corpus data, a first probability of occurrence of a second distribution condition characterized by a target word distribution vector of the text unit in a case where the first distribution condition characterized by the group word distribution vector of the text unit occurs according to the group word distribution vector and the target word distribution vector of each text unit in the corpus data, includes: determining the probability of each group word in the group word set appearing in the text unit of the corpus data according to the group word distribution vector of each text unit in the corpus data; determining a third probability of occurrence of a first distribution condition represented by a group word distribution vector of each text unit in the corpus data according to the probability of occurrence of each group word in the corpus word set in the text unit of the corpus data; determining a fourth probability that a first distribution condition represented by the group word distribution vector of each text unit and a second distribution condition represented by the target word distribution vector of each text unit in the corpus data simultaneously appear according to the group word distribution vector and the target word distribution vector of each text unit in the corpus data; and determining a first probability of the occurrence of a second distribution condition represented by the target word distribution vector of each text unit under the condition that the first distribution condition represented by the group word distribution vector of the text unit in the corpus data occurs according to the third probability and the fourth probability.
In some embodiments of the present disclosure, the method for generating a word vector, determining, for each text unit in corpus data, a second probability of occurrence of a second distribution condition represented by a target word distribution vector of the text unit according to the target word distribution vector of each text unit in corpus data includes: determining the probability of each target word in the target word set appearing in the text unit of the corpus data according to the target word distribution vector of each text unit in the corpus data; and determining a second probability of occurrence of a second distribution condition represented by the target word distribution vector of each text unit in the corpus data according to the probability of occurrence of each target word in the corpus data in the text unit of the corpus data.
In a word vector generation method according to some embodiments of the present disclosure, the classifier model includes at least one of: a random forest classifier model, an XGboost classifier model, and a LightGBM classifier model.
In a word vector generation method according to some embodiments of the present disclosure, determining sample data for training a word vector model according to corpus data and a weight of each text unit in the corpus data includes: determining a co-occurrence value of every two words in the corpus data in each text unit, the co-occurrence value indicating whether the two words are simultaneously present in the text unit; determining the weighted co-occurrence value of each two words in the text unit according to the co-occurrence value of the two words in each text unit in the corpus data and the weight of the text unit; and constructing a co-occurrence matrix according to the weighted co-occurrence value of every two words in the corpus data in each text unit, wherein the co-occurrence matrix is used as sample data for training a word vector model.
In a word vector generation method according to some embodiments of the present disclosure, determining sample data for training a word vector model according to corpus data and a weight of each text unit in the corpus data includes: selecting text units from the corpus data according to the weight of each text unit in the corpus data; and determining sample data for training the word vector model according to the selected text unit.
In the word vector generation method according to some embodiments of the present disclosure, before determining a weight of each text unit in the corpus data according to distributions of each group word in the group word set and each target word in the target word set in the text unit of the corpus data, the method further includes: and performing word segmentation processing on each text unit in the text data to obtain each word contained in the text unit.
In a word vector generation method according to some embodiments of the present disclosure, the method further includes: acquiring a text to be processed; and searching a word vector corresponding to the word in the text to be processed from a word vector library formed by the obtained word vectors.
In a word vector generation method according to some embodiments of the present disclosure, the word vector model includes at least one of: glove, Word2vec, fastText.
In a word vector generation method according to some embodiments of the present disclosure, a text unit is a sentence in natural language.
According to another aspect of the present disclosure, there is provided a word vector generating apparatus including: the corpus acquiring module is used for acquiring corpus data comprising at least two text units, and each text unit at least comprises a word; the weight determining module is used for determining the weight of each text unit in the corpus data according to the distribution of each group word in the group word set and each target word in the target word set in the text unit of the corpus data; the sample determining module is used for determining sample data used for training the word vector model according to the corpus data and the weight of each text unit in the corpus data; and the word vector obtaining module is used for training a word vector model by using the sample data and obtaining a word vector of at least one word of at least one text unit in the corpus data from the trained word vector model.
According to another aspect of the present disclosure, there is provided a computing device comprising: a processor; and a memory having instructions stored thereon, the instructions, when executed on the processor, cause the processor to perform a method of word vector generation according to some embodiments of the present disclosure.
According to another aspect of the present disclosure, one or more computer-readable storage media are provided having computer-readable instructions stored thereon that, when executed, implement a word vector generation method according to some embodiments of the present disclosure.
In the word vector generation method according to some embodiments of the present disclosure, by giving corresponding weights to each text unit based on the distribution of the group words and the target words in the text units of the corpus data, the degree of too high or too low association between the target words and the group words in the text units due to the group cognitive bias can be adaptively adjusted, so as to weaken or remove the group cognitive bias existing in the corpus data. Furthermore, through the text unit empowerment operation, the cognitive deviation of the linguistic data for specific objects caused by regional difference, cultural difference and the like can be corrected, and therefore the accuracy of word vectors is remarkably improved. In services or tasks related to natural language processing, such unbiased (i.e., de-cognitively biased) word vectors can more objectively and realistically reflect the meanings of natural language words and relationships between words in the human social and physical world, and thus can be more fairly treated when facing different groups. Furthermore, the operations of text unit weighting processing, sample data improvement and the like in the word vector generation method according to the embodiment of the disclosure can be automatically completed by the computing device without any additional manual operation (such as corpus labeling), so that compared with the method of removing cognitive deviation in a manual or manual labeling manner in the related art, the word vector generation method according to the embodiment of the disclosure simplifies the work flow, remarkably improves the work efficiency, and reduces the labor cost for removing cognitive deviation; on the other hand, the operations such as text unit weighting processing and sample data improvement in the word vector generation method according to the embodiment of the present disclosure are all completed before the training process of the word vector model starts, and no additional performance loss is brought to the training process of the whole word vector model.
Drawings
Various aspects, features and advantages of the disclosure will become more readily apparent from the following detailed description and the accompanying drawings, in which:
FIG. 1 schematically illustrates an example implementation environment for a word vector generation method according to some embodiments of the present disclosure;
FIG. 2 schematically illustrates an example interaction flow diagram implemented in the example implementation environment of FIG. 1 by a word vector generation method according to some embodiments of the present disclosure;
FIGS. 3A and 3B respectively schematically illustrate a flow diagram of a word vector generation method according to some embodiments of the present disclosure;
FIG. 4 schematically illustrates an example interface of an example application scenario of a word vector generation method according to some embodiments of the present disclosure;
FIG. 5 schematically illustrates a flow diagram of a word vector generation method, according to some embodiments of the present disclosure;
6A-6B respectively schematically illustrate flow diagrams of word vector generation methods according to some embodiments of the present disclosure;
7A-7B respectively schematically illustrate flow diagrams of word vector generation methods in accordance with some embodiments of the present disclosure;
FIG. 8 schematically illustrates a flow diagram of a word vector generation method, according to some embodiments of the present disclosure;
FIG. 9 schematically illustrates a block diagram of a word vector generation apparatus, in accordance with some embodiments of the present disclosure;
FIG. 10 schematically illustrates a computing device according to some embodiments of the present disclosure.
It is to be noted that the figures are diagrammatic and explanatory only and are not necessarily drawn to scale.
Detailed Description
Several embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings in order to enable those skilled in the art to practice the disclosure. The present disclosure may be embodied in many different forms and purposes and should not be construed as limited to the embodiments set forth herein. These embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art. The embodiments do not limit the disclosure.
It will be understood that, although the terms first, second, third, etc. may be used herein to describe various elements, components and/or sections, these elements, components and/or sections should not be limited by these terms. These terms are only used to distinguish one element, component or section from another element, component or section. Thus, a first element, component, or section discussed below could be termed a second element, component, or section without departing from the teachings of the present disclosure.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.
Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and/or the present specification and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
Before describing in detail embodiments of the present invention, for the sake of clarity, some concepts related thereto are first explained:
1. word vector: word Embedding (Word Embedding) refers to a technology of mapping words in a human natural language into low-dimensional real number vectors so as to represent the words and the relations between the words. Word vectors are widely used in deep learning to characterize words, often as the first layer of a deep learning model. Word vectors may generally include two types: one-hot (one-hot) word vectors and distributed word vectors.
2. Text unit: refers to the language units that constitute a corpus, and generally has a relatively complete meaning, such as sentences, paragraphs, etc. in natural language. A unit of text may include, but is not limited to, one or more sentences, a portion of a sentence, or one or more paragraphs.
3. Cognitive deviation: the method refers to a phenomenon that human beings deviate from the real situation in cognition of certain cognitive objects due to influences of various factors (such as culture, environment, regions, living habits and the like) in the process of recognizing the world. Cognitive bias may include, for example, group cognitive bias (e.g., gender cognitive bias, geography cognitive bias) for a particular group and cognitive bias for a particular thing.
4. Group words and target words: in the cognitive deviation, for example, in the case of the cognitive deviation, words representing a group (for example, female) to which the cognitive deviation is directed and a relative group (for example, male) such as "female", "male", "birth", "woman" and the like may be group words, and words for evaluating the degree of superiority or inferiority of a person in a certain aspect (for example, ability level) such as "title", "excellent", "weak", "poor" and the like may be target words.
5. Group word distribution vector for text unit: and the vector is used for representing the distribution or existence condition of each group word in the group word set in the text unit, the dimensionality of the vector is equal to the number of the group words in the group word set, and each element in the group word distribution vector represents whether the corresponding group word exists in the text unit or not.
6. Target word distribution vector for text unit: and the vector is used for representing the distribution or existence condition of each target word in the target word set in the text unit, the dimensionality of the vector is equal to the number of the target words in the target word set, and each element in the target word distribution vector represents whether the corresponding target word exists in the text unit or not.
7. The word vector model: refers to a natural language processing model or tool for mapping words in corpus data to Word vectors by training with corpus data, which may include, but is not limited to, Glove, Word2vec, fastText, LSA, etc. models.
8. A classifier model: a function or model for mapping data to one of the given categories so that it can be applied to data prediction. The classifier model is a general term of a method for classifying samples in data mining, and comprises algorithms such as decision trees, logistic regression, naive Bayes, neural networks and the like. Common classifier models may include, but are not limited to, a random forest classifier model, an xgboost (extreme Gradient Boosting) model, a lightgbm (light Gradient Boosting machine) model, and the like.
9. Deep learning: deep learning is a branch of machine learning, and particularly relates to a method for machine learning by using a deep artificial neural network and a related technology.
10. Inverse probability weighting: inverse Probability Weighting (IPW) is a technique for correcting selection bias, and is widely used in the fields of machine learning, causal inference, and the like.
11. And (3) hyper-parameter: comparing learnable parameters of the machine learning model, wherein the hyper-parameters refer to parameters needing artificial setting in the machine learning model; the hyper-parameters often need to be selected empirically or with effect on the verification Set (Validation Set).
12. Co-occurrence matrix and co-occurrence value: the co-occurrence value is a value indicating whether two words appear in one text unit of the corpus data at the same time, for example, the co-occurrence value of two words in a certain text unit can be represented by 1 and 0, that is, 1 indicates that the two words appear in the text unit at the same time, otherwise, the co-occurrence value is 0; the element of the co-occurrence matrix is the total number of co-occurrence times of every two words in the corpus data in each text unit, and the total number is equal to the sum of co-occurrence values of the two words in each text unit.
13. Weighted co-occurrence matrix and weighted co-occurrence value: the weighted co-occurrence value is the product of the co-occurrence value of two words in a text unit and the weight of the text unit; the weighted co-occurrence matrix is a matrix taking the sum of weighted co-occurrence values of every two words in the corpus data in each text unit as an element.
Natural language refers to a language that people use daily, such as english, chinese, russian, french, spanish, and so on, all belong to one of natural languages. Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. Natural language processing is a science closely related to the research of linguistics, integrating linguistics, computer science and mathematics, and mainly studies a method for performing effective human-computer interaction through natural language and related theories. That is, through the related art in terms of natural language processing, a computer can directly recognize a natural language provided by a user in the form of voice or text, and make a corresponding response, such as performing a specific operation, giving a response in the form of a natural language according to the user's intention, or the like. Specifically, the technologies involved in natural language processing include, but are not limited to, semantic understanding, machine translation, machine question answering, knowledge mapping, emotion analysis, and the like, wherein the semantic understanding can also be used for constructing entries of a search engine. Word vectors are a technique that maps words in the human natural language to low-dimensional real number vectors, which may reflect the representation of words and the relationships between words. For example, in the word vector space, the words closest to the "man" are all male related words, and the words closest to the "woman" are all female related words. The pre-trained word vector is often used as the first layer of the deep neural network, and compared with the randomly initialized network first layer parameter, the pre-trained word vector has been verified to effectively improve the performance of the model, so that the pre-trained word vector is widely applied to all tasks of Natural Language Processing (NLP).
In the related art, word vectors are usually obtained by training on large-scale word-free text based on a Language Model (Language Model) task. The language model task aims at predicting the next vocabulary to appear based on the vocabulary sequence. For example, for a sentence or word sequence "i eat __ in a restaurant," the language model's task is to predict the words with the highest probability of occurring at the horizontal line, i.e., "meal".
However, as mentioned above, in the related word vector training or generating method, since the original corpus data is based on the natural human language, the obtained word vector inevitably carries the possible cognitive deviation to some groups in the original corpus and the cognitive deviation to some things caused by cultural difference, geographical difference, etc., so that the word vector generated based on the corpus also shows the corresponding group cognitive deviation or the cognitive deviation to the corresponding things when the downstream task is completed, thereby affecting the accuracy of the word vector. For example, in the pre-training word vector space or word vector dictionary of the related art, due to the existence of level recognition bias in the original corpus, word vectors corresponding to words such as "title", "excel", etc. may be closer to word vectors corresponding to "male" than to word vectors corresponding to "female". These predictions, due to population recognition bias in the word vectors, are clearly unacceptable.
At present, the method for removing group cognitive deviation in the related art may be mainly implemented by a manual inspection method, for example, before training, sentences or paragraphs with cognitive deviation in original corpus data are manually inspected and labeled, and then sentences or paragraphs labeled as having group cognitive deviation are deleted to remove the cognitive deviation factor of the generated word vector. The consequence of this is that, for large-scale corpus data, manual labeling or screening causes low working efficiency and human errors inevitably, and moreover, simply deleting sentences containing cognitive deviation may cause information loss, so that the word vectors generated based on this cannot truly and accurately reflect the meanings of corresponding words in corpus data and the relationships among words.
The invention provides a method for generating word vectors based on weight of text units, aiming at the problems of cognitive deviation existing in a word vector training or generating method of the related art and inaccurate generated word vectors caused by the cognitive deviation, which can effectively eliminate group cognitive deviation in the word vectors and improve the accuracy of the word vectors on the premise of almost not losing performance. The word vector for removing the cognitive deviation obtained by the technology can be effectively applied to downstream tasks. The basic idea of the word vector generation method according to some embodiments is: firstly, finding out an object (a certain group or a certain object) of cognitive deviation and information (such as content related to cognitive deviation) related to the cognitive deviation object in a text unit of corpus data; then, according to the overall distribution of the two (namely the object and the related information) in the text unit, each text unit is endowed with corresponding weight; constructing a training sample by using the weighted text unit for training a word vector model; finally, obtaining unbiased word vectors from the trained word vector model, namely word vectors without cognitive deviation. Taking the elimination of group cognitive deviation as an example, in the word vector generation method according to some embodiments, first, a group word and a target word are found out from each text unit (e.g., sentence) of corpus data; according to the overall distribution condition of the group words and the target words in the text units (for example, whether the group words and/or the target words exist in each text unit) giving corresponding weight to each text unit; constructing a training sample by using the weighted text unit for training a word vector model; finally, unbiased word vectors for eliminating cognitive deviation are obtained from the trained word vector model.
The distribution of the objects of cognitive deviation and the information related to cognitive deviation in the text unit (i.e. the existence, especially the mutual distribution or the existence in one text unit) actually reflects the degree of association between the two. The root cause of the above-mentioned difference in the degree of association lies in: there is a cognitive deviation in human society for a certain group and/or a relatively high rating for another relative group, and/or a cognitive deviation for something due to differences in geography or culture, etc. For example, in gender cognitive bias, there is often a cognitive bias for a female in some aspect of her ability and/or a relatively overestimation of a male's ability in that aspect. Thus, in order to correct such cognitive deviation to improve the accuracy of the word vector, such cognitive deviation can be adjusted by adjusting the degree of such abnormal association between the group word and the target word in a manner of giving weight to the text unit. Specifically, taking the elimination of the group cognitive deviation as an example, sentences reflecting the association (i.e., co-occurrence) of the group words and the target words may be given different weights according to the degree of association.
For example, if the association degree between "woman" and "poor" is higher, the sentence in which the two co-occur may be given lower weight to reduce the association degree and weaken the cognitive deviation of women; and if the association degree between the male and the poor male is low, the sentence in which the male and the poor male are co-occurring can be given high weight to increase the association degree, so that the over-high evaluation of the group male which is opposite to the group female with cognitive deviation can be weakened, the cognitive deviation of the female can be correspondingly weakened, and the fair treatment of the groups with different sexes can be realized. For another example, in the corpus data in northern china, the frequency of the co-occurrence of "jellied bean curd" as the cognitive deviation object and "salty" is high, that is, the association degree of the two is high, and cognitive deviation exists, at this time, lower weight can be given to the co-occurrence sentence of the two, so as to reduce the association degree of the two, thereby adjusting the cognitive deviation; similarly, in the corpus data in south china, in order to adjust the cognitive deviation, a higher weight may be given to the sentence in which "uncongealed tofu" and "salty" coexist.
FIG. 1 schematically illustrates an example implementation environment 100 for a word vector generation method according to some embodiments of the present disclosure. As shown in fig. 1, the implementation environment 100 may include a corpus management server 110, a weight management server 120, a sample management server 130, and a word vector management server 140. As shown in fig. 1, real-time environment 100 may also include a network 150 and one or more terminal devices 160.
The corpus management server 110, the weight management server 120, the sample management server 130, and the word vector management server 140 may store and execute instructions that may perform the various methods described herein, which may each be a single server or a cluster of servers or a cloud server, or any two or more of the servers may be the same server or the same cluster of servers, or may be cloud servers capable of providing basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, web services, cloud communications, middleware services, domain name services, security services, CDNs, and big data and artificial intelligence platforms. It should be understood that the servers referred to herein are typically server computers having a large amount of memory and processor resources, but other embodiments are possible.
Examples of network 150 include a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), and/or a combination of communication networks such as the Internet. Each of the corpus management server 110, the weight management server 120, the sample management server 130, and the word vector management server 140, and the one or more terminal devices 160 may include at least one communication interface (not shown) capable of communicating over the network 150. Such communication interfaces may be one or more of the following: any type of network interface (e.g., a Network Interface Card (NIC)), wired or wireless (such as IEEE 802.11 Wireless LAN (WLAN)) wireless interface, worldwide interoperability for microwave Access (Wi-MAX) interface, Ethernet interface, Universal Serial Bus (USB) interface, cellular network interface, BluetoothTMAn interface, a Near Field Communication (NFC) interface, etc. Further examples of communication interfaces are described elsewhere herein.
As shown in FIG. 1, the terminal devices 160 may be any type of mobile computing device including, for example, mobile computers (e.g., Microsoft Surface devices, Personal Digital Assistants (PDAs), laptop computers, notebook computers, articles of manufacture such as Apple iPad @TMTablet computer, netbook, etc.), mobile phones (e.g., cellular phones, smart phones such as Microsoft Windows phones, Apple iPhone, Google Android enabled ® Tooth @TMOperating system's telephone, Palm device, Black berry device, etc.), wearable computing device (for example intelligent watch, head mounted device, including intelligent glasses, for example Google Glass-TMEtc.) or other types of mobile devices. In some embodiments, terminal device 160 may also be a stationary computing device, such as a desktop computer, a gaming console, a smart television, and so forth.
As shown in fig. 1, terminal device 160 may include a display screen and a terminal application that may interact with an end user via the display screen. The terminal device 160 may interact with, e.g., send data to or receive data from, one or more of the corpus management server 110, the weight management server 120, the sample management server 130, and the word vector management server 140, e.g., via the network 150. The terminal application may be a native application, a Web page (Web) application, or an applet (LiteApp, e.g., a cell phone applet, a WeChat applet) that is a lightweight application. In the case where the terminal application is a local application that needs to be installed, the terminal application may be installed in the user terminal 160. In the case where the terminal application is a Web application, the terminal application can be accessed through a browser. In the case that the terminal application is an applet, the terminal application may be directly opened on the user terminal 160 by searching related information of the terminal application (e.g., a name of the terminal application, etc.), scanning a graphic code of the terminal application (e.g., a barcode, a two-dimensional code, etc.), and the like, without installing the terminal application.
FIG. 2 illustrates an example interaction flow diagram of a method of word vector generation implemented in the example implementation environment 100 shown in FIG. 1, according to some embodiments of the present disclosure. The principles of operation of the word vector generation method in the implementation environment 100 according to some embodiments of the present disclosure are briefly described below with reference to an example interaction flow diagram shown in fig. 2.
As shown in fig. 2, the corpus management server 110 may be configured to obtain corpus data including at least two text units, each text unit including at least one word.
As shown in fig. 2, the weight management server 120 may be configured to determine the weight of each text unit in the corpus data according to the distribution of each group word in the group word set and each target word in the target word set in the text unit of the corpus data.
As shown in fig. 2, the sample management server 130 may be configured to determine sample data for training the word vector model according to the corpus data and the weight of each text unit in the corpus data.
As shown in fig. 2, the word vector management server 140 may be configured to train a word vector model using the sample data, and obtain a word vector of at least one word of at least one text unit in the corpus data from the trained word vector model.
As shown in fig. 2, optionally, the word vector management server 140 may be further configured to obtain or receive the text to be processed from the terminal device 160, and then search a word vector corresponding to a word in the text to be processed in a word vector library composed of the obtained word vectors. Generally, in the word vector generating method according to some embodiments of the present disclosure, the corpus data is usually a large-scale text of a natural language without tags, so that the word vectors corresponding to the words in the corpus data, which are trained by the above-mentioned text unit weighting, constitute a ready-made word vector library or word vector dictionary. Therefore, in a specific application of downstream natural language processing, word vectors of each word in the text to be processed can be obtained by searching and matching directly based on the generated word vector library, and then the obtained word vectors are used to implement semantic recognition, data classification, data screening, emotion analysis and other processing of the text to be processed, so as to finally provide decision bases or corresponding services for the user, for example, provide services such as data search, data push, intelligent question answering and the like for the terminal device 160 based on the processing results. Since each text unit is provided with a corresponding weight according to the specific distribution of the group word information and the target word information in the text unit in the present disclosure, it is possible to adaptively adjust the abnormal relationship between a certain group or certain groups and certain target words (e.g., poor, weak, etc.) in the corpus data, such as the relationship that is too strong (e.g., the group word "female" is poor) or too weak (e.g., the group word "female" is strong) due to the subjective cognitive deviation of human beings. In this way, the word vectors generated based on the weighted (i.e., deviation corrected by weighting) text units can reduce or eliminate the cognitive deviation factors carried by the word vectors due to group cognitive deviation, thereby enabling objective, fair, and accurate characterization of corresponding words and relationships between words in corpus data when such unbiased word vectors are applied to downstream natural language processing tasks or applications.
The example implementation environments and workflow of fig. 1 and 2 are merely illustrative, and a word vector generation method according to the present disclosure is not limited to the illustrated example implementation environments. It should be understood that although server 110-140 and terminal device 160 are shown and described herein as separate structures, they may be different components of the same computing device. For example, optionally, the implementation environment of the word vector generation method according to some embodiments of the present disclosure may also include only the terminal device without involving the server, that is, when a certain condition is satisfied, the terminal device 160 may also complete the above steps by the server 110 and 140. Optionally, the application scenario or the implementation environment of the word vector generation method according to some embodiments of the present disclosure may also include only a server without involving the terminal device, that is, at least one of the servers 110 and 140 may also complete the step of completing the terminal device 160. For example, it may be assumed that the text to be processed is in one of the servers 110-140, and thus the corresponding step of obtaining the file to be processed may be directly performed at the server side, and then the step of searching the word vector of the text to be processed is automatically completed.
Fig. 3A schematically illustrates a flow diagram of a word vector generation method, in accordance with some embodiments of the present disclosure.
In some embodiments, the word vector generation method may be performed on a server (e.g., server 110 and 140 shown in fig. 1 and 2). In other embodiments, the word vector generation method may also be performed by the server 110 and the terminal device 160 shown in fig. 1 and 2 in combination. As shown in fig. 3A, a word vector generation method according to some embodiments of the present disclosure may include steps S310-S340.
At step S310, corpus data including at least two text units is acquired. Wherein each text unit may contain at least one word.
The generation of word vectors requires support from large-scale corpus data or corpora, because such large-scale corpus data can cover more words and word vectors trained or generated based on such large-scale corpus data can more accurately reflect the meanings of words and relationships between words in the objective world. Thus, in order to make the generated word vectors more numerous and more accurate to construct a word vector library or word vector dictionary required for downstream tasks, larger scale corpus data (e.g., containing tens of millions of sentences or hundreds of millions of words) needs to be collected or collected.
With respect to the retrieval or collection of corpus data, the available corpus data, such as the corpora of hundred, wikipedia, or some user's speech on websites or forums with large visits, or the corpus of various encyclopedias, may be collected from the internet or other text carriers (e.g., press, book, broadcast). Alternatively, the corpus data used to generate the word vectors may also be obtained by purchasing an off-the-shelf corpus. In a word, when corpus data is collected, various fields related to the human society and the physical world are covered as much as possible, and all things are covered, so that the obtained word vector library or word vector dictionary is wider in application range and more accurate in word vectors. Of course, alternatively, the corpus data may also be collected according to a downstream specific task or application, so that word vectors of words in a corresponding field related to the specific task can be obtained more specifically and the word vectors reflect the meaning of the corresponding words in the field and the relationship between the words more accurately. For example, for the application scenario of emotion analysis, corpus data related to the human mental world and emotion can be collected.
Text units refer to units of language that constitute a corpus, such as sentences, paragraphs, and the like. In some embodiments, a unit of text may include, but is not limited to, one or more sentences, a portion of a sentence, or one or more paragraphs. Hereinafter, for the purpose of explanation, text units are generally described by taking sentences as an example, that is, one text unit is a sentence.
The following describes the structure of text units by taking sentences as an example. A sentence is a basic unit for language application, which is composed of one or more words (including words and phrases or phrases) and can express a complete meaning, such as telling others about a matter, proposing a problem, indicating a requirement or a stop, indicating a certain probability of induction, and indicating continuation or omission of a section of speech. The end of a sentence should be marked with a period, question mark, ellipsis, or exclamation point. Therefore, the whole corpus data can be divided into a plurality of text units, i.e. sentences, according to the punctuation marks in the corpus data to be linguistic data, i.e. punctuation marks such as periods, question marks, exclamation marks or ellipses, which represent the end of a sentence, can be used as text unit separators.
Alternatively, the text unit may consist of only a part of a sentence, and thus commas, semicolons, etc. may be used as the segmenter of the text unit at this time. Further alternatively, the text unit may also be constituted by a natural paragraph, in which case its separator may be a line break.
It is noted that the word vector generation method according to some embodiments of the present disclosure may be used to generate word vectors in any language including, but not limited to, chinese, english, german, japanese, korean, french, russian, spanish, etc. Thus, the corpus data may include material in any one or more languages. In some embodiments of the present disclosure, a chinese language is mainly used as an example for explanation, and the word vector generation method of other languages is similar to that of the chinese language and is not described again.
At step S320, a weight of each text unit in the corpus data is determined according to the distribution of each group word in the group word set and each target word in the target word set in the text unit of the corpus data.
In some embodiments, the set of group words and the set of target words may be predetermined according to a specific task or purpose, related common sense, and the like. For example, when the objective is to remove gender recognition bias, the set of group words may be set to include "female", "male", "mr", etc., while the set of target words may be set to include "title", "excellent", "good", "poor", etc.; when the objective is to remove the regional cognitive bias, the group word set may include "city", "country", "mountain area", etc., and the target word set may include "good", "bad", "civilization", "wild", etc.
As described above, in some embodiments, the population cognitive bias to which embodiments of the present disclosure are directed may include only a single population. Alternatively, the word vector generation method according to the embodiment of the present disclosure may also be used to remove multiple population recognition biases at the same time. In this case, the group word and the target word set need to include different group words and target words for different cognitive deviations, such as gender cognitive deviation and regional cognitive deviation, the group word set may include "female", "male", "mr. and" city "," country "," mountain area ", etc., and the target word set may include" title "," excellent "," good "," poor "and" good "," bad "," civilization "," wild ", etc.
Based on the concept of the present disclosure, in order to remove or weaken the group cognitive deviation factor in the corpus data, it is first required to know which text units in the corpus contain such cognitive deviation, so that the text units containing cognitive deviation can be properly processed to eliminate or weaken the cognitive deviation. Through intensive research, the inventor finds that the specific distribution or existence condition of the group related information (such as group words) and the cognition deviation content related information (such as target words) in the text unit can represent the association degree of the group related information and the cognition deviation content related information, and further reflects whether the group cognition deviation exists or not to a certain extent.
In some embodiments, the "distribution" in step S320 may be understood as the existence of the group word or the target word in each text unit of the corpus data, for example, the text unit is exemplified by sentences, the distribution of the group word "female" refers to which sentences the "female" exists or appears, and which sentences the "female" does not exist or appears.
On one hand, the degree of association between a specific target word and a certain group word can be described by the following ways: firstly, obtaining the times a of the common existence or common appearance (namely, the co-occurrence) of the target word and the group word in each text unit; secondly, obtaining the co-occurrence frequency b of the group word and the target word relative to the group word; and finally, obtaining the association degree of the target word and the group word by comparing a and b. For example, a particular target word is "poor", while a group word is "female", and a group word opposite to the group word "female" is "male"; assuming that "poor" and "female" co-occur in 5 text units, i.e., 5 times, and "poor" and "male" co-occur in only 1 text unit, i.e., 1 time, it can be concluded that "female" is associated with "poor" to a high degree and "male" is associated with "poor" to a low degree.
On the other hand, the degree of association between the target word and the group word can be described by the number of co-occurrences of the target word and the group word and the number of occurrences of the target word and the group word, that is, the degree of association can be described by comparing the conditional probability with the unconditional probability. Assuming that "poor" and "woman" co-occur 5 times in total, and "poor" occurs in 10 sentences and "woman" occurs in 20 sentences, it can be inferred that the probability of "poor" occurrence is 10/100=0.1, and the probability of "poor" occurrence in the case of "woman" occurrence is 5/20=0.25>0.1, i.e., the conditional probability of "poor" occurrence in the presence of "woman" is higher relative to the unconditional probability of "poor" occurrence, and thus it can be concluded that the target word "poor" is associated with "woman" to an excessively high degree, reflecting the cognitive bias for women. As another example, assuming that the number of co-occurrences of "male" and "poor" is 1 and the poor still occurs in 10 sentences, the probability of the poor occurrence is still 0.1 when the "male" occurs in 20 sentences, and the probability of the "poor" occurrence in the case of the "male" occurrence is 1/20=0.05<0.1, i.e., the conditional probability of the "poor" occurrence in the presence of the "male" is lower than the non-conditional probability of the "poor" occurrence, so it can be determined that the degree of association of the target word "poor" with "male" may be too low, reflecting an excessively high evaluation of the male population, in other words, a cognitive deviation of the female population from another perspective.
The sources of such differences in the degree of association described above are the cognitive bias in women and/or the relatively overestimation of men in human society. Thus, to correct such cognitive bias, the degree of such abnormal association of the group word with the target word may be adjusted by giving the text a defined weight, thereby adjusting the cognitive bias. The weight of a text unit may refer to a quantity set for the text unit to characterize the importance of the text unit in the corpus data, and may be represented by a positive real number, for example, which may be used to adjust the cognitive deviation present in the corpus data. Specifically, sentences reflecting the association (i.e., co-occurrence) between the group word and the target word may be given different weights according to the association degree, for example, if the association degree between "woman" and "poor" is higher, the sentences co-occurring between "woman" and "poor" may be given lower weights to reduce the association degree and weaken the cognitive deviation of women; and if the association degree between the male and the poor is low, the sentences which are co-occurring between the male and the poor can be given higher weight to increase the association degree, so that the over-high evaluation of the group male which is opposite to the group female with cognitive deviation can be weakened, and the cognitive deviation of the female can be weakened correspondingly, and the fair treatment of the groups with different genders can be realized. In this context, "weight" of a text unit refers to the degree of importance of the corresponding text unit in the entire corpus data, e.g., a text unit with a high weight is more important than a text unit with a low weight when forming a training sample from corpus data. The weights of the text units play a key role in constructing training sample data based on the corpus data, and group cognitive deviation in the corpus data is removed by considering the weights of the text units.
In some embodiments, weights may be formed according to an inverse probability weighting method to adjust the degree of too much or too little association between the group word and the target word in the text unit, for a detailed analysis, see the description of the embodiment shown in fig. 5 below.
At step S330, sample data for training the word vector model is determined according to the corpus data and the weight of each text unit in the corpus data.
In some embodiments, the word vector model refers to a natural language processing model for mapping each word in the corpus data to a word vector by training with the corpus data. The Word Vector model may include, but is not limited to, glove (global Vectors for Word representation), Word2vec (Word to Vector), fastText model. The data used in the Word vector model training is called sample data, and the sample data depends on a specific Word vector model, for example, the sample data of the Word2vec or fastText model may be each text unit or sentence (or a Word vector sequence represented by one-hot vectors with words) in the corpus data; the sample data of the Glove model may be a co-occurrence matrix of corpus data, that is, a matrix with the total number of co-occurrences of every two words in each text unit as an element. Optionally, the Glove model may also train by taking each text unit of the corpus data as sample data, but only needs to process the sample data to form a co-occurrence matrix to formally start the training process.
The inputs and outputs for the Word vector model are typically associated with a specific Word vector model, e.g., the input for the Glove model is a co-occurrence matrix, while the inputs for Word2vec and fastText are sequences of textual units (or sentences) in the corpus or random Word vector representations of the words contained therein.
In some embodiments, for a general Word vector model (e.g., Word2vec or fastText) using a text unit as training sample data, for example, the sample data used for training the Word vector model may be a part of corpus extracted by random sampling from corpus data, and each text unit may be given a corresponding sampling probability according to a weight, so that a high-weight text can be more sampled to enhance a relatively weak association degree between a corresponding group Word of the text unit and a target Word in the sample, and a lower-weight text unit is less sampled to weaken a relatively strong association degree between the corresponding group Word and the target Word, thereby weakening or even eliminating the cognitive bias.
In some embodiments, for a word vector model (e.g., Glove or lsa (content Semantic analysis)) that is a processing object of data in a specific form, specific sample data may be constructed using weights of text units. The processing objects of the Glove model and the LSA model are usually co-occurrence matrices (i.e. elements are the total number of co-occurrence of every two words in the corpus data in the text unit), so when the sample data is constructed by using weights, in order to express the weights of the text unit (i.e. to remove cognitive deviation), the weight factors of the text unit can be merged into the co-occurrence matrices (elements of which are the co-occurrence times of every two words in the corpus data) to obtain weighted co-occurrence matrices, and the weighted co-occurrence matrices are used as the sample data. For a detailed description, refer to the detailed description of the embodiment shown in fig. 5 below.
In step S340, a word vector model is trained using the sample data, and a word vector of at least one word of at least one text unit in the corpus data is obtained from the trained word vector model.
In some embodiments, after obtaining the sample data, a word vector model for mapping words into word vectors may be trained directly through the sample data to obtain a trained word vector model, and word vectors corresponding to at least a part of words or all words in the corpus data are obtained therefrom, that is, word vectors of any one or more words in the corpus data may be obtained from the trained word vector model. The Word vector model can adopt a Glove model, a Word2vec model, a fastText model and the like.
How to train a common word vector model with sample data to obtain an unbiased word vector in some embodiments of the present disclosure is briefly described below.
The Glove model is a word representation tool based on global word frequency statistics, which can represent a word as a vector consisting of real numbers, and the vector captures some semantic characteristics such as similarity, analogy and the like between words. We can compute the semantic similarity between two words by the operation on the vectors, such as euclidean distance or cosine similarity. In the related art, regarding the Glove model training process, the following steps are executed in a loop to obtain a trained model and a word vector:
firstly, randomly collecting a batch of nonzero word pairs from a co-occurrence matrix to serve as batch training data;
secondly, randomly initializing word vectors of the training data and randomly initializing two biases;
then, an objective function (or loss function) J shown in the following formula is optimized by a gradient descent method, and then an update word vector and two offsets are propagated backward:
Figure 220043DEST_PATH_IMAGE001
wherein
Figure 340446DEST_PATH_IMAGE002
X ij Representing elements in the co-occurrence matrix X, i.e. the second
Figure 526708DEST_PATH_IMAGE003
The number of times that the individual word and the jth word appear in one text unit in the whole corpus data, V is the total number of words in the corpus data,w i
Figure 453076DEST_PATH_IMAGE004
the word vectors to be optimized representing the two words respectively,biand
Figure 408393DEST_PATH_IMAGE005
for the two offsets to be optimized,
Figure 27593DEST_PATH_IMAGE006
maxis a hyper-parameter.
In some embodiments of the present disclosure, for the Glove model, before training the Glove model, the weighted co-occurrence matrix blended with the text unit weight obtained in step S330 may be used as sample data, and then the Glove model is trained by using the weighted co-occurrence matrix, so as to obtain a required unbiased word vector.
Specifically, in some embodiments of the present disclosure, for the Glove word vector model, the process of training the word vector model using sample data (such as a weighted co-occurrence matrix) is to loop through the following steps: firstly, randomly collecting a batch of nonzero word pairs from a weighted co-occurrence matrix as batch training data; then, randomly initializing word vectors of the training data and randomly initializing two biases; finally, the objective function (such as the loss function J described above) is optimized by an optimization algorithm such as the gradient descent method, and then the update word vector and the two biases are propagated backwards. Therefore, through the training process, a trained word vector model can be obtained, and an unbiased word vector corresponding to each word in the corpus data is obtained at the same time.
Lsa (late Semantic analysis) is a relatively early term vector characterization tool based on counting, which is also based on co-occurrence matrix, and only adopts matrix decomposition technology based on Singular Value Decomposition (SVD) to perform dimension reduction on large matrix. Therefore, in some embodiments of the present disclosure, if the LSA model is used, the co-occurrence matrix may be replaced by a weighted co-occurrence matrix, and then the same singular value decomposition and other operations are performed to implement training of the word vector model, so as to obtain an unbiased word vector.
Word2vec is a Word vector mapping tool that maps one-hot forms of Word vectors into distributed Word vectors using a one-level neural network. Thus the sample data used in the training process of Word2vec may be a text unit of corpus data (where the Word is replaced by a Word vector in the form of one-hot). Therefore, sample data can be constructed according to the general (text unit as a sample) word vector model described in step S330, that is, a text unit obtained by sampling according to the sampling probability given to each text unit by the weight is sampled, then words in each text unit are transformed into one-hot word vectors, and then the word vector model is trained by using these processed (word vectorization) text units to obtain distributed word vectors corresponding to words in corpus data. Because the sample data is generated based on the text unit which is subjected to weighting processing, the word vector obtained by final training eliminates the group cognition deviation factor in the original corpus data, and therefore the meanings of corresponding words in the corpus data and the relations among the words are fairly, objectively, truly and accurately reflected.
The FastText model is similar to Word2vec and the training process is not described in detail.
In the word vector generation method according to some embodiments of the present disclosure, corresponding weights are given to the text units according to distribution conditions of the group words and the target words related to the group cognitive bias in the corpus data in the text units, so that the (too high or too low) association degree of the group words and the target words in the text units can be adaptively adjusted, and the association degree of the target words and the group words in the whole corpus data has no obvious specificity. Since the specificity of the degree of association between the group word and the target word reflects the existence of the group cognitive deviation, the word vector generation method according to some embodiments of the present disclosure may reduce or eliminate the existence of the group cognitive deviation in the corpus data by reducing the specificity of the degree of association. Furthermore, through the weighting operation of the text unit, the cognitive deviation caused by regional difference, cultural difference and the like in the corpus data can be corrected, so that the accuracy of the word vector is obviously improved. Therefore, after the word vector model is trained by using the sample data constructed on the basis of the text unit with weight, the obtained word vector is obviously weakened, even the group cognition error factor existing in the original corpus data is completely eliminated, so that the group cognition error problem possibly occurring in a downstream task is eliminated; meanwhile, inaccurate and unfair prediction results caused by using word vectors with gender cognitive deviation in downstream tasks are avoided. In services or tasks related to natural language processing, the unbiased word vector can more objectively and truly reflect the meanings of words in natural language and the relationships among words in the human society and the physical world, so that the unbiased word vector can be more fairly treated when facing different groups.
Furthermore, the word vector generation method according to the embodiment of the disclosure only performs simple weighting processing on each text unit in the corpus data and improves sample data based on the weighted corpus data, and these operations can be automatically completed by computing equipment without any additional manual operation or corpus labeling, so that compared with the manual or manual labeling mode in the related art for removing the group cognitive deviation, the word vector generation method according to the embodiment of the disclosure simplifies the work flow, remarkably improves the work efficiency, and reduces the labor cost for removing the group cognitive deviation; on the other hand, the weighting processing and the sample data construction in the word vector generation method according to the embodiment of the disclosure are all completed before the training process of the word vector model is started, and no additional performance loss is brought to the training process of the whole word vector model; in addition, the calculation (such as simple elementary algebra operation) involved in the weighting processing and the sample data improvement in the word vector generation method of the embodiment of the disclosure is not complicated, so that the calculation overhead is small, and the overall calculation performance and the data processing efficiency are high.
Fig. 3B schematically illustrates a flow diagram of a method of word vector generation, according to some embodiments of the present disclosure. As shown in FIG. 3B, steps S310-S340 are identical to FIG. 3A, with optional steps S350-S370 additionally added to FIG. 3B.
As shown in fig. 3B, before step S320, the word vector generation method according to some embodiments of the present disclosure may further include the steps of:
s350, performing word segmentation processing on each text unit in the corpus data to obtain each word in the corpus data.
In some embodiments, after the corpus data is obtained, optionally, the corpus data needs to be subjected to word segmentation, that is, text units in the corpus data are divided into word sequences, so as to facilitate subsequent further processing on words.
The segmentation is the basis of natural language processing, and the segmentation accuracy directly determines the quality of subsequent part-of-speech tagging, syntactic analysis, word vectors and text analysis. English sentences use spaces to separate words, so that the word segmentation problem is not considered in most cases except for certain specific words. However, Chinese is completely different, naturally lacks separators, and requires additional segmentation processing steps to obtain each word when natural language processing is performed. Therefore, when the Chinese natural language is processed, we need to perform word segmentation first, so that vectorization can be realized for each word.
As for a specific algorithm of chinese word segmentation, a dictionary-based rule matching method or a statistical-based machine learning method may be employed. More specifically, the word segmentation processing step described in step S350 may be implemented, for example, using a jiaba word segmentation tool. jiaba is an open source word segmentation engine that combines string matching based algorithms with statistical based algorithms.
Through the word segmentation processing shown in step S350, for example, effective division of each word in a chinese corpus sentence can be realized, thereby laying a foundation for subsequent processing of words.
As shown in fig. 3B, the word vector generation method according to some embodiments of the present disclosure may further include the steps of:
s360, obtaining the text to be processed, an
And S370, searching a word vector corresponding to the word in the text to be processed from a word vector library formed by the obtained word vectors.
As shown in the above inventive concept, the word vector generation method according to some embodiments of the present disclosure, which is trained by using large-scale corpus data and text unit weighting, and corresponds to each word in the corpus data, will form a ready-made word vector library or word vector dictionary, which may be referred to as a pre-training word vector library for specific downstream tasks, for example, as an initial word vector for deep neural network model training, so that not only can the training process be effectively simplified, the model performance be improved, but also the group cognitive bias can be directly weakened or even eliminated, and the accuracy of the word vector can be improved.
Specifically, before the specific application of the downstream natural language processing, word vectors of each word in the text to be processed can be directly obtained by searching and matching in the generated pre-trained word vector library, and then the obtained word vectors are used to realize semantic recognition, data classification, data screening or emotion analysis and other processing of the text to be processed, so as to finally provide decision bases or corresponding services for users, such as data search, data push, intelligent question and answer and other services. In some embodiments, as shown in steps S350-370 and fig. 2, after the pre-trained word vector library is generated, the word vector management server 140 may obtain the text to be processed from the terminal device 160, and then look up word vectors of each word in the text to be processed through dictionary matching from the pre-trained word vector library.
FIG. 4 illustrates an example interface diagram of an example application scenario for a word vector generation method in accordance with some embodiments of the present disclosure.
The word vector generation method according to the embodiment of the present disclosure is not limited to a specific product or application scenario, for example, the word vector generated by the word vector generation method according to the embodiment of the present disclosure may be further used for processing such as semantic recognition, data classification, data screening, or emotion analysis. For illustration purposes, as shown in fig. 4, the application scenario of the embodiment of the present disclosure is described in detail by taking "opinion feedback" as an example. The feedback classification or screening task of the mobile phone manager is considered, namely the feedback information of the user is classified into useful feedback and useless feedback. For this task, a machine learning model based on a deep neural network is generally used to realize the screening or classification of feedback information, and a pre-trained word vector is often used as an initialization parameter of the first layer of the model. If the related art word vector generation method is used, the obtained pre-training word vector may contain a corresponding gender recognition bias factor due to the gender recognition bias existing in the original corpus data. Therefore, the deep neural network model may more easily classify feedback text containing female features or related to women as "useless feedback" due to the cognitive bias factor of the pre-training word vector as an initialization parameter. In the long run, this may make the feedback of the female less visible and resolvable than the male, resulting in a reduction in the product use experience for the female user population.
In the word vector generation method according to the embodiment of the disclosure, since the word vector obtained by the text unit weighting process eliminates the cognitive bias factor to some extent, the method is very suitable for constructing the pre-training word vector as the initialization parameter before the opinion feedback task.
In an "opinion feedback" application scenario, a software (e.g., cell phone housekeeping software) operator first sends a request to collect opinions or suggestions to a terminal device (e.g., cell phone) 150 via, for example, the word vector management server 140, whereupon a "feedback opinion" input interface as shown in fig. 4 may be displayed on the terminal device. As shown in fig. 4, the input interface includes six parts: header section 401, i.e., "submit feedback"; a prompt section 402 for explanation; a feedback information input section 403 for inputting opinions or suggestions; a contact information input section 404; a picture input section 405 for submitting a picture; and a "submit" button 406 for confirming the submission feedback. Subsequently, similar to steps S360-370 shown in fig. 3B, the word vector management server 140 may receive, from the terminal device 160, the feedback information input by the user through the feedback information input part 403 of the input interface shown in fig. 4 as the text to be processed; the word vector management server 140 obtains word vectors corresponding to all words in the text to be processed (i.e., the feedback information) by performing matching search in the word vector library formed by the unbiased word vectors generated in steps S310-S340 shown in fig. 3A. The word vector obtained in this way can be used as an initialization word vector for further screening and classifying the feedback information, for example, in the process of realizing the screening or classification of the feedback information by a machine learning model based on a deep neural network, such a pre-trained word vector can be used as an initialization parameter of the first layer of the model for training, so that the screened useful feedback is real, effective, objective and fair, and has no subjective cognitive deviation, thereby being beneficial to solving the problem of pertinence feedback solving and improving the experience of users (especially female users).
Fig. 5 schematically illustrates an example process of step S320 in the word vector generation method shown in fig. 3A according to some embodiments of the present disclosure.
First, a principle of eliminating the cognitive bias based on the inverse probability weighting method will be described. In general, the task of a language model is to fit the following probabilities:
P(X)=P(Zx)*P(Tx|Zx)*P(X'|Zx,Tx)      (1)
wherein X represents a sentence, ZxRepresenting group-related information (e.g. group words) contained in a sentence, TxRepresenting information (e.g., target words) related to the content of cognitive deviation contained in a text unit, X' representing sentences except for ZxAnd TxOther information than the above. As described above, the cognitive bias in the population is manifested by the specificity (higher or lower) of the degree of association between T and Z, i.e., P (T)x|Zx)≠P(Tx)。
On the other hand, suppose there is no group cognitive bias, T, in the corpus dataxAnd ZxThe absence of specificity of the degree of association between them, i.e., P (T)x|Zx)=P(Tx) Thus the probability to fit is:
P(X)=P(Zx)*P(Tx)*P(X'|Zx,Tx)      (2)。
comparing the formulas (1) and (2), it can be found that if it is desired to remove the cognitive deviation from the corpus data with the group cognitive deviation, the degree of specific association between the group word and the target word needs to be adjusted, i.e. P (T)x|Zx) Change to P (T)x). Thus, a weight w = P (T) may be set for each sentence X in the corpus data with cognitive deviationx)/P(Tx|Zx) Then sentence X becomes a weighted sentence Y, so equation (1) becomes:
P(Y)=w*P(Zx)*P(Tx|Zx)*P(X'|Zx,Tx)
=(P(Tx)/P(Tx|Zx))*P(Zx)*P(Tx|Z)*P(X'|Zx,Tx)
=P(Zx)*P(Tx)*P(X'|Zx,Tx) 。
the above-mentioned weighting method is called inverse probability weighting, and adjusts the specific association between the group word represented by the conditional probability and the target word by using the product of the inverse of the conditional probability and the unconditional probability as a weight. Therefore, by using the above inverse probability weighting algorithm, the specific association degree (too large or too small) between the population information and the cognitive deviation content information can be fundamentally adapted, so that the population cognitive deviation phenomenon is perfectly eliminated.
As shown in fig. 5, the step S320 shown in fig. 3A-determining the weight of each text unit in the corpus data according to the distribution of each group word in the group word set and each target word in the target word set in the text unit of the corpus data may include the following steps S510 to S540.
At step S510, a group word distribution vector and a target word distribution vector for each text unit in the corpus data are determined. The group word distribution vector is used for representing a first distribution condition of each group word in the group word set in the text unit, each element in the group word distribution vector is used for representing whether a corresponding group word in the group word set exists in the text unit, the target word distribution vector is used for representing a second distribution condition of each target word in the target word set in the text unit, and each element in the target word distribution vector is used for representing whether a corresponding target word in the target word set exists in the text unit.
In some embodiments, the distribution of the group words and the target words in the text units of the corpus data may be described using a notion of probability theory. In order to integrally consider the distribution (i.e. existence) of all the group words in the group word set in the text unit, a discrete random vector Z = (Z) may be constructed1, Z2, …,Zn) Wherein Z isi(i =1,2, …, n) is a random variable used to indicate the distribution of the ith group word in the group word set in the text unit, and n is the number of the group words in the group word set. Each random variable ZiCorresponding to each group word in the group word set one by one, and the value range can be 1 and 0, and respectively corresponding to whether the group word exists in a certain text unit, namely corresponding group word exists, Z isi=1, Zi =0 if the corresponding group word is not present. The range of the random vector Z is the component (i.e. Z)1,…,Zn) For example, if the group word set includes two group words, Z = (Z)1,Z2) The value range of Z is (0, 1), (1, 0), (1, 1) and (0, 0); in this case, the distribution of the group words in each unit in the corpus data can be represented by the above four neighbors.
Similarly, the distribution of all target words in the target word set in the text unit can also be described by constructing a random vector T, i.e. T = (T)1, T2, …,Tm) Wherein T isi(i =1,2, …, m) represents a random distribution of the ith target word in the target word set in the text unitAnd variable m is the number of the target words in the target word set. Each random variable TiCorresponding to the target words in the target word set one by one, and the value range can be 1 and 0, which respectively correspond to whether the target words exist in a certain text unit or not, namely T existsi=1, if not present, then Ti=0. In this way, the distribution of the group words and the target words in each text unit in the corpus data can be represented by the specific values of the random vectors Z and T in the text unit, that is, for a certain text unit, the specific distribution of the group words and the target words can be represented as n-dimensional vectors and m-dimensional vectors.
For convenience of calculation, for each text unit, for a specific distribution condition (i.e. a first distribution condition) of all the group words in the group word set in the text unit, a vector formed by a specific value of a random vector Z in the text unit may be used for representing, and the vector representing the specific distribution condition of the group word set in each text unit (i.e. the specific value of the random vector Z in the corresponding text unit) may be defined as a group word distribution vector (which may be represented by Z0Representation). For example, the random vector Z specifically takes the value Z in a text unit0= (1, 0, …,0, 1), this vector Z0= (1, 0, …,0, 1) is the group word distribution vector of the text unit, and the first distribution condition specifically represented by the vector is: the first group word and the nth group word exist in the text unit, and other group words do not exist. Similarly, for each text unit, a target word distribution vector (which may be T) reflecting the specific distribution (i.e. the second distribution) of all target words in the target word set in the text unit may be defined0Representation) which is a vector formed by specific values of the random vector T in the text unit.
How to determine the group word distribution vector and the target word distribution vector for each text unit is described below by taking the example of removing gender recognition bias. First, assuming that a predetermined group word set and a target word set are respectively shown in table 1, the group word set includes "male" and "female", and the target word set includes "good", "poor", "strong", and "weak"". Secondly, whether the group words and the target words in the table 1 exist in each text unit can be searched through a dictionary matching method, so that the distribution vector of the group words and the distribution vector of the target words can be determined according to the existence condition. For example, assuming that a certain text unit contains "woman", "poor", but does not include other group words and target words, the group distribution vector Z of the text unit0= (0, 1), and the target word distribution vector T0= (0, 1,0, 0). Therefore, for each text unit in the corpus data, the group word distribution vector and the target word distribution vector can be obtained in the above manner, so that the distribution (existing situation) of the group word set and the target word set in each text unit is obtained.
TABLE 1-example tables of group word set and target word set
Figure 730845DEST_PATH_IMAGE007
At step S520, a first probability is determined for each text unit according to the group word distribution vector and the target word distribution vector of each text unit in the corpus data. Wherein, for each text unit in the corpus data, the first probability represents the probability of the occurrence of the second distribution condition characterized by the target word distribution vector of the text unit if the first distribution condition characterized by the group word distribution vector of the text unit occurs.
In some embodiments, the weight of a text unit may be calculated according to the inverse probability weighting method described above. Specifically, for each text unit, the weight needs to be calculated according to the following two probabilities: p (T = T)0) And P (T = T)0|Z=Z0) Where T and Z represent random vectors of the distribution of the set of target words and the set of population words in the text unit (i.e., whether they occur in the text unit), respectively, T0And Z0Representing a target word distribution vector and a group word distribution vector in a corresponding text unit. Thus P (T = T)0|Z=Z0) Denotes Z = Z0In the case of (1), T = T0Conditional probability of (2), i.e. represented by the group word distribution vector Z0Target word distribution vector T in the case of occurrence of a first distribution condition of the token0A conditional probability that the characterized second distribution condition occurs. For illustrative purposes, the conditional probability may be represented as a first probability. For the same reason, P (T = T)0) Denotes T = T0Of (unconditional) probability, i.e. target word distribution vector T0The probability that the characterized second distribution condition occurs may be expressed as a second probability for illustrative purposes.
As depicted at step S520, a first probability may be determined based on both the population word distribution vector and the target word distribution vector for each text unit determined at step S510. In some embodiments, for each text unit, the vector Z may be distributed according to the group words of the text unit0And target word distribution vector T0Obtaining group word distribution vector Z directly by counting0And target word distribution vector T0Number of text units occurring simultaneously and group word distribution vector Z0The number of the text units appeared is then directly divided by the former (i.e. the frequency of the simultaneous appearance of the group words and the target words and the frequency of the distribution vectors of the group words) to obtain a first probability, i.e. the probability according to the definition of the probability in the probability theory knowledge or the definition of the conditional probability
P(T=T0|Z=Z0)=P(T=T0, Z=Z0)/P(Z=Z0)      (3)
It should be noted that when there are more elements in the group word set and the target word set, the corresponding dimension of the group word distribution vector and the target word distribution vector is higher, which causes the specific distribution of the group words and the target words in the text unit to increase in geometric progression. For example, if the group word set includes 100 group words, that is, the dimension of the group word distribution vector is 100, the value of the random vector Z, that is, the number of the specific group word distribution vectors is 2100. Therefore, in the case of a large number of elements in the group word set and the target word set, the frequency counting method may cause a significant increase in the amount of calculation.
In some embodiments, to overcome the above problems, the text units in the corpus data may be grouped according to their respective text unitsA word distribution vector and a target word distribution vector, training a classifier model (e.g., a random forest classifier model) to fit to the target word distribution vector at Z = Z0Conditional probability distribution or conditional distribution law of time-random vector T, so as to derive P (T = T) from the conditional distribution law0|Z=Z0). Please refer to the embodiment shown in fig. 7A for the specific classifier model fitting manner described above.
At step S530, a second probability is determined for each text unit according to the target word distribution vector of each text unit in the corpus data. Wherein for each text unit in the corpus data, the second probability represents a probability that a second distribution condition characterized by the target word distribution vector for the text unit occurs.
In some embodiments, the distribution law of the random vector T may be calculated by utilizing the independence between the elements in the random vector T, i.e., the random variables as its components. The detailed process is shown in the flowchart of fig. 8.
Alternatively, in some embodiments, regarding the second probability, the target word distribution vector T may be obtained by directly calculating the random vector T (without considering the components) using the target word distribution vector of each text unit in a similar manner as the first probability is determined in step S5100Probability of (i.e. T)0The probability of occurrence of the characterized second distribution condition). For example, for each text unit, the target word distribution vector T of the text unit can be obtained by direct counting0The number N of text units in which the characterized second distribution condition occurstAnd calculating the total number N of the text units of the whole corpus dataaSo as to obtain a first probability by using the principle of probability theory, namely:
P(T=T0)=Nt/Na      (4)
at step S540, a weight of each text unit in the corpus data is determined according to the first probability and the second probability.
In some embodiments, a first probability P (T = T) is obtained for each text unit by the above steps0|Z=Z0) And a second probability P (T = T)0) Then, canUsing the above-mentioned inverse probability weighting method, the weight w of the text unit is obtained as the ratio of the second probability to the first probability, that is, the weight w is obtained
w=P(T=T0)/P(T=T0|Z=Z0)      (5)
The specific association degree of the target word and the group word in each text unit caused by the group cognition error is corrected to be the normal association degree through the weight of the text unit obtained in the inverse probability weighting mode of the formula (6), so that the group cognition error in the corpus data is fundamentally removed, the group cognition error factor in the word vector generated based on the weight and the corpus data is perfectly eliminated, and the accuracy of the word vector is obviously enhanced.
Fig. 6A schematically illustrates an example process of step S330 illustrated in fig. 3A.
As described above, in the related art, for some word vector models (such as Glove, LSA), the sample data used for training is a co-occurrence matrix whose elements are the co-occurrence times of every two words in the corpus data in a text unit. In some embodiments, for such a word vector model, when sample data is constructed by using weights, in order to remove cognitive bias and embody the weights of the text units, the weights may be merged into the weights of the text units in elements of the co-occurrence matrix (i.e., the co-occurrence times of every two words in the corpus data) to obtain a weighted co-occurrence matrix, and the weighted co-occurrence matrix is used as the sample data.
Referring first to fig. 6A, the step S330 shown in fig. 3A of determining sample data for training a word vector model according to corpus data and a weight of each text unit in the corpus data may include the following steps S610-630.
First, at step S610, a co-occurrence value of every two words in the corpus data in each text unit is determined, the co-occurrence value indicating whether the two words are simultaneously present in the text unit. The co-occurrence value in a text unit for two words may be represented by 1 and 0, i.e. 1 indicates that both words occur simultaneously in the text unit, otherwise the co-occurrence value is 0. Specifically, the ith word and the jth word in the corpus data can be expressed by the following formulaCo-occurrence value c in k text unitsk
Figure 539532DEST_PATH_IMAGE008
Next, at step S620, a weighted co-occurrence value of each two words in the corpus data in each text unit is determined according to the co-occurrence value of the two words in the text unit and the weight of the text unit. To embody the weight of the text unit, the co-occurrence values of the two words shown in the above formula (6) may be weighted to obtain a weighted co-occurrence value. For example, assume that the co-occurrence value of two words in the kth text unit of corpus data is ckThe weight of the kth text unit is wkThen its weighted co-occurrence value is wkck
Thirdly, at step S630, a weighted co-occurrence matrix is constructed according to the weighted co-occurrence values of every two words in each text unit in the corpus data, as sample data for training the word vector model. Similar to the construction of the co-occurrence matrix, the weighted co-occurrence matrix a = { a } constructed from weighted text unitsijElement a in (b) }ijIt can be expressed as a weighted sum of:
Figure 172858DEST_PATH_IMAGE009
wherein c iskRepresents the co-occurrence value, w, of the ith word and the jth word in the kth text unit as shown in equation (7)kAnd N is the weight of the kth text unit, and the number of the text units in the corpus data.
The weighting co-occurrence matrix constructed by the formula (7) reflects the (co-occurrence) relationship between words, and considers the weights of different text units (for eliminating the abnormal association degree of the group words and the target words), so that the weighting co-occurrence matrix can be used as sample data for training a word vector model, and the requirements of a specific word vector model can be met, and the group cognition deviation factors are removed, so that the word vector obtained from the trained word vector model objectively, truly and accurately reflects the meanings of the corresponding words and the relationship between words.
FIG. 6B schematically illustrates a further example process of step S330 shown in FIG. 3A
As shown in fig. 6B, the step S330 shown in fig. 3A of determining sample data for training the word vector model according to the corpus data and the weight of each text unit in the corpus data may include:
s601, selecting text units from the corpus data according to the weight of each text unit in the corpus data;
s602, according to the selected text unit, determining sample data for training the word vector model.
As described above, for a Word vector model (such as Word2vec or fastText) using a text unit as sample data, when constructing the sample data, the sample data may be determined by randomly sampling from the entire corpus data. In order to embody the weight of the text unit, corresponding sampling probabilities can be given to each text unit according to the weight, so that the high-weight text can be more sampled with higher probability to strengthen the relatively weaker association degree of the group words and the target words of the text unit in the sample data, and meanwhile, the text units with lower weight are less sampled with lower probability to weaken the relatively stronger association degree of the corresponding group words and the target words, thereby weakening or even eliminating the cognitive deviation.
Alternatively, FIG. 6B may also be applied to other word vector models, as they are based on corpus data. By taking a Glove model as an example, before sample data is constructed, a probability mode is determined by using the weight of a text unit to collect a plurality of text units from corpus data to serve as basic data for forming a co-occurrence matrix, and then the co-occurrence matrix is constructed according to the co-occurrence condition of a group word and a target word in each text unit, so that the sample data is formed. The sample data obtained in the method gives the adoption probability of each text unit according to the weight of the text unit in the process of sampling the data, so that the weight factor is fully embodied in the sample data, and the aims of removing or weakening the group cognitive deviation and providing the word vector accuracy can be fulfilled.
Fig. 7A schematically illustrates an example process of step S520 illustrated in fig. 5.
In some embodiments, the first probability may be calculated using a classifier model, as follows:
first, corpus data is divided into K groups, each group including at least one text unit, where K is an integer greater than or equal to 2.
Then, for each group of corpus data in the K groups of corpus data, the following steps are performed:
training set and test set determination: taking the group word distribution vector and the target word distribution vector of each text unit in the group of corpus data as a test set, and taking the group word distribution vector and the target word distribution vector of each text unit in other groups of corpus data except the group of corpus data in the K groups of corpus data as a training set;
training: training a classifier model by taking the group word distribution vector of each unit in the training set as input and the target word distribution vector as output; and
a prediction step: and aiming at each text unit in the test set, predicting a first probability of occurrence of a second distribution condition represented by the target word distribution vector under the condition that the first distribution condition represented by the group word distribution vector occurs by using a trained classifier model according to the group word distribution vector and the target word distribution vector of the text unit.
More specifically, as shown in fig. 7A, the step shown in fig. 5, determining, according to the group word distribution vector and the target word distribution vector of each text unit in the corpus data, a first probability of occurrence of a second distribution condition represented by the target word distribution vector of each text unit in the corpus data when the first distribution condition represented by the group word distribution vector of the text unit occurs, may include the following steps:
s710-grouping step: dividing the corpus data into K groups, and enabling a counter K =1, wherein the corpus data is divided into 1 st to Kth groups, each group comprises at least one text unit, and K is an integer greater than or equal to 2;
s720-judging the circulation end condition: comparing K with K, if K is greater than K, the method ends, otherwise go to S730;
s730, a loop initialization step, namely taking the kth group of linguistic data as the current group of linguistic data;
s740-training set and test set determination step: determining a training set and a test set according to the grouped corpus data, namely taking a group word distribution vector and a target word distribution vector of each text unit in the current group of corpus data as the test set, and taking the group word distribution vector and the target word distribution vector of each text unit in other groups of corpus data except the current group of corpus data in the K groups of corpus data as the training set;
s750-training step: training a classifier model by taking the group word distribution vector of each unit in the training set as input and the target word distribution vector as output; and
s760-prediction step: for each text unit involved in the test set, a first probability is predicted using the trained classifier model, and a counter k is set to k = k +1, going to step S720, where the first probability represents a probability of occurrence of a second distribution condition characterized by the target word distribution vector if the first distribution condition characterized by the population word distribution vector occurs, and the prediction of the first probability is performed based on the population word distribution vector and the target word distribution vector of the corresponding text unit.
As described above, in some embodiments, in order to cope with the situation that the number of the group words or the target words in the group word set and/or the target word set is large, prediction at Z = Z by training the classifier model may be utilized0Conditional distribution law of time-random vector T, such that a first probability, P (T = T), is derived from the conditional distribution law0|Z=Z0)。
First, as shown in step S710, the corpus data is divided into K groups for subsequent training set and test set division, so as to implement cross prediction to improve accuracy and prevent overfitting, where K is a hyper-parameter, and may be determined according to specific situations in advance, and theoretically, the larger the K is, the better the K is (but as K increases, the time cost also increases). In S710, a loop counter k =1 is also set.
Secondly, after the data grouping, the subsequent steps are executed in a loop aiming at each language data group to distributively predict the first probability related to each text unit in the language data group. Judging a loop ending condition, namely when K is larger than K, ending the loop as shown in step S720; then, as shown in step S730, initializing the kth group of corpus data to the current group of corpus data; step S740 is configured to determine a training set and a test set, that is, a group word distribution vector and a target word distribution vector of each text unit in the current group of corpus data are used as the test set, and a group word distribution vector and a target word distribution vector of each text unit in the other K-1 groups of corpus data in the K groups of corpus data except the current group of corpus data are used as the training set, so as to implement cross prediction.
Next, the detailed process of steps S750 and S760 will be described by taking a random forest classifier model as an example.
Before describing step S750, the principle of the stochastic classifier needs to be understood. The random forest classifier model is an algorithm integrating a plurality of trees through the idea of ensemble learning, the basic unit of the random forest classifier model is a decision tree, and the essence of the random forest classifier model belongs to a large branch-ensemble learning method of machine learning. There are two keywords in the name of the random forest, one is "random" to avoid overfitting, and the other is "forest" to improve accuracy.
The random forest classification principle is as follows:
the algorithm flow is as follows:
(1) if the number of the total training samples is N, randomly extracting N training samples of the single decision tree from N training sets;
(2) the number of input features of the training sample is set to be M, M is far smaller than M, when splitting is carried out on each node of each decision tree, M input features are randomly selected from the M input features, and then an optimal feature is selected from the M input features according to a certain rule to carry out splitting. m does not change during the construction of the decision tree (note: two selective measures of split attributes in the decision tree: information gain and kini index);
(3) each tree is split until all training examples of the node belong to the same class, and pruning is not needed.
And (4) judging a result:
(1) the target features are of the numeric type: taking the average value of each decision tree as a classification result;
(2) the target features are of the category type: and (4) taking the category with the most single tree classification result as the classification result of the whole random forest, wherein the minority is subject to majority.
According to the algorithm principle of the random forest classifier model, the specific model training process of the step S750 is as follows:
first, assuming that a text unit is involved in the training set, and thus a mapping of a group a body word distribution vector to a target word distribution vector is included, Z0 and the corresponding target word distribution vector of each text unit are obtained, where, as described above, each dimension of the above vectors is represented by 0 or 1, and the dimension of the group word distribution vector is the number M of the group words in the group word set, so the feature number of the Z0 vector is M. Thus, a mapping of the group a body word distribution vector to the target word distribution vector is obtained for a total of a samples.
Then, n groups of mappings are randomly and replaceably taken out of the A group of mappings, and mapping subsets are formed to be used for training a decision tree. A constant M < M is specified and M features are randomly selected from M feature dimensions. In the training process, when the decision tree is split, the optimal feature is learned from the m features according to a certain rule; the decision tree grows to the maximum extent possible and there is no pruning process.
The above steps are repeated until a predetermined number H of decision trees are obtained. And adjusting the parameters n and m until the decision tree exceeding the preset proportion can realize correct classification, namely obtaining the trained random forest classifier.
After the training is finished, as shown in step S760, it is necessary to predict, by using the group word distribution vector and the target word distribution vector of the text unit, a first probability of occurrence of a second distribution condition characterized by the target word distribution vector in the case of occurrence of a first distribution condition characterized by the group word distribution vector by using the trained classifier model, and meanwhile, setting a loop timer K = K +1, and going to step S730 to execute steps S730-S760 for each group of data loops until a loop end condition is satisfied, that is, K > K. The prediction procedure is as follows.
Based on the model training process, the trained random forest classifier model comprises H decision trees, each decision tree comprises a sub-classifier model, the input of the sub-classifier model is a group word distribution vector in a text unit, and the output of the sub-classifier model is a category to which the group word distribution vector belongs, namely the target word distribution vector to which the group word distribution vector is mapped, so that H classification results are obtained. Therefore, assume that the group word distribution vector and the target word distribution vector of one text unit are distributed as Z0And T0(a specific value for the random vectors Z and T, respectively), then Z can be0As an input, predicting Z = Z by the above classification result0And under the condition, obtaining a first probability according to a conditional distribution law of the random vector T.
For example, assume that H =1000, and the decision tree classification result includes 100 target word distribution vectors (i.e., specific values of the random vector T) in total (T is a value of T, respectively)0、T0'、T0''、T0'' …); if input Z0Then, the classification result is: t is0Total 10, T0A total of 5, T0'' Total 2, T0'' 30 total …, the ratio (i.e. probability) of each classification result or target distribution vector in the whole classification result can be obtained according to the frequency counting, and the ratio can be considered as Z = Z0Conditional distribution law of the random vector T under the conditions. Table 2 shows that at Z = Z0The conditional distribution law of the random vector T in the case of (a). From the distribution law shown in Table 2, T can be known by looking up the table0The corresponding conditional probability is 0.01, i.e. P (T = T)0|Z=Z0)=0.01。
Table 2-conditional distribution law example for stochastic vector T with Z = Z0
Value x of T T0 T0' T0'' T0'''
P(T=x|Z=Z0 0.01 0.005 0.002 0.03
In some embodiments, in addition to the random forest classifier model, the classifier models used in steps S750 and S760 shown in fig. 7 may also be other classifier models, such as, but not limited to, XGBoost (eXtreme Gradient Boosting) models, lightgbm (light Gradient Boosting machine) models. The XGboost is an optimized distributed gradient enhancement library and aims to realize high efficiency, flexibility and portability. XGBoost provides parallel tree lifting (also known as GBDT, GBM) that can quickly and accurately solve many data science problems. The XGboost can handle multiple tasks such as regression, classification, and sorting. LightGBM is a fast, distributed, high-performance gradient enhancement (GBDT, GBRT, GBM or MART) framework based on decision tree algorithms for ranking, classification, and many other machine learning tasks.
In the embodiment shown in fig. 7A, by grouping and cross-predicting the first probability (i.e., the probability of the occurrence of the second distribution condition represented by the target word distribution vector when the first distribution condition represented by the group word distribution vector occurs) by using a classifier model such as a random forest classifier, the accuracy of the first probability prediction can be improved and overfitting can be prevented, so that the accuracy of the text unit weight calculation can be improved, and the purpose of enhancing the accuracy of the word vector can be achieved.
Fig. 7B schematically illustrates another example process of step S520 illustrated in fig. 5.
As shown in fig. 7B, the step shown in fig. 5, determining, for each text unit in the corpus data, a first probability of occurrence of a second distribution condition represented by the target word distribution vector of the text unit when the first distribution condition represented by the group word distribution vector of the text unit occurs according to the group word distribution vector and the target word distribution vector of each text unit in the corpus data, may include:
s701, determining the probability of each group word in the group word set appearing in the text unit of the corpus data according to the group word distribution vector of each text unit in the corpus data;
s702, determining a third probability for each text unit according to the probability of each group word in the group word set appearing in the text unit of the corpus data, wherein for each text unit in the corpus data, the third probability represents the probability of the first distribution condition represented by the group word distribution vector of the text unit;
s703, determining a fourth probability for each text unit according to the group word distribution vector and the target word distribution vector of each text unit in the corpus data, wherein for each text unit in the corpus data, the fourth probability represents the probability that a first distribution condition represented by the group word distribution vector and a second distribution condition represented by the target word distribution vector of the text unit occur at the same time;
and S704, determining a first probability for each text unit according to the third probability and the fourth probability, wherein for each text unit in the corpus data, the first probability represents the probability of occurrence of a second distribution condition represented by a target word distribution vector of the text unit under the condition that a first distribution condition represented by a group word distribution vector of the text unit occurs.
In some embodiments, as described above, P (T = T) may be according to equation (3)0|Z=Z0)=P(T=T0, Z=Z0)/P(Z=Z0) The first probability is directly calculated. Therefore, first, a third probability P (Z = Z) of occurrence of the first distribution condition characterized by the group word distribution vector of each text unit in the corpus data needs to be calculated0) And a fourth probability P (T = T) that the first distribution condition represented by the group word distribution vector and the second distribution condition represented by the target word distribution vector of each text unit in the corpus data occur simultaneously0, Z=Z0). As shown in steps S703-S704, a second probability P (T = T) shown in step 530 in fig. 5 may be used0) To calculate the third probability P (Z = Z)0). In other words, each element in the random vector Z (i.e., the random variable Z as its component) can be utilized1,…,Zn) The distribution law of the random vector Z is calculated by the independence between the two. Then, as shown in step S730, first, according to the group word distribution vector of each text unit in the corpus data, the probability that each group word in the group word set appears in the text unit of the corpus data, i.e. P (Z)i=1), (i =1, …, n), so that P (Z) can be knowni=0)=1-P(Zi=1), (i =1, …, n), thus yielding each random variable Z1,…,ZnDistribution law of (1). Thus, as shown in step S740, according to the probability that each group word in the group word set appears in the text unit of the corpus data, a third probability can be obtained:
P(Z=Z0)=P(Z1=Z01)*P(Z2=Z02)*…*P(Zn=Z0n)      (8)
wherein Z is0=(Z01,Z02,…,Z0n),Z0n=1 (presence of corresponding group word) or 0 (absence of corresponding group word).
Subsequently, as described in step S703, a fourth probability that the first distribution condition represented by the group word distribution vector and the second distribution condition represented by the target word distribution vector of each text unit in the corpus data occur simultaneously is determined according to the group word distribution vector and the target word distribution vector of each text unit in the corpus data. For example, for each text unit, the vector Z may be distributed according to the group words of the text unit0And target word distribution vector T0Counting group word distribution vector Z directly through counting0And target word distribution vector T0The number of text units that occur simultaneously is then divided by the total number of text units to obtain a fourth probability that the first and second distributions occur simultaneously.
Finally, as shown in step S704, the first probability may be calculated from the third probability and the fourth probability, for example, by formula (3).
In the embodiment shown in fig. 7B, by simply obtaining the first probability (i.e., the probability of the occurrence of the second distribution condition represented by the target word distribution vector in the case of the occurrence of the first distribution condition represented by the population word distribution vector) by using statistical methods and probability theory basic knowledge and a calculation formula, the prediction of the first probability can be simplified in consideration of the entire corpus data target word and population word entire distribution condition, thereby enhancing the calculation performance while ensuring the accuracy of the text unit weight calculation.
Fig. 8 schematically illustrates an example process of step S530 illustrated in fig. 5.
As shown in fig. 8, step S530 shown in fig. 5 — determining, for each text unit in the corpus data, a second probability of occurrence of a second distribution condition characterized by the target word distribution vector of the text unit according to the target word distribution vector of each text unit in the corpus data includes:
s810, determining the probability of each target word in the target word set appearing in the text unit of the corpus data according to the target word distribution vector of each text unit in the corpus data;
s820, according to the probability of each target word in the target word set appearing in the text unit of the corpus data, determining a second probability for each text unit, wherein for each text unit in the corpus data, the second probability represents the probability of the second distribution condition represented by the target word distribution vector of the text unit.
As described above, the distribution law of the random vector T can be calculated using the independence between the respective elements in the random vector T, i.e., the random variables as their components. Assuming a random vector T = (T) representing the distribution of the target word set in the text unit1,T2,…,Tm) If the components are assumed, i.e. the random variable T1,…,TmIndependent of each other, the vector T is distributed for a specific target word in a certain text unit0=(T01,T02,…,T0m) The probability theory knowledge may be used to obtain a second probability of occurrence of a second distribution condition represented by:
P(T=T0)=P(T1=T01)*P(T2=T02)*…*P(Tm=T0m)      (9)
wherein T is0i (i =1, …, m) represents the respective random variable TiThe specific value in the text unit is (0 or 1).
In some embodiments, similar to step S520, for each random variable T in the random vector TiThe number N of text units of the corresponding target word can be obtained by adopting a frequency counting methodtiAnd calculating the total number Na of the text units, and obtaining the following text units by using a probability theory principle:
P(Ti=1)=Nti/Na      (10)
P(Ti=0)=(Na-Nti)/Na      (11)
for example, as shown in table 1, if the target word set includes "good", "poor", "strong", and "weak", and if a text unit includes "poor", the target word distribution vector T is the same as the target word set0= (0, 1,0, 0); further assume T0The distribution of the target words corresponding to each component in (1)As shown in table 3, if the total number of text units is 100, the distribution law of T1 is:
P(T1=1)=3/100=0.03, P(T1=0)=(100-3)/100=0.97
in a similar manner, T can be obtained as shown in Table 32、T3、T4Distribution law of (1). Then, the distribution law of each component (random variable) shown in equation (4) and table 3 can be obtained:
P(T=T0)=P(T1=0)*P(T2=1)*P(T3=0)*P(T4=1)
=0.97*0.02*0.96*0.94=0.01750656。
TABLE 3 example distribution of target words in text units
Target word (corresponding random variable) You (T)1 Difference (T)2 Strong (T)3 Weak (T)4
Number of text units present 3 2 4 6
P(Ti=1)(i=1,…,4) 0.03 0.02 0.04 0.06
P(Ti=0)(i=1,…,4) 0.97 0.98 0.96 0.94
Fig. 9 schematically illustrates an example block diagram of a word vector generation apparatus 900 in accordance with some embodiments of the present disclosure. The word vector generating apparatus 900 may include a corpus obtaining module 910, a weight determining module 920, a sample determining module 930, and a word vector obtaining module 940.
The corpus acquisition module 910 may be configured to acquire corpus data including at least two units of text, each unit of text including at least one word. The weight determination module 920 may be configured to determine a weight of each text unit in the corpus data according to a distribution of each group word in the group word set and each target word in the target word set in the text unit of the corpus data. The sample determination module 930 may be configured to determine sample data for training the word vector model based on the corpus data and the weight of each text unit in the corpus data. The word vector obtaining module 940 may be configured to train a word vector model using the sample data, and obtain a word vector of at least one word of at least one text unit in the corpus data from the trained word vector model.
The present disclosure provides a word vector generation device that focuses on: the method comprises the steps of giving corresponding weights to all text units based on the distribution conditions of the group words and the target words in the text units of the corpus data, adaptively adjusting the degree of too high or too low association between the target words and the group words in the text units due to group cognition deviation, and accordingly weakening or removing the group cognition deviation existing in the corpus data. Furthermore, through the weighting operation of the text unit, the cognitive deviation caused by regional difference, cultural difference and the like in the corpus data can be corrected, so that the accuracy of the word vector is obviously improved. In services or tasks related to natural language processing, such unbiased (i.e., de-cognitively biased) word vectors can more objectively and realistically reflect the meanings of natural language words and relationships between words in the human social and physical world, and thus can be more fairly treated when facing different groups. Further, the operations of text unit weighting processing, sample data improvement and the like in the word vector generation method according to the embodiment of the disclosure can be automatically completed by the computing device without any additional manual operation (such as corpus labeling), so that compared with the method of removing group cognitive deviation in a manual or manual labeling manner in the related art, the word vector generation device according to the embodiment of the disclosure simplifies the work flow, remarkably improves the work efficiency, and reduces the labor cost for removing group cognitive deviation; on the other hand, the word vector generation device according to the embodiment of the present disclosure performs weighting processing on a text unit and sample data improvement before the training process of the word vector model starts, and does not bring extra performance loss to the training process of the whole word vector model, and the calculations (such as simple elementary algebraic operations) involved in these operations are not complicated, so that the calculation overhead is small, and the overall calculation performance and data processing efficiency are high.
It should be noted that the various modules described above may be implemented in software or hardware or a combination of both. Several different modules may be implemented in the same software or hardware configuration, or one module may be implemented by several different software or hardware configurations.
Fig. 10 schematically illustrates an example block diagram of a computing device 1000 in accordance with some embodiments of the disclosure. Computing device 1000 may represent a device to implement the various means or modules described herein and/or perform the various methods described herein. Computing device 1000 may be, for example, a server, a desktop computer, a laptop computer, a tablet, a smartphone, a smartwatch, a wearable device, or any other suitable computing device or computing system, which may include various levels of devices ranging from full resource devices with substantial storage and processing resources to low-resource devices with limited storage and/or processing resources. In some embodiments, the word vector generation device 900 described above with respect to fig. 9 may be implemented in one or more computing devices 1000, respectively.
As shown in fig. 10, the example computing device 1000 includes a processing system 1001, one or more computer-readable media 1002, and one or more I/O interfaces 1003 communicatively coupled to each other. Although not shown, the computing device 1000 may also include a system bus or other data and command transfer system that couples the various components to one another. A system bus can include any one or combination of different bus structures, such as a memory bus or memory controller, a peripheral bus, a universal serial bus, and/or a processor or local bus that utilizes any of a variety of bus architectures. Alternatively, control and data lines, for example, may be included.
Processing system 1001 represents functionality to perform one or more operations using hardware. Thus, the processing system 1001 is illustrated as including hardware elements 1004 that may be configured as processors, functional blocks, and so forth. This may include implementation in hardware as an application specific integrated circuit or other logic device formed using one or more semiconductors. Hardware elements 1004 are not limited by the materials from which they are formed or the processing mechanisms employed therein. For example, a processor may be comprised of semiconductor(s) and/or transistors (e.g., electronic Integrated Circuits (ICs)). In such a context, processor-executable instructions may be electronically-executable instructions.
The computer-readable medium 1002 is illustrated as including a memory/storage 1005. Memory/storage 1005 represents memory/storage associated with one or more computer-readable media. Memory/storage 1005 may include volatile media (such as Random Access Memory (RAM)) and/or nonvolatile media (such as Read Only Memory (ROM), flash memory, optical disks, magnetic disks, and so forth). Memory/storage 1005 may include fixed media (e.g., RAM, ROM, a fixed hard drive, etc.) as well as removable media (e.g., flash memory, a removable hard drive, an optical disk, and so forth). Illustratively, the memory/storage 1005 may be used to store the first audio of the first category of users, the requested queuing list, etc. mentioned in the above embodiments. The computer-readable medium 1002 may be configured in various other ways as further described below.
One or more I/O (input/output) interfaces 1003 represent functionality that allows a user to enter commands and information to computing device 1000, and also allows information to be displayed to the user and/or transmitted to other components or devices using a variety of input/output devices. Examples of input devices include a keyboard, a cursor control device (e.g., a mouse), a microphone (e.g., for voice input), a scanner, touch functionality (e.g., capacitive or other sensors configured to detect physical touch), a camera (e.g., motion that does not involve touch may be detected as gestures using visible or invisible wavelengths such as infrared frequencies), a network card, a receiver, and so forth. Examples of output devices include a display device (e.g., a monitor or projector), speakers, a printer, a haptic response device, a network card, a transmitter, and so forth. Illustratively, in the above-described embodiments, the first category of users and the second category of users may input through input interfaces on their respective terminal devices to initiate requests and enter audio and/or video and the like, and may view various notifications and view video or listen to audio and the like through output interfaces.
Computing device 1000 also includes word vector generation policy 1006. Word vector generation policy 1006 may be stored in memory/storage 1005 as computer program instructions, or may be hardware or firmware. The word vector generation policy 1006 may implement all functions of the respective modules of the word vector generation apparatus 900 described with respect to fig. 9, along with the processing system 1001 and the like.
Various techniques may be described herein in the general context of software, hardware, elements, or program modules. Generally, these modules include routines, programs, objects, elements, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The terms "module," "functionality," and the like as used herein generally represent software, firmware, hardware, or a combination thereof. The features of the techniques described herein are platform-independent, meaning that the techniques may be implemented on a variety of computing platforms having a variety of processors.
An implementation of the described modules and techniques may be stored on or transmitted across some form of computer readable media. Computer readable media can include a variety of media that can be accessed by computing device 1000. By way of example, and not limitation, computer-readable media may comprise "computer-readable storage media" and "computer-readable signal media".
"computer-readable storage medium" refers to a medium and/or device, and/or a tangible storage apparatus, capable of persistently storing information, as opposed to mere signal transmission, carrier wave, or signal per se. Accordingly, computer-readable storage media refers to non-signal bearing media. Computer-readable storage media include hardware such as volatile and nonvolatile, removable and non-removable media and/or storage devices implemented in a method or technology suitable for storage of information such as computer-readable instructions, data structures, program modules, logic elements/circuits or other data. Examples of computer readable storage media may include, but are not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical storage, hard disks, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or other storage, tangible media, or an article of manufacture suitable for storing the desired information and which may be accessed by a computer.
"computer-readable signal medium" refers to a signal-bearing medium configured to transmit instructions to the hardware of computing device 1000, such as via a network. Signal media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave, data signal or other transport mechanism. Signal media also includes any information delivery media. By way of example, and not limitation, signal media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media.
As previously described, the hardware elements 1004 and the computer-readable medium 1002 represent instructions, modules, programmable device logic, and/or fixed device logic implemented in hardware form that may be used in some embodiments to implement at least some aspects of the techniques described herein. The hardware elements may include integrated circuits or systems-on-chips, Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs), Complex Programmable Logic Devices (CPLDs), and other implementations in silicon or components of other hardware devices. In this context, a hardware element may serve as a processing device that performs program tasks defined by instructions, modules, and/or logic embodied by the hardware element, as well as a hardware device for storing instructions for execution, such as the computer-readable storage medium described previously.
Combinations of the foregoing may also be used to implement the various techniques and modules described herein. Thus, software, hardware, or program modules and other program modules may be implemented as one or more instructions and/or logic embodied on some form of computer-readable storage medium and/or by one or more hardware elements 1004. Computing device 1000 may be configured to implement particular instructions and/or functionality corresponding to software and/or hardware modules. Thus, implementing modules at least partially in hardware as modules executable by the computing device 1000 as software may be accomplished, for example, through the use of computer-readable storage media of a processing system and/or hardware elements 1004. The instructions and/or functions may be executed/operable by, for example, one or more computing devices 1000 and/or processing systems 1001 to implement the techniques, modules, and examples described herein.
The techniques described herein may be supported by these various configurations of computing device 1000 and are not limited to specific examples of the techniques described herein.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as a computer program. For example, embodiments of the present disclosure provide a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing at least one step of the method embodiments of the present disclosure.
In some embodiments of the present disclosure, one or more computer-readable storage media are provided having computer-readable instructions stored thereon that, when executed, implement a word vector generation method in accordance with some embodiments of the present disclosure. The steps of the word vector generation method according to some embodiments of the present disclosure may be converted into computer-readable instructions by programming and stored in a computer-readable storage medium. When such a computer-readable storage medium is read or accessed by a computing device or computer, the computer-readable instructions therein are executed by a processor on the computing device or computer to implement a word vector generation method according to some embodiments of the present disclosure.
In the description of the present specification, the description of the terms "one embodiment," "some embodiments," "an example," "a specific example," or "some examples" or the like means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present disclosure. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Moreover, various embodiments or examples and features of various embodiments or examples described in this specification can be combined and combined by one skilled in the art without being mutually inconsistent.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing steps of a custom logic function or process, and alternate implementations are included within the scope of the preferred embodiment of the present disclosure in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the embodiments of the present disclosure.
The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
It should be understood that portions of the present disclosure may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, any one or a combination of the following techniques, which are well known in the art, may be used: a discrete logic circuit having a logic Gate circuit for realizing a logic function for a data signal, an application specific integrated circuit having an appropriate combinational logic Gate circuit, a Programmable Gate Array (Programmable Gate Array), a Field Programmable Gate Array (Field Programmable Gate Array), or the like.
It will be understood by those skilled in the art that all or part of the steps of the method of the above embodiments may be performed by hardware associated with program instructions, and that the program may be stored in a computer readable storage medium, which when executed, includes performing one or a combination of the steps of the method embodiments.
In addition, functional units in the embodiments of the present disclosure may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a separate product, may also be stored in a computer-readable storage medium.

Claims (15)

1. A word vector generation method, comprising:
obtaining corpus data comprising at least two text units, wherein each text unit at least comprises a word;
determining the weight of each text unit in the corpus data according to the distribution of each group word in the group word set and each target word in the target word set in the text units of the corpus data;
determining sample data for training a word vector model according to the corpus data and the weight of each text unit in the corpus data;
and training a word vector model by using the sample data, and obtaining a word vector of at least one word of at least one text unit in the corpus data from the trained word vector model.
2. The method for generating word vectors according to claim 1, wherein the determining the weight of each text unit in the corpus data according to the distribution of each group word in the group word set and each target word in the target word set in the text unit of the corpus data comprises:
determining a group word distribution vector and a target word distribution vector of each text unit in corpus data, wherein the group word distribution vector is used for indicating a first distribution condition of each group word in a group word set in the text unit, each element in the group word distribution vector is used for indicating whether a corresponding group word in the group word set exists in the text unit, the target word distribution vector is used for indicating a second distribution condition of each target word in the target word set in the text unit, and each element in the target word distribution vector is used for indicating whether a corresponding target word in the target word set appears in the text unit;
according to the group word distribution vector and the target word distribution vector of each text unit in the corpus data, determining a first probability of occurrence of a second distribution condition represented by the target word distribution vector of the text unit under the condition that the first distribution condition represented by the group word distribution vector of the text unit occurs for each text unit in the corpus data;
according to the target word distribution vector of each text unit in the corpus data, determining a second probability of occurrence of a second distribution condition represented by the target word distribution vector of the text unit aiming at each text unit in the corpus data;
and determining the weight of each text unit in the corpus data according to the first probability and the second probability.
3. The method according to claim 2, wherein said determining, for each text unit in the corpus data, a first probability of occurrence of a second distribution condition characterized by the target word distribution vector of the text unit if a first distribution condition characterized by the group word distribution vector of the text unit occurs according to the group word distribution vector and the target word distribution vector of each text unit in the corpus data, comprises:
dividing corpus data into K groups, wherein each group comprises at least one text unit, and K is an integer greater than or equal to 2; and
for each group of corpus data in the K groups of corpus data, executing the following steps:
training set and testing set determination steps: taking the group word distribution vector and the target word distribution vector of each text unit in the group of corpus data as a test set, taking the group word distribution vector and the target word distribution vector of each text unit in other groups of corpus data except the group of corpus data in the K groups of corpus data as a training set,
training: training a classifier model by taking the group word distribution vector of each text unit in the training set as input and the target word distribution vector as output, and
a prediction step: and aiming at each text unit in the test set, predicting a first probability of occurrence of a second distribution condition represented by the target word distribution vector under the condition that the first distribution condition represented by the group word distribution vector occurs by using a trained classifier model according to the group word distribution vector and the target word distribution vector of the text unit.
4. The method for generating word vectors according to claim 2, wherein the determining, for each text unit in the corpus data, a first probability of occurrence of a second distribution condition characterized by the target word distribution vector of the text unit when the first distribution condition characterized by the group word distribution vector of the text unit occurs according to the group word distribution vector and the target word distribution vector of each text unit in the corpus data includes:
determining the probability of each group word in the group word set appearing in the text unit of the corpus data according to the group word distribution vector of each text unit in the corpus data;
determining a third probability of occurrence of a first distribution condition represented by a group word distribution vector of each text unit in the corpus data according to the probability of occurrence of each group word in the corpus word set in the text unit of the corpus data;
determining a fourth probability that a first distribution condition represented by the group word distribution vector of each text unit and a second distribution condition represented by the target word distribution vector of each text unit in the corpus data simultaneously appear according to the group word distribution vector and the target word distribution vector of each text unit in the corpus data;
and determining a first probability of the occurrence of a second distribution condition characterized by the target word distribution vector of each text unit under the condition that the first distribution condition characterized by the group word distribution vector of the text unit in the corpus data occurs according to the third probability and the fourth probability.
5. The method for generating word vectors according to claim 2, wherein said determining, for each text unit in the corpus data, a second probability of occurrence of a second distribution condition characterized by the target word distribution vector of the text unit according to the target word distribution vector of each text unit in the corpus data comprises:
determining the probability of each target word in the target word set appearing in the text unit of the corpus data according to the target word distribution vector of each text unit in the corpus data;
and determining a second probability of occurrence of a second distribution condition represented by the target word distribution vector of each text unit in the corpus data according to the probability of occurrence of each target word in the corpus data in the text unit of the corpus data.
6. The word vector generation method of claim 3, wherein the classifier model comprises at least one of: a random forest classifier model, an XGboost classifier model, and a LightGBM classifier model.
7. The method for generating word vectors according to claim 1, wherein said determining sample data for training word vector models according to corpus data and weights of each text unit in corpus data comprises:
determining a co-occurrence value of every two words in the corpus data in each text unit, the co-occurrence value indicating whether the two words are simultaneously present in the text unit;
determining the weighted co-occurrence value of each two words in the text unit according to the co-occurrence value of the two words in each text unit in the corpus data and the weight of the text unit;
and constructing a co-occurrence matrix according to the weighted co-occurrence value of every two words in the corpus data in each text unit, wherein the co-occurrence matrix is used as sample data for training a word vector model.
8. The method for generating word vectors according to claim 1, wherein said determining sample data for training word vector models according to corpus data and weights of each text unit in corpus data comprises:
selecting a text unit from the corpus data according to the weight of each text unit in the corpus data;
and determining sample data for training the word vector model according to the selected text unit.
9. The method for generating word vectors according to claim 1, wherein before determining the weight of each text unit in the corpus data according to the distribution of each group word in the group word set and each target word in the target word set in the text unit of the corpus data, the method further comprises:
and performing word segmentation processing on each text unit in the text data to obtain each word contained in the text unit.
10. The word vector generation method as claimed in claim 1, further comprising:
acquiring a text to be processed;
and searching a word vector corresponding to the word in the text to be processed from a word vector library formed by the obtained word vectors.
11. The word vector generation method of claim 1, wherein the word vector model comprises at least one of: glove, Word2vec, fastText.
12. The word vector generation method according to claim 1, wherein the text unit is a sentence of a natural language.
13. A word vector generation apparatus comprising:
the system comprises a corpus acquisition module, a semantic data acquisition module and a semantic data processing module, wherein the corpus acquisition module is configured to acquire corpus data comprising at least two text units, and each text unit at least comprises a word;
the weight determining module is configured to determine the weight of each text unit in the corpus data according to the distribution of each group word in the group word set and each target word in the target word set in the text unit of the corpus data;
a sample determination module configured to determine sample data for training the word vector model according to the corpus data and the weight of each text unit in the corpus data;
a word vector obtaining module configured to train a word vector model using the sample data, and obtain a word vector of at least one word of at least one text unit in corpus data from the trained word vector model.
14. A computing device, comprising:
a processor; and
a memory having instructions stored thereon that, when executed on the processor, cause the processor to perform the method of any of claims 1-12.
15. One or more computer-readable storage media having computer-readable instructions stored thereon that, when executed, implement the method of any of claims 1-12.
CN202111653193.7A 2020-12-22 2020-12-22 Word vector generation method and device, computing device and computer-readable storage medium Pending CN114662488A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111653193.7A CN114662488A (en) 2020-12-22 2020-12-22 Word vector generation method and device, computing device and computer-readable storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011528632.7 2020-12-22
CN202111653193.7A CN114662488A (en) 2020-12-22 2020-12-22 Word vector generation method and device, computing device and computer-readable storage medium

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN202011528632.7 Division 2020-12-22 2020-12-22

Publications (1)

Publication Number Publication Date
CN114662488A true CN114662488A (en) 2022-06-24

Family

ID=82057351

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111653193.7A Pending CN114662488A (en) 2020-12-22 2020-12-22 Word vector generation method and device, computing device and computer-readable storage medium

Country Status (1)

Country Link
CN (1) CN114662488A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220121826A1 (en) * 2021-03-15 2022-04-21 Beijing Baidu Netcom Science Technology Co., Ltd. Method of training model, method of determining word vector, device, medium, and product

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220121826A1 (en) * 2021-03-15 2022-04-21 Beijing Baidu Netcom Science Technology Co., Ltd. Method of training model, method of determining word vector, device, medium, and product

Similar Documents

Publication Publication Date Title
US20210232762A1 (en) Architectures for natural language processing
CN107066464B (en) Semantic natural language vector space
Alami et al. Unsupervised neural networks for automatic Arabic text summarization using document clustering and topic modeling
CN107368515B (en) Application page recommendation method and system
CN105183833B (en) Microblog text recommendation method and device based on user model
CN110674317B (en) Entity linking method and device based on graph neural network
US11756094B2 (en) Method and device for evaluating comment quality, and computer readable storage medium
CN109086265B (en) Semantic training method and multi-semantic word disambiguation method in short text
US20210056127A1 (en) Method for multi-modal retrieval and clustering using deep cca and active pairwise queries
Zhao et al. Simple question answering with subgraph ranking and joint-scoring
Wang et al. Named entity disambiguation for questions in community question answering
CN111581364B (en) Chinese intelligent question-answer short text similarity calculation method oriented to medical field
US10810266B2 (en) Document search using grammatical units
Amir et al. Sentence similarity based on semantic kernels for intelligent text retrieval
Chien et al. Latent Dirichlet mixture model
CN111368082A (en) Emotion analysis method for domain adaptive word embedding based on hierarchical network
CN114548321A (en) Self-supervision public opinion comment viewpoint object classification method based on comparative learning
Tondulkar et al. Get me the best: predicting best answerers in community question answering sites
Han et al. Conditional word embedding and hypothesis testing via bayes-by-backprop
CN117573985B (en) Information pushing method and system applied to intelligent online education system
CN113342944B (en) Corpus generalization method, apparatus, device and storage medium
Suresh Kumar et al. Local search five‐element cycle optimized reLU‐BiLSTM for multilingual aspect‐based text classification
AlMahmoud et al. The effect of clustering algorithms on question answering
CN114662488A (en) Word vector generation method and device, computing device and computer-readable storage medium
Chen et al. A hybrid approach for question retrieval in community question answerin

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination