CN113255335A - Word vector obtaining method and system, electronic equipment and storage medium - Google Patents

Word vector obtaining method and system, electronic equipment and storage medium Download PDF

Info

Publication number
CN113255335A
CN113255335A CN202110552803.8A CN202110552803A CN113255335A CN 113255335 A CN113255335 A CN 113255335A CN 202110552803 A CN202110552803 A CN 202110552803A CN 113255335 A CN113255335 A CN 113255335A
Authority
CN
China
Prior art keywords
word
words
dictionary
word vector
chi
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202110552803.8A
Other languages
Chinese (zh)
Inventor
梁吉光
徐凯波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Minglue Zhaohui Technology Co Ltd
Original Assignee
Beijing Minglue Zhaohui Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Minglue Zhaohui Technology Co Ltd filed Critical Beijing Minglue Zhaohui Technology Co Ltd
Priority to CN202110552803.8A priority Critical patent/CN113255335A/en
Publication of CN113255335A publication Critical patent/CN113255335A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

The application discloses a method and a system for acquiring word vectors, electronic equipment and a storage medium, wherein the method for acquiring the word vectors comprises the following steps: a dictionary building step: extracting words from a pre-training word vector model, segmenting the words into characters, processing the characters, and using the processed characters to form a dictionary; a chi-square obtaining step: counting the co-occurrence frequency of the words, and calculating the word chi square according to the co-occurrence frequency; a word vector calculation step: and carrying out weighted calculation on the word chi-square to obtain a word vector. When the method is applied to text analysis, the text can be directly vectorized based on the word vectors without word segmentation, so that cascade errors caused by word segmentation errors are avoided.

Description

Word vector obtaining method and system, electronic equipment and storage medium
Technical Field
The present application relates to the field of deep learning technologies, and in particular, to a method and a system for obtaining a word vector, an electronic device, and a storage medium.
Background
In recent years, with the development of deep learning technology, deep learning has become a necessary research method for most tasks in natural language processing, and plays a crucial role in text representation, text classification, emotion classification, automatic summarization, and the like. In particular, in terms of text representation, almost all natural language processing application tasks require text representation, i.e., vectorizing of text. However, Chinese and English are very different in the lines. Spaces are used as intervals among the words in the English text, and the boundary of the words is naturally made. In the Chinese text, words are not clearly distinguished and marked, and all the characters are continuous, namely Chinese is expressed in the form of continuous characters. Therefore, before vectorization of the chinese text, chinese word segmentation needs to be performed first. Many existing Chinese word segmenters do not have uniform word segmentation standards, so that word segmentation results are relatively disordered, the word segmentation effect cannot be judged, and the word segmentation effect can directly influence the vectorization of Chinese texts.
Disclosure of Invention
The embodiment of the application provides a method, a system, electronic equipment and a storage medium for acquiring a word vector, and at least solves the problems that a large amount of resources are consumed for training to acquire the word vector, the existing resources cannot be used for directly acquiring the word vector, a text cannot be directly vectorized based on the word vector, word segmentation is required, and therefore word segmentation errors are caused to bring about cascade errors.
The invention provides a method for acquiring a word vector, which comprises the following steps:
a dictionary building step: extracting words from a pre-training word vector model, segmenting the words into characters, processing the characters, and using the processed characters to form a dictionary;
a chi-square obtaining step: counting the co-occurrence frequency of the words, and calculating the word chi square according to the co-occurrence frequency;
a word vector calculation step: and carrying out weighted calculation on the word chi-square to obtain a word vector.
In the above method for obtaining a word vector, the dictionary establishing step includes:
a dictionary generating step: after the pre-training word vector model is read, extracting all words from the pre-training word vector model, and forming a dictionary by using the words;
a dictionary generating step: and segmenting the words in the dictionary into separate words, storing and de-duplicating the segmented words, and then using the processed words to form the dictionary.
In the above method for obtaining a word vector, the chi-square obtaining step includes:
a mapping table establishing step: building a word mapping table according to the word construction relation between the words and the words;
a co-occurrence frequency statistic step: counting the co-occurrence times and the non-co-occurrence times of the words in the dictionary;
chi square calculation step: and calculating the word card side according to the co-occurrence times and the non-co-occurrence times of the words in the dictionary, and performing normalization processing on the word card side.
In the above method for obtaining a word vector, the step of calculating a word vector includes obtaining the word vector by weighted calculation according to the word mapping table and the normalized word chi-square.
The present invention also provides a system for acquiring a word vector, wherein the system is suitable for the method for acquiring a word vector, and the system for acquiring a word vector comprises:
the dictionary building unit: extracting words from a pre-training word vector model, segmenting the words into characters, processing the characters, and using the processed characters to form a dictionary;
a chi-square acquisition unit: counting the co-occurrence frequency of the words, and calculating the word chi square according to the co-occurrence frequency;
a word vector calculation unit: and carrying out weighted calculation on the word chi-square to obtain a word vector.
In the above system for acquiring word vectors, the dictionary building unit includes:
a dictionary generation module: after the pre-training word vector model is read, extracting all words from the pre-training word vector model, and forming a dictionary by using the words;
the dictionary generation module: and segmenting the words in the dictionary into separate words, storing and de-duplicating the segmented words, and then using the processed words to form the dictionary.
In the above system for acquiring a word vector, the chi-square acquiring unit includes:
a mapping table building module: building a word mapping table according to the word construction relation between the words and the words;
a co-occurrence frequency statistic module: counting the co-occurrence times and the non-co-occurrence times of the words in the dictionary;
a chi-square calculation module: and calculating the word card side according to the co-occurrence times and the non-co-occurrence times of the words in the dictionary, and performing normalization processing on the word card side.
In the above system for acquiring a word vector, the word vector calculating unit performs weighted calculation to acquire the word vector according to the word mapping table and the word chi-square.
The present invention also provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements any of the word vector obtaining methods described above when executing the computer program.
The present invention also provides an electronic device readable storage medium, on which computer program instructions are stored, and when the computer program instructions are executed by the processor, the method for acquiring a word vector according to any one of the above embodiments is implemented.
Compared with the related technology, the method can directly calculate the word vector based on the existing resources, and does not need to consume a large amount of resources to train. Meanwhile, when the method is applied to text analysis, the text can be directly vectorized based on the word vectors without word segmentation, so that cascading errors caused by word segmentation errors are avoided, and the natural language processing capacity is improved.
The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below to provide a more thorough understanding of the application.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
FIG. 1 is a flow chart of a method for obtaining a word vector according to an embodiment of the present application;
FIG. 2 is a block diagram of a method for obtaining word vectors according to an embodiment of the present application;
FIG. 3 is a device framework diagram of the acquisition of word vectors according to an embodiment of the present application;
FIG. 4 is a schematic diagram of a word vector acquisition system according to the present invention;
fig. 5 is a frame diagram of an electronic device according to an embodiment of the present application.
Wherein the reference numerals are:
the dictionary building unit: 51;
a chi-square acquisition unit: 52;
a word vector calculation unit: 53;
a dictionary generation module 511;
the dictionary generation module: 512;
a mapping table building module: 521, respectively;
a co-occurrence frequency statistic module: 522
A chi-square calculation module: 523;
80 parts of a bus;
a processor: 81;
a memory: 82;
a communication interface: 83.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described and illustrated below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments provided in the present application without any inventive step are within the scope of protection of the present application.
It is obvious that the drawings in the following description are only examples or embodiments of the present application, and that it is also possible for a person skilled in the art to apply the present application to other similar contexts on the basis of these drawings without inventive effort. Moreover, it should be appreciated that such a development effort might be complex and tedious, but would nevertheless be a routine undertaking of design, fabrication, and manufacture for those of ordinary skill having the benefit of this disclosure, and thus should not be construed as a limitation of this disclosure.
Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of ordinary skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments without conflict.
Unless defined otherwise, technical or scientific terms referred to herein shall have the ordinary meaning as understood by those of ordinary skill in the art to which this application belongs. Reference to "a," "an," "the," and similar words throughout this application are not to be construed as limiting in number, and may refer to the singular or the plural. The present application is directed to the use of the terms "including," "comprising," "having," and any variations thereof, which are intended to cover non-exclusive inclusions; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to the listed steps or elements, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. Reference to "connected," "coupled," and the like in this application is not intended to be limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as referred to herein means two or more. "and/or" describes an association relationship of associated objects, meaning that three relationships may exist, for example, "A and/or B" may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. Reference herein to the terms "first," "second," "third," and the like, are merely to distinguish similar objects and do not denote a particular ordering for the objects.
The invention provides a method, a system, electronic equipment and a storage medium for acquiring character vectors, which can utilize a word vector model to acquire the character vectors and solve the problems that the acquisition of the character vectors needs to consume a large amount of resources to train, the prior resource-based direct character vector acquisition cannot be realized, the text cannot be directly vectorized based on the character vectors, the word segmentation is needed, and the word segmentation error is caused to cause the cascading error and the like.
The present invention will be described with reference to specific examples.
Example one
The embodiment provides a method for acquiring a word vector. Referring to fig. 1 to 2, fig. 1 is a flowchart illustrating a method for obtaining a word vector according to an embodiment of the present disclosure; fig. 2 is a block diagram of a method for acquiring a word vector according to an embodiment of the present application, where as shown in the figure, the method for acquiring a word vector includes the following steps:
dictionary building step S1: extracting words from a pre-training word vector model, segmenting the words into characters, processing the characters, and using the processed characters to form a dictionary;
chi-square acquisition step S2: counting the co-occurrence frequency of the words, and calculating the word chi square according to the co-occurrence frequency;
word vector calculation step S3: and carrying out weighted calculation on the word chi-square to obtain a word vector.
In an embodiment, the dictionary building step S1 includes:
dictionary creation step S11: after the pre-training word vector model is read, extracting all words from the pre-training word vector model, and forming a dictionary by using the words;
dictionary generation step S12: and segmenting the words in the dictionary into separate words, storing and de-duplicating the segmented words, and then using the processed words to form the dictionary.
In specific implementation, a prepared pre-training word vector model is read and stored in a form of < word, vector >, and is recorded as word embedding < word, vector (word) >, wherein vector (word) refers to the vector of word, then all words are obtained from the word vector model word embedding, a Chinese dictionary is constructed, and the words are stored in a form of < word >. That is, all the words in WordEmbedding are extracted and stored as the dictionary WordDictionary. The method comprises the steps of splitting words in a Chinese dictionary according to characters, dividing the words into individual characters, namely splitting all words in WordDictionary into individual characters, and storing the individual characters as a dictionary Charantiacny. Further, words segmented by words are stored in a < word > structure, wherein the words segmented by words need to be deduplicated when stored in the dictionary. For convenience of description, the words "tomorrow" and "brief" are taken as examples, and can be divided into "bright", "day", "bright" and "brief", but the dictionary obtained is { bright, day, brief }.
In an embodiment, the chi-square obtaining step S2 includes:
mapping table construction step S21: building a word mapping table according to the word construction relation between the words and the words;
co-occurrence frequency statistics step S22: counting the co-occurrence times and the non-co-occurrence times of the words in the dictionary;
chi-square calculation step S23: and calculating the word card side according to the co-occurrence times and the non-co-occurrence times of the words in the dictionary, and performing normalization processing on the word card side.
In specific implementation, the mapping relation between the character and the word of the participated word forming is stored so as to<Word, set of words in which the word participates in forming a word>Is stored in the form of (1) and is designated as CharWordRelationship. For convenience of description, the words "tomorrow" and "Mingyou" are available as examples<Word, set of words in which the word participates in forming a word>={<Ming, { Ming Tian, Ming Zi }>,<Day, { tomorrow }>,<Slightly, { Ming-Hao }>}. Counting the co-occurrence relationship between words and words in the dictionary and jointly traversing the dictionary<Word and phrase>And dictionary<Character (Chinese character)>The number of co-occurrences between words in the dictionary and the number of non-co-occurrences between words in the dictionary are counted. Further, for convenience of description, the statistical word c is usediAnd cjIn-dictionary<Word and phrase>Number of co-occurrences and ciAnd cjIn-dictionary<Word and phrase>The number of times of non-co-occurrence in (c) is exemplified. The statistical frequency of word co-occurrence is shown in the table below.
Comprises cj Does not contain cj Total of
Comprises ci A B A+B
Does not contain ci C D C+D
Total of A+C B+D A+B+C+D
For convenience of description, taking wordddictionary as { tomorrow, explicit, mindset, mingming, white, weather, science } as an example, statistics of the number of co-occurrences in the dictionary between words and the number of non-co-occurrences in the dictionary between words are demonstrated. Further, CharDictionary { Ming, Tian, Bai, Dynasty, Qin, Ke, Mo } is available from WordDictionary. Further, co-occurrence variables with the statistical words "Ming" and "day" are shown in the following table.
Figure BDA0003076077610000071
Calculating the Chi-square, Chi-square of a word2). According to the chi-square calculation formula, chi-squares of the words in the dictionary CharDictionary and other words of the words in the set of words in which the words in CharWordRelationship participate are calculated. Word ciAnd cjThe chi-square calculation formula of (c) is as follows:
Figure BDA0003076077610000081
wherein A, B, C, D is represented as shown in the word co-occurrence frequency statistical table.
Further, CharWordRelationship { < ming, { tomorrow, tomorrow } >, < day, { tomorrow, weather } >, < tomorrow, { tomorrow } >, < white, { tomorrow, tomorrow } >, < heading, { tomorrow } >, < weather }, < science } >, < skill, { science } >, and }. Further, words and their chi-square are available as follows:
χ2(Ming, Tian) ═ 1 x 1-5 x 1)2/((1+5)*(1+1))=1.33
χ2(Ming, white) ═ 3 x 2-0 x 3)2/((3+3)*(0+2))=0.75
χ2(Ming, not) ═ (1 x 2-5 x 0)2/((1+5)*(0+2))=0.33
χ2(Ming, Chao) ═ 1 x 2-5 x 0)2/((1+5)*(0+2))=0.33
......
Further, the chi-square of the characters and words can be obtained as follows:
χ2(Ming, Ming Tian) ═ χ2(Ming, Tian) ═ 1.33
χ2(Ming ) ═ χ%2(Ming, Bai) ═ 0.75
χ2(Ming, Ming Shi) ═ χ2(Ming, kou) ═ 0.33
χ2(Ming dynasty ) ═ χ2(Ming, Dynasty) ═ 0.33
χ2(Ming, Bai ═ χ%2(Ming ) + chi2(Ming, Bai) + chi2(Ming, Bai))/3 ═ 1+0.75+0.75)/3 ═ 0.83
......
In an embodiment, the word vector calculating step S3 includes obtaining the word vector by weighting calculation according to the word mapping table and the normalized word chi-square.
In the specific implementation, the chi-square of the word and the word in the set of the participated word construction are calculated by combining the chi-square of other words in the dictionary between the words, and the vector of the word is calculated by taking the normalized chi-square as the weight and the vector weight of the participated word construction.
Figure BDA0003076077610000091
Wherein, ciIs a word in a dictionary, f (c)i) Finger character ciSet of terms that participate in the word formation, ckIs the set f (c)i) One word in (1).
Further, the weight of the word "Ming" and its participating constituent words can be calculated as follows:
value (Ming, Ming Tian) ═ 1/χ2(Ming, tomorrow))/((1/χ)2(Ming, understand)) + (1/χ2(Ming, Ming-Su)) + (1/χ2(Ming, Ming Dynasty)) + (1/χ2(Mingming Baibai)))) 1.33/(1/1.33+1/0.75+1/0.33+1/0.33+1/0.83) 0.08
Value (1/χ ═ clear%2(Ming, Exception))/((1/χ)2(Ming, understand)) + (1/χ2(Ming, Ming-Su)) + (1/χ2(Ming, Ming Dynasty)) + (1/χ2(Mingming white))) (0.14
Value (Ming, Ming Shi) ═ 1/χ2(Ming, Ming-Su))/((1/χ)2(Ming, understand)) + (1/χ2(Ming, Ming-Su)) + (1/χ2(Ming, Ming Dynasty)) + (1/χ2(Mingming white))) (0.32
Value (1/χ)2(Ming, Ming dynasty))/((1/χ)2(Ming, understand)) + (1/χ2(Ming, Ming-Su)) + (1/χ2(Ming, Ming Dynasty)) + (1/χ2(Mingming white))) (0.32
Value (1/χ) is white and white2(Mingming white))/((1/χ)2(Ming, understand)) + (1/χ2(Ming, Ming-Su)) + (1/χ2(Ming, Ming Dynasty)) + (1/χ2(Mingming white))) (0.13
......
Further, the obtained word and chi-square of the word are processed to be used as the weighting weight of the word vector of the word "bright", and the calculation formula of the weighting weight is as follows:
Figure BDA0003076077610000092
vector (bright) ═ Value (bright, bright day) (bright day) + Value (bright, clear) ((clear) + Value (bright, bright face) + vector (bright face) + Value (bright face) + white).
Example two
Referring to fig. 3 to 4, fig. 3 is a device frame diagram for obtaining word vectors according to an embodiment of the present application; fig. 4 is a schematic structural diagram of the word vector obtaining system of the present invention. As shown in fig. 3 to 4, the word vector acquiring system of the present invention is suitable for the word vector acquiring method, and includes:
the dictionary building unit 51: extracting words from a pre-training word vector model, segmenting the words into characters, processing the characters, and using the processed characters to form a dictionary;
chi-square acquisition unit 52: counting the co-occurrence frequency of the words, and calculating the word chi square according to the co-occurrence frequency;
the word vector calculation unit 53: and carrying out weighted calculation on the word chi-square to obtain a word vector.
In an embodiment, the dictionary building unit 51 includes:
the dictionary generating module 511: after the pre-training word vector model is read, extracting all words from the pre-training word vector model, and forming a dictionary by using the words;
the dictionary generation module 512: and segmenting the words in the dictionary into separate words, storing and de-duplicating the segmented words, and then using the processed words to form the dictionary.
In an embodiment, the chi-square acquisition unit 52 includes:
mapping table creation module 521: building a word mapping table according to the word construction relation between the words and the words;
co-occurrence frequency statistics module 522: counting the co-occurrence times and the non-co-occurrence times of the words in the dictionary;
chi-square calculation module 523: and calculating the word card side according to the co-occurrence times and the non-co-occurrence times of the words in the dictionary, and performing normalization processing on the word card side.
In an embodiment, the word vector calculating unit 53 performs weighted calculation to obtain the word vector according to the word mapping table and the word chi-square.
EXAMPLE III
Referring to fig. 5, this embodiment discloses a specific implementation of an electronic device. The electronic device may include a processor 81 and a memory 82 storing computer program instructions.
Specifically, the processor 81 may include a Central Processing Unit (CPU), or A Specific Integrated Circuit (ASIC), or may be configured to implement one or more Integrated circuits of the embodiments of the present Application.
Memory 82 may include, among other things, mass storage for data or instructions. By way of example, and not limitation, memory 82 may include a Hard Disk Drive (Hard Disk Drive, abbreviated to HDD), a floppy Disk Drive, a Solid State Drive (SSD), flash memory, an optical Disk, a magneto-optical Disk, tape, or a Universal Serial Bus (USB) Drive or a combination of two or more of these. Memory 82 may include removable or non-removable (or fixed) media, where appropriate. Memory 82 may be internal or external to the means for obtaining the word vectors, where appropriate. In a particular embodiment, the memory 82 is a Non-Volatile (Non-Volatile) memory. In particular embodiments, Memory 82 includes Read-Only Memory (ROM) and Random Access Memory (RAM). The ROM may be mask-programmed ROM, Programmable ROM (PROM), Erasable PROM (FPROM), Electrically Erasable PROM (EFPROM), Electrically rewritable ROM (EAROM), or FLASH Memory (FLASH), or a combination of two or more of these, where appropriate. The RAM may be a Static Random-Access Memory (SRAM) or a Dynamic Random-Access Memory (DRAM), where the DRAM may be a Fast Page Mode Dynamic Random-Access Memory (FPMDRAM), an Extended data output Dynamic Random-Access Memory (EDODRAM), a Synchronous Dynamic Random-Access Memory (SDRAM), and the like.
The memory 82 may be used to store or cache various data files for processing and/or communication use, as well as possible computer program instructions executed by the processor 81.
The processor 81 implements the method of acquiring an arbitrary word vector in the above-described embodiments by reading and executing computer program instructions stored in the memory 82.
In some of these embodiments, the electronic device may also include a communication interface 83 and a bus 80. As shown in fig. 5, the processor 81, the memory 82, and the communication interface 83 are connected via the bus 80 to complete communication therebetween.
The communication interface 83 is used for implementing communication between modules, devices, units and/or equipment in the embodiment of the present application. The communication port 83 may also be implemented with other components such as: data communication is carried out among external equipment, image/word vector acquisition equipment, a database, external storage, an image/word vector acquisition workstation and the like.
The bus 80 includes hardware, software, or both to couple the components of the electronic device to one another. Bus 80 includes, but is not limited to, at least one of the following: data Bus (Data Bus), Address Bus (Address Bus), Control Bus (Control Bus), Expansion Bus (Expansion Bus), and Local Bus (Local Bus). By way of example, and not limitation, Bus 80 may include an Accelerated Graphics Port (AGP) or other Graphics Bus, an Enhanced Industry Standard Architecture (EISA) Bus, a Front-Side Bus (FSB), a Hyper Transport (HT) Interconnect, an ISA (ISA) Bus, an InfiniBand (InfiniBand) Interconnect, a Low Pin Count (LPC) Bus, a memory Bus, a microchannel Architecture (MCA) Bus, a PCI (Peripheral Component Interconnect) Bus, a PCI-Express (PCI-X) Bus, a Serial Advanced Technology Attachment (SATA) Bus, a Video Electronics Bus (audio Electronics Association), abbreviated VLB) bus or other suitable bus or a combination of two or more of these. Bus 80 may include one or more buses, where appropriate. Although specific buses are described and shown in the embodiments of the application, any suitable buses or interconnects are contemplated by the application.
The electronic device may be connected to a word vector acquisition system to implement the methods described in connection with fig. 1-2.
The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
In summary, the word vector can be directly calculated based on the existing resources, and a large amount of resources are not required to be consumed for training. Meanwhile, when the method is applied to text analysis, the text can be directly vectorized based on the word vectors without word segmentation, so that cascading errors caused by word segmentation errors are avoided, and the natural language processing capacity is improved. The invention solves the problems that the acquisition of the word vector needs to consume a large amount of resources to train, the existing resources can not be used for directly acquiring the word vector, the text can not be directly vectorized based on the word vector, word segmentation is needed, and therefore word segmentation errors are caused and cascade errors are caused.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent application shall be subject to the protection scope of the appended claims.

Claims (10)

1. A method for obtaining a word vector, comprising:
a dictionary building step: extracting words from a pre-training word vector model, segmenting the words into characters, processing the characters, and using the processed characters to form a dictionary;
a chi-square obtaining step: counting the co-occurrence frequency of the words, and calculating the word chi square according to the co-occurrence frequency;
a word vector calculation step: and carrying out weighted calculation on the word chi-square to obtain a word vector.
2. The method for obtaining word vectors as claimed in claim 1, wherein said dictionary building step comprises:
a dictionary generating step: after the pre-training word vector model is read, extracting all words from the pre-training word vector model, and forming a dictionary by using the words;
a dictionary generating step: and segmenting the words in the dictionary into separate words, storing and de-duplicating the segmented words, and then using the processed words to form the dictionary.
3. The method for obtaining word vectors according to claim 2, wherein the chi-square obtaining step includes:
a mapping table establishing step: building a word mapping table according to the word construction relation between the words and the words;
a co-occurrence frequency statistic step: counting the co-occurrence times and the non-co-occurrence times of the words in the dictionary;
chi square calculation step: and calculating the word card side according to the co-occurrence times and the non-co-occurrence times of the words in the dictionary, and performing normalization processing on the word card side.
4. The method according to claim 3, wherein the word vector calculation step includes obtaining the word vector by weighted calculation based on the word mapping table and the normalized word chi-square.
5. A word vector acquisition system, adapted to the word vector acquisition method of any one of claims 1 to 4, the word vector acquisition system comprising:
the dictionary building unit: extracting words from a pre-training word vector model, segmenting the words into characters, processing the characters, and using the processed characters to form a dictionary;
a chi-square acquisition unit: counting the co-occurrence frequency of the words, and calculating the word chi square according to the co-occurrence frequency;
a word vector calculation unit: and carrying out weighted calculation on the word chi-square to obtain a word vector.
6. The system for obtaining word vectors according to claim 5, wherein the dictionary building unit includes:
a dictionary generation module: after the pre-training word vector model is read, extracting all words from the pre-training word vector model, and forming a dictionary by using the words;
the dictionary generation module: and segmenting the words in the dictionary into separate words, storing and de-duplicating the segmented words, and then using the processed words to form the dictionary.
7. The system for acquiring word vectors according to claim 6, wherein the chi-square acquiring unit includes:
a mapping table building module: building a word mapping table according to the word construction relation between the words and the words;
a co-occurrence frequency statistic module: counting the co-occurrence times and the non-co-occurrence times of the words in the dictionary;
a chi-square calculation module: and calculating the word card side according to the co-occurrence times and the non-co-occurrence times of the words in the dictionary, and performing normalization processing on the word card side.
8. The system for obtaining word vectors according to claim 7, wherein the word vector calculating unit obtains the word vectors by weighted calculation based on the word mapping table and the word chi-square.
9. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of obtaining a word vector according to any one of claims 1 to 4 when executing the computer program.
10. An electronic device readable storage medium having stored thereon computer program instructions which, when executed by the processor, implement the method of obtaining a word vector as claimed in any one of claims 1 to 4.
CN202110552803.8A 2021-05-20 2021-05-20 Word vector obtaining method and system, electronic equipment and storage medium Withdrawn CN113255335A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110552803.8A CN113255335A (en) 2021-05-20 2021-05-20 Word vector obtaining method and system, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110552803.8A CN113255335A (en) 2021-05-20 2021-05-20 Word vector obtaining method and system, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113255335A true CN113255335A (en) 2021-08-13

Family

ID=77183118

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110552803.8A Withdrawn CN113255335A (en) 2021-05-20 2021-05-20 Word vector obtaining method and system, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113255335A (en)

Similar Documents

Publication Publication Date Title
CN105760474B (en) Method and system for extracting feature words of document set based on position information
CN112507711B (en) Text abstract extraction method and system
CN110083832B (en) Article reprint relation identification method, device, equipment and readable storage medium
CN112256822A (en) Text search method and device, computer equipment and storage medium
CN114818891B (en) Small sample multi-label text classification model training method and text classification method
CN110717040A (en) Dictionary expansion method and device, electronic equipment and storage medium
CN112183102A (en) Named entity identification method based on attention mechanism and graph attention network
CN112232070A (en) Natural language processing model construction method, system, electronic device and storage medium
WO2015131528A1 (en) Method and apparatus for determining topic distribution of given text
CN114511857A (en) OCR recognition result processing method, device, equipment and storage medium
CN114048288A (en) Fine-grained emotion analysis method and system, computer equipment and storage medium
CN116561320A (en) Method, device, equipment and medium for classifying automobile comments
CN113255334A (en) Method, system, electronic device and storage medium for calculating word vector
CN113255335A (en) Word vector obtaining method and system, electronic equipment and storage medium
CN115630643A (en) Language model training method and device, electronic equipment and storage medium
CN112949446B (en) Object identification method, device, equipment and medium
CN113569703A (en) Method and system for judging true segmentation point, storage medium and electronic equipment
CN113255326A (en) Unknown word vector calculation method, system, electronic device and storage medium
CN113742525A (en) Self-supervision video hash learning method, system, electronic equipment and storage medium
CN112257726A (en) Target detection training method, system, electronic device and computer readable storage medium
CN113742470A (en) Data retrieval method, system, electronic device and medium
CN112364935A (en) Data cleaning method, system, computer equipment and storage medium
CN112749542A (en) Trade name matching method, system, equipment and storage medium
CN113343669B (en) Word vector learning method, system, electronic equipment and storage medium
CN112650837B (en) Text quality control method and system combining classification algorithm and unsupervised algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20210813

WW01 Invention patent application withdrawn after publication