CN112100332A - Word embedding expression learning method and device and text recall method and device - Google Patents

Word embedding expression learning method and device and text recall method and device Download PDF

Info

Publication number
CN112100332A
CN112100332A CN202010961808.1A CN202010961808A CN112100332A CN 112100332 A CN112100332 A CN 112100332A CN 202010961808 A CN202010961808 A CN 202010961808A CN 112100332 A CN112100332 A CN 112100332A
Authority
CN
China
Prior art keywords
word
text
word embedding
search
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010961808.1A
Other languages
Chinese (zh)
Inventor
张雨春
翁泽峰
翟彬旭
张东于
范云霓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202010961808.1A priority Critical patent/CN112100332A/en
Publication of CN112100332A publication Critical patent/CN112100332A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Abstract

The disclosure provides a word embedding expression learning method and device and a text recall method and device, and relates to the field of artificial intelligence. The word embedding representation learning method comprises the following steps: acquiring a text corpus, performing word segmentation processing on the text corpus, and constructing a graph structure based on the obtained word segmentation and pronunciation information corresponding to the word segmentation; taking each node in the graph structure as an initial node, and randomly walking to obtain a node sequence corresponding to the initial node; and training a word embedding representation model according to the node sequence to obtain a word embedding lookup table, and determining word embedding representation corresponding to the text corpus based on the word embedding lookup table. The method and the device can construct the graph according to the word segmentation and the pronunciation information, and train word embedding based on the graph structure, so that the words close in morphology have close distances in word embedding space, the problem of error in recalling texts due to input errors is avoided, the recall efficiency and the recall quality are improved, and further the user experience is improved.

Description

Word embedding expression learning method and device and text recall method and device
Technical Field
The present disclosure relates to the field of natural language processing technologies, and in particular, to a word embedding representation learning method, a word embedding representation learning apparatus, a text recall method, a text recall apparatus, a computer-readable storage medium, and an electronic device.
Background
Word embedding (also called word vector, word characterization, text characterization, etc.) is a general term for language model and characterization learning technology in Natural Language Processing (NLP), and it means that a high-dimensional space with the number of all words as the dimension is embedded into a continuous vector space with much lower dimension, and each word or word group is mapped as a vector on the real number domain.
When information is recalled according to a search string, a user may inadvertently make a wrongly-written character exist in the search string, for example, the search string that the user wants to input is "new crown pneumonia", but the search string that is actually input is "new official pneumonia", if the recall is performed strictly according to the search string containing the wrongly-written character, a recall result is wrong or incomplete, and a recall result corresponding to a correct search string is lacked, which reduces user experience.
It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present disclosure, and thus may include information that does not constitute prior art known to those of ordinary skill in the art.
Disclosure of Invention
The embodiment of the disclosure provides a word embedding expression learning method, a word embedding expression learning device, a text recall method, a text recall device, a computer-readable storage medium and an electronic device, so that words with similar morphology can have similar distances in a vector space at least to a certain extent, and the precision and the integrity of recall are improved.
Additional features and advantages of the disclosure will be set forth in the detailed description which follows, or in part will be obvious from the description, or may be learned by practice of the disclosure.
According to an aspect of an embodiment of the present disclosure, there is provided a word embedding representation learning method including: acquiring a text corpus, performing word segmentation processing on the text corpus, and constructing a graph structure based on the obtained word segmentation and pronunciation information corresponding to the word segmentation; taking each node in the graph structure as an initial node, and randomly walking to obtain a node sequence corresponding to the initial node; and training a word embedding representation model according to the node sequence to obtain a word embedding lookup table, and determining word embedding representation corresponding to the text corpus based on the word embedding lookup table.
According to an aspect of an embodiment of the present disclosure, there is provided a word-embedded representation learning apparatus including: the graph construction module is used for acquiring a text corpus, performing word segmentation processing on the text corpus, and constructing a graph structure based on the obtained segmented words and pronunciation information corresponding to the segmented words; the sampling module is used for taking each node in the graph structure as an initial node and randomly walking to obtain a node sequence corresponding to the initial node; and the word embedding obtaining module is used for training the word embedding representation model according to the node sequence to obtain a word embedding lookup table and determining the word embedding representation corresponding to the text corpus based on the word embedding lookup table.
According to an aspect of an embodiment of the present disclosure, there is provided a text recall method including: acquiring a search character string, and performing word segmentation processing on the search character string to acquire search words; inquiring in a word embedding lookup table according to the search participle to obtain word embedding corresponding to the search participle, wherein the word embedding lookup table is a word embedding lookup table obtained according to the word embedding expression learning method in the embodiment; and acquiring a search vector corresponding to the search character string according to word embedding corresponding to all the search participles, and determining a recall text according to the search vector and a text vector corresponding to the candidate text.
According to an aspect of an embodiment of the present disclosure, there is provided a text recall apparatus including: the word segmentation module is used for acquiring a search character string and performing word segmentation processing on the search character string to acquire search words; a word embedding obtaining module, configured to perform query in a word embedding lookup table according to the search segmentation to obtain word embedding corresponding to the search segmentation, where the word embedding lookup table is a word embedding lookup table obtained according to the word embedding expression learning method in the foregoing embodiment; and the recall module is used for embedding and acquiring a search vector corresponding to the search character string according to all the words corresponding to the search participles and determining a recall text according to the search vector and a text vector corresponding to the candidate text.
According to an aspect of an embodiment of the present disclosure, there is provided a computer program product or a computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method provided in the various alternative implementations described above.
According to an aspect of an embodiment of the present disclosure, there is provided an electronic device including: one or more processors; a storage device for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the methods provided in the various alternative implementations described above.
In the technical solutions provided in some embodiments of the present disclosure, a text corpus is subjected to word segmentation processing, a graph structure is constructed according to obtained words and pronunciation information corresponding to the words, then a plurality of node sequences are obtained based on random walk of nodes in the graph structure, finally a word embedding representation model is trained according to the plurality of node sequences to obtain a word embedding lookup table, and word embedding representation corresponding to the text corpus is determined based on the word embedding lookup table. According to the technical scheme, on one hand, a word embedding representation model can be trained based on a graph structure, pronunciation information is introduced into the graph structure, and the performance of the word embedding representation model is improved, so that characters close to each other in terms of morphology have similar vector representation in a word embedding space, and the problem of Out of Vocabulary (OOV) exceeding is relieved; on the other hand, the word embedded representation corresponding to the text can be accurately obtained, and the quality of the recalled text is improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without the exercise of inventive faculty. In the drawings:
fig. 1 shows a schematic diagram of an exemplary system architecture to which technical aspects of embodiments of the present disclosure may be applied;
FIG. 2 schematically illustrates a flow diagram of a word embedding representation learning method according to one embodiment of the present disclosure;
FIG. 3 schematically illustrates a structural schematic of a graph structure according to one embodiment of the present disclosure;
FIG. 4 schematically illustrates a structural schematic of a random walk sampling according to the present disclosure;
FIG. 5 schematically illustrates a flow diagram of a text recall method according to one embodiment of the present disclosure;
FIG. 6 schematically shows a flowchart of obtaining a recall text according to one embodiment of the present disclosure;
7A-7B schematically illustrate interface diagrams for enterprise search, according to one embodiment of the present disclosure;
FIG. 8 schematically illustrates a block diagram of a word embedding representation learning apparatus, according to one embodiment of the present disclosure;
FIG. 9 schematically illustrates a block diagram of a text recall device according to one embodiment of the present disclosure;
FIG. 10 illustrates a schematic block diagram of a computer system suitable for implementing a word-embedded representation learning device and a text recall device of an embodiment of the present disclosure.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the subject matter of the present disclosure can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations, or operations have not been shown or described in detail to avoid obscuring aspects of the disclosure.
The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.
The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.
Fig. 1 shows a schematic diagram of an exemplary system architecture to which the technical solutions of the embodiments of the present disclosure may be applied.
As shown in fig. 1, system architecture 100 may include terminal device 101, network 102, and server 103. The terminal device 101 may be a mobile phone, a portable computer, a tablet computer, a desktop computer, or other terminal devices with a display screen; the network 102 is a medium for providing a communication link between the terminal device 101 and the server 103, and the network 102 may include various connection types, such as a wired communication link, a wireless communication link, and the like, and in the embodiment of the present disclosure, the network 102 between the terminal device 101 and the server 103 may be a wireless communication link, and particularly may be a mobile network.
It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminals, networks, and servers, as desired for an implementation. For example, server 103 may be a server cluster composed of a plurality of servers, and may be used to store information related to search string processing.
In an embodiment of the present disclosure, a user inputs a search string through an input device built in or outside the terminal device 101, the search string input by the user can be sent to the server 103 through the network 102, after receiving the search string, the server 103 may first perform a word segmentation process on the search string to obtain a search word, then perform a search in a word embedding lookup table obtained in advance through a training word embedding representation model according to the search word to obtain a search vector corresponding to the search string, and finally calculate a similarity according to the search vector corresponding to the search string and a text vector of a candidate text, and perform a text recall according to the similarity to obtain a recalled text corresponding to the search string. When a word embedding lookup table is obtained, firstly, a text corpus can be obtained, word segmentation is obtained by performing word segmentation processing on the text corpus, and after word segmentation is obtained, a graph structure can be constructed according to the word segmentation and pronunciation information corresponding to the word segmentation; then, taking each node in the graph structure as an initial node, and acquiring a node sequence corresponding to each initial node in a random walk mode; and finally, training the word embedding representation model according to the node sequence, acquiring an embedding matrix corresponding to a hidden layer in the word embedding representation model after the training is finished as a word embedding lookup table, and determining word embedding representations corresponding to the text corpus and word embedding representations corresponding to the search participles based on the word embedding lookup table. Further, in order to improve the accuracy of word embedding and alleviate the problem of out-of-vocabulary (OOV), a graph structure may be constructed according to a word stock under common characters and service scenes, and a word embedding representation model may be trained according to a node sequence determined based on the graph structure to obtain a word embedding lookup table, and after a search character string is obtained, a word embedding lookup table corresponding to the service scene may be selected according to the service scene corresponding to the search character string, and a word embedding representation corresponding to the search character string may be obtained.
It should be noted that the word embedding representation learning method and the text recall method provided by the embodiments of the present disclosure are generally executed by a server, and accordingly, the word embedding representation learning apparatus and the text recall method apparatus are generally disposed in the server. However, in other embodiments of the present disclosure, the word embedding representation learning method and the text recall method provided by the embodiments of the present disclosure may also be executed by a terminal device.
In the high-level task of natural language processing, the method using machine learning needs to convert words into mathematical representations, and then perform calculation with the mathematical representations to complete the task at the semantic level. In the statistical learning model, word embedding is used for completing a natural language processing task, and is a key technology of the natural language processing task. In the related technology, common word embedding training methods are mainly divided into two main categories, namely static characterization and dynamic characterization, wherein the static characterization includes word embedding through a bag-of-words model, a topic model, a classical Language model and an optimized Language model, and the dynamic characterization includes word embedding through an elmo (entries from Language models), a GPT and bert (bidirectional Encoder retrieval from transformations) model, wherein the bag-of-words model mainly includes one-hot coding, TF-IDF, TextRank and other discrete representations, but since the bag-of-words model ignores the grammatical and word order elements of a document, the document is only regarded as a set of a plurality of unordered words, and each word is independent, so that a dimension disaster problem exists, no incidence relation exists between word vectors, and a semantic gap exists. The topic models mainly include models based on matrix decomposition such as LSA, LDA, Glove and the like, but have the problem of large calculation amount. The classical language models mainly comprise NPLM, C & W and other classical language models, wherein word vectors are byproducts, but the problems of high calculation cost and difficult engineering realization exist. The optimized language model mainly comprises targeted optimized models such as word2vec and FastText, but the problem that the word ambiguity cannot be solved exists. ELMo is a language model for performing bidirectional semantic feature extraction based on double-layer bidirectional LSTM, and has the problems that the LSTM feature extraction capability is limited, and the feature fusion capability of bidirectional splicing is weak. GPT is a unidirectional language model based on the result of transform decoding (Transformer decoder), and has a problem of unidirectional semantics. BERT is a bidirectional language model based on a transform encoder (transform encoder) structure, and has the problems of high training cost and high sample size requirement.
In view of the problems in the related art, the embodiments of the present disclosure provide a word-embedded representation learning method and a text recall method, which are implemented based on machine learning, which is one of Artificial Intelligence (AI), which is a theory, method, technique, and application system that simulates, extends, and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses an environment, acquires knowledge, and uses the knowledge to obtain an optimal result. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.
The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
Computer Vision technology (CV) Computer Vision is a science for researching how to make a machine "see", and further refers to that a camera and a Computer are used to replace human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the Computer processing becomes an image more suitable for human eyes to observe or transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face recognition and fingerprint recognition.
Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.
With the research and progress of artificial intelligence technology, the artificial intelligence technology is developed and applied in a plurality of fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, smart medical care, smart customer service, and the like.
The scheme provided by the embodiment of the disclosure relates to an artificial intelligence natural language processing technology, can be applied to the field of information search, and is specifically explained by the following embodiments:
fig. 2 schematically illustrates a flow diagram of a word embedding representation learning method according to one embodiment of the present disclosure, which may be performed by a server, which may be the server 103 shown in fig. 1. Referring to fig. 2, the word embedding expression learning method at least includes steps S210 to S230, which are described in detail as follows:
in step S210, a text corpus is obtained, word segmentation is performed on the text corpus, and a graph structure is constructed based on the obtained word segments and pronunciation information corresponding to the word segments.
In one embodiment of the present disclosure, different countries and regions exist in different languages, such as chinese, english, french, german, etc., and although the language types are different, the idea of converting various languages into word vectors is substantially the same. In the embodiment of the present disclosure, a large number of text corpora may be obtained first, where the text corpora may be corpora related to a specific service scenario, for example, the text corpora may be corpora related to business information of an enterprise, specifically, business registration names of each enterprise, and the like; or a corpus covering a plurality of related business scenarios, for example, a corpus relating to insurance and physical examination reports, etc.; of course, other linguistic data can be used, and the adaptability can be adjusted according to different requirements.
After the text corpus is obtained, the text corpus may be preprocessed, and specifically, the text corpus may be subjected to word segmentation. For text corpora of different language types, the word segmentation method is slightly different, for example, when the text corpora are chinese texts, the word segmentation method based on a dictionary, the word segmentation method based on statistics, or the word segmentation method based on deep learning may be adopted for word segmentation, specifically, the word segmentation method based on a dictionary may include a forward maximum matching method, a reverse maximum matching method, and a bidirectional maximum matching method, and the word segmentation method based on statistics is to learn the rule of word segmentation (referred to as training) by using a statistical machine learning model on the premise of giving a large amount of already segmented texts, thereby realizing segmentation of unknown texts, and the main statistical model includes: n-gram (N-gram), Hidden Markov Model (HMM), maximum entropy Model (ME), Conditional Random field Model (CRF), etc., the statistical-based word segmentation method includes: the method comprises an N-shortest path method, a word segmentation method based on an N-element grammar model of words, a Chinese word segmentation method based on word-forming words, a Chinese word segmentation method based on a word perceptron algorithm, and a Chinese word segmentation method based on the combination of a generating model and a distinguishing model of words. When the text corpus is a non-Chinese text, n-gram segmentation and sub-word segmentation can be adopted, word stems of words in the non-Chinese text are obtained through the n-gram segmentation, and sub-words of the words are obtained through the sub-word segmentation.
In one embodiment of the present disclosure, taking the search recall of information as an example, a user may inadvertently make a misspelling or a miswritten word when entering a search string, or in the case where a recognition error occurs when performing Optical Character Recognition (OCR) on a text, if the search recall system performs a search recall exactly in accordance with the acquired search string, the recall information may not be available or may be in error or missing, so to improve the quality of the recall of information, further improve the user experience, by enabling the terms with similar morphology to have similar distances in the vector space, when searching and recalling are carried out in a large amount of candidate information according to the search character string, not only information containing a search string but also information containing a word that is similar to the search string in terms of morphology can be recalled. Morphology generally refers to the written form of a word, which is one of the main elements of written language. The similar lexical words usually have the same or similar pronunciation information, taking the chinese character as an example, the pronunciation information of Teng, Teng and heating is teng, the pronunciation information of Xun, Teng and flood is xun, even if the Chinese character is a non-chinese character, the similar lexical words can exist, such as angle and angle, affect and effect, quite and quit, etc., and the similar lexical words also exist similar or the same condition on the pronunciation information. In order to facilitate association of the terms with similar morphology and further enable the terms with similar morphology to have similar word embedding representation, a diagram can be constructed according to the terms with similar morphology, and vector conversion is performed based on the diagram to obtain word embedding of the terms.
In an embodiment of the present disclosure, when constructing a graph structure, a word segmentation process may be performed on a text corpus to obtain a word segmentation and pronunciation information corresponding to the word segmentation, specifically, the word segmentation and the pronunciation information are used as nodes, relationships between words and pronunciation information are used as edges, and a graph structure is further formed according to the nodes and the edges. The pronunciation information corresponding to the participles is related to the types of the text corpora, and when the text corpora are Chinese texts, the pronunciation information can be pinyin; when the corpus of text is non-Chinese text, the pronunciation information may be phonetic symbols.
In order to make the technical scheme of the present disclosure clearer, a specific description is given below by taking a chinese text corpus as an example.
In an embodiment of the disclosure, a graph structure may be constructed according to the segmentation obtained by performing segmentation processing on a corpus of a chinese text and the pinyin corresponding to each character in the segmentation, and specifically, a directed acyclic graph may be constructed according to the nodes and the edges by using the segmentation and the pinyin corresponding to the chinese text as nodes and using the relationships between the segmentation and the single characters in the segmentation and the pinyin corresponding to the single characters as edges. It is worth noting that the pinyin corresponding to a single word comprises a standard pinyin corresponding to the single word and a non-standard pinyin, the non-standard pinyin is similar to the standard pinyin, the non-standard pinyin is mainly generated due to different living areas and different accents, for example, confusion of l and n, the standard pinyin of milk should be niu nai, but due to the fact that pronunciation is not standard, someone can spell into liu lai and the like, different nodes and edges of the same pinyin characters and similar nodes and edges of the same pinyin characters are introduced into a graph structure, information content in the graph structure is enlarged, association of words with similar morphological is guaranteed to the greatest extent, and the words are enabled to have similar distances in a vector space.
Fig. 3 shows a schematic structural diagram of a graph structure, as shown in fig. 3, word segmentation processing is performed on a text corpus to obtain word segments such as word segments "tengxing yun", "tengxing", "yitong", "tengxing", "tengyo", and the like, individual characters in the word segments are characters with similar morphology, and the graph structure shown in fig. 3 can be formed according to the individual characters in the word segments and the pinyin of the individual characters. Although the connection relationships between similar pinyin words are not shown in fig. 3, it should be understood that the graph structure contains a considerable amount of data, and the number of nodes and edges contained therein may be in the tens of millions or even hundreds of millions, so that all connection relationships, such as edges established on the node relationships between similar pinyin words and different pinyin words of interest, should be contained in the graph structure.
In an embodiment of the present disclosure, in the process of constructing the graph structure, a weight may be given to an edge between each node according to a preset rule, and the preset rule may be set according to specific business requirements, for example, a higher first weight is set for an edge between a participle and a Chinese word in the participle, a second weight lower than the first weight is set between the Chinese word and a standard pinyin, a third weight lower than the second weight is set between the Chinese word and a non-standard pinyin, or a weight is determined according to an edit distance of the Chinese word on pinyin or composition, for example, a reciprocal of the edit distance is used as a weight of the edge, and of course, a weight model may be trained alone to give a weight to the edge.
The graph structure is constructed according to the word segmentation and the pronunciation information corresponding to the word segmentation, the word embedding of each node is obtained based on the graph structure learning and serves as the distributed representation of the characters, high-quality word embedding can be obtained, words with similar morphology have high similarity in a word embedding space, and the problem of OOV is greatly relieved due to the addition of pinyin nodes.
In step S220, each node in the graph structure is used as an initial node, and a node sequence corresponding to the initial node is obtained by random walk.
In one embodiment of the present disclosure, after the construction of the graph structure is completed, word embedding of each node in the graph structure can be obtained through machine learning model learning based on the graph structure. When word embedding of each node is obtained through learning, each node in the graph structure can be used as an initial node, a node sequence corresponding to the initial node is obtained through a random walk mode, and then word embedding is trained according to a large number of node sequences.
When a random walk acquires a node sequence, two parameters may be set first: the first parameter p and the second parameter q, p and q are used for achieving a balance in a Breadth-first Search (BFS) and a Depth-first Search (DFS) and considering local and macroscopic information; then determining the wandering probability of the current node jumping to a history node and a future node adjacent to the current node according to the first parameter p and the second parameter q; and finally, determining the walking direction according to the walking probability, and determining the node sequence based on the walking direction.
Fig. 4 shows a schematic structural diagram of random walk sampling, as shown in fig. 4, there are nodes t, v, x1, x2, and x3 and edges connected to the nodes in the diagram structure, the current node is v, and from the edge (t, v), it can be known from analysis in the diagram that, during next sampling, a jump can be made from the current node v to the nodes t, x1, x2, and x3, and the walk probability corresponding to each edge is marked as αpq(t, x) the wandering probability of each edge is different depending on the node to which the edge is connected and the p and q, and αpqThe value of (t, x) can be shown as formula (1):
Figure BDA0002680814110000111
wherein d istxRepresenting the direct shortest path from node t to node x; dtx0 denotes back to node t; d tx1 means that node t is directly connected to node x, but node v was selected in the previous step; dtx2 means that node t is not directly connected to node x, but node v is directly connected to node x.
After determining the parameters p, q, each edge can be determinedThe corresponding wandering probability is not returned to the node that has been acquired in the sampling, so the first parameter p is usually set to be larger, that is, the probability of wandering along the edge (v, t) is small. Further, when setting the values of the first parameter p and the second parameter q, the setting may be performed according to the sampling requirement, for example, if searching in the breadth direction is mainly desired, q may be set to a value greater than 1, and p may be set to a value greater than q, so that the edges (v, x) are sampled along1) Sampling is carried out; if a search in the depth direction is desired, q may be set to a value greater than zero and less than 1 and p may be set to a value greater than 1, such that the edges (v, x) are sampled along2) Side (v, x)3) Sampling is performed.
In an embodiment of the present disclosure, during sampling, a sampling length L may be set to obtain a plurality of node sequences with each node as an initial node and having a sampling length, where the sampling length L may be set to, for example, 2 ≦ L ≦ 5, and may also be set to other value ranges according to actual needs.
In step S230, a word embedding representation model is trained according to the node sequence to obtain a word embedding lookup table, and a word embedding representation corresponding to the text corpus is determined based on the word embedding lookup table.
In an embodiment of the present disclosure, after a node sequence with each node in the graph structure as an initial node is obtained, the word embedding representation model may be trained according to the node sequence to obtain a stable word embedding representation model and a word embedding lookup table. The word embedding expression model used in the present disclosure may specifically be a Node2vec model or the like, the Node2vec model is a model used to generate a Node vector in a graph structure, the graph structure generated in step S220 is input, and the output is a vector of each Node, that is, word embedding of a word corresponding to each Node. The Node2vec model structure contains a Skip-gram model, the Skip-gram model is one of word2vec models, after obtaining Node sequences, each Node sequence can be processed through the Skip-gram model to obtain a prediction result. When the word embedded representation model is trained according to the node sequence, the node sequence can be specifically input into the word embedded representation model to obtain the prediction information output by the word embedded representation model, then a loss function is determined according to the prediction information and the marking information corresponding to the node sequence, finally, the parameters of the word embedded representation model are optimized based on the loss function, and the training can be considered to be completed when the value of the loss function is minimum or the training for a preset number of times is completed. The Skip-gram model is a word for predicting the context of the target word by inputting the target word, and maximizes the probability of word occurrence, namely the probability of node co-occurrence, wherein the target word is a word corresponding to any node in a node sequence.
The Skip-gram model comprises an input layer, a hidden layer and an output layer, each word in a node sequence is input into the model through the input layer, a weight matrix is arranged between the input layer and the hidden layer, a value obtained by the hidden layer is obtained by the action of the weight matrix on the input word, meanwhile, the hidden layer also has the weight matrix from the output layer to the output layer, each value of a vector of the output layer is a result obtained by multiplying a vector point of the hidden layer by each column of the weight matrix, and finally, the output layer vector is normalized, so that the prediction probability of each word, namely the probability that each word in a word list becomes the context of a target word can be obtained, wherein the word with the maximum probability is the predicted word, namely the predicted word is the word with the maximum co-occurrence probability with the input target word and the most probable sentence forming probability.
From the above process analysis, the key point of word embedding is to obtain a weight matrix from the input layer to the hidden layer, and word embedding can be obtained through the action of the weight matrix. The size of the weight matrix obtained after training is nxm, wherein N is the vocabulary scale, and M is the word embedding length, because when the vocabulary is constructed according to the word segmentation nodes in the graph structure, each word is given a unique number, for example, the words are sequentially numbered from 0 to N, after the weight matrix is obtained, the word embedding corresponding to the word can be obtained by looking up the vector of the corresponding row in the weight matrix according to the number of the word in the vocabulary, that is, the ith row vector in the weight matrix is the word embedding of the ith word in the vocabulary. Correspondingly, after the weight matrix, namely the word embedding lookup table, is obtained, the number of the participle can be determined according to the participle and the word list corresponding to the text corpus, the word embedding corresponding to each participle is obtained in the word embedding lookup table according to the number, and then the word embedding corresponding to the text corpus can be obtained according to the word embedding of all the participles in the text corpus.
It is to be noted that the graph structure in the embodiment of the present disclosure includes words with similar morphology, so that performing word embedding expression learning based on the graph structure can enable words corresponding to the words with similar morphology to be embedded in a vector space with similar distances, so that similarity measurement of homophones becomes possible, and further, when information recall is performed, not only information including a search character string can be recalled, but also information including words with similar morphology to the search character string can be recalled, thereby avoiding a recall error caused by an input error.
In one embodiment of the present disclosure, when the graph structure is constructed in step S210, the edges between the connected nodes have assigned weights, and the weights may act on a loss function when the model is trained to improve the model performance, where the loss function represents the difference degree between the prediction information and the label information, and when the difference degree is lower, the smaller the loss function is, the better the model performance is. The method has the advantages that the attention degree of the model to two nodes with large difference can be improved by introducing the weight of the edge when the loss function is calculated, and then the nodes with difference can be focused when the parameters are adjusted reversely, so that the prediction information output by the optimized model is similar to or identical to the labeling information. The loss function may be specifically a cross entropy loss function, and may also be other loss functions, which is not specifically limited in this disclosure.
In an embodiment of the present disclosure, in the process of embedding a training word, since the data volume of a text corpus and a graph structure is particularly large, a distributed computation spark manner is usually adopted for data processing in engineering, but the following three problems still exist in the algorithm running process: (1) graph storage and machine node I/O high; (2) data skew; (3) the multi-round iterative data dependency chain is too long. For the problem (1), nodes and edges need to be stored in the graph structure during storage, if the graph structure needs to be divided into a plurality of subgraphs for storage when the graph structure is stored on a plurality of machines, the number of the nodes and the edges needs to be noticed during the division, if a certain machine stores a large number of nodes and cuts the edges, if the machine processes the edges, other machines need to be used for pulling the information of the edges, and therefore the data processing efficiency is low. In order to solve the problem, a mixed segmentation method can be adopted for optimization, the mixed segmentation method mainly adopts different segmentation strategies according to the degree of the nodes in the graph structure, specifically, the low-degree nodes are cut by edges to keep locality, the high-degree nodes are cut by points to reduce node backup, and the whole graph structure is balanced in parallelism and storage. For the problem (2), there may be some words with high frequency of occurrence in the text corpus, and some words with low frequency of occurrence, so that a machine processing the words with high frequency of occurrence needs to consume a lot of time, while a machine processing the words with low frequency of occurrence will complete data processing quickly, but the data processing logic must wait for all machines to complete their own tasks before performing the next round of processing, so that the data processing efficiency is very low, in order to solve the problem, the problem can be alleviated by multi-stage aggregation operation and map join, that is, the task processing the words with high frequency of occurrence is divided into a plurality of subtasks, the plurality of machines execute the subtasks simultaneously, and then the processing results of the plurality of machines are integrated together as one task to be processed. For the problem (3), since the model training process is a multi-round iterative process, and the model performance is optimized through a process of continuously propagating backward parameters in a forward direction, that is, the same text corpus is used for model training repeated for multiple times, which may cause that a machine for executing the model training algorithm is hung up with the increase of the training times, resulting in failure of the training process, a reasonable intermediate variable cache or an important data structure is persisted to alleviate the problem, so that the whole operation is smoother, specifically, a dependency chain of data can be directly cut off, an intermediate result is cached, and when next data processing is performed, the intermediate result is directly started without repeatedly executing the previous flow.
The word embedding expression learning method can solve the word embedding expression problem from the angle of graph calculation, particularly enables words of words with similar morphology to be embedded in a vector space with a similar distance, enables similarity measurement of homophones in Chinese to be possible, only needs less computing resources when the word embedding expression learning method learns large-scale word embedding on ten million nodes and hundred million-level sides, can be completed in a minute level, and is efficient in performance.
The present disclosure further provides a text recall method based on the word-embedded representation learning method, and fig. 5 shows a flowchart of the text recall method, and as shown in fig. 5, the method at least includes steps S510-S530, specifically:
in step S510, a search string is obtained, and a word segmentation process is performed on the search string to obtain a search word.
In one embodiment of the disclosure, a user inputs a search string in a terminal interface through an input device, wherein the search string can be Chinese, English or other types of strings. In the embodiment of the present disclosure, a chinese search string is still taken as an example for explanation, the search string may be, for example, a name of a person, and information of the name of the person matched with the search string is obtained according to the search string; for example, the name of the business, the information of the corresponding business queried at the business enterprise registration platform according to the search string, and so on.
In an embodiment of the present disclosure, before a search recall is performed according to a search string, a search string needs to be preprocessed, that is, a search string needs to be subjected to a word segmentation process to obtain a search word, for example, the search string is an enterprise name "XX technology limited liability company", and the search word can be obtained as "XX technology limited liability company" by word segmentation. After the search segmentation is obtained, word embeddings corresponding to the search segmentation may be determined based on a word embedding look-up table.
In step S520, a query is performed in a word embedding lookup table according to the search segmentation to obtain a word embedding representation corresponding to the search segmentation, where the word embedding lookup table is obtained according to the word embedding representation learning method in the above embodiment.
In an embodiment of the present disclosure, when the word embedding lookup table includes enough embedding vectors of words, word embedding corresponding to the word embedding lookup table may be obtained from the word embedding lookup table according to the search participle.
The word embedding lookup table is required to contain enough embedding vectors of words, on one hand, linguistic data which can cover all service scenes needs to be collected, on the other hand, a huge graph structure needs to be constructed according to the linguistic data, which is a great challenge to the storage and processing efficiency of a machine, so that in order to further improve the data processing efficiency and avoid the OOV problem, the graph structure can be constructed according to text linguistic data of different service scenes and word embedding training is carried out to obtain the word embedding lookup tables corresponding to different service scenes, after the search character string is obtained, the service scene corresponding to the search character string can be determined, the corresponding target word embedding lookup table can be determined in the word embedding lookup table corresponding to different service scenes according to the service scene, then, the word embedding lookup table corresponding to the search participles can be inquired and obtained according to the search participles, therefore, the model training efficiency and quality can be improved, and the efficiency of converting the search character string into the vector can be improved.
In step S530, a search vector corresponding to the search string is obtained according to word embedding corresponding to all the search participles, and a recall text is determined according to the search vector and a text vector corresponding to a candidate text.
In an embodiment of the present disclosure, after obtaining word embeddings corresponding to search participles in a search character string, the word embeddings corresponding to all the search participles may be sequentially spliced to obtain search vectors corresponding to the search character string, and then matching is performed according to the search vectors and text vectors corresponding to candidate texts to obtain a recall text.
In an embodiment of the present disclosure, when a text is recalled, the number of candidate texts is usually multiple, so when the recalled text is determined according to the search vector and the text vector corresponding to the candidate text, a first similarity between the search vector and the text vector of each candidate text may be calculated, the recalled text is determined according to the first similarity, when the first similarity is greater than or equal to a preset similarity threshold, the candidate text is recalled as the recalled text, and when the first similarity is less than the preset similarity threshold, the candidate text is filtered. The first similarity can be determined by calculating the cosine distance, Euclidean distance and Hamming distance between the search vector and the text vector, and the higher the first similarity is, the more matched the corresponding candidate text and the search character string is. In the word embedding expression learning process, words corresponding to the terms with similar morphology are embedded in the vector space and have similar distances, so that when the recall text is determined according to the first similarity, the text containing the search character string can be recalled, the text containing the terms with similar morphology to the search character string can be recalled, the situations of recall text errors or recall text missing caused by the existence of wrongly-written characters in the search character string and the like are avoided, and the user experience is further improved.
In an embodiment of the present disclosure, when the search string and the candidate text only include words with similar morphology, the recall may be performed in the manner described in the above embodiments, such as name recall, product recall, and the like, the word embedding of the search name and the candidate name, the word embedding of the search product name and the candidate product name are obtained by the method according to the embodiment of the present disclosure, and then the name recall is performed by calculating the similarity between the word embedding of the search name and the word embedding of the candidate name, or the product recall is performed by calculating the similarity between the word embedding of the search product name and the word embedding of the candidate product name. However, when the search string and the candidate text contain a plurality of fields with different attributes, the search string and the candidate text need to be recalled according to the fields with different attributes, for example, if the search string contains words with similar morphology and also contains words with similar semantics, the search string cannot be recalled by simply obtaining word embedding and then calculating similarity through the method in the embodiment of the present disclosure. Fig. 6 is a schematic flowchart illustrating a process of obtaining a recall text, and as shown in fig. 6, in step S601, performing inverted indexing on candidate texts and text vectors corresponding to the candidate texts, determining second similarities between search vectors and the text vectors, and performing an initial recall according to the second similarities; in step S602, a third similarity between the search string and a vector corresponding to a field having the same attribute in the candidate text obtained by the initial recall is obtained, and the candidate text obtained by the initial recall is recalled according to the third similarity, so as to obtain a recall text. The second similarity and the third similarity may be calculated in the same manner as the first similarity or in different manners, which is not specifically limited in this disclosure.
Taking enterprise search as an example, fig. 7A-7B are schematic diagrams illustrating an interface of enterprise search, and as shown in fig. 7A, a user inputs a name of a search enterprise to be queried in a display interface of a terminal, for example, the user inputs "tengxi (beijing) limited company". After receiving the search business name, the business name can be divided into four segments through a sequence labeling model: tengxn, science and technology, Beijing and company Limited, wherein Tengxn is an enterprise word size, science and technology is an industry attribute of an enterprise, Beijing is a geographic position attribute of the enterprise, and company Limited is a basic attribute of the enterprise, so when inquiring enterprise information corresponding to the enterprise name, inquiry recalls need to be carried out from the four attributes, and only the enterprise word size in the four fields relates to the problem of similar morphology, for example, a user actually wants to search for Tengxin science and technology (Beijing) company Limited, but the user inputs the word size into Tengxin science and technology (Beijing) company Limited, and the industry attribute and the basic attribute mainly relate to semantic problems, for example, science and technology are similar semantically, and company Limited are similar semantically. And then, encoding fields with different attributes in the enterprise name in different vector conversion modes, wherein the word embedding conversion of the enterprise word size can adopt a word embedding expression learning method in the embodiment of the disclosure to obtain a word embedding lookup table related to an enterprise search service scene, and then determining the corresponding word embedding in the word embedding lookup table according to the encoding of the enterprise word size in the word table. After the vector corresponding to each field is determined, the search vector corresponding to the name of the searched enterprise can be obtained. When the enterprise information query platform queries, matching can be performed according to the search vector corresponding to the name of the searched enterprise and the text vector corresponding to the candidate enterprise name stored in the database, and the candidate enterprise name obtained through matching is returned to the terminal, so that the user can click to check the details of the enterprise.
The method for obtaining the text vector corresponding to the candidate enterprise name is the same as the method for searching the vector, and is not described herein again. When the search vectors are matched with the text vectors corresponding to the candidate enterprise names, matching is performed in a full space, specifically, inverted indexing is performed on the candidate enterprise names and the corresponding text vectors, then the similarity between the search vectors corresponding to the search enterprise names and the text vectors corresponding to the candidate enterprise names is determined, and the candidate enterprise names with the similarity larger than a preset threshold value are recalled to achieve initial recall. After the initial recall, the similarity between the search enterprise name and the vectors corresponding to the fields with the same attribute in the initially recalled candidate enterprise name may be determined, then the similarities corresponding to the attributes are sorted, the candidate enterprise names corresponding to the attributes are obtained according to a preset similarity threshold, and the common candidate enterprise names are recalled as the search results and fed back to the user, as shown in fig. 7B.
The word embedding expression learning method and the text recall method based on the disclosure can recall the text containing the search character string and the text containing the characters similar to the search character string in terms of morphology, so that the recall quantity and the recall quality are improved, the situations of inaccurate and incomplete recall information caused by input errors or recognition errors of the search character string are avoided, and the user experience is further improved.
The following describes apparatus embodiments of the present disclosure that may be used to perform the word embedding representation learning method and the text recall method in the above-described embodiments of the present disclosure. For details not disclosed in the embodiments of the apparatus of the present disclosure, please refer to the embodiments of the word embedding representation learning method and the text recall method described above in the present disclosure.
FIG. 8 schematically illustrates a block diagram of a word embedding representation learning apparatus, according to one embodiment of the present disclosure.
Referring to fig. 8, a word-embedded representation learning apparatus 800 according to an embodiment of the present disclosure includes: a graph construction module 801, a sampling module 802, and a word embedding acquisition module 803.
The graph construction module 801 is configured to acquire a text corpus, perform word segmentation processing on the text corpus, and construct a graph structure based on the obtained word segments and pronunciation information corresponding to the word segments; a sampling module 802, configured to take each node in the graph structure as an initial node, and randomly walk to obtain a node sequence corresponding to the initial node; a word embedding obtaining module 803, configured to train a word embedding representation model according to the node sequence to obtain a word embedding lookup table, and determine a word embedding representation corresponding to the text corpus based on the word embedding lookup table.
In an embodiment of the present disclosure, the text corpus is a chinese text, and the pronunciation information is a pinyin corresponding to each character in each participle obtained by performing participle processing on the chinese text; the graph building module 801 is configured to: and taking the participles and the pinyins corresponding to the Chinese texts as nodes, taking the relations among the participles, the single characters in the participles and the pinyins corresponding to the single characters as edges, and constructing the undirected acyclic graph according to the nodes and the edges.
In one embodiment of the present disclosure, the graph construction module 801 is further configured to: and when the undirected acyclic graph is constructed, setting the weight of each edge according to a preset rule.
In one embodiment of the disclosure, the edges include edges established on node relationships where pinyin characters are different and pinyin characters are similar.
In one embodiment of the present disclosure, the sampling module 802 is configured to: acquiring preset first parameters and second parameters, and determining the migration probability of the current node jumping to the historical node and the current node jumping to the future node according to the current node, the historical node and the future node adjacent to the current node, and the first parameters and the second parameters; and determining a walking direction according to the walking probability, and determining the node sequence based on the walking direction.
In one embodiment of the present disclosure, the word embedding obtaining module 803 includes: a prediction information acquisition unit configured to input the node sequence to the word-embedded representation model to acquire prediction information; a loss function determining unit, configured to determine a loss function according to the prediction information and the flag information corresponding to the node sequence; and the parameter optimization unit is used for optimizing the parameters of the word embedding representation model based on the loss function so as to enable the value of the loss function to be minimum, and taking an embedding matrix corresponding to a hidden layer in the trained word embedding representation model as the word embedding lookup table.
In one embodiment of the present disclosure, the word embedding obtaining module 803 is configured to: acquiring a word list constructed based on the graph structure, and acquiring codes corresponding to participles in the text corpus according to the word list; determining word embedding corresponding to the participle in the word embedding lookup table according to the code; and determining word embedding representation corresponding to the text corpus according to word embedding corresponding to all the participles.
FIG. 9 schematically shows a block diagram of a text recall device according to one embodiment of the present disclosure.
Referring to fig. 9, a text recall apparatus 900 according to an embodiment of the present disclosure includes: a word segmentation module 901, a word embedding acquisition module 902 and a recall module 903.
The word segmentation module 901 is configured to obtain a search character string, and perform word segmentation processing on the search character string to obtain a search word; a word embedding obtaining module 902, configured to perform query in a word embedding lookup table according to the search segmentation to obtain word embedding corresponding to the search segmentation, where the word embedding lookup table is a word embedding lookup table obtained according to the word embedding expression learning method in the foregoing embodiment; and the recall module 903 is configured to obtain a search vector corresponding to the search character string according to word embedding corresponding to all the search segmented words, and determine a recall text according to the search vector and a text vector corresponding to the candidate text.
In one embodiment of the present disclosure, the word embedding obtaining module 902 is configured to: determining a service scene corresponding to the search character string, and determining a target word embedded lookup table according to the service scene; and inquiring in the target word embedding lookup table according to the search participle to obtain word embedding corresponding to the search participle.
In one embodiment of the present disclosure, the number of the candidate texts is plural; the recall module 903 comprises: and the recalling unit is used for acquiring a first similarity between the search vector and a text vector corresponding to each candidate text, and determining the recalling text according to the first similarity.
In one embodiment of the present disclosure, the search string and the candidate text include a plurality of fields of different attributes; the recall unit is configured to: performing reverse indexing according to the candidate texts and the text vectors, determining second similarity between the search vectors and each text vector, and performing initial recall according to the second similarity; and acquiring a third similarity between the search character string and vectors corresponding to fields with the same attribute in the candidate text obtained by the initial recall, and recalling in the result of the initial recall according to the third similarity to acquire the recalled text.
FIG. 10 illustrates a schematic structural diagram of a computer system suitable for use in implementing an electronic device of an embodiment of the present disclosure.
It should be noted that the computer system 1000 of the electronic device shown in fig. 10 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.
As shown in fig. 10, the computer system 1000 includes a Central Processing Unit (CPU)1001 that can perform various appropriate actions and processes according to a program stored in a Read-Only Memory (ROM) 1002 or a program loaded from a storage portion 1008 into a Random Access Memory (RAM) 1003, implementing the search string Processing method described in the above-described embodiment. In the RAM1003, various programs and data necessary for system operation are also stored. The CPU 1001, ROM 1002, and RAM1003 are connected to each other via a bus 1004. An Input/Output (I/O) interface 1005 is also connected to the bus 1004.
The following components are connected to the I/O interface 1005: an input section 1006 including a keyboard, a mouse, and the like; an output section 1007 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and a speaker; a storage portion 1008 including a hard disk and the like; and a communication section 1009 including a Network interface card such as a LAN (Local Area Network) card, a modem, or the like. The communication section 1009 performs communication processing via a network such as the internet. The driver 1010 is also connected to the I/O interface 1005 as necessary. A removable medium 1011 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 1010 as necessary, so that a computer program read out therefrom is mounted into the storage section 1008 as necessary.
In particular, the processes described below with reference to the flowcharts may be implemented as computer software programs, according to embodiments of the present disclosure. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication part 1009 and/or installed from the removable medium 1011. When the computer program is executed by a Central Processing Unit (CPU)1001, various functions defined in the system of the present disclosure are executed.
It should be noted that the computer readable medium shown in the embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read-Only Memory (ROM), an Erasable Programmable Read-Only Memory (EPROM), a flash Memory, an optical fiber, a portable Compact Disc Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer-readable signal medium may include a propagated data signal with computer-readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present disclosure may be implemented by software, or may be implemented by hardware, and the described units may also be disposed in a processor. Wherein the names of the elements do not in some way constitute a limitation on the elements themselves.
As another aspect, the present disclosure also provides a computer-readable medium that may be contained in the word-embedded representation learning apparatus and the text recall apparatus described in the above embodiments; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by an electronic device, cause the electronic device to implement the method described in the above embodiments.
It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a touch terminal, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (15)

1. A method of word-embedded representation learning, comprising:
acquiring a text corpus, performing word segmentation processing on the text corpus, and constructing a graph structure based on the obtained word segmentation and pronunciation information corresponding to the word segmentation;
taking each node in the graph structure as an initial node, and randomly walking to obtain a node sequence corresponding to the initial node;
and training a word embedding representation model according to the node sequence to obtain a word embedding lookup table, and determining word embedding representation corresponding to the text corpus based on the word embedding lookup table.
2. The method according to claim 1, wherein the text corpus is a chinese text, and the pronunciation information is a pinyin corresponding to each word in each participle obtained by participle processing of the chinese text;
the construction of the graph structure based on the obtained segmentation and the pronunciation information corresponding to the segmentation comprises the following steps:
and taking the participles and the pinyins corresponding to the Chinese texts as nodes, taking the relations among the participles, the single characters in the participles and the pinyins corresponding to the single characters as edges, and constructing the undirected acyclic graph according to the nodes and the edges.
3. The method according to claim 2, wherein the constructing a graph structure based on the obtained segmented words and pronunciation information corresponding to the segmented words further comprises:
and when the undirected acyclic graph is constructed, setting the weight of each edge according to a preset rule.
4. The method of claim 1 or 2, wherein the edges include edges established on node relationships where Pinyin identical words are different and Pinyin similar words are identical.
5. The method according to claim 1, wherein the taking each node in the graph structure as an initial node, and the randomly walking to acquire a node sequence corresponding to the initial node comprises:
acquiring preset first parameters and second parameters, and determining the migration probability of the current node jumping to the historical node and the current node jumping to the future node according to the current node, the historical node and the future node adjacent to the current node, and the first parameters and the second parameters;
and determining a walking direction according to the walking probability, and determining the node sequence based on the walking direction.
6. The method of claim 1, wherein training a word embedding representation model according to the sequence of nodes to obtain a word embedding look-up table comprises:
inputting the node sequence into the word embedding representation model to obtain prediction information;
determining a loss function according to the prediction information and the marking information corresponding to the node sequence;
and optimizing parameters of the word embedding representation model based on the loss function so as to enable the value of the loss function to be minimum, and taking an embedding matrix corresponding to a hidden layer in the trained word embedding representation model as the word embedding lookup table.
7. The method of claim 6, wherein determining a word embedding representation corresponding to the text corpus based on the word embedding lookup table comprises:
acquiring a word list constructed based on the graph structure, and acquiring codes corresponding to participles in the text corpus according to the word list;
determining word embedding corresponding to the participle in the word embedding lookup table according to the code;
and determining word embedding representation corresponding to the text corpus according to word embedding corresponding to all the participles.
8. A text recall method, comprising:
acquiring a search character string, and performing word segmentation processing on the search character string to acquire search words;
performing a query in a word embedding lookup table according to the search participle to obtain word embedding corresponding to the search participle, the word embedding lookup table being a word embedding lookup table obtained according to the word embedding representation learning method of any one of claims 1-6;
and acquiring a search vector corresponding to the search character string according to word embedding corresponding to all the search participles, and determining a recall text according to the search vector and a text vector corresponding to the candidate text.
9. The method of claim 8, wherein said querying in a word embedding lookup table according to the search participle to obtain a word embedding corresponding to the search participle comprises:
determining a service scene corresponding to the search character string, and determining a target word embedded lookup table according to the service scene;
and inquiring in the target word embedding lookup table according to the search participle to obtain word embedding corresponding to the search participle.
10. The method of claim 8, wherein the number of candidate texts is plural;
determining a recall text according to the search vector and a text vector corresponding to the candidate text, comprising:
and acquiring first similarity between the search vector and a text vector corresponding to each candidate text, and determining the recalled text according to the first similarity.
11. The method of claim 10, wherein the search string and the candidate text comprise a plurality of fields of different attributes;
the obtaining a first similarity between the search vector and a text vector corresponding to each candidate text, and determining the recall text according to the first similarity includes:
performing reverse indexing according to the candidate texts and the text vectors, determining second similarity between the search vectors and each text vector, and performing initial recall according to the second similarity;
and acquiring a third similarity between the search character string and a vector corresponding to a field with the same attribute in the candidate text obtained by the initial recall, and recalling in the candidate text obtained by the initial recall according to the third similarity to acquire the recall text.
12. A word-embedded representation learning apparatus, comprising:
the graph construction module is used for acquiring a text corpus, performing word segmentation processing on the text corpus, and constructing a graph structure based on the obtained segmented words and pronunciation information corresponding to the segmented words;
the sampling module is used for taking each node in the graph structure as an initial node and randomly walking to obtain a node sequence corresponding to the initial node;
and the word embedding obtaining module is used for training the word embedding representation model according to the node sequence to obtain a word embedding lookup table and determining the word embedding representation corresponding to the text corpus based on the word embedding lookup table.
13. A text recall apparatus, comprising:
the word segmentation module is used for acquiring a search character string and performing word segmentation processing on the search character string to acquire search words;
a word embedding obtaining module, configured to perform query in a word embedding lookup table according to the search participle to obtain word embedding corresponding to the search participle, where the word embedding lookup table is a word embedding lookup table obtained by the word embedding representation learning method according to any one of claims 1 to 6;
and the recall module is used for embedding and acquiring a search vector corresponding to the search character string according to all the words corresponding to the search participles and determining a recall text according to the search vector and a text vector corresponding to the candidate text.
14. A computer-readable storage medium on which a computer program is stored, the program, when executed by a processor, implementing a word-embedded representation learning method according to any one of claims 1 to 7 and a text recall method according to any one of claims 8 to 11.
15. An electronic device, comprising:
one or more processors;
storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement a word-embedded representation learning method as recited in any one of claims 1 to 7 and a text recall method as recited in any one of claims 8 to 11.
CN202010961808.1A 2020-09-14 2020-09-14 Word embedding expression learning method and device and text recall method and device Pending CN112100332A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010961808.1A CN112100332A (en) 2020-09-14 2020-09-14 Word embedding expression learning method and device and text recall method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010961808.1A CN112100332A (en) 2020-09-14 2020-09-14 Word embedding expression learning method and device and text recall method and device

Publications (1)

Publication Number Publication Date
CN112100332A true CN112100332A (en) 2020-12-18

Family

ID=73752383

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010961808.1A Pending CN112100332A (en) 2020-09-14 2020-09-14 Word embedding expression learning method and device and text recall method and device

Country Status (1)

Country Link
CN (1) CN112100332A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112597291A (en) * 2020-12-26 2021-04-02 中国农业银行股份有限公司 Intelligent question and answer implementation method, device and equipment
CN112634900A (en) * 2021-03-10 2021-04-09 北京世纪好未来教育科技有限公司 Method and apparatus for detecting phonetics
CN112686766A (en) * 2020-12-26 2021-04-20 中山大学 Embedded representation method, device, equipment and storage medium of social network
CN113343719A (en) * 2021-06-21 2021-09-03 哈尔滨工业大学 Unsupervised bilingual translation dictionary acquisition method for collaborative training by using different word embedding models
CN113343704A (en) * 2021-04-15 2021-09-03 山东师范大学 Text retrieval method and system based on word embedded vector
CN113420642A (en) * 2021-06-21 2021-09-21 西安电子科技大学 Small sample target detection method and system based on category semantic feature reweighting
CN113553431A (en) * 2021-07-27 2021-10-26 深圳平安综合金融服务有限公司 User label extraction method, device, equipment and medium
CN115130472A (en) * 2022-08-31 2022-09-30 北京澜舟科技有限公司 Method, system and readable storage medium for segmenting subwords based on BPE
CN115146596A (en) * 2022-07-26 2022-10-04 平安科技(深圳)有限公司 Method and device for generating recall text, electronic equipment and storage medium

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112597291A (en) * 2020-12-26 2021-04-02 中国农业银行股份有限公司 Intelligent question and answer implementation method, device and equipment
CN112686766A (en) * 2020-12-26 2021-04-20 中山大学 Embedded representation method, device, equipment and storage medium of social network
CN112634900A (en) * 2021-03-10 2021-04-09 北京世纪好未来教育科技有限公司 Method and apparatus for detecting phonetics
CN113343704A (en) * 2021-04-15 2021-09-03 山东师范大学 Text retrieval method and system based on word embedded vector
CN113343719A (en) * 2021-06-21 2021-09-03 哈尔滨工业大学 Unsupervised bilingual translation dictionary acquisition method for collaborative training by using different word embedding models
CN113420642A (en) * 2021-06-21 2021-09-21 西安电子科技大学 Small sample target detection method and system based on category semantic feature reweighting
CN113553431A (en) * 2021-07-27 2021-10-26 深圳平安综合金融服务有限公司 User label extraction method, device, equipment and medium
CN115146596A (en) * 2022-07-26 2022-10-04 平安科技(深圳)有限公司 Method and device for generating recall text, electronic equipment and storage medium
CN115146596B (en) * 2022-07-26 2023-05-02 平安科技(深圳)有限公司 Recall text generation method and device, electronic equipment and storage medium
CN115130472A (en) * 2022-08-31 2022-09-30 北京澜舟科技有限公司 Method, system and readable storage medium for segmenting subwords based on BPE

Similar Documents

Publication Publication Date Title
US11501182B2 (en) Method and apparatus for generating model
CN111753060B (en) Information retrieval method, apparatus, device and computer readable storage medium
CN112100332A (en) Word embedding expression learning method and device and text recall method and device
CN109376222B (en) Question-answer matching degree calculation method, question-answer automatic matching method and device
CN110688854B (en) Named entity recognition method, device and computer readable storage medium
CN111160031A (en) Social media named entity identification method based on affix perception
CN113392209B (en) Text clustering method based on artificial intelligence, related equipment and storage medium
CN113705313A (en) Text recognition method, device, equipment and medium
CN116204674B (en) Image description method based on visual concept word association structural modeling
CN114722069A (en) Language conversion method and device, electronic equipment and storage medium
CN112528001A (en) Information query method and device and electronic equipment
WO2022187063A1 (en) Cross-modal processing for vision and language
CN114757184B (en) Method and system for realizing knowledge question and answer in aviation field
CN116258137A (en) Text error correction method, device, equipment and storage medium
CN115952791A (en) Chapter-level event extraction method, device and equipment based on machine reading understanding and storage medium
CN115114419A (en) Question and answer processing method and device, electronic equipment and computer readable medium
CN113779225A (en) Entity link model training method, entity link method and device
CN112699685A (en) Named entity recognition method based on label-guided word fusion
CN111930915A (en) Session information processing method, device, computer readable storage medium and equipment
CN116595023A (en) Address information updating method and device, electronic equipment and storage medium
CN110852066A (en) Multi-language entity relation extraction method and system based on confrontation training mechanism
CN115062123A (en) Knowledge base question-answer pair generation method of conversation generation system
CN114911940A (en) Text emotion recognition method and device, electronic equipment and storage medium
CN116266268A (en) Semantic analysis method and device based on contrast learning and semantic perception
CN114611529A (en) Intention recognition method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination