CN110472241B - Method for generating redundancy-removed information sentence vector and related equipment - Google Patents

Method for generating redundancy-removed information sentence vector and related equipment Download PDF

Info

Publication number
CN110472241B
CN110472241B CN201910690370.5A CN201910690370A CN110472241B CN 110472241 B CN110472241 B CN 110472241B CN 201910690370 A CN201910690370 A CN 201910690370A CN 110472241 B CN110472241 B CN 110472241B
Authority
CN
China
Prior art keywords
sentence
vector
vectors
spliced
initial
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910690370.5A
Other languages
Chinese (zh)
Other versions
CN110472241A (en
Inventor
郑立颖
徐亮
阮晓雯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201910690370.5A priority Critical patent/CN110472241B/en
Publication of CN110472241A publication Critical patent/CN110472241A/en
Application granted granted Critical
Publication of CN110472241B publication Critical patent/CN110472241B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a method and related equipment for generating redundancy-removed information sentence vectors, and relates to the field of natural language processing, wherein the method comprises the following steps: acquiring initial sentence vectors of each sentence in a sentence set; based on the comparison of vector elements in each vector dimension of the initial sentence vector, splicing the initial sentence vector to obtain a spliced sentence vector of each sentence; based on the spliced sentence vector, acquiring a spliced sentence vector matrix of the sentence set; zero-equalizing the spliced sentence vector matrix to obtain a target sentence vector matrix of the sentence set; and determining the target sentence vector of each sentence in the sentence set based on the target sentence vector matrix. The method improves the efficiency of natural language processing of the neural network.

Description

Method for generating redundancy-removed information sentence vector and related equipment
Technical Field
The application relates to the field of natural language processing, in particular to a method and related equipment for generating redundancy-removed information sentence vectors.
Background
In the field of natural language processing, when performing semantic parsing on sentences in a text, each sentence needs to be converted into a vector form, that is, each sentence is converted into a sentence vector, so that the neural network performing natural language processing performs analysis processing. Therefore, when semantic analysis is performed in sentence units, the generation of sentence vectors as a basis affects the efficiency of natural language processing. In the prior art, a method for generating sentence vectors is obtained by simply weighted average of word vectors of words in sentences. Sentence vectors generated by this method typically contain a large amount of repeated, redundant information, resulting in inefficiency of natural language processing by the neural network based on these sentence vectors.
Disclosure of Invention
Based on the above, in order to solve the technical problems faced by how to improve the efficiency of natural language processing by a neural network in the technical aspect of the related art, the application provides a method for generating redundancy-removed information sentence vectors and related equipment.
In a first aspect, a method for generating a redundancy-removed information sentence vector is provided, including:
acquiring initial sentence vectors of each sentence in a sentence set;
based on the comparison of vector elements in each vector dimension of the initial sentence vector, splicing the initial sentence vector to obtain a spliced sentence vector of each sentence;
based on the spliced sentence vector, acquiring a spliced sentence vector matrix of the sentence set;
zero-equalizing the spliced sentence vector matrix to obtain a target sentence vector matrix of the sentence set;
and determining the target sentence vector of each sentence in the sentence set based on the target sentence vector matrix.
In an exemplary embodiment of the present disclosure, the obtaining an initial sentence vector of each sentence in the sentence set includes:
word segmentation is carried out on each sentence in the sentence set to obtain word segmentation vocabulary;
acquiring word vectors of each word segmentation vocabulary;
based on the word vector of each word segmentation vocabulary, an initial sentence vector of each sentence is obtained.
In an exemplary embodiment of the present disclosure, the obtaining an initial sentence vector of each sentence based on the word vector of each word segmentation vocabulary includes:
based on word frequency-inverse document frequency TF-IDF algorithm, determining TF-IDF values of all words in sentence set;
determining the TF-IDF value of the vocabulary as the weight of the word vector of the vocabulary;
and carrying out weighted average on word vectors of all words in each sentence based on the word vector weight to obtain initial sentence vectors of each sentence.
In an exemplary embodiment of the present disclosure, the splicing the initial sentence vector based on the comparison of vector elements in each vector dimension of the initial sentence vector to obtain a spliced sentence vector of each sentence includes:
determining the maximum value and the minimum value of vector elements in all initial sentence vectors in each vector dimension;
and adding the maximum value and the minimum value into the initial sentence vector of each sentence to obtain the spliced sentence vector of each sentence.
In an exemplary embodiment of the present disclosure, the obtaining, based on the spliced sentence vector, a spliced sentence vector matrix of a sentence set includes:
and sequentially taking the spliced sentence vectors of all sentences as the row vectors of the spliced sentence vector sentences according to the sequence of sentences in the sentence set, so as to obtain a spliced sentence vector matrix of the sentence set.
In an exemplary embodiment of the disclosure, the determining, based on the target sentence vector matrix, a target sentence vector for each sentence in the sentence set includes:
and sequentially determining the row vectors in the target sentence vector matrix as target sentence vectors of all sentences according to the sequence of sentences in the sentence set.
According to a second aspect of the present disclosure, there is provided an apparatus for generating a redundancy-removed information sentence vector, comprising:
the first acquisition module is used for acquiring initial sentence vectors of all sentences in the sentence set;
the splicing module is used for splicing the initial sentence vectors based on the comparison of vector elements in each vector dimension of the initial sentence vectors to obtain spliced sentence vectors of each sentence;
the second acquisition module is used for acquiring a spliced sentence vector matrix of the sentence set based on the spliced sentence vector;
the zero-averaging module is used for zero-averaging the spliced sentence vector matrix to obtain a target sentence vector matrix of the sentence set;
and the determining module is used for determining the target sentence vector of each sentence in the sentence set based on the target sentence vector matrix.
According to a third aspect of the present disclosure, there is provided an electronic device that generates a redundancy-removed information sentence vector, comprising:
a memory configured to store executable instructions;
and a processor configured to execute the executable instructions stored in the memory to implement the method of generating redundancy-free information sentence vectors.
According to a fourth aspect of the present disclosure, there is provided a computer-readable storage medium storing computer program instructions that, when executed by a computer, cause the computer to perform the method of generating a redundancy-removing information sentence vector.
The embodiment of the disclosure obtains a spliced sentence vector by extracting common features (maximum value and minimum value in each vector dimension) in initial sentence vectors of each sentence in a sentence set and splicing the common features into the initial sentence vector of each sentence. And performing zero-mean processing on a spliced sentence vector matrix formed by spliced sentence vectors to obtain target sentence vectors of each sentence, wherein redundant information is removed and main information is reserved. Therefore, the neural network can perform natural language processing with higher efficiency on the basis of the target sentence vector.
Other features and advantages of the present disclosure will be apparent from the following detailed description, or may be learned in part by the practice of the disclosure.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
Fig. 1 illustrates a flowchart for generating redundancy-removed information sentence vectors according to an example embodiment of the present disclosure.
FIG. 2 illustrates a flowchart for obtaining initial sentence vectors for each sentence in a sentence set, according to an example embodiment of the present disclosure.
FIG. 3 illustrates a flow chart for obtaining initial sentence vectors for each sentence based on word vectors for each of the segmented words in accordance with an example embodiment of the present disclosure.
FIG. 4 is a flow chart of concatenating the initial sentence vectors based on a comparison of vector elements in each vector dimension of the initial sentence vectors to obtain a concatenated sentence vector for each sentence, according to an example embodiment of the present disclosure.
Fig. 5 shows a block diagram of an apparatus for generating redundancy-removed information sentence vectors according to an example embodiment of the present disclosure.
Fig. 6 illustrates a system architecture diagram for generating redundancy-removed sentence vectors according to an example embodiment of the present disclosure.
Fig. 7 illustrates an electronic device diagram for generating redundancy-removed information sentence vectors in accordance with an example embodiment of the present disclosure.
Fig. 8 illustrates a computer-readable storage medium diagram for generating redundancy-removed information sentence vectors according to an example embodiment of the present disclosure.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the present disclosure. One skilled in the relevant art will recognize, however, that the aspects of the disclosure may be practiced without one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known technical solutions have not been shown or described in detail to avoid obscuring aspects of the present disclosure.
Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus a repetitive description thereof will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in software or in one or more hardware modules or integrated circuits or in different networks and/or processor devices and/or microcontroller devices.
The present disclosure is directed to improving the efficiency of natural language processing by neural networks. A method for generating a redundancy-removed information sentence vector according to one embodiment of the present disclosure includes: acquiring initial sentence vectors of each sentence in a sentence set; based on the comparison of vector elements in each vector dimension of the initial sentence vector, splicing the initial sentence vector to obtain a spliced sentence vector of each sentence; based on the spliced sentence vector, acquiring a spliced sentence vector matrix of the sentence set; zero-equalizing the spliced sentence vector matrix to obtain a target sentence vector matrix of the sentence set; and determining the target sentence vector of each sentence in the sentence set based on the target sentence vector matrix.
In order to make the objects, technical solutions and advantages of the present disclosure more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present disclosure.
The process of generating redundancy-removed information sentence vectors in the embodiments of the present disclosure is described below.
FIG. 1 illustrates a flow chart of generating redundancy-removed information sentence vectors according to an example embodiment of the present disclosure:
step S110: acquiring initial sentence vectors of each sentence in a sentence set;
step S120: based on the comparison of vector elements in each vector dimension of the initial sentence vector, splicing the initial sentence vector to obtain a spliced sentence vector of each sentence;
step S130: based on the spliced sentence vector, acquiring a spliced sentence vector matrix of the sentence set;
step S140: zero-equalizing the spliced sentence vector matrix to obtain a target sentence vector matrix of the sentence set;
step S150: and determining the target sentence vector of each sentence in the sentence set based on the target sentence vector matrix.
Sentence sets refer to sets that contain multiple sentences, such as: text containing a plurality of sentences.
The initial sentence vector refers to a sentence vector that has not undergone redundancy removal information processing, for example: and (3) carrying out weighted average on word vectors of all words in the sentence to directly obtain sentence vectors.
The spliced sentence vector is a sentence vector obtained by adding vector elements to the initial sentence vector.
The target sentence vector is a sentence vector of redundancy removal information obtained by a target.
In the embodiment of the disclosure, the common features in the initial sentence vectors are extracted and spliced to obtain spliced sentence vectors, and then zero-averaging is performed on a spliced sentence vector matrix formed by the spliced sentence vectors to remove redundant information in the spliced sentence vector matrix, so that a target sentence vector without redundant information is obtained.
Next, each step of generating the redundancy-elimination information sentence vector described above in the present exemplary embodiment will be explained and described in detail with reference to the accompanying drawings.
In step S110, an initial sentence vector of each sentence in the sentence set is acquired.
In one embodiment, as shown in fig. 2, step S110 includes:
step S1101: word segmentation is carried out on each sentence in the sentence set to obtain word segmentation vocabulary;
step S1102: acquiring word vectors of each word segmentation vocabulary;
step S1103: based on the word vector of each word segmentation vocabulary, an initial sentence vector of each sentence is obtained.
Word segmentation refers to the process of decomposing each sentence in natural language into individual words.
Word vectors refer to vectors used to represent words, i.e., words are represented in the form of vectors to facilitate computer processing analysis of words.
In an embodiment of the present disclosure, a computer receives a sentence set in which redundancy-removed information sentence vectors for each sentence are to be generated. In the processing of natural language, the vocabulary is a basic unit for understanding text semantics, so in order to generate redundancy-removing information sentence vectors of each sentence, each individual vocabulary in the sentence set is firstly processed to obtain the word vector of each vocabulary, and then the redundancy-removing information sentence vectors of each sentence are generated on the basis.
In one embodiment, the computer performs word segmentation on each sentence in the received sentence set based on a preset word segmentation algorithm, so as to obtain word segmentation vocabulary obtained after word segmentation. And acquiring word vectors of each word segmentation vocabulary based on a preset word vector generation algorithm, and further acquiring initial sentence vectors of each sentence.
In the embodiment of the disclosure, when each sentence in the sentence set is segmented, since the English words are separated by spaces or punctuations, english can be segmented directly according to space characters and punctuation characters. Because the Chinese words are mostly connected together compared with the English words, the Chinese word segmentation cannot be performed simply according to space characters and punctuation characters. Therefore, the process of word segmentation for Chinese to obtain word segmentation vocabulary is described below.
In one embodiment, the computer performs word segmentation on each sentence in the sentence set based on a word segmentation method of string matching. In this embodiment, a dictionary containing a sufficient number of character strings is preset. When word segmentation is carried out, each sentence in the sentence set is taken as an object to be decomposed. And scanning and matching an object to be decomposed in the dictionary each time, and if a character string is found in the dictionary to be the same as a part of the object to be decomposed, separating the character string in the object to be decomposed as a single vocabulary.
The embodiment has the advantages that word segmentation can be performed quickly by matching with the character strings in the dictionary, and the method is simple and easy to implement.
In one embodiment, the computer performs word segmentation on each sentence in the sentence set based on a statistical word segmentation method. In this embodiment, a statistical-based machine learning model is trained in advance using a given large number of text that has been segmented, so that the trained machine learning model can perform chinese segmentation. Common machine learning models for word segmentation are: n-gram, hidden markov model Hidden Markov Model, conditional random field model Coditional Random Fields, etc.
The embodiment has the advantages that the word segmentation is performed through the machine learning model, the word segmentation accuracy is higher, and the word segmentation effect is more excellent.
The process of obtaining the word vector of each word segment vocabulary after obtaining each word segment vocabulary is described below.
In one embodiment, the computer represents each segmented Word as a vector form capable of exhibiting semantic similarity between words based on a machine learning model (e.g., a Word bag CBOW model) that embeds high-dimensional Word vectors into low-dimensional space Word components.
In this embodiment, the computer first represents each word segment word as a word vector in discrete form according to the order in which each word segment word appears, for example: "Xiaoming" is denoted as [ 1,0 ], "eat" as [ 0,1,0 ], and "apple" as [ 0,1 ]. And then, based on a pre-trained CBOW model, each word vector in a discrete form is expressed as a word vector in a distributed form. For example: after treatment with the CBOW model, "Xiaoming" is denoted as [ 0.9,0.2, -0.2 ], "eat" as [ 0,1.7,0.3 ], and "apple" as [ 0.1,0.2,0.1 ].
It follows that the distance between the discrete word vectors cannot be measured, and natural language processing cannot be performed on the discrete word vectors. The distributed word vectors can show the semantically similar degree of the corresponding words according to the distance between the vectors, thereby being more beneficial to further natural language processing.
The following describes a process of acquiring an initial sentence vector of each sentence after the computer acquires a word vector of each word segmentation vocabulary.
In one embodiment, as shown in fig. 3, step S1103: based on the word vector of each word segmentation vocabulary, obtaining an initial sentence vector of each sentence comprises the following steps:
step S11031: based on word frequency-inverse document frequency TF-IDF algorithm, determining TF-IDF values of all words in sentence set;
step S11032: determining the TF-IDF value of the vocabulary as the weight of the word vector of the vocabulary;
step S11033: and carrying out weighted average on word vectors of all words in each sentence based on the word vector weight to obtain initial sentence vectors of each sentence.
Word frequency-inverse document frequency TF-IDF is divided into two parts: word frequency-TF, inverse document frequency-IDF. Wherein, TF refers to the number of times that the vocabulary appears in the document; IDF refers to the weight of a vocabulary in a document. The TF-IDF value measures the importance of the corresponding vocabulary in the document, and the larger the TF-IDF value is, the more important the corresponding vocabulary in the document is. The TF-IDF value is used to measure the importance of a word because the higher the word frequency TF of a word, the more important it is not necessarily represented. For example: "it appears very frequently in a document, but it has little help in the analysis of the content of the document, so that its assigned weight, i.e., the inverse document frequency IDF value, is small when measuring its importance. Therefore, the measurement of the importance of the vocabulary is performed based on the TF-IDF value of the vocabulary.
In one embodiment, the computer calculates TF-IDF values for each vocabulary in the sentence set based on TF-IDF algorithm. Since the TF-IDF value of a vocabulary shows its importance in a document, determining the TF-IDF value of a vocabulary determines the word vector weight of the corresponding vocabulary. And carrying out weighted average on word vectors of all the words in each sentence according to the weight of the word vectors of each word to obtain initial sentence vectors of each sentence.
For example: for the sentence "snack apple," the word vector of "snack" has been determined to be [ 0.9,0.2, -0.2 ], the word vector weight being 0.01; the word vector of "eating" is [ 0,1.7,0.3 ], and the weight of the word vector is 0.05; the word vector of "apple" is [ 0.1,0.2,0.1 ], and the weight of the word vector is 0.02. The word vectors of these words are weighted averaged: 0.01 x [ 0.9,0.2 ], 0.2 x [ 0,1.7,0.3 ] and 0.02 x [ 0.1,0.2,0.1 ] are included in the sentence of [ 0.011,0.091,0.015 ], and an initial sentence vector of the sentence "snack apple" is obtained as [ 0.011,0.091,0.015 ].
The embodiment has the advantages that the initial sentence vector obtained on the basis of the TF-IDF value is more accurate, and deviation in the natural language processing process is less likely to be caused.
The following describes a process of acquiring the spliced sentence vector of each sentence after acquiring the initial sentence vector of each sentence.
In step S120, the initial sentence vectors are spliced based on the comparison of vector elements in each vector dimension of the initial sentence vectors, so as to obtain spliced sentence vectors of each sentence.
In one embodiment, as shown in fig. 4, step S120 includes:
step S1201: determining the maximum value and the minimum value of vector elements in all initial sentence vectors in each vector dimension;
step S1202: and adding the maximum value and the minimum value into the initial sentence vector of each sentence to obtain the spliced sentence vector of each sentence.
In one embodiment, in the initial sentence vectors of all sentences in the sentence set, the maximum value and the minimum value of the vector elements in each vector dimension are determined, and the maximum value and the minimum value are added to the initial sentence vector of each sentence, that is, the initial sentence vector of each sentence is spliced, so as to obtain the spliced sentence vector of each sentence.
For example: the sentence set has 3 sentences A, B, C, wherein the initial sentence vector of sentence a is [ 3,1,0 ], the initial sentence vector of sentence B is [ 1,5,2 ], and the initial sentence vector of sentence C is [ 2,3,4 ]. From the initial sentence vectors of all sentences in the sentence set, the maximum value of the vector elements corresponding to one dimension is 3, and the minimum value is 1; the maximum value of the two-dimensional corresponding vector element is 5, and the minimum value is 1; the maximum value of the vector element corresponding to the three dimensions is 4, and the minimum value is 0. Therefore, after the initial sentence vectors of each sentence are spliced, the spliced sentence vector of the sentence a is obtained as [ 3,1,0,3,1,5,1,4,0 ], the spliced sentence vector of the sentence B is obtained as [ 1,5,2,3,1,5,1,4,0 ], and the spliced sentence vector of the sentence C is obtained as [ 2,3,4,3,1,5,1,4,0 ].
The embodiment has the advantage that the maximum value and the minimum value in each vector dimension are added to the initial sentence vector of each sentence, so that the obtained spliced sentence vector of each sentence contains the integral characteristic information common to the sentence set.
The process of obtaining a stitched sentence vector matrix of a sentence set is described below.
In step S130, a stitched sentence vector matrix of the sentence set is obtained based on the stitched sentence vector.
In an embodiment, the obtaining, based on the spliced sentence vector, a spliced sentence vector matrix of the sentence set includes: and sequentially taking the spliced sentence vectors of all sentences as row vectors of the spliced sentence vector matrix according to the sequence of sentences in the sentence set, so as to obtain the spliced sentence vector matrix of the sentence set.
In this embodiment, the spliced sentence vectors of each sentence in the sentence set are used as each row vector in the matrix according to the front-back sequence of the corresponding sentence, so as to obtain the spliced sentence vector matrix of the sentence set. For example: sentence a, sentence B, sentence C appear in sequence in the sentence set. The spliced sentence vector of the sentence A is [ 3,1,0,3,1,5,1,4,0 ], the spliced sentence vector of the sentence B is [ 1,5,2,3,1,5,1,4,0 ], and the spliced sentence vector of the sentence C is [ 2,3,4,3,1,5,1,4,0 ]. Then, the spliced sentence vector matrix of the sentence set is:
the embodiment achieves the aim of uniformly processing the spliced sentence vectors of all sentences in the sentence set by establishing the spliced sentence vector matrix of the sentence set.
The following describes a process of performing redundancy elimination information processing on the spliced sentence vector matrix of the sentence set.
In step S140, zero-averaging is performed on the spliced sentence vector matrix to obtain a target sentence vector matrix of the sentence set.
Zero-averaging of the matrix falls into the category of zero-averaging of the data. Zero-averaging of data is achieved by subtracting the mean of the data to eliminate errors due to the large difference in dimensions and values. The data mean value after zero-mean treatment is 0, the standard deviation is 1, and the standard deviation is compliant with standard normal distribution. Unnecessary redundant information in the data is removed by zero-averaging the data, so that the convergence speed of the neural network can be increased when the neural network is trained by using the data, and the neural network is trained more efficiently.
In the embodiment of the disclosure, after the spliced sentence vector matrix of the sentence set is obtained, zero-averaging is performed on the spliced sentence vector matrix to obtain the target sentence vector matrix of the sentence set. Wherein each row vector in the target sentence vector matrix is the target sentence vector of each sentence.
In step S150, a target sentence vector for each sentence in the sentence set is determined based on the target sentence vector matrix.
In one embodiment, determining the target sentence vector of each sentence in the sentence set from the target sentence vector matrix includes: and sequentially determining the row vectors in the target sentence vector matrix as target sentence vectors of all sentences according to the sequence of sentences in the sentence set.
In this embodiment, each row vector in the target sentence vector matrix is sequentially determined as the target sentence vector of each sentence. For example: sentence a, sentence B, sentence C appear in sequence in the sentence set. The first row of vectors in the obtained spliced sentence vector matrix are spliced sentence vectors of a sentence A, the second row of vectors are spliced sentence vectors of a sentence B, and the third row of vectors are spliced sentence vectors of a sentence C. And carrying out zero-mean on the spliced sentence vector matrix to obtain a target sentence vector matrix of the sentence set. And determining a first row vector as a target sentence vector of the sentence A, a second row vector as a target sentence vector of the sentence B and a third row vector as a target sentence vector of the sentence C in the target sentence vector matrix.
The present disclosure also provides an apparatus for generating redundancy-removed information sentence vectors. As shown in fig. 5, the apparatus for generating a redundancy-removing information sentence vector includes:
a first obtaining module 210, configured to obtain an initial sentence vector of each sentence in the sentence set;
the splicing module 220 is configured to splice the initial sentence vectors based on the comparison of vector elements in each vector dimension of the initial sentence vectors, so as to obtain spliced sentence vectors of each sentence;
a second obtaining module 230, configured to obtain a spliced sentence vector matrix of the sentence set based on the spliced sentence vector;
the zero-averaging module 240 is configured to zero-average the spliced sentence vector matrix to obtain a target sentence vector matrix of the sentence set;
the determining module 250 is configured to determine a target sentence vector of each sentence in the sentence set based on the target sentence vector matrix.
The specific details of each module in the above-mentioned apparatus for generating redundancy-removing information sentence vector are described in detail in the corresponding method, so that the details are not repeated here.
It should be noted that although in the above detailed description several modules or units of a device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit in accordance with embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.
Furthermore, although the steps of the methods in the present disclosure are depicted in a particular order in the drawings, this does not require or imply that the steps must be performed in the particular order or that all of the illustrated steps be performed in order to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform, etc.
From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, including several instructions to cause a computing device (may be a personal computer, a server, a mobile terminal, or a network device, etc.) to perform the method according to the embodiments of the present disclosure.
Fig. 6 illustrates a system architecture diagram for authenticating a user node in a blockchain in accordance with an example embodiment of the present disclosure. The system architecture includes: management end 310, computer 320, neural network 330.
In one embodiment, the managing terminal 310 sends a sentence set and a sentence vector generating instruction to the computer 320, so that the computer 320 generates sentence vectors of all sentences in the sentence set. The computer 320 obtains initial sentence vectors of each sentence in the sentence set, and splices the initial sentence vectors to obtain spliced sentence vectors of each sentence. And performing zero-averaging processing on the spliced sentence vector matrix formed by the spliced sentence vectors to obtain a target sentence vector matrix of the sentence set, thereby obtaining redundancy-removed information sentence vectors of sentences in the sentence set (namely, target sentence vectors of all sentences in the sentence set). The computer 320 transmits the generated redundancy-removed sentence vector to the neural network 330, so that the neural network 330 can perform natural language processing with higher efficiency on the basis of the redundancy-removed sentence vector.
Note that, the computer 320 may be any terminal with enough computing power, and may be a personal computer or a server. Likewise, computer 320 may perform the methods described in embodiments of the present disclosure as part of neural network 330.
From the above description of the system architecture, those skilled in the art will readily understand that the system architecture described herein can implement the functions of the respective modules in the apparatus for generating redundancy-removed information sentence vectors shown in fig. 5.
In an exemplary embodiment of the present disclosure, an electronic device capable of implementing the method of generating a redundancy-removed information sentence vector is also provided.
Those skilled in the art will appreciate that the various aspects of the application may be implemented as a system, method, or program product. Accordingly, aspects of the application may be embodied in the following forms, namely: an entirely hardware embodiment, an entirely software embodiment (including firmware, micro-code, etc.) or an embodiment combining hardware and software aspects may be referred to herein as a "circuit," module "or" system.
An electronic device 600 according to this embodiment of the application is described below with reference to fig. 7. The electronic device 600 shown in fig. 7 is merely an example, and should not be construed as limiting the functionality and scope of use of embodiments of the present application.
As shown in fig. 7, the electronic device 600 is in the form of a general purpose computing device. Components of electronic device 600 may include, but are not limited to: the at least one processing unit 610, the at least one memory unit 620, and a bus 630 that connects the various system components, including the memory unit 620 and the processing unit 610.
Wherein the storage unit stores program code that is executable by the processing unit 610 such that the processing unit 610 performs steps according to various exemplary embodiments of the present application described in the above-described "exemplary methods" section of the present specification. For example, the processing unit 610 may perform step S110 shown in fig. 1: acquiring initial sentence vectors of each sentence in a sentence set; step S120: based on the comparison of vector elements in each vector dimension of the initial sentence vector, splicing the initial sentence vector to obtain a spliced sentence vector of each sentence; step S130: based on the spliced sentence vector, acquiring a spliced sentence vector matrix of the sentence set; step S140: zero-equalizing the spliced sentence vector matrix to obtain a target sentence vector matrix of the sentence set; step S150: and determining the target sentence vector of each sentence in the sentence set based on the target sentence vector matrix.
The storage unit 620 may include readable media in the form of volatile storage units, such as Random Access Memory (RAM) 6201 and/or cache memory unit 6202, and may further include Read Only Memory (ROM) 6203.
The storage unit 620 may also include a program/utility 6204 having a set (at least one) of program modules 6205, such program modules 6205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.
Bus 630 may be a local bus representing one or more of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or using any of a variety of bus architectures.
The electronic device 600 may also communicate with one or more external devices 700 (e.g., keyboard, pointing device, bluetooth device, etc.), one or more devices that enable a user to interact with the electronic device 600, and/or any device (e.g., router, modem, etc.) that enables the electronic device 600 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 650. Also, electronic device 600 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet, through network adapter 660. As shown, network adapter 660 communicates with other modules of electronic device 600 over bus 630. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with electronic device 600, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.
From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, including several instructions to cause a computing device (may be a personal computer, a server, a terminal device, or a network device, etc.) to perform the method according to the embodiments of the present disclosure.
Referring to fig. 8, a program product 800 for implementing a method of generating redundancy-removing sentence vectors, which may employ a portable compact disc read-only memory (CD-ROM) and include program code, and which may be run on a terminal device, such as a personal computer, is described in accordance with an embodiment of the present application. However, the program product of the present application is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The computer readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).
Furthermore, the above-described drawings are only schematic illustrations of processes included in the method according to the exemplary embodiment of the present application, and are not intended to be limiting. It will be readily appreciated that the processes shown in the above figures do not indicate or limit the temporal order of these processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, for example, among a plurality of modules.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

Claims (7)

1. A method of generating a redundancy-removed information sentence vector, comprising:
acquiring initial sentence vectors of each sentence in a sentence set;
based on the comparison of vector elements in each vector dimension of the initial sentence vector, splicing the initial sentence vector to obtain a spliced sentence vector of each sentence;
based on the spliced sentence vector, acquiring a spliced sentence vector matrix of the sentence set;
zero-equalizing the spliced sentence vector matrix to obtain a target sentence vector matrix of the sentence set;
determining target sentence vectors of all sentences in the sentence set based on the target sentence vector matrix;
the step of splicing the initial sentence vectors based on the comparison of vector elements in each vector dimension of the initial sentence vectors to obtain spliced sentence vectors of each sentence comprises the following steps: determining the maximum value and the minimum value of vector elements in all initial sentence vectors in each vector dimension; adding the maximum value and the minimum value into the initial sentence vector of each sentence to obtain the spliced sentence vector of each sentence;
the step of obtaining the spliced sentence vector matrix of the sentence set based on the spliced sentence vector comprises the following steps: and sequentially taking the spliced sentence vectors of all sentences as row vectors of the spliced sentence vector matrix according to the sequence of sentences in the sentence set, so as to obtain the spliced sentence vector matrix of the sentence set.
2. The method of claim 1, wherein the obtaining an initial sentence vector for each sentence in the sentence set comprises:
word segmentation is carried out on each sentence in the sentence set to obtain word segmentation vocabulary;
acquiring word vectors of each word segmentation vocabulary;
based on the word vector of each word segmentation vocabulary, an initial sentence vector of each sentence is obtained.
3. The method of claim 2, wherein the obtaining an initial sentence vector for each sentence based on the word vector for each word segmentation vocabulary comprises:
based on word frequency-inverse document frequency TF-IDF algorithm, determining TF-IDF values of all words in sentence set;
determining the TF-IDF value of the vocabulary as the weight of the word vector of the vocabulary;
and carrying out weighted average on word vectors of all words in each sentence based on the word vector weight to obtain initial sentence vectors of each sentence.
4. The method of claim 1, wherein determining a target sentence vector for each sentence in a sentence set based on the target sentence vector matrix comprises:
and sequentially determining the row vectors in the target sentence vector matrix as target sentence vectors of all sentences according to the sequence of sentences in the sentence set.
5. An apparatus for generating a redundancy-removed information sentence vector, the apparatus being for performing the method of any one of claims 1 to 4, the apparatus comprising:
the first acquisition module is used for acquiring initial sentence vectors of all sentences in the sentence set;
the splicing module is used for splicing the initial sentence vectors based on the comparison of vector elements in each vector dimension of the initial sentence vectors to obtain spliced sentence vectors of each sentence;
the second acquisition module is used for acquiring a spliced sentence vector matrix of the sentence set based on the spliced sentence vector;
the zero-averaging module is used for zero-averaging the spliced sentence vector matrix to obtain a target sentence vector matrix of the sentence set;
and the determining module is used for determining the target sentence vector of each sentence in the sentence set based on the target sentence vector matrix.
6. An electronic device for generating redundancy-removed sentence vectors, comprising:
a memory configured to store executable instructions;
a processor configured to execute executable instructions stored in a memory to implement the method according to any one of claims 1-4.
7. A computer readable storage medium, characterized in that it stores computer program instructions, which when executed by a computer, cause the computer to perform the method according to any of claims 1-4.
CN201910690370.5A 2019-07-29 2019-07-29 Method for generating redundancy-removed information sentence vector and related equipment Active CN110472241B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910690370.5A CN110472241B (en) 2019-07-29 2019-07-29 Method for generating redundancy-removed information sentence vector and related equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910690370.5A CN110472241B (en) 2019-07-29 2019-07-29 Method for generating redundancy-removed information sentence vector and related equipment

Publications (2)

Publication Number Publication Date
CN110472241A CN110472241A (en) 2019-11-19
CN110472241B true CN110472241B (en) 2023-11-10

Family

ID=68509073

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910690370.5A Active CN110472241B (en) 2019-07-29 2019-07-29 Method for generating redundancy-removed information sentence vector and related equipment

Country Status (1)

Country Link
CN (1) CN110472241B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111985209B (en) * 2020-03-31 2024-03-29 北京来也网络科技有限公司 Text sentence recognition method, device and equipment combining RPA and AI and storage medium
CN113722438B (en) * 2021-08-31 2023-06-23 平安科技(深圳)有限公司 Sentence vector generation method and device based on sentence vector model and computer equipment

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108572961A (en) * 2017-03-08 2018-09-25 北京嘀嘀无限科技发展有限公司 A kind of the vectorization method and device of text

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3499384A1 (en) * 2017-12-18 2019-06-19 Fortia Financial Solutions Word and sentence embeddings for sentence classification

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108572961A (en) * 2017-03-08 2018-09-25 北京嘀嘀无限科技发展有限公司 A kind of the vectorization method and device of text

Also Published As

Publication number Publication date
CN110472241A (en) 2019-11-19

Similar Documents

Publication Publication Date Title
CN110287278B (en) Comment generation method, comment generation device, server and storage medium
CN108170749B (en) Dialog method, device and computer readable medium based on artificial intelligence
CN109522552B (en) Normalization method and device of medical information, medium and electronic equipment
US10755048B2 (en) Artificial intelligence based method and apparatus for segmenting sentence
US20230004721A1 (en) Method for training semantic representation model, device and storage medium
CN109241286B (en) Method and device for generating text
US10943071B2 (en) Statistical preparation of data using semantic clustering
CN109408826A (en) A kind of text information extracting method, device, server and storage medium
CN110334209B (en) Text classification method, device, medium and electronic equipment
JP7335300B2 (en) Knowledge pre-trained model training method, apparatus and electronic equipment
CN114861889B (en) Deep learning model training method, target object detection method and device
US11151202B2 (en) Exploiting answer key modification history for training a question and answering system
CN111488742B (en) Method and device for translation
CN110472241B (en) Method for generating redundancy-removed information sentence vector and related equipment
CN111597800A (en) Method, device, equipment and storage medium for obtaining synonyms
JP2023002690A (en) Semantics recognition method, apparatus, electronic device, and storage medium
WO2020052060A1 (en) Method and apparatus for generating correction statement
CN112711943B (en) Uygur language identification method, device and storage medium
US20200175104A1 (en) Generating rules for automated text annotation
CN112307738B (en) Method and device for processing text
CN110309278B (en) Keyword retrieval method, device, medium and electronic equipment
CN111666405B (en) Method and device for identifying text implication relationship
CN115858776A (en) Variant text classification recognition method, system, storage medium and electronic equipment
CN115357710A (en) Training method and device for table description text generation model and electronic equipment
US10726211B1 (en) Automated system for dynamically generating comprehensible linguistic constituents

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant