CN110472241A - Generate the method and relevant device of de-redundancy information sentence vector - Google Patents

Generate the method and relevant device of de-redundancy information sentence vector Download PDF

Info

Publication number
CN110472241A
CN110472241A CN201910690370.5A CN201910690370A CN110472241A CN 110472241 A CN110472241 A CN 110472241A CN 201910690370 A CN201910690370 A CN 201910690370A CN 110472241 A CN110472241 A CN 110472241A
Authority
CN
China
Prior art keywords
sentence
vector
splicing
sentence vector
initial
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910690370.5A
Other languages
Chinese (zh)
Other versions
CN110472241B (en
Inventor
郑立颖
徐亮
阮晓雯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201910690370.5A priority Critical patent/CN110472241B/en
Publication of CN110472241A publication Critical patent/CN110472241A/en
Application granted granted Critical
Publication of CN110472241B publication Critical patent/CN110472241B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The present disclosure discloses a kind of methods and relevant device for generating de-redundancy information sentence vector, are related to natural language processing field, this method comprises: obtaining the initial sentence vector that sentence concentrates each sentence;Based on the comparison of vector element on the initial each vector dimension of sentence vector, the initial sentence vector is spliced, the splicing sentence vector of each sentence is obtained;Based on the splicing sentence vector, the splicing sentence vector matrix of sentence collection is obtained;Zero averaging is carried out to the splicing sentence vector matrix, obtains the target sentence vector matrix of sentence collection;Based on the target sentence vector matrix, determine that sentence concentrates the target sentence vector of each sentence.The method increase the efficiency that neural network carries out natural language processing.

Description

Generate the method and relevant device of de-redundancy information sentence vector
Technical field
The present invention relates to natural language processing fields, more particularly to the method and correlation for generating de-redundancy information sentence vector Equipment.
Background technique
It needs to be converted to each sentence when carrying out semantic parsing to the sentence in text in natural language processing field The form of vector, that is, each sentence is converted into a vector, is carried out at analysis for carrying out the neural network of natural language processing Reason.Therefore, when carrying out semantic parsing as unit of by sentence, based on the generation of sentence vector affect natural language processing Efficiency.In the prior art, the method for generating sentence vector is simply weighted by the term vector to vocabulary each in sentence Averagely obtain.The sentence vector that this method generates generally comprises the information of a large amount of repetitions, redundancy, so as to cause in these vectors On the basis of, the inefficiency for the natural language processing that neural network carries out.
Summary of the invention
Based on this, how to realize that improving neural network carries out from natural language from technological layer in the related technology for solution The technical issues of efficiency of reason is faced, the present invention provides a kind of methods for generating de-redundancy information sentence vector and correlation to set It is standby.
In a first aspect, providing a kind of method for generating de-redundancy information sentence vector, comprising:
Obtain the initial sentence vector that sentence concentrates each sentence;
Based on the comparison of vector element on the initial each vector dimension of sentence vector, the initial sentence vector is spelled It connects, obtains the splicing sentence vector of each sentence;
Based on the splicing sentence vector, the splicing sentence vector matrix of sentence collection is obtained;
Zero averaging is carried out to the splicing sentence vector matrix, obtains the target sentence vector matrix of sentence collection;
Based on the target sentence vector matrix, determine that sentence concentrates the target sentence vector of each sentence.
In an exemplary embodiment of the disclosure, the initial sentence vector for obtaining sentence and concentrating each sentence, comprising:
It concentrates each sentence to segment the sentence, obtains participle vocabulary;
Obtain the term vector of each participle vocabulary;
Based on the term vector of each participle vocabulary, the initial sentence vector of each sentence is obtained.
In an exemplary embodiment of the disclosure, the term vector based on each participle vocabulary obtains each sentence Initial sentence vector, comprising:
Based on word frequency-inverse document frequency TF-IDF algorithm, determine that sentence concentrates the TF-IDF value of all vocabulary;
The TF-IDF value of the vocabulary is determined as to the term vector weight of the vocabulary;
Based on the term vector weight, the term vector of vocabulary all in each sentence is weighted and averaged, each sentence is obtained Initial sentence vector.
It is described based on element vector on the initial each vector dimension of sentence vector in an exemplary embodiment of the disclosure The comparison of element, splices the initial sentence vector, obtains the splicing sentence vector of each sentence, comprising:
Determine in all initial sentence vectors, on each vector dimension vector element maximum value and minimum value;
The maximum value and the minimum value are added in the initial sentence vector of each sentence, the splicing sentence of each sentence is obtained Vector.
In an exemplary embodiment of the disclosure, it is described be based on the splicing sentence vector, acquisition sentence collection splicing sentence to Moment matrix, comprising:
According to sentence concentrate sentence occur tandem, successively using the splicing sentence vector of each sentence as the splicing sentence to The row vector for measuring sentence, obtains the splicing sentence vector matrix of sentence collection.
It is described to be based on the target sentence vector matrix in an exemplary embodiment of the disclosure, determine that sentence concentrates each sentence The target sentence vector of son, comprising:
According to the tandem that sentence concentrates sentence to occur, the row vector in target sentence vector matrix is successively determined as each sentence The target sentence vector of son.
According to the second aspect of the disclosure, a kind of device for generating de-redundancy information sentence vector is provided, comprising:
First obtains module, and the initial sentence vector of each sentence is concentrated for obtaining sentence;
Splicing module, for the comparison based on vector element on the initial each vector dimension of sentence vector, to described first Beginning sentence vector is spliced, and the splicing sentence vector of each sentence is obtained;
Second obtains module, for being based on the splicing sentence vector, obtains the splicing sentence vector matrix of sentence collection;
Zero averaging module, for splicing sentence vector matrix progress zero averaging, obtain the target sentence of sentence collection to Moment matrix;
Determining module determines that sentence concentrates the target sentence vector of each sentence for being based on the target sentence vector matrix.
According to the third aspect of the disclosure, a kind of electronic equipment for generating de-redundancy information sentence vector is provided, comprising:
Memory is configured to storage executable instruction;
Processor is configured to execute the executable instruction stored in memory, to realize the generation de-redundancy information sentence The method of vector.
According to the fourth aspect of the disclosure, a kind of computer readable storage medium is provided, computer program is stored with Instruction makes computer execute the side for generating de-redundancy information sentence vector when the computer instruction is computer-executed Method.
The embodiment of the present disclosure is by concentrating the common trait in the initial sentence vector of each sentence (on each vector dimension sentence Maximum value, minimum value) it extracts, common trait is spliced in the initial sentence vector of each sentence, splicing sentence vector is obtained.Into And zero averaging processing is carried out in the splicing sentence vector matrix being made of splicing sentence vector, it is superfluous to obtain eliminating for each sentence Remaining information, the target sentence vector for remaining main information.So that neural network is on the basis of the target sentence vector, it can Natural language processing is carried out with higher efficiency.
Other characteristics and advantages of the disclosure will be apparent from by the following detailed description, or partially by the disclosure Practice and acquistion.
It should be understood that the above general description and the following detailed description are merely exemplary, this can not be limited It is open.
Detailed description of the invention
Fig. 1 shows the flow chart of the generation de-redundancy information sentence vector according to one example embodiment of the disclosure.
Fig. 2 shows the processes that the acquisition sentence according to one example embodiment of the disclosure concentrates the initial sentence vector of each sentence Figure.
Fig. 3 shows the term vector based on each participle vocabulary according to one example embodiment of the disclosure, obtains each sentence The flow chart of the initial sentence vector of son.
Fig. 4 is shown according to one example embodiment of the disclosure based on vector on the initial each vector dimension of sentence vector The initial sentence vector is spliced in the comparison of element, obtains the flow chart of the splicing sentence vector of each sentence.
Fig. 5 shows the block diagram of the device of the generation de-redundancy information sentence vector according to one example embodiment of the disclosure.
Fig. 6 shows the system architecture diagram of the generation de-redundancy information sentence vector according to one example embodiment of the disclosure.
Fig. 7 shows the electronic equipment figure of the generation de-redundancy information sentence vector according to one example embodiment of the disclosure.
Fig. 8 shows the computer-readable storage of the generation de-redundancy information sentence vector according to one example embodiment of the disclosure Medium figure.
Specific embodiment
Example embodiment is described more fully with reference to the drawings.However, example embodiment can be with a variety of shapes Formula is implemented, and is not understood as limited to example set forth herein;On the contrary, thesing embodiments are provided so that the disclosure will more Fully and completely, and by the design of example embodiment comprehensively it is communicated to those skilled in the art.Described feature, knot Structure or characteristic can be incorporated in any suitable manner in one or more embodiments.In the following description, it provides perhaps More details fully understand embodiment of the present disclosure to provide.It will be appreciated, however, by one skilled in the art that can It is omitted with technical solution of the disclosure one or more in the specific detail, or others side can be used Method, constituent element, device, step etc..In other cases, be not shown in detail or describe known solution to avoid a presumptuous guest usurps the role of the host and So that all aspects of this disclosure thicken.
In addition, attached drawing is only the schematic illustrations of the disclosure, it is not necessarily drawn to scale.Identical attached drawing mark in figure Note indicates same or similar part, thus will omit repetition thereof.Some block diagrams shown in the drawings are function Energy entity, not necessarily must be corresponding with physically or logically independent entity.These function can be realized using software form Energy entity, or these functional entitys are realized in one or more hardware modules or integrated circuit, or at heterogeneous networks and/or place These functional entitys are realized in reason device device and/or microcontroller device.
The purpose of the disclosure is to improve the efficiency that neural network carries out natural language processing.According to one embodiment of the disclosure Generation de-redundancy information sentence vector method, comprising: obtain sentence and concentrate the initial sentence vector of each sentence;Based on the initial sentence The comparison of vector element on each vector dimension of vector splices the initial sentence vector, obtains the splicing sentence of each sentence Vector;Based on the splicing sentence vector, the splicing sentence vector matrix of sentence collection is obtained;Zero is carried out to the splicing sentence vector matrix Value obtains the target sentence vector matrix of sentence collection;Based on the target sentence vector matrix, determine that sentence concentrates the target sentence of each sentence Vector.
It is with reference to the accompanying drawings and embodiments, right in order to which the objects, technical solutions and advantages of the disclosure are more clearly understood The present invention is further elaborated.It should be appreciated that this time described specific embodiment is only used to explain the disclosure, and It is not used in the restriction disclosure.
The process that de-redundancy information sentence vector is generated in the embodiment of the present disclosure is described below.
Fig. 1 shows the flow chart of the generation de-redundancy information sentence vector according to one example embodiment of the disclosure:
Step S110: the initial sentence vector that sentence concentrates each sentence is obtained;
Step S120: the comparison based on vector element on the initial each vector dimension of sentence vector, to the initial sentence Vector is spliced, and the splicing sentence vector of each sentence is obtained;
Step S130: being based on the splicing sentence vector, obtains the splicing sentence vector matrix of sentence collection;
Step S140: zero averaging is carried out to the splicing sentence vector matrix, obtains the target sentence vector matrix of sentence collection;
Step S150: it is based on the target sentence vector matrix, determines that sentence concentrates the target sentence vector of each sentence.
Sentence collection, which refers to, contains the set of multiple sentences, such as: contain the text of multiple sentences.
Initial sentence vector refers to the sentence vector without de-redundancy information processing, such as: to the term vector of word each in sentence The sentence vector directly obtained after weighted average.
Splicing sentence vector refers to the sentence vector for adding in initial sentence vector and obtaining after vector element.
Target sentence vector refers to the sentence vector of target obtains de-redundancy information.
In the embodiment of the present disclosure, by extracting, splice to common trait in initial sentence vector, obtain splicing sentence to Amount, then zero averaging is carried out to the splicing sentence vector matrix being made of splicing sentence vector, it is superfluous in removal splicing sentence vector matrix Remaining information, to obtain the target sentence vector of de-redundancy information.
In the following, by being carried out in conjunction with each step of the attached drawing to generation de-redundancy information sentence vector above-mentioned in this example embodiment Detailed explanation and explanation.
In step s 110, the initial sentence vector that sentence concentrates each sentence is obtained.
In one embodiment, as shown in Fig. 2, step S110 includes:
Step S1101: concentrating each sentence to segment the sentence, obtains participle vocabulary;
Step S1102: the term vector of each participle vocabulary is obtained;
Step S1103: the term vector based on each participle vocabulary obtains the initial sentence vector of each sentence.
Participle, which refers to the process of, is decomposed into individual vocabulary one by one for each sentence in natural language.
Term vector refers to the vector for indicating vocabulary, that is, vocabulary is represented as the form of vector, in order to computer pair The processing of vocabulary is analyzed.
In the embodiment of the present disclosure, computer has received the sentence that generate the de-redundancy information sentence vector of wherein each sentence Collection.Since in the processing of natural language, vocabulary is the basic unit for understanding text semantic, therefore, in order to generate each sentence De-redundancy information sentence vector, first have to distich and each individually vocabulary concentrated to be handled, obtain the term vector of each vocabulary, And then the de-redundancy information sentence vector of each sentence is generated on this basis.
In one embodiment, computer is based on preset segmentation methods, concentrates each sentence to divide the sentence received Word, the participle vocabulary obtained after being segmented.Again be based on preset term vector generating algorithm, obtain it is each participle vocabulary word to Amount, and then obtain the initial sentence vector of each sentence.
In the embodiment of the present disclosure, when distich concentrates each sentence to be segmented, due to be between English word with space or Person's punctuate separates, therefore can directly be segmented according to space character and punctuation character to English.Due to compared to English Literary word, it is most between Chinese language words to be connected together, it to the participle of Chinese can not be simply according to space character and mark Point character carries out.Therefore, it is described below and Chinese is segmented, obtain the process of participle vocabulary.
In one embodiment, computer concentrates each sentence to be segmented based on the segmenting method distich of string matching. In the embodiment, default one contains the dictionary of enough character strings.When being segmented, concentrate each sentence as wait divide sentence Solve object.An object to be decomposed is scanned in dictionary every time, is matched, if found in dictionary a character string with it is to be decomposed It is a part of identical in object, then it is separated this character string in object to be decomposed as an individual vocabulary.
The advantages of embodiment, is, can quickly be segmented by the matching with character string in dictionary, simple and easy.
In one embodiment, computer concentrates each sentence to be segmented based on the segmenting method distich of statistics.The implementation In example, the machine learning model based on statistics is trained using the given text largely segmented in advance, so that instruction Machine learning model after white silk is able to carry out Chinese word segmentation.Being commonly used in machine learning model of the participle based on statistics has: N member Grammatical model N-gram, hidden Markov model Hidden Markov Model, conditional random field models Coditional Random Fields etc..
The advantages of embodiment, is, is segmented by machine learning model, and the precision of participle is higher, segments effect It is more excellent.
It is described below after obtaining each participle vocabulary, obtains the process of the term vector of each participle vocabulary.
In one embodiment, computer is based on the machine that higher-dimension term vector is embedded into lower dimensional space Word Embedding Learning model (such as: bag of words CBOW model), by each participle lexical representation be between capable of showing vocabulary semantic similarity degree to Amount form.
In the embodiment, each participle lexical representation is discrete by the sequence that computer occurs according to each participle vocabulary first The term vector of form, such as: " Xiao Ming " is expressed as [1,0,0], " will eat " and be expressed as [0,1,0], " apple " is expressed as [0, 0,1].Again based on CBOW model trained in advance, the term vector of each discrete form is expressed as to the term vector of distributed form.Example Such as: after the processing of CBOW model, " Xiao Ming " is expressed as to [0.9,0.2, -0.2], " will eat " and be expressed as [0,1.7,0.3], " apple " is expressed as [0.1,0.2,0.1].
It can be seen that being the distance that can not be measured between vector between the term vector of discrete form, having led to can not yet Natural language processing is carried out on the term vector of discrete form.And it can be with the length of distance between vector between distributed term vector It is short, corresponding vocabulary is shown in close degree semantically, so as to be more conducive to further natural language processing.
After the term vector that computer obtains each participle vocabulary is described below, the process of the initial sentence vector of each sentence is obtained.
In one embodiment, as shown in figure 3, step S1103: the term vector based on each participle vocabulary obtains each sentence The initial sentence vector of son, comprising:
Step S11031: it is based on word frequency-inverse document frequency TF-IDF algorithm, determines that sentence concentrates the TF-IDF of all vocabulary Value;
Step S11032: the TF-IDF value of the vocabulary is determined as to the term vector weight of the vocabulary;
Step S11033: being based on the term vector weight, the term vector of vocabulary all in each sentence be weighted and averaged, Obtain the initial sentence vector of each sentence.
Term frequency-inverse document frequency TF-IDF points are two parts: word frequency-TF, inverse document frequency-IDF.Wherein, TF refers to word The number that remittance occurs in a document;IDF refers to the weight of vocabulary in a document.TF-IDF value has then measured corresponding vocabulary in document In significance level, TF-IDF value is bigger, and corresponding vocabulary is more important in a document.Why TF-IDF value measurement vocabulary is used Significance level is that it is more important might not to represent it since the word frequency TF of a word is higher.Such as: " ", it goes out in a document Existing frequency is very high, but it for document content analysis almost without help, therefore, measuring its significance level When, the weight allocated it, i.e. inverse document frequency IDF value can very littles.Therefore, word-based when measuring the significance level of vocabulary The TF-IDF value of remittance carries out.
In one embodiment, computer is based on TF-IDF algorithm, and the TF-IDF value that sentence concentrates each vocabulary is calculated.Due to The TF-IDF value of vocabulary illustrates its significance level in a document, therefore the TF-IDF value of vocabulary has been determined corresponding vocabulary Term vector weight.According to the term vector weight of each vocabulary, the term vector of vocabulary all in each sentence is weighted and averaged, is obtained The initial sentence vector of each sentence.
Such as: for sentence " Xiao Ming eats apple ", it has been determined that the term vector of " Xiao Ming " is [0.9,0.2, -0.2], term vector Weight is 0.01;The term vector of " eating " is [0,1.7,0.3], and term vector weight is 0.05;The term vector of " apple " be [0.1, 0.2,0.1], term vector weight is 0.02.The term vector of these vocabulary is weighted and averaged: 0.01*[0.9,0.2, -0.2]+ 0.05*[0,1.7,0.3]+0.02*[0.1,0.2,0.1]=[0.011,0.091,0.015], obtain sentence " Xiao Ming eats apple " Initial sentence vector be [0.011,0.091,0.015].
The advantages of embodiment, is that the initial sentence vector acquired on the basis of TF-IDF value is more accurate, is less susceptible to Cause the deviation during natural language processing.
After the initial sentence vector for obtaining each sentence is described below, the process of the splicing sentence vector of each sentence is obtained.
In the step s 120, the comparison based on vector element on the initial each vector dimension of sentence vector, to described first Beginning sentence vector is spliced, and the splicing sentence vector of each sentence is obtained.
In one embodiment, as shown in figure 4, step S120 includes:
Step S1201: determine in all initial sentence vectors, on each vector dimension the maximum value of vector element and Minimum value;
Step S1202: the maximum value and the minimum value are added in the initial sentence vector of each sentence, each sentence is obtained The splicing sentence vector of son.
In one embodiment, in the initial sentence vector that sentence concentrates all sentences, element vector on each vector dimension is determined The maximum value and minimum value of element, the maximum value and minimum value are added in the initial sentence vector of each sentence, that is, to each sentence The initial sentence vector of son is spliced, to obtain the splicing sentence vector of each sentence.
Such as: sentence is concentrated with 3 sentences A, B, C, wherein the initial sentence vector of sentence A is [3,1,0], and sentence B's is initial Sentence vector is [1,5,2], and the initial sentence vector of sentence C is [2,3,4].It can be seen that this concentrates the initial sentence of all sentences In vector, one-dimensional corresponding vector element maximum value is 3, minimum value 1;The corresponding vector element maximum value of two dimension is 5, minimum Value is 1;Three-dimensional corresponding vector element maximum value is 4, minimum value 0.Therefore, the initial sentence vector of each sentence is spliced Afterwards, the splicing sentence vector for obtaining sentence A is [3,1,0,3,1,5, Isosorbide-5-Nitrae, 0], the splicing sentence vector of sentence B be [1,5,2,3,1, 5, Isosorbide-5-Nitrae, 0], the splicing sentence vector of sentence C is [2,3,4,3,1,5, Isosorbide-5-Nitrae, 0].
The advantages of embodiment, is, by adding maximum value, the minimum value on each vector dimension in each sentence Initial sentence vector, so that containing this in the splicing sentence vector of obtained each sentence concentrates common global feature information.
The process for obtaining the splicing sentence vector matrix of sentence collection is described below.
In step s 130, it is based on the splicing sentence vector, obtains the splicing sentence vector matrix of sentence collection.
In one embodiment, described to be based on the splicing sentence vector, obtain the splicing sentence vector matrix of sentence collection, comprising: press According to the tandem that sentence concentrates sentence to occur, successively using the splicing sentence vector of each sentence as the row of the splicing sentence vector matrix Vector obtains the splicing sentence vector matrix of sentence collection.
In the embodiment, the tandem for concentrating the splicing sentence vector of each sentence to occur according to corresponding sentence sentence will be each Splice sentence vector as row vector each in matrix, to obtain the splicing sentence vector matrix of sentence collection.Such as: sentence concentration successively occurs It is sentence A, sentence B, sentence C.The splicing sentence vector of sentence A is [3,1,0,3,1,5, Isosorbide-5-Nitrae, 0], the splicing sentence of sentence B to Amount is [1,5,2,3,1,5, Isosorbide-5-Nitrae, 0], and the splicing sentence vector of sentence C is [2,3,4,3,1,5, Isosorbide-5-Nitrae, 0].Then, the spelling of this collection Connect a vector matrix are as follows:
The embodiment has reached the splicing that unified distich concentrates all sentences by establishing the splicing sentence vector matrix of sentence collection The purpose that sentence vector is handled.
The splicing sentence vector matrix that distich collection is described below carries out the process of de-redundancy information processing.
In step S140, zero averaging is carried out to the splicing sentence vector matrix, obtains the target sentence moment of a vector of sentence collection Battle array.
The zero averaging of matrix belongs to the scope of data zero averaging.Data zero averaging is by subtracting the equal of data Value, to reach the error eliminated due to caused by dimension is different, numerical value differs larger.The data mean value handled by zero averaging It is 0, standard deviation 1 obeys standardized normal distribution.By carrying out zero averaging to data, eliminate unnecessary superfluous in data Remaining information when so that being trained using these data to neural network, can accelerate the convergence rate of neural network, thus more Efficiently training is added to finish neural network.
In the embodiment of the present disclosure, after obtaining the splicing sentence vector matrix of sentence collection, zero-mean is carried out to splicing sentence vector matrix Change, obtains the target sentence vector matrix of sentence collection.Wherein, in target sentence vector matrix each every trade vector be exactly each sentence target sentence Vector.
In step S150, it is based on the target sentence vector matrix, determines that sentence concentrates the target sentence vector of each sentence.
In one embodiment, Yu Suoshu target sentence vector matrix determines that sentence concentrates the target sentence vector of each sentence, comprising: According to the tandem that sentence concentrates sentence to occur, the row vector in target sentence vector matrix is successively determined as to the target of each sentence Sentence vector.
In the embodiment, every trade vector each in target sentence vector matrix is successively determined as to the target sentence vector of each sentence. Such as: it is sentence A that sentence concentration, which successively occurs, sentence B, sentence C.The first row row vector is in obtained splicing sentence vector matrix The splicing sentence vector of sentence A, the second every trade vector are the splicing sentence vector of sentence B, and the third line row vector is the splicing sentence of sentence C Vector.After the splicing sentence vector matrix zero averaging, the target sentence vector matrix of this collection is obtained.Then, by target sentence vector The first row row vector is determined as the target sentence vector of sentence A in matrix, the second every trade vector be determined as the target sentence of sentence B to Amount, the third line row vector are determined as the target sentence vector of sentence C.
The disclosure additionally provides a kind of device for generating de-redundancy information sentence vector.As shown in figure 5, the generation de-redundancy The device of information sentence vector includes:
First obtains module 210, and the initial sentence vector of each sentence is concentrated for obtaining sentence;
Splicing module 220, for the comparison based on vector element on the initial each vector dimension of sentence vector, to described Initial sentence vector is spliced, and the splicing sentence vector of each sentence is obtained;
Second obtains module 230, for being based on the splicing sentence vector, obtains the splicing sentence vector matrix of sentence collection;
Zero averaging module 240 obtains the target sentence of sentence collection for carrying out zero averaging to the splicing sentence vector matrix Vector matrix;
Determining module 250 determines that sentence concentrates the target sentence vector of each sentence for being based on the target sentence vector matrix.
In the device of above-mentioned generation de-redundancy information sentence vector the detail of each module in corresponding method into Detailed description is gone, therefore details are not described herein again.
It should be noted that although being referred to several modules or list for acting the equipment executed in the above detailed description Member, but this division is not enforceable.In fact, according to embodiment of the present disclosure, it is above-described two or more Module or the feature and function of unit can embody in a module or unit.Conversely, an above-described mould The feature and function of block or unit can be to be embodied by multiple modules or unit with further division.
In addition, although describing each step of method in the disclosure in the accompanying drawings with particular order, this does not really want These steps must be executed according to the particular order by asking or implying, could be real or have to carry out step shown in whole Existing desired result.Additional or alternative, it is convenient to omit multiple steps are merged into a step and executed by certain steps, with And/or a step is decomposed into execution of multiple steps etc. by person.
Through the above description of the embodiments, those skilled in the art is it can be readily appreciated that example described herein is implemented Mode can also be realized by software realization in such a way that software is in conjunction with necessary hardware.Therefore, according to the disclosure The technical solution of embodiment can be embodied in the form of software products, and the software product can store non-easy at one In the property lost storage medium (can be CD-ROM, USB flash disk, mobile hard disk etc.) or on network, including some instructions are so that a meter Equipment (can be personal computer, server, mobile terminal or network equipment etc.) is calculated to execute according to disclosure embodiment Method.
Fig. 6 shows the system for carrying out authentication to user node in block chain according to one example embodiment of the disclosure Architecture diagram.The system architecture includes: management end 310, computer 320, neural network 330.
In one embodiment, management end 310 sends sentence collection to computer 320 and sentence vector generates instruction, so that calculating Machine 320 generates the sentence vector that this concentrates all sentences.Computer 320 obtains the initial sentence vector that this concentrates each sentence, right Each initial sentence vector is spliced, and the splicing sentence vector of each sentence is obtained.And then to from the splicing sentence that splicing sentence vector forms to Moment matrix carries out zero averaging processing, obtains the target sentence vector matrix of sentence collection, to obtain the de-redundancy of sentence in sentence concentration Information sentence vector (that is, target sentence vector that the sentence concentrates each sentence).Computer 320 is by the de-redundancy information sentence vector of generation It is sent to neural network 330, neural network 330 is enabled to carry out natural language processing with higher efficiency on this basis.
It should be noted that it can be any terminal with enough computing capabilitys that computer 320, which is, it can be individual Computer, or server.Equally, it is real to execute the disclosure to can be used as a part in neural network 330 for computer 320 Apply method described in example.
By the way that above to the description of system architecture, those skilled in the art is it can be readily appreciated that system architecture described herein It can be realized the function of modules in the device shown in fig. 5 for generating de-redundancy information sentence vector.
In an exemplary embodiment of the disclosure, a kind of side that can be realized and generate de-redundancy information sentence vector is additionally provided The electronic equipment of method.
Person of ordinary skill in the field it is understood that various aspects of the invention can be implemented as system, method or Program product.Therefore, various aspects of the invention can be embodied in the following forms, it may be assumed that complete hardware embodiment, complete The embodiment combined in terms of full Software Implementation (including firmware, microcode etc.) or hardware and software, can unite here Referred to as circuit, " module " or " system ".
The electronic equipment 600 of this embodiment according to the present invention is described referring to Fig. 7.The electronics that Fig. 7 is shown Equipment 600 is only an example, should not function to the embodiment of the present invention and use scope bring any restrictions.
As shown in fig. 7, electronic equipment 600 is showed in the form of universal computing device.The component of electronic equipment 600 can wrap It includes but is not limited to: at least one above-mentioned processing unit 610, at least one above-mentioned storage unit 620, the different system components of connection The bus 630 of (including storage unit 620 and processing unit 610).
Wherein, the storage unit is stored with program code, and said program code can be held by the processing unit 610 Row, so that various according to the present invention described in the execution of the processing unit 610 above-mentioned " illustrative methods " part of this specification The step of illustrative embodiments.For example, the processing unit 610 can execute step S110 as shown in fig. 1: obtaining sentence collection In each sentence initial sentence vector;Step S120: the comparison based on vector element on the initial each vector dimension of sentence vector, The initial sentence vector is spliced, the splicing sentence vector of each sentence is obtained;Step S130: being based on the splicing sentence vector, Obtain the splicing sentence vector matrix of sentence collection;Step S140: zero averaging is carried out to the splicing sentence vector matrix, obtains sentence collection Target sentence vector matrix;Step S150: it is based on the target sentence vector matrix, determines that sentence concentrates the target sentence vector of each sentence.
Storage unit 620 may include the readable medium of volatile memory cell form, such as Random Access Storage Unit (RAM) 6201 and/or cache memory unit 6202, it can further include read-only memory unit (ROM) 6203.
Storage unit 620 can also include program/utility with one group of (at least one) program module 6205 6204, such program module 6205 includes but is not limited to: operating system, one or more application program, other program moulds It may include the realization of network environment in block and program data, each of these examples or certain combination.
Bus 630 can be to indicate one of a few class bus structures or a variety of, including storage unit bus or storage Cell controller, peripheral bus, graphics acceleration port, processing unit use any bus structures in a variety of bus structures Local bus.
Electronic equipment 600 can also be with one or more external equipments 700 (such as keyboard, sensing equipment, bluetooth equipment Deng) communication, can also be enabled a user to one or more equipment interact with the electronic equipment 600 communicate, and/or with make Any equipment (such as the router, modulation /demodulation that the electronic equipment 600 can be communicated with one or more of the other calculating equipment Device etc.) communication.This communication can be carried out by input/output (I/O) interface 650.Also, electronic equipment 600 can be with By network adapter 660 and one or more network (such as local area network (LAN), wide area network (WAN) and/or public network, Such as internet) communication.As shown, network adapter 660 is communicated by bus 630 with other modules of electronic equipment 600. It should be understood that although not shown in the drawings, other hardware and/or software module can not used in conjunction with electronic equipment 600, including but not Be limited to: microcode, device driver, redundant processing unit, external disk drive array, RAID system, tape drive and Data backup storage system etc..
Through the above description of the embodiments, those skilled in the art is it can be readily appreciated that example described herein is implemented Mode can also be realized by software realization in such a way that software is in conjunction with necessary hardware.Therefore, according to the disclosure The technical solution of embodiment can be embodied in the form of software products, which can store non-volatile at one Property storage medium (can be CD-ROM, USB flash disk, mobile hard disk etc.) in or network on, including some instructions are so that a calculating Equipment (can be personal computer, server, terminal installation or network equipment etc.) is executed according to disclosure embodiment Method.
Refering to what is shown in Fig. 8, describe embodiment according to the present invention for realizing generate de-redundancy information sentence vector Method program product 800, can using portable compact disc read only memory (CD-ROM) and including program code, and It can be run on terminal device, such as PC.However, program product of the invention is without being limited thereto, it in this document, can Read storage medium can be it is any include or storage program tangible medium, the program can be commanded execution system, device or The use or in connection of person's device.
Described program product can be using any combination of one or more readable mediums.Readable medium can be readable letter Number medium or readable storage medium storing program for executing.Readable storage medium storing program for executing for example can be but be not limited to electricity, magnetic, optical, electromagnetic, infrared ray or System, device or the device of semiconductor, or any above combination.The more specific example of readable storage medium storing program for executing is (non exhaustive List) include: electrical connection with one or more conducting wires, portable disc, hard disk, random access memory (RAM), read-only Memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disc read only memory (CD-ROM), light storage device, magnetic memory device or above-mentioned any appropriate combination.
Computer-readable signal media may include in a base band or as carrier wave a part propagate data-signal, In carry readable program code.The data-signal of this propagation can take various forms, including but not limited to electromagnetic signal, Optical signal or above-mentioned any appropriate combination.Readable signal medium can also be any readable Jie other than readable storage medium storing program for executing Matter, the readable medium can send, propagate or transmit for by instruction execution system, device or device use or and its The program of combined use.
The program code for including on readable medium can transmit with any suitable medium, including but not limited to wirelessly, have Line, optical cable, RF etc. or above-mentioned any appropriate combination.
The program for executing operation of the present invention can be write with any combination of one or more programming languages Code, described program design language include object oriented program language-Java, C++ etc., further include conventional Procedural programming language-such as " C " language or similar programming language.Program code can be fully in user It calculates and executes in equipment, partly executes on a user device, being executed as an independent software package, partially in user's calculating Upper side point is executed on a remote computing or is executed in remote computing device or server completely.It is being related to far Journey calculates in the situation of equipment, and remote computing device can pass through the network of any kind, including local area network (LAN) or wide area network (WAN), it is connected to user calculating equipment, or, it may be connected to external computing device (such as utilize ISP To be connected by internet).
In addition, above-mentioned attached drawing is only the schematic theory of processing included by method according to an exemplary embodiment of the present invention It is bright, rather than limit purpose.It can be readily appreciated that the time that above-mentioned processing shown in the drawings did not indicated or limited these processing is suitable Sequence.In addition, be also easy to understand, these processing, which can be, for example either synchronously or asynchronously to be executed in multiple modules.
Those skilled in the art after considering the specification and implementing the invention disclosed here, will readily occur to its of the disclosure His embodiment.This application is intended to cover any variations, uses, or adaptations of the disclosure, these modifications, purposes or Adaptive change follow the general principles of this disclosure and including the undocumented common knowledge in the art of the disclosure or Conventional techniques.The description and examples are only to be considered as illustrative, and the true scope and spirit of the disclosure are by claim It points out.

Claims (9)

1. a kind of method for generating de-redundancy information sentence vector characterized by comprising
Obtain the initial sentence vector that sentence concentrates each sentence;
Based on the comparison of vector element on the initial each vector dimension of sentence vector, the initial sentence vector is spliced, Obtain the splicing sentence vector of each sentence;
Based on the splicing sentence vector, the splicing sentence vector matrix of sentence collection is obtained;
Zero averaging is carried out to the splicing sentence vector matrix, obtains the target sentence vector matrix of sentence collection;
Based on the target sentence vector matrix, determine that sentence concentrates the target sentence vector of each sentence.
2. the method according to claim 1, wherein the initial sentence vector for obtaining sentence and concentrating each sentence, packet It includes:
It concentrates each sentence to segment the sentence, obtains participle vocabulary;
Obtain the term vector of each participle vocabulary;
Based on the term vector of each participle vocabulary, the initial sentence vector of each sentence is obtained.
3. according to the method described in claim 2, it is characterized in that, the term vector based on each participle vocabulary, obtains The initial sentence vector of each sentence, comprising:
Based on word frequency-inverse document frequency TF-IDF algorithm, determine that sentence concentrates the TF-IDF value of all vocabulary;
The TF-IDF value of the vocabulary is determined as to the term vector weight of the vocabulary;
Based on the term vector weight, the term vector of vocabulary all in each sentence is weighted and averaged, obtains the first of each sentence Beginning sentence vector.
4. the method according to claim 1, wherein described based on the initial each vector dimension of sentence vector The initial sentence vector is spliced in the comparison of vector element, obtains the splicing sentence vector of each sentence, comprising:
Determine in all initial sentence vectors, on each vector dimension vector element maximum value and minimum value;
The maximum value and the minimum value are added in the initial sentence vector of each sentence, obtain the splicing sentence of each sentence to Amount.
5. obtaining the spelling of sentence collection the method according to claim 1, wherein described be based on the splicing sentence vector Connect a vector matrix, comprising:
According to the tandem that sentence concentrates sentence to occur, successively using the splicing sentence vector of each sentence as the splicing sentence moment of a vector The row vector of battle array, obtains the splicing sentence vector matrix of sentence collection.
6. determining sentence collection the method according to claim 1, wherein described be based on the target sentence vector matrix In each sentence target sentence vector, comprising:
According to the tandem that sentence concentrates sentence to occur, the row vector in target sentence vector matrix is successively determined as each sentence Target sentence vector.
7. a kind of device for generating de-redundancy information sentence vector characterized by comprising
First obtains module, and the initial sentence vector of each sentence is concentrated for obtaining sentence;
Splicing module, for the comparison based on vector element on the initial each vector dimension of sentence vector, to the initial sentence Vector is spliced, and the splicing sentence vector of each sentence is obtained;
Second obtains module, for being based on the splicing sentence vector, obtains the splicing sentence vector matrix of sentence collection;
Zero averaging module obtains the target sentence moment of a vector of sentence collection for carrying out zero averaging to the splicing sentence vector matrix Battle array;
Determining module determines that sentence concentrates the target sentence vector of each sentence for being based on the target sentence vector matrix.
8. a kind of electronic equipment for generating de-redundancy information sentence vector characterized by comprising
Memory is configured to storage executable instruction;
Processor is configured to execute the executable instruction stored in memory, to realize any of -6 institute according to claim 1 The method stated.
9. a kind of computer readable storage medium, which is characterized in that it is stored with computer program instructions, when the computer refers to When order is computer-executed, computer is made to execute method described in any of -6 according to claim 1.
CN201910690370.5A 2019-07-29 2019-07-29 Method for generating redundancy-removed information sentence vector and related equipment Active CN110472241B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910690370.5A CN110472241B (en) 2019-07-29 2019-07-29 Method for generating redundancy-removed information sentence vector and related equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910690370.5A CN110472241B (en) 2019-07-29 2019-07-29 Method for generating redundancy-removed information sentence vector and related equipment

Publications (2)

Publication Number Publication Date
CN110472241A true CN110472241A (en) 2019-11-19
CN110472241B CN110472241B (en) 2023-11-10

Family

ID=68509073

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910690370.5A Active CN110472241B (en) 2019-07-29 2019-07-29 Method for generating redundancy-removed information sentence vector and related equipment

Country Status (1)

Country Link
CN (1) CN110472241B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111985209A (en) * 2020-03-31 2020-11-24 北京来也网络科技有限公司 Text sentence recognition method, device, equipment and storage medium combining RPA and AI
CN113722438A (en) * 2021-08-31 2021-11-30 平安科技(深圳)有限公司 Sentence vector generation method and device based on sentence vector model and computer equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108572961A (en) * 2017-03-08 2018-09-25 北京嘀嘀无限科技发展有限公司 A kind of the vectorization method and device of text
US20190188277A1 (en) * 2017-12-18 2019-06-20 Fortia Financial Solutions Method and device for processing an electronic document

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108572961A (en) * 2017-03-08 2018-09-25 北京嘀嘀无限科技发展有限公司 A kind of the vectorization method and device of text
US20190188277A1 (en) * 2017-12-18 2019-06-20 Fortia Financial Solutions Method and device for processing an electronic document

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111985209A (en) * 2020-03-31 2020-11-24 北京来也网络科技有限公司 Text sentence recognition method, device, equipment and storage medium combining RPA and AI
CN111985209B (en) * 2020-03-31 2024-03-29 北京来也网络科技有限公司 Text sentence recognition method, device and equipment combining RPA and AI and storage medium
CN113722438A (en) * 2021-08-31 2021-11-30 平安科技(深圳)有限公司 Sentence vector generation method and device based on sentence vector model and computer equipment
WO2023029356A1 (en) * 2021-08-31 2023-03-09 平安科技(深圳)有限公司 Sentence embedding generation method and apparatus based on sentence embedding model, and computer device
CN113722438B (en) * 2021-08-31 2023-06-23 平安科技(深圳)有限公司 Sentence vector generation method and device based on sentence vector model and computer equipment

Also Published As

Publication number Publication date
CN110472241B (en) 2023-11-10

Similar Documents

Publication Publication Date Title
CN111309915B (en) Method, system, device and storage medium for training natural language of joint learning
CN108170749B (en) Dialog method, device and computer readable medium based on artificial intelligence
CN107729300B (en) Text similarity processing method, device and equipment and computer storage medium
US10372821B2 (en) Identification of reading order text segments with a probabilistic language model
US10929383B2 (en) Method and system for improving training data understanding in natural language processing
CN107729313B (en) Deep neural network-based polyphone pronunciation distinguishing method and device
CN110377714A (en) Text matching technique, device, medium and equipment based on transfer learning
CN110457708B (en) Vocabulary mining method and device based on artificial intelligence, server and storage medium
CN105210055B (en) According to the hyphenation device across languages phrase table
CN109992765A (en) Text error correction method and device, storage medium and electronic equipment
CN113590865B (en) Training method of image search model and image search method
CN112528677B (en) Training method and device of semantic vector extraction model and electronic equipment
CN111488742B (en) Method and device for translation
US20220005461A1 (en) Method for recognizing a slot, and electronic device
JP7337979B2 (en) Model training method and apparatus, text prediction method and apparatus, electronic device, computer readable storage medium, and computer program
WO2019118257A1 (en) Assertion-based question answering
CN103050115A (en) Recognizing device, recognizing method, generating device, and generating method
US20230080904A1 (en) Method for generating cross-lingual textual semantic model, and electronic device
EP4170542A2 (en) Method for sample augmentation
CN110472241A (en) Generate the method and relevant device of de-redundancy information sentence vector
CN113918031A (en) System and method for Chinese punctuation recovery using sub-character information
CN110929499B (en) Text similarity obtaining method, device, medium and electronic equipment
CN111666405B (en) Method and device for identifying text implication relationship
CN112307738A (en) Method and device for processing text
WO2023088278A1 (en) Method and apparatus for verifying authenticity of expression, and device and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant