CN110413990A - The configuration method of term vector, device, storage medium, electronic device - Google Patents

The configuration method of term vector, device, storage medium, electronic device Download PDF

Info

Publication number
CN110413990A
CN110413990A CN201910534810.8A CN201910534810A CN110413990A CN 110413990 A CN110413990 A CN 110413990A CN 201910534810 A CN201910534810 A CN 201910534810A CN 110413990 A CN110413990 A CN 110413990A
Authority
CN
China
Prior art keywords
vocabulary
term vector
sequence
strokes sequence
strokes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910534810.8A
Other languages
Chinese (zh)
Inventor
郑立颖
徐亮
阮晓雯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201910534810.8A priority Critical patent/CN110413990A/en
Publication of CN110413990A publication Critical patent/CN110413990A/en
Priority to PCT/CN2019/117725 priority patent/WO2020253050A1/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The present invention provides a kind of configuration method of term vector, device, storage medium, electronic devices, wherein the configuration method of term vector provided by the invention comprises determining that the first vocabulary of initial term vector to be configured;Judge the first vocabulary whether in term vector dictionary, wherein term vector dictionary is used to store the one-to-one relationship of multiple vocabulary Yu multiple term vectors;If it is judged that the first vocabulary executes stroke dismantling not in term vector dictionary, to the first vocabulary, strokes sequence is obtained;Calculate the similarity of the strokes sequence of each vocabulary in term vector dictionary and the strokes sequence of the first vocabulary;It determines term vector corresponding with the highest vocabulary of strokes sequence similarity of the first vocabulary, and is configured to the initial term vector of the first vocabulary.Through the invention, the technical issues of subsequent training mission accuracy decline is caused when solving the term vector for configuring unregistered word in the way of being randomly assigned in the related technology.

Description

The configuration method of term vector, device, storage medium, electronic device
Technical field
The present invention relates to field of neural networks, are situated between in particular to a kind of configuration method of term vector, device, storage Matter, electronic device.
Background technique
When handling text class data, usually most basic step be exactly participle and trained term vector (for example, using Word2vec method is trained), it is then based on term vector and carries out the tasks such as subsequent text comparison, classification.In actual treatment In the process, often occur comprising the not neologisms within the scope of term vector dictionary (unregistered word) in text to be processed, usually Processing method be distribution term vector random to unregistered word at random, still, the term vector being randomly assigned does not use The semantic information of neologisms causes follow-up work accuracy decline.
For the above problem present in the relevant technologies, at present it is not yet found that the solution of effect.
Summary of the invention
The embodiment of the invention provides a kind of configuration method of term vector, device, storage medium, electronic devices, at least Solution causes under subsequent training mission precision when configuring the term vector of unregistered word in the way of being randomly assigned in the prior art The technical issues of drop.
According to one embodiment of present invention, a kind of configuration method of term vector is provided, comprising: determine to be configured initial First vocabulary of term vector;Judge the first vocabulary whether in term vector dictionary, wherein term vector dictionary is for storing multiple words The one-to-one relationship converged with multiple term vectors;If it is judged that the first vocabulary holds the first vocabulary not in term vector dictionary The dismantling of row stroke, obtains strokes sequence;Calculate the strokes sequence of each vocabulary in term vector dictionary and the stroke of the first vocabulary The similarity of sequence;It determines corresponding with the highest vocabulary of strokes sequence similarity of the first vocabulary term vector, and is configured to the The initial term vector of one vocabulary.
Further, include the second vocabulary in term vector dictionary, calculate the stroke sequence of each vocabulary in term vector dictionary The similarity of column and the strokes sequence of the first vocabulary, comprising: determine the strokes sequence of the second vocabulary and the stroke sequence of the first vocabulary The total length of the coincidence tract of column, wherein be overlapped stroke in the strokes sequence that tract is two vocabulary and arrange identical sequence Column section;The total length for being overlapped tract based on the first vocabulary with the second vocabulary, determines the strokes sequence and second of the first vocabulary The similarity of the strokes sequence of vocabulary.
Further, the total length for being overlapped tract based on the first vocabulary with the second vocabulary, determines the pen of the first vocabulary The similarity for drawing the strokes sequence of sequence and the second vocabulary, using following formula: wherein, S is the first vocabulary to S=2*p/ (n+m) Strokes sequence and the second vocabulary strokes sequence similarity, p is that the first vocabulary and the second vocabulary are overlapped the total of tract Length, n are the length of the strokes sequence of the first vocabulary, and m is the length of the strokes sequence of the second vocabulary.
Further, it is determined that the first vocabulary of initial term vector to be configured, comprising: obtain corpus to be segmented;To corpus It is segmented, obtains multiple participles of sequence;The first participle that initial term vector is not configured is determined in multiple participles, obtains the One vocabulary.
The configuration method of the term vector provided through the invention, by being carried out to the vocabulary being not logged in term vector dictionary Dismantling, lookup and the immediate posting term of its stroke in term vector dictionary, and then the term vector of the close word of stroke is configured For the initial term vector of unregistered word, solves the term vector for configuring unregistered word in the way of being randomly assigned in the related technology When the technical issues of leading to subsequent training mission accuracy decline, using semantic information entrained in the stroke of Chinese come not step on It records word and assigns initial term vector, the time-consuming of subsequent training mission can be reduced, improve the precision of training mission.
According to another embodiment of the invention, a kind of configuration device of term vector is provided, comprising: first determines mould Block, for determining the first vocabulary of initial term vector to be configured;Judgment module, for judging the first vocabulary whether in term vector word In allusion quotation, wherein term vector dictionary is used to store the one-to-one relationship of multiple vocabulary Yu multiple term vectors;Module is disassembled, is used for If it is judged that the first vocabulary executes stroke dismantling not in term vector dictionary, to the first vocabulary, strokes sequence is obtained;Calculate mould Block, for calculating the similarity of the strokes sequence of each vocabulary in term vector dictionary and the strokes sequence of the first vocabulary;Second Determining module for determining corresponding with the highest vocabulary of strokes sequence similarity of the first vocabulary term vector, and is configured to the The initial term vector of one vocabulary.
It further, include the second vocabulary in term vector dictionary, computing module includes: the first determination unit, for determining The total length for being overlapped tract of the strokes sequence of second vocabulary and the strokes sequence of the first vocabulary, wherein being overlapped tract is Stroke arranges identical tract in the strokes sequence of two vocabulary;Second determination unit, for being based on the first vocabulary and second The total length of the coincidence tract of vocabulary, determines the similarity of the strokes sequence of the first vocabulary and the strokes sequence of the second vocabulary.
Further, the total length that is overlapped tract of second determination unit based on the first vocabulary with the second vocabulary determines The similarity of the strokes sequence of the strokes sequence of first vocabulary and the second vocabulary, using following formula: S=2*p/ (n+m) wherein, S is the similarity of the strokes sequence of the first vocabulary and the strokes sequence of the second vocabulary, and p is overlapped for the first vocabulary and the second vocabulary The total length of tract, n are the length of the strokes sequence of the first vocabulary, and m is the length of the strokes sequence of the second vocabulary.
Further, the first determining module includes: acquiring unit, for obtaining corpus to be segmented;Participle unit is used for Corpus is segmented, multiple participles of sequence are obtained;Third determination unit is not configured initially for determining in multiple participles The first participle of term vector, obtains the first vocabulary.
The configuration device of the term vector provided through the invention, by being carried out to the vocabulary being not logged in term vector dictionary Dismantling, lookup and the immediate posting term of its stroke in term vector dictionary, and then the term vector of the close word of stroke is configured For the initial term vector of unregistered word, solves the term vector for configuring unregistered word in the way of being randomly assigned in the related technology When the technical issues of leading to subsequent training mission accuracy decline, using semantic information entrained in the stroke of Chinese come not step on It records word and assigns initial term vector, the time-consuming of subsequent training mission can be reduced, improve the precision of training mission.
According to still another embodiment of the invention, a kind of storage medium is additionally provided, meter is stored in the storage medium Calculation machine program, wherein the computer program is arranged to execute the step in any of the above-described embodiment of the method when operation.
According to still another embodiment of the invention, a kind of electronic device, including memory and processor are additionally provided, it is described Computer program is stored in memory, the processor is arranged to run the computer program to execute any of the above-described Step in embodiment of the method.
Detailed description of the invention
The drawings described herein are used to provide a further understanding of the present invention, constitutes part of this application, this hair Bright illustrative embodiments and their description are used to explain the present invention, and are not constituted improper limitations of the present invention.In the accompanying drawings:
Fig. 1 is the flow chart of the configuration method of term vector according to an embodiment of the present invention;
Fig. 2 is the schematic diagram of the configuration device of term vector according to an embodiment of the present invention;
Fig. 3 is a kind of hardware block diagram of electronic device of the embodiment of the present invention.
Specific embodiment
In order to make those skilled in the art more fully understand application scheme, below in conjunction in the embodiment of the present application Attached drawing, the technical scheme in the embodiment of the application is clearly and completely described, it is clear that described embodiment is only The embodiment of the application a part, instead of all the embodiments, in the absence of conflict, embodiment and reality in the application The feature applied in example can be combined with each other.Based on the embodiment in the application, those of ordinary skill in the art are not making wound Every other embodiment obtained under the premise of the property made labour, shall fall within the protection scope of the present application.
It should be noted that the description and claims of this application and term " first " in above-mentioned attached drawing, " Two " etc. be to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should be understood that using in this way Data be interchangeable under appropriate circumstances, so as to embodiments herein described herein can in addition to illustrating herein or Sequence other than those of description is implemented.In addition, term " includes " and " having " and their any deformation, it is intended that cover Cover it is non-exclusive include, for example, the process, method, system, product or equipment for containing a series of steps or units are not necessarily limited to Step or unit those of is clearly listed, but may include be not clearly listed or for these process, methods, product Or other step or units that equipment is intrinsic.
Embodiment 1
The configuration method for present embodiments providing a kind of term vector can be applied to mobile terminal, handheld terminal or similar Arithmetic facility among.Operating in different arithmetic facilities only is difference of the scheme in executing subject, and those skilled in the art can be pre- See that operation can generate identical technical effect in nonidentity operation equipment.
The configuration method of term vector provided in this embodiment, by being torn open to the vocabulary being not logged in term vector dictionary Solution, lookup and the immediate posting term of its stroke in term vector dictionary, and then configure the term vector of the close word of stroke to The initial term vector of unregistered word, when solving the term vector for configuring unregistered word in the way of being randomly assigned in the related technology The technical issues of leading to subsequent training mission accuracy decline, entrained semantic information is come to be not logged in the stroke using Chinese Word assigns initial term vector, can reduce the time-consuming of subsequent training mission, improve the precision of training mission.
As shown in Figure 1, the configuration method of term vector provided in this embodiment includes the following steps:
Step 101, the first vocabulary of initial term vector to be configured is determined.
First vocabulary can be any one participle vocabulary in corpus to be processed.The present embodiment is applied to vocabulary Term vector configured, due to during neural metwork training, machine is not easy to identification natural language, needs nature Language shift is machine language, is to identify difference by term vector in the present embodiment in order to the different vocabulary of machine recognition Vocabulary.In the present embodiment, neural metwork training can be through neural network model realization, the neural network model Including but not limited to BP neural network model, CNN neural network model etc..
When determining vocabulary based on corpus, first to execution word segmentation processing is expected, then determines and currently need to configure initially The vocabulary of term vector is the first vocabulary.Specifically, determining the first vocabulary of initial term vector to be configured, include the following steps:
Step 11, corpus to be segmented is obtained;
Step 12, corpus is segmented, obtains multiple participles of sequence;
Step 13, the first participle that initial term vector is not configured is determined in multiple participles, obtains the first vocabulary.
Wherein, word segmentation processing can use existing segmentation methods, and details are not described herein for the present embodiment.
Step 102, judge the first vocabulary whether in term vector dictionary, wherein term vector dictionary is for storing multiple words The one-to-one relationship converged with multiple term vectors.
Term vector dictionary is used to store the relationship pair of vocabulary and term vector trained or manually marked.
After obtaining the first vocabulary, judge the first vocabulary whether in term vector dictionary.If it is judged that the first vocabulary In term vector dictionary, then in the corresponding term vector of the first vocabulary of term vector dictionary lookup.
Step 103, if it is judged that the first vocabulary is not in term vector dictionary, stroke dismantling is executed to the first vocabulary, is obtained To strokes sequence;
When carrying out stroke dismantling to the first vocabulary, each word in vocabulary can be disassembled as the pen of Chinese minimum unit It draws, for example, horizontal, vertical, slash, right-falling stroke etc., can also disassemble each word in vocabulary to preset according to preset minimum unit Degree, for example, using the common component units such as day, the moon, wood, mesh, mouth, field as a minimum unit, with the dismantling of abbreviation stroke The length of the strokes sequence obtained later.Specifically dismantling can be determined to which kind of degree according to actual set, the present embodiment pair This is not specifically limited, and details are not described herein.
Strokes sequence is the sequence of all strokes included by corresponding vocabulary.The group of the characters such as number, letter can be passed through Close to identify different strokes, the group identifiers of all strokes at sequence be exactly strokes sequence.For example, if Philosophy Identifier is respectively 1,2,3,4, then the strokes sequence of " wood " word is { 1,2,3,4 }.For another example, if the identifier of " wood " is 1, mesh Identifier be 2, then the strokes sequence of " phase " word be { 1,2 }.
Step 104, the phase of the strokes sequence and the strokes sequence of the first vocabulary of each vocabulary in term vector dictionary is calculated Like degree;
The step, which can be, to be executed after often passing through step 103 and getting the strokes sequence of a vocabulary, that is, often After finishing to the vocabulary dismantling in a corpus, the similarity of itself and each vocabulary in term vector dictionary is calculated.
It is torn open alternatively, being also possible to all unregistered words (the not word in term vector dictionary) in a certain section of corpus It is executed respectively after solution.
In addition, in calculating vocabulary and term vector dictionary when the strokes sequence similarity of each vocabulary, can also with when Synchronous to execute or asynchronous execution, synchronous execute is point several processes, each process calculates the vocabulary and term vector dictionary In a vocabulary similarity, asynchronous execution is then only that sequence, which executes, calculates the vocabulary and term vector word by a process The similarity of each vocabulary in allusion quotation.
In the similarity of the strokes sequence for the strokes sequence and the first vocabulary for calculating each vocabulary in term vector dictionary, Following steps can be used:
Step 21, the strokes sequence of the second vocabulary and the overall length for being overlapped tract of the strokes sequence of the first vocabulary are determined Degree, wherein be overlapped stroke in the strokes sequence that tract is two vocabulary and arrange identical tract.
For example, the strokes sequence of the first vocabulary be { 5,3,4,9,7,1,3,13 } second vocabulary strokes sequence be 1,3, 10,5,3,4,9,11 }, then the coincidence tract of the first vocabulary and the second vocabulary has two sections, respectively 1,3 and 5, and 3,4,9, length Respectively 2 and 4.
Step 22, the total length for being overlapped tract based on the first vocabulary with the second vocabulary, determines the stroke of the first vocabulary The similarity of the strokes sequence of sequence and the second vocabulary.
Optionally, the total length for being overlapped tract in step 22 based on the first vocabulary with the second vocabulary, determines the first word When the similarity of the strokes sequence of the strokes sequence of remittance and the second vocabulary, following formula can be used:
S=2*p/ (n+m)
Wherein, S be the first vocabulary strokes sequence and the second vocabulary strokes sequence similarity, p be the first vocabulary and The total length of the coincidence tract of second vocabulary, n are the length of the strokes sequence of the first vocabulary, and m is the stroke sequence of the second vocabulary The length of column.
For example, by taking a kind of optional usage scenario as an example, step 103 is described in detail as follows:
Step 1: neologisms are disassembled by stroke, form strokes sequence;
Step 2: all words in existing term vector model are disassembled according to stroke, each word obtains a pen Draw sequence;
Step 3: sequential value obtained in step 1 and sequence sets all in step 2 are compared, and find out most phase As sequence, similar calculation method are as follows: 2* (common existing sequence length)/(length of length+sequence 2 of sequence 1).
Such as: sequence 1 is abcd, and sequence 2 is bcde, then common existing sequence length is 3 (bcd), similarity 2* 3/ (4+4)=0.75
Step 4: it finds out and the vector of the corresponding word of the sequence is given to neologisms after most like sequence.
Step 105, it determines term vector corresponding with the highest vocabulary of strokes sequence similarity of the first vocabulary, and is configured to The initial term vector of first vocabulary.
That is, if it is judged that stroke and term vector word of first vocabulary not in term vector dictionary, according to the first vocabulary The stroke of vocabulary in allusion quotation is compared, and finds the highest vocabulary of strokes sequence similarity, and by stroke sequence in term vector dictionary The term vector of the highest vocabulary of column similarity is assigned to the first vocabulary.
The technical program can make full use of term vector dictionary abundant, be expanded using the stroke configuration information of Chinese itself Fill its semantic feature, the term vector of unknown vocabulary predicted using the term vector of known vocabulary, make up term vector unregistered word with The defect of machine assignment, and the precision of subsequent training mission is promoted, reduce training gradient.
It should be noted that step shown in the flowchart of the accompanying drawings can be in such as a group of computer-executable instructions It is executed in computer system, although also, logical order is shown in flow charts, and it in some cases, can be with not The sequence being same as herein executes shown or described step.
Through the above description of the embodiments, those skilled in the art can be understood that according to above-mentioned implementation The method of example can be realized by means of software and necessary general hardware platform, naturally it is also possible to by hardware, but it is very much In the case of the former be more preferably embodiment.Based on this understanding, technical solution of the present invention is substantially in other words to existing The part that technology contributes can be embodied in the form of software products, which is stored in a storage In medium (such as ROM/RAM, magnetic disk, CD), including some instructions are used so that a terminal device (can be mobile phone, calculate Machine, server or network equipment etc.) execute method described in each embodiment of the present invention.
Embodiment 2
Additionally provide a kind of configuration device of term vector in the present embodiment, the device for realizing above-described embodiment 1 and Its preferred embodiment, to the term or implementation not being described in detail in this present embodiment, reference can be made to mutually speaking on somebody's behalf in embodiment 1 Bright, the descriptions that have already been made will not be repeated.
Term " module " as used below, can be achieved on the combination of the software and/or hardware of predetermined function.Although Device described in following embodiment is preferably realized with software, but the combined realization of hardware or software and hardware And can be contemplated.
Fig. 2 is the schematic diagram of the configuration device of term vector according to an embodiment of the present invention, as shown in Fig. 2, the device includes: First determining module 10, judgment module 20 disassemble module 30, computing module 40 and the second determining module 50.
Wherein, the first determining module is used to determine the first vocabulary of initial term vector to be configured;Judgment module, for judging Whether the first vocabulary is in term vector dictionary, wherein term vector dictionary is for storing multiple vocabulary and multiple term vectors one by one Corresponding relationship;Module is disassembled, for if it is judged that the first vocabulary not in term vector dictionary, executes stroke to the first vocabulary and tears open Solution, obtains strokes sequence;Computing module, for calculating the strokes sequence and the first vocabulary of each vocabulary in term vector dictionary The similarity of strokes sequence;Second determining module, for the determining highest vocabulary pair of strokes sequence similarity with the first vocabulary The term vector answered, and it is configured to the initial term vector of the first vocabulary.
It optionally, include the second vocabulary in term vector dictionary, computing module includes: the first determination unit, for determining the The total length for being overlapped tract of the strokes sequence of two vocabulary and the strokes sequence of the first vocabulary, wherein being overlapped tract is two Stroke arranges identical tract in the strokes sequence of a vocabulary;Second determination unit, for being based on the first vocabulary and the second word The total length of the coincidence tract of remittance, determines the similarity of the strokes sequence of the first vocabulary and the strokes sequence of the second vocabulary.
Optionally, the total length that is overlapped tract of second determination unit based on the first vocabulary with the second vocabulary determines The similarity of the strokes sequence of the strokes sequence of one vocabulary and the second vocabulary, using following formula: S=2*p/ (n+m) wherein, S For the similarity of the strokes sequence of the strokes sequence and the second vocabulary of the first vocabulary, p is the first vocabulary to be overlapped with the second vocabulary The total length of tract, n are the length of the strokes sequence of the first vocabulary, and m is the length of the strokes sequence of the second vocabulary.
Optionally, the first determining module includes: acquiring unit, for obtaining corpus to be segmented;Participle unit, for pair Corpus is segmented, and multiple participles of sequence are obtained;Initial word is not configured for determining in multiple participles in third determination unit The first participle of vector, obtains the first vocabulary.
The present embodiment by being disassembled to the vocabulary being not logged in term vector dictionary, in term vector dictionary search with The immediate posting term of its stroke, and then configure the term vector of the close word of stroke to the initial term vector of unregistered word, solution Determined the term vector for configuring unregistered word in the way of being randomly assigned in the related technology when cause under subsequent training mission precision The technical issues of drop, assigns initial term vector using semantic information entrained in the stroke of Chinese for unregistered word, can It reduces the time-consuming of subsequent training mission, improve the precision of training mission.
It should be noted that above-mentioned modules can be realized by software or hardware, for the latter, Ke Yitong Following manner realization is crossed, but not limited to this: above-mentioned module is respectively positioned in same processor;Alternatively, above-mentioned modules are with any Combined form is located in different processors.
Obviously, those skilled in the art should be understood that each module of the above invention or each step can be with general Computing device realize that they can be concentrated on a single computing device, or be distributed in multiple computing devices and formed Network on, optionally, they can be realized with the program code that computing device can perform, it is thus possible to which they are stored It is performed by computing device in the storage device, and in some cases, it can be to be different from shown in sequence execution herein Out or description the step of, perhaps they are fabricated to each integrated circuit modules or by them multiple modules or Step is fabricated to single integrated circuit module to realize.In this way, the present invention is not limited to any specific hardware and softwares to combine.
Embodiment 3
The embodiments of the present invention also provide a kind of storage medium, computer program is stored in the storage medium, wherein The computer program is arranged to execute the step in any of the above-described embodiment of the method when operation.
The present embodiment by being disassembled to the vocabulary being not logged in term vector dictionary, in term vector dictionary search with The immediate posting term of its stroke, and then configure the term vector of the close word of stroke to the initial term vector of unregistered word, solution Determined the term vector for configuring unregistered word in the way of being randomly assigned in the related technology when cause under subsequent training mission precision The technical issues of drop, assigns initial term vector using semantic information entrained in the stroke of Chinese for unregistered word, can It reduces the time-consuming of subsequent training mission, improve the precision of training mission.
Optionally, in the present embodiment, above-mentioned storage medium can include but is not limited to: USB flash disk, read-only memory (Read- Only Memory, referred to as ROM), it is random access memory (Random Access Memory, referred to as RAM), mobile hard The various media that can store computer program such as disk, magnetic or disk.
Embodiment 4
The embodiments of the present invention also provide a kind of electronic device, including memory and processor, stored in the memory There is computer program, which is arranged to run computer program to execute the step in any of the above-described embodiment of the method Suddenly.
The present embodiment by being disassembled to the vocabulary being not logged in term vector dictionary, in term vector dictionary search with The immediate posting term of its stroke, and then configure the term vector of the close word of stroke to the initial term vector of unregistered word, solution Determined the term vector for configuring unregistered word in the way of being randomly assigned in the related technology when cause under subsequent training mission precision The technical issues of drop, assigns initial term vector using semantic information entrained in the stroke of Chinese for unregistered word, can It reduces the time-consuming of subsequent training mission, improve the precision of training mission.
Optionally, above-mentioned electronic device can also include transmission device and input-output equipment, wherein the transmission device It is connected with above-mentioned processor, which connects with above-mentioned processor.By taking electronic device is electronic device as an example, Fig. 3 It is a kind of hardware block diagram of electronic device of the embodiment of the present invention.As shown in figure 3, electronic device may include one or more (processor 302 can include but is not limited to Micro-processor MCV or programmable logic to a (one is only shown in Fig. 3) processor 302 The processing unit of device FPGA etc.) and memory 304 for storing data, optionally, above-mentioned electronic device can also include Transmission device 306 and input-output equipment 308 for communication function.It will appreciated by the skilled person that Fig. 3 institute The structure shown is only to illustrate, and does not cause to limit to the structure of above-mentioned electronic device.For example, electronic device may also include than figure More perhaps less component shown in 3 or with the configuration different from shown in Fig. 3.
Memory 304 can be used for storing computer program, for example, the software program and module of application software, such as this hair The corresponding computer program of the recognition methods of image in bright embodiment, processor 302 are stored in memory 304 by operation Computer program realize above-mentioned method thereby executing various function application and data processing.Memory 304 can wrap Include high speed random access memory, may also include nonvolatile memory, as one or more magnetic storage device, flash memory or Other non-volatile solid state memories.In some instances, memory 304 can further comprise long-range relative to processor 302 The memory of setting, these remote memories can pass through network connection to electronic device.The example of above-mentioned network includes but not It is limited to internet, intranet, local area network, mobile radio communication and combinations thereof.
Transmitting device 306 is used to that data to be received or sent via a network.Above-mentioned network specific example may include The wireless network that the communication providers of electronic device provide.In an example, transmitting device 306 includes a network adapter (Network Interface Controller, referred to as NIC), can be connected by base station with other network equipments so as to It is communicated with internet.In an example, transmitting device 306 can be radio frequency (Radio Frequency, referred to as RF) Module is used to wirelessly be communicated with internet.
The foregoing is only a preferred embodiment of the present invention, is not intended to restrict the invention, for the skill of this field For art personnel, the invention may be variously modified and varied.It is all within principle of the invention, it is made it is any modification, etc. With replacement, improvement etc., should all be included in the protection scope of the present invention.

Claims (10)

1. a kind of configuration method of term vector, which is characterized in that the described method includes:
Determine the first vocabulary of initial term vector to be configured;
Judge first vocabulary whether in term vector dictionary, wherein the term vector dictionary for store multiple vocabulary with The one-to-one relationship of multiple term vectors;
If it is judged that first vocabulary executes stroke dismantling not in the term vector dictionary, to first vocabulary, obtain To strokes sequence;
Calculate the similarity of the strokes sequence of each vocabulary in the term vector dictionary and the strokes sequence of first vocabulary;
It determines term vector corresponding with the highest vocabulary of strokes sequence similarity of first vocabulary, and is configured to described first The initial term vector of vocabulary.
2. the method according to claim 1, wherein including the second vocabulary, the meter in the term vector dictionary Calculate the similarity of the strokes sequence of each vocabulary in the term vector dictionary and the strokes sequence of first vocabulary, comprising:
Determine the strokes sequence of second vocabulary and the total length for being overlapped tract of the strokes sequence of first vocabulary, In, it is described to be overlapped the identical tract of stroke arrangement in the strokes sequence that tract is two vocabulary;
The total length for being overlapped tract based on first vocabulary with second vocabulary, determines the stroke of first vocabulary The similarity of the strokes sequence of sequence and second vocabulary.
3. according to the method described in claim 2, it is characterized in that, described based on first vocabulary and second vocabulary It is overlapped the total length of tract, determines that the strokes sequence of first vocabulary is similar to the strokes sequence of second vocabulary Degree, using following formula:
S=2*p/ (n+m)
Wherein, S is the similarity of the strokes sequence of first vocabulary and the strokes sequence of second vocabulary, and p is described the The total length for being overlapped tract of one vocabulary and second vocabulary, n are the length of the strokes sequence of first vocabulary, and m is The length of the strokes sequence of second vocabulary.
4. the method according to claim 1, wherein the first vocabulary of the determination initial term vector to be configured, Include:
Obtain corpus to be segmented;
The corpus is segmented, multiple participles of sequence are obtained;
The first participle that initial term vector is not configured is determined in the multiple participle, obtains first vocabulary.
5. a kind of configuration device of term vector characterized by comprising
First determining module, for determining the first vocabulary of initial term vector to be configured;
Judgment module, for judging first vocabulary whether in term vector dictionary, wherein the term vector dictionary is for depositing Store up the one-to-one relationship of multiple vocabulary Yu multiple term vectors;
Module is disassembled, for if it is judged that first vocabulary holds first vocabulary not in the term vector dictionary The dismantling of row stroke, obtains strokes sequence;
Computing module, for calculating the strokes sequence of each vocabulary in the term vector dictionary and the stroke of first vocabulary The similarity of sequence;
Second determining module, for determine corresponding with the highest vocabulary of strokes sequence similarity of first vocabulary word to Amount, and it is configured to the initial term vector of first vocabulary.
6. device according to claim 5, which is characterized in that include the second vocabulary, the meter in the term vector dictionary Calculating module includes:
First determination unit, for determining the strokes sequence of second vocabulary and being overlapped for the strokes sequence of first vocabulary The total length of tract, wherein described to be overlapped the identical tract of stroke arrangement in the strokes sequence that tract is two vocabulary;
Second determination unit is determined for the total length for being overlapped tract based on first vocabulary with second vocabulary The similarity of the strokes sequence of the strokes sequence of first vocabulary and second vocabulary.
7. device according to claim 6, which is characterized in that second determination unit is based on first vocabulary and institute The total length for stating the coincidence tract of the second vocabulary, determines the strokes sequence of first vocabulary and the stroke of second vocabulary The similarity of sequence, using following formula:
S=2*p/ (n+m)
Wherein, S is the similarity of the strokes sequence of first vocabulary and the strokes sequence of second vocabulary, and p is described the The total length for being overlapped tract of one vocabulary and second vocabulary, n are the length of the strokes sequence of first vocabulary, and m is The length of the strokes sequence of second vocabulary.
8. device according to claim 5, which is characterized in that first determining module includes:
Acquiring unit, for obtaining corpus to be segmented;
Participle unit obtains multiple participles of sequence for segmenting to the corpus;
Third determination unit obtains described for determining the first participle that initial term vector is not configured in the multiple participle First vocabulary.
9. a kind of storage medium, which is characterized in that be stored with computer program in the storage medium, wherein the computer Program is arranged to perform claim when operation and requires method described in 1 to 4 any one.
10. a kind of electronic device, including memory and processor, which is characterized in that be stored with computer journey in the memory Sequence, the processor are arranged to run the computer program in method described in perform claim 1 to 4 any one of requirement.
CN201910534810.8A 2019-06-20 2019-06-20 The configuration method of term vector, device, storage medium, electronic device Pending CN110413990A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201910534810.8A CN110413990A (en) 2019-06-20 2019-06-20 The configuration method of term vector, device, storage medium, electronic device
PCT/CN2019/117725 WO2020253050A1 (en) 2019-06-20 2019-11-13 Word vector configuration method and apparatus, and storage medium and electronic device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910534810.8A CN110413990A (en) 2019-06-20 2019-06-20 The configuration method of term vector, device, storage medium, electronic device

Publications (1)

Publication Number Publication Date
CN110413990A true CN110413990A (en) 2019-11-05

Family

ID=68359467

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910534810.8A Pending CN110413990A (en) 2019-06-20 2019-06-20 The configuration method of term vector, device, storage medium, electronic device

Country Status (2)

Country Link
CN (1) CN110413990A (en)
WO (1) WO2020253050A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020253050A1 (en) * 2019-06-20 2020-12-24 平安科技(深圳)有限公司 Word vector configuration method and apparatus, and storage medium and electronic device
CN113342934A (en) * 2021-05-31 2021-09-03 北京明略软件系统有限公司 Word vector determination method and device, storage medium and electronic device
CN113342934B (en) * 2021-05-31 2024-04-19 北京明略软件系统有限公司 Word vector determining method and device, storage medium and electronic device

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150178631A1 (en) * 2013-09-04 2015-06-25 Neural Id Llc Pattern recognition system
CN105608462A (en) * 2015-12-10 2016-05-25 小米科技有限责任公司 Character similarity judgment method and device
CN106095865A (en) * 2016-06-03 2016-11-09 中细软移动互联科技有限公司 A kind of trade mark text similarity reviewing method
CN108959250A (en) * 2018-06-27 2018-12-07 众安信息技术服务有限公司 A kind of error correction method and its system based on language model and word feature
CN109145294A (en) * 2018-08-07 2019-01-04 北京三快在线科技有限公司 Text entities recognition methods and device, electronic equipment, storage medium
CN109408814A (en) * 2018-09-30 2019-03-01 中国地质大学(武汉) Across the language vocabulary representative learning method and system of China and Britain based on paraphrase primitive word
CN109858039A (en) * 2019-03-01 2019-06-07 北京奇艺世纪科技有限公司 A kind of text information identification method and identification device

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109299269A (en) * 2018-10-23 2019-02-01 阿里巴巴集团控股有限公司 A kind of file classification method and device
CN110413990A (en) * 2019-06-20 2019-11-05 平安科技(深圳)有限公司 The configuration method of term vector, device, storage medium, electronic device

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150178631A1 (en) * 2013-09-04 2015-06-25 Neural Id Llc Pattern recognition system
CN105608462A (en) * 2015-12-10 2016-05-25 小米科技有限责任公司 Character similarity judgment method and device
CN106095865A (en) * 2016-06-03 2016-11-09 中细软移动互联科技有限公司 A kind of trade mark text similarity reviewing method
CN108959250A (en) * 2018-06-27 2018-12-07 众安信息技术服务有限公司 A kind of error correction method and its system based on language model and word feature
CN109145294A (en) * 2018-08-07 2019-01-04 北京三快在线科技有限公司 Text entities recognition methods and device, electronic equipment, storage medium
CN109408814A (en) * 2018-09-30 2019-03-01 中国地质大学(武汉) Across the language vocabulary representative learning method and system of China and Britain based on paraphrase primitive word
CN109858039A (en) * 2019-03-01 2019-06-07 北京奇艺世纪科技有限公司 A kind of text information identification method and identification device

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020253050A1 (en) * 2019-06-20 2020-12-24 平安科技(深圳)有限公司 Word vector configuration method and apparatus, and storage medium and electronic device
CN113342934A (en) * 2021-05-31 2021-09-03 北京明略软件系统有限公司 Word vector determination method and device, storage medium and electronic device
CN113342934B (en) * 2021-05-31 2024-04-19 北京明略软件系统有限公司 Word vector determining method and device, storage medium and electronic device

Also Published As

Publication number Publication date
WO2020253050A1 (en) 2020-12-24

Similar Documents

Publication Publication Date Title
CN108763325B (en) A kind of network object processing method and processing device
CN109697500B (en) Data processing method and device, electronic equipment and storage medium
CN109344395A (en) A kind of data processing method, device, server and storage medium
US11354549B2 (en) Method and system for region proposal based object recognition for estimating planogram compliance
CN110969172A (en) Text classification method and related equipment
CN110968664A (en) Document retrieval method, device, equipment and medium
CN108197660A (en) Multi-model Feature fusion/system, computer readable storage medium and equipment
CN110413990A (en) The configuration method of term vector, device, storage medium, electronic device
CN108090040B (en) Text information classification method and system
CN106782516B (en) Corpus classification method and apparatus
CN112200862B (en) Training method of target detection model, target detection method and device
CN114090792A (en) Document relation extraction method based on comparison learning and related equipment thereof
CN106339105A (en) Method and device for identifying phonetic information
CN114141236B (en) Language model updating method and device, electronic equipment and storage medium
CN116128044A (en) Model pruning method, image processing method and related devices
CN115375965A (en) Preprocessing method for target scene recognition and target scene recognition method
CN106815592B (en) Text data processing method and device and wrong word recognition methods and device
CN113139932B (en) Deep learning defect image identification method and system based on ensemble learning
CN114464194A (en) Voiceprint clustering method and device, storage medium and electronic device
CN106372071B (en) The information acquisition method and device of data warehouse
CN110197143B (en) Settlement station article identification method and device and electronic equipment
CN111444345A (en) Dish name classification method and device
CN113515591A (en) Text bad information identification method and device, electronic equipment and storage medium
CN111522943A (en) Automatic test method, device, equipment and storage medium for logic node
CN111079468A (en) Method and device for robot to recognize object

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20191105