Specific embodiment
In order to illustrate more clearly of the technical solution of this specification embodiment, will make below to required in embodiment description
Attached drawing is briefly described.It should be evident that the accompanying drawings in the following description is only some examples or reality of this specification
Example is applied, it for those of ordinary skill in the art, without creative efforts, can also be attached according to these
This specification is applied to other similar scenes by figure.Unless being explained obviously or separately, identical mark in figure from language environment
Number represent identical structure or operation.
It should be appreciated that " system " used herein, " device ", " unit " and/or " mould group " is for distinguishing different stage
Different components, component, assembly unit, part or a kind of method of assembly.However, if other words can realize identical purpose,
Then the word can be replaced by other expression.
As shown in the specification and claims, unless context clearly prompts exceptional situation, " one ", "one",
The words such as "an" and/or "the" not refer in particular to odd number, may also comprise plural number.It is, in general, that term " includes " is only mentioned with "comprising"
Show included the steps that clearly mark and element, and these steps and element do not constitute one it is exclusive enumerate, method or
Equipment may also include other step or element.
Flow chart has been used to be used to illustrate the operation according to performed by the system of the embodiment of this specification in this specification.
It is not necessarily accurately carried out in sequence it should be understood that above or below operates.On the contrary, can according to inverted order or simultaneously
Handle each step.It is also possible to during other operations are added to these, or from these processes remove a certain step or number
Step operation.
One or more embodiments of this specification provide a kind of core information extracting method.Its basic inventive concept is
The core information of text, which usually can be, influences maximum word to text semantic, and after text loses core information, semanteme is by shadow
The degree of sound is maximum.
One or more embodiments of this specification successively determine one or more word in text information by mask method
The weight of language judges semantic effect amount of the word in text information by weight, so as to determine text according to weight
Core information.Specifically, in a text information, if it is desired to determine that wherein some word is to the influence amount of text information,
Can by additional character the word carry out mask, then from the point of view of lack the word after text semantic with urtext believe
Semantic gap between breath semanteme is how many, and then judges the word to the disturbance degree of text information.
One or more embodiments of this specification pass through the semanteme after the text after calculating word segmentation processing with urtext
Distance determines the weight of text information.Also compared according to space length of the semantic similar or similar text in feature vector
Similar principle may include accurate semantic information by the text vector that words vector is converted to, therefore will be in text
Words, which is converted to one-dimensional vector, can be used as the semantic expressiveness of text.By by the text after urtext information and word segmentation processing
Originally it is encoded into vector expression, indicates to indicate at a distance from vector space with original vector by calculating mask vector, it can be true
Weight of the participle information in text information in text information is determined, with this final core information extracted in text information.
It should be understood that this example demonstrates that the application scenarios of the system and method for book are only some of this specification
Example or embodiment for those of ordinary skill in the art without creative efforts, can be with roots
It will be this example demonstrates that book be applied to other similar scene according to these attached drawings.
The technical solution that core information in one or more embodiments of this specification extracts can be applied to a variety of industry
Business scene, including but not limited to artificial intelligence, data mining, big data analysis, data grabber, public sentiment monitoring, disaster monitoring, friendship
Logical monitoring, analysis of central issue, information recall, online customer service, question and answer robot, semantic analysis, speech recognition, Police Information, promptly
Seek help, rescue and relief work, electric business service, literature search, taking care of books, machine translation, document copying monitoring etc..Although the present embodiment
Specification is mainly described with text information, but should be noted that this example demonstrates that the principle of book can also be applied to
Semantics recognition in other business scenarios such as speech processes, image procossing, such as voice semantics recognition, dialogue robot, image text
This risk monitoring and control etc..
Fig. 1 is the exemplary process diagram of the core information extracting method according to shown in this specification some embodiments.Such as Fig. 1
Described, described method includes following steps:
Step 101, text information is obtained.
In some embodiments, which can obtain module by text and execute.In some embodiments, text information
Acquisition modes may include following at least one: voice input, image recognition, posture input, be manually entered, client push,
Server transport, database information import, computer data collection imports, computer obtains automatically.For example, in public sentiment monitoring
In application scenarios, text information is input to news search engine in such a way that voice is inputted or is manually entered by user, then
Text obtains module and obtains text information by real-time monitoring back-end data.In some embodiments, when input data is audio
When, text obtains module and can identify to audio signal, obtains text information.In some embodiments, input data can
To be image, text obtains module can be based on text detection and identification technology (such as OCR, deep learning model technology) acquisition
Text information in image.In some embodiments, input data is also possible to the input of the postures such as gesture, and text obtains module can
To be matched to posture, to determine its corresponding text information.In some embodiments, the text information may include to
Extract the short text information of core information.In some embodiments, the form of short text information may include customer service robot pair
Talk about voice input and/or the text input under scene or Intelligence repository.In some embodiments, short text information can be one
A phrase, such as " cat for playing knitting wool ball ";It is also possible to a simple sentence, such as " I has a cat.";It is also possible to one again
Sentence, such as " kitten is playing knitting wool ball, and gets a big kick ".In some embodiments, short text information also may include with
The combination of upper form.In some embodiments, the storage location of text information may include following a kind of or combination: database,
Memory etc. with store function, text obtain module and obtain text information from storage location by network.
Step 103, word segmentation processing is carried out to the text information, obtains one or more corresponding with the text information
Segment information.
In some embodiments, which can be executed by word segmentation module.It can be with to the word segmentation processing of the text information
It is interpreted as being cut into segmentation sequence information in text information the pretreatment of one or more individual word, it is therefore intended that convenient
Judge the disturbance degree of one or more words for being split out in its corresponding complete text information, it can be with based on disturbance degree
Judge whether the word belongs to core word.
In some embodiments, the processing mode for carrying out word segmentation processing to text information can be carried out based on preset algorithm,
In some embodiments, the preset algorithm also may include specific participle model.Participle processing method includes but is not limited to:
Segmenting method based on string matching, the segmenting method based on understanding, segmenting method based on statistics etc..In some embodiments
In, word segmentation processing can be carried out to text information by participle model.Wherein, participle model includes but is not limited to: N-gram mould
Type (N-gram), hidden Markov model (Hidden Markov Model, HMM), maximum entropy model (ME), condition random field
Model (Conditional Random Fields, CRF), JIEBA participle model etc..It is right by taking JIEBA participle model as an example below
The process that text information carries out word segmentation processing is described.This specification one or more embodiment passes through specific text information
" my dog is cute, he likes playing " carries out the elaboration of related art scheme as an example, it should manage
Solution, the citing will not bring any restriction to the protection scope of this specification.
In some embodiments, word segmentation processing is carried out to the text obtained in step 101 using JIEBA participle model.Example
Such as: using JIEBA participle model to the text information s of acquisition0: " my dog is cute, he likes playing " is carried out
Participle, available multiple segmentation sequence information w1: " my ", w2: " dog ", w3: " is ", w4: " cute ", w5: " he ", w6:
" likes ", w7: " play ", w8: " ##ing ".
Step 105, one or more power of one or more of participle information in the text information is determined
Weight.
In some embodiments, which can be determined by weight determination module.Information is segmented in corresponding text information
Weight be able to reflect one or more of different degrees or disturbance degree of the participle information in the text information, therefore,
By judging that segmenting weight of the information in text information may determine that whether the participle information belongs to the pass of place Textual information
Keyword or core word.For example, weight determination module can determine text information s0In multiple participle corresponding weights in the text.
In abovementioned steps, to text information s0: " my dog is cute, he likes playing " is obtained such as after carrying out word segmentation processing
Lower segmentation sequence: w1: " my ", w2: " dog ", w3: " is ", w4: " cute ", w5: " he ", w6: " likes ", w7" play ", w8: " ##
Ing " includes 8 participle information in the sequence, is respectively as follows: w1, w2......w8.Weight determination module can determine text respectively
Information s0In above-mentioned 8 participles information in text information s0In corresponding 8 weights.
In some embodiments, the method for determining weight of the participle information in affiliated text information includes but is not limited to:
Participle weight is determined by the statistical information of participle, determines participle weight, by there is supervision by calculating participle context relation
Machine learning model processing text information determine the weight of different participles, by segmenting corresponding mask text and urtext
Semantic distance between information determines one of participle weight or combination.
In some embodiments, it is comprised at least one of the following by the method that the statistical information of participle determines participle weight:
TF-IDF method (calculating participle frequency of occurrence of the information in text information and the amount of text comprising the participle information), theme
Model method, Text Rank method (being updated according to word cooccurrence relation iteration).
It in some embodiments, include: by all words by calculating the method that participle context relation determines term weighing
Language is expressed as figure G=(V, E) by context relation, and V point indicates that all words, E indicate the context relation between word, set point
V_i can calculate its score value with the score value of its adjacent node.In some embodiments, between the vector expression by calculating participle
Semantic distance determine participle weight method include: that participle is separately encoded into vector to indicate, calculate vector expression between
Space length is to determine the semantic distance between participle.
The determining participle weight of the semantic distance between corresponding mask text and urtext information is segmented by calculating
Method can be described in detail in conjunction with specific embodiments in the other parts of this specification, refer to Fig. 2.
Step 107, the core information of the text information is at least determined based on one or more of weights.
In some embodiments, which can be executed by core information determining module.Core information determining module can be with
The core information of the text information is determined based on the corresponding weight of different participle information using one or more kinds of methods.In
In some embodiments, core information determining module can be determined from many participle information with weight based on preset threshold
The core information of the text information, i.e. core word.In some embodiments, preset threshold may include weight threshold, specifically
Ground, the weight for segmenting information are then determined as core information if it is greater than the weight threshold.In some embodiments, preset threshold
It can also include amount threshold, specifically, choose sequence first three the corresponding participle information of weight and be determined as core information.
For example, above-mentioned text information s0In 8 participle information w1: " my ", w2: " dog ", w3: " is ", w4: " cute ",
w5: " he ", w6: " likes ", w7" play ", w8: " ##ing " is in text information s0Respective weights have determined.In some implementations
Example in, can first set weight threshold be 80%, according to the weight threshold then weight selection greater than 80% participle information work
For core information.In some embodiments, it is also possible to first set amount threshold as 3, then chosen respectively according to the amount threshold
The higher 3 participles information of weight is as core information.The process of core information is determined according to amount threshold and/or weight threshold
In, the weight of each participle information can be ranked up, can also directly be sieved according to preset threshold without sequence
Choosing.
In some embodiments, the preset threshold can be preset, the including but not limited to artificial setting of set-up mode,
Computer installation, program setting etc..Weight threshold, the specific percentage of amount threshold or numerical value can be according to true in step 105
Depending on the concrete condition of fixed analytical weight.
It should be noted that in some embodiments, the method can also include based on screening conditions to the text
The corresponding each participle information of information is screened.In some embodiments, the screening step can be executed by screening module.Example
Such as, since the semantic words such as table negative, turnover can have a large effect to semanteme, but the core information of not usually sentence,
So in some embodiments, can further be screened to participle information, if in the corresponding participle information of text information
When comprising the semantic word such as above-mentioned table negative, turnover, it when screening core word, can be excluded, avoid being chosen.Some
In embodiment, the screening mode that above-mentioned word is excluded can be by the limitation in screening process plus stop-word, to avoid
Choose above-mentioned word.In some implementations, can in advance will to semanteme have larger impact but do not include entity information word into
Row is aggregated to form a vocabulary limitation table, when carrying out the screening step, each participle information and the vocabulary can be limited
Vocabulary in table is compared, and then can determine whether the participle information needs to exclude.Wherein, vocabulary limitation table includes one
Or multiple limitation vocabulary, limitation vocabulary can include but is not limited to indicate the semantic conjunctive word such as negative, turnover, auxiliary words of mood,
Adverbial word etc..In some embodiments, the limitation vocabulary can be established by limitation bilingual lexicon acquisition module or by limitation vocabulary mould
Block obtains one or more limitation vocabulary from the limitation vocabulary pre-generated.In some embodiments, the screening step
It can be placed on after the weight for determining each participle information, can also be placed on before the weight for determining each participle information.For example,
After step 103 executes, each participle can be screened, exclude limitation vocabulary therein.In another example can be in step
After rapid 107 execute, the core information of acquisition is screened, limitation vocabulary therein is excluded.In some embodiments, it sieves
Selecting condition may include one or more limitation vocabulary, also may include the vocabulary limitation being made of one or more limitation vocabulary
Combination, in some embodiments, it is also possible to which the vocabulary limitation combination being made of one or more limitation vocabulary is understood as being word
It converges and limits table.
It should be noted that the above-mentioned description in relation to core information extracting method 100 is used for the purpose of example and explanation, and
The scope of application of this specification one or more embodiment is not limited.To those skilled in the art, in this specification one
Various modifications and variations can be carried out to extracting cell core method 100 under the guidance of a or multiple embodiments.However, these amendment and
Change still within the scope of this specification one or more embodiment.
Fig. 2 is the sub-process figure that participle information weight is determined according to shown in some embodiments of this specification.
In some embodiments, determine that the weight of one or more participle information can also be realized by mask Furthest Neighbor,
Specifically, can be judged by determining the text distance between the corresponding mask text of participle information and urtext information point
The respective weights of word information.In some embodiments, determine that the text semantic distance between two texts can be by two texts
Distance that this corresponding vector indicates is realized.According to space length of the semantic similar or similar text in feature vector
Compare similar principle, accurate semantic information may include by the text vector that words vector is converted to, therefore by text
In words be converted to one-dimensional vector, the semantic expressiveness as text.
It is elaborated below in conjunction with such as Fig. 2 and determines one or more participle information corresponding power in the text information
Some embodiments of weight, process 200 include:
Step 201, based on one or more participle information and text information, one or more mask text is determined,
There is a participle information to be occluded in one or more of mask texts respectively.
In some embodiments, which can be executed by weight determination module.In some embodiments, determine one or
The corresponding mask text of multiple participle information can be accomplished in that successively to be covered in text information using additional character
The participle information for needing to judge weight or different degree, to determine the corresponding mask text of the participle information.The work of additional character
With being: making the participle loss of learning for needing to judge weight or different degree in text information, lose the participle information with judgement
Mask text and the text information between semantic gap, and then judge the weight of the participle information in the text information
It spends.If after a participle loss of learning, the semantic variation of relatively primitive Textual information is very big, then meaning that the participle is believed
It ceases critically important to text information.In some embodiments, additional character includes but is not limited to character, character string, letter, number
One of or combination, as long as the symbol can indicate missing corresponding to the participle information.In some embodiments,
[MASK] or [M] can be used as additional character successively cover one or more participle information in step 103.Below will
Illustrate how to determine one or more mask text in conjunction with the specific example of Fig. 3 signal.
Fig. 3 is the schematic diagram of the exemplary mask text according to shown in some embodiments of this specification.In some implementations
In example, text information s can be successively covered as additional character using symbolization [M]0: " my dog is cute, he likes
8 participle information in play##ing ": w1: " my ", w2: " dog ", w3: " is ", w4: " cute ", w5: " he ", w6:
" likes ", w7: " play ", w8: " ##ing ", and then determine the corresponding 8 mask texts of 8 participle information: s1: " [M] dog
Is cute he likes play##ing ", s2: " my [M] is cute he likes play##ing ", s3: " my dog [M]
Cute he likes play##ing ", s4: " my dog is [M] he likes play##ing ", s5: " my dog is
Cute [M] likes play##ing ", s6: " my dog is cute he [M] play##ing ", s7: " my dog is cute
He likes [M] ##ing ", s8: " my dog is cute he likes play [M] ".
It in some embodiments, can also be by multiple participle information in the corresponding participle information of text information (as segmented
Combination) mask is carried out simultaneously, and then obtain simultaneously by the corresponding mask text of multiple participle information of mask.In some realities
It applies in example, according to the available above-mentioned multiple participles by mask of the semantic distance between the mask text and urtext information
Semantic importance of the information in urtext information, i.e. distance are bigger, and semantic importance is bigger, semantic important apart from smaller
It spends smaller.
In some embodiments, determine text semantic distance between two texts can by determine text it is corresponding to
The semantic distance between indicating is measured to realize, therefore there are following steps, determines that the text information is corresponding with the mask text
Vector indicate.
Step 203, determine that the original vector of the text information indicates based on the first preset algorithm and the text information.
In some embodiments, which can be executed by weight determination module.In some embodiments, pre- based on first
Imputation method and text information determine that the original vector expression of the text information can be by one or more kinds of algorithms to text
Information is encoded to determine that corresponding vector indicates.In some embodiments, the first preset algorithm includes but is not limited to: RNN
(RecurrentNeural Network)、CNN(Convolutional Neural Networks)、Transformer、GPB、
One of BERT (Bidirectional Encoder Representations from Transformers) or combination.
Algorithm, which is illustrated, to be shown to vector table for below using BERT model as the first preset algorithm.
During converting corresponding vector for text information using BERT model indicates, the input of BERT model includes
The corresponding text insertion of text information to be transformed, segmentation insertion and position insertion.Wherein, text insertion includes being based on text
Word can be divided into one group of limited public words unit, in the validity and word of word by the feature vector of information coding
Balance is obtained between the flexibility of symbol.Segmentation insertion includes the feature vector based on segmentation information coding, can be used for distinguishing text
The sentence of different contexts in this, by the statement coding of different contexts at different feature vectors.For example, text sequence information
" [CLS] my dog is cute [SEP] " and " he likes playing [SEP] " belongs to different subordinate sentences, then text sequence
Corresponding segment information can be expressed as " AAAAAABBBBB ".Position insertion includes the feature based on text position information coding
The location information of word can be encoded into feature vector by vector.For example, text information s0“[CLS]my dog is cute
[SEP], he likes playing [SEP] ", wherein the number of segmentation sequence information is respectively as follows: 0,1,2,3,4,5,6,7,8,
9,10, ' dog ' belongs to the 3rd word of information in segmentation sequence, number 2, and position embedding information can believe the position of word
Breath is encoded into feature vector.It is illustrated how urtext s below in conjunction with Fig. 4 and specific example through BERT model0
Being converted into corresponding vector indicates v0。
Fig. 4 be according to shown in some embodiments of this specification based on BERT model by text information be converted to it is corresponding to
Measure the schematic diagram indicated.As shown in figure 4, using BERT model by text information s0: " my dog is cute, he likes
Play##ing ", which is converted to corresponding original vector, indicates v0.As shown, text information s0Corresponding text insertion are as follows: [CLS]
my dog is cute[SEP]he likes playing[SEP];Corresponding segmentation insertion are as follows: AAAAAABBBBB;It is corresponding
Position insertion are as follows: 0,1,2,3,4,5,6,7,8,9,10.By text information s0Corresponding text insertion, segmentation insertion and position
The sum of insertion, as the input of BERT model, by BERT model treatment, available text information s0Original vector indicate
v0。
Step 205, one or more screening is determined based on the first preset algorithm and one or more of mask texts
Cover vector indicates.
In some embodiments, which can be executed by weight determination module.In some embodiments, based on it is above-mentioned
The same algorithm mask text of step determines that corresponding mask vector indicates.Illustrate below in conjunction with Fig. 5 and specific example
How BERT model is passed through by mask text s1Being converted into corresponding vector indicates v1。
Fig. 5 be according to shown in some embodiments of this specification based on BERT model mask text is converted to it is corresponding
The schematic diagram that vector indicates.As shown in figure 5, using BERT model by mask text s1: " [M] dog is cute, he likes
Play##ing ", which is converted to corresponding original vector, indicates v1.As shown, mask text s1Corresponding text insertion are as follows: [CLS]
[M]dog is cute[SEP]he likes playing[SEP];Corresponding segmentation insertion are as follows: AAAAAABBBBB;It is corresponding
Position insertion are as follows: 0,1,2,3,4,5,6,7,8,9,10.By mask text s1Corresponding text insertion, segmentation insertion and position
The sum of insertion, as the input of BERT model, by BERT model treatment, available mask text s1Mask vector indicate
v1。
Other mask texts s2......s8It is indicated by BERT model conversation at corresponding vector using similar method
v2......v8, details are not described herein.
Step 207, according to indicating the original vector and one or more of mask vectors indicate determining
One or more weight.In some embodiments, which can be executed by weight determination module.
As described in Figure 6, Fig. 6 is the schematic diagram that weight is determined according to shown in some embodiments of this specification:
Weight determination module determines that one or more of mask vectors indicate between original vector expression
One or more distance;One or more of weights are determined according to one or more of distances.In some implementations
In example, the corresponding distance of the weight is positively correlated, specifically, when the mask vector indicates between original vector expression
Distance it is bigger when, illustrate that the participle information fallen by mask is bigger to the semantic effect of text information, the weight of the participle information
It is bigger.
After determining text information and the corresponding vector expression of multiple mask texts, it can indicate true based on corresponding vector
Determine the semantic distance between the expression of mask vector and original vector expression, and then can judge mask vector according to semantic distance
Indicate corresponding participle information to the different degree or disturbance degree of the text information.
Weight determination module is by determining that one or more of mask vectors indicate to indicate it with the original vector
Between space length, determine the semantic distance between text information and multiple mask texts, can determine special in mask text
The influence amount of the participle information of different symbol mask, further determines that the weight of participle information, finally determines one or more screening
The weight of one or more corresponding participle information of cover vector.
In some embodiments, it can also indicate that the semantic distance indicated with original vector is returned to multiple mask vectors
One change processing, respectively obtaining different mask vectors indicates the weight of relatively primitive vector.The step can determine mould by weight
Block executes.
In some embodiments, the semantic distance calculation method between mask vector and original vector includes but is not limited to base
Calculation method in word frequency statistics, the calculation method based on ontology, calculation method based on geometry metric space etc..Based on word frequency
The calculation method of statistics includes but is not limited to method based on reduplication, TF-IDF (Term Frequency-Inverse
Document Frequency) and its various weighting algorithms (such as: LSA, HAL, Islam) etc..Based on the distance of ontology
Calculation method includes but is not limited to be calculated based on ontology library back gauge calculation method, the calculation method based on ontology library node, mixing
Method etc..Based on including but not limited to Euclidean distance (Euclidean in geometry metric space calculation method
Distance), COS distance (Cosine Distance), manhatton distance (Manhattan Distance) etc..
In some embodiments, when calculating the distance between mask vector and original vector using COS distance method
When, it is referred to following formula (1), the corresponding calculation formula of COS distance method, wherein v0Indicate original vector, viIndicate i-th
The mask vector of a mask text, diIndicate the semantic distance between i-th of mask vector and original vector.
In some embodiments, further above-mentioned semantic distance can also be normalized, obtains corresponding 0
Numerical value between to 1 is to get arriving corresponding weight.Wherein, common normalization processing method includes but is not limited to: minimax
Normalization and/or mean normalization method.
It in some embodiments, can be with when semantic distance is normalized using minimax normalization method
Referring to following formula (2), minimax normalizes formula.Wherein, XnormData after indicating normalization, X indicate initial data,
XmaxAnd XminRespectively indicate the maximum value and minimum value of raw data set.
For example, to the semantic distance d between i-th of mask vector and original vectoriIt is obtained after carrying out minimax normalization
To weight be equal toFor another example urtext s0In the mask text that is formed of the 1st participle information it is corresponding
Vector indicate v1V is indicated with original vector0Between semantic distance be d1, after carrying out minimum normalized to it, obtain
Weight 1 are as follows:Indicate weight of the above-mentioned 1st participle information in urtext.
In other embodiments, semantic distance can also be normalized using mean normalization method.
Value normalization formula are as follows:Wherein, z indicates the data after normalization, and μ indicates the mean value of raw data set, and σ is indicated
The variance of raw data set.
In other embodiments, except through being determined except weight to the method that semantic distance is normalized, also
Weight can be obtained by the method given a mark to semantic distance.Wherein, semantic distance marking includes the semantic distance number that will acquire
According to classifying, and preset corresponding weight threshold.The semantic distance that numerical value is more than preset threshold is classified as key object,
The corresponding participle information by mask and urtext have biggish similarity in its mask vector.
Fig. 7 is the example system module map that the core information according to shown in this specification some embodiments extracts.
As shown in fig. 7, the system includes: that text obtains module 710, word segmentation module 720, weight determination module 730 and core
Heart information determination module 740.
Text obtains module 710 for obtaining text information.
Word segmentation module 720 is used to obtain corresponding with the text information based on word segmentation processing is carried out to the text information
One or more participle information.
Weight determination module 730 is used to determine one of one or more of participle information in the text information
Or multiple weights;The weight is able to reflect different degree of one or more of participle information in the text information.
In some embodiments, weight determination module 730 is also used to based on one or more of participle information and the text information,
Determine one or more mask text;There is a participle information to be occluded in one or more of mask texts respectively;Base
Determine that the original vector of the text information indicates in the first preset algorithm and the text information;Based on the first preset algorithm and
One or more of mask texts determine that one or more mask vector of one or more of mask texts indicates;
It is indicated according to the original vector and one or more of mask vectors indicates to determine one or more of weights.
In some embodiments, weight determination module be also used to determine one or more of mask vectors indicate with it is described it is original to
One or more distance between amount expression;One or more of power are determined according to one or more of distances
Weight.In some embodiments, weight determination module is also used to that one or more of distances are normalized, to determine
State one or more weights.
In some embodiments, which further includes that text insertion obtains module, includes for obtaining the text insertion
Feature vector based on text information coding;Segmentation insertion obtains module, includes based on segmentation for obtaining the segmentation insertion
The feature vector of information coding;Position insertion obtains module, includes based on text position information for obtaining the position insertion
The feature vector of coding.
In some embodiments, text insertion obtains module and is also used to one or more point in the text information
Word information is divided into limited public words unit, and is encoded into feature vector.In some embodiments, segmentation insertion obtains
Module is also used to: by the text information segment information and one or more of participle information codings at feature vector.
In some embodiments, position insertion obtains module and is also used to: by one or more participle information in the text information
The location information of middle segmentation sequence is encoded into feature vector.
Core information determining module 740 is at least determining the text information based on one or more of weights
Core information.In some embodiments, core information determining module is also used to: according to one or more of weights and being preset
Threshold value determines the core information of the text information.In some embodiments, core information determining module is also used to: based on participle
Model carries out word segmentation processing to the text information;Participle model may include following at least one: JIEBA participle, HMM participle
Model, CRF participle model, deep learning model.
In some embodiments, which further includes limitation bilingual lexicon acquisition module, for obtaining limitation lexical information;One
In a little embodiments, which further includes screening module, for being based on limitation lexical information to one or more of participle information
It is screened, if one or more participle information is included in the limitation lexical information, which is excluded
Except core information.
It should be appreciated that system shown in Fig. 7 and its module can use various modes to realize.For example, in some implementations
In example, system and its module can be realized by the combination of hardware, software or software and hardware.Wherein, hardware components can
To be realized using special logic;Software section then can store in memory, by instruction execution system appropriate, for example (,) it is micro-
Processor or special designs hardware execute.It will be appreciated by those skilled in the art that meter can be used in above-mentioned method and system
It calculation machine executable instruction and/or is included in the processor control code to realize, such as in such as disk, CD or DVD-ROM
The programmable memory of mounting medium, such as read-only memory (firmware) or the data of such as optics or electrical signal carrier
Such code is provided on carrier.System and its module in this specification one or more embodiment can not only have such as
The semiconductor of ultra large scale integrated circuit or gate array, logic chip, transistor etc. or such as field-programmable gate array
The hardware circuit of the programmable hardware device of column, programmable logic device etc. is realized, can also be used for example by various types of
Software realization performed by device is managed, it can also be by combination (for example, firmware) Lai Shixian of above-mentioned hardware circuit and software.
It should be noted that the description of system and its module is shown, determined for candidate item above, only for convenience of description,
This specification can not be limited within the scope of illustrated embodiment.It is appreciated that for those skilled in the art, In
After understanding the principle of the system, any combination or structure may be carried out to modules without departing substantially from this principle
It is connect at subsystem with other modules.For example, in some embodiments, for example, the text disclosed in Fig. 7 obtains module 710, divides
Word module 720, weight determination module 730 and core information determining module 740 can be the disparate modules in a system, can also
To be function that a module realizes two or more above-mentioned modules.For example, text obtains module 710, weight determines
Module 730 can be two modules, is also possible to a module while having the function of acquisition and determining.For example, modules can
To share an acquisition module, modules can also be respectively provided with respective acquisition module.Suchlike deformation, at this
Within the protection scope of specification.
Based on the above core information extracting method, this specification one or more embodiment additionally provides a kind of core information
Extraction element, described device include at least one processor and at least one processor;At least one processor is used for
Store computer instruction;At least one described processor is used to execute at least partly instruction in the computer instruction to realize
Core information extracting method described in any of the above-described embodiment.
Core information extraction element can be used for the computer instruction in processing core information extraction implementation procedure.Specifically
, core information extraction element can store computer instruction and execute core information extraction operation.
The core information extraction element of this specification embodiment can be applied to multiple business scene, including but not limited to people
It is work intelligence, data mining, big data analysis, data grabber, public sentiment monitoring, disaster monitoring, Traffic monitoring, analysis of central issue, online
Customer service, question and answer robot, semantic analysis, speech recognition, Police Information, flash appeal, rescue and relief work, electric business service, document inspection
Rope, taking care of books, machine translation, document copying monitoring etc..
All the embodiments in this specification are described in a progressive manner, same and similar portion between each embodiment
Dividing may refer to each other, and each embodiment focuses on the differences from other embodiments.Especially for system and
For Installation practice, since it is substantially similar to the method embodiment or it is based on embodiment of the method, so be described relatively simple,
The relevent part can refer to the partial explaination of embodiments of method.
The possible beneficial effect of this specification embodiment includes but is not limited to: (1) this programme is replaced using mask method
Text original word, can obtain the weight of words in the text respectively, and accuracy is high;(2) this programme uses BERT semantic coding mould
Type, can extract core information to single statement, and model accuracy rate is high;(3) this programme does not depend on labeled data, can improve not
With the core information abstraction function of the short text small data under application scenarios, optimize the accuracy that core information extracts, applicability
By force.It should be noted that the different issuable beneficial effects of embodiment are different, in different embodiments, it is possible to create
Beneficial effect can be the combination of any of the above one or more, be also possible to other it is any can obtainable beneficial effect.
Basic conception is described above, it is clear that those skilled in the art, above-mentioned detailed disclosure is only
As an example, and not constituting the restriction to this specification.Although do not clearly state herein, those skilled in the art may
This specification one or more embodiment can be carry out various modifications, improved and be corrected.Such modification is improved and is corrected in this theory
It is proposed in bright book one or more embodiment, so such is modified, improves, amendment still falls within this specification example embodiment
Spirit and scope.
Meanwhile particular words have been used to describe the embodiment of this specification in this specification one or more embodiment.
Such as " one embodiment ", " embodiment ", and/or " some embodiments " means relevant at least one embodiment of this specification
A certain feature, structure or feature.Therefore, it should be emphasized that simultaneously it is noted that being referred to twice or repeatedly in this specification in different location
" embodiment " or " one embodiment " or " alternate embodiment " be not necessarily meant to refer to the same embodiment.In addition, this
Certain features, structure or feature in one or more embodiments of specification can carry out combination appropriate.
In addition, it will be understood by those skilled in the art that the various aspects of this specification can have patentability by several
Type or situation be illustrated and described, the combination including any new and useful process, machine, product or substance, or
Any new and useful improvement to them.Correspondingly, the various aspects of this specification can be executed completely by hardware, can be with
It is executed, can also be executed by combination of hardware by software (including firmware, resident software, microcode etc.) completely.Hardware above
Or software is referred to alternatively as " data block ", " module ", " engine ", " unit ", " component " or " system ".In addition, this specification
Various aspects may show as the computer product being located in one or more computer-readable mediums, which includes that computer can
Reader coding.
Computer storage medium may include the propagation data signal containing computer program code in one, such as in base
Take or as carrier wave a part.The transmitting signal may there are many forms of expression, including electromagnetic form, light form etc., or
Suitable combining form.Computer storage medium can be any computer-readable Jie in addition to computer readable storage medium
Matter, the medium can realize communication, propagation or transmission for using by being connected to an instruction execution system, device or equipment
Program.Program coding in computer storage medium can be propagated by any suitable medium, including wireless
The combination of electricity, cable, fiber optic cables, RF or similar mediums or any of above medium.
Computer program code needed for the operation of this specification each section can be compiled with any one or more program language
Write, including Object-Oriented Programming Language such as Java, Scala, Smalltalk, Eiffel, JADE, Emerald, C++, C#,
VB.NET, Python etc., conventional procedural programming language for example C language, VisualBasic, Fortran2003, Perl,
COBOL2002, PHP, ABAP. dynamic programming language such as Python, Ruby and Groovy or other programming languages etc..The program
Coding can run on the user computer completely or run as independent software package or partially exist on the user computer
Operation part runs in remote computer or runs on remote computer or processing equipment completely on subscriber computer.Rear
In the case of kind, remote computer can be connect by any latticed form with subscriber computer, such as local area network (LAN) or wide area
Net (WAN), or it is connected to outer computer (such as passing through internet), or in cloud computing environment, or as service using such as
Software services (SaaS).
In addition, except being clearly stated in non-claimed, the sequence of processing element and sequence described in this specification, digital alphabet
Use or other titles use, be not intended to limit the sequence of this specification process and method.Although leading in above-mentioned disclosure
Cross various examples discuss it is some it is now recognized that useful inventive embodiments, but it is to be understood that, such details only plays
Bright purpose, appended claims are not limited in the embodiment disclosed, on the contrary, claim is intended to cover all meet originally
The amendment and equivalent combinations of specification embodiment spirit and scope.For example, although system component described above can pass through
Hardware device is realized, but can also be only achieved by the solution of software, such as in existing processing equipment or movement
Described system is installed in equipment.
Similarly, it is noted that in order to simplify the statement of this specification disclosure, to help to invent one or more
The understanding of embodiment, above in the description of this specification embodiment, sometimes by various features merger to one embodiment, attached
In figure or descriptions thereof.But this disclosure method is not meant to aspect ratio required for this specification embodiment object
The feature referred in claim is more.In fact, the feature of embodiment is all special less than the single embodiment of above-mentioned disclosure
Sign.
The number of description ingredient, number of attributes is used in some embodiments, it should be appreciated that such to be used for embodiment
The number of description has used qualifier " about ", " approximation " or " generally " to modify in some instances.Unless in addition saying
It is bright, " about ", " approximation " or " generally " show the variation that the number allows to have ± 20%.Correspondingly, in some embodiments
In, numerical parameter used in description and claims is approximation, approximation feature according to needed for separate embodiment
It can change.In some embodiments, numerical parameter is considered as defined significant digit and using the reservation of general digit
Method.Although the Numerical Range and parameter in some embodiments of this specification for confirming its range range are approximation, specific
In embodiment, being set in for such numerical value is reported as precisely as possible in feasible region.
For this specification one or more embodiment reference each patent, patent application, patent application publication object and
Entire contents are incorporated to this specification embodiment hereby by other materials, such as article, books, specification, publication, document
As reference.It is inconsistent or except generating the application history file that conflict with this specification embodiment content, to this specification power
Benefit require the conditional file of widest scope (adding currently or later in this manual) also except.It should be noted that
If described in the use and this specification of description, definition, and/or term in this specification attaching material in have it is inconsistent or
The place of conflict, is subject to the description, definition and/or the use of term of this specification.
Finally, it will be understood that embodiment described in this specification is only to illustrate the principle of this specification embodiment.
Others deformation may also belong to the range of this specification.Therefore, as an example, not a limit, the substitution of this specification embodiment
Configuration can be considered that the introduction with this specification is consistent.Correspondingly, the embodiment of this specification is not limited only to this specification and is clearly situated between
The embodiment for continuing and describing.