CN110414004A

CN110414004A - A kind of method and system that core information extracts

Info

Publication number: CN110414004A
Application number: CN201910699583.4A
Authority: CN
Inventors: 杨明晖
Original assignee: Alibaba Group Holding Ltd
Current assignee: Advanced New Technologies Co Ltd; Advantageous New Technologies Co Ltd
Priority date: 2019-07-31
Filing date: 2019-07-31
Publication date: 2019-11-05
Anticipated expiration: 2039-07-31
Also published as: CN110414004B

Abstract

This specification embodiment discloses a kind of method and system that core information extracts.The method that the core information extracts includes: acquisition text information；Based on the word segmentation processing to text information, one or more participle information corresponding with text information are obtained；Determine one or more one or more weight of participle information in text information；Weight is able to reflect different degree of one or more participle information in text information；The core information of text information is at least determined based on one or more weight.

Description

A kind of method and system that core information extracts

Technical field

This specification is related to artificial intelligence field, the in particular to method and system that a kind of core information extracts.

Background technique

With the development of information-intensive society, the information data in each field increases rapidly.By artificial intelligence automatically from a large amount of Core information is accurately extracted in text for numerous areas ten such as the information retrievals, data mining, data processing of Internet era Point important, therefore, the core information for extracting text becomes an important technology of natural language processing field.

Core information extract technology in, common core information extracting method include it is unsupervised and have supervision two kinds of sides Case.Unsupervised core information extracts Statistics-Based Method and applies well in document, chapter text, big data, but is difficult to The accurate keyword for calculating small data quantity text.There is supervision algorithm effect on short text small data to be better than unsupervised algorithm, but With the fast development of internet and the increasingly complexity of user's usage scenario, the text scene of different enterprise customers are different, literary This length is different, and same word weight under different scenes differs greatly, and common are supervision algorithm and is difficult to obtain high-quality mark Infuse data.

Therefore, it is desirable to there is a kind of reliable improved method, can be adapted to each independent of text length and mark sample Text core information under kind scene extracts.

Summary of the invention

The one aspect of this specification provides a kind of core information extracting method.The described method includes: obtaining text information； Based on the word segmentation processing to the text information, one or more participle information corresponding with the text information are obtained；It determines One or more weight of one or more of participle information in the text information；The weight is able to reflect institute State different degree of one or more participle information in the text information；At least determined based on one or more of weights The core information of the text information.

In some embodiments, weight of the one or more of participle information of the determination in the text information It include: that one or more mask text is determined based on one or more of participle information and the text information；Described one At least one participle information is occluded respectively in a or multiple mask texts；Based on the first preset algorithm and the text information Determine that the original vector of the text information indicates；One is determined based on the first preset algorithm and one or more of mask texts A or multiple mask vectors indicate；It is indicated according to the original vector and one or more of mask vectors indicates true Fixed one or more of weights.

In some embodiments, described according to original vector expression and one or more of mask vector tables Show and determines that one or more of weights comprise determining that one or more of mask vectors indicate and the original vector One or more distance between expression；One or more of weights are determined according to one or more of distances.

In some embodiments, the corresponding distance of the weight is positively correlated.

In some embodiments, the distance comprises at least one of the following: cosine distance, Euler distance or Manhattan away from From.

In some embodiments, determine that one or more of weights include: according to one or more of distances One or more of distances are normalized, with the one or more of weights of determination.

In some embodiments, first preset algorithm includes BERT model.

In some embodiments, the core that the text information is at least determined based on one or more of weights Information includes: the core information that the text information is determined according to one or more of weights and preset threshold.

In some embodiments, the text information includes short text information.

In some embodiments, the method also includes: obtain limitation lexical information；Based on limitation lexical information to described One or more participle information is screened, if one or more participle information are included in the limitation lexical information, The participle information is excluded except core information.

In some embodiments, the text information includes short text information.

This specification embodiment further relates to a kind of core word extraction system.The system comprises: text obtains module, is used for Obtain text information；Word segmentation module, for obtaining corresponding with the text information based on the word segmentation processing to the text information One or more participle information；Weight determination module, for determining one or more of participle information in the text One or more weight in information；The weight is able to reflect one or more of participle information in the text information In different degree；Core information determining module, at least determining the text information based on one or more of weights Core information.

In some embodiments, the weight determination module is also used to: being based on one or more of participle information and institute Text information is stated, determines one or more mask text；At least one divides respectively in one or more of mask texts Word information is occluded；Determine that the original vector of the text information indicates based on the first preset algorithm and the text information；Base Determine that one or more mask vector indicates in the first preset algorithm and one or more of mask texts；According to the original Beginning vector indicates and one or more of mask vectors indicate to determine one or more of weights.

In some embodiments, the weight determination module is also used to: determining one or more of mask vector tables Show one or more distance between original vector expression；Described one is determined according to one or more of distances A or multiple weights.

In some embodiments, the weight determination module is also used to: carrying out normalizing to one or more of distances Change, with the one or more of weights of determination.

In some embodiments, first preset algorithm includes BERT model.

In some embodiments, the core information determining module is also used to: according to one or more of weights and Preset threshold determines the core information of the text information.

In some embodiments, the text information includes short text information.

In some embodiments, the system also includes: limitation bilingual lexicon acquisition module, for obtain limitation lexical information； Screening module, for being screened based on limitation lexical information to one or more of participle information, if one or more It segments information to be included in the limitation lexical information, then excludes the participle information except core information.

This specification embodiment further relates to a kind of core word extraction element, and described device includes processor and memory； The memory for storing instruction, is used to execute described instruction by the processor, to realize that core word as described above mentions Take method.

This specification embodiment further relates to a kind of computer readable storage medium, and the storage medium storage computer refers to It enables, after computer instruction is executed by least one processor, can be realized core word extracting method as described above.

Detailed description of the invention

This specification will further illustrate in a manner of exemplary embodiment, these exemplary embodiments will by attached drawing into Row detailed description.These embodiments are simultaneously unrestricted, and in these embodiments, being identically numbered indicates identical structure, In:

Fig. 1 is the exemplary process diagram of the core information extracting method according to shown in this specification some embodiments；

Fig. 2 is the sub-process figure that participle information weight is determined according to shown in some embodiments of this specification；

Fig. 3 is the schematic diagram of the exemplary mask text according to shown in some embodiments of this specification；

Fig. 4 is that the urtext BERT vector according to shown in this specification some embodiments indicates schematic diagram；

Fig. 5 is that the mask text BERT vector according to shown in some embodiments of this specification indicates schematic diagram；

Fig. 6 is the schematic diagram that weight is determined according to shown in some embodiments of this specification；

Fig. 7 is the exemplary system module map of the core information extracting method according to shown in this specification some embodiments.

Specific embodiment

In order to illustrate more clearly of the technical solution of this specification embodiment, will make below to required in embodiment description Attached drawing is briefly described.It should be evident that the accompanying drawings in the following description is only some examples or reality of this specification Example is applied, it for those of ordinary skill in the art, without creative efforts, can also be attached according to these This specification is applied to other similar scenes by figure.Unless being explained obviously or separately, identical mark in figure from language environment Number represent identical structure or operation.

It should be appreciated that " system " used herein, " device ", " unit " and/or " mould group " is for distinguishing different stage Different components, component, assembly unit, part or a kind of method of assembly.However, if other words can realize identical purpose, Then the word can be replaced by other expression.

As shown in the specification and claims, unless context clearly prompts exceptional situation, " one ", "one", The words such as "an" and/or "the" not refer in particular to odd number, may also comprise plural number.It is, in general, that term " includes " is only mentioned with "comprising" Show included the steps that clearly mark and element, and these steps and element do not constitute one it is exclusive enumerate, method or Equipment may also include other step or element.

Flow chart has been used to be used to illustrate the operation according to performed by the system of the embodiment of this specification in this specification. It is not necessarily accurately carried out in sequence it should be understood that above or below operates.On the contrary, can according to inverted order or simultaneously Handle each step.It is also possible to during other operations are added to these, or from these processes remove a certain step or number Step operation.

One or more embodiments of this specification provide a kind of core information extracting method.Its basic inventive concept is The core information of text, which usually can be, influences maximum word to text semantic, and after text loses core information, semanteme is by shadow The degree of sound is maximum.

One or more embodiments of this specification successively determine one or more word in text information by mask method The weight of language judges semantic effect amount of the word in text information by weight, so as to determine text according to weight Core information.Specifically, in a text information, if it is desired to determine that wherein some word is to the influence amount of text information, Can by additional character the word carry out mask, then from the point of view of lack the word after text semantic with urtext believe Semantic gap between breath semanteme is how many, and then judges the word to the disturbance degree of text information.

One or more embodiments of this specification pass through the semanteme after the text after calculating word segmentation processing with urtext Distance determines the weight of text information.Also compared according to space length of the semantic similar or similar text in feature vector Similar principle may include accurate semantic information by the text vector that words vector is converted to, therefore will be in text Words, which is converted to one-dimensional vector, can be used as the semantic expressiveness of text.By by the text after urtext information and word segmentation processing Originally it is encoded into vector expression, indicates to indicate at a distance from vector space with original vector by calculating mask vector, it can be true Weight of the participle information in text information in text information is determined, with this final core information extracted in text information.

It should be understood that this example demonstrates that the application scenarios of the system and method for book are only some of this specification Example or embodiment for those of ordinary skill in the art without creative efforts, can be with roots It will be this example demonstrates that book be applied to other similar scene according to these attached drawings.

The technical solution that core information in one or more embodiments of this specification extracts can be applied to a variety of industry Business scene, including but not limited to artificial intelligence, data mining, big data analysis, data grabber, public sentiment monitoring, disaster monitoring, friendship Logical monitoring, analysis of central issue, information recall, online customer service, question and answer robot, semantic analysis, speech recognition, Police Information, promptly Seek help, rescue and relief work, electric business service, literature search, taking care of books, machine translation, document copying monitoring etc..Although the present embodiment Specification is mainly described with text information, but should be noted that this example demonstrates that the principle of book can also be applied to Semantics recognition in other business scenarios such as speech processes, image procossing, such as voice semantics recognition, dialogue robot, image text This risk monitoring and control etc..

Fig. 1 is the exemplary process diagram of the core information extracting method according to shown in this specification some embodiments.Such as Fig. 1 Described, described method includes following steps:

Step 101, text information is obtained.

In some embodiments, which can obtain module by text and execute.In some embodiments, text information Acquisition modes may include following at least one: voice input, image recognition, posture input, be manually entered, client push, Server transport, database information import, computer data collection imports, computer obtains automatically.For example, in public sentiment monitoring In application scenarios, text information is input to news search engine in such a way that voice is inputted or is manually entered by user, then Text obtains module and obtains text information by real-time monitoring back-end data.In some embodiments, when input data is audio When, text obtains module and can identify to audio signal, obtains text information.In some embodiments, input data can To be image, text obtains module can be based on text detection and identification technology (such as OCR, deep learning model technology) acquisition Text information in image.In some embodiments, input data is also possible to the input of the postures such as gesture, and text obtains module can To be matched to posture, to determine its corresponding text information.In some embodiments, the text information may include to Extract the short text information of core information.In some embodiments, the form of short text information may include customer service robot pair Talk about voice input and/or the text input under scene or Intelligence repository.In some embodiments, short text information can be one A phrase, such as " cat for playing knitting wool ball "；It is also possible to a simple sentence, such as " I has a cat."；It is also possible to one again Sentence, such as " kitten is playing knitting wool ball, and gets a big kick ".In some embodiments, short text information also may include with The combination of upper form.In some embodiments, the storage location of text information may include following a kind of or combination: database, Memory etc. with store function, text obtain module and obtain text information from storage location by network.

Step 103, word segmentation processing is carried out to the text information, obtains one or more corresponding with the text information Segment information.

In some embodiments, which can be executed by word segmentation module.It can be with to the word segmentation processing of the text information It is interpreted as being cut into segmentation sequence information in text information the pretreatment of one or more individual word, it is therefore intended that convenient Judge the disturbance degree of one or more words for being split out in its corresponding complete text information, it can be with based on disturbance degree Judge whether the word belongs to core word.

In some embodiments, the processing mode for carrying out word segmentation processing to text information can be carried out based on preset algorithm, In some embodiments, the preset algorithm also may include specific participle model.Participle processing method includes but is not limited to: Segmenting method based on string matching, the segmenting method based on understanding, segmenting method based on statistics etc..In some embodiments In, word segmentation processing can be carried out to text information by participle model.Wherein, participle model includes but is not limited to: N-gram mould Type (N-gram), hidden Markov model (Hidden Markov Model, HMM), maximum entropy model (ME), condition random field Model (Conditional Random Fields, CRF), JIEBA participle model etc..It is right by taking JIEBA participle model as an example below The process that text information carries out word segmentation processing is described.This specification one or more embodiment passes through specific text information " my dog is cute, he likes playing " carries out the elaboration of related art scheme as an example, it should manage Solution, the citing will not bring any restriction to the protection scope of this specification.

In some embodiments, word segmentation processing is carried out to the text obtained in step 101 using JIEBA participle model.Example Such as: using JIEBA participle model to the text information s of acquisition₀: " my dog is cute, he likes playing " is carried out Participle, available multiple segmentation sequence information w₁: " my ", w₂: " dog ", w₃: " is ", w₄: " cute ", w₅: " he ", w₆: " likes ", w₇: " play ", w₈: " ##ing ".

Step 105, one or more power of one or more of participle information in the text information is determined Weight.

In some embodiments, which can be determined by weight determination module.Information is segmented in corresponding text information Weight be able to reflect one or more of different degrees or disturbance degree of the participle information in the text information, therefore, By judging that segmenting weight of the information in text information may determine that whether the participle information belongs to the pass of place Textual information Keyword or core word.For example, weight determination module can determine text information s₀In multiple participle corresponding weights in the text. In abovementioned steps, to text information s₀: " my dog is cute, he likes playing " is obtained such as after carrying out word segmentation processing Lower segmentation sequence: w₁: " my ", w₂: " dog ", w₃: " is ", w₄: " cute ", w₅: " he ", w₆: " likes ", w₇" play ", w₈: " ## Ing " includes 8 participle information in the sequence, is respectively as follows: w₁, w₂......w₈.Weight determination module can determine text respectively Information s₀In above-mentioned 8 participles information in text information s₀In corresponding 8 weights.

In some embodiments, the method for determining weight of the participle information in affiliated text information includes but is not limited to: Participle weight is determined by the statistical information of participle, determines participle weight, by there is supervision by calculating participle context relation Machine learning model processing text information determine the weight of different participles, by segmenting corresponding mask text and urtext Semantic distance between information determines one of participle weight or combination.

In some embodiments, it is comprised at least one of the following by the method that the statistical information of participle determines participle weight: TF-IDF method (calculating participle frequency of occurrence of the information in text information and the amount of text comprising the participle information), theme Model method, Text Rank method (being updated according to word cooccurrence relation iteration).

It in some embodiments, include: by all words by calculating the method that participle context relation determines term weighing Language is expressed as figure G=(V, E) by context relation, and V point indicates that all words, E indicate the context relation between word, set point V_i can calculate its score value with the score value of its adjacent node.In some embodiments, between the vector expression by calculating participle Semantic distance determine participle weight method include: that participle is separately encoded into vector to indicate, calculate vector expression between Space length is to determine the semantic distance between participle.

The determining participle weight of the semantic distance between corresponding mask text and urtext information is segmented by calculating Method can be described in detail in conjunction with specific embodiments in the other parts of this specification, refer to Fig. 2.

Step 107, the core information of the text information is at least determined based on one or more of weights.

In some embodiments, which can be executed by core information determining module.Core information determining module can be with The core information of the text information is determined based on the corresponding weight of different participle information using one or more kinds of methods.In In some embodiments, core information determining module can be determined from many participle information with weight based on preset threshold The core information of the text information, i.e. core word.In some embodiments, preset threshold may include weight threshold, specifically Ground, the weight for segmenting information are then determined as core information if it is greater than the weight threshold.In some embodiments, preset threshold It can also include amount threshold, specifically, choose sequence first three the corresponding participle information of weight and be determined as core information.

For example, above-mentioned text information s₀In 8 participle information w₁: " my ", w₂: " dog ", w₃: " is ", w₄: " cute ", w₅: " he ", w₆: " likes ", w₇" play ", w₈: " ##ing " is in text information s₀Respective weights have determined.In some implementations Example in, can first set weight threshold be 80%, according to the weight threshold then weight selection greater than 80% participle information work For core information.In some embodiments, it is also possible to first set amount threshold as 3, then chosen respectively according to the amount threshold The higher 3 participles information of weight is as core information.The process of core information is determined according to amount threshold and/or weight threshold In, the weight of each participle information can be ranked up, can also directly be sieved according to preset threshold without sequence Choosing.

In some embodiments, the preset threshold can be preset, the including but not limited to artificial setting of set-up mode, Computer installation, program setting etc..Weight threshold, the specific percentage of amount threshold or numerical value can be according to true in step 105 Depending on the concrete condition of fixed analytical weight.

It should be noted that in some embodiments, the method can also include based on screening conditions to the text The corresponding each participle information of information is screened.In some embodiments, the screening step can be executed by screening module.Example Such as, since the semantic words such as table negative, turnover can have a large effect to semanteme, but the core information of not usually sentence, So in some embodiments, can further be screened to participle information, if in the corresponding participle information of text information When comprising the semantic word such as above-mentioned table negative, turnover, it when screening core word, can be excluded, avoid being chosen.Some In embodiment, the screening mode that above-mentioned word is excluded can be by the limitation in screening process plus stop-word, to avoid Choose above-mentioned word.In some implementations, can in advance will to semanteme have larger impact but do not include entity information word into Row is aggregated to form a vocabulary limitation table, when carrying out the screening step, each participle information and the vocabulary can be limited Vocabulary in table is compared, and then can determine whether the participle information needs to exclude.Wherein, vocabulary limitation table includes one Or multiple limitation vocabulary, limitation vocabulary can include but is not limited to indicate the semantic conjunctive word such as negative, turnover, auxiliary words of mood, Adverbial word etc..In some embodiments, the limitation vocabulary can be established by limitation bilingual lexicon acquisition module or by limitation vocabulary mould Block obtains one or more limitation vocabulary from the limitation vocabulary pre-generated.In some embodiments, the screening step It can be placed on after the weight for determining each participle information, can also be placed on before the weight for determining each participle information.For example, After step 103 executes, each participle can be screened, exclude limitation vocabulary therein.In another example can be in step After rapid 107 execute, the core information of acquisition is screened, limitation vocabulary therein is excluded.In some embodiments, it sieves Selecting condition may include one or more limitation vocabulary, also may include the vocabulary limitation being made of one or more limitation vocabulary Combination, in some embodiments, it is also possible to which the vocabulary limitation combination being made of one or more limitation vocabulary is understood as being word It converges and limits table.

It should be noted that the above-mentioned description in relation to core information extracting method 100 is used for the purpose of example and explanation, and The scope of application of this specification one or more embodiment is not limited.To those skilled in the art, in this specification one Various modifications and variations can be carried out to extracting cell core method 100 under the guidance of a or multiple embodiments.However, these amendment and Change still within the scope of this specification one or more embodiment.

Fig. 2 is the sub-process figure that participle information weight is determined according to shown in some embodiments of this specification.

In some embodiments, determine that the weight of one or more participle information can also be realized by mask Furthest Neighbor, Specifically, can be judged by determining the text distance between the corresponding mask text of participle information and urtext information point The respective weights of word information.In some embodiments, determine that the text semantic distance between two texts can be by two texts Distance that this corresponding vector indicates is realized.According to space length of the semantic similar or similar text in feature vector Compare similar principle, accurate semantic information may include by the text vector that words vector is converted to, therefore by text In words be converted to one-dimensional vector, the semantic expressiveness as text.

It is elaborated below in conjunction with such as Fig. 2 and determines one or more participle information corresponding power in the text information Some embodiments of weight, process 200 include:

Step 201, based on one or more participle information and text information, one or more mask text is determined, There is a participle information to be occluded in one or more of mask texts respectively.

In some embodiments, which can be executed by weight determination module.In some embodiments, determine one or The corresponding mask text of multiple participle information can be accomplished in that successively to be covered in text information using additional character The participle information for needing to judge weight or different degree, to determine the corresponding mask text of the participle information.The work of additional character With being: making the participle loss of learning for needing to judge weight or different degree in text information, lose the participle information with judgement Mask text and the text information between semantic gap, and then judge the weight of the participle information in the text information It spends.If after a participle loss of learning, the semantic variation of relatively primitive Textual information is very big, then meaning that the participle is believed It ceases critically important to text information.In some embodiments, additional character includes but is not limited to character, character string, letter, number One of or combination, as long as the symbol can indicate missing corresponding to the participle information.In some embodiments, [MASK] or [M] can be used as additional character successively cover one or more participle information in step 103.Below will Illustrate how to determine one or more mask text in conjunction with the specific example of Fig. 3 signal.

Fig. 3 is the schematic diagram of the exemplary mask text according to shown in some embodiments of this specification.In some implementations In example, text information s can be successively covered as additional character using symbolization [M]₀: " my dog is cute, he likes 8 participle information in play##ing ": w₁: " my ", w₂: " dog ", w₃: " is ", w₄: " cute ", w₅: " he ", w₆: " likes ", w₇: " play ", w₈: " ##ing ", and then determine the corresponding 8 mask texts of 8 participle information: s₁: " [M] dog Is cute he likes play##ing ", s₂: " my [M] is cute he likes play##ing ", s₃: " my dog [M] Cute he likes play##ing ", s₄: " my dog is [M] he likes play##ing ", s₅: " my dog is Cute [M] likes play##ing ", s₆: " my dog is cute he [M] play##ing ", s₇: " my dog is cute He likes [M] ##ing ", s₈: " my dog is cute he likes play [M] ".

It in some embodiments, can also be by multiple participle information in the corresponding participle information of text information (as segmented Combination) mask is carried out simultaneously, and then obtain simultaneously by the corresponding mask text of multiple participle information of mask.In some realities It applies in example, according to the available above-mentioned multiple participles by mask of the semantic distance between the mask text and urtext information Semantic importance of the information in urtext information, i.e. distance are bigger, and semantic importance is bigger, semantic important apart from smaller It spends smaller.

In some embodiments, determine text semantic distance between two texts can by determine text it is corresponding to The semantic distance between indicating is measured to realize, therefore there are following steps, determines that the text information is corresponding with the mask text Vector indicate.

Step 203, determine that the original vector of the text information indicates based on the first preset algorithm and the text information.

In some embodiments, which can be executed by weight determination module.In some embodiments, pre- based on first Imputation method and text information determine that the original vector expression of the text information can be by one or more kinds of algorithms to text Information is encoded to determine that corresponding vector indicates.In some embodiments, the first preset algorithm includes but is not limited to: RNN (RecurrentNeural Network)、CNN(Convolutional Neural Networks)、Transformer、GPB、 One of BERT (Bidirectional Encoder Representations from Transformers) or combination.

Algorithm, which is illustrated, to be shown to vector table for below using BERT model as the first preset algorithm.

During converting corresponding vector for text information using BERT model indicates, the input of BERT model includes The corresponding text insertion of text information to be transformed, segmentation insertion and position insertion.Wherein, text insertion includes being based on text Word can be divided into one group of limited public words unit, in the validity and word of word by the feature vector of information coding Balance is obtained between the flexibility of symbol.Segmentation insertion includes the feature vector based on segmentation information coding, can be used for distinguishing text The sentence of different contexts in this, by the statement coding of different contexts at different feature vectors.For example, text sequence information " [CLS] my dog is cute [SEP] " and " he likes playing [SEP] " belongs to different subordinate sentences, then text sequence Corresponding segment information can be expressed as " AAAAAABBBBB ".Position insertion includes the feature based on text position information coding The location information of word can be encoded into feature vector by vector.For example, text information s₀“[CLS]my dog is cute [SEP], he likes playing [SEP] ", wherein the number of segmentation sequence information is respectively as follows: 0,1,2,3,4,5,6,7,8, 9,10, ' dog ' belongs to the 3rd word of information in segmentation sequence, number 2, and position embedding information can believe the position of word Breath is encoded into feature vector.It is illustrated how urtext s below in conjunction with Fig. 4 and specific example through BERT model₀ Being converted into corresponding vector indicates v₀。

Fig. 4 be according to shown in some embodiments of this specification based on BERT model by text information be converted to it is corresponding to Measure the schematic diagram indicated.As shown in figure 4, using BERT model by text information s₀: " my dog is cute, he likes Play##ing ", which is converted to corresponding original vector, indicates v₀.As shown, text information s₀Corresponding text insertion are as follows: [CLS] my dog is cute[SEP]he likes playing[SEP]；Corresponding segmentation insertion are as follows: AAAAAABBBBB；It is corresponding Position insertion are as follows: 0,1,2,3,4,5,6,7,8,9,10.By text information s₀Corresponding text insertion, segmentation insertion and position The sum of insertion, as the input of BERT model, by BERT model treatment, available text information s₀Original vector indicate v₀。

Step 205, one or more screening is determined based on the first preset algorithm and one or more of mask texts Cover vector indicates.

In some embodiments, which can be executed by weight determination module.In some embodiments, based on it is above-mentioned The same algorithm mask text of step determines that corresponding mask vector indicates.Illustrate below in conjunction with Fig. 5 and specific example How BERT model is passed through by mask text s₁Being converted into corresponding vector indicates v₁。

Fig. 5 be according to shown in some embodiments of this specification based on BERT model mask text is converted to it is corresponding The schematic diagram that vector indicates.As shown in figure 5, using BERT model by mask text s₁: " [M] dog is cute, he likes Play##ing ", which is converted to corresponding original vector, indicates v₁.As shown, mask text s₁Corresponding text insertion are as follows: [CLS] [M]dog is cute[SEP]he likes playing[SEP]；Corresponding segmentation insertion are as follows: AAAAAABBBBB；It is corresponding Position insertion are as follows: 0,1,2,3,4,5,6,7,8,9,10.By mask text s₁Corresponding text insertion, segmentation insertion and position The sum of insertion, as the input of BERT model, by BERT model treatment, available mask text s₁Mask vector indicate v₁。

Other mask texts s₂......s₈It is indicated by BERT model conversation at corresponding vector using similar method v₂......v₈, details are not described herein.

Step 207, according to indicating the original vector and one or more of mask vectors indicate determining One or more weight.In some embodiments, which can be executed by weight determination module.

As described in Figure 6, Fig. 6 is the schematic diagram that weight is determined according to shown in some embodiments of this specification:

Weight determination module determines that one or more of mask vectors indicate between original vector expression One or more distance；One or more of weights are determined according to one or more of distances.In some implementations In example, the corresponding distance of the weight is positively correlated, specifically, when the mask vector indicates between original vector expression Distance it is bigger when, illustrate that the participle information fallen by mask is bigger to the semantic effect of text information, the weight of the participle information It is bigger.

After determining text information and the corresponding vector expression of multiple mask texts, it can indicate true based on corresponding vector Determine the semantic distance between the expression of mask vector and original vector expression, and then can judge mask vector according to semantic distance Indicate corresponding participle information to the different degree or disturbance degree of the text information.

Weight determination module is by determining that one or more of mask vectors indicate to indicate it with the original vector Between space length, determine the semantic distance between text information and multiple mask texts, can determine special in mask text The influence amount of the participle information of different symbol mask, further determines that the weight of participle information, finally determines one or more screening The weight of one or more corresponding participle information of cover vector.

In some embodiments, it can also indicate that the semantic distance indicated with original vector is returned to multiple mask vectors One change processing, respectively obtaining different mask vectors indicates the weight of relatively primitive vector.The step can determine mould by weight Block executes.

In some embodiments, the semantic distance calculation method between mask vector and original vector includes but is not limited to base Calculation method in word frequency statistics, the calculation method based on ontology, calculation method based on geometry metric space etc..Based on word frequency The calculation method of statistics includes but is not limited to method based on reduplication, TF-IDF (Term Frequency-Inverse Document Frequency) and its various weighting algorithms (such as: LSA, HAL, Islam) etc..Based on the distance of ontology Calculation method includes but is not limited to be calculated based on ontology library back gauge calculation method, the calculation method based on ontology library node, mixing Method etc..Based on including but not limited to Euclidean distance (Euclidean in geometry metric space calculation method Distance), COS distance (Cosine Distance), manhatton distance (Manhattan Distance) etc..

In some embodiments, when calculating the distance between mask vector and original vector using COS distance method When, it is referred to following formula (1), the corresponding calculation formula of COS distance method, wherein v₀Indicate original vector, v_iIndicate i-th The mask vector of a mask text, d_iIndicate the semantic distance between i-th of mask vector and original vector.

In some embodiments, further above-mentioned semantic distance can also be normalized, obtains corresponding 0 Numerical value between to 1 is to get arriving corresponding weight.Wherein, common normalization processing method includes but is not limited to: minimax Normalization and/or mean normalization method.

It in some embodiments, can be with when semantic distance is normalized using minimax normalization method Referring to following formula (2), minimax normalizes formula.Wherein, X_normData after indicating normalization, X indicate initial data, X_maxAnd X_minRespectively indicate the maximum value and minimum value of raw data set.

For example, to the semantic distance d between i-th of mask vector and original vector_iIt is obtained after carrying out minimax normalization To weight be equal toFor another example urtext s₀In the mask text that is formed of the 1st participle information it is corresponding Vector indicate v₁V is indicated with original vector₀Between semantic distance be d₁, after carrying out minimum normalized to it, obtain Weight 1 are as follows:Indicate weight of the above-mentioned 1st participle information in urtext.

In other embodiments, semantic distance can also be normalized using mean normalization method. Value normalization formula are as follows:Wherein, z indicates the data after normalization, and μ indicates the mean value of raw data set, and σ is indicated The variance of raw data set.

In other embodiments, except through being determined except weight to the method that semantic distance is normalized, also Weight can be obtained by the method given a mark to semantic distance.Wherein, semantic distance marking includes the semantic distance number that will acquire According to classifying, and preset corresponding weight threshold.The semantic distance that numerical value is more than preset threshold is classified as key object, The corresponding participle information by mask and urtext have biggish similarity in its mask vector.

Fig. 7 is the example system module map that the core information according to shown in this specification some embodiments extracts.

As shown in fig. 7, the system includes: that text obtains module 710, word segmentation module 720, weight determination module 730 and core Heart information determination module 740.

Text obtains module 710 for obtaining text information.

Word segmentation module 720 is used to obtain corresponding with the text information based on word segmentation processing is carried out to the text information One or more participle information.

Weight determination module 730 is used to determine one of one or more of participle information in the text information Or multiple weights；The weight is able to reflect different degree of one or more of participle information in the text information. In some embodiments, weight determination module 730 is also used to based on one or more of participle information and the text information, Determine one or more mask text；There is a participle information to be occluded in one or more of mask texts respectively；Base Determine that the original vector of the text information indicates in the first preset algorithm and the text information；Based on the first preset algorithm and One or more of mask texts determine that one or more mask vector of one or more of mask texts indicates； It is indicated according to the original vector and one or more of mask vectors indicates to determine one or more of weights. In some embodiments, weight determination module be also used to determine one or more of mask vectors indicate with it is described it is original to One or more distance between amount expression；One or more of power are determined according to one or more of distances Weight.In some embodiments, weight determination module is also used to that one or more of distances are normalized, to determine State one or more weights.

In some embodiments, which further includes that text insertion obtains module, includes for obtaining the text insertion Feature vector based on text information coding；Segmentation insertion obtains module, includes based on segmentation for obtaining the segmentation insertion The feature vector of information coding；Position insertion obtains module, includes based on text position information for obtaining the position insertion The feature vector of coding.

In some embodiments, text insertion obtains module and is also used to one or more point in the text information Word information is divided into limited public words unit, and is encoded into feature vector.In some embodiments, segmentation insertion obtains Module is also used to: by the text information segment information and one or more of participle information codings at feature vector. In some embodiments, position insertion obtains module and is also used to: by one or more participle information in the text information The location information of middle segmentation sequence is encoded into feature vector.

Core information determining module 740 is at least determining the text information based on one or more of weights Core information.In some embodiments, core information determining module is also used to: according to one or more of weights and being preset Threshold value determines the core information of the text information.In some embodiments, core information determining module is also used to: based on participle Model carries out word segmentation processing to the text information；Participle model may include following at least one: JIEBA participle, HMM participle Model, CRF participle model, deep learning model.

In some embodiments, which further includes limitation bilingual lexicon acquisition module, for obtaining limitation lexical information；One In a little embodiments, which further includes screening module, for being based on limitation lexical information to one or more of participle information It is screened, if one or more participle information is included in the limitation lexical information, which is excluded Except core information.

It should be appreciated that system shown in Fig. 7 and its module can use various modes to realize.For example, in some implementations In example, system and its module can be realized by the combination of hardware, software or software and hardware.Wherein, hardware components can To be realized using special logic；Software section then can store in memory, by instruction execution system appropriate, for example (,) it is micro- Processor or special designs hardware execute.It will be appreciated by those skilled in the art that meter can be used in above-mentioned method and system It calculation machine executable instruction and/or is included in the processor control code to realize, such as in such as disk, CD or DVD-ROM The programmable memory of mounting medium, such as read-only memory (firmware) or the data of such as optics or electrical signal carrier Such code is provided on carrier.System and its module in this specification one or more embodiment can not only have such as The semiconductor of ultra large scale integrated circuit or gate array, logic chip, transistor etc. or such as field-programmable gate array The hardware circuit of the programmable hardware device of column, programmable logic device etc. is realized, can also be used for example by various types of Software realization performed by device is managed, it can also be by combination (for example, firmware) Lai Shixian of above-mentioned hardware circuit and software.

It should be noted that the description of system and its module is shown, determined for candidate item above, only for convenience of description, This specification can not be limited within the scope of illustrated embodiment.It is appreciated that for those skilled in the art, In After understanding the principle of the system, any combination or structure may be carried out to modules without departing substantially from this principle It is connect at subsystem with other modules.For example, in some embodiments, for example, the text disclosed in Fig. 7 obtains module 710, divides Word module 720, weight determination module 730 and core information determining module 740 can be the disparate modules in a system, can also To be function that a module realizes two or more above-mentioned modules.For example, text obtains module 710, weight determines Module 730 can be two modules, is also possible to a module while having the function of acquisition and determining.For example, modules can To share an acquisition module, modules can also be respectively provided with respective acquisition module.Suchlike deformation, at this Within the protection scope of specification.

Based on the above core information extracting method, this specification one or more embodiment additionally provides a kind of core information Extraction element, described device include at least one processor and at least one processor；At least one processor is used for Store computer instruction；At least one described processor is used to execute at least partly instruction in the computer instruction to realize Core information extracting method described in any of the above-described embodiment.

Core information extraction element can be used for the computer instruction in processing core information extraction implementation procedure.Specifically , core information extraction element can store computer instruction and execute core information extraction operation.

The core information extraction element of this specification embodiment can be applied to multiple business scene, including but not limited to people It is work intelligence, data mining, big data analysis, data grabber, public sentiment monitoring, disaster monitoring, Traffic monitoring, analysis of central issue, online Customer service, question and answer robot, semantic analysis, speech recognition, Police Information, flash appeal, rescue and relief work, electric business service, document inspection Rope, taking care of books, machine translation, document copying monitoring etc..

All the embodiments in this specification are described in a progressive manner, same and similar portion between each embodiment Dividing may refer to each other, and each embodiment focuses on the differences from other embodiments.Especially for system and For Installation practice, since it is substantially similar to the method embodiment or it is based on embodiment of the method, so be described relatively simple, The relevent part can refer to the partial explaination of embodiments of method.

The possible beneficial effect of this specification embodiment includes but is not limited to: (1) this programme is replaced using mask method Text original word, can obtain the weight of words in the text respectively, and accuracy is high；(2) this programme uses BERT semantic coding mould Type, can extract core information to single statement, and model accuracy rate is high；(3) this programme does not depend on labeled data, can improve not With the core information abstraction function of the short text small data under application scenarios, optimize the accuracy that core information extracts, applicability By force.It should be noted that the different issuable beneficial effects of embodiment are different, in different embodiments, it is possible to create Beneficial effect can be the combination of any of the above one or more, be also possible to other it is any can obtainable beneficial effect.

Basic conception is described above, it is clear that those skilled in the art, above-mentioned detailed disclosure is only As an example, and not constituting the restriction to this specification.Although do not clearly state herein, those skilled in the art may This specification one or more embodiment can be carry out various modifications, improved and be corrected.Such modification is improved and is corrected in this theory It is proposed in bright book one or more embodiment, so such is modified, improves, amendment still falls within this specification example embodiment Spirit and scope.

Meanwhile particular words have been used to describe the embodiment of this specification in this specification one or more embodiment. Such as " one embodiment ", " embodiment ", and/or " some embodiments " means relevant at least one embodiment of this specification A certain feature, structure or feature.Therefore, it should be emphasized that simultaneously it is noted that being referred to twice or repeatedly in this specification in different location " embodiment " or " one embodiment " or " alternate embodiment " be not necessarily meant to refer to the same embodiment.In addition, this Certain features, structure or feature in one or more embodiments of specification can carry out combination appropriate.

In addition, it will be understood by those skilled in the art that the various aspects of this specification can have patentability by several Type or situation be illustrated and described, the combination including any new and useful process, machine, product or substance, or Any new and useful improvement to them.Correspondingly, the various aspects of this specification can be executed completely by hardware, can be with It is executed, can also be executed by combination of hardware by software (including firmware, resident software, microcode etc.) completely.Hardware above Or software is referred to alternatively as " data block ", " module ", " engine ", " unit ", " component " or " system ".In addition, this specification Various aspects may show as the computer product being located in one or more computer-readable mediums, which includes that computer can Reader coding.

Computer storage medium may include the propagation data signal containing computer program code in one, such as in base Take or as carrier wave a part.The transmitting signal may there are many forms of expression, including electromagnetic form, light form etc., or Suitable combining form.Computer storage medium can be any computer-readable Jie in addition to computer readable storage medium Matter, the medium can realize communication, propagation or transmission for using by being connected to an instruction execution system, device or equipment Program.Program coding in computer storage medium can be propagated by any suitable medium, including wireless The combination of electricity, cable, fiber optic cables, RF or similar mediums or any of above medium.

Computer program code needed for the operation of this specification each section can be compiled with any one or more program language Write, including Object-Oriented Programming Language such as Java, Scala, Smalltalk, Eiffel, JADE, Emerald, C++, C#, VB.NET, Python etc., conventional procedural programming language for example C language, VisualBasic, Fortran2003, Perl, COBOL2002, PHP, ABAP. dynamic programming language such as Python, Ruby and Groovy or other programming languages etc..The program Coding can run on the user computer completely or run as independent software package or partially exist on the user computer Operation part runs in remote computer or runs on remote computer or processing equipment completely on subscriber computer.Rear In the case of kind, remote computer can be connect by any latticed form with subscriber computer, such as local area network (LAN) or wide area Net (WAN), or it is connected to outer computer (such as passing through internet), or in cloud computing environment, or as service using such as Software services (SaaS).

In addition, except being clearly stated in non-claimed, the sequence of processing element and sequence described in this specification, digital alphabet Use or other titles use, be not intended to limit the sequence of this specification process and method.Although leading in above-mentioned disclosure Cross various examples discuss it is some it is now recognized that useful inventive embodiments, but it is to be understood that, such details only plays Bright purpose, appended claims are not limited in the embodiment disclosed, on the contrary, claim is intended to cover all meet originally The amendment and equivalent combinations of specification embodiment spirit and scope.For example, although system component described above can pass through Hardware device is realized, but can also be only achieved by the solution of software, such as in existing processing equipment or movement Described system is installed in equipment.

Similarly, it is noted that in order to simplify the statement of this specification disclosure, to help to invent one or more The understanding of embodiment, above in the description of this specification embodiment, sometimes by various features merger to one embodiment, attached In figure or descriptions thereof.But this disclosure method is not meant to aspect ratio required for this specification embodiment object The feature referred in claim is more.In fact, the feature of embodiment is all special less than the single embodiment of above-mentioned disclosure Sign.

The number of description ingredient, number of attributes is used in some embodiments, it should be appreciated that such to be used for embodiment The number of description has used qualifier " about ", " approximation " or " generally " to modify in some instances.Unless in addition saying It is bright, " about ", " approximation " or " generally " show the variation that the number allows to have ± 20%.Correspondingly, in some embodiments In, numerical parameter used in description and claims is approximation, approximation feature according to needed for separate embodiment It can change.In some embodiments, numerical parameter is considered as defined significant digit and using the reservation of general digit Method.Although the Numerical Range and parameter in some embodiments of this specification for confirming its range range are approximation, specific In embodiment, being set in for such numerical value is reported as precisely as possible in feasible region.

For this specification one or more embodiment reference each patent, patent application, patent application publication object and Entire contents are incorporated to this specification embodiment hereby by other materials, such as article, books, specification, publication, document As reference.It is inconsistent or except generating the application history file that conflict with this specification embodiment content, to this specification power Benefit require the conditional file of widest scope (adding currently or later in this manual) also except.It should be noted that If described in the use and this specification of description, definition, and/or term in this specification attaching material in have it is inconsistent or The place of conflict, is subject to the description, definition and/or the use of term of this specification.

Finally, it will be understood that embodiment described in this specification is only to illustrate the principle of this specification embodiment. Others deformation may also belong to the range of this specification.Therefore, as an example, not a limit, the substitution of this specification embodiment Configuration can be considered that the introduction with this specification is consistent.Correspondingly, the embodiment of this specification is not limited only to this specification and is clearly situated between The embodiment for continuing and describing.

Claims

1. a kind of core information extracting method, the method are executed by least one processor, which is characterized in that the method packet It includes:

Obtain text information；

Based on the word segmentation processing to the text information, one or more participle information corresponding with the text information are obtained；

Based on one or more of participle information and the text information, one or more mask text is determined；Described one At least one participle information is occluded respectively in a or multiple mask texts；

Determine that the original vector of the text information indicates based on the first preset algorithm and the text information；

Determine that one or more mask vector indicates based on the first preset algorithm and one or more of mask texts；

It is indicated according to the original vector and one or more of mask vectors indicates that determination is one or more of Weight；The weight is able to reflect its different degree of the corresponding participle information in the text information；

The core information of the text information is at least determined based on one or more of weights.

2. the method according to claim 1, wherein described according to original vector expression and one Or multiple mask vectors indicate to determine that one or more of weights include:

Determine that one or more of mask vectors indicate one or more distance between original vector expression；

One or more of weights are determined according to one or more of distances.

3. according to the method described in claim 2, it is characterized in that, the corresponding distance of the weight is positively correlated.

4. according to the method described in claim 2, it is characterized in that, the distance comprises at least one of the following: cosine distance, Euler's distance or manhatton distance.

5. according to the method described in claim 2, it is characterized in that, one according to one or more of distance determinations Or multiple weights further include: one or more of distances are normalized, with the one or more of weights of determination.

6. the method according to claim 1, wherein first preset algorithm includes BERT model.

7. the method according to claim 1, wherein described at least determined based on one or more of weights The core information of the text information includes:

The core information of the text information is determined according to one or more of weights and preset threshold.

8. the method according to claim 1, wherein the text information includes short text information.

9. the method according to claim 1, wherein the method also includes:

Obtain limitation lexical information；

One or more of participle information are screened based on limitation lexical information, if one or more participle packets It is contained in the limitation lexical information, then excludes the participle information except core information.

10. a kind of core word extraction system, which is characterized in that the system comprises:

Text obtains module, for obtaining text information；

Word segmentation module, for based on the word segmentation processing to the text information, obtain one corresponding with the text information or Multiple participle information；

Weight determination module is used for:

And according to the original vector indicate and one or more of mask vectors indicate determine it is one or Multiple weights；The weight is able to reflect its different degree of the corresponding participle information in the text information；

Core information determining module, at least determining that the core of the text information is believed based on one or more of weights Breath.

11. system according to claim 10, which is characterized in that the weight determination module is also used to:

One or more of weights are determined according to one or more of distances.

12. system according to claim 11, which is characterized in that the corresponding distance of the weight is positively correlated.

13. system according to claim 11, which is characterized in that the distance comprises at least one of the following: cosine away from With a distance from, Euler or manhatton distance.

14. system according to claim 11, which is characterized in that the weight determination module is also used to: to one Or multiple distances are normalized, with the one or more of weights of determination.

15. system according to claim 10, which is characterized in that first preset algorithm includes BERT model.

16. system according to claim 10, which is characterized in that the core information determining module is also used to: according to institute It states one or more weight and preset threshold determines the core information of the text information.

17. system according to claim 10, which is characterized in that the text information includes short text information.

18. system according to claim 10, which is characterized in that the system also includes:

Bilingual lexicon acquisition module is limited, for obtaining limitation lexical information；

Screening module, for based on limitation lexical information one or more of participle information are screened, if one or Multiple participle information are included in the limitation lexical information, then are excluded the participle information except core information.

19. a kind of core word extraction element, described device includes processor and memory；The memory refers to for storing It enables, which is characterized in that the processor is for executing described instruction, to realize the core as described in any one of claims 1 to 9 The corresponding operation of word extracting method.