CN110414004B

CN110414004B - Method and system for extracting core information

Info

Publication number: CN110414004B
Application number: CN201910699583.4A
Authority: CN
Inventors: 杨明晖
Original assignee: Advanced New Technologies Co Ltd
Current assignee: Advanced New Technologies Co Ltd; Advantageous New Technologies Co Ltd
Priority date: 2019-07-31
Filing date: 2019-07-31
Publication date: 2022-11-18
Anticipated expiration: 2039-07-31
Also published as: CN110414004A

Abstract

The embodiment of the specification discloses a method and a system for extracting core information. The method for extracting the core information comprises the following steps: acquiring text information; acquiring one or more word segmentation information corresponding to the text information based on word segmentation processing of the text information; determining one or more weights of one or more word segmentation information in the text information; the weight can reflect the importance of one or more word segmentation information in the text information; core information of the text information is determined based at least on the one or more weights.

Description

Method and system for extracting core information

Technical Field

The present disclosure relates to the field of artificial intelligence, and in particular, to a method and system for extracting core information.

Background

With the development of the information society, information data in various fields is rapidly increasing. The automatic and accurate extraction of core information from a large number of texts through artificial intelligence is very important in many fields of information retrieval, data mining, data processing and the like in the internet era, and therefore, the extraction of the core information of texts becomes an important technology in the field of natural language processing.

In the core information extraction technology, the commonly used core information extraction method includes two schemes, namely unsupervised scheme and supervised scheme. The unsupervised core information extraction statistics-based method is well applied to documents, chapter texts and large data, but keywords of small-data-volume texts are difficult to accurately calculate. The effect of the supervised algorithm is superior to that of the unsupervised algorithm on the short text small data, but with the rapid development of the internet and the increasing complexity of the use scenes of users, the text scenes of different enterprise users are different, the text space is different, the weight difference of the same word is larger in different scenes, and the common supervised algorithm is difficult to obtain high-quality annotation data.

Therefore, a reliable improved method is desired, which can adapt to the extraction of the text core information under various scenes without depending on text space and labeled samples.

Disclosure of Invention

One aspect of the present specification provides a core information extraction method. The method comprises the following steps: acquiring text information; acquiring one or more word segmentation information corresponding to the text information based on word segmentation processing of the text information; determining one or more weights of the one or more word segmentation information in the text information; the weight can reflect the importance degree of the one or more word segmentation information in the text information; determining core information of the text information based at least on the one or more weights.

In some embodiments, the determining the weight of the one or more word segmentation information in the text information comprises: determining one or more mask texts based on the one or more word segmentation information and the text information; at least one piece of word segmentation information in the one or more mask texts is shielded; determining an original vector representation of the text information based on a first preset algorithm and the text information; determining one or more mask vector representations based on a first preset algorithm and the one or more mask texts; determining the one or more weights from the original vector representation and the one or more mask vector representations.

In some embodiments, said determining said one or more weights from said original vector representation and said one or more mask vector representations comprises: determining one or more distances between the one or more mask vector representations and the original vector representation; determining the one or more weights from the one or more distances.

In some embodiments, the weights are positively correlated with their corresponding distances.

In some embodiments, the distance comprises at least one of: cosine distance, euler distance, or manhattan distance.

In some embodiments, determining the one or more weights from the one or more distances comprises: normalizing the one or more distances to determine the one or more weights.

In some embodiments, the first preset algorithm comprises a BERT model.

In some embodiments, said determining core information of said textual information based at least on said one or more weights comprises: and determining the core information of the text information according to the one or more weights and a preset threshold value.

In some embodiments, the text information comprises short text information.

In some embodiments, the method further comprises: acquiring limited vocabulary information; and screening the one or more participle information based on the restricted vocabulary information, and excluding the participle information from the core information if the one or more participle information is contained in the restricted vocabulary information.

In some embodiments, the text information comprises short text information.

The embodiment of the specification also relates to a core word extraction system. The system comprises: the text acquisition module is used for acquiring text information; the word segmentation module is used for acquiring one or more word segmentation information corresponding to the text information based on word segmentation processing of the text information; the weight determining module is used for determining one or more weights of the one or more word segmentation information in the text information; the weight can reflect the importance degree of the one or more word segmentation information in the text information; a core information determination module to determine core information of the text information based at least on the one or more weights.

In some embodiments, the weight determination module is further to: determining one or more mask texts based on the one or more participle information and the text information; at least one piece of word segmentation information in the one or more mask texts is shielded; determining an original vector representation of the text information based on a first preset algorithm and the text information; determining one or more mask vector representations based on a first preset algorithm and the one or more mask texts; determining the one or more weights from the original vector representation and the one or more mask vector representations.

In some embodiments, the weight determination module is further to: determining one or more distances between the one or more mask vector representations and the original vector representation; determining the one or more weights from the one or more distances.

In some embodiments, the weight determination module is further to: normalizing the one or more distances to determine the one or more weights.

In some embodiments, the first preset algorithm comprises a BERT model.

In some embodiments, the core information determination module is further configured to: and determining the core information of the text information according to the one or more weights and a preset threshold value.

In some embodiments, the text information comprises short text information.

In some embodiments, the system further comprises: the restricted vocabulary acquisition module is used for acquiring restricted vocabulary information; and the screening module is used for screening the one or more word segmentation information based on the restricted vocabulary information, and if the one or more word segmentation information is contained in the restricted vocabulary information, the word segmentation information is excluded from the core information.

The embodiment of the specification also relates to a core word extracting device, which comprises a processor and a memory; the memory is configured to store instructions for execution by the processor to implement the core word extraction method as described above.

Embodiments of the present specification also relate to a computer-readable storage medium storing computer instructions, which when executed by at least one processor, can implement the core word extraction method as described above.

Drawings

The present description will be further explained by way of exemplary embodiments, which will be described in detail by way of the accompanying drawings. These embodiments are not intended to be limiting, and in these embodiments like numerals are used to indicate like structures, wherein:

FIG. 1 is an exemplary flow diagram of a method of core information extraction, shown in accordance with some embodiments of the present description;

FIG. 2 is a sub-flow diagram illustrating determining weights for participle information according to some embodiments of the present description;

FIG. 3 is a schematic illustration of an exemplary mask text shown in accordance with some embodiments of the present description;

FIG. 4 is a representation of a raw text BERT vector shown in accordance with some embodiments of the present description;

FIG. 5 is a representation of a masking text BERT vector shown in accordance with some embodiments of the present description;

FIG. 6 is a schematic illustration of determining weights according to some embodiments of the present description;

FIG. 7 is an exemplary system block diagram of a core information extraction method, shown in some embodiments herein.

Detailed Description

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings used in the description of the embodiments will be briefly described below. It is obvious that the drawings in the following description are only examples or embodiments of the present description, and that for a person skilled in the art, without inventive effort, the present description can also be applied to other similar contexts on the basis of these drawings. Unless otherwise apparent from the context, or otherwise indicated, like reference numbers in the figures refer to the same structure or operation.

It should be understood that "system", "device", "unit" and/or "module" as used herein is a method for distinguishing different components, elements, parts, portions or assemblies of different levels. However, other words may be substituted by other expressions if they accomplish the same purpose.

As used in this specification and the appended claims, the terms "a," "an," "the," and/or "the" are not intended to be inclusive in the singular, but rather are intended to be inclusive in the plural, unless the context clearly dictates otherwise. In general, the terms "comprises" and "comprising" are intended to cover only the explicitly identified steps or elements as not constituting an exclusive list and that the method or apparatus may comprise further steps or elements.

Flow charts are used in this description to illustrate operations performed by a system according to embodiments of the present description. It should be understood that the preceding or following operations are not necessarily performed in the exact order in which they are performed. Rather, the various steps may be processed in reverse order or simultaneously. Meanwhile, other operations may be added to the processes, or a certain step or several steps of operations may be removed from the processes.

One or more embodiments of the present specification provide a core information extraction method. The basic inventive concept is that the core information of the text can be words which have the greatest influence on the text semantics, and after the text loses the core information, the semantics are influenced to the greatest extent.

One or more embodiments of the present disclosure sequentially determine weights of one or more words in text information by a masking method, and determine semantic influence amounts of the words in the text information by the weights, so that core information of a text can be determined according to the weights. Specifically, in one text message, if the influence of a certain word on the text message is to be determined, the word can be masked by a special symbol, and then the semantic difference between the text semantic lacking the word and the original text message semantic is determined, so as to determine the influence of the word on the text message.

One or more embodiments of the present specification determine the weight of the text information by calculating the semantic distance between the text after the word segmentation processing and the original text. According to the principle that the space distances of texts with similar or similar semantics on the feature vectors are also similar, the text vectors obtained by converting the word vectors can contain accurate semantic information, so that words in the texts are converted into one-dimensional vectors to be used as semantic representation of the texts. The original text information and the text after word segmentation processing are coded into vector representations, and the distance between the mask vector representation and the original vector representation on a vector space is calculated, so that the weight of word segmentation information in the text information can be determined, and the core information in the text information is finally extracted.

It should be understood that the application scenarios of the system and method in the present specification are only examples or embodiments of the present specification, and it is obvious for those skilled in the art that the present specification can also be applied to other similar scenarios according to these drawings without any creative effort.

The technical solution of core information extraction in one or more embodiments of the present specification may be applied to various business scenarios, including but not limited to artificial intelligence, data mining, big data analysis, data capture, public opinion monitoring, disaster monitoring, traffic monitoring, hot spot analysis, information recall, online customer service, question and answer robot, semantic analysis, voice recognition, public security information, emergency help, emergency rescue, e-commerce service, document retrieval, book management, machine translation, document copy monitoring, and the like. Although the present embodiment is mainly described with text information, it should be noted that the principle of the present embodiment can also be applied to semantic recognition in other business scenarios such as voice processing, image processing, etc., such as voice semantic recognition, conversation robot, image text risk monitoring, etc.

Fig. 1 is an exemplary flow diagram of a core information extraction method, shown in some embodiments according to the present description. As shown in fig. 1, the method comprises the steps of:

step 101, acquiring text information.

In some embodiments, this step may be performed by a text acquisition module. In some embodiments, the manner of obtaining the text information may include at least one of: voice input, image recognition, gesture input, manual input, client push, server transmission, database information import, computer data set import, computer automatic acquisition, and the like. For example, in an application scenario of public opinion monitoring, a user inputs text information to a news search engine through a voice input or manual input mode, and then a text acquisition module acquires the text information through real-time monitoring background data. In some embodiments, when the input data is audio, the text acquisition module may identify the audio signal to obtain text information. In some embodiments, the input data may be an image, and the text acquisition module may acquire textual information in the image based on text detection and recognition techniques (e.g., OCR, deep learning models, etc.). In some embodiments, the input data may also be gesture input such as a gesture, and the text acquisition module may match the gesture to determine the corresponding text information. In some embodiments, the text information may include short text information of the core information to be extracted. In some embodiments, the form of short text information may include voice input and/or text input under a customer service robot dialog scenario or intelligent knowledge base. In some embodiments, the short text message may be a phrase, such as "cat playing a hairline ball"; or a single sentence, such as "i have a cat. "; or a repeat, such as "kitten is playing the wool ball and is happy to play". In some embodiments, short text information may also include combinations of the above forms. In some embodiments, the storage location of the text information may include one or a combination of: a database, a memory with a storage function, and the like, and the text information is acquired from the storage position through a network by the text acquisition module.

Step 103, performing word segmentation processing on the text information to obtain one or more word segmentation information corresponding to the text information.

In some embodiments, this step may be performed by a segmentation module. The word segmentation processing on the text information can be understood as preprocessing for segmenting word sequence information in the text information into one or more single words, the purpose is to conveniently judge the influence degree of the segmented word or words in the corresponding complete text information, and whether the word belongs to a core word or not can be judged based on the influence degree.

In some embodiments, the word segmentation processing on the text information may be performed based on a preset algorithm, and in some embodiments, the preset algorithm may also include a specific word segmentation model. Word segmentation processing methods include, but are not limited to: a word segmentation method based on character string matching, a word segmentation method based on understanding, a word segmentation method based on statistics, and the like. In some embodiments, the text information may be word-segmented by a word segmentation model. The word segmentation model includes, but is not limited to: n-gram (N-gram), hidden Markov Model (HMM), maximum entropy Model (ME), conditional Random field Model (CRF), JIEBA participle Model, and the like. The following describes a process of performing word segmentation processing on text information by taking a JIEBA word segmentation model as an example. One or more embodiments of the present disclosure make the description of the related technical solutions through a specific text message "my dog is cut, he likes playing" as an example, and it should be understood that this example does not bring any limitation to the scope of the present disclosure.

In some embodiments, the text obtained in step 101 is participled using the JIEBA participle model. For example: obtaining text information s by using JIEBA word segmentation model ₀ : the word segmentation is carried out on the my dog is cute and he like playing to obtain a plurality of word segmentation sequence information w ₁ ：“my”，w ₂ ：“dog”，w ₃ ：“is”，w ₄ ：“cute”，w ₅ ：“he”，w ₆ ：“likes”，w ₇ ：“play”，w ₈ ：“##ing”。

Step 105, determining one or more weights of the one or more word segmentation information in the text information.

In some embodiments, this step may be determined by a weight determination module. The weight of the participle information in the corresponding text information can reflect the importance or influence degree of the participle information or the participle information in the text information, so that whether the participle information belongs to the keywords or the core words of the text information can be judged by judging the weight of the participle information in the text information. For example, the weight determination module may determine the text information s ₀ The corresponding weights of a plurality of participles in the text. In the above step, the text information s is processed ₀ : the word segmentation processing is carried out on the mydo is cute, he likes playingThe following word segmentation sequences: w is a ₁ ：“my”，w ₂ ：“dog”，w ₃ ：“is”，w ₄ ：“cute”，w ₅ ：“he”，w ₆ ：“likes”，w ₇ “play”，w ₈ : "# # ing", the sequence includes 8 word segmentation information, which are respectively: w is a ₁ ，w ₂ ......w ₈ . The weight determination modules may respectively determine the text information s ₀ The 8 word segmentation information is in text information s ₀ Corresponding to 8 weights in the table.

In some embodiments, the method for determining the weight of the participle information in the text information includes but is not limited to: determining word segmentation weight through statistical information of the word segmentation, determining word segmentation weight through calculating context relation of the word segmentation, determining weights of different words through processing text information by a supervised machine learning model, and determining one or a combination of the word segmentation weights through semantic distance between a mask text corresponding to the word segmentation and original text information.

In some embodiments, the method for determining the word segmentation weight through the statistical information of the word segmentation comprises at least one of the following steps: a TF-IDF method (calculating the occurrence frequency of word segmentation information in Text information and the number of texts containing the word segmentation information), a theme model method and a Text Rank method (iteratively updating according to word co-occurrence relations).

In some embodiments, the method for determining a word weight by calculating a word segmentation context comprises: all terms are represented as graph G = (V, E) in context, point V represents all terms, point E represents context between terms, given point V _ i, its score can be calculated with the scores of its neighbors. In some embodiments, a method of determining a participle weight by calculating a semantic distance between vector representations of the participle comprises: the participles are encoded into vector representations, respectively, and spatial distances between the vector representations are calculated to determine semantic distances between the participles.

The method for determining the weight of a participle by calculating the semantic distance between the mask text corresponding to the participle and the original text information will be described in detail in other parts of this specification with reference to specific embodiments, please refer to fig. 2.

Step 107, determining core information of the text information based on at least the one or more weights.

In some embodiments, this step may be performed by the core information determination module. The core information determining module may determine the core information of the text information based on the weights corresponding to the different word segmentation information by using one or more methods. In some embodiments, the core information determining module may determine the core information of the text information, i.e., the core word, from the weighted segmentation information based on a preset threshold. In some embodiments, the preset threshold may comprise a weight threshold, in particular, the weight of the participle information is determined to be the core information if it is greater than the weight threshold. In some embodiments, the preset threshold may further include a quantity threshold, and specifically, the participle information corresponding to the weight of the top three in the sequence is selected to be determined as the core information.

For example, the text information s ₀ 8 word segmentation information w in ₁ ：“my”，w ₂ ：“dog”，w ₃ ：“is”，w ₄ ：“cute”，w ₅ ：“he”，w ₆ ：“likes”，w ₇ “play”，w ₈ : "# # ing" in the text information s ₀ Has been determined. In some embodiments, a weight threshold may be set to 80%, and the participle information with a weight greater than 80% may be selected as the core information according to the weight threshold. In some embodiments, the number threshold may also be set to 3, and 3 pieces of word segmentation information with higher weight are respectively selected as the core information according to the number threshold. In the process of determining the core information according to the quantity threshold and/or the weight threshold, the weights of the participle information can be sorted, or the weights can be screened directly according to a preset threshold without sorting.

In some embodiments, the preset threshold may be preset, and the setting manner includes, but is not limited to, manual setting, computer setting, program setting, and the like. The specific percentage or value of the weight threshold, number threshold may be based on the specifics of the analysis weights determined in step 105.

It should be noted that, in some embodiments, the method may further include screening, based on a screening condition, each piece of word segmentation information corresponding to the text information. In some embodiments, the screening step may be performed by a screening module. For example, because semantic words such as table negation and turning have a large influence on the semantic meaning, but are not core information of the sentence, in some embodiments, the word segmentation information may be further filtered, and if the word segmentation information corresponding to the text information includes the semantic words such as table negation and turning, the core words may be excluded when the core words are filtered, so as to avoid being selected. In some embodiments, the filtering manner for excluding the above words may avoid selecting the above words by adding a limit of stopping words in the filtering process. In some implementations, words that have a large influence on semantics but do not include entity information may be summarized in advance to form a vocabulary restriction table, and during the screening step, each piece of word segmentation information may be compared with the vocabulary in the vocabulary restriction table, so that whether the piece of word segmentation information needs to be excluded or not may be determined. The vocabulary limitation list comprises one or more limited vocabularies, and the limited vocabularies can comprise but are not limited to relevant words, mood assistant words, adverbs and the like which represent negative semantics, turning semantics and the like. In some embodiments, the restricted vocabulary may be built by the restricted vocabulary acquisition module or one or more restricted vocabularies may be acquired from a pre-generated restricted vocabulary by the restricted vocabulary module. In some embodiments, the filtering step may be performed after determining the weight of each segmented information, or may be performed before determining the weight of each segmented information. For example, after the execution of step 103 is finished, each segmented word may be filtered to exclude the restricted vocabulary therein. For another example, after the execution of step 107 is finished, the obtained core information may be filtered to exclude the restricted vocabulary therein. In some embodiments, the filtering condition may include one or more restricted words and may include a combination of one or more restricted words, and in some embodiments, the combination of one or more restricted words may be understood as a vocabulary restriction list.

It should be noted that the above description regarding the core information extraction method 100 is for illustration and description only and is not intended to limit the applicability of one or more embodiments of the present disclosure. Various modifications and alterations to the core extraction method 100 will be apparent to those skilled in the art in light of one or more embodiments set forth herein. However, such modifications and variations are intended to be within the scope of one or more embodiments of the present disclosure.

FIG. 2 is a sub-flow diagram illustrating determining weights for participle information according to some embodiments of the present description.

In some embodiments, determining the weight of one or more word segmentation information may also be implemented by a masking distance method, and specifically, the corresponding weight of the word segmentation information may be determined by determining a text distance between a masked text corresponding to the word segmentation information and the original text information. In some embodiments, determining the text semantic distance between two texts may be accomplished by the distance of the vector representations to which the two texts correspond. According to the principle that the space distances of texts with similar or similar semantics on the feature vectors are also similar, the text vectors obtained by converting the word vectors can contain accurate semantic information, so that the words in the texts are converted into one-dimensional vectors to be used as the semantic representation of the texts.

Some embodiments for determining the corresponding weight of one or more participle information in the text information are described in detail below in conjunction with fig. 2, where process 200 includes:

step 201, one or more mask texts are determined based on one or more participle information and text information, wherein one participle information in the one or more mask texts is respectively masked.

In some embodiments, this step may be performed by the weight determination module. In some embodiments, determining the mask text corresponding to the one or more participle information may be implemented as follows: and adopting special symbols to sequentially shield word segmentation information needing to judge the weight or the importance in the text information so as to determine a mask text corresponding to the word segmentation information. The role of the special symbol is: the method comprises the steps of enabling word segmentation information needing to judge the weight or the importance degree in text information to be absent, judging the semantic difference between a mask text losing the word segmentation information and the text information, and further judging the importance degree of the word segmentation information in the text information. If the semanteme of a piece of word information is greatly changed relative to the original text information after the piece of word information is lost, the piece of word information is shown to be important for the text information. In some embodiments, the special symbol includes, but is not limited to, one or a combination of a character, a character string, a letter and a number, as long as the symbol can represent the absence of the corresponding semantics of the participle information. In some embodiments, one or more of the participle information in step 103 may be masked in turn using [ MASK ] or [ M ] as a special symbol. How to determine one or more mask texts will be described below in connection with the specific example illustrated in fig. 3.

Fig. 3 is a schematic diagram of an exemplary mask text shown in accordance with some embodiments of the present description. In some embodiments, the symbol [ M ] may be employed]Covering the text information s as special symbols in turn ₀ : 8 participle information in "my dog is cute, he like play # # ing": w is a ₁ ：“my”，w ₂ ：“dog”，w ₃ ：“is”，w ₄ ：“cute”，w ₅ ：“he”，w ₆ ：“likes”，w ₇ ：“play”，w ₈ : "# # ing", and further determining 8 mask texts corresponding to the 8 participle information: s ₁ ：“[M]dog is cute he likes play##ing”，s ₂ ：“my[M]is cute he likes play##ing”，s ₃ ：“my dog[M]cute he likes play##ing”，s ₄ ：“my dog is[M]he likes play##ing”，s ₅ ：“my dog is cute[M]likes play##ing”，s ₆ ：“my dog is cute he[M]play##ing”，s ₇ ：“my dog is cute he likes[M]##ing”，s ₈ ：“my dog is cute he likes play[M]”。

In some embodiments, a plurality of participle information (e.g., a combination of participles) in the participle information corresponding to the text information may be masked at the same time, so as to obtain a masked text corresponding to the simultaneously masked plurality of participle information. In some embodiments, the semantic importance of the masked participle information in the original text information can be obtained according to the semantic distance between the mask text and the original text information, that is, the larger the distance is, the larger the semantic importance is, the smaller the distance is, and the smaller the semantic importance is.

In some embodiments, determining the text semantic distance between two texts may be achieved by determining the semantic distance between vector representations to which the texts correspond, so that there is the step of determining the vector representation to which the text information corresponds to the mask text.

Step 203, determining an original vector representation of the text information based on a first preset algorithm and the text information.

In some embodiments, this step may be performed by the weight determination module. In some embodiments, determining an original vector representation of the text information based on a first preset algorithm and the text information may determine a corresponding vector representation by encoding the text information by one or more algorithms. In some embodiments, the first preset algorithm includes, but is not limited to: RNN (Recurrentneural Networks), CNN (conditional Neural Networks), transformer, GPB, BERT (Bidirective Encode responses from Transformers).

The vector representation algorithm will be described below by taking the BERT model as the first preset algorithm as an example.

In the process of converting text information into corresponding vector representation by using the BERT model, the input of the BERT model comprises text embedding, segmentation embedding and position embedding corresponding to the text information to be converted. The text embedding comprises a feature vector coded based on text information, and a word can be divided into a limited group of common word units, so that the balance between the effectiveness of the word and the flexibility of characters is achieved. The segmentation embedding comprises feature vectors coded based on segmentation information, can be used for distinguishing sentences of different contexts in the text, and codes the sentences of different contexts into different feature vectors. For example, text sequence information "[ CLS]my dog is cute[SEP]"and" he likes playing [ SEP ]]"belong to different clauses, the segmentation information corresponding to the text sequence can be expressed as"AAAABBBBBBB". The position embedding includes a feature vector encoded based on text position information, and position information of a word can be encoded into the feature vector. For example, text information s ₀ “[CLS]my dog is cute[SEP]，he likes playing[SEP]", wherein the numbers of the word segmentation sequence information are respectively: 0,1,2,3,4,5,6,7,8,9, 10, 'dog' belongs to the 3 rd word of information in the participle sequence, numbered 2, and the position embedding information can encode the position information of the word into a feature vector. How to use the BERT model to convert the original text s will be described with reference to FIG. 4 and the specific examples ₀ Conversion into a corresponding vector representation v ₀ 。

Fig. 4 is a schematic diagram of converting textual information into corresponding vector representations based on a BERT model, according to some embodiments described herein. As shown in FIG. 4, text information s is transformed using the BERT model ₀ : "my dog is cut, he like play # # ing" is converted into corresponding original vector representation v ₀ . As shown, the text information s ₀ The corresponding text embedding is: [ CLS]my dog is cute[SEP]he likes playing[SEP](ii) a The corresponding split embedding is: AAAABBBBB; the corresponding position embedding is: 0,1,2,3,4,5,6,7,8,9, 10. Text information s ₀ The sum of corresponding text embedding, segmentation embedding and position embedding is used as the input of a BERT model, and text information s can be obtained through the processing of the BERT model ₀ Is represented by the original vector of ₀ 。

Step 205, one or more mask vector representations are determined based on a first predetermined algorithm and the one or more mask texts.

In some embodiments, this step may be performed by the weight determination module. In some embodiments, the corresponding mask vector representation is determined based on the same algorithmic mask text as the above steps. How to mask the text s by the BERT model will be described with reference to fig. 5 and the specific examples ₁ Conversion into a corresponding vector representation v ₁ 。

Fig. 5 is a schematic diagram of converting mask text to corresponding vector representations based on a BERT model, according to some embodiments of the present description. As shown in FIG. 5Showing, using a BERT model, a mask text s ₁ ：“[M]Conversion of dog is cute, he like play # # ing "into corresponding original vector representation v ₁ . As shown, the masking text s ₁ The corresponding text embedding is: [ CLS][M]dog is cute[SEP]he likes playing[SEP](ii) a The corresponding split embedding is: AAAABBBBB; the corresponding position embedding is: 0,1,2,3,4,5,6,7,8,9, 10. Will shade the text s ₁ The sum of the corresponding text embedding, segmentation embedding and position embedding is used as the input of a BERT model, and the mask text s can be obtained through the processing of the BERT model ₁ Is represented by a mask vector v ₁ 。

Other masked text s ₂ ......s ₈ Similar method is adopted to convert the corresponding vector representation v into by the BERT model ₂ ......v ₈ And will not be described herein.

Step 207, determining the one or more weights from the original vector representation and the one or more mask vector representations. In some embodiments, this step may be performed by the weight determination module.

As depicted in fig. 6, fig. 6 is a schematic diagram illustrating determining weights according to some embodiments of the present description:

a weight determination module determines one or more distances between the one or more mask vector representations and the original vector representation; determining the one or more weights from the one or more distances. In some embodiments, the weight is positively correlated with the distance corresponding to the weight, and in particular, when the distance between the mask vector representation and the original vector representation is larger, it is indicated that the semantic influence of the masked participle information on the text information is larger, and the weight of the participle information is larger.

After the text information and the vector representations corresponding to the mask texts are determined, the semantic distance between the mask vector representation and the original vector representation can be determined based on the corresponding vector representations, and then the importance or influence of the word segmentation information corresponding to the mask vector representation on the text information can be judged according to the semantic distance.

The weight determining module determines the semantic distance between the text information and a plurality of mask texts by determining the spatial distance between the one or more mask vector representations and the original vector representation, can determine the influence quantity of participle information masked by special symbols in the mask texts, further determines the weight of the participle information, and finally determines the weight of one or more participle information corresponding to one or more mask vectors.

In some embodiments, the semantic distance between the plurality of mask vector representations and the original vector representation may be normalized to obtain weights of different mask vector representations relative to the original vector. This step may be performed by the weight determination module.

In some embodiments, the semantic distance calculation method between the mask vector and the original vector includes, but is not limited to, a word frequency statistics-based calculation method, an ontology-based calculation method, a geometric metric space-based calculation method, and the like. The calculation method based on word Frequency statistics includes, but is not limited to, overlapped word based method, TF-IDF (Term Frequency-Inverse Document Frequency), various weighting algorithms thereof (e.g., LSA, HAL, islam), and the like. The distance calculation method based on the ontology includes, but is not limited to, an ontology library edge distance calculation method, an ontology library node calculation method, a hybrid calculation method, and the like. The geometric-metric-based spatial calculation method includes, but is not limited to, euclidean Distance (Euclidean Distance), cosine Distance (Cosine Distance), manhattan Distance (Manhattan Distance), and the like.

In some embodiments, when the cosine distance method is used to calculate the distance between the mask vector and the original vector, the following formula (1) may be referred to, and the formula for calculating the cosine distance method corresponds to, where v ₀ Representing the original vector, v _i Mask vector representing ith mask text, d _i Representing the semantic distance between the ith mask vector and the original vector.

In some embodiments, the semantic distance may be further normalized to obtain a corresponding value from 0 to 1, that is, to obtain a corresponding weight. Common normalization processing methods include, but are not limited to: maximum-minimum normalization and/or mean normalization methods.

In some embodiments, when the semantic distance is normalized by the max/min normalization method, the following formula (2), the max/min normalization formula, may be referred to. Wherein X _norm Representing normalized data, X representing raw data, X _max And X _min Representing the maximum and minimum values of the original data set, respectively.

For example, for the semantic distance d between the ith mask vector and the original vector _i The weight obtained after maximum and minimum normalization is equal to

As another example, the original text s ₀ Vector representation v corresponding to mask text formed by the 1 st participle information ₁ With the original vector representation v ₀ A semantic distance between d ₁ After the minimum normalization processing is performed on the data, the obtained weight 1 is as follows:

i.e. representing the weight of the above-mentioned 1 st participle information in the original text.

In other embodiments, the semantic distance may be normalized by using a mean normalization method. The mean normalization formula is:

where z represents the normalized data, μ represents the mean of the original data set, and σ represents the variance of the original data set.

In some other embodiments, in addition to determining the weight by normalizing the semantic distance, the weight may be obtained by scoring the semantic distance. And the semantic distance scoring comprises classifying the acquired semantic distance data and presetting a corresponding weight threshold. And (3) setting the semantic distance with the value exceeding a preset threshold as a key object, wherein the corresponding masked word segmentation information in the mask vector has larger similarity with the original text.

FIG. 7 is an exemplary system block diagram of core information extraction, shown in accordance with some embodiments of the present description.

As shown in fig. 7, the system includes: a text acquisition module 710, a segmentation module 720, a weight determination module 730, and a core information determination module 740.

The text acquiring module 710 is used for acquiring text information.

The word segmentation module 720 is configured to perform word segmentation processing on the text information to obtain one or more word segmentation information corresponding to the text information.

The weight determining module 730 is configured to determine one or more weights of the one or more word segmentation information in the text information; the weight can reflect an importance of the one or more word segmentation information in the text information. In some embodiments, the weight determination module 730 is further configured to determine one or more mask texts based on the one or more participle information and the text information; one piece of word segmentation information in the one or more mask texts is/are respectively masked; determining an original vector representation of the text information based on a first preset algorithm and the text information; determining one or more mask vector representations of the one or more mask texts based on a first preset algorithm and the one or more mask texts; determining the one or more weights from the original vector representation and the one or more mask vector representations. In some embodiments, the weight determination module is further to determine one or more distances between the one or more mask vector representations and the original vector representation; determining the one or more weights from the one or more distances. In some embodiments, the weight determination module is further to normalize the one or more distances to determine the one or more weights.

In some embodiments, the system further comprises a text embedding acquisition module for acquiring the text embedding comprises a feature vector encoded based on text information; a segmentation embedding acquisition module for acquiring the segmentation embedding including a feature vector encoded based on segmentation information; and the position embedding acquisition module is used for acquiring the position embedding including the feature vector coded based on the text position information.

In some embodiments, the text embedding obtaining module is further configured to divide one or more participle information in the text information into limited common word units and encode the word units into feature vectors. In some embodiments, the segmentation-embedding acquisition module is further to: and encoding the segmentation information and the one or more word segmentation information in the text information into a feature vector. In some embodiments, the location embedding acquisition module is further to: and encoding the position information of the word segmentation sequence in the word segmentation information or the word segmentation information in the text information into a feature vector.

The core information determination module 740 is configured to determine the core information of the text information based on at least the one or more weights. In some embodiments, the core information determination module is further configured to: and determining the core information of the text information according to the one or more weights and a preset threshold value. In some embodiments, the core information determination module is further configured to: performing word segmentation processing on the text information based on a word segmentation model; the word segmentation model may include at least one of: JIEBA participle, HMM participle model, CRF participle model, deep learning model.

In some embodiments, the system further comprises a restricted vocabulary acquisition module for acquiring restricted vocabulary information; in some embodiments, the system further comprises a filtering module configured to filter the one or more pieces of segmentation information based on the restricted vocabulary information, and to exclude the one or more pieces of segmentation information from the core information if the one or more pieces of segmentation information are included in the restricted vocabulary information.

It should be appreciated that the system and its modules illustrated in FIG. 7 may be implemented in a variety of ways. For example, in some embodiments, the system and its modules may be implemented in hardware, software, or a combination of software and hardware. Wherein the hardware portion may be implemented using dedicated logic; the software portions may be stored in a memory for execution by a suitable instruction execution system, such as a microprocessor or specially designed hardware. Those skilled in the art will appreciate that the methods and systems described above may be implemented using computer executable instructions and/or embodied in processor control code, such code being provided, for example, on a carrier medium such as a diskette, CD-or DVD-ROM, a programmable memory such as read-only memory (firmware), or a data carrier such as an optical or electronic signal carrier. The system and its modules in one or more embodiments of the present specification may be implemented not only by hardware circuits such as very large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, or programmable hardware devices such as field programmable gate arrays, programmable logic devices, etc., but also by software executed by various types of processors, for example, or by a combination of hardware circuits and software (e.g., firmware).

It should be noted that the above descriptions of the candidate item display and determination system and the modules thereof are only for convenience of description, and the description is not limited to the scope of the illustrated embodiments. It will be appreciated by those skilled in the art that, given the teachings of the present system, any combination of modules or sub-system configurations may be used to connect to other modules without departing from such teachings. For example, in some embodiments, for example, the text obtaining module 710, the word segmentation module 720, the weight determination module 730, and the core information determination module 740 disclosed in fig. 7 may be different modules in one system, or may be a module that implements the functions of two or more modules described above. For example, the text acquiring module 710 and the weight determining module 730 may be two modules, or one module may have both acquiring and determining functions. For example, each module may share one obtaining module, and each module may have its own obtaining module. Such variations are within the scope of the present description.

Based on the above core information extraction method, one or more embodiments of the present specification further provide a core information extraction apparatus including at least one processor and at least one memory; the at least one memory is for storing computer instructions; the at least one processor is configured to execute at least a part of the computer instructions to implement the core information extraction method according to any of the above embodiments.

The core information extraction means may be arranged to process computer instructions during execution of the core information extraction. Specifically, the core information extraction means may store computer instructions and perform core information extraction operations.

The core information extraction device in the embodiment of the present specification can be applied to a variety of service scenarios, including but not limited to artificial intelligence, data mining, big data analysis, data capture, public opinion monitoring, disaster monitoring, traffic monitoring, hotspot analysis, online customer service, question and answer robot, semantic analysis, voice recognition, public security information, emergency help, emergency rescue and relief, e-commerce service, document retrieval, book management, machine translation, document copy monitoring, and the like.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for system and apparatus embodiments, the description is relatively simple as it is substantially similar to or based on method embodiments, and reference may be made to some descriptions of method embodiments for related points.

The beneficial effects that may be brought by the embodiments of the present description include, but are not limited to: (1) The method replaces the original words of the text by using a mask method, can respectively obtain the weights of the words in the text, and has high accuracy; (2) According to the scheme, a BERT semantic coding model is used, core information can be extracted from a single statement, and the model accuracy is high; (3) The method does not depend on the labeled data, can perfect the core information extraction function of the short text small data in different application scenes, optimizes the accuracy of core information extraction, and has strong applicability. It is to be noted that different embodiments may produce different advantages, and in different embodiments, the advantages that may be produced may be any one or combination of the above, or any other advantages that may be obtained.

Having thus described the basic concept, it will be apparent to those skilled in the art that the foregoing detailed disclosure is to be regarded as illustrative only and not as limiting the present specification. Various modifications, adaptations, and alternatives to one or more embodiments described herein may occur to one skilled in the art without departing from the scope of the invention as defined by the claims. Such alterations, modifications, and improvements are intended to be suggested in one or more embodiments of this disclosure, and are intended to be within the spirit and scope of the exemplary embodiments of this disclosure.

Also, the use of specific language in one or more embodiments of the specification has been used to describe embodiments of the specification. Reference throughout this specification to "one embodiment," "an embodiment," and/or "some embodiments" means that a particular feature, structure, or characteristic described in connection with at least one embodiment of the specification is included. Therefore, it is emphasized and should be appreciated that two or more references to "an embodiment" or "one embodiment" or "an alternative embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, some features, structures, or characteristics of one or more embodiments of the specification may be combined as appropriate.

Moreover, those skilled in the art will appreciate that aspects of the present description may be illustrated and described in terms of several patentable categories or situations, including any new and useful combination of processes, machines, manufacture, or materials, or any new and useful modification thereof. Accordingly, aspects of this description may be performed entirely by hardware, entirely by software (including firmware, resident software, micro-code, etc.) or by a combination of hardware and software. The above hardware or software may be referred to as "data block," module, "" engine, "" unit, "" component, "or" system. Furthermore, aspects of the present description may be represented as a computer product, including computer readable program code, embodied in one or more computer readable media.

The computer storage medium may comprise a propagated data signal with the computer program code embodied therewith, for example, on baseband or as part of a carrier wave. The propagated signal may take any of a variety of forms, including electromagnetic, optical, etc., or any suitable combination. A computer storage medium may be any computer-readable medium that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code located on a computer storage medium may be propagated over any suitable medium, including radio, cable, fiber optic cable, RF, or the like, or any combination of the preceding.

Computer program code required for the operation of various portions of this specification may be written in any one or more programming languages, including an object oriented programming language such as Java, scala, smalltalk, eiffel, JADE, emerald, C + +, C #, VB.NET, python, and the like, a conventional programming language such as C, visualBasic, fortran2003, perl, COBOL2002, PHP, ABAP, a dynamic programming language such as Python, ruby, and Groovy, or other programming languages, and the like. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or processing device. In the latter scenario, the remote computer may be connected to the user's computer through any form of network, such as a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet), or in a cloud computing environment, or as a service using, for example, software as a service (SaaS).

Additionally, the order in which elements and sequences are described in this specification, the use of numerical letters, or other designations are not intended to limit the order of the processes and methods described in this specification, unless explicitly stated in the claims. While various presently contemplated embodiments of the invention have been discussed in the foregoing disclosure by way of example, it is to be understood that such detail is solely for that purpose and that the appended claims are not limited to the disclosed embodiments, but, on the contrary, are intended to cover all modifications and equivalent arrangements that are within the spirit and scope of the embodiments herein. For example, although the system components described above may be implemented by hardware devices, they may also be implemented by software-only solutions, such as installing the described system on an existing processing device or mobile device.

Similarly, it should be noted that in the preceding description of embodiments of the present specification, various features are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure aiding in the understanding of one or more of the embodiments. This method of disclosure, however, is not intended to imply that more features are required than are expressly recited in the claims. Indeed, the embodiments may be characterized as having less than all of the features of a single disclosed embodiment.

Numerals describing the number of components, attributes, etc. are used in some embodiments, it being understood that such numerals used in the description of the embodiments are modified in some instances by the use of the modifier "about", "approximately" or "substantially". Unless otherwise indicated, "about", "approximately" or "substantially" indicates that the number allows a variation of ± 20%. Accordingly, in some embodiments, the numerical parameters used in the specification and claims are approximations that may vary depending upon the desired properties of the individual embodiments. In some embodiments, the numerical parameter should take into account the specified significant digits and employ a general digit-preserving approach. Notwithstanding that the numerical ranges and parameters setting forth the broad scope of the range in some embodiments of the specification are approximations, in specific embodiments, such numerical values are set forth as precisely as possible within the practical range.

Each patent, patent application publication, and other material, such as articles, books, specifications, publications, documents, and the like, cited in connection with one or more embodiments of the specification is hereby incorporated by reference in its entirety into the examples of the specification. Except where the application history data does not conform to or conflict with the contents of the examples herein, except where a claim to this specification is limited in its broadest scope (whether present or later appended to this specification). It is to be understood that the descriptions, definitions and/or uses of terms in the accompanying materials of the present specification shall control if they are inconsistent or inconsistent with the statements and/or uses of the present specification.

Finally, it should be understood that the embodiments described herein are merely illustrative of the principles of the embodiments of the present disclosure. Other variations are also possible within the scope of this description. Thus, by way of example, and not limitation, alternative configurations of the embodiments of the specification can be considered consistent with the teachings of the specification. Accordingly, the embodiments of the present description are not limited to only those embodiments explicitly described and depicted herein.

Claims

1. A method for core information extraction, the method being performed by at least one processor, the method comprising:

acquiring text information;

acquiring one or more word segmentation information corresponding to the text information based on word segmentation processing of the text information;

determining one or more mask texts based on the one or more participle information and the text information; at least one piece of word segmentation information in the one or more mask texts is shielded;

determining an original vector representation of the text information based on a first preset algorithm and the text information;

determining one or more mask vector representations based on a first preset algorithm and the one or more mask texts;

determining the one or more weights from the original vector representation and the one or more mask vector representations; the weight can reflect the importance of the corresponding word segmentation information in the text information;

determining core information of the text information based at least on the one or more weights.

2. The method of claim 1, wherein said determining the one or more weights from the original vector representation and the one or more mask vector representations comprises:

determining one or more distances between the one or more mask vector representations and the original vector representation;

determining the one or more weights from the one or more distances.

3. The method of claim 2, wherein the weight is positively correlated with its corresponding distance.

4. The method of claim 2, wherein the distance comprises at least one of: cosine distance, euler distance, or manhattan distance.

5. The method of claim 2, wherein determining the one or more weights from the one or more distances further comprises: normalizing the one or more distances to determine the one or more weights.

6. The method of claim 1, wherein the first pre-set algorithm comprises a BERT model.

7. The method of claim 1, wherein the determining core information of the textual information based at least on the one or more weights comprises:

and determining the core information of the text information according to the one or more weights and a preset threshold value.

8. The method of claim 1, wherein the text information comprises short text information.

9. The method of claim 1, further comprising:

acquiring limited vocabulary information;

and screening the one or more pieces of segmentation information based on the restricted vocabulary information, and excluding the one or more pieces of segmentation information from the core information if the one or more pieces of segmentation information are included in the restricted vocabulary information.

10. A core word extraction system, the system comprising:

the text acquisition module is used for acquiring text information;

the word segmentation module is used for acquiring one or more word segmentation information corresponding to the text information based on word segmentation processing of the text information;

a weight determination module to:

and determining the one or more weights from the original vector representation and the one or more mask vector representations; the weight can reflect the importance of the corresponding word segmentation information in the text information;

a core information determination module to determine core information of the text information based at least on the one or more weights.

11. The system of claim 10, wherein the weight determination module is further configured to:

determining the one or more weights from the one or more distances.

12. The system of claim 11, wherein the weights are positively correlated with their corresponding distances.

13. The system of claim 11, wherein the distance comprises at least one of: cosine distance, euler distance, or manhattan distance.

14. The system of claim 11, wherein the weight determination module is further configured to: normalizing the one or more distances to determine the one or more weights.

15. The system of claim 10, wherein the first pre-set algorithm comprises a BERT model.

16. The system of claim 10, wherein the core information determination module is further configured to: and determining the core information of the text information according to the one or more weights and a preset threshold value.

17. The system of claim 10, wherein the textual information comprises short textual information.

18. The system of claim 10, further comprising:

the restricted vocabulary acquisition module is used for acquiring restricted vocabulary information;

and the screening module is used for screening the one or more word segmentation information based on the restricted vocabulary information, and if the one or more word segmentation information is contained in the restricted vocabulary information, the word segmentation information is excluded from the core information.

19. An apparatus for extracting core words, the apparatus comprising a processor and a memory; the memory is configured to store instructions, and the processor is configured to execute the instructions to implement operations corresponding to the core word extraction method according to any one of claims 1 to 9.