CN113377911B

CN113377911B - Text information extraction method and device, electronic equipment and storage medium

Info

Publication number: CN113377911B
Application number: CN202110643959.7A
Authority: CN
Inventors: 高明; 华煌圣; 彭政; 张文斐; 张栩华; 王德辉; 刘己未; 宋强
Original assignee: Guangzhou Power Supply Bureau of Guangdong Power Grid Co Ltd
Current assignee: Guangzhou Power Supply Bureau of Guangdong Power Grid Co Ltd
Priority date: 2021-06-09
Filing date: 2021-06-09
Publication date: 2022-10-14
Anticipated expiration: 2041-06-09
Also published as: CN113377911A

Abstract

The invention discloses a text information extraction method, a text information extraction device, electronic equipment and a storage medium, wherein the method comprises the following steps: respectively carrying out text preprocessing on a plurality of preset text paragraphs and a plurality of preset problem texts to obtain corresponding preprocessed text paragraphs and corresponding preprocessed problem texts; calculating a first characteristic item of the preprocessed text paragraph; calculating a second characteristic item of the preprocessed question text; calculating the association degree between the pre-processing text paragraphs and the pre-processing question texts by adopting the first characteristic items and the second characteristic items; dividing a plurality of preprocessed text paragraphs into a plurality of pareto sets according to the association degree; sequencing the pareto sets to obtain pareto sequences; and extracting text information corresponding to a plurality of question texts from the pareto sequence. By the method and the device, the extraction efficiency and the extraction accuracy of the multi-problem associated text information are improved.

Description

Text information extraction method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of information extraction technologies, and in particular, to a method and an apparatus for extracting text information, an electronic device, and a storage medium.

Background

With the advent of the big data era, data is increasingly contained in the internet, and convenience is provided for life, study and office of people. Therefore, it is urgently required to extract information required by a user from a large amount of text data.

At present, when people use an internet text retrieval tool, people often need to retrieve a plurality of input problems and extract text information associated with each problem. Therefore, the efficiency of extracting text information is low, and the accuracy is low.

Disclosure of Invention

The invention provides a text information extraction method, a text information extraction device, electronic equipment and a storage medium, which are used for solving the technical problems that the extraction efficiency of text information is low and the extraction accuracy is low because the existing text information extraction method needs to retrieve a plurality of input problems and extracts the text information which is relatively associated with each problem.

The invention provides a text information extraction method, which comprises the following steps:

respectively carrying out text preprocessing on a plurality of preset text paragraphs and a plurality of preset problem texts to obtain corresponding preprocessed text paragraphs and corresponding preprocessed problem texts;

calculating a first feature item of the preprocessed text paragraph;

calculating a second characteristic item of the preprocessing question text;

calculating the association degree between the preprocessed text paragraph and the preprocessed question text by adopting the first characteristic item and the second characteristic item;

dividing the plurality of preprocessed text paragraphs into a plurality of pareto sets according to the association degree;

sequencing the pareto sets to obtain pareto sequences;

and extracting text information corresponding to a plurality of question texts from the pareto sequence.

Optionally, the step of calculating a first feature item of the preprocessed text paragraph includes:

acquiring a first word segmentation vocabulary of the preprocessed text paragraphs; the first word segmentation vocabulary comprises a plurality of first words;

calculating a first word frequency and a first reverse file frequency of each first word;

calculating a first absolute characteristic value of a corresponding first vocabulary by adopting the first word frequency and the first reverse file frequency;

calculating a first relative characteristic value of a corresponding first vocabulary according to a preset text paragraph characteristic item threshold value and the first absolute characteristic value;

and calculating a first characteristic item of the preprocessed text paragraph by using the first relative characteristic value of each first vocabulary in the first vocabulary.

Optionally, the step of calculating a second feature item of the pre-processing question text includes:

acquiring a second word segmentation vocabulary of the preprocessed problem text; the second word segmentation vocabulary comprises a plurality of second words;

calculating a second word frequency and a second reverse file frequency of each second word;

calculating a second absolute characteristic value of a corresponding second vocabulary by adopting the second word frequency and the second reverse file frequency;

calculating a second relative characteristic value of a corresponding second vocabulary according to a preset problem text characteristic item threshold value and the second absolute characteristic value;

and calculating a second characteristic item of the preprocessed problem text by adopting a second relative characteristic value of each second vocabulary in the second vocabulary entry table.

Optionally, the step of dividing the plurality of preprocessed text paragraphs into a plurality of pareto sets according to the relevance includes:

determining an unclassified text paragraph in the preprocessed text paragraphs, and constructing a text set by adopting the unclassified text paragraph;

sequentially traversing the unclassified text paragraphs in the text set, and determining the unclassified text paragraphs without dominant relationship as non-dominant individuals;

adding all the non-dominated individuals obtained by traversal into the same pareto set;

judging whether an unclassified text exists in the text set or not;

if not, returning to the step of determining unclassified text paragraphs in the preprocessed text paragraphs and constructing a text set by adopting the unclassified text paragraphs;

if yes, all pareto sets are output.

Optionally, the step of extracting text information corresponding to a plurality of question texts from the pareto sequence includes:

determining a target pareto set and a critical pareto set according to a preset text information number and the pareto sequence;

calculating the crowdedness of each preprocessed text paragraph in the critical pareto set;

sequencing the crowdedness to obtain a crowdedness sequence;

determining a target text paragraph from the critical pareto set according to the target pareto set, the text information number and the crowdedness sequence;

extracting the preprocessed text paragraphs and the target text paragraphs in the target pareto set as text information to be extracted by the plurality of question texts.

The invention also provides a text information extraction device, comprising:

the preprocessing module is used for respectively performing text preprocessing on the plurality of preset text paragraphs and the plurality of preset problem texts to obtain corresponding preprocessed text paragraphs and corresponding preprocessed problem texts;

a first feature item calculation module, configured to calculate a first feature item of the preprocessed text paragraph;

the second characteristic item calculation module is used for calculating a second characteristic item of the preprocessed problem text;

the relevancy calculation module is used for calculating the relevancy between the preprocessed text paragraph and the preprocessed problem text by adopting the first characteristic item and the second characteristic item;

the pareto set dividing module is used for dividing the preprocessed text paragraphs into a plurality of pareto sets according to the association degree;

the pareto sequence generation module is used for sequencing the pareto sets to obtain a pareto sequence;

and the text information extraction module is used for extracting the text information corresponding to the plurality of question texts from the pareto sequence.

Optionally, the first feature item calculating module includes:

the first word segmentation vocabulary acquisition sub-module is used for acquiring a first word segmentation vocabulary of the preprocessed text paragraph; the first word segmentation vocabulary comprises a plurality of first words;

the first word frequency and first reverse file frequency calculation submodule is used for calculating the first word frequency and the first reverse file frequency of each first vocabulary;

the first absolute characteristic value operator module is used for calculating a first absolute characteristic value of a corresponding first vocabulary by adopting the first word frequency and the first reverse file frequency;

the first relative feature value operator module is used for calculating a first relative feature value of a corresponding first vocabulary according to a preset text paragraph feature item threshold and the first absolute feature value;

and the first characteristic item calculation sub-module is used for calculating a first characteristic item of the preprocessed text paragraph by adopting a first relative characteristic value of each first vocabulary in the first vocabulary word list.

Optionally, the second feature item calculating module includes:

the second sub-word vocabulary acquisition sub-module is used for acquiring a second sub-word vocabulary of the preprocessed problem text; the second word segmentation vocabulary comprises a plurality of second words;

the second word frequency and second reverse file frequency calculation submodule is used for calculating the second word frequency and second reverse file frequency of each second vocabulary;

the second absolute eigenvalue operator module is used for calculating a second absolute eigenvalue of a corresponding second vocabulary by adopting the second word frequency and the second reverse file frequency;

the second relative characteristic value operator module is used for calculating a second relative characteristic value of a corresponding second vocabulary according to a preset problem text characteristic item threshold value and the second absolute characteristic value;

and the second characteristic item calculation submodule is used for calculating a second characteristic item of the preprocessed problem text by adopting a second relative characteristic value of each second vocabulary in the second vocabulary entry table.

The invention also provides an electronic device comprising a processor and a memory:

the memory is used for storing program codes and transmitting the program codes to the processor;

the processor is configured to execute the text information extraction method according to any one of the above instructions in the program code.

The present invention also provides a computer-readable storage medium for storing program code for executing the text information extraction method as described in any one of the above.

According to the technical scheme, the invention has the following advantages: the method comprises the steps of respectively carrying out text preprocessing on a plurality of preset text paragraphs and a plurality of preset problem texts to obtain corresponding preprocessed text paragraphs and corresponding preprocessed problem texts; calculating a first characteristic item of the preprocessed text paragraph; calculating a second characteristic item of the preprocessing problem text; calculating the association degree between the preprocessed text paragraphs and the preprocessed problem texts by adopting the first characteristic items and the second characteristic items; dividing the preprocessed text paragraphs into a plurality of pareto sets according to the relevance; sequencing the pareto sets to obtain pareto sequences; so as to extract the text information corresponding to a plurality of question texts from the pareto sequence. Therefore, the extraction efficiency and the extraction accuracy of the multi-problem associated text information are improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a flowchart illustrating steps of a text information extraction method according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a process of dividing into pareto sets according to an embodiment of the present invention;

fig. 3 is a schematic flowchart of a text information extraction method according to an embodiment of the present invention;

fig. 4 is a block diagram of a text information extraction apparatus according to an embodiment of the present invention.

Detailed Description

The embodiment of the invention provides a text information extraction method, a text information extraction device, electronic equipment and a storage medium, which are used for solving the technical problems that the existing text information extraction method needs to search a plurality of input problems and extracts text information which is relatively associated with each problem, so that the extraction efficiency of the text information is low and the extraction accuracy is low.

In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the embodiments described below are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, fig. 1 is a flowchart illustrating steps of a text information extraction method according to an embodiment of the present invention.

The text information extraction method provided by the invention specifically comprises the following steps:

step 101, respectively performing text preprocessing on a plurality of preset text paragraphs and a plurality of preset problem texts to obtain corresponding preprocessed text paragraphs and corresponding preprocessed problem texts;

in embodiments of the present invention, text preprocessing may include, but is not limited to: filtering special or useless symbols, word segmentation (separating and refining words in the text), filtering stop words, word frequency statistics and other conventional text processing processes.

102, calculating a first characteristic item of a preprocessed text paragraph;

after completing the pre-processing of the text passage and the question text, a first feature item of the pre-processed text passage can be extracted.

In one example, calculating the first feature item of the preprocessed text passage may be implemented by:

s21, acquiring a first word segmentation vocabulary of the preprocessed text paragraphs; the first word segmentation vocabulary comprises a plurality of first words;

s22, calculating a first word frequency and a first reverse file frequency of each first word;

s23, calculating a first absolute characteristic value of a corresponding first vocabulary by adopting the first word frequency and the first reverse file frequency;

s24, calculating a first relative characteristic value of a corresponding first vocabulary according to a preset text paragraph characteristic item threshold value and a first absolute characteristic value;

and S25, calculating a first characteristic item of the preprocessed text paragraph by using the first relative characteristic value of each first vocabulary in the first vocabulary word list.

In a specific implementation, a result obtained by word segmentation in a text preprocessing process may be used as a first word segmentation and vocabulary list, a first word frequency and a first inverse file frequency of each preprocessed text paragraph for each first word in the first vocabulary list are calculated, and a first absolute feature value of each preprocessed text paragraph is calculated.

The first word frequency can be calculated by adopting the following formula:

wherein n is _i,j For preprocessing text paragraphs d _j Chinese word n _i Of (a) times, sigma _k n _k,j Representing the number of all entries of text.

The first absolute feature value of each first word can be calculated by the following formula:

wherein E is a first absolute characteristic value of a first vocabulary, and gamma is a standardized function, so that the problem that the calculated word frequency is inaccurate in word frequency calculation due to the problem of long and short documents can be avoided; tf is _ij The method for calculating the word frequency by taking the logarithm can effectively avoid the linear increase of the word frequency; idf _i The method for calculating the frequency of the reverse file for taking the logarithm can also avoid the problem of inaccurate calculation caused by linear growth, wherein the denominator part +1 is used for avoiding the condition that the denominator is zero; | D | represents the total number of pre-processed text paragraphs.

Then, by setting a text paragraph feature item threshold Td, a first relative feature value of the first vocabulary and the preprocessed text paragraph is calculated by the following formula:

where T is a text paragraph feature term threshold Td and e is a first relative feature value. And the first vocabulary with the first relative characteristic value not being 0 is the first characteristic item of the preprocessed text paragraph.

Step 103, calculating a second characteristic item of the preprocessed question text;

similarly, the embodiment of the invention can also extract a second characteristic item of the preprocessed problem text. The method comprises the following specific steps:

s31, acquiring a second participle vocabulary of the preprocessed problem text; the second word segmentation vocabulary comprises a plurality of second words;

s32, calculating a second word frequency and a second reverse file frequency of each second word;

s33, calculating a second absolute characteristic value of a corresponding second vocabulary by adopting a second word frequency and a second reverse file frequency;

s34, calculating a second relative characteristic value of a corresponding second vocabulary according to a preset problem text characteristic item threshold value and a second absolute characteristic value;

and S35, calculating a second characteristic item of the preprocessed problem text by adopting the second relative characteristic value of each second vocabulary in the second vocabulary word list.

In the embodiment of the present invention, the process of calculating the second feature item of the preprocessed problem text is similar to that in step 102, and the description of step 102 may be specifically referred to, which is not repeated herein.

It should be noted that after the first feature item of the preprocessed text paragraph and the second feature item of the preprocessed problem text are obtained through calculation, the words in the vocabulary table may be used as columns, and the preprocessed problem text and the preprocessed text paragraph may be used as lines, so as to obtain feature vectors of the preprocessed problem text and the preprocessed text paragraph.

104, calculating the association degree between the preprocessed text paragraphs and the preprocessed problem texts by adopting the first characteristic items and the second characteristic items;

in the embodiment of the present invention, after the first feature item and the second feature item are obtained, the association degree between the preprocessed text paragraph and the preprocessed question file may be calculated.

In one example, a spatial vector model may be utilized in conjunction with feature vectors to obtain a degree of correlation between a pre-processed text passage and a question text.

The method for calculating the relevance between the pre-processed problem text and the pre-processed text paragraph can adopt but is not limited to the following methods: cosine similarity, manhattan distance, euclidean distance, limit similarity and other similarity calculation methods.

In one example, the limit similarity calculation method may be referred to as the following formula:

whereinD represents a feature vector of the preprocessed text paragraph; q represents a feature vector of the pre-processing problem text; cosine Sim (q, d) represents the similarity between the preprocessed text paragraph d and the preprocessed problem text q; z represents the total number of the first characteristic items and the second characteristic items; w represents a feature vector; w is a _i,d A value representing the ith dimension of the preprocessed text feature vector d; w is a _i,q The value of the ith dimension of the feature vector q representing the pre-processing question text.

105, dividing a plurality of preprocessed text paragraphs into a plurality of pareto sets according to the relevance;

in the embodiment of the invention, for a certain preprocessed text paragraph d _i If the similarity of each preprocessed question text is higher than that of the preprocessed paragraph text d _j High, then it is regarded as a pre-processed text paragraph d _i Is "dominating" the preprocessed text passage d _j In (1). A Pareto (Pareto) set is also a non-dominant set, and paragraph texts in the same Pareto set are mutually non-dominant.

In one example, as shown in fig. 2, step 105 may include the following sub-steps:

s51, determining unclassified text paragraphs in the preprocessed text paragraphs, and constructing a text set by adopting the unclassified text paragraphs;

s52, sequentially traversing the unclassified text paragraphs in the text set, and determining the unclassified text paragraphs without dominated relationship as non-dominated individuals;

s53, adding all the non-dominated individuals obtained by traversal into the same pareto set;

s54, judging whether an unclassified text exists in the text set;

s55, if not, returning to the step of determining the unclassified text paragraphs in the preprocessed text paragraphs and constructing a text set by adopting the unclassified text paragraphs;

and S56, if so, outputting all pareto sets.

In a specific implementation, a pareto set F may be initialized first _k K =1; and creating paragraphs containing unclassified textA text set;

traverse all unassigned text paragraphs d _i ,i＝1；

For all other unclassified text paragraphs d _j J ≠ i, compare d _i And d _j Dominant versus non-dominant relationship of (c);

if there is not one d _j Dominant d _i Then d is _i Marked as non-dominant individual, added to the set F _k Performing the following steps;

let i = i +1 go through all unallocated text paragraphs d _i Continuing to find non-dominant individuals to join the set F _k Until i = s;

and (4) removing the selected non-dominant individuals, enabling k = k +1, and continuing to repeat the steps until all the preprocessed text paragraphs are classified into a certain pareto set.

Where s is the total number of text paragraphs.

106, sequencing the pareto sets to obtain a pareto sequence;

and the order of a plurality of pareto sets obtained by carrying out non-dominated sorting on the preprocessed text paragraphs is a pareto sequence of the pareto sets. Wherein pre-processed text paragraphs in the preceding pareto set dominate pre-processed text paragraphs in the following pareto set.

And step 107, extracting text information corresponding to a plurality of question texts from the pareto sequence.

In the embodiment of the present invention, after the pareto sequence is obtained, text information corresponding to a plurality of question texts may be extracted from the pareto sequence.

In one example, step 107 may include the following sub-steps:

s71, determining a target pareto set and a critical pareto set according to a preset text information number and a pareto sequence;

s72, calculating the crowdedness of each preprocessed text paragraph in the critical pareto set;

s73, sequencing the crowding degrees to obtain a crowding degree sequence;

s74, determining a target text paragraph from the critical pareto set according to the target pareto set, the number of text messages and the sequence of crowdedness;

and S75, determining the preprocessed text paragraphs and the target text paragraphs in the target pareto set as text information to be extracted from a plurality of question texts.

In practical application, preprocessed text paragraphs in the same pareto set are taken as a node, the node sparsity degree near the preprocessed text paragraph node can be considered as the "crowding degree", and the probability that the preprocessed text paragraphs corresponding to the nodes which are crowded together have repeated content is high, so that the sparser preprocessed text paragraphs can be preferentially considered during selection. The congestion degree may be calculated by the following formula:

wherein d is _i Represents a certain pre-processed text passage, and

represents d _i The distance between the left node and the right node for the feature vector of a certain vocabulary j is larger if the distance between the left node and the right node of the preprocessed text paragraph node is larger, which means that the preprocessed text paragraph nodes near the node are sparser and the crowding degree is lower. Finally, the preprocessed text paragraph nodes in the same pareto set pass through the crowdedness Cwd (d) _i ) And performing descending sorting.

It should be noted that in practical applications, the congestion degree ordering may not be performed on all pareto sets for the following reasons:

since k ordered pareto sets F are obtained if necessary _i ，i∈[1,k]And the number of the preprocessed text paragraphs in each pareto set is S _i ，i∈[1,k]. Then there must be an integer t, t<k and are such that

Wherein n is the number of text messages, i.e. the number of pre-extracted text messages. That is, there is an integer t, so that n pieces of text information cannot be obtained enough after obtaining the first t preprocessed text paragraphs in the pareto sets, and the number of n text information exceeds the number of text paragraphs in the t +1 st pareto set. Therefore, the preprocessed text paragraphs in the t pareto sets are text information that is inevitably to be selected, and therefore the preprocessed text paragraphs in the t pareto sets do not need to be subjected to congestion degree sorting, and only the text paragraphs in the t +1 th pareto set need to be subjected to congestion degree sorting, that is, only the text paragraphs in the F +1 th pareto set need to be subjected to congestion degree sorting _t+1 Pareto sets sort the crowdedness. Wherein, F is _t+1 Each pareto set is a critical pareto set.

Finally, besides selecting all the preprocessed text paragraphs in the first t pareto sets, the top ordered in the t +1 th pareto set needs to be selected

And (4) preprocessing the text paragraphs to finally obtain the final text information extraction results of the n preprocessed text paragraphs.

The method comprises the steps of respectively carrying out text preprocessing on a plurality of preset text paragraphs and a plurality of preset problem texts to obtain corresponding preprocessed text paragraphs and corresponding preprocessed problem texts; calculating a first characteristic item of the preprocessed text paragraph; calculating a second characteristic item of the preprocessed question text; calculating the association degree between the preprocessed text paragraphs and the preprocessed problem texts by adopting the first characteristic items and the second characteristic items; dividing the preprocessed text paragraphs into a plurality of pareto sets according to the relevance; sequencing the pareto sets to obtain pareto sequences; to extract text information corresponding to a plurality of question texts from the pareto sequence. Therefore, the extraction efficiency and the extraction accuracy of the multi-problem associated text information are improved.

For ease of understanding, the following is illustrated by specific examples:

referring to fig. 3, fig. 3 is a schematic flow chart of a text information extraction method according to an embodiment of the present invention.

As shown in fig. 3, the invention first performs preprocessing and feature item calculation on text paragraphs and question texts, and then obtains feature vector tables of preprocessed question texts and preprocessed text paragraphs by using vocabularies in vocabularies of all text paragraphs and all question texts as columns and using each preprocessed text paragraph and question text as lines; and (4) organizing the values of each column of the pre-processing question text and the pre-processing text paragraph to obtain the feature vectors of the pre-processing question text and the pre-processing text paragraph. And calculating the association degree between the preprocessed text paragraphs and the preprocessed text paragraphs based on the preprocessed problem text and the feature vectors of the preprocessed text paragraphs, and further performing non-dominated sorting on the preprocessed text paragraphs based on the association degree to obtain a plurality of pareto sets containing different preprocessed text paragraphs. And carrying out congestion degree sequencing on the preprocessed text paragraphs in the same pareto set, and further obtaining preset n preprocessed text paragraphs as the extraction result of the text information according to the sequencing of the pareto set and the congestion degree sequencing of the preprocessed text paragraphs in the pareto set.

Referring to fig. 4, fig. 4 is a block diagram of a text information extraction device according to an embodiment of the present invention.

The embodiment of the invention provides a text information extraction device, which comprises:

the preprocessing module 401 is configured to perform text preprocessing on a plurality of preset text paragraphs and a plurality of preset problem texts, respectively, to obtain corresponding preprocessed text paragraphs and corresponding preprocessed problem texts;

a first feature item calculating module 402, configured to calculate a first feature item of the preprocessed text paragraph;

a second feature item calculating module 403, configured to calculate a second feature item of the pre-processing problem text;

the association degree calculating module 404 is configured to calculate an association degree between the preprocessed text paragraph and the preprocessed question text by using the first feature item and the second feature item;

a pareto set dividing module 405, configured to divide the plurality of preprocessed text paragraphs into a plurality of pareto sets according to the association degree;

a pareto sequence generation module 406, configured to sort the pareto sets to obtain a pareto sequence;

the text information extracting module 407 is configured to extract text information corresponding to a plurality of question texts from the pareto sequence.

In this embodiment of the present invention, the first feature item calculating module 402 includes:

the first word segmentation vocabulary acquisition submodule is used for acquiring a first word segmentation vocabulary of the preprocessed text paragraph; the first word segmentation vocabulary comprises a plurality of first words;

the first relative characteristic value operator module is used for calculating a first relative characteristic value of a corresponding first vocabulary according to a preset text paragraph characteristic item threshold value and a first absolute characteristic value;

and the first characteristic item calculation submodule is used for calculating a first characteristic item of the preprocessed text paragraph by adopting the first relative characteristic value of each first vocabulary in the first vocabulary word list.

In this embodiment of the present invention, the second feature item calculating module 403 includes:

the second participle vocabulary acquisition sub-module is used for acquiring a second participle vocabulary of the preprocessed problem text; the second word segmentation vocabulary comprises a plurality of second words;

the second absolute characteristic value operator module is used for calculating a second absolute characteristic value of a corresponding second vocabulary by adopting a second word frequency and a second reverse file frequency;

the second relative characteristic value operator module is used for calculating a second relative characteristic value of a corresponding second vocabulary according to a preset problem text characteristic item threshold value and a second absolute characteristic value;

and the second characteristic item calculation submodule is used for calculating a second characteristic item of the preprocessed problem text by adopting the second relative characteristic value of each second vocabulary in the second vocabulary entry table.

In this embodiment of the present invention, the pareto set dividing module 405 includes:

the text set constructing sub-module is used for determining unclassified text paragraphs in the preprocessed text paragraphs and constructing a text set by adopting the unclassified text paragraphs;

the non-dominant individual determination submodule is used for sequentially traversing the unclassified text paragraphs in the text set and determining the unclassified text paragraphs which do not have the dominant relationship as non-dominant individuals;

the adding submodule is used for adding all the non-dominated individuals obtained by traversal into the same pareto set;

the judgment submodule is used for judging whether the unclassified text exists in the text set or not;

the return submodule is used for returning to the step of determining the unclassified text paragraphs in the preprocessed text paragraphs and constructing the text set by adopting the unclassified text paragraphs if the preprocessed text paragraphs are not classified;

and the output submodule is used for outputting all the pareto sets if the pareto sets exist.

In this embodiment of the present invention, the text information extracting module 407 includes:

the target pareto set and critical pareto set determining submodule is used for determining a target pareto set and a critical pareto set according to the preset number of text messages and a pareto sequence;

the congestion degree operator module is used for calculating the congestion degree of each preprocessed text paragraph in the critical pareto set;

the congestion degree sequence obtaining sub-module is used for sequencing the congestion degrees to obtain a congestion degree sequence;

the target text paragraph determining submodule is used for determining a target text paragraph from the critical pareto set according to the target pareto set, the number of text messages and the crowding degree sequence;

and the text information extraction submodule is used for extracting the preprocessed text paragraphs and the target text paragraphs in the target pareto set as the text information to be extracted by the plurality of question texts.

An embodiment of the present invention further provides an electronic device, where the device includes a processor and a memory:

the memory is used for storing the program codes and transmitting the program codes to the processor;

the processor is used for executing the text information extraction method of any embodiment of the invention according to instructions in the program code.

The embodiment of the invention also provides a computer-readable storage medium, which is used for storing a program code, and the program code is used for executing the text information extraction method of any embodiment of the invention.

It can be clearly understood by those skilled in the art that, for convenience and simplicity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The embodiments in the present specification are all described in a progressive manner, and each embodiment focuses on differences from other embodiments, and portions that are the same and similar between the embodiments may be referred to each other.

As will be appreciated by one of skill in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Finally, it should also be noted that, in this document, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrases "comprising one of 8230; \8230;" 8230; "does not exclude the presence of additional like elements in a process, method, article, or terminal device that comprises the element.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A text information extraction method is characterized by comprising the following steps:

calculating a first feature item of the preprocessed text paragraph;

calculating a second characteristic item of the preprocessing question text;

calculating the association degree between the pre-processing text paragraph and the pre-processing question text by adopting the first characteristic item and the second characteristic item;

dividing the preprocessed text paragraphs into a plurality of pareto sets according to the relevance;

sequencing the pareto sets to obtain a pareto sequence;

extracting text information corresponding to a plurality of question texts from the pareto sequence;

wherein the step of calculating the first feature item of the preprocessed text paragraph comprises:

calculating a first characteristic item of the preprocessed text paragraph by using a first relative characteristic value of each first vocabulary in the first vocabulary;

wherein the step of dividing the plurality of preprocessed text paragraphs into a plurality of pareto sets according to the relevance comprises:

sequentially traversing the unclassified text paragraphs in the text set, and determining the unclassified text paragraphs without the dominated relationship as non-dominated individuals;

judging whether an unclassified text exists in the text set or not;

if yes, all pareto sets are output.

2. The method of claim 1, wherein the step of calculating the second feature term of the pre-processing question text comprises:

acquiring a second word segmentation vocabulary of the preprocessed problem text; the second participle vocabulary comprises a plurality of second vocabularies;

and calculating a second characteristic item of the preprocessed problem text by using a second relative characteristic value of each second word in the second word-dividing vocabulary.

3. The method of claim 1, wherein the step of extracting text information corresponding to a plurality of question texts from the pareto sequence comprises:

sequencing the crowdedness to obtain a crowdedness sequence;

and extracting the preprocessed text paragraphs and the target text paragraphs in the target pareto set as text information to be extracted by using the preprocessed text paragraphs and the target text paragraphs as a plurality of question texts.

4. A text information extraction device characterized by comprising:

the preprocessing module is used for respectively carrying out text preprocessing on the plurality of preset text paragraphs and the plurality of preset problem texts to obtain corresponding preprocessed text paragraphs and corresponding preprocessed problem texts;

the text information extraction module is used for extracting text information corresponding to a plurality of question texts from the pareto sequence;

wherein the first feature item calculation module comprises:

the first relative characteristic value operator module is used for calculating a first relative characteristic value of a corresponding first vocabulary according to a preset text paragraph characteristic item threshold value and the first absolute characteristic value;

a first feature item calculation sub-module, configured to calculate a first feature item of the preprocessed text paragraph by using a first relative feature value of each first vocabulary in the first vocabulary entry;

wherein, the pareto set division module includes:

the non-dominant individual determination submodule is used for sequentially traversing the unclassified text paragraphs in the text set and determining the unclassified text paragraphs without dominant relationship as non-dominant individuals;

the judging submodule is used for judging whether the unclassified text exists in the text set or not;

5. The apparatus of claim 4, wherein the second feature term computation module comprises:

the second sub-word vocabulary acquisition sub-module is used for acquiring a second sub-word vocabulary of the preprocessed problem text; the second participle vocabulary comprises a plurality of second vocabularies;

the second absolute characteristic value operator module is used for calculating a second absolute characteristic value of a corresponding second vocabulary by adopting the second word frequency and the second reverse file frequency;

the second relative feature value operator module is used for calculating a second relative feature value of a corresponding second vocabulary according to a preset problem text feature item threshold and the second absolute feature value;

and the second characteristic item calculation sub-module is used for calculating a second characteristic item of the preprocessed problem text by adopting a second relative characteristic value of each second vocabulary in the second vocabulary entry table.

6. An electronic device, comprising a processor and a memory:

the processor is configured to execute the text information extraction method according to any one of claims 1 to 3 according to an instruction in the program code.

7. A computer-readable storage medium for storing a program code for executing the text information extraction method according to any one of claims 1 to 3.