CN113377911B - Text information extraction method and device, electronic equipment and storage medium - Google Patents

Text information extraction method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN113377911B
CN113377911B CN202110643959.7A CN202110643959A CN113377911B CN 113377911 B CN113377911 B CN 113377911B CN 202110643959 A CN202110643959 A CN 202110643959A CN 113377911 B CN113377911 B CN 113377911B
Authority
CN
China
Prior art keywords
text
preprocessed
pareto
vocabulary
calculating
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110643959.7A
Other languages
Chinese (zh)
Other versions
CN113377911A (en
Inventor
高明
华煌圣
彭政
张文斐
张栩华
王德辉
刘己未
宋强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Power Supply Bureau of Guangdong Power Grid Co Ltd
Original Assignee
Guangzhou Power Supply Bureau of Guangdong Power Grid Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Power Supply Bureau of Guangdong Power Grid Co Ltd filed Critical Guangzhou Power Supply Bureau of Guangdong Power Grid Co Ltd
Priority to CN202110643959.7A priority Critical patent/CN113377911B/en
Publication of CN113377911A publication Critical patent/CN113377911A/en
Application granted granted Critical
Publication of CN113377911B publication Critical patent/CN113377911B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/383Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Library & Information Science (AREA)
  • Probability & Statistics with Applications (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a text information extraction method, a text information extraction device, electronic equipment and a storage medium, wherein the method comprises the following steps: respectively carrying out text preprocessing on a plurality of preset text paragraphs and a plurality of preset problem texts to obtain corresponding preprocessed text paragraphs and corresponding preprocessed problem texts; calculating a first characteristic item of the preprocessed text paragraph; calculating a second characteristic item of the preprocessed question text; calculating the association degree between the pre-processing text paragraphs and the pre-processing question texts by adopting the first characteristic items and the second characteristic items; dividing a plurality of preprocessed text paragraphs into a plurality of pareto sets according to the association degree; sequencing the pareto sets to obtain pareto sequences; and extracting text information corresponding to a plurality of question texts from the pareto sequence. By the method and the device, the extraction efficiency and the extraction accuracy of the multi-problem associated text information are improved.

Description

Text information extraction method and device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of information extraction technologies, and in particular, to a method and an apparatus for extracting text information, an electronic device, and a storage medium.
Background
With the advent of the big data era, data is increasingly contained in the internet, and convenience is provided for life, study and office of people. Therefore, it is urgently required to extract information required by a user from a large amount of text data.
At present, when people use an internet text retrieval tool, people often need to retrieve a plurality of input problems and extract text information associated with each problem. Therefore, the efficiency of extracting text information is low, and the accuracy is low.
Disclosure of Invention
The invention provides a text information extraction method, a text information extraction device, electronic equipment and a storage medium, which are used for solving the technical problems that the extraction efficiency of text information is low and the extraction accuracy is low because the existing text information extraction method needs to retrieve a plurality of input problems and extracts the text information which is relatively associated with each problem.
The invention provides a text information extraction method, which comprises the following steps:
respectively carrying out text preprocessing on a plurality of preset text paragraphs and a plurality of preset problem texts to obtain corresponding preprocessed text paragraphs and corresponding preprocessed problem texts;
calculating a first feature item of the preprocessed text paragraph;
calculating a second characteristic item of the preprocessing question text;
calculating the association degree between the preprocessed text paragraph and the preprocessed question text by adopting the first characteristic item and the second characteristic item;
dividing the plurality of preprocessed text paragraphs into a plurality of pareto sets according to the association degree;
sequencing the pareto sets to obtain pareto sequences;
and extracting text information corresponding to a plurality of question texts from the pareto sequence.
Optionally, the step of calculating a first feature item of the preprocessed text paragraph includes:
acquiring a first word segmentation vocabulary of the preprocessed text paragraphs; the first word segmentation vocabulary comprises a plurality of first words;
calculating a first word frequency and a first reverse file frequency of each first word;
calculating a first absolute characteristic value of a corresponding first vocabulary by adopting the first word frequency and the first reverse file frequency;
calculating a first relative characteristic value of a corresponding first vocabulary according to a preset text paragraph characteristic item threshold value and the first absolute characteristic value;
and calculating a first characteristic item of the preprocessed text paragraph by using the first relative characteristic value of each first vocabulary in the first vocabulary.
Optionally, the step of calculating a second feature item of the pre-processing question text includes:
acquiring a second word segmentation vocabulary of the preprocessed problem text; the second word segmentation vocabulary comprises a plurality of second words;
calculating a second word frequency and a second reverse file frequency of each second word;
calculating a second absolute characteristic value of a corresponding second vocabulary by adopting the second word frequency and the second reverse file frequency;
calculating a second relative characteristic value of a corresponding second vocabulary according to a preset problem text characteristic item threshold value and the second absolute characteristic value;
and calculating a second characteristic item of the preprocessed problem text by adopting a second relative characteristic value of each second vocabulary in the second vocabulary entry table.
Optionally, the step of dividing the plurality of preprocessed text paragraphs into a plurality of pareto sets according to the relevance includes:
determining an unclassified text paragraph in the preprocessed text paragraphs, and constructing a text set by adopting the unclassified text paragraph;
sequentially traversing the unclassified text paragraphs in the text set, and determining the unclassified text paragraphs without dominant relationship as non-dominant individuals;
adding all the non-dominated individuals obtained by traversal into the same pareto set;
judging whether an unclassified text exists in the text set or not;
if not, returning to the step of determining unclassified text paragraphs in the preprocessed text paragraphs and constructing a text set by adopting the unclassified text paragraphs;
if yes, all pareto sets are output.
Optionally, the step of extracting text information corresponding to a plurality of question texts from the pareto sequence includes:
determining a target pareto set and a critical pareto set according to a preset text information number and the pareto sequence;
calculating the crowdedness of each preprocessed text paragraph in the critical pareto set;
sequencing the crowdedness to obtain a crowdedness sequence;
determining a target text paragraph from the critical pareto set according to the target pareto set, the text information number and the crowdedness sequence;
extracting the preprocessed text paragraphs and the target text paragraphs in the target pareto set as text information to be extracted by the plurality of question texts.
The invention also provides a text information extraction device, comprising:
the preprocessing module is used for respectively performing text preprocessing on the plurality of preset text paragraphs and the plurality of preset problem texts to obtain corresponding preprocessed text paragraphs and corresponding preprocessed problem texts;
a first feature item calculation module, configured to calculate a first feature item of the preprocessed text paragraph;
the second characteristic item calculation module is used for calculating a second characteristic item of the preprocessed problem text;
the relevancy calculation module is used for calculating the relevancy between the preprocessed text paragraph and the preprocessed problem text by adopting the first characteristic item and the second characteristic item;
the pareto set dividing module is used for dividing the preprocessed text paragraphs into a plurality of pareto sets according to the association degree;
the pareto sequence generation module is used for sequencing the pareto sets to obtain a pareto sequence;
and the text information extraction module is used for extracting the text information corresponding to the plurality of question texts from the pareto sequence.
Optionally, the first feature item calculating module includes:
the first word segmentation vocabulary acquisition sub-module is used for acquiring a first word segmentation vocabulary of the preprocessed text paragraph; the first word segmentation vocabulary comprises a plurality of first words;
the first word frequency and first reverse file frequency calculation submodule is used for calculating the first word frequency and the first reverse file frequency of each first vocabulary;
the first absolute characteristic value operator module is used for calculating a first absolute characteristic value of a corresponding first vocabulary by adopting the first word frequency and the first reverse file frequency;
the first relative feature value operator module is used for calculating a first relative feature value of a corresponding first vocabulary according to a preset text paragraph feature item threshold and the first absolute feature value;
and the first characteristic item calculation sub-module is used for calculating a first characteristic item of the preprocessed text paragraph by adopting a first relative characteristic value of each first vocabulary in the first vocabulary word list.
Optionally, the second feature item calculating module includes:
the second sub-word vocabulary acquisition sub-module is used for acquiring a second sub-word vocabulary of the preprocessed problem text; the second word segmentation vocabulary comprises a plurality of second words;
the second word frequency and second reverse file frequency calculation submodule is used for calculating the second word frequency and second reverse file frequency of each second vocabulary;
the second absolute eigenvalue operator module is used for calculating a second absolute eigenvalue of a corresponding second vocabulary by adopting the second word frequency and the second reverse file frequency;
the second relative characteristic value operator module is used for calculating a second relative characteristic value of a corresponding second vocabulary according to a preset problem text characteristic item threshold value and the second absolute characteristic value;
and the second characteristic item calculation submodule is used for calculating a second characteristic item of the preprocessed problem text by adopting a second relative characteristic value of each second vocabulary in the second vocabulary entry table.
The invention also provides an electronic device comprising a processor and a memory:
the memory is used for storing program codes and transmitting the program codes to the processor;
the processor is configured to execute the text information extraction method according to any one of the above instructions in the program code.
The present invention also provides a computer-readable storage medium for storing program code for executing the text information extraction method as described in any one of the above.
According to the technical scheme, the invention has the following advantages: the method comprises the steps of respectively carrying out text preprocessing on a plurality of preset text paragraphs and a plurality of preset problem texts to obtain corresponding preprocessed text paragraphs and corresponding preprocessed problem texts; calculating a first characteristic item of the preprocessed text paragraph; calculating a second characteristic item of the preprocessing problem text; calculating the association degree between the preprocessed text paragraphs and the preprocessed problem texts by adopting the first characteristic items and the second characteristic items; dividing the preprocessed text paragraphs into a plurality of pareto sets according to the relevance; sequencing the pareto sets to obtain pareto sequences; so as to extract the text information corresponding to a plurality of question texts from the pareto sequence. Therefore, the extraction efficiency and the extraction accuracy of the multi-problem associated text information are improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a flowchart illustrating steps of a text information extraction method according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a process of dividing into pareto sets according to an embodiment of the present invention;
fig. 3 is a schematic flowchart of a text information extraction method according to an embodiment of the present invention;
fig. 4 is a block diagram of a text information extraction apparatus according to an embodiment of the present invention.
Detailed Description
The embodiment of the invention provides a text information extraction method, a text information extraction device, electronic equipment and a storage medium, which are used for solving the technical problems that the existing text information extraction method needs to search a plurality of input problems and extracts text information which is relatively associated with each problem, so that the extraction efficiency of the text information is low and the extraction accuracy is low.
In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the embodiments described below are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, fig. 1 is a flowchart illustrating steps of a text information extraction method according to an embodiment of the present invention.
The text information extraction method provided by the invention specifically comprises the following steps:
step 101, respectively performing text preprocessing on a plurality of preset text paragraphs and a plurality of preset problem texts to obtain corresponding preprocessed text paragraphs and corresponding preprocessed problem texts;
in embodiments of the present invention, text preprocessing may include, but is not limited to: filtering special or useless symbols, word segmentation (separating and refining words in the text), filtering stop words, word frequency statistics and other conventional text processing processes.
102, calculating a first characteristic item of a preprocessed text paragraph;
after completing the pre-processing of the text passage and the question text, a first feature item of the pre-processed text passage can be extracted.
In one example, calculating the first feature item of the preprocessed text passage may be implemented by:
s21, acquiring a first word segmentation vocabulary of the preprocessed text paragraphs; the first word segmentation vocabulary comprises a plurality of first words;
s22, calculating a first word frequency and a first reverse file frequency of each first word;
s23, calculating a first absolute characteristic value of a corresponding first vocabulary by adopting the first word frequency and the first reverse file frequency;
s24, calculating a first relative characteristic value of a corresponding first vocabulary according to a preset text paragraph characteristic item threshold value and a first absolute characteristic value;
and S25, calculating a first characteristic item of the preprocessed text paragraph by using the first relative characteristic value of each first vocabulary in the first vocabulary word list.
In a specific implementation, a result obtained by word segmentation in a text preprocessing process may be used as a first word segmentation and vocabulary list, a first word frequency and a first inverse file frequency of each preprocessed text paragraph for each first word in the first vocabulary list are calculated, and a first absolute feature value of each preprocessed text paragraph is calculated.
The first word frequency can be calculated by adopting the following formula:
Figure BDA0003108280810000061
wherein n is i,j For preprocessing text paragraphs d j Chinese word n i Of (a) times, sigma k n k,j Representing the number of all entries of text.
The first absolute feature value of each first word can be calculated by the following formula:
Figure BDA0003108280810000062
wherein E is a first absolute characteristic value of a first vocabulary, and gamma is a standardized function, so that the problem that the calculated word frequency is inaccurate in word frequency calculation due to the problem of long and short documents can be avoided; tf is ij The method for calculating the word frequency by taking the logarithm can effectively avoid the linear increase of the word frequency; idf i The method for calculating the frequency of the reverse file for taking the logarithm can also avoid the problem of inaccurate calculation caused by linear growth, wherein the denominator part +1 is used for avoiding the condition that the denominator is zero; | D | represents the total number of pre-processed text paragraphs.
Then, by setting a text paragraph feature item threshold Td, a first relative feature value of the first vocabulary and the preprocessed text paragraph is calculated by the following formula:
Figure BDA0003108280810000071
where T is a text paragraph feature term threshold Td and e is a first relative feature value. And the first vocabulary with the first relative characteristic value not being 0 is the first characteristic item of the preprocessed text paragraph.
Step 103, calculating a second characteristic item of the preprocessed question text;
similarly, the embodiment of the invention can also extract a second characteristic item of the preprocessed problem text. The method comprises the following specific steps:
s31, acquiring a second participle vocabulary of the preprocessed problem text; the second word segmentation vocabulary comprises a plurality of second words;
s32, calculating a second word frequency and a second reverse file frequency of each second word;
s33, calculating a second absolute characteristic value of a corresponding second vocabulary by adopting a second word frequency and a second reverse file frequency;
s34, calculating a second relative characteristic value of a corresponding second vocabulary according to a preset problem text characteristic item threshold value and a second absolute characteristic value;
and S35, calculating a second characteristic item of the preprocessed problem text by adopting the second relative characteristic value of each second vocabulary in the second vocabulary word list.
In the embodiment of the present invention, the process of calculating the second feature item of the preprocessed problem text is similar to that in step 102, and the description of step 102 may be specifically referred to, which is not repeated herein.
It should be noted that after the first feature item of the preprocessed text paragraph and the second feature item of the preprocessed problem text are obtained through calculation, the words in the vocabulary table may be used as columns, and the preprocessed problem text and the preprocessed text paragraph may be used as lines, so as to obtain feature vectors of the preprocessed problem text and the preprocessed text paragraph.
104, calculating the association degree between the preprocessed text paragraphs and the preprocessed problem texts by adopting the first characteristic items and the second characteristic items;
in the embodiment of the present invention, after the first feature item and the second feature item are obtained, the association degree between the preprocessed text paragraph and the preprocessed question file may be calculated.
In one example, a spatial vector model may be utilized in conjunction with feature vectors to obtain a degree of correlation between a pre-processed text passage and a question text.
The method for calculating the relevance between the pre-processed problem text and the pre-processed text paragraph can adopt but is not limited to the following methods: cosine similarity, manhattan distance, euclidean distance, limit similarity and other similarity calculation methods.
In one example, the limit similarity calculation method may be referred to as the following formula:
Figure BDA0003108280810000081
whereinD represents a feature vector of the preprocessed text paragraph; q represents a feature vector of the pre-processing problem text; cosine Sim (q, d) represents the similarity between the preprocessed text paragraph d and the preprocessed problem text q; z represents the total number of the first characteristic items and the second characteristic items; w represents a feature vector; w is a i,d A value representing the ith dimension of the preprocessed text feature vector d; w is a i,q The value of the ith dimension of the feature vector q representing the pre-processing question text.
105, dividing a plurality of preprocessed text paragraphs into a plurality of pareto sets according to the relevance;
in the embodiment of the invention, for a certain preprocessed text paragraph d i If the similarity of each preprocessed question text is higher than that of the preprocessed paragraph text d j High, then it is regarded as a pre-processed text paragraph d i Is "dominating" the preprocessed text passage d j In (1). A Pareto (Pareto) set is also a non-dominant set, and paragraph texts in the same Pareto set are mutually non-dominant.
In one example, as shown in fig. 2, step 105 may include the following sub-steps:
s51, determining unclassified text paragraphs in the preprocessed text paragraphs, and constructing a text set by adopting the unclassified text paragraphs;
s52, sequentially traversing the unclassified text paragraphs in the text set, and determining the unclassified text paragraphs without dominated relationship as non-dominated individuals;
s53, adding all the non-dominated individuals obtained by traversal into the same pareto set;
s54, judging whether an unclassified text exists in the text set;
s55, if not, returning to the step of determining the unclassified text paragraphs in the preprocessed text paragraphs and constructing a text set by adopting the unclassified text paragraphs;
and S56, if so, outputting all pareto sets.
In a specific implementation, a pareto set F may be initialized first k K =1; and creating paragraphs containing unclassified textA text set;
traverse all unassigned text paragraphs d i ,i=1;
For all other unclassified text paragraphs d j J ≠ i, compare d i And d j Dominant versus non-dominant relationship of (c);
if there is not one d j Dominant d i Then d is i Marked as non-dominant individual, added to the set F k Performing the following steps;
let i = i +1 go through all unallocated text paragraphs d i Continuing to find non-dominant individuals to join the set F k Until i = s;
and (4) removing the selected non-dominant individuals, enabling k = k +1, and continuing to repeat the steps until all the preprocessed text paragraphs are classified into a certain pareto set.
Where s is the total number of text paragraphs.
106, sequencing the pareto sets to obtain a pareto sequence;
and the order of a plurality of pareto sets obtained by carrying out non-dominated sorting on the preprocessed text paragraphs is a pareto sequence of the pareto sets. Wherein pre-processed text paragraphs in the preceding pareto set dominate pre-processed text paragraphs in the following pareto set.
And step 107, extracting text information corresponding to a plurality of question texts from the pareto sequence.
In the embodiment of the present invention, after the pareto sequence is obtained, text information corresponding to a plurality of question texts may be extracted from the pareto sequence.
In one example, step 107 may include the following sub-steps:
s71, determining a target pareto set and a critical pareto set according to a preset text information number and a pareto sequence;
s72, calculating the crowdedness of each preprocessed text paragraph in the critical pareto set;
s73, sequencing the crowding degrees to obtain a crowding degree sequence;
s74, determining a target text paragraph from the critical pareto set according to the target pareto set, the number of text messages and the sequence of crowdedness;
and S75, determining the preprocessed text paragraphs and the target text paragraphs in the target pareto set as text information to be extracted from a plurality of question texts.
In practical application, preprocessed text paragraphs in the same pareto set are taken as a node, the node sparsity degree near the preprocessed text paragraph node can be considered as the "crowding degree", and the probability that the preprocessed text paragraphs corresponding to the nodes which are crowded together have repeated content is high, so that the sparser preprocessed text paragraphs can be preferentially considered during selection. The congestion degree may be calculated by the following formula:
Figure BDA0003108280810000101
wherein d is i Represents a certain pre-processed text passage, and
Figure BDA0003108280810000102
represents d i The distance between the left node and the right node for the feature vector of a certain vocabulary j is larger if the distance between the left node and the right node of the preprocessed text paragraph node is larger, which means that the preprocessed text paragraph nodes near the node are sparser and the crowding degree is lower. Finally, the preprocessed text paragraph nodes in the same pareto set pass through the crowdedness Cwd (d) i ) And performing descending sorting.
It should be noted that in practical applications, the congestion degree ordering may not be performed on all pareto sets for the following reasons:
since k ordered pareto sets F are obtained if necessary i ,i∈[1,k]And the number of the preprocessed text paragraphs in each pareto set is S i ,i∈[1,k]. Then there must be an integer t, t<k and are such that
Figure BDA0003108280810000103
Wherein n is the number of text messages, i.e. the number of pre-extracted text messages. That is, there is an integer t, so that n pieces of text information cannot be obtained enough after obtaining the first t preprocessed text paragraphs in the pareto sets, and the number of n text information exceeds the number of text paragraphs in the t +1 st pareto set. Therefore, the preprocessed text paragraphs in the t pareto sets are text information that is inevitably to be selected, and therefore the preprocessed text paragraphs in the t pareto sets do not need to be subjected to congestion degree sorting, and only the text paragraphs in the t +1 th pareto set need to be subjected to congestion degree sorting, that is, only the text paragraphs in the F +1 th pareto set need to be subjected to congestion degree sorting t+1 Pareto sets sort the crowdedness. Wherein, F is t+1 Each pareto set is a critical pareto set.
Finally, besides selecting all the preprocessed text paragraphs in the first t pareto sets, the top ordered in the t +1 th pareto set needs to be selected
Figure BDA0003108280810000104
And (4) preprocessing the text paragraphs to finally obtain the final text information extraction results of the n preprocessed text paragraphs.
The method comprises the steps of respectively carrying out text preprocessing on a plurality of preset text paragraphs and a plurality of preset problem texts to obtain corresponding preprocessed text paragraphs and corresponding preprocessed problem texts; calculating a first characteristic item of the preprocessed text paragraph; calculating a second characteristic item of the preprocessed question text; calculating the association degree between the preprocessed text paragraphs and the preprocessed problem texts by adopting the first characteristic items and the second characteristic items; dividing the preprocessed text paragraphs into a plurality of pareto sets according to the relevance; sequencing the pareto sets to obtain pareto sequences; to extract text information corresponding to a plurality of question texts from the pareto sequence. Therefore, the extraction efficiency and the extraction accuracy of the multi-problem associated text information are improved.
For ease of understanding, the following is illustrated by specific examples:
referring to fig. 3, fig. 3 is a schematic flow chart of a text information extraction method according to an embodiment of the present invention.
As shown in fig. 3, the invention first performs preprocessing and feature item calculation on text paragraphs and question texts, and then obtains feature vector tables of preprocessed question texts and preprocessed text paragraphs by using vocabularies in vocabularies of all text paragraphs and all question texts as columns and using each preprocessed text paragraph and question text as lines; and (4) organizing the values of each column of the pre-processing question text and the pre-processing text paragraph to obtain the feature vectors of the pre-processing question text and the pre-processing text paragraph. And calculating the association degree between the preprocessed text paragraphs and the preprocessed text paragraphs based on the preprocessed problem text and the feature vectors of the preprocessed text paragraphs, and further performing non-dominated sorting on the preprocessed text paragraphs based on the association degree to obtain a plurality of pareto sets containing different preprocessed text paragraphs. And carrying out congestion degree sequencing on the preprocessed text paragraphs in the same pareto set, and further obtaining preset n preprocessed text paragraphs as the extraction result of the text information according to the sequencing of the pareto set and the congestion degree sequencing of the preprocessed text paragraphs in the pareto set.
Referring to fig. 4, fig. 4 is a block diagram of a text information extraction device according to an embodiment of the present invention.
The embodiment of the invention provides a text information extraction device, which comprises:
the preprocessing module 401 is configured to perform text preprocessing on a plurality of preset text paragraphs and a plurality of preset problem texts, respectively, to obtain corresponding preprocessed text paragraphs and corresponding preprocessed problem texts;
a first feature item calculating module 402, configured to calculate a first feature item of the preprocessed text paragraph;
a second feature item calculating module 403, configured to calculate a second feature item of the pre-processing problem text;
the association degree calculating module 404 is configured to calculate an association degree between the preprocessed text paragraph and the preprocessed question text by using the first feature item and the second feature item;
a pareto set dividing module 405, configured to divide the plurality of preprocessed text paragraphs into a plurality of pareto sets according to the association degree;
a pareto sequence generation module 406, configured to sort the pareto sets to obtain a pareto sequence;
the text information extracting module 407 is configured to extract text information corresponding to a plurality of question texts from the pareto sequence.
In this embodiment of the present invention, the first feature item calculating module 402 includes:
the first word segmentation vocabulary acquisition submodule is used for acquiring a first word segmentation vocabulary of the preprocessed text paragraph; the first word segmentation vocabulary comprises a plurality of first words;
the first word frequency and first reverse file frequency calculation submodule is used for calculating the first word frequency and the first reverse file frequency of each first vocabulary;
the first absolute characteristic value operator module is used for calculating a first absolute characteristic value of a corresponding first vocabulary by adopting the first word frequency and the first reverse file frequency;
the first relative characteristic value operator module is used for calculating a first relative characteristic value of a corresponding first vocabulary according to a preset text paragraph characteristic item threshold value and a first absolute characteristic value;
and the first characteristic item calculation submodule is used for calculating a first characteristic item of the preprocessed text paragraph by adopting the first relative characteristic value of each first vocabulary in the first vocabulary word list.
In this embodiment of the present invention, the second feature item calculating module 403 includes:
the second participle vocabulary acquisition sub-module is used for acquiring a second participle vocabulary of the preprocessed problem text; the second word segmentation vocabulary comprises a plurality of second words;
the second word frequency and second reverse file frequency calculation submodule is used for calculating the second word frequency and second reverse file frequency of each second vocabulary;
the second absolute characteristic value operator module is used for calculating a second absolute characteristic value of a corresponding second vocabulary by adopting a second word frequency and a second reverse file frequency;
the second relative characteristic value operator module is used for calculating a second relative characteristic value of a corresponding second vocabulary according to a preset problem text characteristic item threshold value and a second absolute characteristic value;
and the second characteristic item calculation submodule is used for calculating a second characteristic item of the preprocessed problem text by adopting the second relative characteristic value of each second vocabulary in the second vocabulary entry table.
In this embodiment of the present invention, the pareto set dividing module 405 includes:
the text set constructing sub-module is used for determining unclassified text paragraphs in the preprocessed text paragraphs and constructing a text set by adopting the unclassified text paragraphs;
the non-dominant individual determination submodule is used for sequentially traversing the unclassified text paragraphs in the text set and determining the unclassified text paragraphs which do not have the dominant relationship as non-dominant individuals;
the adding submodule is used for adding all the non-dominated individuals obtained by traversal into the same pareto set;
the judgment submodule is used for judging whether the unclassified text exists in the text set or not;
the return submodule is used for returning to the step of determining the unclassified text paragraphs in the preprocessed text paragraphs and constructing the text set by adopting the unclassified text paragraphs if the preprocessed text paragraphs are not classified;
and the output submodule is used for outputting all the pareto sets if the pareto sets exist.
In this embodiment of the present invention, the text information extracting module 407 includes:
the target pareto set and critical pareto set determining submodule is used for determining a target pareto set and a critical pareto set according to the preset number of text messages and a pareto sequence;
the congestion degree operator module is used for calculating the congestion degree of each preprocessed text paragraph in the critical pareto set;
the congestion degree sequence obtaining sub-module is used for sequencing the congestion degrees to obtain a congestion degree sequence;
the target text paragraph determining submodule is used for determining a target text paragraph from the critical pareto set according to the target pareto set, the number of text messages and the crowding degree sequence;
and the text information extraction submodule is used for extracting the preprocessed text paragraphs and the target text paragraphs in the target pareto set as the text information to be extracted by the plurality of question texts.
An embodiment of the present invention further provides an electronic device, where the device includes a processor and a memory:
the memory is used for storing the program codes and transmitting the program codes to the processor;
the processor is used for executing the text information extraction method of any embodiment of the invention according to instructions in the program code.
The embodiment of the invention also provides a computer-readable storage medium, which is used for storing a program code, and the program code is used for executing the text information extraction method of any embodiment of the invention.
It can be clearly understood by those skilled in the art that, for convenience and simplicity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
The embodiments in the present specification are all described in a progressive manner, and each embodiment focuses on differences from other embodiments, and portions that are the same and similar between the embodiments may be referred to each other.
As will be appreciated by one of skill in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.
Finally, it should also be noted that, in this document, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrases "comprising one of 8230; \8230;" 8230; "does not exclude the presence of additional like elements in a process, method, article, or terminal device that comprises the element.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (7)

1. A text information extraction method is characterized by comprising the following steps:
respectively carrying out text preprocessing on a plurality of preset text paragraphs and a plurality of preset problem texts to obtain corresponding preprocessed text paragraphs and corresponding preprocessed problem texts;
calculating a first feature item of the preprocessed text paragraph;
calculating a second characteristic item of the preprocessing question text;
calculating the association degree between the pre-processing text paragraph and the pre-processing question text by adopting the first characteristic item and the second characteristic item;
dividing the preprocessed text paragraphs into a plurality of pareto sets according to the relevance;
sequencing the pareto sets to obtain a pareto sequence;
extracting text information corresponding to a plurality of question texts from the pareto sequence;
wherein the step of calculating the first feature item of the preprocessed text paragraph comprises:
acquiring a first word segmentation vocabulary of the preprocessed text paragraphs; the first word segmentation vocabulary comprises a plurality of first words;
calculating a first word frequency and a first reverse file frequency of each first word;
calculating a first absolute characteristic value of a corresponding first vocabulary by adopting the first word frequency and the first reverse file frequency;
calculating a first relative characteristic value of a corresponding first vocabulary according to a preset text paragraph characteristic item threshold value and the first absolute characteristic value;
calculating a first characteristic item of the preprocessed text paragraph by using a first relative characteristic value of each first vocabulary in the first vocabulary;
wherein the step of dividing the plurality of preprocessed text paragraphs into a plurality of pareto sets according to the relevance comprises:
determining an unclassified text paragraph in the preprocessed text paragraphs, and constructing a text set by adopting the unclassified text paragraph;
sequentially traversing the unclassified text paragraphs in the text set, and determining the unclassified text paragraphs without the dominated relationship as non-dominated individuals;
adding all the non-dominated individuals obtained by traversal into the same pareto set;
judging whether an unclassified text exists in the text set or not;
if not, returning to the step of determining unclassified text paragraphs in the preprocessed text paragraphs and constructing a text set by adopting the unclassified text paragraphs;
if yes, all pareto sets are output.
2. The method of claim 1, wherein the step of calculating the second feature term of the pre-processing question text comprises:
acquiring a second word segmentation vocabulary of the preprocessed problem text; the second participle vocabulary comprises a plurality of second vocabularies;
calculating a second word frequency and a second reverse file frequency of each second word;
calculating a second absolute characteristic value of a corresponding second vocabulary by adopting the second word frequency and the second reverse file frequency;
calculating a second relative characteristic value of a corresponding second vocabulary according to a preset problem text characteristic item threshold value and the second absolute characteristic value;
and calculating a second characteristic item of the preprocessed problem text by using a second relative characteristic value of each second word in the second word-dividing vocabulary.
3. The method of claim 1, wherein the step of extracting text information corresponding to a plurality of question texts from the pareto sequence comprises:
determining a target pareto set and a critical pareto set according to a preset text information number and the pareto sequence;
calculating the crowdedness of each preprocessed text paragraph in the critical pareto set;
sequencing the crowdedness to obtain a crowdedness sequence;
determining a target text paragraph from the critical pareto set according to the target pareto set, the text information number and the crowdedness sequence;
and extracting the preprocessed text paragraphs and the target text paragraphs in the target pareto set as text information to be extracted by using the preprocessed text paragraphs and the target text paragraphs as a plurality of question texts.
4. A text information extraction device characterized by comprising:
the preprocessing module is used for respectively carrying out text preprocessing on the plurality of preset text paragraphs and the plurality of preset problem texts to obtain corresponding preprocessed text paragraphs and corresponding preprocessed problem texts;
a first feature item calculation module, configured to calculate a first feature item of the preprocessed text paragraph;
the second characteristic item calculation module is used for calculating a second characteristic item of the preprocessed problem text;
the relevancy calculation module is used for calculating the relevancy between the preprocessed text paragraph and the preprocessed problem text by adopting the first characteristic item and the second characteristic item;
the pareto set dividing module is used for dividing the preprocessed text paragraphs into a plurality of pareto sets according to the association degree;
the pareto sequence generation module is used for sequencing the pareto sets to obtain a pareto sequence;
the text information extraction module is used for extracting text information corresponding to a plurality of question texts from the pareto sequence;
wherein the first feature item calculation module comprises:
the first word segmentation vocabulary acquisition sub-module is used for acquiring a first word segmentation vocabulary of the preprocessed text paragraph; the first word segmentation vocabulary comprises a plurality of first words;
the first word frequency and first reverse file frequency calculation submodule is used for calculating the first word frequency and the first reverse file frequency of each first vocabulary;
the first absolute characteristic value operator module is used for calculating a first absolute characteristic value of a corresponding first vocabulary by adopting the first word frequency and the first reverse file frequency;
the first relative characteristic value operator module is used for calculating a first relative characteristic value of a corresponding first vocabulary according to a preset text paragraph characteristic item threshold value and the first absolute characteristic value;
a first feature item calculation sub-module, configured to calculate a first feature item of the preprocessed text paragraph by using a first relative feature value of each first vocabulary in the first vocabulary entry;
wherein, the pareto set division module includes:
the text set constructing sub-module is used for determining unclassified text paragraphs in the preprocessed text paragraphs and constructing a text set by adopting the unclassified text paragraphs;
the non-dominant individual determination submodule is used for sequentially traversing the unclassified text paragraphs in the text set and determining the unclassified text paragraphs without dominant relationship as non-dominant individuals;
the adding submodule is used for adding all the non-dominated individuals obtained by traversal into the same pareto set;
the judging submodule is used for judging whether the unclassified text exists in the text set or not;
the return submodule is used for returning to the step of determining the unclassified text paragraphs in the preprocessed text paragraphs and constructing the text set by adopting the unclassified text paragraphs if the preprocessed text paragraphs are not classified;
and the output submodule is used for outputting all the pareto sets if the pareto sets exist.
5. The apparatus of claim 4, wherein the second feature term computation module comprises:
the second sub-word vocabulary acquisition sub-module is used for acquiring a second sub-word vocabulary of the preprocessed problem text; the second participle vocabulary comprises a plurality of second vocabularies;
the second word frequency and second reverse file frequency calculation submodule is used for calculating the second word frequency and second reverse file frequency of each second vocabulary;
the second absolute characteristic value operator module is used for calculating a second absolute characteristic value of a corresponding second vocabulary by adopting the second word frequency and the second reverse file frequency;
the second relative feature value operator module is used for calculating a second relative feature value of a corresponding second vocabulary according to a preset problem text feature item threshold and the second absolute feature value;
and the second characteristic item calculation sub-module is used for calculating a second characteristic item of the preprocessed problem text by adopting a second relative characteristic value of each second vocabulary in the second vocabulary entry table.
6. An electronic device, comprising a processor and a memory:
the memory is used for storing program codes and transmitting the program codes to the processor;
the processor is configured to execute the text information extraction method according to any one of claims 1 to 3 according to an instruction in the program code.
7. A computer-readable storage medium for storing a program code for executing the text information extraction method according to any one of claims 1 to 3.
CN202110643959.7A 2021-06-09 2021-06-09 Text information extraction method and device, electronic equipment and storage medium Active CN113377911B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110643959.7A CN113377911B (en) 2021-06-09 2021-06-09 Text information extraction method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110643959.7A CN113377911B (en) 2021-06-09 2021-06-09 Text information extraction method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113377911A CN113377911A (en) 2021-09-10
CN113377911B true CN113377911B (en) 2022-10-14

Family

ID=77573262

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110643959.7A Active CN113377911B (en) 2021-06-09 2021-06-09 Text information extraction method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113377911B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110516210A (en) * 2019-08-22 2019-11-29 北京影谱科技股份有限公司 The calculation method and device of text similarity
WO2020177230A1 (en) * 2019-03-07 2020-09-10 平安科技(深圳)有限公司 Medical data classification method and apparatus based on machine learning, and computer device and storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5124230B2 (en) * 2007-10-18 2013-01-23 ヤマハ発動機株式会社 Parametric multiobjective optimization apparatus, parametric multiobjective optimization method, and parametric multiobjective optimization program
CN110008343A (en) * 2019-04-12 2019-07-12 深圳前海微众银行股份有限公司 File classification method, device, equipment and computer readable storage medium
CN110580252B (en) * 2019-07-30 2021-12-28 中国人民解放军国防科技大学 Space object indexing and query method under multi-objective optimization

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020177230A1 (en) * 2019-03-07 2020-09-10 平安科技(深圳)有限公司 Medical data classification method and apparatus based on machine learning, and computer device and storage medium
CN110516210A (en) * 2019-08-22 2019-11-29 北京影谱科技股份有限公司 The calculation method and device of text similarity

Also Published As

Publication number Publication date
CN113377911A (en) 2021-09-10

Similar Documents

Publication Publication Date Title
CN108959431B (en) Automatic label generation method, system, computer readable storage medium and equipment
US20210191509A1 (en) Information recommendation method, device and storage medium
CN106649490B (en) Image retrieval method and device based on depth features
CN109766423A (en) Answering method and device neural network based, storage medium, terminal
CN108228541B (en) Method and device for generating document abstract
CN108804421B (en) Text similarity analysis method and device, electronic equipment and computer storage medium
CN107463548B (en) Phrase mining method and device
CN105975459B (en) A kind of the weight mask method and device of lexical item
US20220414131A1 (en) Text search method, device, server, and storage medium
US20130339373A1 (en) Method and system of filtering and recommending documents
CN110287409B (en) Webpage type identification method and device
CN106844482B (en) Search engine-based retrieval information matching method and device
CN110210038B (en) Core entity determining method, system, server and computer readable medium thereof
CN110019669B (en) Text retrieval method and device
CN111708909B (en) Video tag adding method and device, electronic equipment and computer readable storage medium
CN111859004A (en) Retrieval image acquisition method, device, equipment and readable storage medium
WO2016095645A1 (en) Stroke input method, device and system
CN109783547B (en) Similarity connection query method and device
CN105512333A (en) Product comment theme searching method based on emotional tendency
CN112632261A (en) Intelligent question and answer method, device, equipment and storage medium
CN112818121A (en) Text classification method and device, computer equipment and storage medium
CN112131341A (en) Text similarity calculation method and device, electronic equipment and storage medium
CN107766419B (en) Threshold denoising-based TextRank document summarization method and device
CN111553156B (en) Keyword extraction method, device and equipment
CN113377911B (en) Text information extraction method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant