CN115238670A - Information text extraction method, device, equipment and storage medium - Google Patents
Information text extraction method, device, equipment and storage medium Download PDFInfo
- Publication number
- CN115238670A CN115238670A CN202210951569.0A CN202210951569A CN115238670A CN 115238670 A CN115238670 A CN 115238670A CN 202210951569 A CN202210951569 A CN 202210951569A CN 115238670 A CN115238670 A CN 115238670A
- Authority
- CN
- China
- Prior art keywords
- vector
- text
- statement
- information
- extracting
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 67
- 239000013598 vector Substances 0.000 claims abstract description 307
- 239000011159 matrix material Substances 0.000 claims abstract description 65
- 238000006243 chemical reaction Methods 0.000 claims abstract description 24
- 230000004927 fusion Effects 0.000 claims abstract description 14
- 238000007781 pre-processing Methods 0.000 claims abstract description 13
- 230000002457 bidirectional effect Effects 0.000 claims abstract description 10
- 238000012216 screening Methods 0.000 claims abstract description 9
- 230000006870 function Effects 0.000 claims description 27
- 238000000034 method Methods 0.000 claims description 23
- 238000012545 processing Methods 0.000 claims description 16
- 238000004590 computer program Methods 0.000 claims description 12
- 238000001914 filtration Methods 0.000 claims description 8
- 238000010276 construction Methods 0.000 claims description 5
- 230000009467 reduction Effects 0.000 claims description 5
- 230000011218 segmentation Effects 0.000 claims description 5
- 238000001228 spectrum Methods 0.000 claims description 3
- 238000005516 engineering process Methods 0.000 abstract description 3
- 238000004891 communication Methods 0.000 description 8
- 239000000284 extract Substances 0.000 description 6
- 238000004422 calculation algorithm Methods 0.000 description 5
- 238000007726 management method Methods 0.000 description 5
- 238000004458 analytical method Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 238000013473 artificial intelligence Methods 0.000 description 3
- 239000004973 liquid crystal related substance Substances 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000002776 aggregation Effects 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 238000005314 correlation function Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/151—Transformation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to an intelligent decision technology, and discloses an information text extraction method, which comprises the following steps: preprocessing the information text to obtain a target text, extracting text sentences in the target text, and performing vector conversion on the text sentences to obtain sentence vectors; calculating the vector length of the statement vector, extracting the characteristic vector of the statement vector by combining the vector length, and extracting a statement entity in the text statement; performing bidirectional coding on the statement entity to obtain a coded vector, performing vector fusion on the coded vector and the statement vector to obtain a target vector, and extracting the information characteristics of the target vector; constructing a feature matrix corresponding to the information features, calculating feature distances between adjacent features in the feature matrix, screening the feature distances to obtain target distances, and taking the features corresponding to the target distances as description features of the statement entities; and extracting the information text according to the sentence entity and the description characteristics to obtain an extraction result. The invention improves the extraction efficiency of the information text.
Description
Technical Field
The invention relates to the technical field of intelligent decision, in particular to an information text extraction method, an information text extraction device, information text extraction equipment and a computer readable storage medium.
Background
The existing Internet has massive texts, when some knowledge is required to be acquired, training needs to be performed from the massive texts, if we can automatically extract structured knowledge from the texts in the open field, a structured knowledge retrieval system is built and displayed in a concise and clear mode, and therefore users can conveniently and quickly build and query key, understandable and useful knowledge.
The massive texts all contain information such as triples, and in the information extraction, the extraction of the texts is completed by using the structural data of the triples, but the extraction of the triples leads to complicated steps of information retrieval, and further leads to the reduction of the efficiency of text retrieval, so that a method capable of improving the efficiency of information text extraction is needed.
Disclosure of Invention
The invention provides an information text extraction method, an information text extraction device, information text extraction equipment and a storage medium, and mainly aims to improve the information text extraction efficiency.
In order to achieve the above object, the present invention provides an information text extraction method, which includes:
acquiring an information text to be extracted, preprocessing the information text to obtain a target text, extracting text sentences in the target text, and performing vector conversion on the text sentences to obtain sentence vectors;
calculating the vector length of the statement vector, extracting the feature vector of the statement vector by combining the vector length, and extracting the statement entity in the text statement according to the feature vector;
performing bidirectional coding on the statement entity to obtain a coding vector, performing vector fusion on the coding vector and the statement vector to obtain a target vector, and extracting information characteristics of the target vector;
constructing a feature matrix corresponding to the information features, calculating feature distances between adjacent features in the feature matrix, screening the feature distances to obtain target distances, and taking the features corresponding to the target distances as description features of the statement entities;
and extracting the information text according to the sentence entity and the description characteristics to obtain an extraction result.
Optionally, the preprocessing the information text to obtain a target text includes:
performing content conversion on the information text to obtain a conversion text;
filtering stop words of the converted text according to a preset stop word list to obtain a filtered text;
and performing word segmentation processing on the filtered text to obtain a target text.
Optionally, the vector conversion of the text statement to obtain a statement vector includes:
carrying out one-hot coding on the text sentence to obtain a coded sentence;
vector conversion is carried out on the coding statement to obtain a coding statement vector;
and performing dimension reduction processing on the coding statement vector to obtain a statement vector.
Optionally, the extracting, in combination with the vector length, a feature vector of the statement vector includes:
constructing a matrix of the statement vector according to the vector length;
calculating a vector dimension of the matrix;
and combining the matrix and the vector dimension to extract the features of the statement vector to obtain a feature vector.
Optionally, vector fusion is performed on the coding vector and the statement vector to obtain a target vector, including:
respectively carrying out position coding on the coding vector and the statement vector to obtain a first position vector and a second position vector;
calculating the correlation degree of the first position vector and the second position vector by using a preset correlation function;
combining the first position vector and the second position vector with the association degree larger than a preset value to obtain a combined sub-vector;
and summarizing the merged sub-vectors to obtain a target vector.
Optionally, the preset relevance function includes:
wherein R represents a degree of association, M represents a total number of the first position vector and the second position vector, i represents a start vector, N represents a final vector, k represents a position vector, p represents a first position, Q represents a second position, k represents a third position, and P represents the k-th position vector, k, in the first position Q Representing the k-th position vector in the second position, ln representing a logarithmic function, and ρ representing a correlation coefficient.
Optionally, the constructing a feature matrix corresponding to the information feature includes:
and constructing a feature matrix corresponding to the information features by using the following formula:
wherein J represents a characteristic matrix corresponding to the information characteristic, a represents the number of characteristic data, n represents a matrix coefficient, d represents an independent variable of the information characteristic, and F represents a matrix spectrum radius corresponding to the information characteristic.
In order to solve the above problem, the present invention further provides an information text extracting apparatus, including:
the text processing module is used for acquiring an information text to be extracted, preprocessing the information text to obtain a target text, extracting text sentences in the target text, and performing vector conversion on the text sentences to obtain sentence vectors;
the entity extraction module is used for calculating the vector length of the statement vector, extracting the feature vector of the statement vector by combining the vector length, and extracting the statement entity in the text statement according to the feature vector;
the feature extraction module is used for carrying out bidirectional coding on the statement entity to obtain a coding vector, carrying out vector fusion on the coding vector and the statement vector to obtain a target vector, and extracting the information feature of the target vector;
the matrix construction module is used for constructing a feature matrix corresponding to the information features, calculating feature distances between adjacent features in the feature matrix, screening the feature distances to obtain target distances, and taking the features corresponding to the target distances as description features of the statement entity;
and the text extraction module is used for extracting the information text according to the sentence entity and the description characteristics to obtain an extraction result.
In order to solve the above problem, the present invention also provides an electronic device, including:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein,
the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to enable the at least one processor to perform the information text extraction method described above.
In order to solve the above problem, the present invention further provides a computer-readable storage medium, in which at least one computer program is stored, and the at least one computer program is executed by a processor in an electronic device to implement the information text extraction method described above.
The method comprises the steps of obtaining an information text to be extracted, preprocessing the information text to obtain a target text, and removing invalid information in the information text so as to improve the efficiency of subsequent processing of the information text; in addition, the invention constructs the characteristic matrix corresponding to the information characteristic, and the characteristic matrix can show the multidimensional data of the information characteristic. Therefore, the method, the device, the equipment and the storage medium for extracting the information text provided by the embodiment of the invention can improve the efficiency of extracting the information text.
Drawings
Fig. 1 is a schematic flow chart of an information text extraction method according to an embodiment of the present invention;
fig. 2 is a functional block diagram of an information text extraction apparatus according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of an electronic device for implementing the information text extraction method according to an embodiment of the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further described with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not delimit the invention.
The embodiment of the application provides an information text extraction method. In the embodiment of the present application, the execution subject of the information text extraction method includes, but is not limited to, at least one of electronic devices, such as a server and a terminal, that can be configured to execute the method provided in the embodiment of the present application. In other words, the information text extraction method may be performed by software or hardware installed in the terminal device or the server device, and the software may be a block chain platform. The server includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like. The server may be an independent server, or may be a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a web service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), and a big data and artificial intelligence platform.
Fig. 1 is a schematic flow chart of an information text extraction method according to an embodiment of the present invention. In this embodiment, the method for extracting information text includes steps S1 to S5:
s1, obtaining an information text to be extracted, preprocessing the information text to obtain a target text, extracting text sentences in the target text, and performing vector conversion on the text sentences to obtain sentence vectors.
According to the invention, the information text to be extracted is obtained, the information text is preprocessed to obtain the target text, and invalid information in the information text can be removed, so that the efficiency of subsequent processing of the information text is improved.
The target text is obtained by filtering invalid information in the information text, and further, the information text to be extracted can be obtained through internet downloading.
As an embodiment of the present invention, the preprocessing the information text to obtain a target text includes: and performing content conversion on the information text to obtain a converted text, performing stop word filtering on the converted text according to a preset stop word list to obtain a filtered text, and performing word segmentation on the filtered text to obtain a target text.
The conversion text is obtained by converting non-text content in the information text into corresponding text content, the non-text content comprises pictures, english words and the like, the preset stop word list is a list containing all stop words, and the filtering text is obtained by filtering the stop words in the conversion text.
Furthermore, the content conversion of the information text can be realized by OCR character recognition and a Chinese-English translator, the Chinese-English translator is compiled by a script language, the stop word filtering of the converted text can be realized by a set aggregation method, and the filtering text can be subjected to word segmentation by an ik word segmentation device.
According to the invention, the text sentence in the target text is extracted, and the extraction difficulty of the information text can be reduced by extracting the text sentence, wherein the text sentence is a logical sentence in the target text, and further, the text sentence in the target text can be extracted through a right function.
According to the method, the sentence vectors are obtained by carrying out vector conversion on the text sentences, the text sentences are converted into the corresponding space vectors, and the guarantee is provided for the subsequent extraction of the features of the sentence vectors, wherein the sentence vectors are in a vector expression form corresponding to the text sentences.
As an embodiment of the present invention, the performing vector conversion on the text statement to obtain a statement vector includes: and carrying out one-hot coding on the text statement to obtain a coded statement, carrying out vector conversion on the coded statement to obtain a coded statement vector, and carrying out dimension reduction processing on the coded statement vector to obtain a statement vector.
The coding statement is a binary expression form corresponding to the text statement, the coding statement vector is a vector form corresponding to the coding statement, further, the One-Hot coding of the text statement can be realized by a One-Hot coding method, and the dimension reduction processing of the coding statement vector can be realized by a pooling function, such as a maximum function and a minimum function.
S2, calculating the vector length of the statement vector, extracting the feature vector of the statement vector by combining the vector length, and extracting the statement entity in the text statement according to the feature vector.
The invention provides a premise for extracting the characteristics of the statement vector subsequently by calculating the vector length of the statement vector, wherein the vector length is the length of the statement vector in a vector space, and further, the vector length of the statement vector can be calculated by a vector formula which is compiled by Java language.
The invention extracts the characteristic vector of the statement vector by combining the vector length, and the characteristic attribute of the statement vector can be known through the characteristic vector, so that the entity of the text statement can be accurately extracted subsequently, wherein the characteristic vector is the only attribute capable of representing the statement vector.
As an embodiment of the present invention, the extracting, in combination with the vector length, a feature vector of the statement vector includes: and constructing a matrix of the statement vector according to the vector length, calculating the vector dimension of the matrix, and performing feature extraction on the statement vector by combining the matrix and the vector dimension to obtain a feature vector.
The matrix is a square matrix formed by complex numbers or real number sets arranged according to a rectangular array and coefficients and constants from a equation set at the earliest, and represents a set form corresponding to the statement vector, the vector dimension is the number of the statement vector in a space, further, the matrix of the statement vector is constructed by a matrix function, such as a sine function, the vector dimension of the matrix can be calculated by a dimension function, the dimension function is compiled by a script language, and the feature extraction of the statement vector can be realized by a self-attention function.
According to the invention, the sentence entity in the text sentence is extracted according to the feature vector, and the accuracy of extracting the sentence entity can be improved through the feature vector, wherein the sentence entity is a word representing a person or a specific thing in the text sentence, such as: a fire engine is stopped on the road, and the road and a fire-fighting place are both entities.
As an embodiment of the present invention, the extracting, according to the feature vector, a sentence entity in the text sentence includes: identifying words corresponding to the feature vectors in the text sentences, performing semantic analysis on the words to obtain word semantics, and extracting sentence entities in the text sentences according to the word semantics.
The word semantics is meaning explanation corresponding to the word, further, the recognition of the word corresponding to the feature vector in the text statement may be implemented by OCR character recognition technology, the semantic analysis of the word may be implemented by a semantic analysis algorithm, the semantic analysis algorithm is compiled by a script language, and the statement entity in the text statement may be implemented by a character matching extraction algorithm.
And S3, performing bidirectional coding on the statement entity to obtain a coding vector, performing vector fusion on the coding vector and the statement vector to obtain a target vector, and extracting the information characteristics of the target vector.
The statement entity is bidirectionally encoded, so that an encoding vector can be obtained, and subsequent vector fusion is guaranteed, wherein the encoding vector is a vector corresponding to the statement entity, and further, the bidirectional encoding of the statement entity is realized by an encoder.
According to the invention, the coding vector and the statement vector are subjected to vector fusion to obtain the target vector, so that the characteristics of the target vector can be accurately extracted subsequently, wherein the target vector is obtained after the coding vector and the statement vector are fused.
As an embodiment of the present invention, the vector fusion between the coding vector and the statement vector to obtain a target vector includes: respectively carrying out position coding on the coding vector and the statement vector to obtain a first position vector and a second position vector, calculating the association degree of the first position vector and the second position vector by utilizing a preset association degree function, merging the first position vector and the second position vector with the association degree larger than a preset value to obtain a merged sub-vector, and summarizing the merged sub-vector to obtain a target vector.
The first position vector and the second position vector are obtained after the coding vector and the sentence vector are respectively subjected to position coding, the association degree is the association degree of the sub-vectors in the first position vector and the second position vector, and the combined sub-vector is the vector combined by the sub-vectors in the first position vector and the second position vector.
Further, as an optional embodiment of the present invention, the position coding of the coding vector and the statement vector may be implemented by an encoder, the merging of the sub-vectors in the first position vector and the second position vector may be implemented by a vector algorithm, and the summarizing of the merged sub-vectors may be implemented by a number-times-vector method.
Further, as an optional embodiment of the present invention, the preset relevance function includes:
wherein R represents a degree of association, M represents a total number of the first position vector and the second position vector, i represents a start vector, N represents a final vector, k represents a position vector, p represents a first position, Q represents a second position, k represents a third position, and P represents the k-th position vector, k, in the first position Q Representing the k-th position vector in the second position, ln representing a logarithmic function, and ρ representing a correlation coefficient.
According to the invention, by extracting the information characteristics of the target vector, the characteristic attributes in the target vector can be known, and a premise is provided for subsequent matrix construction, wherein the information characteristics are valuable attributes of the target vector measuring tool, and further, the extraction of the information characteristics can be realized by a hog extraction algorithm.
S4, constructing a feature matrix corresponding to the information features, calculating feature distances between adjacent features in the feature matrix, screening the feature distances to obtain target distances, and taking the features corresponding to the target distances as description features of the statement entities.
According to the invention, a feature matrix corresponding to the information features is constructed, and the feature matrix can show multi-dimensional data of the information features, wherein the feature matrix is a matrix list corresponding to the information features.
Further, the constructing a feature matrix corresponding to the information feature includes:
and constructing a characteristic matrix corresponding to the information characteristic by using the following formula:
wherein J represents a feature matrix corresponding to the information features, a represents the number of feature data, n represents a matrix coefficient, d represents an independent variable of the information features, and F represents a matrix spectrum radius corresponding to the information features.
The invention is convenient to know the distance between the adjacent features in the feature matrix by calculating the feature distance between the adjacent features in the feature matrix.
Further, the feature distance between adjacent features in the feature matrix may be calculated by the following formula:
where D (a, b represent feature distances between adjacent features in the feature matrix, m represents the starting feature in the matrix, a m An initial coordinate point representing a first one of the neighboring features, b m An initial coordinate point, x, representing a second one of the neighboring features m Representing feature termination coordinate points, y, adjacent to the termination feature m A termination coordinate point representing a termination feature.
According to the invention, the characteristic distance is screened to obtain the target distance, the characteristic corresponding to the target distance can be used as the description characteristic of the statement entity, and then the relation description corresponding to the statement entity is accurately obtained, wherein the target distance is the distance meeting the requirement, the description characteristic is the description relation corresponding to the statement entity, and further, the screening of the characteristic distance can be realized through a vlookup function.
And S5, extracting the information text according to the sentence entity and the description characteristics to obtain an extraction result.
According to the invention, the information text is extracted according to the statement entity and the description characteristics, so that an extraction result can be obtained, the information text extraction efficiency is further improved, and further, the information text extraction can be realized through an InputBox function.
The method comprises the steps of obtaining an information text to be extracted, preprocessing the information text to obtain a target text, and removing invalid information in the information text so as to improve the efficiency of subsequent processing of the information text; in addition, the invention constructs the characteristic matrix corresponding to the information characteristic, and the characteristic matrix can show the multidimensional data of the information characteristic. Therefore, the information text extraction method provided by the embodiment of the invention can improve the efficiency of information text extraction.
Fig. 2 is a functional block diagram of an information text extraction apparatus according to an embodiment of the present invention.
The information text extraction apparatus 100 according to the present invention may be installed in an electronic device. According to the realized functions, the information text extraction device 100 can comprise a text processing module 101, an entity extraction module 102, a feature extraction module 103, a matrix construction module 104 and a text extraction module 105. The module of the present invention, which may also be referred to as a unit, refers to a series of computer program segments that can be executed by a processor of an electronic device and can perform a fixed function, and is stored in a memory of the electronic device.
In the present embodiment, the functions of the respective modules/units are as follows:
the text processing module 101 is configured to obtain an information text to be extracted, preprocess the information text to obtain a target text, extract a text sentence in the target text, and perform vector conversion on the text sentence to obtain a sentence vector;
the entity extraction module 102 is configured to calculate a vector length of the statement vector, extract a feature vector of the statement vector in combination with the vector length, and extract a statement entity in the text statement according to the feature vector;
the feature extraction module 103 is configured to perform bidirectional encoding on the statement entity to obtain an encoding vector, perform vector fusion on the encoding vector and the statement vector to obtain a target vector, and extract an information feature of the target vector;
the matrix construction module 104 is configured to construct a feature matrix corresponding to the information features, calculate a feature distance between adjacent features in the feature matrix, filter the feature distance to obtain a target distance, and use a feature corresponding to the target distance as a description feature of the sentence entity;
the text extraction module 105 is configured to extract the information text according to the sentence entity and the description feature to obtain an extraction result.
In detail, when the modules in the information text extraction device 100 in the embodiment of the present application are used, the same technical means as the information text extraction method described in fig. 1 above are used, and the same technical effect can be produced, and details are not described here.
Fig. 3 is a schematic structural diagram of an electronic device 1 for implementing an information text extraction method according to an embodiment of the present invention.
The electronic device 1 may include a processor 10, a memory 11, a communication bus 12, and a communication interface 13, and may further include a computer program, such as an information text extraction method program, stored in the memory 11 and executable on the processor 10.
In some embodiments, the processor 10 may be composed of an integrated circuit, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same function or different functions, and includes one or more Central Processing Units (CPUs), a microprocessor, a digital Processing chip, a graphics processor, a combination of various control chips, and the like. The processor 10 is a Control Unit (Control Unit) of the electronic device 1, connects various components of the whole electronic device by using various interfaces and lines, and executes various functions and processes data of the electronic device by running or executing programs or modules (for example, executing an information text extraction method program, etc.) stored in the memory 11 and calling data stored in the memory 11.
The memory 11 includes at least one type of readable storage medium including flash memory, removable hard disks, multimedia cards, card-type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disks, optical disks, and the like. The memory 11 may in some embodiments be an internal storage unit of the electronic device, for example a removable hard disk of the electronic device. The memory 11 may also be an external storage device of the electronic device in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, provided on the electronic device. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device. The memory 11 may be used not only to store application software installed in the electronic device and various types of data, such as codes of an information text extraction method program, etc., but also to temporarily store data that has been output or is to be output.
The communication bus 12 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. The bus is arranged to enable connection communication between the memory 11 and at least one processor 10 or the like.
The communication interface 13 is used for communication between the electronic device 1 and other devices, and includes a network interface and a user interface. Optionally, the network interface may include a wired interface and/or a wireless interface (e.g., WI-FI interface, bluetooth interface, etc.), which are typically used to establish a communication connection between the electronic device and other electronic devices. The user interface may be a Display (Display), an input unit, such as a Keyboard (Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable, among other things, for displaying information processed in the electronic device and for displaying a visualized user interface.
Fig. 3 only shows an electronic device with components, and it will be understood by a person skilled in the art that the structure shown in fig. 3 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than shown, or a combination of certain components, or a different arrangement of components.
For example, although not shown, the electronic device 1 may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 10 through a power management device, so as to implement functions of charge management, discharge management, power consumption management, and the like through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuits, power converters or inverters, power status indicators, and the like. The electronic device 1 may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.
It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.
The information text extraction method program stored in the memory 11 of the electronic device 1 is a combination of a plurality of instructions, and when running in the processor 10, can realize:
acquiring an information text to be extracted, preprocessing the information text to obtain a target text, extracting text sentences in the target text, and performing vector conversion on the text sentences to obtain sentence vectors;
calculating the vector length of the statement vector, extracting the feature vector of the statement vector by combining the vector length, and extracting a statement entity in the text statement according to the feature vector;
performing bidirectional coding on the statement entity to obtain a coding vector, performing vector fusion on the coding vector and the statement vector to obtain a target vector, and extracting the information characteristic of the target vector;
constructing a feature matrix corresponding to the information features, calculating feature distances between adjacent features in the feature matrix, screening the feature distances to obtain target distances, and taking the features corresponding to the target distances as description features of the statement entity;
and extracting the information text according to the sentence entity and the description characteristics to obtain an extraction result.
Specifically, the processor 10 may refer to the description of the relevant steps in the corresponding embodiments of the figures, which is not repeated herein.
Further, the integrated modules/units of the electronic device 1, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. The computer readable storage medium may be volatile or non-volatile. For example, the computer-readable medium may include: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, read-Only Memory (ROM).
The present invention also provides a computer-readable storage medium, storing a computer program which, when executed by a processor of an electronic device, may implement:
acquiring an information text to be extracted, preprocessing the information text to obtain a target text, extracting text sentences in the target text, and performing vector conversion on the text sentences to obtain sentence vectors;
calculating the vector length of the statement vector, extracting the feature vector of the statement vector by combining the vector length, and extracting the statement entity in the text statement according to the feature vector;
performing bidirectional coding on the statement entity to obtain a coding vector, performing vector fusion on the coding vector and the statement vector to obtain a target vector, and extracting information characteristics of the target vector;
constructing a feature matrix corresponding to the information features, calculating feature distances between adjacent features in the feature matrix, screening the feature distances to obtain target distances, and taking the features corresponding to the target distances as description features of the statement entities;
and extracting the information text according to the sentence entity and the description characteristics to obtain an extraction result.
In the several embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and there may be other divisions in actual implementation.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one position, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof.
The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.
The embodiment of the application can acquire and process related data based on an artificial intelligence technology. Among them, artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.
Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the system claims may also be implemented by one unit or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.
Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.
Claims (10)
1. An information text extraction method, characterized in that the method comprises:
acquiring an information text to be extracted, preprocessing the information text to obtain a target text, extracting text sentences in the target text, and performing vector conversion on the text sentences to obtain sentence vectors;
calculating the vector length of the statement vector, extracting the feature vector of the statement vector by combining the vector length, and extracting a statement entity in the text statement according to the feature vector;
performing bidirectional coding on the statement entity to obtain a coding vector, performing vector fusion on the coding vector and the statement vector to obtain a target vector, and extracting information characteristics of the target vector;
constructing a feature matrix corresponding to the information features, calculating feature distances between adjacent features in the feature matrix, screening the feature distances to obtain target distances, and taking the features corresponding to the target distances as description features of the statement entities;
and extracting the information text according to the sentence entity and the description characteristics to obtain an extraction result.
2. The method for extracting information text according to claim 1, wherein the preprocessing the information text to obtain a target text comprises:
performing content conversion on the information text to obtain a conversion text;
filtering stop words of the converted text according to a preset stop word list to obtain a filtered text;
and performing word segmentation processing on the filtered text to obtain a target text.
3. The method for extracting information text according to claim 2, wherein the performing vector conversion on the text sentence to obtain a sentence vector comprises:
carrying out one-hot coding on the text sentence to obtain a coded sentence;
vector conversion is carried out on the coding statement to obtain a coding statement vector;
and performing dimension reduction processing on the coding statement vector to obtain a statement vector.
4. The method for extracting information text according to claim 1, wherein the extracting the feature vector of the sentence vector in combination with the vector length comprises:
constructing a matrix of the statement vector according to the vector length;
calculating a vector dimension of the matrix;
and combining the matrix and the vector dimension to extract the features of the statement vector to obtain a feature vector.
5. The method for extracting information text according to claim 1, wherein the vector fusion of the encoding vector and the sentence vector to obtain the target vector comprises:
respectively carrying out position coding on the coding vector and the statement vector to obtain a first position vector and a second position vector;
calculating the association degree of the first position vector and the second position vector by using a preset association degree function;
combining the first position vector and the second position vector with the association degree larger than a preset value to obtain a combined sub-vector;
and summarizing the merged sub-vectors to obtain a target vector.
6. The method for extracting information text according to claim 5, wherein the preset relevance function comprises:
wherein R represents a degree of association, M represents a total number of the first position vector and the second position vector, i represents a start vector, N represents a final vector, k represents a position vector, p represents a first position, Q represents a second position, k represents a third position, and P represents the k-th position vector, k, in the first position Q Representing the k-th position vector in the second position, ln representing a logarithmic function, and p representing a correlation coefficient.
7. The method for extracting information text according to claim 1, wherein the constructing the feature matrix corresponding to the information features comprises:
and constructing a characteristic matrix corresponding to the information characteristic by using the following formula:
wherein J represents a feature matrix corresponding to the information features, a represents the number of feature data, n represents a matrix coefficient, d represents an argument of the information features, and F represents a matrix spectrum radius corresponding to the information features.
8. An information text extraction apparatus, characterized in that the apparatus comprises:
the text processing module is used for acquiring an information text to be extracted, preprocessing the information text to obtain a target text, extracting text sentences in the target text, and performing vector conversion on the text sentences to obtain sentence vectors;
the entity extraction module is used for calculating the vector length of the statement vector, extracting the feature vector of the statement vector by combining the vector length, and extracting the statement entity in the text statement according to the feature vector;
the feature extraction module is used for carrying out bidirectional coding on the statement entity to obtain a coding vector, carrying out vector fusion on the coding vector and the statement vector to obtain a target vector, and extracting the information feature of the target vector;
the matrix construction module is used for constructing a feature matrix corresponding to the information features, calculating feature distances between adjacent features in the feature matrix, screening the feature distances to obtain target distances, and taking the features corresponding to the target distances as description features of the statement entity;
and the text extraction module is used for extracting the information text according to the sentence entity and the description characteristics to obtain an extraction result.
9. An electronic device, characterized in that the electronic device comprises:
at least one processor; and (c) a second step of,
a memory communicatively coupled to the at least one processor; wherein,
the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to enable the at least one processor to perform the method of extracting an information text as claimed in any one of claims 1 to 7.
10. A computer-readable storage medium, storing a computer program, wherein the computer program, when executed by a processor, implements the information text extraction method according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210951569.0A CN115238670B (en) | 2022-08-09 | 2022-08-09 | Information text extraction method, device, equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210951569.0A CN115238670B (en) | 2022-08-09 | 2022-08-09 | Information text extraction method, device, equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115238670A true CN115238670A (en) | 2022-10-25 |
CN115238670B CN115238670B (en) | 2023-07-04 |
Family
ID=83678733
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210951569.0A Active CN115238670B (en) | 2022-08-09 | 2022-08-09 | Information text extraction method, device, equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115238670B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115409041A (en) * | 2022-10-29 | 2022-11-29 | 深圳迅策科技有限公司 | Unstructured data extraction method, device, equipment and storage medium |
CN115784535A (en) * | 2023-01-09 | 2023-03-14 | 深圳瑞赛环保科技有限公司 | Computer technology-based waste liquid aluminum ion filtering method and system |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2011128924A (en) * | 2009-12-18 | 2011-06-30 | Kddi Corp | Comic image analysis apparatus, program, and search apparatus and method for extracting text from comic image |
US9495357B1 (en) * | 2013-05-02 | 2016-11-15 | Athena Ann Smyros | Text extraction |
JP6337183B1 (en) * | 2017-06-22 | 2018-06-06 | 株式会社ドワンゴ | Text extraction device, comment posting device, comment posting support device, playback terminal, and context vector calculation device |
US20190278843A1 (en) * | 2017-02-27 | 2019-09-12 | Tencent Technology (Shenzhen) Company Ltd | Text entity extraction method, apparatus, and device, and storage medium |
CN111967242A (en) * | 2020-08-17 | 2020-11-20 | 支付宝(杭州)信息技术有限公司 | Text information extraction method, device and equipment |
CN112860905A (en) * | 2021-04-08 | 2021-05-28 | 深圳壹账通智能科技有限公司 | Text information extraction method, device and equipment and readable storage medium |
CN114398855A (en) * | 2022-01-13 | 2022-04-26 | 北京快确信息科技有限公司 | Text extraction method, system and medium based on fusion pre-training |
CN114840662A (en) * | 2021-02-02 | 2022-08-02 | 京东科技控股股份有限公司 | Event information extraction method and device and electronic equipment |
-
2022
- 2022-08-09 CN CN202210951569.0A patent/CN115238670B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2011128924A (en) * | 2009-12-18 | 2011-06-30 | Kddi Corp | Comic image analysis apparatus, program, and search apparatus and method for extracting text from comic image |
US9495357B1 (en) * | 2013-05-02 | 2016-11-15 | Athena Ann Smyros | Text extraction |
US20190278843A1 (en) * | 2017-02-27 | 2019-09-12 | Tencent Technology (Shenzhen) Company Ltd | Text entity extraction method, apparatus, and device, and storage medium |
JP6337183B1 (en) * | 2017-06-22 | 2018-06-06 | 株式会社ドワンゴ | Text extraction device, comment posting device, comment posting support device, playback terminal, and context vector calculation device |
CN111967242A (en) * | 2020-08-17 | 2020-11-20 | 支付宝(杭州)信息技术有限公司 | Text information extraction method, device and equipment |
CN114840662A (en) * | 2021-02-02 | 2022-08-02 | 京东科技控股股份有限公司 | Event information extraction method and device and electronic equipment |
CN112860905A (en) * | 2021-04-08 | 2021-05-28 | 深圳壹账通智能科技有限公司 | Text information extraction method, device and equipment and readable storage medium |
CN114398855A (en) * | 2022-01-13 | 2022-04-26 | 北京快确信息科技有限公司 | Text extraction method, system and medium based on fusion pre-training |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115409041A (en) * | 2022-10-29 | 2022-11-29 | 深圳迅策科技有限公司 | Unstructured data extraction method, device, equipment and storage medium |
CN115409041B (en) * | 2022-10-29 | 2023-01-17 | 深圳迅策科技有限公司 | Unstructured data extraction method, device, equipment and storage medium |
CN115784535A (en) * | 2023-01-09 | 2023-03-14 | 深圳瑞赛环保科技有限公司 | Computer technology-based waste liquid aluminum ion filtering method and system |
CN115784535B (en) * | 2023-01-09 | 2023-05-05 | 深圳瑞赛环保科技有限公司 | Method and system for filtering aluminum ions in waste liquid based on computer technology |
Also Published As
Publication number | Publication date |
---|---|
CN115238670B (en) | 2023-07-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113378970B (en) | Sentence similarity detection method and device, electronic equipment and storage medium | |
CN112541338A (en) | Similar text matching method and device, electronic equipment and computer storage medium | |
CN115238670B (en) | Information text extraction method, device, equipment and storage medium | |
CN113821622B (en) | Answer retrieval method and device based on artificial intelligence, electronic equipment and medium | |
CN113360654B (en) | Text classification method, apparatus, electronic device and readable storage medium | |
CN113706322A (en) | Service distribution method, device, equipment and storage medium based on data analysis | |
CN114398557A (en) | Information recommendation method and device based on double portraits, electronic equipment and storage medium | |
CN114416939A (en) | Intelligent question and answer method, device, equipment and storage medium | |
CN113886708A (en) | Product recommendation method, device, equipment and storage medium based on user information | |
CN112632264A (en) | Intelligent question and answer method and device, electronic equipment and storage medium | |
CN114840684A (en) | Map construction method, device and equipment based on medical entity and storage medium | |
CN113806492A (en) | Record generation method, device and equipment based on semantic recognition and storage medium | |
CN115409041B (en) | Unstructured data extraction method, device, equipment and storage medium | |
CN116741358A (en) | Inquiry registration recommendation method, inquiry registration recommendation device, inquiry registration recommendation equipment and storage medium | |
CN116468025A (en) | Electronic medical record structuring method and device, electronic equipment and storage medium | |
CN116340537A (en) | Character relation extraction method and device, electronic equipment and storage medium | |
CN115346095A (en) | Visual question answering method, device, equipment and storage medium | |
CN115186188A (en) | Product recommendation method, device and equipment based on behavior analysis and storage medium | |
CN114943306A (en) | Intention classification method, device, equipment and storage medium | |
CN114943289A (en) | User portrait classification method, device, equipment and medium based on deep learning | |
CN113706207A (en) | Order transaction rate analysis method, device, equipment and medium based on semantic analysis | |
CN113723114A (en) | Semantic analysis method, device and equipment based on multi-intent recognition and storage medium | |
CN117874202B (en) | Intelligent question-answering method and system based on large model | |
CN114546882B (en) | Intelligent question-answering system testing method and device, electronic equipment and storage medium | |
CN116757197A (en) | Text theme segmentation method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |