CN114091456A

CN114091456A - Intelligent positioning method and system for quotation contents

Info

Publication number: CN114091456A
Application number: CN202210063117.9A
Authority: CN
Inventors: 蓝建敏; 苗苏望; 李锦洲; 池穆霖; 李观春
Original assignee: Excellence Information Technology Co ltd
Current assignee: Excellence Information Technology Co ltd
Priority date: 2022-01-20
Filing date: 2022-01-20
Publication date: 2022-02-25
Anticipated expiration: 2042-01-20
Also published as: CN114091456B

Abstract

The invention provides a method and a system for intelligently positioning quotation contents, wherein character strings in a text file are read as character strings to be detected, a word segmentation algorithm is used for segmenting the character strings to be detected into a plurality of character string arrays, key characters are respectively positioned for each character string array, a plurality of different character string data are used as quotation content sets, the key characters are compared with each character string data in the quotation content sets, a plurality of character string data are matched, the time cost of matching is reduced, and the beneficial effects of quickly identifying the quotation contents in the character strings and matching and pointing to the quoted contents are realized.

Description

Intelligent positioning method and system for quotation contents

Technical Field

The invention belongs to the technical field of unstructured data processing technology and distributed software, and particularly relates to a method and a system for intelligently positioning quotation contents.

Background

The positioning of the content of the quotation is a technical method which uses character matching or regular matching to extract key characters in a character string extracted from the content of a paper or a document, and associates and matches the key characters with the data of each quotation in the data. In the industrial application of citation content positioning, a pre-trained text vector is usually used for embedding character string information, and then similarity calculation is performed on characters and a data set by using the embedded vector, so that the most similar data set is screened out. The patent document of publication No. CN109947915A discloses an artificial intelligence expert system based on knowledge management system and a construction method thereof, which can obtain first associated information from questions in the question-answer information and matching answers and update the first associated information into the knowledge graph, however, the time cost and the computational complexity for extracting information that requires structured citation content to be matched are extremely large, and it is difficult to perform matching calculation on unstructured text data.

Disclosure of Invention

The present invention is directed to a method and system for intelligently locating citation content, which solves one or more of the problems of the prior art and provides at least one useful choice or creation condition.

The invention provides a method and a system for intelligently positioning quotation contents, which are characterized in that character strings in a text file are read as character strings to be detected, a word segmentation algorithm is used for segmenting the character strings to be detected into a plurality of character string arrays, key characters are respectively positioned for each character string array, a plurality of different character string data are used as quotation content sets, the key characters are compared with each character string data in the quotation content sets, and a plurality of character string data are matched.

In order to achieve the above object, according to an aspect of the present invention, there is provided a method for intelligently locating cited contents, the method including the steps of:

s100, inputting a text file, and reading a character string in the text file to be used as a character string to be detected;

s200, performing word segmentation on the character string to be detected by using a word segmentation algorithm, and dividing the character string to be detected into a plurality of character string arrays;

s300, respectively positioning each character string array to position key characters;

s400, taking a plurality of different character string data as a quotation content set;

and S500, comparing the key characters with each character string data in the quotation content set to match a plurality of character string data.

The calculations involved in the steps of the method are calculations performed to extract corresponding values, and are subjected to non-dimensionalization.

Further, in S200, the method for segmenting the character string to be detected by using the segmentation algorithm and dividing the character string to be detected into a plurality of character string arrays includes: dividing the character string to be detected into a plurality of sentences by taking the point number in the character string as a dividing point, dividing each sentence into words by using a Chinese word division algorithm, removing punctuation marks from each sentence, dividing each sentence into a plurality of character strings respectively, and forming the character strings into a character string array, thereby obtaining a plurality of character string arrays.

Further, in S300, each character string array is respectively located, and the method for locating the key character includes:

recording a set formed by each character string array as Arrset, recording the number of elements in the Arrset as n, recording the serial numbers of the elements in the Arrset as i, i belongs to [1, n ], recording the element with the serial number of i in the Arrset as Arr (i), recording the number of the elements in the character string array Arr (i) as n (i), recording the serial numbers of the elements in the character string array Arr (i) as i (i), recording the i (i) belongs to [1, n (i) ], and recording the element with the serial number of i (i) in the character string array Arr (i) as character string Arr [ i (i) ];

inputting all character strings in the character string array Arr (i) into an ELMo Chinese pre-training model at the same time, outputting embedded vectors (word vectors) of all character strings in the character string array Arr (i) by the ELMo Chinese pre-training model, and recording the embedded vectors of the character strings Arr [ i (i) ] in the character string array Arr (i) as emb [ i (i) ];

recording the number of dimensionalities in the embedded vector as k, and the sequence number of the dimensionalities in the embedded vector as v, wherein v belongs to [1, k ];

the value of the dimension with the sequence number v in the emb [ i (i)) ] is recorded as emb [ i (i)) ] v;

in each character string array Arr (i), the first character string Arr [1] in the character string array Arr (i) is connected with the last character string Arr [ n (i), namely the next element after Arr [ n (i) ] in Arr (i) is Arr [1], so that the first character string and the last character string in the character string array Arr (i) are connected end to form a closed ring;

in each character string array Arr (i), a matrix with n (i) columns and k rows formed by embedded vectors emb [ i (i) ] of each character string Arr [ i (i) in Arr (i) is marked as Mat (i), columns with numbers i (i) in Mat (i) are marked as emb [ i (i) ], elements with column numbers i (i) and row numbers v in Mat (i) are assigned to have values equal to emb [ i (i)) ] v, and elements with column numbers i (i) and row numbers v in Mat (i) are marked as Mat (i) [ v, i (i)) ];

setting a positioning array as an array for positioning key characters in a character string array, wherein the number of dimensions in the positioning array is the same as the number of elements in the character string array corresponding to the positioning array, and the positioning numerical values realize rapid multi-dimensional vectorization coding of the key characters in the character string array, thereby being beneficial to rapidly positioning the positions of the key characters and simultaneously reducing the time complexity of calculation;

the positioning array of the character string array Arr (i) is denoted as Pis (i), the number of dimensions in Pis (i) is denoted as n (i), the serial number of the dimensions in Pis (i) is denoted as i (i), the numerical value of the dimension with the serial number of i (i) in Pis (i) is Pis (i), and the calculation formula of Pis (i) is as follows:

，

wherein the function sig is an exponential function with the value of one-half of the circumference ratio as a base number, thereby obtaining the value of each dimension in pis (i);

selecting the element with the minimum value in the positioning array pis (i), and recording the serial number of the element with the minimum value in the positioning array as i (i '), wherein i (i') belongs to [1, n (i) ];

calculating int (n (i)/2), wherein the function int is a rounded function, let r = int (n (i)/2);

in a ring formed by connecting the first character string and the last character string in the character string array Arr (i) end to end, the r-th element after the element with the sequence number i (i ') is obtained as Arr [ i (i ') + r ], and the Arr [ i (i ') + r ] is a key character, so that the key character is positioned.

Further, in S400, a method of using a plurality of different character string data as a set of the cited content is:

acquiring character string data of texts of a plurality of different papers or webpages through a web crawler technology, performing word segmentation and keyword extraction on each character string data respectively by using a word segmentation algorithm and a keyword extraction algorithm, obtaining a plurality of words by the word segmentation algorithm in each character string data, extracting a plurality of keywords from the obtained plurality of words, and recording serial numbers of each extracted keyword in the corresponding plurality of words;

the method comprises the steps of recording a set formed by character string data of texts of a plurality of different papers or web pages obtained by a web crawler technology as Refset, wherein the number of elements in the set Refset is m, the sequence number of the elements in the set Refset is j, j belongs to [1, m ], the element with the sequence number of j in the set Refset is Refset (j), Refset (j) is a set of a plurality of participles obtained by a participle algorithm and is Refcont (j), the number of the participles in Refcont (j) is n (j), the sequence number of the participles in Refcont (j) is i (j), i (j) belongs to [1, n (j) ], obtaining a plurality of different keywords from each Refcont (j) by a keyword extraction algorithm, the set of the keywords obtained in Refberg (j), the set of the sequence numbers of the keywords obtained in Refcont (j) is Refixet (idxej), and the set is a Refset of the text content of the Refset Refset.

Further, in S500, the method of comparing the key character with each character string data in the citation content set to match out a plurality of character string data includes:

using a Pre-Training Language Model ERNIE, finely adjusting the ERNIE to be used as a Prediction Model, masking the positions of positioned keywords in each element in a set reference by using a mask Language Modeling (Masked Language Model) mechanism, and performing N-Gram Prediction on the Masked positions by using the Prediction Model (see the algorithm described in section 3.3 Comprehensive N-Gram Prediction in paper D Xiao, Li Y K, Zhang H, et al ERNIE-Gram: Pre-Training with explicit N-Gram Masked Language Modeling for Natural Language interpretation [ J ] 2020), wherein the probability (likelihood) that the Masked positions are key characters is predicted to be used as a Prediction probability value;

comparing the key characters with each character string data in the quotation content set: marking the key character as keyw, enabling a function Prd () to represent a function for predicting the masked position by using a prediction model to obtain a prediction probability value of the key character, enabling Prd (keyw, Refset (j)) to represent a prediction probability value of keyw obtained by predicting the masked position in Refset (j) by using the prediction model, enabling a set formed by prediction probability values Prd (keyw, Refset (j)) of keyw in elements Refset (j) of Refset to be used as a Prdset, enabling the number of elements in the Prdset to be m as same as the number of elements in Refset, enabling the sequence number of the elements in the Prdset to be j as same as the sequence number of the elements in Refset, and enabling Prd (keyw, Refset (j)) to be an element with the sequence number of j in the dset;

and then, matching a plurality of character string data: the function min is a function of the element with the smallest value in the set, the function max is a function of the element with the largest value in the set, min (Prdset) is the value of the element with the smallest value in the Prdset, max (Prdset) is the value of the element with the largest value in the Prdset, the intersection value is defined as the value for comparing each element Refset (j) of Refset according to the corresponding prediction probability value Prd (keyw, Refset (j)), the intersection value corresponding to Refset (j) is p (j), the calculation formula of p (j) is,

p(j)=sin(π* Prd(keyw, Refset(j)) /(max(Prdset)- min(Prdset) ) ) ,

the sin is a sine function, the cross-plot value has the beneficial effects that the elements in the quotation content set are cross-compared quickly in a batch mode so as to screen out the quotation content with the maximum likelihood probability on the prediction probability value, and pi is the circumferential rate;

judging whether the intersection value of each element Refset (j) of the Refset meets the constraint condition p (j) >1, if the element Refset (j) meets the intersection value p (j) >1, matching the character string Refset (j) meeting the constraint condition with the key character keyw, and matching a plurality of Refsets (j); if there is no element Refset (j) that satisfies its intersection value p (j) >1, then keyw has no matching element in the set Refset; and recording and outputting a result of judging whether the intersection value of each element Refset (j) of Refset meets the constraint condition p (j) > 1.

The invention also provides an intelligent positioning system for the quotation contents, which comprises: the processor executes the computer program to realize the steps in the method for intelligently positioning the cited content, the system for intelligently positioning the cited content can be operated in a computing device such as a desktop computer, a notebook computer, a palm computer and a cloud data center, the operable system can include, but is not limited to, the processor, the memory and a server cluster, and the processor executes the computer program to operate in the units of the following systems:

the text input unit is used for inputting a text file and reading a character string in the text file as a character string to be detected;

the word segmentation detection unit is used for segmenting the character strings to be detected by using a word segmentation algorithm and dividing the character strings to be detected into a plurality of character string arrays;

the key character positioning unit is used for respectively positioning each character string array and positioning key characters;

a quotation content collecting unit for collecting a plurality of different character string data as quotation content;

and the character matching unit is used for comparing the key characters with each character string data in the quotation content set to match a plurality of character string data.

The invention has the beneficial effects that: the invention provides a method and a system for intelligently positioning quotation contents, wherein character strings in a text file are read as character strings to be detected, a word segmentation algorithm is used for segmenting the character strings to be detected into a plurality of character string arrays, key characters are respectively positioned for each character string array, a plurality of different character string data are used as quotation content sets, the key characters are compared with each character string data in the quotation content sets, a plurality of character string data are matched, the time cost of matching is reduced, and the beneficial effects of quickly identifying the quotation contents in the character strings and matching and pointing to the quoted contents are realized.

Drawings

The above and other features of the present invention will become more apparent by describing in detail embodiments thereof with reference to the attached drawings in which like reference numerals designate the same or similar elements, it being apparent that the drawings in the following description are merely exemplary of the present invention and other drawings can be obtained by those skilled in the art without inventive effort, wherein:

FIG. 1 is a flow chart of a method for intelligent positioning of citation content;

fig. 2 is a system configuration diagram of a citation content intelligent positioning system.

Detailed Description

The conception, the specific structure and the technical effects of the present invention will be clearly and completely described in conjunction with the embodiments and the accompanying drawings to fully understand the objects, the schemes and the effects of the present invention. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

In the description of the present invention, the meaning of a plurality of means is one or more, the meaning of a plurality of means is two or more, and larger, smaller, larger, etc. are understood as excluding the number, and larger, smaller, inner, etc. are understood as including the number. If the first and second are described for the purpose of distinguishing technical features, they are not to be understood as indicating or implying relative importance or implicitly indicating the number of technical features indicated or implicitly indicating the precedence of the technical features indicated.

Fig. 1 is a flow chart of an intelligent positioning method of cited content according to the present invention, and a method and a system for intelligent positioning of cited content according to an embodiment of the present invention are described below with reference to fig. 1.

The invention provides an intelligent positioning method for citation content, which specifically comprises the following steps:

inputting all character strings in the character string array Arr (i) into an ELMo Chinese pre-training model together, wherein the ELMo Chinese pre-training model outputs embedded vectors of all character strings in the character string array Arr (i), and the embedded vectors of the character strings Arr [ i (i) ] in the character string array Arr (i) are recorded as emb [ i (i) ];

in each character string array Arr (i), a matrix with n (i) columns and k rows formed by embedded vectors emb [ i (i) ] of each character string Arr [ i (i) ] in Arr (i) is marked as Mat (i), columns with serial numbers i (i) in Mat (i) are marked as emb [ i (i)) ], elements with column serial numbers i (i) and row serial numbers v in Mat (i) have numerical values equal to emb [ i (i)) ] v, elements with column serial numbers i (i) and row serial numbers v in Mat (i) are marked as Mat (i)) [ v, i (i)) ];

setting a positioning array as an array for positioning key characters in a character string array, wherein the number of dimensions in the positioning array is the same as the number of elements in the character string array corresponding to the positioning array;

，

acquiring character string data of texts of a plurality of different papers by a web crawler technology, performing word segmentation and keyword extraction on each character string data by using a word segmentation algorithm and a keyword extraction algorithm, obtaining a plurality of words by the word segmentation algorithm in each character string data, extracting a plurality of keywords from the obtained plurality of words, and recording the sequence numbers of each extracted keyword in the corresponding plurality of words;

the method comprises the steps of recording a set formed by character string data of texts of a plurality of different papers obtained by a web crawler technology as Refset, wherein the number of elements in the set Refset is m, the serial number of the elements in the set Refset is j, j belongs to [1, m ], the element with the serial number of j in the set Refset is Refset (j), Refset (j) is a set of a plurality of participles obtained by a participle algorithm and is Refcont (j), the number of the participles in Refcont (j) is n (j), the serial number of the participles in Refcont (j) is i (j), i (j) belongs to [1, n j) ], a plurality of different keywords are obtained from Refcont (j) by a keyword extraction algorithm, the set of the keywords obtained in Refcont (j) is abrg (j), and the set of the serial numbers of the keywords obtained in Refcont (j) in the cont (idxet (j) is (idxet j).

using a Pre-Training Language Model ERNIE, (detailed construction of the Pre-Training Language Model ERNIE see Sun Y, Wang S, Li Y, et al ERNIE 2.0: a continuous Pre-Training frame for Language Understanding [ J ] 2019.), fine-tuning the ERNIE as a prediction Model (fine-tuning prediction Model see paper D Xiao, Li Y K, Zhang H, et al ERNIE-Gram: Pre-Training with explicit expression N-Gram Masked Language Modeling for Natural Language Understanding [ J ] 2020. and implementation code of its open source), using a Language mask (Masked Language mask Model) mechanism therein, masking the positions of located keywords in each element in the set, predicting the position of the Model by a position mask (detailed Language mask) in a N-Gram prediction Model (see Li-Y paper, x-Y paper, zhang H, et al ERNIE-Gram, Pre-Training with explicit N-Gram Masked Language Modeling for Natural Language interpretation [ J ].2020. the algorithm described in section 3.3 Comprehensive N-Gram Prediction), predicting the probability (likelihood) that the Masked position is a key character as the predicted probability value;

p(j)=sin(π* Prd(keyw, Refset(j)) /(max(Prdset)- min(Prdset) ) ) ,

wherein sin is a function of the sine,

The cited content intelligent positioning system comprises: the processor executes the computer program to implement the steps in the above-mentioned method for intelligent positioning of cited content, the system for intelligent positioning of cited content may be operated in a computing device such as a desktop computer, a notebook computer, a palm computer, and a cloud data center, and the operable systems may include, but are not limited to, a processor, a memory, and a server cluster.

As shown in fig. 2, an intelligent positioning system for cited content according to an embodiment of the present invention includes: a processor, a memory and a computer program stored in the memory and executable on the processor, the processor implementing the steps in one of the above cited intelligent positioning method embodiments when executing the computer program, the processor executing the computer program to run in the units of the following system:

The citation content intelligent positioning system can be operated in computing equipment such as desktop computers, notebooks, palm computers and cloud data centers. The cited content intelligent positioning system comprises, but is not limited to, a processor and a memory. Those skilled in the art will appreciate that the example is only an example of an intelligent positioning method and system for cited content, and does not constitute a limitation of an intelligent positioning method and system for cited content, and may include more or less components than the other, or some components in combination, or different components, for example, the intelligent positioning system for cited content may further include input and output devices, network access devices, buses, etc.

The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete component Gate or transistor logic, discrete hardware components, etc. The general purpose processor may be a microprocessor or the processor may be any conventional processor, etc., the processor is a control center of the intelligent positioning system for the cited content, and various interfaces and lines are used to connect various sub-areas of the whole intelligent positioning system for the cited content.

The memory can be used for storing the computer program and/or the module, and the processor can realize various functions of the cited content intelligent positioning method and system by running or executing the computer program and/or the module stored in the memory and calling the data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.

The invention provides a method and a system for intelligently positioning quotation contents, which are used for respectively positioning key characters for each character string array, taking a plurality of different character string data as a quotation content set, comparing the key characters with each character string data in the quotation content set, and matching a plurality of character string data, thereby reducing the matching time cost and realizing the beneficial effects of quickly identifying the quotation contents in the character strings and matching the quotation contents pointing to the quoted contents.

Although the present invention has been described in considerable detail and with reference to certain illustrated embodiments, it is not intended to be limited to any such details or embodiments or any particular embodiment, so as to effectively encompass the intended scope of the invention. Furthermore, the foregoing describes the invention in terms of embodiments foreseen by the inventor for which an enabling description was available, notwithstanding that insubstantial modifications of the invention, not presently foreseen, may nonetheless represent equivalent modifications thereto.

Claims

1. An intelligent positioning method for citation content, characterized by comprising the following steps:

2. The intelligent positioning method for the quotation contents according to claim 1, characterized in that in S200, the method for segmenting the character string to be detected by using the segmentation algorithm and dividing the character string to be detected into a plurality of character string arrays comprises the following steps: dividing the character string to be detected into a plurality of sentences by taking the point number in the character string as a dividing point, dividing each sentence into words by using a Chinese word division algorithm, removing punctuation marks from each sentence, dividing each sentence into a plurality of character strings respectively, and forming the character strings into a character string array, thereby obtaining a plurality of character string arrays.

3. The intelligent positioning method for the quotation contents according to claim 1, characterized in that in S300, the character string arrays are respectively positioned, and the method for positioning the key characters comprises:

4. The intelligent positioning method for the quotation contents according to claim 1, wherein in S400, the method for using a plurality of different character string data as the quotation content set comprises:

acquiring character string data of texts of a plurality of different webpages through a web crawler technology, performing word segmentation and keyword extraction on each character string data respectively by using a word segmentation algorithm and a keyword extraction algorithm, obtaining a plurality of words by the word segmentation algorithm in each character string data, extracting a plurality of keywords from the obtained plurality of words, and recording serial numbers of the extracted keywords in the corresponding plurality of words;

the method includes the steps of recording a set formed by character string data of texts of a plurality of different web pages obtained through a web crawler technology as Refset, wherein the number of elements in the set Refset is m, the serial number of the elements in the set Refset is j, j belongs to [1, m ], the element with the serial number of j in the set Refset is Refset (j), Refset (j) is a set of a plurality of participles obtained through a participle algorithm and is Refcont (j), the number of the participles in Refcont (j) is n (j), the serial number of the participles in Refcont (j) is i (j), i (j) belongs to [1, n j ]) and a plurality of different keywords are obtained from Refcont (j) through a keyword extraction algorithm, the set of the keywords obtained in Refcont (j) is refarg (j), the set of the serial numbers of the keywords in Refcont (j) is refxet j), and the set of the quotation contents is Refset of the quotation text.

5. The intelligent positioning method for the quotation contents according to claim 4, wherein in S500, the method for comparing the key characters with each character string data in the quotation content set to obtain a plurality of character string data comprises:

using a pre-training language model ERNIE, finely adjusting the ERNIE to be used as a prediction model, masking the positions of positioned keywords in each element in a set Refset by using a mask language modeling mechanism, performing N-gram prediction on the masked positions by using the prediction model, and predicting the probability that the masked positions are key characters to be used as a prediction probability value;

p(j)=sin(π* Prd(keyw, Refset(j)) /(max(Prdset)- min(Prdset) ) ) ,

6. An intelligent positioning system for cited content, comprising: the processor executes the computer program to realize the steps in the intelligent positioning method of the citation content in any one of claims 1-5, the intelligent positioning system of the citation content runs in a computing device of a desktop computer, a notebook computer, a palm computer and a cloud data center, and the running system comprises the processor, the memory and a server cluster.