CN114091456A - Intelligent positioning method and system for quotation contents - Google Patents

Intelligent positioning method and system for quotation contents Download PDF

Info

Publication number
CN114091456A
CN114091456A CN202210063117.9A CN202210063117A CN114091456A CN 114091456 A CN114091456 A CN 114091456A CN 202210063117 A CN202210063117 A CN 202210063117A CN 114091456 A CN114091456 A CN 114091456A
Authority
CN
China
Prior art keywords
character string
refset
arr
array
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210063117.9A
Other languages
Chinese (zh)
Other versions
CN114091456B (en
Inventor
蓝建敏
苗苏望
李锦洲
池穆霖
李观春
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Excellence Information Technology Co ltd
Original Assignee
Excellence Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Excellence Information Technology Co ltd filed Critical Excellence Information Technology Co ltd
Priority to CN202210063117.9A priority Critical patent/CN114091456B/en
Publication of CN114091456A publication Critical patent/CN114091456A/en
Application granted granted Critical
Publication of CN114091456B publication Critical patent/CN114091456B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a method and a system for intelligently positioning quotation contents, wherein character strings in a text file are read as character strings to be detected, a word segmentation algorithm is used for segmenting the character strings to be detected into a plurality of character string arrays, key characters are respectively positioned for each character string array, a plurality of different character string data are used as quotation content sets, the key characters are compared with each character string data in the quotation content sets, a plurality of character string data are matched, the time cost of matching is reduced, and the beneficial effects of quickly identifying the quotation contents in the character strings and matching and pointing to the quoted contents are realized.

Description

Intelligent positioning method and system for quotation contents
Technical Field
The invention belongs to the technical field of unstructured data processing technology and distributed software, and particularly relates to a method and a system for intelligently positioning quotation contents.
Background
The positioning of the content of the quotation is a technical method which uses character matching or regular matching to extract key characters in a character string extracted from the content of a paper or a document, and associates and matches the key characters with the data of each quotation in the data. In the industrial application of citation content positioning, a pre-trained text vector is usually used for embedding character string information, and then similarity calculation is performed on characters and a data set by using the embedded vector, so that the most similar data set is screened out. The patent document of publication No. CN109947915A discloses an artificial intelligence expert system based on knowledge management system and a construction method thereof, which can obtain first associated information from questions in the question-answer information and matching answers and update the first associated information into the knowledge graph, however, the time cost and the computational complexity for extracting information that requires structured citation content to be matched are extremely large, and it is difficult to perform matching calculation on unstructured text data.
Disclosure of Invention
The present invention is directed to a method and system for intelligently locating citation content, which solves one or more of the problems of the prior art and provides at least one useful choice or creation condition.
The invention provides a method and a system for intelligently positioning quotation contents, which are characterized in that character strings in a text file are read as character strings to be detected, a word segmentation algorithm is used for segmenting the character strings to be detected into a plurality of character string arrays, key characters are respectively positioned for each character string array, a plurality of different character string data are used as quotation content sets, the key characters are compared with each character string data in the quotation content sets, and a plurality of character string data are matched.
In order to achieve the above object, according to an aspect of the present invention, there is provided a method for intelligently locating cited contents, the method including the steps of:
s100, inputting a text file, and reading a character string in the text file to be used as a character string to be detected;
s200, performing word segmentation on the character string to be detected by using a word segmentation algorithm, and dividing the character string to be detected into a plurality of character string arrays;
s300, respectively positioning each character string array to position key characters;
s400, taking a plurality of different character string data as a quotation content set;
and S500, comparing the key characters with each character string data in the quotation content set to match a plurality of character string data.
The calculations involved in the steps of the method are calculations performed to extract corresponding values, and are subjected to non-dimensionalization.
Further, in S200, the method for segmenting the character string to be detected by using the segmentation algorithm and dividing the character string to be detected into a plurality of character string arrays includes: dividing the character string to be detected into a plurality of sentences by taking the point number in the character string as a dividing point, dividing each sentence into words by using a Chinese word division algorithm, removing punctuation marks from each sentence, dividing each sentence into a plurality of character strings respectively, and forming the character strings into a character string array, thereby obtaining a plurality of character string arrays.
Further, in S300, each character string array is respectively located, and the method for locating the key character includes:
recording a set formed by each character string array as Arrset, recording the number of elements in the Arrset as n, recording the serial numbers of the elements in the Arrset as i, i belongs to [1, n ], recording the element with the serial number of i in the Arrset as Arr (i), recording the number of the elements in the character string array Arr (i) as n (i), recording the serial numbers of the elements in the character string array Arr (i) as i (i), recording the i (i) belongs to [1, n (i) ], and recording the element with the serial number of i (i) in the character string array Arr (i) as character string Arr [ i (i) ];
inputting all character strings in the character string array Arr (i) into an ELMo Chinese pre-training model at the same time, outputting embedded vectors (word vectors) of all character strings in the character string array Arr (i) by the ELMo Chinese pre-training model, and recording the embedded vectors of the character strings Arr [ i (i) ] in the character string array Arr (i) as emb [ i (i) ];
recording the number of dimensionalities in the embedded vector as k, and the sequence number of the dimensionalities in the embedded vector as v, wherein v belongs to [1, k ];
the value of the dimension with the sequence number v in the emb [ i (i)) ] is recorded as emb [ i (i)) ] v;
in each character string array Arr (i), the first character string Arr [1] in the character string array Arr (i) is connected with the last character string Arr [ n (i), namely the next element after Arr [ n (i) ] in Arr (i) is Arr [1], so that the first character string and the last character string in the character string array Arr (i) are connected end to form a closed ring;
in each character string array Arr (i), a matrix with n (i) columns and k rows formed by embedded vectors emb [ i (i) ] of each character string Arr [ i (i) in Arr (i) is marked as Mat (i), columns with numbers i (i) in Mat (i) are marked as emb [ i (i) ], elements with column numbers i (i) and row numbers v in Mat (i) are assigned to have values equal to emb [ i (i)) ] v, and elements with column numbers i (i) and row numbers v in Mat (i) are marked as Mat (i) [ v, i (i)) ];
setting a positioning array as an array for positioning key characters in a character string array, wherein the number of dimensions in the positioning array is the same as the number of elements in the character string array corresponding to the positioning array, and the positioning numerical values realize rapid multi-dimensional vectorization coding of the key characters in the character string array, thereby being beneficial to rapidly positioning the positions of the key characters and simultaneously reducing the time complexity of calculation;
the positioning array of the character string array Arr (i) is denoted as Pis (i), the number of dimensions in Pis (i) is denoted as n (i), the serial number of the dimensions in Pis (i) is denoted as i (i), the numerical value of the dimension with the serial number of i (i) in Pis (i) is Pis (i), and the calculation formula of Pis (i) is as follows:
Figure 852612DEST_PATH_IMAGE001
wherein the function sig is an exponential function with the value of one-half of the circumference ratio as a base number, thereby obtaining the value of each dimension in pis (i);
selecting the element with the minimum value in the positioning array pis (i), and recording the serial number of the element with the minimum value in the positioning array as i (i '), wherein i (i') belongs to [1, n (i) ];
calculating int (n (i)/2), wherein the function int is a rounded function, let r = int (n (i)/2);
in a ring formed by connecting the first character string and the last character string in the character string array Arr (i) end to end, the r-th element after the element with the sequence number i (i ') is obtained as Arr [ i (i ') + r ], and the Arr [ i (i ') + r ] is a key character, so that the key character is positioned.
Further, in S400, a method of using a plurality of different character string data as a set of the cited content is:
acquiring character string data of texts of a plurality of different papers or webpages through a web crawler technology, performing word segmentation and keyword extraction on each character string data respectively by using a word segmentation algorithm and a keyword extraction algorithm, obtaining a plurality of words by the word segmentation algorithm in each character string data, extracting a plurality of keywords from the obtained plurality of words, and recording serial numbers of each extracted keyword in the corresponding plurality of words;
the method comprises the steps of recording a set formed by character string data of texts of a plurality of different papers or web pages obtained by a web crawler technology as Refset, wherein the number of elements in the set Refset is m, the sequence number of the elements in the set Refset is j, j belongs to [1, m ], the element with the sequence number of j in the set Refset is Refset (j), Refset (j) is a set of a plurality of participles obtained by a participle algorithm and is Refcont (j), the number of the participles in Refcont (j) is n (j), the sequence number of the participles in Refcont (j) is i (j), i (j) belongs to [1, n (j) ], obtaining a plurality of different keywords from each Refcont (j) by a keyword extraction algorithm, the set of the keywords obtained in Refberg (j), the set of the sequence numbers of the keywords obtained in Refcont (j) is Refixet (idxej), and the set is a Refset of the text content of the Refset Refset.
Further, in S500, the method of comparing the key character with each character string data in the citation content set to match out a plurality of character string data includes:
using a Pre-Training Language Model ERNIE, finely adjusting the ERNIE to be used as a Prediction Model, masking the positions of positioned keywords in each element in a set reference by using a mask Language Modeling (Masked Language Model) mechanism, and performing N-Gram Prediction on the Masked positions by using the Prediction Model (see the algorithm described in section 3.3 Comprehensive N-Gram Prediction in paper D Xiao, Li Y K, Zhang H, et al ERNIE-Gram: Pre-Training with explicit N-Gram Masked Language Modeling for Natural Language interpretation [ J ] 2020), wherein the probability (likelihood) that the Masked positions are key characters is predicted to be used as a Prediction probability value;
comparing the key characters with each character string data in the quotation content set: marking the key character as keyw, enabling a function Prd () to represent a function for predicting the masked position by using a prediction model to obtain a prediction probability value of the key character, enabling Prd (keyw, Refset (j)) to represent a prediction probability value of keyw obtained by predicting the masked position in Refset (j) by using the prediction model, enabling a set formed by prediction probability values Prd (keyw, Refset (j)) of keyw in elements Refset (j) of Refset to be used as a Prdset, enabling the number of elements in the Prdset to be m as same as the number of elements in Refset, enabling the sequence number of the elements in the Prdset to be j as same as the sequence number of the elements in Refset, and enabling Prd (keyw, Refset (j)) to be an element with the sequence number of j in the dset;
and then, matching a plurality of character string data: the function min is a function of the element with the smallest value in the set, the function max is a function of the element with the largest value in the set, min (Prdset) is the value of the element with the smallest value in the Prdset, max (Prdset) is the value of the element with the largest value in the Prdset, the intersection value is defined as the value for comparing each element Refset (j) of Refset according to the corresponding prediction probability value Prd (keyw, Refset (j)), the intersection value corresponding to Refset (j) is p (j), the calculation formula of p (j) is,
p(j)=sin(π* Prd(keyw, Refset(j)) /(max(Prdset)- min(Prdset) ) ) ,
the sin is a sine function, the cross-plot value has the beneficial effects that the elements in the quotation content set are cross-compared quickly in a batch mode so as to screen out the quotation content with the maximum likelihood probability on the prediction probability value, and pi is the circumferential rate;
judging whether the intersection value of each element Refset (j) of the Refset meets the constraint condition p (j) >1, if the element Refset (j) meets the intersection value p (j) >1, matching the character string Refset (j) meeting the constraint condition with the key character keyw, and matching a plurality of Refsets (j); if there is no element Refset (j) that satisfies its intersection value p (j) >1, then keyw has no matching element in the set Refset; and recording and outputting a result of judging whether the intersection value of each element Refset (j) of Refset meets the constraint condition p (j) > 1.
The invention also provides an intelligent positioning system for the quotation contents, which comprises: the processor executes the computer program to realize the steps in the method for intelligently positioning the cited content, the system for intelligently positioning the cited content can be operated in a computing device such as a desktop computer, a notebook computer, a palm computer and a cloud data center, the operable system can include, but is not limited to, the processor, the memory and a server cluster, and the processor executes the computer program to operate in the units of the following systems:
the text input unit is used for inputting a text file and reading a character string in the text file as a character string to be detected;
the word segmentation detection unit is used for segmenting the character strings to be detected by using a word segmentation algorithm and dividing the character strings to be detected into a plurality of character string arrays;
the key character positioning unit is used for respectively positioning each character string array and positioning key characters;
a quotation content collecting unit for collecting a plurality of different character string data as quotation content;
and the character matching unit is used for comparing the key characters with each character string data in the quotation content set to match a plurality of character string data.
The invention has the beneficial effects that: the invention provides a method and a system for intelligently positioning quotation contents, wherein character strings in a text file are read as character strings to be detected, a word segmentation algorithm is used for segmenting the character strings to be detected into a plurality of character string arrays, key characters are respectively positioned for each character string array, a plurality of different character string data are used as quotation content sets, the key characters are compared with each character string data in the quotation content sets, a plurality of character string data are matched, the time cost of matching is reduced, and the beneficial effects of quickly identifying the quotation contents in the character strings and matching and pointing to the quoted contents are realized.
Drawings
The above and other features of the present invention will become more apparent by describing in detail embodiments thereof with reference to the attached drawings in which like reference numerals designate the same or similar elements, it being apparent that the drawings in the following description are merely exemplary of the present invention and other drawings can be obtained by those skilled in the art without inventive effort, wherein:
FIG. 1 is a flow chart of a method for intelligent positioning of citation content;
fig. 2 is a system configuration diagram of a citation content intelligent positioning system.
Detailed Description
The conception, the specific structure and the technical effects of the present invention will be clearly and completely described in conjunction with the embodiments and the accompanying drawings to fully understand the objects, the schemes and the effects of the present invention. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
In the description of the present invention, the meaning of a plurality of means is one or more, the meaning of a plurality of means is two or more, and larger, smaller, larger, etc. are understood as excluding the number, and larger, smaller, inner, etc. are understood as including the number. If the first and second are described for the purpose of distinguishing technical features, they are not to be understood as indicating or implying relative importance or implicitly indicating the number of technical features indicated or implicitly indicating the precedence of the technical features indicated.
Fig. 1 is a flow chart of an intelligent positioning method of cited content according to the present invention, and a method and a system for intelligent positioning of cited content according to an embodiment of the present invention are described below with reference to fig. 1.
The invention provides an intelligent positioning method for citation content, which specifically comprises the following steps:
s100, inputting a text file, and reading a character string in the text file to be used as a character string to be detected;
s200, performing word segmentation on the character string to be detected by using a word segmentation algorithm, and dividing the character string to be detected into a plurality of character string arrays;
s300, respectively positioning each character string array to position key characters;
s400, taking a plurality of different character string data as a quotation content set;
and S500, comparing the key characters with each character string data in the quotation content set to match a plurality of character string data.
Further, in S200, the method for segmenting the character string to be detected by using the segmentation algorithm and dividing the character string to be detected into a plurality of character string arrays includes: dividing the character string to be detected into a plurality of sentences by taking the point number in the character string as a dividing point, dividing each sentence into words by using a Chinese word division algorithm, removing punctuation marks from each sentence, dividing each sentence into a plurality of character strings respectively, and forming the character strings into a character string array, thereby obtaining a plurality of character string arrays.
Further, in S300, each character string array is respectively located, and the method for locating the key character includes:
recording a set formed by each character string array as Arrset, recording the number of elements in the Arrset as n, recording the serial numbers of the elements in the Arrset as i, i belongs to [1, n ], recording the element with the serial number of i in the Arrset as Arr (i), recording the number of the elements in the character string array Arr (i) as n (i), recording the serial numbers of the elements in the character string array Arr (i) as i (i), recording the i (i) belongs to [1, n (i) ], and recording the element with the serial number of i (i) in the character string array Arr (i) as character string Arr [ i (i) ];
inputting all character strings in the character string array Arr (i) into an ELMo Chinese pre-training model together, wherein the ELMo Chinese pre-training model outputs embedded vectors of all character strings in the character string array Arr (i), and the embedded vectors of the character strings Arr [ i (i) ] in the character string array Arr (i) are recorded as emb [ i (i) ];
recording the number of dimensionalities in the embedded vector as k, and the sequence number of the dimensionalities in the embedded vector as v, wherein v belongs to [1, k ];
the value of the dimension with the sequence number v in the emb [ i (i)) ] is recorded as emb [ i (i)) ] v;
in each character string array Arr (i), the first character string Arr [1] in the character string array Arr (i) is connected with the last character string Arr [ n (i), namely the next element after Arr [ n (i) ] in Arr (i) is Arr [1], so that the first character string and the last character string in the character string array Arr (i) are connected end to form a closed ring;
in each character string array Arr (i), a matrix with n (i) columns and k rows formed by embedded vectors emb [ i (i) ] of each character string Arr [ i (i) ] in Arr (i) is marked as Mat (i), columns with serial numbers i (i) in Mat (i) are marked as emb [ i (i)) ], elements with column serial numbers i (i) and row serial numbers v in Mat (i) have numerical values equal to emb [ i (i)) ] v, elements with column serial numbers i (i) and row serial numbers v in Mat (i) are marked as Mat (i)) [ v, i (i)) ];
setting a positioning array as an array for positioning key characters in a character string array, wherein the number of dimensions in the positioning array is the same as the number of elements in the character string array corresponding to the positioning array;
the positioning array of the character string array Arr (i) is denoted as Pis (i), the number of dimensions in Pis (i) is denoted as n (i), the serial number of the dimensions in Pis (i) is denoted as i (i), the numerical value of the dimension with the serial number of i (i) in Pis (i) is Pis (i), and the calculation formula of Pis (i) is as follows:
Figure 135826DEST_PATH_IMAGE001
wherein the function sig is an exponential function with the value of one-half of the circumference ratio as a base number, thereby obtaining the value of each dimension in pis (i);
selecting the element with the minimum value in the positioning array pis (i), and recording the serial number of the element with the minimum value in the positioning array as i (i '), wherein i (i') belongs to [1, n (i) ];
calculating int (n (i)/2), wherein the function int is a rounded function, let r = int (n (i)/2);
in a ring formed by connecting the first character string and the last character string in the character string array Arr (i) end to end, the r-th element after the element with the sequence number i (i ') is obtained as Arr [ i (i ') + r ], and the Arr [ i (i ') + r ] is a key character, so that the key character is positioned.
Further, in S400, a method of using a plurality of different character string data as a set of the cited content is:
acquiring character string data of texts of a plurality of different papers by a web crawler technology, performing word segmentation and keyword extraction on each character string data by using a word segmentation algorithm and a keyword extraction algorithm, obtaining a plurality of words by the word segmentation algorithm in each character string data, extracting a plurality of keywords from the obtained plurality of words, and recording the sequence numbers of each extracted keyword in the corresponding plurality of words;
the method comprises the steps of recording a set formed by character string data of texts of a plurality of different papers obtained by a web crawler technology as Refset, wherein the number of elements in the set Refset is m, the serial number of the elements in the set Refset is j, j belongs to [1, m ], the element with the serial number of j in the set Refset is Refset (j), Refset (j) is a set of a plurality of participles obtained by a participle algorithm and is Refcont (j), the number of the participles in Refcont (j) is n (j), the serial number of the participles in Refcont (j) is i (j), i (j) belongs to [1, n j) ], a plurality of different keywords are obtained from Refcont (j) by a keyword extraction algorithm, the set of the keywords obtained in Refcont (j) is abrg (j), and the set of the serial numbers of the keywords obtained in Refcont (j) in the cont (idxet (j) is (idxet j).
Further, in S500, the method of comparing the key character with each character string data in the citation content set to match out a plurality of character string data includes:
using a Pre-Training Language Model ERNIE, (detailed construction of the Pre-Training Language Model ERNIE see Sun Y, Wang S, Li Y, et al ERNIE 2.0: a continuous Pre-Training frame for Language Understanding [ J ] 2019.), fine-tuning the ERNIE as a prediction Model (fine-tuning prediction Model see paper D Xiao, Li Y K, Zhang H, et al ERNIE-Gram: Pre-Training with explicit expression N-Gram Masked Language Modeling for Natural Language Understanding [ J ] 2020. and implementation code of its open source), using a Language mask (Masked Language mask Model) mechanism therein, masking the positions of located keywords in each element in the set, predicting the position of the Model by a position mask (detailed Language mask) in a N-Gram prediction Model (see Li-Y paper, x-Y paper, zhang H, et al ERNIE-Gram, Pre-Training with explicit N-Gram Masked Language Modeling for Natural Language interpretation [ J ].2020. the algorithm described in section 3.3 Comprehensive N-Gram Prediction), predicting the probability (likelihood) that the Masked position is a key character as the predicted probability value;
comparing the key characters with each character string data in the quotation content set: marking the key character as keyw, enabling a function Prd () to represent a function for predicting the masked position by using a prediction model to obtain a prediction probability value of the key character, enabling Prd (keyw, Refset (j)) to represent a prediction probability value of keyw obtained by predicting the masked position in Refset (j) by using the prediction model, enabling a set formed by prediction probability values Prd (keyw, Refset (j)) of keyw in elements Refset (j) of Refset to be used as a Prdset, enabling the number of elements in the Prdset to be m as same as the number of elements in Refset, enabling the sequence number of the elements in the Prdset to be j as same as the sequence number of the elements in Refset, and enabling Prd (keyw, Refset (j)) to be an element with the sequence number of j in the dset;
and then, matching a plurality of character string data: the function min is a function of the element with the smallest value in the set, the function max is a function of the element with the largest value in the set, min (Prdset) is the value of the element with the smallest value in the Prdset, max (Prdset) is the value of the element with the largest value in the Prdset, the intersection value is defined as the value for comparing each element Refset (j) of Refset according to the corresponding prediction probability value Prd (keyw, Refset (j)), the intersection value corresponding to Refset (j) is p (j), the calculation formula of p (j) is,
p(j)=sin(π* Prd(keyw, Refset(j)) /(max(Prdset)- min(Prdset) ) ) ,
wherein sin is a function of the sine,
judging whether the intersection value of each element Refset (j) of the Refset meets the constraint condition p (j) >1, if the element Refset (j) meets the intersection value p (j) >1, matching the character string Refset (j) meeting the constraint condition with the key character keyw, and matching a plurality of Refsets (j); if there is no element Refset (j) that satisfies its intersection value p (j) >1, then keyw has no matching element in the set Refset; and recording and outputting a result of judging whether the intersection value of each element Refset (j) of Refset meets the constraint condition p (j) > 1.
The cited content intelligent positioning system comprises: the processor executes the computer program to implement the steps in the above-mentioned method for intelligent positioning of cited content, the system for intelligent positioning of cited content may be operated in a computing device such as a desktop computer, a notebook computer, a palm computer, and a cloud data center, and the operable systems may include, but are not limited to, a processor, a memory, and a server cluster.
As shown in fig. 2, an intelligent positioning system for cited content according to an embodiment of the present invention includes: a processor, a memory and a computer program stored in the memory and executable on the processor, the processor implementing the steps in one of the above cited intelligent positioning method embodiments when executing the computer program, the processor executing the computer program to run in the units of the following system:
the text input unit is used for inputting a text file and reading a character string in the text file as a character string to be detected;
the word segmentation detection unit is used for segmenting the character strings to be detected by using a word segmentation algorithm and dividing the character strings to be detected into a plurality of character string arrays;
the key character positioning unit is used for respectively positioning each character string array and positioning key characters;
a quotation content collecting unit for collecting a plurality of different character string data as quotation content;
and the character matching unit is used for comparing the key characters with each character string data in the quotation content set to match a plurality of character string data.
The citation content intelligent positioning system can be operated in computing equipment such as desktop computers, notebooks, palm computers and cloud data centers. The cited content intelligent positioning system comprises, but is not limited to, a processor and a memory. Those skilled in the art will appreciate that the example is only an example of an intelligent positioning method and system for cited content, and does not constitute a limitation of an intelligent positioning method and system for cited content, and may include more or less components than the other, or some components in combination, or different components, for example, the intelligent positioning system for cited content may further include input and output devices, network access devices, buses, etc.
The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete component Gate or transistor logic, discrete hardware components, etc. The general purpose processor may be a microprocessor or the processor may be any conventional processor, etc., the processor is a control center of the intelligent positioning system for the cited content, and various interfaces and lines are used to connect various sub-areas of the whole intelligent positioning system for the cited content.
The memory can be used for storing the computer program and/or the module, and the processor can realize various functions of the cited content intelligent positioning method and system by running or executing the computer program and/or the module stored in the memory and calling the data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.
The invention provides a method and a system for intelligently positioning quotation contents, which are used for respectively positioning key characters for each character string array, taking a plurality of different character string data as a quotation content set, comparing the key characters with each character string data in the quotation content set, and matching a plurality of character string data, thereby reducing the matching time cost and realizing the beneficial effects of quickly identifying the quotation contents in the character strings and matching the quotation contents pointing to the quoted contents.
Although the present invention has been described in considerable detail and with reference to certain illustrated embodiments, it is not intended to be limited to any such details or embodiments or any particular embodiment, so as to effectively encompass the intended scope of the invention. Furthermore, the foregoing describes the invention in terms of embodiments foreseen by the inventor for which an enabling description was available, notwithstanding that insubstantial modifications of the invention, not presently foreseen, may nonetheless represent equivalent modifications thereto.

Claims (6)

1. An intelligent positioning method for citation content, characterized by comprising the following steps:
s100, inputting a text file, and reading a character string in the text file to be used as a character string to be detected;
s200, performing word segmentation on the character string to be detected by using a word segmentation algorithm, and dividing the character string to be detected into a plurality of character string arrays;
s300, respectively positioning each character string array to position key characters;
s400, taking a plurality of different character string data as a quotation content set;
and S500, comparing the key characters with each character string data in the quotation content set to match a plurality of character string data.
2. The intelligent positioning method for the quotation contents according to claim 1, characterized in that in S200, the method for segmenting the character string to be detected by using the segmentation algorithm and dividing the character string to be detected into a plurality of character string arrays comprises the following steps: dividing the character string to be detected into a plurality of sentences by taking the point number in the character string as a dividing point, dividing each sentence into words by using a Chinese word division algorithm, removing punctuation marks from each sentence, dividing each sentence into a plurality of character strings respectively, and forming the character strings into a character string array, thereby obtaining a plurality of character string arrays.
3. The intelligent positioning method for the quotation contents according to claim 1, characterized in that in S300, the character string arrays are respectively positioned, and the method for positioning the key characters comprises:
recording a set formed by each character string array as Arrset, recording the number of elements in the Arrset as n, recording the serial numbers of the elements in the Arrset as i, i belongs to [1, n ], recording the element with the serial number of i in the Arrset as Arr (i), recording the number of the elements in the character string array Arr (i) as n (i), recording the serial numbers of the elements in the character string array Arr (i) as i (i), recording the i (i) belongs to [1, n (i) ], and recording the element with the serial number of i (i) in the character string array Arr (i) as character string Arr [ i (i) ];
inputting all character strings in the character string array Arr (i) into an ELMo Chinese pre-training model together, wherein the ELMo Chinese pre-training model outputs embedded vectors of all character strings in the character string array Arr (i), and the embedded vectors of the character strings Arr [ i (i) ] in the character string array Arr (i) are recorded as emb [ i (i) ];
recording the number of dimensionalities in the embedded vector as k, and the sequence number of the dimensionalities in the embedded vector as v, wherein v belongs to [1, k ];
the value of the dimension with the sequence number v in the emb [ i (i)) ] is recorded as emb [ i (i)) ] v;
in each character string array Arr (i), the first character string Arr [1] in the character string array Arr (i) is connected with the last character string Arr [ n (i), namely the next element after Arr [ n (i) ] in Arr (i) is Arr [1], so that the first character string and the last character string in the character string array Arr (i) are connected end to form a closed ring;
in each character string array Arr (i), a matrix with n (i) columns and k rows formed by embedded vectors emb [ i (i) ] of each character string Arr [ i (i) in Arr (i) is marked as Mat (i), columns with numbers i (i) in Mat (i) are marked as emb [ i (i) ], elements with column numbers i (i) and row numbers v in Mat (i) are assigned to have values equal to emb [ i (i)) ] v, and elements with column numbers i (i) and row numbers v in Mat (i) are marked as Mat (i) [ v, i (i)) ];
setting a positioning array as an array for positioning key characters in a character string array, wherein the number of dimensions in the positioning array is the same as the number of elements in the character string array corresponding to the positioning array;
the positioning array of the character string array Arr (i) is denoted as Pis (i), the number of dimensions in Pis (i) is denoted as n (i), the serial number of the dimensions in Pis (i) is denoted as i (i), the numerical value of the dimension with the serial number of i (i) in Pis (i) is Pis (i), and the calculation formula of Pis (i) is as follows:
Figure DEST_PATH_IMAGE002A
wherein the function sig is an exponential function with the value of one-half of the circumference ratio as a base number, thereby obtaining the value of each dimension in pis (i);
selecting the element with the minimum value in the positioning array pis (i), and recording the serial number of the element with the minimum value in the positioning array as i (i '), wherein i (i') belongs to [1, n (i) ];
calculating int (n (i)/2), wherein the function int is a rounded function, let r = int (n (i)/2);
in a ring formed by connecting the first character string and the last character string in the character string array Arr (i) end to end, the r-th element after the element with the sequence number i (i ') is obtained as Arr [ i (i ') + r ], and the Arr [ i (i ') + r ] is a key character, so that the key character is positioned.
4. The intelligent positioning method for the quotation contents according to claim 1, wherein in S400, the method for using a plurality of different character string data as the quotation content set comprises:
acquiring character string data of texts of a plurality of different webpages through a web crawler technology, performing word segmentation and keyword extraction on each character string data respectively by using a word segmentation algorithm and a keyword extraction algorithm, obtaining a plurality of words by the word segmentation algorithm in each character string data, extracting a plurality of keywords from the obtained plurality of words, and recording serial numbers of the extracted keywords in the corresponding plurality of words;
the method includes the steps of recording a set formed by character string data of texts of a plurality of different web pages obtained through a web crawler technology as Refset, wherein the number of elements in the set Refset is m, the serial number of the elements in the set Refset is j, j belongs to [1, m ], the element with the serial number of j in the set Refset is Refset (j), Refset (j) is a set of a plurality of participles obtained through a participle algorithm and is Refcont (j), the number of the participles in Refcont (j) is n (j), the serial number of the participles in Refcont (j) is i (j), i (j) belongs to [1, n j ]) and a plurality of different keywords are obtained from Refcont (j) through a keyword extraction algorithm, the set of the keywords obtained in Refcont (j) is refarg (j), the set of the serial numbers of the keywords in Refcont (j) is refxet j), and the set of the quotation contents is Refset of the quotation text.
5. The intelligent positioning method for the quotation contents according to claim 4, wherein in S500, the method for comparing the key characters with each character string data in the quotation content set to obtain a plurality of character string data comprises:
using a pre-training language model ERNIE, finely adjusting the ERNIE to be used as a prediction model, masking the positions of positioned keywords in each element in a set Refset by using a mask language modeling mechanism, performing N-gram prediction on the masked positions by using the prediction model, and predicting the probability that the masked positions are key characters to be used as a prediction probability value;
comparing the key characters with each character string data in the quotation content set: marking the key character as keyw, enabling a function Prd () to represent a function for predicting the masked position by using a prediction model to obtain a prediction probability value of the key character, enabling Prd (keyw, Refset (j)) to represent a prediction probability value of keyw obtained by predicting the masked position in Refset (j) by using the prediction model, enabling a set formed by prediction probability values Prd (keyw, Refset (j)) of keyw in elements Refset (j) of Refset to be used as a Prdset, enabling the number of elements in the Prdset to be m as same as the number of elements in Refset, enabling the sequence number of the elements in the Prdset to be j as same as the sequence number of the elements in Refset, and enabling Prd (keyw, Refset (j)) to be an element with the sequence number of j in the dset;
and then, matching a plurality of character string data: the function min is a function of the element with the smallest value in the set, the function max is a function of the element with the largest value in the set, min (Prdset) is the value of the element with the smallest value in the Prdset, max (Prdset) is the value of the element with the largest value in the Prdset, the intersection value is defined as the value for comparing each element Refset (j) of Refset according to the corresponding prediction probability value Prd (keyw, Refset (j)), the intersection value corresponding to Refset (j) is p (j), the calculation formula of p (j) is,
p(j)=sin(π* Prd(keyw, Refset(j)) /(max(Prdset)- min(Prdset) ) ) ,
judging whether the intersection value of each element Refset (j) of the Refset meets the constraint condition p (j) >1, if the element Refset (j) meets the intersection value p (j) >1, matching the character string Refset (j) meeting the constraint condition with the key character keyw, and matching a plurality of Refsets (j); if there is no element Refset (j) that satisfies its intersection value p (j) >1, then keyw has no matching element in the set Refset; and recording and outputting a result of judging whether the intersection value of each element Refset (j) of Refset meets the constraint condition p (j) > 1.
6. An intelligent positioning system for cited content, comprising: the processor executes the computer program to realize the steps in the intelligent positioning method of the citation content in any one of claims 1-5, the intelligent positioning system of the citation content runs in a computing device of a desktop computer, a notebook computer, a palm computer and a cloud data center, and the running system comprises the processor, the memory and a server cluster.
CN202210063117.9A 2022-01-20 2022-01-20 Intelligent positioning method and system for quotation contents Active CN114091456B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210063117.9A CN114091456B (en) 2022-01-20 2022-01-20 Intelligent positioning method and system for quotation contents

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210063117.9A CN114091456B (en) 2022-01-20 2022-01-20 Intelligent positioning method and system for quotation contents

Publications (2)

Publication Number Publication Date
CN114091456A true CN114091456A (en) 2022-02-25
CN114091456B CN114091456B (en) 2022-04-15

Family

ID=80308928

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210063117.9A Active CN114091456B (en) 2022-01-20 2022-01-20 Intelligent positioning method and system for quotation contents

Country Status (1)

Country Link
CN (1) CN114091456B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101539904A (en) * 2009-04-21 2009-09-23 武汉大学 Automatic indexing method of quotations
CN107729480A (en) * 2017-10-16 2018-02-23 北京神州泰岳软件股份有限公司 A kind of Text Information Extraction method and device of limited area
CN107861949A (en) * 2017-11-22 2018-03-30 珠海市君天电子科技有限公司 Extracting method, device and the electronic equipment of text key word
WO2019222787A1 (en) * 2018-05-21 2019-11-28 Citehero Pty Ltd A computer implemented method and a computer system for determining a set of citations related to an electronic document edited by a user on a computing device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101539904A (en) * 2009-04-21 2009-09-23 武汉大学 Automatic indexing method of quotations
CN107729480A (en) * 2017-10-16 2018-02-23 北京神州泰岳软件股份有限公司 A kind of Text Information Extraction method and device of limited area
CN107861949A (en) * 2017-11-22 2018-03-30 珠海市君天电子科技有限公司 Extracting method, device and the electronic equipment of text key word
WO2019222787A1 (en) * 2018-05-21 2019-11-28 Citehero Pty Ltd A computer implemented method and a computer system for determining a set of citations related to an electronic document edited by a user on a computing device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
MANABU OHTA ET AL.: "Empirical Evaluation of CRF-Based Bibliography Extraction from Reference Strings", 《2014 11TH IAPR INTERNATIONAL WORKSHOP ON DOCUMENT ANALYSIS SYSTEMS》 *
张亚楠 等: "静态代码缺陷定位技术研究", 《信息与电脑(理论版)》 *

Also Published As

Publication number Publication date
CN114091456B (en) 2022-04-15

Similar Documents

Publication Publication Date Title
Zhang et al. Adversarial attacks on deep-learning models in natural language processing: A survey
CN110347835B (en) Text clustering method, electronic device and storage medium
Cerda et al. Encoding high-cardinality string categorical variables
US11157693B2 (en) Stylistic text rewriting for a target author
US20150095017A1 (en) System and method for learning word embeddings using neural language models
CN111914097A (en) Entity extraction method and device based on attention mechanism and multi-level feature fusion
CN110162771B (en) Event trigger word recognition method and device and electronic equipment
WO2021051574A1 (en) English text sequence labelling method and system, and computer device
US11003950B2 (en) System and method to identify entity of data
CN109791570B (en) Efficient and accurate named entity recognition method and device
CN112686049A (en) Text auditing method, device, equipment and storage medium
US20230237395A1 (en) Apparatus and methods for matching video records with postings using audiovisual data processing
CN110866098A (en) Machine reading method and device based on transformer and lstm and readable storage medium
CN111767714B (en) Text smoothness determination method, device, equipment and medium
CN112101031A (en) Entity identification method, terminal equipment and storage medium
CN116304307A (en) Graph-text cross-modal retrieval network training method, application method and electronic equipment
Mankolli et al. Machine learning and natural language processing: Review of models and optimization problems
CN113836295A (en) Text abstract extraction method, system, terminal and storage medium
CN113722512A (en) Text retrieval method, device and equipment based on language model and storage medium
CN112632956A (en) Text matching method, device, terminal and storage medium
CN117152770A (en) Handwriting input-oriented writing capability intelligent evaluation method and system
CN114091456B (en) Intelligent positioning method and system for quotation contents
CN113627157B (en) Probability threshold value adjusting method and system based on multi-head attention mechanism
CN114492437A (en) Keyword recognition method and device, electronic equipment and storage medium
CN111199170B (en) Formula file identification method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant