WO2021159760A1 - 文章截断点的设定方法、装置以及计算机设备 - Google Patents

文章截断点的设定方法、装置以及计算机设备 Download PDF

Info

Publication number
WO2021159760A1
WO2021159760A1 PCT/CN2020/125150 CN2020125150W WO2021159760A1 WO 2021159760 A1 WO2021159760 A1 WO 2021159760A1 CN 2020125150 W CN2020125150 W CN 2020125150W WO 2021159760 A1 WO2021159760 A1 WO 2021159760A1
Authority
WO
WIPO (PCT)
Prior art keywords
vector
sentence
article
target
initial
Prior art date
Application number
PCT/CN2020/125150
Other languages
English (en)
French (fr)
Inventor
吴汇哲
顾大中
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021159760A1 publication Critical patent/WO2021159760A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/131Fragmentation of text files, e.g. creating reusable text-blocks; Linking to fragments, e.g. using XInclude; Namespaces

Definitions

  • This application relates to the field of artificial intelligence, and in particular to a method, device and computer equipment for setting the cut-off point of an article.
  • the main purpose of this application is to provide a method, device and computer equipment for setting the cut-off point of an article. Information problem.
  • This application provides a method for setting the cut-off point of an article, including:
  • the target sentence is selected from the article, and the target vector corresponding to each sentence from the beginning of the article to the end of the target sentence is weighted and calculated to obtain the first vector.
  • the target vector corresponding to each sentence is weighted and calculated to obtain a second vector; wherein the dimension of the first vector is equal to the dimension of the second vector;
  • the linear distance is compared with a set threshold, and when the linear distance is higher than the set threshold, the end position of the target sentence is used as the initial truncation point.
  • Steps of 1 linear distance include:
  • the linear distance from 1 is obtained according to the mapping value.
  • the method further includes:
  • a preset number of target truncation points are selected from the initial truncation points to truncate the article.
  • a preset number of target truncation points are selected from the initial truncation points to truncate the article.
  • the set formed by selecting the preset number of initial truncation points from the first set is recorded as the second set;
  • the second set with the highest score value is selected, and the initial cutoff point in the set is used as the target cutoff point.
  • a preset number of target truncation points are selected from the initial truncation points to truncate the article Before the steps, it also includes:
  • the preset list includes the dimension of the article vector and the target truncation point Correspondence of the preset number of points.
  • each sentence in the article is input into the bert model to obtain multiple word vectors corresponding to each sentence, and input into the bidirectional long-term short-term memory network in the form of word vector sequences to obtain the first sentence vector and
  • the steps of the second sentence vector include:
  • the sentence is preprocessed, and a TOKEN list is established to record the position of the sentence according to the position of the sentence in the article.
  • the preprocessing includes removing punctuation marks, unifying language, and Delete irrelevant words and sentences, including greetings, adjectives and dirty words;
  • the word vectors are formed into the word vector sequence according to the sequence in the sentence, and the first sentence vector is formed by sequential splicing according to the word vector sequence, and the second sentence vector is formed by sequential splicing in reverse order.
  • the method includes:
  • the target truncation point is selected from the first truncation point by a preset rule, and the article is truncated by the target truncation point.
  • This application also provides a device for setting the cut-off point of an article, including:
  • the vectorization module is used to input each sentence in the article into the bert model to obtain multiple word vectors corresponding to each sentence, and input the word vector sequence into the bidirectional long-term short-term memory network to obtain the first sentence vector corresponding to each sentence And a second sentence vector, wherein the first sentence vector is sequentially spliced according to the word vector sequence, and the second sentence vector is sequentially spliced according to the word vector sequence in reverse order;
  • the vector splicing module is used for splicing the end of the first sentence vector and the beginning of the second sentence vector of each sentence to obtain the target vector of each sentence;
  • the weighted sum calculation module is used for selecting a target sentence from the article, and weighting and calculating the target vector corresponding to each sentence from the beginning of the article to the end of the target sentence to obtain the first vector, and converting the target sentence
  • the target vector corresponding to each sentence from the end to the end of the article is weighted and calculated to obtain a second vector; wherein the dimension of the first vector is equal to the dimension of the second vector;
  • the first similarity value calculation module is used to calculate the similarity between the first vector and the second vector corresponding to the target sentence, and then perform a sigmoid non-linear mapping of the calculated first similarity value to (0,1) Interval, find the linear distance from 1;
  • the initial cutoff point setting module is used to compare the linear distance with a set threshold, and when the linear distance is higher than the set threshold, use the end position of the target sentence as the initial cutoff point.
  • the application also provides a computer device, including a memory and a processor, the memory stores a computer program, and the processor implements
  • the target sentence is selected from the article, and the target vector corresponding to each sentence from the beginning of the article to the end of the target sentence is weighted and calculated to obtain the first vector.
  • the target vector corresponding to each sentence is weighted and calculated to obtain a second vector; wherein the dimension of the first vector is equal to the dimension of the second vector;
  • the linear distance is compared with a set threshold, and when the linear distance is higher than the set threshold, the end position of the target sentence is used as the initial truncation point.
  • This application also provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the steps of the method for setting the article cut-off point are realized:
  • the target sentence is selected from the article, and the target vector corresponding to each sentence from the beginning of the article to the end of the target sentence is weighted and calculated to obtain the first vector.
  • the target vector corresponding to each sentence is weighted and calculated to obtain a second vector; wherein the dimension of the first vector is equal to the dimension of the second vector;
  • the linear distance is compared with a set threshold, and when the linear distance is higher than the set threshold, the end position of the target sentence is used as the initial truncation point.
  • the first vector is obtained by weighting and calculating the target vector corresponding to each sentence from the beginning of the article to the end of the target sentence, and performing the calculation with the target vector corresponding to each sentence from the end of the target sentence to the end of the article.
  • the weighted sum is calculated to obtain the second vector, and the similarity calculation is performed.
  • the information of all sentences is fully considered, and the truncation point of the article can be better selected.
  • FIG. 1 is a schematic flowchart of a method for setting a cut-off point of an article according to an embodiment of the present application
  • FIG. 2 is a schematic structural block diagram of a method for setting an article cut-off point according to an embodiment of the present application
  • FIG. 3 is a schematic block diagram of the structure of a computer device according to an embodiment of the application.
  • this application proposes a method for setting the cut-off point of an article, which includes:
  • S1 Input each sentence in the article into the bert model to obtain multiple word vectors corresponding to each sentence, and input the word vector sequence into the bidirectional long-term short-term memory network to obtain the first sentence vector and second sentence corresponding to each sentence A vector, wherein the first sentence vector is formed by concatenating sequentially according to the word vector sequence, and the second sentence vector is formed by concatenating sequentially according to the word vector sequence in reverse order;
  • S2 concatenate the end of the first sentence vector and the beginning of the second sentence vector of each sentence to obtain the target vector of each sentence;
  • S3 Select a target sentence from the article, and weight and calculate the target vector corresponding to each sentence from the beginning of the article to the end of the target sentence to obtain the first vector, and transfer the end of the target sentence to the article
  • the target vector corresponding to each sentence at the end is weighted and calculated to obtain a second vector; wherein the dimension of the first vector is equal to the dimension of the second vector;
  • S4 Perform similarity calculation on the first vector and second vector corresponding to the target sentence, and then perform sigmoid non-linear mapping of the calculated first similarity value to the (0,1) interval, and find the linearity with 1 distance;
  • step S1 input each sentence in the article into the bert model to obtain multiple word vectors corresponding to each sentence.
  • the division of sentences in the article is divided by clause symbols, that is, from the article
  • the content from the beginning to the first clause is a sentence, and the content between the clauses is a sentence.
  • the clause can be a clause in Chinese or a clause in English.
  • Clause The symbol can be a period, exclamation mark, question mark, etc.
  • the bert model can be trained based on corpus databases of different categories, that is, different bert models are obtained, and then the corresponding bert model is selected for input according to the category of the article. Since the corresponding bert model is trained based on the corpus database of the corresponding category, Therefore, the word vector generated by this model will be better.
  • the first vector which is sequentially spliced according to the word vector sequence, and the word vector sequence in reverse order, may be spliced in turn.
  • the second vector of is spliced to form a target vector.
  • the target vector can reduce the loss value of the subsequent calculation, so that the result of the subsequent similarity calculation is better.
  • the target sentence is selected, where the target sentence can be selected by sequentially selecting each sentence in the article, and then weighting the target vector corresponding to each sentence from the beginning of the article to the end of the target sentence
  • the first vector is obtained by sum calculation
  • the target vector corresponding to each sentence from the end of the target sentence to the end of the article is weighted and calculated to obtain a second vector, wherein the weighted sum calculation includes calculating the first vector and/or the second vector
  • the purpose of performing dimensionality increase calculation or dimensionality reduction calculation is to keep the dimensions of the first vector and the second vector consistent, so as to facilitate subsequent similarity calculations.
  • the similarity calculation is performed on the first vector and the second vector, where the formula for the similarity calculation can be WMD algorithm (word move's distance), simhash algorithm, cosine similarity-based algorithm, SVM (Support Vector Machine)
  • WMD algorithm word move's distance
  • simhash algorithm simhash algorithm
  • cosine similarity-based algorithm SVM (Support Vector Machine)
  • SVM Small Vector Machine
  • the vector model performs calculations, etc., and the similarity between the first vector and the second vector can be calculated.
  • the calculated first similarity value mapping value (0, 1) interval is used, so that the similarity value can be reflected in the linear distance from 1 to facilitate subsequent judgments with the threshold.
  • the similarity value is compared with the set threshold to determine whether the end of each sentence meets the initial condition of segmentation.
  • the initial condition is met, the end position of the corresponding target sentence can be used as the initial Truncation point, you can directly use the initial truncation point as the final truncation point to truncate the article.
  • multiple truncation points are included, you can select one or more initial truncation points to truncate the article.
  • Rules for selection It is not limited. For example, it can be the initial cutoff point where the number of characters in each paragraph after truncation is as small as possible, or the initial cutoff point with the smallest similarity can be selected for truncation.
  • Step S4 to find the linear distance from 1 includes:
  • S401 By formula Calculate the first similarity value, where Is the first similarity value, Represents the first vector, Represents the second vector, Represents the i-th dimension of the first vector, Represents the i-th dimension of the second vector;
  • each dimension can be calculated separately and then integrated to obtain the first similarity value so that the similarity is Calculate the use of input values as much as possible, reduce the calculation loss of the function, and make the calculation effect better, and then use the sigmoid function to calculate the mapping value of each first similarity value in the (0,1) interval, and finally according to the mapping value Calculate the linear distance from 1 by subtracting the mapping value from 1.
  • the method further includes :
  • S601 Obtain a first text distance from each of the initial truncation points to the beginning of the article, and a second text distance to the end of the article;
  • the position of each truncation point in the article can be considered, that is, the first text distance and the second text distance, and then it is preferably performed at the center position of the article. Truncation, so the position of the initial truncation point can be scored, that is, the position score, and then according to the formula Calculate the position score of each initial cutoff point, and then perform a comprehensive calculation based on the position score and the first similarity value, and select a preset number of initial cutoff points as the target cutoff point.
  • a preset number of target truncation points are selected from the initial truncation points for the The step S603 of truncating the article includes:
  • S6031 Record the set formed by all the initial truncation points as the first set
  • S6032 Select the set formed by the preset number of initial truncation points from the first set and record it as the second set;
  • S6033 Calculate the score value of each second set by a calculation formula; wherein the calculation formula is w and m are the preset weight parameters respectively; h 1 , h 2 ,..., h n are the first similarity values corresponding to the elements in the second set; ⁇ R i is the i-th group selected from the second set The difference between the first similarity values corresponding to the two elements out; n represents the number of elements in the second set, and F(n) represents the score value;
  • the second set with the highest score value is selected, and the initial cutoff point in the set is used as the target cutoff point.
  • the set formed by the initial truncation points is recorded as the first set.
  • the number of initial truncation points will be larger, and the required target truncation points will be correspondingly larger. Therefore, according to the number of truncation points required, that is, the preset number, different combinations can be selected from the first set as the second set, and then the score value of the second set can be calculated by the formula, and then passed as the first set
  • the similarity value and the second similarity value are assigned different weighting coefficients w and m.
  • the weighting coefficient w can be increased, and when the factors affecting the first similarity are compared
  • the weight coefficient m can be increased, and then the score value of each initial cutoff point can be calculated, and the target cutoff point can be selected according to the level of the score value.
  • a preset number of target truncation points are selected from the initial truncation points for the Before step S603 of truncating the article, it also includes:
  • S6021 concatenate the first sentence vector of each sentence in the article to obtain the article vector of the article;
  • S6022 Search for the preset number corresponding to the target truncation point in a preset list according to the dimension of the article vector; wherein, the preset list includes the dimension of the article vector and the The corresponding relationship of the preset number of target truncation points.
  • the first sentence vector of each sentence in the article is spliced to obtain the article vector of the article.
  • the length of the article vector can be used to query the preset number of target truncation points in the preset list.
  • the preset list is the corresponding relationship between the preset number of target truncation points set in advance and the length of the article vector.
  • each sentence in the article is input to the bert model to obtain multiple word vectors corresponding to each sentence, and the word vector sequence is input into the bidirectional long-term short-term memory network to obtain the first corresponding to each sentence.
  • the step S1 of the sentence vector and the second sentence vector includes:
  • S101 Preprocess the sentence, and establish a TOKEN list according to the position of the sentence in the article to record the position of the sentence, wherein the preprocessing includes removing punctuation in the question and unifying Language, delete irrelevant words and sentences, including greetings, adjectives and dirty words;
  • S102 Read the text data of the data set through the bert model, and construct the word vector through the bert model fine-tuning, wherein the bert model is trained based on a word database;
  • the word vectors are formed into the word vector sequence according to the sequence in the sentence, and the first sentence vector is formed by sequential splicing according to the word vector sequence, and the second sentence vector is formed by sequential splicing in reverse order.
  • the sentence in order to simplify the generated sentence vector and discard other irrelevant influencing factors, the sentence can be preprocessed, punctuation marks, irrelevant words and sentences can be deleted, and language types can be unified, etc. Then the TOKEN list is established, the purpose of which is to mark each sentence so as to facilitate the calculation of each sentence in the subsequent process, so that the position will not be confused. Then the word vector is constructed through the bert model, and then the first sentence vector and the second sentence vector are formed by sequential splicing and reverse splicing according to the word vector sequence.
  • the method includes :
  • S601 Calculate the second similarity value of the target sentence vectors of two adjacent sentences at each initial truncation point
  • S603 Screening out the target truncation point from the first truncation point according to a preset rule, and truncating the article by the target truncation point.
  • the second similarity value of the target sentence vectors of two adjacent sentences can also be calculated for further judgment.
  • the calculation The second similarity value of the two adjacent sentence vectors of the initial truncation point, and then the initial truncation point whose second similarity value is less than the preset similarity value is extracted as the first truncation point, and then the preset rules, such as selecting The first truncation point with the smallest second similarity is used as the target truncation point to truncate the article, thereby completing the segmentation of the article.
  • the present application also provides a device for setting the cut-off point of an article, including:
  • the vectorization module 10 is used to input each sentence in the article into the bert model to obtain multiple word vectors corresponding to each sentence, and input the word vector sequence into the bidirectional long-term short-term memory network to obtain the first sentence corresponding to each sentence Vector and a second sentence vector, wherein the first sentence vector is formed by concatenating sequentially according to the word vector sequence, and the second sentence vector is formed by concatenating sequentially according to the word vector sequence in reverse order;
  • the vector splicing module 20 is used for splicing the end of the first sentence vector and the beginning of the second sentence vector of each sentence to obtain the target vector of each sentence;
  • the weighted sum calculation module 30 is used for selecting a target sentence from the article, and weighting and calculating the target vector corresponding to each sentence from the beginning of the article to the end of the target sentence to obtain a first vector, and calculating the target The target vector corresponding to each sentence from the end of the sentence to the end of the article is weighted and calculated to obtain a second vector; wherein the dimension of the first vector is equal to the dimension of the second vector;
  • the first similarity value calculation module 40 is configured to perform similarity calculation on the first vector and the second vector corresponding to the target sentence, and then perform sigmoid non-linear mapping of the calculated first similarity value to (0,1 ) Interval, find the linear distance from 1;
  • the initial cutoff point setting module 50 is configured to compare the linear distance with a set threshold, and when the linear distance is higher than the set threshold, use the end position of the target sentence as the initial cutoff point.
  • each sentence in the article into the bert model to obtain multiple word vectors corresponding to each sentence.
  • the division of sentences in the article is divided by clause symbols, that is, from the beginning of the article to the first clause
  • the content of the symbol is a sentence
  • the sentence symbol can be a sentence symbol in Chinese or English.
  • the sentence symbol can be a period, an exclamation mark, Question mark and other symbols.
  • the bert model can be trained based on corpus databases of different categories, that is, different bert models are obtained, and then the corresponding bert model is selected for input according to the category of the article. Since the corresponding bert model is trained based on the corpus database of the corresponding category, Therefore, the word vector generated by this model will be better.
  • the first vector formed by the sequential splicing of the word vector sequence and the second vector formed by the sequential splicing of the word vector sequence in the reverse order may be spliced to form The target vector, through the target vector, the loss value of the subsequent calculation can be reduced, so that the result of the subsequent similarity calculation is better.
  • the target sentence can be selected by sequentially selecting each sentence in the article, and then weighting and calculating the target vector corresponding to each sentence from the beginning of the article to the end of the target sentence to obtain the first vector
  • the second vector is calculated by weighting and calculating the target vector corresponding to each sentence from the end of the target sentence to the end of the article, wherein the weighted sum calculation includes calculating or reducing the dimension of the first vector and/or the second vector
  • the purpose of calculation is to keep the dimensions of the first vector and the second vector consistent, so as to facilitate subsequent similarity calculations.
  • the formula for the similarity calculation can be WMD algorithm (word move's distance), simhash algorithm, cosine similarity-based algorithm, and SVM (Support Vector Machine) vector model for calculation And so on, the similarity between the first vector and the second vector can be calculated. Then, the calculated first similarity value mapping value (0, 1) interval is used, so that the similarity value can be reflected in the linear distance from 1 to facilitate subsequent judgments with the threshold.
  • the end position of the corresponding target sentence can be used as the initial truncation point, and the follow-up can be directly
  • the initial cut-off point is used as the final cut-off point to cut the article.
  • one or more initial cut-off points can be selected to cut the article.
  • the selection rules are not limited. For example, it can be The initial truncation point at which the number of words in each paragraph after truncation is as small as possible, or the initial truncation point with the smallest similarity can be selected for truncation.
  • the first similarity value calculation module 40 includes:
  • the first similarity value calculation sub-module is used to pass the formula Calculate the first similarity value, where Is the first similarity value, Represents the first vector, Represents the second vector, Represents the i-th dimension of the first vector, Represents the i-th dimension of the second vector;
  • mapping value calculation sub-module used to pass formula Calculate the non-linear mapping to the (0,1) interval mapping value
  • the linear distance calculation sub-module is used to obtain the linear distance from 1 according to the mapping value.
  • each dimension can be calculated separately and then integrated to obtain the first similarity value, so that the calculation of the similarity can use as many input values as possible. Reduce the calculation loss of the function to make the calculation better, and then use the sigmoid function to calculate the mapping value of each first similarity value in the (0,1) interval, and finally calculate the linear distance from 1 according to the mapping value. The way out is to subtract the mapping value from 1.
  • the device for setting the article truncation point further includes:
  • a text distance obtaining module configured to obtain the first text distance from each of the initial truncation points to the beginning of the article and the second text distance to the end of the article;
  • Location score calculation module used to calculate according to formula Calculate the position score of each of the initial cutoff points, where the K is the position score, X is the first text distance, and Y is the second text distance;
  • the target truncation point selection module is configured to select a preset number of target truncation points from the initial truncation points according to the first similarity value and the position score corresponding to each of the initial truncation points.
  • the article is truncated.
  • the initial truncation point can be The position is scored, that is, the position score, and then according to the formula Calculate the position score of each initial cutoff point, and then perform a comprehensive calculation based on the position score and the first similarity value, and select a preset number of initial cutoff points as the target cutoff point.
  • the target truncation point selection module includes:
  • the first set forming sub-module is used to record the set formed by all the initial truncation points as the first set;
  • the second set forming sub-module is used to select the set formed by the preset number of initial truncation points from the first set and record it as the second set;
  • the score value calculation sub-module is used to calculate the score value of each second set through a calculation formula; wherein the calculation formula is w and m are the preset weight parameters respectively; h 1 , h 2 ,..., h n are the first similarity values corresponding to the elements in the second set; ⁇ R i is the i-th group selected from the second set The difference between the first similarity values corresponding to the two elements out; n represents the number of elements in the second set, and F(n) represents the score value;
  • the selection sub-module is configured to select the second set with the highest score value, and use the initial cut-off point in the set as the target cut-off point.
  • the set of initial truncation points is recorded as the first set.
  • the number is the preset number. Different combinations are selected from the first set as the second set, and then the score value of the second set is calculated by the formula, and then passed as the first similarity value and the second similarity value Different weight coefficients w and m are assigned. It should be understood that when the factor affecting the position score is relatively large, the weight coefficient w can be increased, and when the factor affecting the first similarity is relatively large, the weight coefficient m can be increased. , And then calculate the score value of each initial cutoff point, and select the target cutoff point according to the level of the score value.
  • the first similarity value calculation module 40 further includes:
  • the article vector splicing submodule is used to splice the first sentence vector of each sentence in the article to obtain the article vector of the article;
  • the target truncation point search sub-module is configured to search for the preset number corresponding to the target truncation point in a preset list according to the dimension of the article vector; wherein, the preset list contains the Correspondence between the dimension of the article vector and the preset number of target truncation points.
  • the first sentence vector of each sentence in the article is spliced to obtain the article vector of the article.
  • the vectorization module 10 includes:
  • the preprocessing sub-module is used to preprocess the sentence and build a TOKEN list according to the position of the sentence in the article to record the position of the sentence, wherein the preprocessing includes removing the question Punctuation marks, unified language, delete irrelevant words and sentences, including greetings, adjectives and swear words;
  • the word vector reading submodule is used to read the text data of the data set through the bert model, and construct the word vector through the bert model fine-tuning, wherein the bert model is trained based on a word database ;
  • the word vector sequence forming module is used to form the word vector sequence according to the sequence of the word vectors in the sentence, and to form the first sentence vector according to the word vector sequence, and to form the first sentence vector in reverse order.
  • the second sentence vector is used to form the word vector sequence according to the sequence of the word vectors in the sentence, and to form the first sentence vector according to the word vector sequence, and to form the first sentence vector in reverse order.
  • the sentence can be preprocessed, punctuation marks, irrelevant words and sentences can be deleted, and languages can be unified, etc., and then the TOKEN list is established.
  • the purpose is to Mark each sentence to facilitate the calculation of each sentence in the subsequent process, so that the position will not be confused.
  • the word vector is constructed through the bert model, and then the first sentence vector and the second sentence vector are formed by sequential splicing and reverse splicing according to the word vector sequence.
  • the device for setting the article cut-off point includes:
  • the second similarity value calculation module is used to calculate the second similarity value of the target sentence vectors of two adjacent sentences at each initial truncation point;
  • the second similarity value judgment module is configured to extract the initial truncation point whose second similarity value is less than the preset similarity value as the first truncation point;
  • the target truncation point screening module is used to filter out the target truncation point from the first truncation point according to a preset rule, and cut the article through the target truncation point.
  • the linear distance satisfies the initial cutoff point greater than the set threshold, then calculate the two adjacent sentence vectors of the initial cutoff point Then extract the initial cutoff point whose second similarity value is less than the preset similarity value as the first cutoff point, and then use preset rules, such as selecting the first cutoff with the smallest second similarity
  • the point is used as the target truncation point to truncate the article, thereby completing the segmentation of the article.
  • the first vector is obtained by weighting and calculating the target vector corresponding to each sentence from the beginning of the article to the end of the target sentence, and the first vector is obtained by weighting and calculating each sentence from the end of the target sentence to the end of the article.
  • the target vector corresponding to the sentence is weighted and calculated to obtain the second vector, and the similarity calculation is performed.
  • the information of all sentences is fully considered, and the truncation point of the article can be better selected.
  • an embodiment of the present application also provides a computer device.
  • the computer device may be a server, and its internal structure may be as shown in FIG. 3.
  • the computer equipment includes a processor, a memory, a network interface, and a database connected through a system bus. Among them, the processor designed by the computer is used to provide calculation and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system, a computer program, and a database.
  • the memory provides an environment for the operation of the operating system and the computer program in the non-volatile storage medium.
  • the database of the computer equipment is used to store various word vectors and so on.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • FIG. 3 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied.
  • the embodiments of the present application also provide a computer-readable storage medium.
  • the computer-readable storage medium may be non-volatile or volatile, and has a computer program stored thereon, which can be implemented when the computer program is executed by a processor.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种文章截断点的设定方法、装置以及计算机设备,涉及人工智能领域,其中方法包括:从所述文章中选取目标句子,并将所述目标句子末端至所述文章末端的每一个句子对应的目标向量进行加权和计算得到第二向量;将所述目标句子对应的第一向量和第二向量进行相似度计算,将计算得到的第一相似度值再进行sigmoid非线性映射至(0,1)区间,求出与1的线性距离;将所述线性距离与设定阈值比较,当所述线性距离高于设定阈值时,将所述目标句子的末尾位置作为初始截断点。所述方法充分考虑了所有句子的信息,对文章的截断点作出更好的选取。

Description

文章截断点的设定方法、装置以及计算机设备
本申请要求于2020年09月09日提交中国专利局、申请号为202010941600.3,发明名称为“文章截断点的设定方法、装置以及计算机设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能领域,特别涉及一种文章截断点的设定方法、装置以及计算机设备。
背景技术
针对未分段的文章,如何对段落进行截段是一个非常棘手的问题。很多时候,错误的切分会导致原本不属于同一段落的句子被归纳到一段,对正确生成段落或者分析段落造成困难。发明人意识到,针对文章的分段主要是通过计算相邻两个句子包含的信息的相似度,而忽略了其余句子的信息,具有一定局限性,不能对文章作出更好的截断选择。因此,亟需一种文章截断点的设定方法。
技术问题
本申请的主要目的为提供一种文章截断点的设定方法、装置以及计算机设备,旨在解决现有技术中主要是计算相邻两个句子包含的信息的相似度,而忽略了其余句子的信息的问题。
技术解决方案
本申请提供了一种文章截断点的设定方法,包括:
将文章中的每个句子输入bert模型得到每个句子对应的多个词向量,并用词向量序列形式输入到双向长短期记忆网络中得到每个句子对应的第一句向量和第二句向量,其中所述第一句向量为按照所述词向量序列依次拼接而成,所述第二句向量为按照反序的所述词向量序列依次拼接而成;
将每个句子的所述第一句向量的末端与所述第二句向量的首端拼接,得到每个句子的目标向量;
从所述文章中选取目标句子,并将文章的首端至所述目标句子末端的每一个句子对应的目标向量进行加权和计算得到第一向量,将所述目标句子末端至所述文章末端的每一个句子对应的目标向量进行加权和计算得到第二向量;其中,所述第一向量的维度等于所述第二向量的维度;
将所述目标句子对应的第一向量和第二向量进行相似度计算,将计算得到的第一相似度值再进行sigmoid非线性映射至(0,1)区间,求出与1的线性距离;
将所述线性距离与设定阈值比较,当所述线性距离高于设定阈值时,将所述目标句子的末尾位置作为初始截断点。
进一步地,所述将所述目标句子对应的第一向量和第二向量进行相似度计算,将计算得到的第一相似度值再进行sigmoid非线性映射至(0,1)区间,求出与1的线性距离的步骤,包括:
通过公式
Figure PCTCN2020125150-appb-000001
计算所述第一相似度值,其中,
Figure PCTCN2020125150-appb-000002
为所述第一相似度值,
Figure PCTCN2020125150-appb-000003
表示第一向量,
Figure PCTCN2020125150-appb-000004
表示第二向量,
Figure PCTCN2020125150-appb-000005
表示第一向量的第i维,
Figure PCTCN2020125150-appb-000006
表示第二向量的第i维;
通过公式
Figure PCTCN2020125150-appb-000007
计算非线性映射至(0,1) 区间的映射值;
根据所述映射值求出与1的所述线性距离。
进一步地,所述将所述线性距离与设定阈值比较,当所述线性距离高于设定阈值时,将所述目标句子的末尾位置作为初始截断点的步骤之后,还包括:
获取每个所述初始截断点至所述文章首端的第一文本距离,以及至所述文章末端的第二文本距离;
根据公式
Figure PCTCN2020125150-appb-000008
计算每个所述初始截断点的位置分值,其中所述K为位置分值,X为所述第一文本距离,Y为所述第二文本距离;
根据各所述初始截断点对应的所述第一相似度值以及所述位置分值,从所述初始截断点中选取预设个数的目标截断点对所述文章进行截断。
进一步地,所述根据各所述初始截断点对应的所述第一相似度值以及所述位置分值,从所述初始截断点中选取预设个数的目标截断点对所述文章进行截断的步骤,包括:
将所有所述初始截断点构成的集合记为第一集合;
从所述第一集合中选取所述预设个数的初始截断点构成的集合记为第二集合;
通过计算公式计算各第二集合的得分值;其中所述计算公式为
Figure PCTCN2020125150-appb-000009
Figure PCTCN2020125150-appb-000010
w和m分别为预设的权重参数;h 1,h 2,…,h n为所述第二集合中的元素对应的第一相似度值;ΔR i为第i组从第二集合中挑选出的两个元素对应的第一相似度值之差;n表示第二集合中元素的个数,F(n)表示得分值;
选取所述得分值最高的所述第二集合,并将该集合中的初始截断点作为所述目标截断点。
进一步地,所述根据各所述初始截断点对应的所述第一相似度值以及所述位置分值,从所述初始截断点中选取预设个数的目标截断点对所述文章进行截断的步骤之前,还包括:
将所述文章中每个句子的所述第一句向量进行拼接得到所述文章的文章向量;
根据所述文章向量的维度在预设的列表中查找对应所述目标截断点的所述预设个数;其中,所述预设的列表中包含了所述文章向量的维度与所述目标截断点的所述预设个数的对应关系。
进一步地,所述将文章中的每个句子输入bert模型得到每个句子对应的多个词向量,并用词向量序列形式输入到双向长短期记忆网络中得到每个句子对应的第一句向量和第二句向量的步骤,包括:
将所述句子进行预处理,并按照所述句子在所述文章中的位置建立TOKEN列表对所述句子的位置进行记录,其中所述预处理包括剔除所述问题中的标点符号、统一语种、删除不相关词句,所述不相关词句包括问候语、形容词以及脏词;
通过所述bert模型读取数据集的文本数据,并通过所述bert模型fine-tuning的方式构建所述词向量,其中所述bert模型基于词语数据库训练而成;
将所述词向量按照在所述句子中的先后顺序构成所述词向量序列,并根据所述词向量序列依次拼接构成第一句向量,以及反序依次拼接构成第二句向量。
进一步地,所述将所述线性距离与设定阈值比较,当所述线性距离高于设定 阈值时,将所述目标句子的末尾位置作为初始截断点的步骤之后,包括:
计算每个初始截断点相邻两个句子的目标句向量的第二相似度值;
将所述第二相似度值小于预设相似度值的所述初始截断点提取出来作为第一截断点;
通过预设的规则在第一截断点中筛选出目标截断点,并通过所述目标截断点对所述文章进行截断。
本申请还提供了一种文章截断点的设定装置,包括:
向量化模块,用于将文章中的每个句子输入bert模型得到每个句子对应的多个词向量,并用词向量序列形式输入到双向长短期记忆网络中得到每个句子对应的第一句向量和第二句向量,其中所述第一句向量为按照所述词向量序列依次拼接而成,所述第二句向量为按照反序的所述词向量序列依次拼接而成;
向量拼接模块,用于将每个句子的所述第一句向量的末端与所述第二句向量的首端拼接,得到每个句子的目标向量;
加权和计算模块,用于从所述文章中选取目标句子,并将文章的首端至所述目标句子末端的每一个句子对应的目标向量进行加权和计算得到第一向量,将所述目标句子末端至所述文章末端的每一个句子对应的目标向量进行加权和计算得到第二向量;其中,所述第一向量的维度等于所述第二向量的维度;
第一相似度值计算模块,用于将所述目标句子对应的第一向量和第二向量进行相似度计算,将计算得到的第一相似度值再进行sigmoid非线性映射至(0,1)区间,求出与1的线性距离;
初始截断点设定模块,用于将所述线性距离与设定阈值比较,当所述线性距离高于设定阈值时,将所述目标句子的末尾位置作为初始截断点。
本申请还提供了一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述处理器执行所述计算机程序时实现
文章截断点的设定方法的步骤:
将文章中的每个句子输入bert模型得到每个句子对应的多个词向量,并用词向量序列形式输入到双向长短期记忆网络中得到每个句子对应的第一句向量和第二句向量,其中所述第一句向量为按照所述词向量序列依次拼接而成,所述第二句向量为按照反序的所述词向量序列依次拼接而成;
将每个句子的所述第一句向量的末端与所述第二句向量的首端拼接,得到每个句子的目标向量;
从所述文章中选取目标句子,并将文章的首端至所述目标句子末端的每一个句子对应的目标向量进行加权和计算得到第一向量,将所述目标句子末端至所述文章末端的每一个句子对应的目标向量进行加权和计算得到第二向量;其中,所述第一向量的维度等于所述第二向量的维度;
将所述目标句子对应的第一向量和第二向量进行相似度计算,将计算得到的第一相似度值再进行sigmoid非线性映射至(0,1)区间,求出与1的线性距离;
将所述线性距离与设定阈值比较,当所述线性距离高于设定阈值时,将所述目标句子的末尾位置作为初始截断点。
本申请还提供了一种计算机可读存储介质,其上存储有计算机程序,所述计 算机程序被处理器执行时实现文章截断点的设定方法的步骤:
将文章中的每个句子输入bert模型得到每个句子对应的多个词向量,并用词向量序列形式输入到双向长短期记忆网络中得到每个句子对应的第一句向量和第二句向量,其中所述第一句向量为按照所述词向量序列依次拼接而成,所述第二句向量为按照反序的所述词向量序列依次拼接而成;
将每个句子的所述第一句向量的末端与所述第二句向量的首端拼接,得到每个句子的目标向量;
从所述文章中选取目标句子,并将文章的首端至所述目标句子末端的每一个句子对应的目标向量进行加权和计算得到第一向量,将所述目标句子末端至所述文章末端的每一个句子对应的目标向量进行加权和计算得到第二向量;其中,所述第一向量的维度等于所述第二向量的维度;
将所述目标句子对应的第一向量和第二向量进行相似度计算,将计算得到的第一相似度值再进行sigmoid非线性映射至(0,1)区间,求出与1的线性距离;
将所述线性距离与设定阈值比较,当所述线性距离高于设定阈值时,将所述目标句子的末尾位置作为初始截断点。
有益效果
通过将文章的首端至所述目标句子末端的每一个句子对应的目标向量进行加权和计算得到第一向量,与将所述目标句子末端至所述文章末端的每一个句子对应的目标向量进行加权和计算得到第二向量,进行相似度计算,充分考虑了所有句子的信息,可以对文章的截断点作出更好的选取。
附图说明
图1是本申请一实施例的一种文章截断点的设定方法的流程示意图;
图2是本申请一实施例的一种文章截断点的设定方法的结构示意框图;
图3为本申请一实施例的计算机设备的结构示意框图。
本发明的最佳实施方式
参照图1,本申请提出一种文章截断点的设定方法,包括:
S1:将文章中的每个句子输入bert模型得到每个句子对应的多个词向量,并用词向量序列形式输入到双向长短期记忆网络中得到每个句子对应的第一句向量和第二句向量,其中所述第一句向量为按照所述词向量序列依次拼接而成,所述第二句向量为按照反序的所述词向量序列依次拼接而成;
S2:将每个句子的所述第一句向量的末端与所述第二句向量的首端拼接,得到每个句子的目标向量;
S3:从所述文章中选取目标句子,并将文章的首端至所述目标句子末端的每一个句子对应的目标向量进行加权和计算得到第一向量,将所述目标句子末端至所述文章末端的每一个句子对应的目标向量进行加权和计算得到第二向量;其中,所述第一向量的维度等于所述第二向量的维度;
S4:将所述目标句子对应的第一向量和第二向量进行相似度计算,将计算得到的第一相似度值再进行sigmoid非线性映射至(0,1)区间,求出与1的线性距离;
S5:将所述线性距离与设定阈值比较,当所述线性距离高于设定阈值时,将所述目标句子的末尾位置作为初始截断点。
如上述步骤S1所述,将文章中的每个句子输入至bert模型中,以得到每个句子对应的多个词向量,其中文章中句子的划分,是通过分句符号进行划分,即从文章开头至第一个分句符号的内容为一个句子,分句符号之间的内容为一个句 子,其中分句符号可以是中文表示的分句符号,也可以是英文表示的分句符号,分句符号可以是句号、感叹号、问号等符号。其中,bert模型可以基于不同的类别的语料数据库进行训练,即得到不同的bert模型,然后根据文章的类别选取对应的bert模型进行输入,由于对应的bert模型是基于对应类别的语料数据库训练的,故而通过该模型生成的词向量也会更优。
如上述步骤S2所述,为了使每句话所包含的信息可以得到更好的计算,可以将按照词向量序列依次拼接而成的第一向量和反序的所述词向量序列依次拼接而成的第二向量进行拼接形成目标向量,通过目标向量可以减少后续计算的损失值,使后续进行相似度计算的结果更好。
如上述步骤S3所述,选取目标句子,其中目标句子的选取可以是对文章中的每个句子依次选取,然后将文章的首端至所述目标句子末端的每一个句子对应的目标向量进行加权和计算得到第一向量,将所述目标句子末端至所述文章末端的每一个句子对应的目标向量进行加权和计算得到第二向量,其中加权和计算包括对第一向量和/或第二向量进行升维计算或降维计算等,目的是为了使第一向量和第二向量的维度保持一致,以便于后续进行相似度计算。
如上述步骤S4所述,对第一向量和第二向量进行相似度计算,其中相似度计算的公式可以是WMD算法(word mover’s distance)、simhash算法、基于余弦相似度的算法、基于SVM(Support Vector Machine)向量模型进行计算等,可以计算出第一向量和第二向量的相似度即可。然后再将计算得到的第一相似度值映射值(0,1)区间中,使相似度的值可以体现在与1的线性距离中,便于后续与阈值之间的判断。
如上述步骤S5所述,将相似度值与设定阈值进行比较,可以判断各个句子的末端是否满足分段的初始条件,当满足该初始条件后,可以将对应的目标句子的末尾位置作为初始截断点,后续可以直接以初始截断点作为最终的截断点对文章进行截断,当包含了多个截断点的时候,可以在其中选取一个或者多个的初始截断点对文章进行截断,选取的规则不做限定,例如可以是使截断之后的各个段落的字数尽可能相差不大的初始截断点,也可以是选取其中相似度最小的初始截断点进行截断。
在一个实施例中,所述将每个截断点对应的第一向量和第二向量进行相似度计算,将计算得到的第一相似度值再进行sigmoid非线性映射至(0,1)区间,求出与1的线性距离的步骤S4,包括:
S401:通过公式
Figure PCTCN2020125150-appb-000011
计算所述第一相似度值,其中,
Figure PCTCN2020125150-appb-000012
为所述第一相似度值,
Figure PCTCN2020125150-appb-000013
表示第一向量,
Figure PCTCN2020125150-appb-000014
表示第二向量,
Figure PCTCN2020125150-appb-000015
表示第一向量的第i维,
Figure PCTCN2020125150-appb-000016
表示第二向量的第i维;
S402:通过公式
Figure PCTCN2020125150-appb-000017
计算非线性映射至(0,1)区间的映射值;
S403:根据所述映射值求出与1的所述线性距离。
如上述步骤S401-S403所述,由于第一向量和第二向量的维数是相同的,故而可以对每一维度进行单独的计算,然后进行综合,得到第一相似度值,使相似 度的计算尽可能多的使用输入值,减少函数的计算损失,使计算的效果更佳,然后再通过sigmoid函数计算每个第一相似度值在(0,1)区间的映射值,最后根据映射值求出与1的线性距离,求出的方式为通过1减去该映射值。
在一个实施例中,所述将所述线性距离与设定阈值比较,当所述线性距离高于设定阈值时,将所述目标句子的末尾位置作为初始截断点的步骤S5之后,还包括:
S601:获取每个所述初始截断点至所述文章首端的第一文本距离,以及至所述文章末端的第二文本距离;
S602:根据公式
Figure PCTCN2020125150-appb-000018
计算每个所述初始截断点的位置分值,其中所述K为位置分值,X为所述第一文本距离,Y为所述第二文本距离;
S603:根据各所述初始截断点对应的所述第一相似度值以及所述位置分值,从所述初始截断点中选取预设个数的目标截断点对所述文章进行截断。
如上述步骤S601-S603所述,当存在多个初始截断点时,可以考虑每个截断点在文章中的位置,即第一文本距离和第二文本距离,然后优选在文章的中心位置处进行截断,因此可以对初始截断点的位置进行打分,即位置分值,然后根据公式
Figure PCTCN2020125150-appb-000019
计算每个所述初始截断点的位置分值,然后根据位置分值和第一相似度值进行综合计算,选取预设个数的初始截断点作为目标截断点。
在一个实施例中,所述根据各所述初始截断点对应的所述第一相似度值以及所述位置分值,从所述初始截断点中选取预设个数的目标截断点对所述文章进行截断的步骤S603,包括:
S6031:将所有所述初始截断点构成的集合记为第一集合;
S6032:从所述第一集合中选取所述预设个数的初始截断点构成的集合记为第二集合;
S6033:通过计算公式计算各第二集合的得分值;其中所述计算公式为
Figure PCTCN2020125150-appb-000020
w和m分别为预设的权重参数;h 1,h 2,…,h n为所述第二集合中的元素对应的第一相似度值;ΔR i为第i组从第二集合中挑选出的两个元素对应的第一相似度值之差;n表示第二集合中元素的个数,F(n)表示得分值;
选取所述得分值最高的所述第二集合,并将该集合中的初始截断点作为所述目标截断点。
如上述步骤S6031-S6033所述,将各初始截断点构成的集合记为第一集合,当文章比较长的时候,初始截断点的数量会比较多,所需要的目标截断点也会相应比较多,因此,可以根据需要截断点的个数,即预设个数,从第一集合中筛选出不同的组合作为第二集合,然后通过公式计算第二集合的得分值,然后通过为第一相似度值和第二相似度值赋予不同的权重系数w和m,应当理解的是,当位置分值影响的因素比较大时,可以增大权重系数w,当第一相似度影响的因素比较大时,可以增大权重系数m,然后计算每个初始截断点的得分值,根据得分值的高低选取目标截断点。
在一个实施例中,所述根据各所述初始截断点对应的所述第一相似度值以及所述位置分值,从所述初始截断点中选取预设个数的目标截断点对所述文章进行截断的步骤S603之前,还包括:
S6021:将所述文章中每个句子的所述第一句向量进行拼接得到所述文章的文章向量;
S6022:根据所述文章向量的维度在预设的列表中查找对应所述目标截断点的所述预设个数;其中,所述预设的列表中包含了所述文章向量的维度与所述目标截断点的所述预设个数的对应关系。
如上述步骤S6021-S6022所述,将文章中每个句子的第一句向量进行拼接得到文章的文章向量,此时可以根据文章向量的长度去预设的列表中查询目标截断点的预设个数,其中,预设的列表为事先设置的目标截断点的预设个数与文章向量的长度的对应关系。
在一个实施例中,所述将文章中的每个句子输入bert模型得到每个句子对应的多个词向量,并用词向量序列形式输入到双向长短期记忆网络中得到每个句子对应的第一句向量和第二句向量的步骤S1,包括:
S101:将所述句子进行预处理,并按照所述句子在所述文章中的位置建立TOKEN列表对所述句子的位置进行记录,其中所述预处理包括剔除所述问题中的标点符号、统一语种、删除不相关词句,所述不相关词句包括问候语、形容词以及脏词;
S102:通过所述bert模型读取数据集的文本数据,并通过所述bert模型fine-tuning的方式构建所述词向量,其中所述bert模型基于词语数据库训练而成;
S103:将所述词向量按照在所述句子中的先后顺序构成所述词向量序列,并根据所述词向量序列依次拼接构成第一句向量,以及反序依次拼接构成第二句向量。
如上述步骤S101-S103所述,为了简化生成的句向量,摒弃掉其他不相干的影响因素,可以将句子进行预处理,将标点符号、不相干的词句进行删除,以及将语种进行统一等,然后建立TOKEN列表,其目的是为了对每个句子进行标记,方便后续过程对各个句子进行计算,不至于发生位置的错乱。然后通过bert模型构建词向量,然后根据词向量序列进行依次拼接和反序的依次拼接形成第一句向量和第二句向量。
在另一个实施例中,所述将所述线性距离与设定阈值比较,当所述线性距离高于设定阈值时,将所述目标句子的末尾位置作为初始截断点的步骤S5之后,包括:
S601:计算每个初始截断点相邻两个句子的目标句向量的第二相似度值;
S602:将所述第二相似度值小于预设相似度值的所述初始截断点提取出来作为第一截断点;
S603:通过预设的规则在第一截断点中筛选出目标截断点,并通过所述目标截断点对所述文章进行截断。
如上述步骤S601-S603所述,还可以是计算相邻两个句子的目标句向量的第二相似度值进行进一步的判定,当线性距离满足大于设定阈值的初始截断点,然后再计算该初始截断点相邻两个句向量的第二相似度值,然后将第二相似度值小于预设相似度值的初始截断点提取出来作为第一截断点,然后通过预设的规则,例如选取第二相似度最小的第一截断点作为目标截断点对文章进行截断,从而对文章完成分段。
参照图2,本申请还提供了一种文章截断点的设定装置,包括:
向量化模块10,用于将文章中的每个句子输入bert模型得到每个句子对应 的多个词向量,并用词向量序列形式输入到双向长短期记忆网络中得到每个句子对应的第一句向量和第二句向量,其中所述第一句向量为按照所述词向量序列依次拼接而成,所述第二句向量为按照反序的所述词向量序列依次拼接而成;
向量拼接模块20,用于将每个句子的所述第一句向量的末端与所述第二句向量的首端拼接,得到每个句子的目标向量;
加权和计算模块30,用于从所述文章中选取目标句子,并将文章的首端至所述目标句子末端的每一个句子对应的目标向量进行加权和计算得到第一向量,将所述目标句子末端至所述文章末端的每一个句子对应的目标向量进行加权和计算得到第二向量;其中,所述第一向量的维度等于所述第二向量的维度;
第一相似度值计算模块40,用于将所述目标句子对应的第一向量和第二向量进行相似度计算,将计算得到的第一相似度值再进行sigmoid非线性映射至(0,1)区间,求出与1的线性距离;
初始截断点设定模块50,用于将所述线性距离与设定阈值比较,当所述线性距离高于设定阈值时,将所述目标句子的末尾位置作为初始截断点。
将文章中的每个句子输入至bert模型中,以得到每个句子对应的多个词向量,其中文章中句子的划分,是通过分句符号进行划分,即从文章开头至第一个分句符号的内容为一个句子,分句符号之间的内容为一个句子,其中分句符号可以是中文表示的分句符号,也可以是英文表示的分句符号,分句符号可以是句号、感叹号、问号等符号。其中,bert模型可以基于不同的类别的语料数据库进行训练,即得到不同的bert模型,然后根据文章的类别选取对应的bert模型进行输入,由于对应的bert模型是基于对应类别的语料数据库训练的,故而通过该模型生成的词向量也会更优。
为了使每句话所包含的信息可以得到更好的计算,可以将按照词向量序列依次拼接而成的第一向量和反序的所述词向量序列依次拼接而成的第二向量进行拼接形成目标向量,通过目标向量可以减少后续计算的损失值,使后续进行相似度计算的结果更好。
选取目标句子,其中目标句子的选取可以是对文章中的每个句子依次选取,然后将文章的首端至所述目标句子末端的每一个句子对应的目标向量进行加权和计算得到第一向量,将所述目标句子末端至所述文章末端的每一个句子对应的目标向量进行加权和计算得到第二向量,其中加权和计算包括对第一向量和/或第二向量进行升维计算或降维计算等,目的是为了使第一向量和第二向量的维度保持一致,以便于后续进行相似度计算。
对第一向量和第二向量进行相似度计算,其中相似度计算的公式可以是WMD算法(word mover’s distance)、simhash算法、基于余弦相似度的算法、基于SVM(Support Vector Machine)向量模型进行计算等,可以计算出第一向量和第二向量的相似度即可。然后再将计算得到的第一相似度值映射值(0,1)区间中,使相似度的值可以体现在与1的线性距离中,便于后续与阈值之间的判断。
将相似度值与设定阈值进行比较,可以判断各个句子的末端是否满足分段的初始条件,当满足该初始条件后,可以将对应的目标句子的末尾位置作为初始截断点,后续可以直接以初始截断点作为最终的截断点对文章进行截断,当包含了多个截断点的时候,可以在其中选取一个或者多个的初始截断点对文章进行截断,选取的规则不做限定,例如可以是使截断之后的各个段落的字数尽可能相差不大的初始截断点,也可以是选取其中相似度最小的初始截断点进行截断。
在一个实施例中,第一相似度值计算模块40,包括:
第一相似度值计算子模块,用于通过公式
Figure PCTCN2020125150-appb-000021
计算所述第一相似度值,其中,
Figure PCTCN2020125150-appb-000022
为所述第一相似度值,
Figure PCTCN2020125150-appb-000023
表示第一向量,
Figure PCTCN2020125150-appb-000024
表示第二向量,
Figure PCTCN2020125150-appb-000025
表示第一向量的第i维,
Figure PCTCN2020125150-appb-000026
表示第二向量的第i维;
映射值计算子模块,用于通过公式
Figure PCTCN2020125150-appb-000027
计算非线性映射至(0,1)区间的映射值;
线性距离计算子模块,用于根据所述映射值求出与1的所述线性距离。
由于第一向量和第二向量的维数是相同的,故而可以对每一维度进行单独的计算,然后进行综合,得到第一相似度值,使相似度的计算尽可能多的使用输入值,减少函数的计算损失,使计算的效果更佳,然后再通过sigmoid函数计算每个第一相似度值在(0,1)区间的映射值,最后根据映射值求出与1的线性距离,求出的方式为通过1减去该映射值。
在一个实施例中,文章截断点的设定装置,还包括:
文本距离获取模块,用于获取每个所述初始截断点至所述文章首端的第一文本距离,以及至所述文章末端的第二文本距离;
位置分值计算模块,用于根据公式
Figure PCTCN2020125150-appb-000028
计算每个所述初始截断点的位置分值,其中所述K为位置分值,X为所述第一文本距离,Y为所述第二文本距离;
目标截断点选取模块,用于根据各所述初始截断点对应的所述第一相似度值以及所述位置分值,从所述初始截断点中选取预设个数的目标截断点对所述文章进行截断。
当存在多个初始截断点时,可以考虑每个截断点在文章中的位置,即第一文本距离和第二文本距离,然后优选在文章的中心位置处进行截断,因此可以对初始截断点的位置进行打分,即位置分值,然后根据公式
Figure PCTCN2020125150-appb-000029
计算每个所述初始截断点的位置分值,然后根据位置分值和第一相似度值进行综合计算,选取预设个数的初始截断点作为目标截断点。
在一个实施例中,目标截断点选取模块,包括:
第一集合构成子模块,用于将所有所述初始截断点构成的集合记为第一集合;
第二集合构成子模块,用于从所述第一集合中选取所述预设个数的初始截断点构成的集合记为第二集合;
得分值计算子模块,用于通过计算公式计算各第二集合的得分值;其中所述计算公式为
Figure PCTCN2020125150-appb-000030
w和m分别为预设的权重参数;h 1,h 2,…,h n为所述第二集合中的元素对应的第一相似度值;ΔR i为第i组从第二集合中挑选出的两个元素对应的第一相似度值之差;n表示第二集合中元素的个数,F(n)表示得分值;
选取子模块,用于选取所述得分值最高的所述第二集合,并将该集合中的初始截断点作为所述目标截断点。
将各初始截断点构成的集合记为第一集合,当文章比较长的时候,初始截断点的数量会比较多,所需要的目标截断点也会相应比较多,因此,可以根据需要截断点的个数,即预设个数,从第一集合中筛选出不同的组合作为第二集合,然后通过公式计算第二集合的得分值,然后通过为第一相似度值和第二相似度值赋予不同的权重系数w和m,应当理解的是,当位置分值影响的因素比较大时,可以增大权重系数w,当第一相似度影响的因素比较大时,可以增大权重系数m,然后计算每个初始截断点的得分值,根据得分值的高低选取目标截断点。
在一个实施例中,第一相似度值计算模块40,还包括:
文章向量拼接子模块,用于将所述文章中每个句子的所述第一句向量进行拼接得到所述文章的文章向量;
目标截断点查找子模块,用于根据所述文章向量的维度在预设的列表中查找对应所述目标截断点的所述预设个数;其中,所述预设的列表中包含了所述文章向量的维度与所述目标截断点的所述预设个数的对应关系。
将文章中每个句子的第一句向量进行拼接得到文章的文章向量,此时可以根据文章向量的长度去预设的列表中查询目标截断点的预设个数,其中,预设的列表为事先设置的目标截断点的预设个数与文章向量的长度的对应关系。
在一个实施例中,向量化模块10,包括:
预处理子模块,用于将所述句子进行预处理,并按照所述句子在所述文章中的位置建立TOKEN列表对所述句子的位置进行记录,其中所述预处理包括剔除所述问题中的标点符号、统一语种、删除不相关词句,所述不相关词句包括问候语、形容词以及脏词;
词向量读取子模块,用于通过所述bert模型读取数据集的文本数据,并通过所述bert模型fine-tuning的方式构建所述词向量,其中所述bert模型基于词语数据库训练而成;
词向量序列构成模块,用于将所述词向量按照在所述句子中的先后顺序构成所述词向量序列,并根据所述词向量序列依次拼接构成第一句向量,以及反序依次拼接构成第二句向量。
为了简化生成的句向量,摒弃掉其他不相干的影响因素,可以将句子进行预处理,将标点符号、不相干的词句进行删除,以及将语种进行统一等,然后建立TOKEN列表,其目的是为了对每个句子进行标记,方便后续过程对各个句子进行计算,不至于发生位置的错乱。然后通过bert模型构建词向量,然后根据词向量序列进行依次拼接和反序的依次拼接形成第一句向量和第二句向量。
在另一个实施例中,文章截断点的设定装置,包括:
第二相似度值计算模块,用于计算每个初始截断点相邻两个句子的目标句向量的第二相似度值;
第二相似度值判断模块,用于将所述第二相似度值小于预设相似度值的所述初始截断点提取出来作为第一截断点;
目标截断点筛选模块,用于通过预设的规则在第一截断点中筛选出目标截断点,并通过所述目标截断点对所述文章进行截断。
还可以是计算相邻两个句子的目标句向量的第二相似度值进行进一步的判定,当线性距离满足大于设定阈值的初始截断点,然后再计算该初始截断点相邻两个句向量的第二相似度值,然后将第二相似度值小于预设相似度值的初始截断点提取出来作为第一截断点,然后通过预设的规则,例如选取第二相似度最小的第一截断点作为目标截断点对文章进行截断,从而对文章完成分段。
本申请的有益效果:通过将文章的首端至所述目标句子末端的每一个句子对应的目标向量进行加权和计算得到第一向量,与将所述目标句子末端至所述文章末端的每一个句子对应的目标向量进行加权和计算得到第二向量,进行相似度计算,充分考虑了所有句子的信息,可以对文章的截断点作出更好的选取。
参照图3,本申请实施例中还提供一种计算机设备,该计算机设备可以是服务器,其内部结构可以如图3所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设计的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机程序和数据库。该内存器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的数据库用于存储各种词向量等。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时可以实现上述任一实施例所述的文章截断点的设定方法。
本领域技术人员可以理解,图3中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定。
本申请实施例还提供一种计算机可读存储介质,所述计算机可读存储介质可以是非易失性,也可以是易失性,其上存储有计算机程序,计算机程序被处理器执行时可以实现上述任一实施例所述的文章截断点的设定方法。

Claims (20)

  1. 一种文章截断点的设定方法,包括:
    将文章中的每个句子输入bert模型得到每个句子对应的多个词向量,并用词向量序列形式输入到双向长短期记忆网络中得到每个句子对应的第一句向量和第二句向量,其中所述第一句向量为按照所述词向量序列依次拼接而成,所述第二句向量为按照反序的所述词向量序列依次拼接而成;
    将每个句子的所述第一句向量的末端与所述第二句向量的首端拼接,得到每个句子的目标向量;
    从所述文章中选取目标句子,并将文章的首端至所述目标句子末端的每一个句子对应的目标向量进行加权和计算得到第一向量,将所述目标句子末端至所述文章末端的每一个句子对应的目标向量进行加权和计算得到第二向量;其中,所述第一向量的维度等于所述第二向量的维度;
    将所述目标句子对应的第一向量和第二向量进行相似度计算,将计算得到的第一相似度值再进行sigmoid非线性映射至(0,1)区间,求出与1的线性距离;
    将所述线性距离与设定阈值比较,当所述线性距离高于设定阈值时,将所述目标句子的末尾位置作为初始截断点。
  2. 如权利要求1所述的文章截断点的设定方法,其中,所述将所述目标句子对应的第一向量和第二向量进行相似度计算,将计算得到的第一相似度值再进行sigmoid非线性映射至(0,1)区间,求出与1的线性距离的步骤,包括:
    通过公式
    Figure PCTCN2020125150-appb-100001
    计算所述第一相似度值,其中,
    Figure PCTCN2020125150-appb-100002
    为所述第一相似度值,
    Figure PCTCN2020125150-appb-100003
    表示第一向量,
    Figure PCTCN2020125150-appb-100004
    表示第二向量,
    Figure PCTCN2020125150-appb-100005
    表示第一向量的第i维,
    Figure PCTCN2020125150-appb-100006
    表示第二向量的第i维;
    通过公式
    Figure PCTCN2020125150-appb-100007
    计算非线性映射至(0,1)区间的映射值;
    根据所述映射值求出与1的所述线性距离。
  3. 如权利要求1所述的文章截断点的设定方法,其中,所述将所述线性距离与设定阈值比较,当所述线性距离高于设定阈值时,将所述目标句子的末尾位置作为初始截断点的步骤之后,还包括:
    获取每个所述初始截断点至所述文章首端的第一文本距离,以及至所述文章末端的第二文本距离;
    根据公式
    Figure PCTCN2020125150-appb-100008
    计算每个所述初始截断点的位置分值,其中所述K为位置分值,X为所述第一文本距离,Y为所述第二文本距离;
    根据各所述初始截断点对应的所述第一相似度值以及所述位置分值,从所述初始截断点中选取预设个数的目标截断点对所述文章进行截断。
  4. 如权利要求3所述的文章截断点的设定方法,其中,所述根据各所述初始截断点对应的所述第一相似度值以及所述位置分值,从所述初始截断点中选取预设个数的目标截断点对所述文章进行截断的步骤,包括:
    将所有所述初始截断点构成的集合记为第一集合;
    从所述第一集合中选取所述预设个数的初始截断点构成的集合记为第二集合;
    通过计算公式计算各第二集合的得分值;其中所述计算公式为
    Figure PCTCN2020125150-appb-100009
    Figure PCTCN2020125150-appb-100010
    w和m分别为预设的权重参数;h 1,h 2,…,h n为所述第二集合中的元素对应的第一相似度值;ΔR i为第i组从第二集合中挑选出的两个元素对应的第一相似度值之差;n表示第二集合中元素的个数,F(n)表示得分值;
    选取所述得分值最高的所述第二集合,并将该集合中的初始截断点作为所述目标截断点。
  5. 如权利要求3所述的文章截断点的设定方法,其中,所述根据各所述初始截断点对应的所述第一相似度值以及所述位置分值,从所述初始截断点中选取预设个数的目标截断点对所述文章进行截断的步骤之前,还包括:
    将所述文章中每个句子的所述第一句向量进行拼接得到所述文章的文章向量;
    根据所述文章向量的维度在预设的列表中查找对应所述目标截断点的所述预设个数;其中,所述预设的列表中包含了所述文章向量的维度与所述目标截断点的所述预设个数的对应关系。
  6. 如权利要求1所述的文章截断点的设定方法,其中,所述将文章中的每个句子输入bert模型得到每个句子对应的多个词向量,并用词向量序列形式输入到双向长短期记忆网络中得到每个句子对应的第一句向量和第二句向量的步骤,包括:
    将所述句子进行预处理,并按照所述句子在所述文章中的位置建立TOKEN列表对所述句子的位置进行记录,其中所述预处理包括剔除所述问题中的标点符号、统一语种、删除不相关词句,所述不相关词句包括问候语、形容词以及脏词;
    通过所述bert模型读取数据集的文本数据,并通过所述bert模型fine-tuning的方式构建所述词向量,其中所述bert模型基于词语数据库训练而成;
    将所述词向量按照在所述句子中的先后顺序构成所述词向量序列,并根据所述词向量序列依次拼接构成第一句向量,以及反序依次拼接构成第二句向量。
  7. 如权利要求1所述的文章截断点的设定方法,其中,所述将所述线性距离与设定阈值比较,当所述线性距离高于设定阈值时,将所述目标句子的末尾位置作为初始截断点的步骤之后,包括:
    计算每个初始截断点相邻两个句子的目标句向量的第二相似度值;
    将所述第二相似度值小于预设相似度值的所述初始截断点提取出来作为第一截断点;
    通过预设的规则在第一截断点中筛选出目标截断点,并通过所述目标截断点对所述文章进行截断。
  8. 一种文章截断点的设定装置,包括:
    向量化模块,用于将文章中的每个句子输入bert模型得到每个句子对应的多个词向量,并用词向量序列形式输入到双向长短期记忆网络中得到每个句子对应的第一句向量和第二句向量,其中所述第一句向量为按照所述词向量序列依次拼接而成,所述第二句向量为按照反序的所述词向量序列依次拼接而成;
    向量拼接模块,用于将每个句子的所述第一句向量的末端与所述第二句向量 的首端拼接,得到每个句子的目标向量;
    加权和计算模块,用于从所述文章中选取目标句子,并将文章的首端至所述目标句子末端的每一个句子对应的目标向量进行加权和计算得到第一向量,将所述目标句子末端至所述文章末端的每一个句子对应的目标向量进行加权和计算得到第二向量;其中,所述第一向量的维度等于所述第二向量的维度;
    第一相似度值计算模块,用于将所述目标句子对应的第一向量和第二向量进行相似度计算,将计算得到的第一相似度值再进行sigmoid非线性映射至(0,1)区间,求出与1的线性距离;
    初始截断点设定模块,用于将所述线性距离与设定阈值比较,当所述线性距离高于设定阈值时,将所述目标句子的末尾位置作为初始截断点。
  9. 一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述处理器执行所述计算机程序时实现文章截断点的设定方法的步骤:
    将文章中的每个句子输入bert模型得到每个句子对应的多个词向量,并用词向量序列形式输入到双向长短期记忆网络中得到每个句子对应的第一句向量和第二句向量,其中所述第一句向量为按照所述词向量序列依次拼接而成,所述第二句向量为按照反序的所述词向量序列依次拼接而成;
    将每个句子的所述第一句向量的末端与所述第二句向量的首端拼接,得到每个句子的目标向量;
    从所述文章中选取目标句子,并将文章的首端至所述目标句子末端的每一个句子对应的目标向量进行加权和计算得到第一向量,将所述目标句子末端至所述文章末端的每一个句子对应的目标向量进行加权和计算得到第二向量;其中,所述第一向量的维度等于所述第二向量的维度;
    将所述目标句子对应的第一向量和第二向量进行相似度计算,将计算得到的第一相似度值再进行sigmoid非线性映射至(0,1)区间,求出与1的线性距离;
    将所述线性距离与设定阈值比较,当所述线性距离高于设定阈值时,将所述目标句子的末尾位置作为初始截断点。
  10. 如权利要求9所述的计算机设备,其中,所述将所述目标句子对应的第一向量和第二向量进行相似度计算,将计算得到的第一相似度值再进行sigmoid非线性映射至(0,1)区间,求出与1的线性距离的步骤,包括:
    通过公式
    Figure PCTCN2020125150-appb-100011
    计算所述第一相似度值,其中,
    Figure PCTCN2020125150-appb-100012
    为所述第一相似度值,
    Figure PCTCN2020125150-appb-100013
    表示第一向量,
    Figure PCTCN2020125150-appb-100014
    表示第二向量,
    Figure PCTCN2020125150-appb-100015
    表示第一向量的第i维,
    Figure PCTCN2020125150-appb-100016
    表示第二向量的第i维;
    通过公式
    Figure PCTCN2020125150-appb-100017
    计算非线性映射至(0,1)区间的映射值;
    根据所述映射值求出与1的所述线性距离。
  11. 如权利要求9所述的计算机设备,其中,所述将所述线性距离与设定阈值比较,当所述线性距离高于设定阈值时,将所述目标句子的末尾位置作为初始截断点的步骤之后,还包括:
    获取每个所述初始截断点至所述文章首端的第一文本距离,以及至所述文章 末端的第二文本距离;
    根据公式
    Figure PCTCN2020125150-appb-100018
    计算每个所述初始截断点的位置分值,其中所述K为位置分值,X为所述第一文本距离,Y为所述第二文本距离;
    根据各所述初始截断点对应的所述第一相似度值以及所述位置分值,从所述初始截断点中选取预设个数的目标截断点对所述文章进行截断。
  12. 如权利要求11所述的计算机设备,其中,所述根据各所述初始截断点对应的所述第一相似度值以及所述位置分值,从所述初始截断点中选取预设个数的目标截断点对所述文章进行截断的步骤,包括:
    将所有所述初始截断点构成的集合记为第一集合;
    从所述第一集合中选取所述预设个数的初始截断点构成的集合记为第二集合;
    通过计算公式计算各第二集合的得分值;其中所述计算公式为
    Figure PCTCN2020125150-appb-100019
    Figure PCTCN2020125150-appb-100020
    w和m分别为预设的权重参数;h 1,h 2,…,h n为所述第二集合中的元素对应的第一相似度值;ΔR i为第i组从第二集合中挑选出的两个元素对应的第一相似度值之差;n表示第二集合中元素的个数,F(n)表示得分值;
    选取所述得分值最高的所述第二集合,并将该集合中的初始截断点作为所述目标截断点。
  13. 如权利要求11所述的计算机设备,其中,所述根据各所述初始截断点对应的所述第一相似度值以及所述位置分值,从所述初始截断点中选取预设个数的目标截断点对所述文章进行截断的步骤之前,还包括:
    将所述文章中每个句子的所述第一句向量进行拼接得到所述文章的文章向量;
    根据所述文章向量的维度在预设的列表中查找对应所述目标截断点的所述预设个数;其中,所述预设的列表中包含了所述文章向量的维度与所述目标截断点的所述预设个数的对应关系。
  14. 如权利要求9所述的计算机设备,其中,所述将文章中的每个句子输入bert模型得到每个句子对应的多个词向量,并用词向量序列形式输入到双向长短期记忆网络中得到每个句子对应的第一句向量和第二句向量的步骤,包括:
    将所述句子进行预处理,并按照所述句子在所述文章中的位置建立TOKEN列表对所述句子的位置进行记录,其中所述预处理包括剔除所述问题中的标点符号、统一语种、删除不相关词句,所述不相关词句包括问候语、形容词以及脏词;
    通过所述bert模型读取数据集的文本数据,并通过所述bert模型fine-tuning的方式构建所述词向量,其中所述bert模型基于词语数据库训练而成;
    将所述词向量按照在所述句子中的先后顺序构成所述词向量序列,并根据所述词向量序列依次拼接构成第一句向量,以及反序依次拼接构成第二句向量。
  15. 如权利要求1所述的计算机设备,其中,所述将所述线性距离与设定阈值比较,当所述线性距离高于设定阈值时,将所述目标句子的末尾位置作为初始截断点的步骤之后,包括:
    计算每个初始截断点相邻两个句子的目标句向量的第二相似度值;
    将所述第二相似度值小于预设相似度值的所述初始截断点提取出来作为第一截断点;
    通过预设的规则在第一截断点中筛选出目标截断点,并通过所述目标截断点对所述文章进行截断。
  16. 一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现文章截断点的设定方法的步骤:
    将文章中的每个句子输入bert模型得到每个句子对应的多个词向量,并用词向量序列形式输入到双向长短期记忆网络中得到每个句子对应的第一句向量和第二句向量,其中所述第一句向量为按照所述词向量序列依次拼接而成,所述第二句向量为按照反序的所述词向量序列依次拼接而成;
    将每个句子的所述第一句向量的末端与所述第二句向量的首端拼接,得到每个句子的目标向量;
    从所述文章中选取目标句子,并将文章的首端至所述目标句子末端的每一个句子对应的目标向量进行加权和计算得到第一向量,将所述目标句子末端至所述文章末端的每一个句子对应的目标向量进行加权和计算得到第二向量;其中,所述第一向量的维度等于所述第二向量的维度;
    将所述目标句子对应的第一向量和第二向量进行相似度计算,将计算得到的第一相似度值再进行sigmoid非线性映射至(0,1)区间,求出与1的线性距离;
    将所述线性距离与设定阈值比较,当所述线性距离高于设定阈值时,将所述目标句子的末尾位置作为初始截断点。
  17. 如权利要求16所述的计算机可读存储介质,其中,所述将所述目标句子对应的第一向量和第二向量进行相似度计算,将计算得到的第一相似度值再进行sigmoid非线性映射至(0,1)区间,求出与1的线性距离的步骤,包括:
    通过公式
    Figure PCTCN2020125150-appb-100021
    计算所述第一相似度值,其中,
    Figure PCTCN2020125150-appb-100022
    为所述第一相似度值,
    Figure PCTCN2020125150-appb-100023
    表示第一向量,
    Figure PCTCN2020125150-appb-100024
    表示第二向量,
    Figure PCTCN2020125150-appb-100025
    表示第一向量的第i维,
    Figure PCTCN2020125150-appb-100026
    表示第二向量的第i维;
    通过公式
    Figure PCTCN2020125150-appb-100027
    计算非线性映射至(0,1)区间的映射值;
    根据所述映射值求出与1的所述线性距离。
  18. 如权利要求16所述的计算机可读存储介质,其中,所述将所述线性距离与设定阈值比较,当所述线性距离高于设定阈值时,将所述目标句子的末尾位置作为初始截断点的步骤之后,还包括:
    获取每个所述初始截断点至所述文章首端的第一文本距离,以及至所述文章末端的第二文本距离;
    根据公式
    Figure PCTCN2020125150-appb-100028
    计算每个所述初始截断点的位置分值,其中所述K为位置分值,X为所述第一文本距离,Y为所述第二文本距离;
    根据各所述初始截断点对应的所述第一相似度值以及所述位置分值,从所述初始截断点中选取预设个数的目标截断点对所述文章进行截断。
  19. 如权利要求18所述的计算机可读存储介质,其中,所述根据各所述初始截断点对应的所述第一相似度值以及所述位置分值,从所述初始截断点中选取 预设个数的目标截断点对所述文章进行截断的步骤,包括:
    将所有所述初始截断点构成的集合记为第一集合;
    从所述第一集合中选取所述预设个数的初始截断点构成的集合记为第二集合;
    通过计算公式计算各第二集合的得分值;其中所述计算公式为
    Figure PCTCN2020125150-appb-100029
    Figure PCTCN2020125150-appb-100030
    w和m分别为预设的权重参数;h 1,h 2,…,h n为所述第二集合中的元素对应的第一相似度值;ΔR i为第i组从第二集合中挑选出的两个元素对应的第一相似度值之差;n表示第二集合中元素的个数,F(n)表示得分值;
    选取所述得分值最高的所述第二集合,并将该集合中的初始截断点作为所述目标截断点。
  20. 如权利要求18所述的计算机可读存储介质,其中,所述根据各所述初始截断点对应的所述第一相似度值以及所述位置分值,从所述初始截断点中选取预设个数的目标截断点对所述文章进行截断的步骤之前,还包括:
    将所述文章中每个句子的所述第一句向量进行拼接得到所述文章的文章向量;
    根据所述文章向量的维度在预设的列表中查找对应所述目标截断点的所述预设个数;其中,所述预设的列表中包含了所述文章向量的维度与所述目标截断点的所述预设个数的对应关系。
PCT/CN2020/125150 2020-09-09 2020-10-30 文章截断点的设定方法、装置以及计算机设备 WO2021159760A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010941600.3A CN112016292B (zh) 2020-09-09 2020-09-09 文章截断点的设定方法、装置以及计算机设备
CN202010941600.3 2020-09-09

Publications (1)

Publication Number Publication Date
WO2021159760A1 true WO2021159760A1 (zh) 2021-08-19

Family

ID=73523074

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/125150 WO2021159760A1 (zh) 2020-09-09 2020-10-30 文章截断点的设定方法、装置以及计算机设备

Country Status (2)

Country Link
CN (1) CN112016292B (zh)
WO (1) WO2021159760A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113361260A (zh) * 2021-06-10 2021-09-07 北京字节跳动网络技术有限公司 一种文本处理方法、装置、设备以及存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104239285A (zh) * 2013-06-06 2014-12-24 腾讯科技(深圳)有限公司 文章新章节的检测方法及装置
CN106874469A (zh) * 2017-02-16 2017-06-20 北京大学 一种新闻综述生成方法与系统
CN107133213A (zh) * 2017-05-06 2017-09-05 广东药科大学 一种基于算法的文本摘要自动提取方法与系统
US20200125673A1 (en) * 2018-10-23 2020-04-23 International Business Machines Corporation Learning thematic similarity metric from article text units

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7308138B2 (en) * 2000-12-12 2007-12-11 Hewlett-Packard Development Company, L.P. Document segmentation method
US8081823B2 (en) * 2007-11-20 2011-12-20 Seiko Epson Corporation Segmenting a string using similarity values
CN102004724B (zh) * 2010-12-23 2012-06-20 哈尔滨工业大学 文档段落分割方法
CN109241526B (zh) * 2018-08-22 2022-11-15 北京慕华信息科技有限公司 一种段落分割方法和装置
CN110222654A (zh) * 2019-06-10 2019-09-10 北京百度网讯科技有限公司 文本分割方法、装置、设备及存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104239285A (zh) * 2013-06-06 2014-12-24 腾讯科技(深圳)有限公司 文章新章节的检测方法及装置
CN106874469A (zh) * 2017-02-16 2017-06-20 北京大学 一种新闻综述生成方法与系统
CN107133213A (zh) * 2017-05-06 2017-09-05 广东药科大学 一种基于算法的文本摘要自动提取方法与系统
US20200125673A1 (en) * 2018-10-23 2020-04-23 International Business Machines Corporation Learning thematic similarity metric from article text units

Also Published As

Publication number Publication date
CN112016292B (zh) 2022-10-11
CN112016292A (zh) 2020-12-01

Similar Documents

Publication Publication Date Title
US11080910B2 (en) Method and device for displaying explanation of reference numeral in patent drawing image using artificial intelligence technology based machine learning
KR101999152B1 (ko) 컨벌루션 신경망 기반 영문 텍스트 정형화 방법
CN109670163B (zh) 信息识别方法、信息推荐方法、模板构建方法及计算设备
CN110929038B (zh) 基于知识图谱的实体链接方法、装置、设备和存储介质
WO2021114810A1 (zh) 基于图结构的公文推荐方法、装置、计算机设备及介质
CN111427995A (zh) 基于内部对抗机制的语义匹配方法、装置及存储介质
CN112052326A (zh) 一种基于长短文本匹配的智能问答方法及系统
CN110765277B (zh) 一种基于知识图谱的移动端的在线设备故障诊断方法
CN101354705B (zh) 文档图像处理装置和文档图像处理方法
WO2022088671A1 (zh) 自动问答方法、装置、设备及存储介质
CN113360699B (zh) 模型训练方法和装置、图像问答方法和装置
CN112818093A (zh) 基于语义匹配的证据文档检索方法、系统及存储介质
WO2021190662A1 (zh) 医学文献排序方法、装置、电子设备及存储介质
CN114444507A (zh) 基于水环境知识图谱增强关系的上下文参数中文实体预测方法
CN112434533A (zh) 实体消歧方法、装置、电子设备及计算机可读存储介质
WO2021159760A1 (zh) 文章截断点的设定方法、装置以及计算机设备
CN111680264A (zh) 一种多文档阅读理解方法
CN114528413A (zh) 众包标注支持的知识图谱更新方法、系统和可读存储介质
CN111259223B (zh) 基于情感分析模型的新闻推荐和文本分类方法
CN112765976A (zh) 文本相似度计算方法、装置、设备及存储介质
CN113297485B (zh) 一种生成跨模态的表示向量的方法以及跨模态推荐方法
CN115410185A (zh) 一种多模态数据中特定人名及单位名属性的提取方法
CN115098619A (zh) 资讯去重方法、装置、电子设备及计算机可读取存储介质
CN114969439A (zh) 一种模型训练、信息检索方法及装置
CN114328894A (zh) 文档处理方法、装置、电子设备及介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20918184

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20918184

Country of ref document: EP

Kind code of ref document: A1