CN111708878B

CN111708878B - Method, device, storage medium and equipment for extracting sports text abstract

Info

Publication number: CN111708878B
Application number: CN202010844192.XA
Authority: CN
Inventors: 王佳安; 李直旭; 陈志刚; 何莹; 郑新
Original assignee: Iflytek Suzhou Technology Co Ltd
Current assignee: Iflytek Suzhou Technology Co Ltd
Priority date: 2020-08-20
Filing date: 2020-08-20
Publication date: 2020-11-24
Anticipated expiration: 2040-08-20
Also published as: CN111708878A

Abstract

The application discloses a method, a device, a storage medium and equipment for abstracting a sports text abstract, wherein the method comprises the following steps: for each non-detail sentence in the obtained target text, firstly, determining the probability that the non-detail sentence is the abstract sentence according to the word characteristics extracted from the non-detail sentence, then, selecting the target non-detail sentence meeting the preset initial selection condition from all the non-detail sentences according to the probability that all the non-detail sentences are the abstract sentences to form an initial selection text abstract, and then determining the text abstract of the target text according to the initial selection text abstract. Therefore, the method removes the detail sentences in the target text, so that the remaining non-detail sentences can reflect more text key information, and then more accurately determines the probability that the non-detail sentences are abstract sentences according to the word characteristics of all the non-detail sentences, so as to be used as the basis for forming the text abstract, thereby improving the accuracy of the extraction result of the sports text abstract.

Description

Method, device, storage medium and equipment for extracting sports text abstract

Technical Field

The present application relates to the field of natural language processing technologies, and in particular, to a method, an apparatus, a storage medium, and a device for extracting a sports text abstract.

Background

With the advent of the information age, the amount of information that needs to be processed has seen an increase in the geometric level. In the current natural language processing field, people can quickly acquire a large amount of relevant information of sports events, but information redundancy exists often, and a large amount of time is needed to eliminate useless information, and under the condition, how to more quickly and accurately acquire summary information from massive sports text information becomes an important research topic.

At present, there are two methods for obtaining the abstract of the sports text: one is to adopt an extraction mode to obtain a sports text abstract, but most of abstract sentences obtained by the mode are detailed sentences with higher similarity in the text, but do not relate to sentences containing real important information, such as game names, game results, two parties participating in a game and the like, so that the accuracy of the extracted abstract is not high enough, and the key content of the text cannot be accurately represented. The other commonly used method for obtaining the abstract of the sports text is an obtaining method adopting a generating mode, but the process of generating the abstract in the mode is uncontrollable, namely, the process of generating the abstract by the model cannot be manually interfered, and because training data is insufficient, the model does not learn the language structure of human and only learns the connection of the language surface, the content of the abstract generated by the mode has obscure meaning, inaccurate grammar and poor readability.

Therefore, the precision and accuracy of the obtained sports text abstract are still to be improved by two common sports text abstract acquisition methods at present.

Disclosure of Invention

The embodiment of the application mainly aims to provide a method, a device, a storage medium and equipment for extracting a sports text abstract, which can improve the accuracy of a sports text abstract extraction result.

The embodiment of the application provides a method for extracting a sports text abstract, which comprises the following steps:

acquiring a non-detail sentence in a target text; the non-detail sentences are sentences of which the similarity with other sentences in the target text is lower than a preset threshold;

for each non-detail sentence, determining the probability that the non-detail sentence is a summary sentence according to the word characteristics extracted from the non-detail sentence;

selecting target non-detail sentences meeting preset initial selection conditions from all non-detail sentences to form initial selection text summaries of the target texts according to the probability that all non-detail sentences are abstract sentences;

and determining the text abstract of the target text according to the primarily selected text abstract.

In a possible implementation manner, the determining, for each non-detail sentence, a probability that the non-detail sentence is a summary sentence according to a word feature extracted from the non-detail sentence includes:

for each non-detail sentence, generating a first sentence expression result of the current non-detail sentence according to the word characteristics extracted from the current non-detail sentence and the dependency relationship among words in the current non-detail sentence;

generating a second statement expression result of the current non-detail sentence according to the first statement expression result of the current non-detail sentence and the respective first statement expression results of other non-detail sentences except the current non-detail sentence in the target text;

and obtaining the probability that the current non-detail sentence is the abstract sentence according to the second sentence expression result of each non-detail sentence.

In a possible implementation manner, the generating a first statement expression result of the current non-detailed sentence according to the word features extracted from the current non-detailed sentence and the dependency relationship between words in the current non-detailed sentence includes:

respectively taking each word of the current non-detail sentence as a target word, and extracting the word characteristics of each target word;

for each target word, generating a first semantic expression result of the current target word according to the word characteristics of the current target word;

generating a second semantic expression result of the current target word according to the first semantic expression result of the current target word and the respective first semantic expression results of other target words except the current target word in the non-detail sentence;

and generating a first statement expression result of the current non-detail statement according to the second semantic expression result of each target word.

for each non-detail sentence, inputting word characteristics extracted from the non-detail sentence into a pre-constructed sports text abstract sentence prediction model, and predicting the probability that the non-detail sentence is an abstract sentence;

the sports text abstract sentence prediction model is constructed in the following mode:

acquiring sample sentences in the sports text;

training a pre-constructed initial sports text abstract sentence prediction model by using the sample sentences to obtain the sports text abstract sentence prediction model;

the training corpus of the sports text abstract sentence prediction model comprises a plurality of sample sentences in sports texts, important entity words of the sample sentences are labeled in advance, and the sports texts are stored in a pre-constructed sports text corpus; the initial sports text abstract sentence prediction model is used for predicting the probability that the covered important entity words in the sentences are all words in a word list according to word features in input sentences, and the word list is constructed according to the sports text corpus.

In a possible implementation manner, the training, by using the sample sentence, a pre-constructed initial sports text abstract sentence prediction model to obtain the sports text abstract sentence prediction model includes:

performing word segmentation processing on the sample sentence, and identifying important entity words in the sample sentence;

covering a first percentage of important entity words in the sample sentence, keeping a second percentage of the important entity words unchanged, and replacing a third percentage of the important entity words with other important entity words of the same class, wherein the sum of the first percentage, the second percentage and the third percentage is 1;

inputting the word characteristics of each participle extracted from the sample sentence into a pre-constructed initial sports text abstract sentence prediction model for training, and predicting the probability that the covered important entity words in the sample sentence are each word in the word list;

and when the preset stopping condition is not met, re-acquiring the sample sentences in the sports text, repeatedly performing word segmentation processing on the sample sentences, identifying important entity words in the sample sentences and subsequent steps until the preset stopping condition is reached, and taking the model when the preset stopping condition is reached as the sports text abstract sentence prediction model.

In one possible implementation, the important entity words are extracted from the corresponding text sentences by using a pre-constructed important entity recognition model.

The embodiment of the present application further provides a device for extracting a sports text abstract, including:

the first acquisition unit is used for acquiring a non-detail sentence in a target text; the non-detail sentences are sentences of which the similarity with other sentences in the target text is lower than a preset threshold;

the first determining unit is used for determining the probability that each non-detail sentence is a summary sentence according to the word characteristics extracted from the non-detail sentences;

the composition unit is used for selecting a target non-detail sentence meeting preset initial selection conditions from all non-detail sentences according to the probability that all non-detail sentences are abstract sentences to form an initial selection text abstract of the target text;

and the second determining unit is used for determining the text abstract of the target text according to the primary selected text abstract.

In a possible implementation manner, the first determining unit includes:

the first generation subunit is used for generating a first statement expression result of the current non-detail sentence according to the word characteristics extracted from the current non-detail sentence and the dependency relationship among words in the current non-detail sentence for each non-detail sentence;

a second generating subunit, configured to generate a second statement expression result of the current non-detailed sentence according to the first statement expression result of the current non-detailed sentence and the first statement expression results of the other non-detailed sentences except the current non-detailed sentence in the target text;

and the first obtaining subunit is used for obtaining the probability that the current non-detail sentence is the abstract sentence according to the second sentence expression result of each non-detail sentence.

In one possible implementation manner, the first generating subunit includes:

the extraction subunit is used for respectively taking each word of the current non-detail sentence as a target word and extracting the word characteristics of each target word;

the third generation subunit is used for generating a first semantic expression result of the current target word according to the word feature of the current target word for each target word;

a fourth generating subunit, configured to generate a second semantic expression result of the current target word according to the first semantic expression result of the current target word and the respective first semantic expression results of other target words except the current target word in the non-detail sentence;

and the fifth generating subunit is used for generating the first statement expression result of the current non-detail statement according to the second semantic expression result of each target word.

In a possible implementation manner, the first determining unit is specifically configured to:

the device further comprises:

the second acquisition unit is used for acquiring sample sentences in the sports text;

the training unit is used for training a pre-constructed initial sports text abstract sentence prediction model by using the sample sentences to obtain the sports text abstract sentence prediction model;

In one possible implementation, the training unit includes:

the identification subunit is used for carrying out word segmentation processing on the sample sentence and identifying important entity words in the sample sentence;

a processing subunit, configured to cover a first percentage of the important entity words in the sample sentence, a second percentage of the important entity words remain unchanged, and a third percentage of the important entity words are replaced with other important entity words of the same category, where a sum of the first percentage, the second percentage, and the third percentage is 1;

the predication subunit is used for inputting the word characteristics of each participle extracted from the sample sentence into a pre-constructed initial sports text abstract sentence predication model for training, and predicating the probability that the covered important entity words in the sample sentence are each word in the word list;

and the second obtaining subunit is configured to, when a preset stop condition is not met, obtain a sample sentence in the sports text again, repeatedly perform word segmentation processing on the sample sentence, identify an important entity word in the sample sentence and subsequent steps until the preset stop condition is reached, and use a model when the preset stop condition is reached as the sports text abstract sentence prediction model.

The embodiment of the present application further provides a device for extracting a sports text abstract, including: a processor, a memory, a system bus;

the processor and the memory are connected through the system bus;

the memory is used for storing one or more programs, the one or more programs comprising instructions, which when executed by the processor, cause the processor to perform any one implementation of the above-mentioned sports text summarization method.

An embodiment of the present application further provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the instructions are run on a terminal device, the terminal device is enabled to execute any implementation manner of the above-mentioned method for extracting a sports text abstract.

The embodiment of the application also provides a computer program product, and when the computer program product runs on the terminal device, the terminal device executes any implementation mode of the sports text abstract extraction method.

According to the method, the device, the storage medium and the equipment for extracting the sports text abstract, after the non-detail sentences in the target text are obtained, for each non-detail sentence, the probability that the non-detail sentences are abstract sentences is determined according to the word characteristics extracted from the non-detail sentences, then the target non-detail sentences meeting the preset initial selection condition are selected from all the non-detail sentences according to the probability that all the non-detail sentences are abstract sentences to form the initial selection text abstract of the target text, and further the text abstract of the target text can be determined according to the initial selection text abstract. Therefore, the embodiment of the application removes the detailed sentences in the target text first, so that the remaining non-detailed sentences can reflect more key content information such as game names, game results and the like of sports games, and then the probability that the non-detailed sentences are abstract sentences is determined more accurately based on word characteristics extracted from all the obtained non-detailed sentences and the incidence relations among all the non-detailed sentences, and the probability is used as a basis for forming text abstract, so that the accuracy of the abstract extraction result of the sports text can be improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic flow chart of a method for extracting a sports text abstract according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a sports text abstract sentence prediction model according to an embodiment of the present application;

fig. 3 is a schematic flowchart of determining a probability that a non-detail sentence is a summary sentence according to a word feature extracted from the non-detail sentence according to an embodiment of the present application;

fig. 4 is a schematic flowchart of a first sentence expression result for generating a non-detail sentence according to an embodiment of the present application;

fig. 5 is a schematic composition diagram of a device for extracting a sports text abstract according to an embodiment of the present application.

Detailed Description

In some methods for extracting a sports text abstract, in order to improve readability of the abstract content, a abstraction manner is generally adopted to extract the sports text abstract, specifically, a whole piece of sports text is modeled into a graph model, wherein nodes in the graph represent sentences contained in the text, edges between the nodes represent text similarity between the sentences, and the sentences forming the text abstract are extracted in a loop iteration manner. However, most abstract sentences extracted by the extraction method are detail sentences in the text, because the detail sentences often have similar structures (the repeated words in the detail sentences are more, so that the similarity between the sentences is higher), the detail sentences occupy higher weights in the extraction process of the loop iteration and are extracted as abstract sentences with higher probability, but the detail sentences often do not relate to sentences containing real important information, such as game names, game results, participating teams and the like, so that the extracted text abstract precision is not high enough, and the key contents of the sports text cannot be accurately represented.

For example, the following steps are carried out: for the news of a female race, many detailed sentences for the detailed description of the race appear, such as: the safety coach adopts two points to change three points, replaces a player C and a player D with a player A and a player B, and a player E scores points quickly, and the player B hits the player, so that the team can track down to 10-19 in a score, the player F is prevented from attacking, the player A hangs the ball and gets the player, the second attack behind the player B of the team is blocked, the player B hits the ball in a counter position, the team is quite passive at a net gap, the third attack behind the player F is blocked, the team takes the spot at 24-10, although the player F takes the next score, the player B gets out of the border with the attack of the player B of the team, the team catches the player at 25-11 in the head, and the like. Therefore, the detailed sentences have high text similarity and are likely to be extracted to form an abstract of the game news text, but most of the detailed sentences do not relate to the real important sports game information such as specific game names, game venues, game time and the like and cannot accurately represent the key content of the sports news.

In order to solve the above-mentioned drawbacks, an embodiment of the present application provides a method for extracting a sports text abstract, which includes classifying sentences in a sports text, removing detailed sentences with higher text similarity, retaining non-detailed sentences with text similarity lower than a preset threshold with other sentences in the sports text, which are capable of reflecting key information of the sports text better, determining a probability that each non-detailed sentence is a abstract sentence according to a word feature extracted from each non-detailed sentence, selecting target non-detailed sentences meeting a preset initial selection condition from all non-detailed sentences according to the probability that all non-detailed sentences are abstract sentences to form an initial selection text abstract, and selecting target non-detailed sentences containing more entities from the target non-detailed sentences of the initial selection text abstract to form a sports text abstract, so as to represent related game names, game names, The method can improve the accuracy of the extraction result of the abstract of the sports text by using the truly important sports competition information such as the competition venue, the competition time and the like.

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

First embodiment

Referring to fig. 1, a flow chart of a method for extracting a sports text abstract provided in this embodiment is schematically illustrated, where the method includes the following steps:

s101: acquiring a non-detail sentence in a target text; and the non-detail sentences are sentences of which the similarity with other sentences in the target text is lower than a preset threshold value.

In this embodiment, any sports text to be subjected to abstract extraction in this embodiment is defined as a target text. It should be noted that the present embodiment does not limit the language type of the target text, for example, the target text may be a chinese text or an english text; the length of the target text is not limited in this embodiment, for example, the target text may be a paragraph text or a chapter text; the embodiment also does not limit the type of the target text, for example, the target text may be a sports news or a sports professional paper, or may be a part of text in a sports lecture or a magazine article.

In order to improve the accuracy of the target text abstract extraction result, the extracted abstract can represent more key information (such as game name, game time, game field, parameter team and the like) of the target text. In this embodiment, first, a current or future text classification method is used to classify all sentences included in a target text to distinguish "detailed sentences" from "non-detailed sentences".

The "detail sentence" refers to a sentence in the target text, the similarity (e.g., text similarity) of which is higher than a preset threshold, and usually describes the details of the sports game. The "non-detail sentence" refers to a sentence with a similarity (e.g. text similarity) lower than a preset threshold with other sentences in the target text, and usually describes key match information such as a match name, a match time, a match field, a parameter team, and the like in the text. It should be noted that the preset threshold refers to a critical value for distinguishing "detailed sentences" from "non-detailed sentences", and specific values may be set according to actual situations and empirical values, which is not limited in the embodiment of the present application.

On this basis, the detailed sentences in the target text with the similarity (e.g., text similarity) higher than the preset threshold may be further removed, and only the non-detailed sentences with the similarity (e.g., text similarity) lower than the preset threshold with other sentences in the target text are retained, which can better embody the key match information in the target text, so as to perform the subsequent step S102.

Specifically, when classifying the target text to obtain a "non-detail sentence" in the target text, first, a clause text of the target text may be obtained by performing clause processing on the target text according to a punctuation mark at the end of a sentence (such as a period "-", an exclamation mark "|, a question mark". It will be appreciated that this symbol "[ CLS ]" without explicit semantic information fuses the semantic information of each word/word in the sentence-dividing text more "fairly" than other words/words already present in the sentence-dividing text.

Then, the semantic representation of each sentence text can be respectively input into a classification model (such as a convolution layer, a pooling layer, a full link layer and the like) to continuously extract the feature vector of each sentence text, and then the feature vector of each sentence text can be processed by using the full link layer and a sigmoid activation function to predict the probability that each sentence text is a 'detailed sentence'. The probability may be a value in the interval [0,1], when the probability value exceeds a preset probability threshold value, it indicates that the corresponding clause text is a "detailed sentence", otherwise, when the probability value does not exceed the preset probability threshold value, it indicates that the corresponding clause text is a "non-detailed sentence", where the preset probability threshold value refers to a critical value for distinguishing the "detailed sentence" from the "non-detailed sentence", and a specific value may be set according to an actual situation. In this way, the "non-detailed sentence" in the target text, for which the probability of predicting the "detailed sentence" does not exceed 0.5, can be acquired to execute the subsequent step S102.

It should be noted that, in order to obtain a more accurate text classification result, a system model structure composed of a BERT model, a convolutional layer, a pooling layer, a full link layer, and a sigmoid activation function needs to be trained in advance, in the training process, a sentence dividing text is sequentially extracted from training data, and multiple rounds of model training are performed by using each sentence dividing text until a training end condition is met, at this time, training of the system model is completed. Specifically, during the current round of training, the prediction result output by the sigmoid activation function can be compared with the real result, and the system model parameters are updated according to the difference between the prediction result and the real result, wherein the used training target function can be a cross entropy loss function, the adopted parameter updating algorithm can be a gradient descent algorithm and the like, and the system model parameters can be updated according to the change of the cross entropy loss function value after each round of training is finished.

S102: and for each non-detail sentence, determining the probability that the non-detail sentence is the abstract sentence according to the word characteristics extracted from the non-detail sentence.

In this embodiment, after the non-detailed sentences in the target text are obtained in step S101, in order to extract a more accurate text abstract, the existing or future word segmentation method may be further used to perform word segmentation on each non-detailed sentence to obtain each word in each non-detailed sentence, and then, the word features corresponding to each word may be extracted. For each word, its word characteristics may include semantic information for that word. And then the probability that each non-detail sentence is the abstract sentence can be accurately determined according to the characteristics of each word in each non-detail sentence.

S103: and according to the probability that all the non-detail sentences are abstract sentences, selecting target non-detail sentences meeting preset initial selection conditions from all the non-detail sentences to form an initial selection text abstract of the target text.

In this embodiment, after the probability that each non-detail sentence is a summary sentence is determined in step S102, a target non-detail sentence meeting a preset initial selection condition may be further selected from all non-detail sentences according to the probability that each non-detail sentence is a summary sentence, so as to form an initial selection text summary of the target text. The preset initial selection condition refers to a preset condition which needs to be met by non-detail sentences capable of being used as the summary of the initially selected text, for example, the first 10 non-detail sentences with the highest probability as summary sentences can be selected from all the non-detail sentences in sequence to form the summary of the initially selected text; or, the non-detail sentences with probability exceeding 0.7 as abstract sentences can be selected from all the non-detail sentences in sequence to form the initial selection text abstract, the specific preset initial selection condition can be set according to the actual situation, the embodiment of the application is not limited,

it should be noted that, in order to save the reading time of the reader, the total length of the initially selected text abstract of the target text needs to be limited, and when the total number of words of the selected target non-detailed sentence reaches a preset length, the selection is stopped, where the preset length may be set according to actual conditions and experience values, for example, as is known from experience, the length of the abstract that the reader can generally accept is about 60 words, the total length of the initially selected text abstract may be limited to 100 words (i.e., 1.5 times more than 60 words) for executing the subsequent step S104.

For example, the following steps are carried out: assuming that the total length of the primary selected text abstract of the target text is limited to 100 words in advance, and the preset primary selection condition is that non-detail sentences with probability exceeding 0.7 are sequentially selected from all non-detail sentences in sequence to form the primary selected text abstract, and 4 non-detail sentences with probability exceeding 0.7 are determined as the summary sentences through the step S102, and the total length is respectively as follows: the probability that the non-detail sentence 1 is a summary sentence is 0.9 and comprises 48 words; the probability that the non-detail sentence 2 is a summary sentence is 0.85, and 35 words are contained; the probability that the non-detail sentence 3 is a summary sentence is 0.8, and 20 words are contained; the probability that non-detailed sentence 4 is a summary sentence is 0.78, containing 25 words. Then non-detail sentence 1, non-detail sentence 2, non-detail sentence 3 and non-detail sentence 4 can be selected in turn according to the probability from high to low (i.e. 0.9- >0.85- >0.8- > 0.78), but the sum of the number of words of non-detail sentence 1, non-detail sentence 2 and non-detail sentence 3 is 103, i.e. 48+35+20=103, and 100 words are reached, then only non-detail sentence 1, non-detail sentence 2 and non-detail sentence 3 are selected to form the primary selection text abstract, and non-detail sentence 4 is not selected to form the primary selection text abstract.

S104: and determining the text abstract of the target text according to the initially selected text abstract.

In this embodiment, after the preliminary text abstract of the target text is determined in step S103, in order to further save the time of the reader and reduce the redundancy, the text abstract of the target text is concise and complete. The number of entities contained in the target non-detail sentence in the initially selected text abstract can be determined, wherein each entity corresponds to one entity word, and the entity words can be independent things such as people, places and mechanisms. It is understood that the entity words in the sports text are more representative of the key information of the sports text than other words in the text, such as the entity words in the sports text representing the team, time, field, etc. Therefore, the larger the number of entities contained in the target non-detailed sentence in the initially selected text abstract, the richer the semantic information contained in the target non-detailed sentence, and the more likely the target non-detailed sentence is to be a sentence which finally constitutes the text abstract of the target text.

Specifically, the above steps only consider the information contained in each non-detail sentence (for determining the probability of the corresponding non-detail sentence as the abstract sentence) when determining the target non-detail sentences composing the abstract of the initial selected text of the target text, but do not consider the redundancy of the non-detail sentences, for example, the non-detail sentences which are slightly similar to each other have very close probabilities as abstract sentences, so that the composed abstract of the initial selected text has the problem of high redundancy.

Based on this, in order to reduce the text abstract redundancy of the target text and obtain a concise and complete text abstract, the number of entities included in each target non-detailed sentence in the initially selected text abstract may be determined, for example, the number of entities included in each target non-detailed sentence may be determined by using a Bi-directional Long Short-Term Memory (BiLSTM) network or a Conditional Random Field (CRF), and the specific implementation process is consistent with the conventional method, and is not described herein again. Then, all the target non-detailed sentences are sorted according to the number of entities contained in each target non-detailed sentence, wherein the higher the number of entities contained in the target non-detailed sentences, the higher the ranking of the target non-detailed sentences, because the higher the number of entities contained in the target non-detailed sentences, the richer the semantic information contained in the corresponding target non-detailed sentences, and the more likely the semantic information is to be used as the final sentence which forms the text abstract of the target text. And then sequentially selecting the target non-detail sentences ranked more front from all the target non-detail sentences from front to back in sequence to form the final text abstract of the target text.

It should be noted that, in order to save the reading time of the reader and improve the reading experience of the reader, the total length of the final text abstract of the target text needs to be limited, and when the total number of words of the selected target non-detailed sentence reaches the preset total length, the selection is stopped, where the preset total length may be set according to an actual situation and an empirical value, which is not limited in the embodiment of the present application, for example, the preset total length may be set to 60 words.

It should be noted that, when it is determined that the number of entities included in each of the two or more target non-detail sentences is the same, the target non-detail sentence with the least number of words is preferably selected to constitute the final text abstract of the target text, so that the refining degree of the abstract sentence can be further increased.

In a possible implementation manner of this embodiment, the step S102 may specifically include: and for each non-detail sentence, inputting the word characteristics extracted from the non-detail sentence into a pre-constructed sports text abstract sentence prediction model, and predicting the probability that the non-detail sentence is the abstract sentence.

In this implementation manner, the probability that each non-detailed sentence is a abstract sentence can be predicted by using a pre-constructed sports text abstract sentence prediction model, and for a specific implementation process, refer to the second embodiment.

Next, this embodiment will describe a process for constructing a sports text abstract sentence prediction model, which may specifically include the following steps a-B:

step A: sample sentences in the sports text are obtained.

In this embodiment, in order to construct a sports text abstract sentence prediction model, a large amount of preparation work needs to be performed in advance, and in order to improve the accuracy of a prediction result of the sports text abstract sentence prediction model, the training corpora adopted in the embodiment of the present application are all sentences included in a sports text, so as to perform model training in the sports field, rather than performing training in the general field by using general corpora.

Therefore, firstly, a large amount of sentence texts in the sports text need to be collected, for example, 100 sentence texts in a sports news text can be collected in advance, each sentence text is respectively used as a sample sentence, and important entity words in the sample sentence are labeled in advance, that is, important entity words in a training corpus are labeled in advance to train a sports text abstract sentence prediction model. The important entity words in the training corpus refer to words in the sports text which represent important sports match information such as match time, match place, match team, match field, match result and the like.

It should be noted that, in order to provide enough training corpora, in the embodiment of the present application, a sports text corpus is pre-constructed, a large number of sports texts not containing illegal characters are obtained from each sports portal website in a manner of web crawlers and the like, and are stored in the sports text corpus, and meanwhile, important entity words contained in the sports texts in the sports text corpus are labeled in advance to provide enough training corpora training corpus for training a sports text abstract sentence prediction model.

And B: and training a pre-constructed initial sports text abstract sentence prediction model by using the sample sentences to obtain the sports text abstract sentence prediction model.

In this embodiment, an initial sports text abstract sentence prediction model may be constructed in advance, and model parameters, such as the sports text abstract sentence prediction model shown in fig. 2, may be initialized. The sports text abstract sentence prediction model is structurally characterized by comprising two layers of transform encoders and is connected with one layer of transform decoder. The initial sports text abstract sentence prediction model is used for predicting the probability that the covered important entity words in the sentence are all words in a word list according to the word features in the input sentence, and the word list is constructed according to a sports text corpus, for example, all participles in the whole sports text corpus can be obtained to form the word list. Then, the initial sports text abstract sentence prediction model may be trained by using the sample sentences obtained in step a to obtain a sports text abstract sentence prediction model, and the specific implementation process may include the following steps B1-B4:

step B1: and performing word segmentation processing on the sample sentence, and identifying the important entity words in the sample sentence.

In this embodiment, after the sample sentence obtained in step a is used, word segmentation processing may be performed on the sample sentence to obtain each word included in the sample sentence, and then, according to the labeling performed on the important entity words in the sports text in the corpus in advance, the important entity words in each word included in the sample sentence may be identified.

In an implementation manner of this embodiment, the important entity words are extracted from the corresponding text sentences by using a pre-constructed important entity recognition model.

Specifically, the method may include training an existing open source model TENER (fransformer Encoder for ner) in advance by using labels of important entity words in a corpus, for example, forming an important entity lexicon from all the labeled important entity words in the corpus, then training the TENER by using the important entity lexicon, continuously improving the important entity lexicon in the training process (for example, adding important entity words representing new athletes, new competition venues, and the like), and then training the TENER again by using the new important entity lexicon until a training stop condition is satisfied, thereby completing the training to obtain an important entity recognition model TENER for recognizing important entities in the sports text. Further, all important entity words can be extracted from the sample sentence by using the important entity recognition model TENER for performing the subsequent step B2.

It should be noted that the training process of the existing open source model tee is consistent with the existing method, and is not described herein again.

Step B2: covering a first percentage of important entity words in the sample sentence, keeping a second percentage of the important entity words unchanged, and replacing a third percentage of the important entity words with other important entity words of the same class, wherein the sum of the first percentage, the second percentage and the third percentage is 1.

It should be noted that, because no information is covered when the probability prediction is performed by using the trained sports text abstract sentence prediction model, in order to make up for the gap between the model training process and the final model using process, in the embodiment of the present application, when performing model training, a first percentage of important entity words in a sample sentence is covered, instead of covering all important entity words, and meanwhile, a second percentage of important entity words is kept unchanged, and a third percentage of important entity words is replaced by other important entity words of the same class, where specific values of the first percentage, the second percentage, and the third percentage may be set according to actual conditions, which is not limited in the embodiment of the present application, but the sum of the first percentage, the second percentage, and the third percentage needs to be guaranteed to be 1,

for example, the following steps are carried out: taking the first percentage, the second percentage, and the third percentage as 80%, 10%, and 10% as examples, 80% of the important entity words in the sample sentence may be covered, 10% of the important entity words may be kept unchanged, and the remaining 10% of the important entity words may be replaced with other important entity words of the same category (e.g., the M venue name is replaced with the N venue name).

Step B3: and inputting the word characteristics of each participle extracted from the sample sentence into a pre-constructed initial sports text abstract sentence prediction model for training, and predicting the probability that the covered important entity words in the sample sentence are each word in the word list.

In this embodiment, after the important entity words in the sample sentence are processed in step B2, vector expression results of each participle (including all the main entity words processed in step B2) in the sample sentence can be further extracted, and as word features corresponding to each participle, the vector expression results are input to the pre-constructed initial sports text abstract sentence prediction model shown in fig. 2 for model training, and according to the above manner, multiple sets of context words in the sports text to which the sample sentence belongs can be sequentially extracted, and corresponding word features are generated and input to the pre-constructed initial sports text abstract sentence prediction model for training, so that context-related sentence-level representations (such as the context-related sentence-level representations in fig. 2) obtained by two layers of transform encoders in the model can be used for performing the training

) And predicting the probability that the covered important entity words (or the replaced and unchanged important entity words) in the sample sentence are each word in the word list according to the existing word characteristics in the sample sentence.

For example, the following steps are carried out: assuming that the vocabulary contains 100 words, and the probability of predicting a covered important entity word in the sample sentence as each word in the vocabulary is a set of vectors:

the set of vectors has 100 dimensional values, wherein each dimensional value can be in the interval [0,1]]And the value of each dimension represents the probability value of the covered important entity word belonging to each preset word (such as "beijing", "bird nest", "lady world cup") in the word list, and the sum of the values of the 100 dimensions is 1. It can be seen that the value of the third dimension in the set of vectors is 0.98 highest, becauseTherefore, the preset words corresponding to the dimension are the covered important entity words in the sample sentence.

For another example: now, taking a third sentence of a certain sports text as a sample sentence for training, as shown in fig. 2, the important entity word w2 represented by the second sentence in the sentence is covered, and is represented in a context-dependent sentence level according to the sentence through the above steps B2-B3 (as shown in fig. 2)

) And predicting the probability that the covered important entity word w2 in the clause is each word in the word list according to the existing word characteristics in the clause, wherein in fig. 2, "BOS" token represents the start of a sample sentence to be predicted (i.e. a third clause), and "EOS" token represents the end of the sample sentence to be predicted (i.e. the third clause).

Step B4: and when the preset stopping condition is not met, re-acquiring the sample sentences in the sports text, repeatedly performing word segmentation processing on the sample sentences, identifying important entity words in the sample sentences and subsequent steps until the preset stopping condition is reached, and taking the model when the preset stopping condition is reached as a sports text abstract sentence prediction model.

In this embodiment, when performing multiple rounds of model training on the initial sports text abstract sentence prediction model by using the training data in the above steps B1-B3, after each round of training, it is further necessary to determine whether a preset stop condition is satisfied, for example, it is necessary to determine whether the prediction accuracy of the covered important entity words reaches a preset threshold, and when the preset stop condition is not satisfied, it is necessary to re-obtain the sample sentences in the sports text according to the result of the current round of model training, and repeatedly perform the word segmentation processing on the sample sentences, and identify the important entity words and subsequent steps in the sample sentences, so as to perform re-training on the model through the above steps B2-B3. And taking the model when the preset stop condition is reached as a sports text abstract sentence prediction model until the preset stop condition is reached.

In this way, according to the migration learning characteristics of the sports text abstract sentence prediction model, the probability that each non-detailed sentence is used as an abstract sentence can be predicted more accurately by using the sports text abstract sentence prediction model, and the specific implementation process is shown in the second embodiment.

In summary, according to the method for extracting the sports text abstract provided by this embodiment, after the non-detail sentences in the target text are obtained, for each non-detail sentence, the probability that the non-detail sentence is the abstract sentence is determined according to the word features extracted from the non-detail sentence, and then, according to the probability that all the non-detail sentences are abstract sentences, the target non-detail sentences meeting the preset initial selection condition are selected from all the non-detail sentences to form the initial selection text abstract of the target text, and further, the text abstract of the target text can be determined according to the initial selection text abstract. Therefore, the embodiment of the application removes the detailed sentences in the target text first, so that the remaining non-detailed sentences can reflect more key content information such as game names, game results and the like of sports games, and then the probability that the non-detailed sentences are abstract sentences is determined more accurately based on word characteristics extracted from all the obtained non-detailed sentences and the incidence relations among all the non-detailed sentences, and the probability is used as a basis for forming text abstract, so that the accuracy of the abstract extraction result of the sports text can be improved.

Second embodiment

The present embodiment will describe a specific implementation manner of predicting the probability that each non-detail sentence is a abstract sentence by using a pre-constructed sports text abstract sentence prediction model in the first embodiment.

Referring to fig. 3, a schematic diagram of a process for determining a probability that a non-detail sentence is a summary sentence according to a word feature extracted from the non-detail sentence according to the present embodiment is shown, where the process includes the following steps:

s301: and for each non-detail sentence, generating a first sentence expression result of the non-detail sentence according to the word characteristics extracted from the non-detail sentence and the dependency relationship among the words in the non-detail sentence.

In this embodiment, after the non-detail sentences in the target text are acquired, for each non-detail sentence, the non-detail sentence can be predicted according to the following steps S301 to S303, so as to predict the probability that each non-detail sentence is a summary sentence. It should be noted that, in the following content, how to predict the probability that a non-detailed sentence is an abstract sentence will be described with reference to a certain non-detailed sentence in the target text in the present embodiment, and the prediction modes of other non-detailed sentences are similar to the above, and are not described again.

In step S301, word segmentation processing may be performed on the non-detail sentence to obtain each word in the non-detail sentence, and then word features corresponding to each word may be extracted. And inputting the extracted word characteristics corresponding to each word into a pre-constructed sports text abstract sentence prediction model shown in fig. 2, encoding the words through a first layer of transformer encoder of the model, analyzing the dependency relationship among the words in the non-detailed sentence, and outputting a first sentence expression result of the non-detailed sentence through the first layer of transformer encoder. The dependency relationship between the words in the non-detail sentence refers to a logical association relationship between the words in the non-detail sentence, such as an upper and lower meaning relationship, a general division relationship, a class meaning relationship, and the like.

Before encoding by using the first layer transform encoder of the model, it is necessary to add an end-of-sentence tagging word to the end of the non-detailed sentence, for example, "EOS" token or another tagging word may be added, and an output vector corresponding to the end-of-sentence tagging word (for example, "EOS" token) after encoding processing by the transform encoder is used as the first sentence expression result of the non-detailed sentence.

In a possible implementation manner of this embodiment, a specific implementation process of "generating a first sentence expression result of the current non-detail sentence according to the word feature extracted from the current non-detail sentence and the dependency relationship between words in the current non-detail sentence" in this step S301 may include the following steps S3011 to S3014:

s3011: and respectively taking each word of the current non-detail sentence as a target word, and extracting the word characteristics of each target word.

In this implementation, the present non-detailed sentence may be segmented to obtain each word, and here, each word is defined as a target word. Then, the word feature corresponding to each target word can be extracted. For each target word, the word feature of the target word may be a vector expression result of the target word, that is, the word vector of the target word may be used to characterize its corresponding semantic information, where the word vector of the target word may be generated by a vector generation method, for example, the word vector of the target word may be generated by a word2vec method.

For example, the following steps are carried out: as shown in fig. 2, taking the current non-detail sentence as sent1 in fig. 2 as an example, the current non-detail sentence includes two target words, and then the word vectors corresponding to the two target words generated by the word2vec method are w1 and w2 respectively displayed at the bottom of fig. 2.

S3012: and for each target word, generating a first semantic expression result of the target word according to the word characteristics of the current target word.

In this implementation manner, after the word feature (i.e., the word vector) of each target word is extracted in step S3011, each target word may be processed according to subsequent steps S3012 to S3013, so as to generate the first sentence expression result of the current non-detail sentence according to the processing result. It should be noted that, in the following content, this embodiment will use a certain target word in the current non-detailed sentence as a reference, and introduce how to process the target word, and the processing manners of other target words are similar to the above, and are not described again.

In step S3012, after the word feature corresponding to each target word is extracted, the word feature may be used as a first semantic expression result of the corresponding target word, and input into the pre-constructed sports text abstract sentence prediction model shown in fig. 2, so as to perform encoding processing on the sports text abstract sentence prediction model through the first layer transform encoder of the model.

It should be noted that, in order to improve the model prediction accuracy, the word feature corresponding to the target word may include not only the word vector representing the semantic information corresponding to the target word, but also the vector expression result representing the position information and the syntax information of the target word, and the word feature is used as the first semantic expression result of the target word for subsequent processing, which is not described in detail herein.

S3013: and generating a second semantic expression result of the current target word according to the first semantic expression result of the current target word and the respective first semantic expression results of other target words except the current target word in the current non-detail sentence.

In this implementation, after the first semantic expression result of each target word is determined in step S3012, it may be input to the sports text abstract sentence prediction model shown in fig. 2 as input data, and after it is input to the first layer transform encoder of the sports text abstract sentence prediction model, as shown in fig. 2, before encoding by using the first layer transform encoder of the model, it is necessary to add an end-of-sentence labeled word, such as "EOS" token, to the end of the sentence, and then encode these input vectors by using the first layer transform encoder in the model, so as to generate the second semantic expression result of each target word.

For example, the following steps are carried out: based on the above example, as shown in fig. 2, after the word vectors w1 and w2 corresponding to the two target words in the sent1 are input to the first layer transform encoder of the sports text abstract sentence prediction model shown in fig. 2, an "EOS" token is added at the end of the sentence, and then the input vectors corresponding to w1 and w2 and the "EOS" token are encoded by the first layer transform encoder in the model to generate the second semantic expression results of the two target participles and the "EOS" token, such as three gray bars shown in the bottom layer in fig. 2 (each gray bar includes a black circle in the middle).

S3014: and generating a first statement expression result of the current non-detail statement according to the second semantic expression result of each target word.

In this implementation, after the second semantic expression result of each target word is generated in step S3013, the first-layer transform encoder may input the output vector corresponding to the end-of-sentence mark word (e.g. the "EOS" token) output by the first-layer transform encoder as the first sentence expression result of the current non-detail sentence to the second-layer transform encoder of the model, as the first gray bar pointed by the arrow above the third gray bar shown in the lowest layer in fig. 2 (i.e. the corresponding gray bar is placed on the "EOS"), and define the first sentence expression result of the current non-detail sentence as h1, as shown in fig. 2.

S302: and generating a second statement expression result of the current non-detail sentence according to the first statement expression result of the current non-detail sentence and the respective first statement expression results of other non-detail sentences except the current non-detail sentence in the target text.

In this embodiment, after the first sentence expression result corresponding to each non-detail sentence is generated in step S301, for example, four first sentence expression results h1, h2, h3 and h4 corresponding to the non-detail sentences are generated, as shown by the four gray bars in the third level of fig. 2, each of which contains a black circle in the middle, then, the results can be further expressed based on the respective first sentences, generating a second statement expression result corresponding to each non-detail statement through a second-layer transformer encoder of the model, the second sentence expression result characterizes the association relationship between the corresponding non-detail sentence and the context-related non-detail sentence in the target text, as shown in the four light gray bar boxes (each light gray bar box contains a black circle in the middle) of the second layer in fig. 2, and the second sentence expression results corresponding to the four non-detail sentences are defined as follows.

As shown in fig. 2.

S303: and obtaining the probability that the current non-detail sentence is the abstract sentence according to the second sentence expression result of each non-detail sentence.

In this embodiment, after the two-layer transform encoder included in the sports text abstract sentence prediction model is used to generate the second sentence expression result of each non-detailed sentence by executing step S302, the second sentence expression result of each non-detailed sentence may be further processed by using the full link layer and the sigmoid activation function, so as to predict the probability that each non-detailed sentence is currently a abstract sentence, and the specific calculation formula is as follows:

wherein D represents the entire target text;

representing the ith non-detail sentence in the target text;

the probability that the ith non-detail sentence in the target text is the abstract sentence is expressed, and the probability can be the interval [0, 1%]The higher the probability value is, the more likely the corresponding non-detail sentence is taken as the abstract sentence is, otherwise, the smaller the probability value is, the lower the probability value is, the more likely the corresponding non-detail sentence is taken as the abstract sentence is;

parameters representing the full connection layer, and specific values can be determined through a model training process;

and the second sentence expression result represents the ith non-detail sentence in the target text.

In summary, in the embodiment, the pre-trained sports text abstract sentence prediction model is used, and the probabilities that all non-detailed sentences are abstract sentences are predicted more accurately according to the word features contained in each detailed sentence and the association relations among all non-detailed sentences, so that the target non-detailed sentences forming the text abstract are determined from all non-detailed sentences according to the probabilities that all non-detailed sentences are abstract sentences, and the accuracy of the sports text abstract extraction result can be improved.

Third embodiment

In this embodiment, a device for extracting a sports text abstract is described, and please refer to the above method embodiment for related contents.

Referring to fig. 5, a schematic composition diagram of a device for extracting a sports text abstract according to this embodiment is provided, where the device 500 includes:

a first obtaining unit 501, configured to obtain a non-detail sentence in a target text; the non-detail sentences are sentences of which the similarity with other sentences in the target text is lower than a preset threshold;

a first determining unit 502, configured to determine, for each non-detail sentence, a probability that the non-detail sentence is a summary sentence according to a word feature extracted from the non-detail sentence;

a composition unit 503, configured to select, according to the probability that all the non-detail sentences are abstract sentences, a target non-detail sentence that meets a preset initial selection condition from all the non-detail sentences, and compose an initial selection text abstract of the target text;

a second determining unit 504, configured to determine a text abstract of the target text according to the initially selected text abstract.

In an implementation manner of this embodiment, the first determining unit 502 includes:

In an implementation manner of this embodiment, the first generating subunit includes:

In an implementation manner of this embodiment, the first determining unit 502 is specifically configured to:

the device further comprises:

In an implementation manner of this embodiment, the training unit includes:

In one implementation manner of this embodiment, the important entity words are extracted from the corresponding text sentences by using a pre-constructed important entity recognition model.

Further, an embodiment of the present application further provides a sports text abstract extracting device, including: a processor, a memory, a system bus;

the processor and the memory are connected through the system bus;

the memory is used for storing one or more programs, and the one or more programs comprise instructions which when executed by the processor cause the processor to execute any implementation method of the sports text abstract extraction method.

Further, an embodiment of the present application also provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the instructions are run on a terminal device, the instructions cause the terminal device to execute any implementation method of the above-mentioned sports text abstract extraction method.

Further, an embodiment of the present application further provides a computer program product, which when running on a terminal device, causes the terminal device to execute any implementation method of the above-mentioned sports text abstract extraction method.

As can be seen from the above description of the embodiments, those skilled in the art can clearly understand that all or part of the steps in the above embodiment methods can be implemented by software plus a necessary general hardware platform. Based on such understanding, the technical solution of the present application may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network communication device such as a media gateway, etc.) to execute the method according to the embodiments or some parts of the embodiments of the present application.

It should be noted that, in the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

It is further noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for extracting a sports text abstract is characterized by comprising the following steps:

determining a text abstract of the target text according to the primarily selected text abstract;

for each non-detail sentence, determining the probability that the non-detail sentence is a summary sentence according to the word features extracted from the non-detail sentence comprises:

for each non-detail sentence, respectively taking each word of the current non-detail sentence as a target word, and extracting the word characteristic of each target word;

generating a second semantic expression result of the current target word according to the first semantic expression result of the current target word and the respective first semantic expression results of other target words except the current target word in the current non-detail sentence;

generating a first statement expression result of the current non-detail statement according to a second semantic expression result of each target word;

2. The method of claim 1, wherein the determining, for each non-detail sentence, the probability that the non-detail sentence is a summary sentence based on word features extracted from the non-detail sentence comprises:

acquiring sample sentences in the sports text;

3. The method of claim 2, wherein the training of the pre-constructed initial sports text abstract sentence prediction model using the sample sentences to obtain the sports text abstract sentence prediction model comprises:

4. The method of claim 3, wherein the important entity words are extracted from the corresponding text sentences using a pre-constructed important entity recognition model.

5. A sports text summarization extraction apparatus, comprising:

the second determining unit is used for determining the text abstract of the target text according to the initially selected text abstract;

the first determination unit includes:

the first obtaining subunit is configured to obtain, according to a second statement expression result of each non-detail statement, a probability that the current non-detail statement is a summary statement;

the first generation subunit includes:

6. A sports text summarization device, comprising: a processor, a memory, a system bus;

the processor and the memory are connected through the system bus;

the memory is to store one or more programs, the one or more programs comprising instructions, which when executed by the processor, cause the processor to perform the method of any of claims 1-4.

7. A computer-readable storage medium having stored therein instructions that, when executed on a terminal device, cause the terminal device to perform the method of any one of claims 1-4.