CN116151233A

CN116151233A - Data labeling and generating method, model training method, device and medium

Info

Publication number: CN116151233A
Application number: CN202211096247.9A
Authority: CN
Inventors: 丁隆耀; 蒋宁; 肖冰; 李宽; 吕乐宾
Original assignee: Mashang Xiaofei Finance Co Ltd
Current assignee: Mashang Xiaofei Finance Co Ltd
Priority date: 2022-09-08
Filing date: 2022-09-08
Publication date: 2023-05-23

Abstract

The embodiment of the specification provides a data labeling method, a data generating method, a model training method, model training equipment and a medium. The method may include: obtaining a sample to be annotated, wherein the sample to be annotated comprises at least one text, the text comprises at least one event, and each text comprises at least one text sentence; identifying text sentences in each text to obtain an identification result, and marking the corresponding text sentences according to the information to be marked if the identification result comprises the information to be marked, wherein the information to be marked comprises an information type and an argument, the information type is a viewpoint type of the text sentences expressed for the event, and the argument comprises target events corresponding to the viewpoints and/or entity information of the texts related to the viewpoints. The comprehensiveness of content extraction for the event-related text is improved.

Description

Data labeling and generating method, model training method, device and medium

Technical Field

Embodiments in the present disclosure relate to the field of natural language processing, and in particular, to a data labeling method, a data generating method, a model training device, and a medium.

Background

Currently, with the development of internet technology, the public has grown accustomed to browsing web pages through computer devices. The web page may provide the user with a content resource such as pictures, videos, or text.

Since a lot of information is propagated over the network every day. In order to quickly understand the information related to a certain topic, the articles on the network need to be collected and sorted, and the information needs to be extracted. Thus, the obtained processed information data can facilitate the user to quickly know the related information of the theme. In the prior art, information extraction is mainly performed based on emotion analysis of an entity, so as to obtain a triplet form representation ("entity", "polarity", "emotion word"). For example: in the sentence of "the hotel is clean but too expensive", the triplet information can be extracted as follows: ("hotel", "active", "very clean"), ("hotel", "passive", "too expensive").

However, in some cases, if only positive or negative emotion analysis of some related entities is extracted for some events, it is difficult for the extracted information data to comprehensively express one event.

Disclosure of Invention

Various embodiments in the present specification provide a data labeling method, a data generating method, a model training device and a medium. The comprehensiveness of content extraction for the event-related text is improved.

One embodiment of the present disclosure provides a data labeling method, applied to a data labeling system, including: obtaining a sample to be annotated, wherein the sample to be annotated comprises at least one text, the text comprises at least one event, and each text comprises at least one text sentence; identifying text sentences in each text to obtain an identification result, and if the identification result comprises information to be marked, marking the corresponding text sentences according to the information to be marked; the information to be marked comprises an information type and an argument, wherein the information type is a viewpoint type expressed by a text sentence aiming at an event, and the argument comprises entity information of a target event corresponding to the viewpoint and/or a text related to the viewpoint.

One embodiment of the present specification provides a data generation method, which may include: acquiring event related text and event description data; wherein the event description data is used for describing an event; the event related text is related to the event; the event related text comprises a plurality of text sentences; identifying viewpoint sentences expressing viewpoints in the text sentences according to the event description data; determining an argument corresponding to the viewpoint statement; and binding the viewpoint sentences and the corresponding argument to generate information data.

One embodiment of the present specification provides a training method of a text processing model including a viewpoint extraction model and an argument recognition model, the method including: receiving a plurality of first sample data and a plurality of second sample data, wherein the first sample data comprises event description data and text sentences, the second sample data comprises viewpoint sentences, text paragraphs corresponding to the viewpoint sentences and argument corresponding to the viewpoint sentences, and the first sample data and the second sample data comprise the same viewpoint sentences; wherein at least part of the text sentences are viewpoint sentences expressing viewpoints, and the event description data are used for describing the occurred events; training the perspective extraction model based on the first sample data; the viewpoint extraction model is used for extracting viewpoint sentences in text sentences according to the event description data; training the argument recognition model based on the second sample data; the argument identification model is used for identifying an argument corresponding to the viewpoint statement according to the text paragraph.

One embodiment of the present specification provides an information data generation apparatus including: the acquisition unit is used for acquiring the event related text and the event description data; wherein the event description data is used for describing an event; the event related text is related to the event; the event related text comprises a plurality of text sentences; a viewpoint identifying unit configured to identify a viewpoint sentence expressing a viewpoint in the text sentence, based on the event description data; an argument determining unit, configured to determine an argument corresponding to the viewpoint statement; and the binding unit is used for binding the viewpoint statement and the corresponding argument to generate information data.

One embodiment of the present description provides a sample data labeling apparatus, including: the system comprises a sample acquisition unit, a text processing unit and a text processing unit, wherein the sample acquisition unit is used for acquiring a sample to be marked, the sample to be marked comprises at least one text, the text comprises at least one event, and each text comprises at least one text sentence; the marking unit is used for identifying the text sentences in each text to obtain an identification result, if the identification result comprises information to be marked, marking the corresponding text sentences according to the information to be marked, wherein the information to be marked comprises an information type and an argument, the information type is the viewpoint type of the text sentences expressed for the event, and the argument comprises target events corresponding to the viewpoints and/or entity information of the texts related to the viewpoints.

One embodiment of the present specification provides a training apparatus for a text processing model including a viewpoint extraction model and an argument recognition model, the training apparatus comprising: a receiving unit configured to receive a plurality of first sample data and a plurality of second sample data, the first sample data including event description data, a text sentence, and an argument corresponding to the text sentence, the second sample data including a perspective sentence, a text paragraph corresponding to the perspective sentence, and an argument corresponding to the perspective sentence, the first sample data and the second sample data including the same perspective sentence; wherein at least part of the text sentences are viewpoint sentences expressing viewpoints; the event description data is used for describing an event; a viewpoint model training unit for training the viewpoint extraction model based on the first sample data; the viewpoint extraction model is used for identifying viewpoint sentences in the text sentences according to the event description data; an argument model training unit for training the argument recognition model based on the second sample data; wherein the argument identification model is used to determine an argument corresponding to the perspective statement.

An embodiment of the present specification provides a computer device comprising a memory storing a computer program and a processor implementing a method according to any one of the preceding claims when the computer program is executed by the processor.

An embodiment of the present specification provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method as claimed in any preceding claim.

The present specification provides various embodiments that extract perspective sentences in event-related articles based on event description data for an event, and extract arguments in the perspective sentences. Therefore, the extraction of the information data based on the event is realized, and the emotion analysis of the entity is not concerned any more, but the views and the argument of the views which are put forward in the event-related articles related to the event are concerned, so that the event can be expressed more comprehensively. The user can quickly know the event condition and possible influence.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:

FIG. 1 is a schematic diagram of a data processing system provided in one embodiment of the present description.

Fig. 2 is a flow chart of a method for generating event-based information data according to an embodiment of the present disclosure.

Fig. 3 is a flow chart of a sample data labeling method according to an embodiment of the present disclosure.

Fig. 4 is a flowchart of a training method of a data generation model according to an embodiment of the present disclosure.

Fig. 5 is a schematic block diagram of an information data generating apparatus according to an embodiment of the present disclosure.

Fig. 6 is a schematic block diagram of a sample data labeling apparatus according to an embodiment of the present disclosure.

Fig. 7 is a schematic block diagram of a training device for an information data generation model according to an embodiment of the present disclosure.

Fig. 8 is a schematic architecture diagram of a computer device according to an embodiment of the present disclosure.

Detailed Description

The information extraction refers to extracting specific information from natural language text so as to promote classification and arrangement of the information. In this way, further use may be facilitated.

In the related art, in the case of extracting information, analysis is often performed based on emotion of an entity. Such as in the form of triples ("entity", "polarity", "emotion word"). In some cases, in the text, there may be some neutral expression. For example, "i feel that this is the turning point of the relationship of two countries. Such neutral sentences may express some personal judgment or standpoint on some event. In the related art, the way to analyze and extract information data based on the emotion of the entity ignores these neutral expressions. So that it is difficult to comprehensively express an event in the extracted information data.

Therefore, it is necessary to provide a method of extracting a perspective sentence in an event-related article based on event description data for an event and extracting an argument in the perspective sentence. The extraction of the information data based on the event is realized, and the emotion analysis of the entity is not concerned any more, but rather the viewpoints and the arguments of the viewpoints which are proposed in the event-related articles related to the event are concerned, so that the event can be expressed more comprehensively.

Please refer to fig. 1. The present description provides a data processing system. The data processing system may include clients and servers. The user may operate the client to issue instructions to the server to control the operation of the data processing system. The client may be an electronic device with network access capabilities. Specifically, for example, the client may be a desktop computer, a tablet computer, a notebook computer, a smart phone, a digital assistant, a smart wearable device, a shopping guide terminal, a television, a smart speaker, a microphone, and the like. Wherein, intelligent wearable equipment includes but is not limited to intelligent bracelet, intelligent wrist-watch, intelligent glasses, intelligent helmet, intelligent necklace etc.. Alternatively, the client may be software capable of running in the electronic device. The server may be an electronic device with some arithmetic processing capability. Which may have a network communication module, a processor, memory, and the like. Of course, the server may also refer to software running in the electronic device. The server may also be a distributed server, and may be a system having a plurality of processors, memories, network communication modules, etc. operating in concert. Alternatively, the server may be a server cluster formed for several servers. Or, with the development of science and technology, the server may also be a new technical means capable of realizing the corresponding functions of the embodiment of the specification. For example, a new form of "server" based on quantum computing implementation may be possible.

Please refer to fig. 2. One embodiment of the present specification provides an event-based information data generation method. The information data generation method may be applied to a server. The information data generation method may include the following steps.

Step S101: acquiring event related text and event description data; wherein the event description data is used for describing an event; the event related text is related to the event; the event related text includes a plurality of text sentences.

In some cases, a user may operate a client to specify events of interest. Specifically, for example, some keywords that can be used to characterize an event may be specified, or text data in a page may be specified, or the like. In this way, the server may pull related event related articles in the network based on keywords or pull similar event related articles based on the text data.

An event may be a thing or thing that happens in the real world. After an event occurs, some impact may be generated in society. Therefore, the event can be understood more quickly, clearly and comprehensively by generating the information data related to the event. Alternatively, the impact of the event can be comprehensively evaluated through the information data.

The event description data may be used to represent events. In particular, for example, the event description data may include at least one event description text. Specifically, in the case that the number of the event description texts is plural, accuracy and comprehensiveness of the event description may be improved. The event description text may be text data extracted for the provided event related text based on an event extraction model. Specifically, the event extraction model may be implemented based on a machine learning method. For example, event extraction may be implemented using a trained neural network. Further, for example, the event description data may be expressed as e, and e= { w1, w2, … …, wn }, where n may be a positive integer, and w is used to represent the event description text.

Event related text may refer to articles related to an event. Specifically, the articles may be articles published in a network, such as web pages of a subject website, personal blog articles, articles in a microblog, and the like. The content expressed by the event related text can be the description event, the influence caused by the expression event, the attitude of an author aiming at the event, or other events caused by the description event, and the like. The event related text may include a plurality of text sentences. A text sentence may be a combination of words that can express a more complete semantic meaning. In some cases, text statements may be partitioned according to punctuation. Specifically, for example, the symbol may be followed. "divide different text sentences. Of course, text sentences may also be divided according to other punctuation marks, such as, for example, english ", or comma", "etc. In some embodiments, whether or not it can be a complete semantic as a criterion for dividing a text sentence. Specifically, each event related text may be d, and there may be d= { s1, s2, … …, sn }. Where s represents a text sentence and n is a positive integer.

In some implementations, the text statement can include the body and comments of the event-related text. After the event related text is released, the text of the event related text can be displayed in the related page. Thus, the viewpoint sentences related to the event-related text can be obtained through text sentence analysis aiming at the body. In some cases, there may be many people who leave messages in the comment section of the event-related text. These messages may also have some valuable points of view. Specifically, for example, the heat evaluation reply information below the event-related text may obtain top1 or top3 in the heat evaluation reply, and obtain a high-value reply by setting a praise threshold screening, and the like. Therefore, the comments of the event-related text are taken as a part of the content of the text statement, so that the information data can be extracted more comprehensively.

In some implementations, financial event related text relating to a financial event may be authorized to be obtained from a network; and extracting event description data from the financial event related text.

In some cases, the financial event may be a real-world occurrence of a financial related event. Specifically, some events are other fields on the surface, but after occurrence, they also have a certain influence on the financial field, and can also be regarded as financial events in this embodiment. Financial events often involve more personal attitudes, judgment or anticipation, etc. Thus, the financial event itself is difficult to fully represent by extracting information only from the emotion analysis perspective of the entity. Therefore, information data may be extracted for financial events based on perspective and argument angle. Thus, the obtained information data can more comprehensively and accurately express the financial event.

Step S103: and identifying the viewpoint statement expressing the viewpoint in the text statement according to the event description data.

In the present embodiment, whether or not a text sentence is in view of an event expression may be analyzed with the event description data as a reference. The perspective may be used to represent the perspective that an event holds from a certain standpoint or angle. The views may have emotional tendencies, e.g., positive or negative. Of course, the view may have no emotional tendency, and may express an attitude, which may be neutral or objective.

In some embodiments, the types of views expressed by the view statements include at least one of: event personal judgment for representing prediction or interpretation of future transactions; or, an event attitude for representing a standpoint on a problem; alternatively, an emotional view of the sense is expressed; or a suggested opinion for expressing suggestions to a person reading the event-related text.

Thus, the opinion sentence may not be limited to emotion expression. So that the viewpoint sentences can express the events more comprehensively. Further, by dividing the types of views from a plurality of angles, it is possible to embody at least part of the aforementioned views in the plurality of recognized view sentences. In the present embodiment, the viewpoint expressed by the viewpoint sentence may include at least two of the viewpoint types described above.

In this embodiment, an event personal judgment for indicating prediction or explanation of a future transaction is specifically, for example, "i consider that the plan of XX will fail". For example, "XX company" says that this decision is not useful at all. Express the emotional view of feeling, specifically, for example, "i see something very happily". Suggested views for expressing suggestions to a person reading the event-related text, in particular, for example, "we suggest that the cell phone user shut down the device during flight".

In some embodiments, the identifying, according to the event description data, a perspective sentence expressing a perspective in the text sentence may include: inputting the event description data and the text sentence into a viewpoint extraction model to obtain a viewpoint sentence in the text sentence; the viewpoint extraction model is used for outputting a first label corresponding to the text sentence; wherein the first tag includes a first value indicating that the corresponding text sentence is a perspective sentence or a second value indicating that the corresponding text sentence is not a perspective sentence.

In this embodiment, the viewpoint extraction model may be a natural language data processing model that has been trained in advance. In this way, after the event description data and the text sentence are input into the viewpoint extraction model, the data processing is performed through the viewpoint extraction model, and finally the output viewpoint sentence is obtained. In some embodiments, a plurality of viewpoint extraction models may be provided for the information types of the respective viewpoints, each viewpoint extraction model extracting a viewpoint sentence of the respective information category from the time-dependent text.

In some embodiments, the perspective extraction model may be a BERT model, and thus, after the event description data and the text sentence are input to the perspective extraction model. The perspective extraction model may be categorized for text sentences. Specifically, the viewpoint extraction model may have two categories, one being that the text sentence is considered as a viewpoint sentence and the other being that the text sentence is considered as not a viewpoint sentence. Specifically, the viewpoint extraction model may implement classification by adding a first tag to the input text sentence, and specifically, the first tag may include a first value indicating that the text sentence is a viewpoint sentence, and a second value indicating that the text sentence is not a viewpoint sentence. For example, the first value may be 1 and the second value may be 0. In some embodiments, the event description data may be specifically represented as e, and the text sentence in the event-related text as s. At this time, the input data x= { [ CLS ], e, [ SEP ], s } can be constructed. Wherein [ CLS ] and [ SEP ] represent the starting sentence identifier of the BERT class model and the segmenters of two different sentences, respectively. Then, a BERT model is used for text matching, each s is traversed for each e to form an input X, and after the X is encoded by the BERT model, the two classifications are carried out through a classifier. When s is considered to include a viewpoint, the first label corresponding to s is "1", and when s is not included, the first label is "0".

Of course, in some embodiments, the perspective extraction model may also be implemented using other algorithmic models. Specifically, for example, a neural network algorithm model or the like is employed.

Step S105: and determining the argument corresponding to the viewpoint statement.

In some cases, an argument may refer to a name phrase with a purported role. The name phrase may represent a target event, or entity information related to a point of view expressed by a point of view sentence. A purported role may be a semantic role assigned to noun phrases by predicates based on semantic relationships between the noun phrases with which it is associated. The role of the avatar may include: a constructor, a susceptor, a beneficiary, a receiver, a user, etc.

In this embodiment, the argument may include at least one of: an event expression corresponding to an event for which the viewpoint expressed by the viewpoint statement is directed; or, in the case that the viewpoint expressed by the viewpoint sentence is a related event for an event, the argument is a related event expression of the related event; or, the perspective statement expresses entity information corresponding to the entity related to the perspective.

In this embodiment, the event representation may be used to characterize an event. Specifically, the event expression may be the name of the event. Specifically, for example, a perspective statement may be "war will cause a stock to drop", wherein an event expression may be "war". The view statement expresses views of related events for an event, and it is understood that after an event occurs, another sub-event may occur. Specifically, for example, the opinion sentence may be "123 stock is greatly affected by war, the fall is greater", the event may be "war", and the related event expression of the related event may be "123 stock falls is greater". The entity to which the viewpoint expressed by the viewpoint sentence relates may be a specific transaction in the real world. The entity information may be the name of a specific transaction in the real world. Specifically, "' ABC" is a good performing stock, for example. Where 'ABC' may be entity information for an entity.

In some embodiments, in the perspective statement, determining an argument corresponding to the perspective statement may include: acquiring a text paragraph of the viewpoint sentence in the event related text; extracting argument in the text paragraph; inputting the text paragraph, the viewpoint statement and the argument into an argument identification model, and identifying the argument corresponding to the viewpoint statement.

There may be at least one paragraph in the event related text, typically a number of paragraphs. Each paragraph may include at least one text statement. Thus, the perspective sentence corresponds to a text paragraph including the perspective sentence in the event related text. The text paragraph may document the context of the opinion statement. Acquiring the text paragraph can more accurately express the viewpoint statement. In some cases, text paragraphs in the event-related text that include the point-of-view sentences may be determined by way of text matching. Or, when determining the viewpoint statement, recording the text paragraph in which the viewpoint statement is located. Or, when determining the viewpoint statement, recording the position of the text paragraph in which the viewpoint statement is located.

The argument may be extracted from the text paragraph, and further, the argument corresponding to the perspective sentence may be determined from the extracted argument. In some implementations, the argument may be extracted from the text paragraph using a trained data processing model based on a machine learning approach of feature engineering. Specifically, for example, a deep learning algorithm may be employed. Of course, those skilled in the art, with the benefit of this disclosure, may extract the argument in other ways, and this will not be repeated. In some embodiments, the data processing model may also be generated directly based on a deep learning algorithm, the argument is extracted from the viewpoint sentence, and the extracted argument is used as the argument corresponding to the viewpoint sentence.

The argument recognition model may be used to determine, from the text paragraphs and the viewpoint sentences, an argument corresponding to the viewpoint sentence among the extracted argument. The argument recognition model may be a natural language processing model. The argument recognition model may be generated based on natural language processing algorithms and trained using training samples. In some implementations, the argument recognition model can be built based on the BERT model. Specifically, for example, a text paragraph in which a viewpoint sentence is located may be denoted by span, the viewpoint sentence is denoted by o, the extracted argument is denoted by a, and input data x= { [ CLS ], span, [ SEP ], o, [ SEP ], a } is constructed. Then, the method of text matching can be performed by using the BERT model, and the argument a of the corresponding viewpoint statement o is determined.

Of course, in some embodiments, the meta-recognition model may also be implemented using other algorithmic models. Specifically, for example, a neural network algorithm model or the like is employed.

Step S107: and binding the viewpoint sentences and the corresponding argument to generate information data.

The resulting perspective statement may be stored in correspondence with the corresponding argument such that the perspective statement and argument form a data combination. Specifically, for example, the viewpoint sentence may be represented as Ok, the argument may be represented as Ak, and the generated information data may be represented as T. T= { …, (Ok, ak), … |e, d }.

Various embodiments provided herein extract perspective sentences in event-related text based on event description data for an event, and extract arguments in the perspective sentences. Therefore, the extraction of the information data based on the event is realized, and the emotion analysis of the entity is not concerned any more, but the viewpoints and the argument of the viewpoints which are put forward in the event related text related to the event are concerned, so that the event can be expressed more comprehensively. The user can quickly know the event condition and possible influence.

Please refer to fig. 3. The embodiment of the specification also provides a data labeling method. The data labeling method can be applied to a data labeling system. The data annotation system can include an electronic device for providing data annotation functionality, and the electronic device can store annotated data. In some cases, the data annotation system may include a client for providing user interaction, annotation input of data, and a server that may interact with the client and store the annotated data. The data labeling method may include the following steps.

Step S110: obtaining a sample to be annotated, wherein the sample to be annotated comprises at least one text, the text comprises at least one event, and each text comprises at least one text sentence.

In this embodiment, the sample to be annotated may be text data related to the event. In particular, for example, a sample to be annotated may include a plurality of text sentences. The event itself, or the influence caused by the expression event, can be described through a plurality of text sentences, or the attitude of an author aiming at the event, or other events caused by the description event, and the like. In some implementations, the sample to be annotated can be an event-related article. Specifically, the articles may be articles published in a network, such as web pages of a subject website, personal blog articles, articles in a microblog, and the like.

Step S112: identifying text sentences in each text to obtain an identification result, and marking the corresponding text sentences according to the information to be marked if the identification result comprises the information to be marked, wherein the information to be marked comprises an information type and an argument, the information type is a viewpoint type of the text sentences expressed for the event, and the argument comprises target events corresponding to the viewpoints and/or entity information of the texts related to the viewpoints.

In this embodiment, a plurality of text sentences may be included in the text. And if a part of the text sentences are expressed in terms of the expression points, identifying the viewpoint sentences expressing the expression points from the text sentences, and marking the viewpoint sentences, so that the viewpoint sentences can be used as sample data for a subsequent training model. Specifically, the specified types of the plurality of viewpoints may be predefined. Specifically, for example, the types of views include at least one of: event personal judgment for representing prediction or interpretation of future transactions; or, an event attitude for representing a standpoint on a problem; alternatively, an emotional view of the sense is expressed; or a suggested opinion for expressing suggestions to a person reading the event-related text.

Corresponding labeling labels can be respectively set for the information types of different perspectives. Specifically, for example, a label is set to a for an information category "event personal judgment for representing prediction or explanation of a future transaction", a label is set to B for an information category "event attitudes for representing a standpoint on a problem", a label is set to C for an information category "emotion point of feeling expressed", and a label is set to D for an information category "advice point of advice for expressing a person reading the event-related text". In this way, a label can be added to the text sentence corresponding to the start and end of the text sentence, so as to indicate that the text sentence is a viewpoint sentence and the information category to which the text sentence belongs. Furthermore, the labeled sample data can be divided into a plurality of information categories through the information categories, and further model training work can be executed based on viewpoint sentences and corresponding argument in each category.

In this embodiment, the recognition result may include information to be annotated for the text sentence. It can be appreciated that when the recognition result includes information to be annotated, the text sentence can be annotated according to the information to be annotated. Specifically, the information to be annotated can include information types and arguments, so that the text sentence can be annotated based on the information to be annotated. In some cases, the information to be annotated may represent annotation information that the text sentence is not a perspective sentence, and thus, the text sentence may be annotated based on the annotation information. Further, the finally annotated sample data may include text sentences expressing views, and text data not expressing views.

In the embodiment, the labeling data pair is generated according to the sample to be labeled, so that the text sentence is labeled conveniently, and the viewpoint sentence and the argument are classified into the information category while the viewpoint sentence is identified. Therefore, the efficiency of sample labeling is improved. Furthermore, through labeling of viewpoint sentences and arguments, a model can be generated based on the sample training information data, so that the model can be generated according to the information data obtained through sample training, and text content related to an event can be comprehensively extracted.

In some embodiments, the data labeling method may further include: acquiring event description data of the event; wherein the event description data is used for describing an event; and combining the text statement expressing the view of the event, the event description data and the argument corresponding to the view into sample data.

In this embodiment, the event description data and the text sentence for the viewpoint of the event expression, and the argument corresponding to the viewpoint may be bound and shaped into one sample data. Further, the text processing model may be trained from the obtained plurality of sample data.

In this embodiment, the event description data may include at least one event description text. Specifically, in the case that the number of the event description texts is plural, accuracy and comprehensiveness of the event description may be improved. The event description text may be text data extracted for the provided event related text based on an event extraction model.

In some embodiments, the data labeling method may further include: dividing the sample data into a plurality of sample data sets based on the information type of the point of view; wherein the plurality of sample data sets are used to respectively train the data generation model.

In the present embodiment, the sample data is classified according to the information type, so that a sample data set corresponding to the sample category can be formed. In this way, the perspective extraction model and the argument recognition model can be trained from the sample dataset, respectively. In this way, the multiple views extraction models and the corresponding argument identification models obtained through training can be respectively used for extracting views and corresponding arguments of corresponding information types. Specifically, for example, the information type may include at least one of: event personal judgment for representing prediction or interpretation of future transactions; or, an event attitude for representing a standpoint on a problem; alternatively, an emotional view of the sense is expressed; or a suggested opinion for expressing suggestions to a person reading the event-related text.

Please refer to fig. 4. The embodiment of the specification also provides a training method of the text processing model. The text processing model comprises a viewpoint extraction model and an argument identification model. The training method may be applied to a server. The training method may include the following steps.

Step S120: receiving a plurality of first sample data and a plurality of second sample data, wherein the first sample data comprises event description data and text sentences, the second sample data comprises viewpoint sentences, text paragraphs corresponding to the viewpoint sentences and argument corresponding to the viewpoint sentences, and the first sample data and the second sample data comprise the same viewpoint sentences; wherein at least part of the text sentences are viewpoint sentences expressing viewpoints; the event description data is used for describing an event.

The sample data may correspond to annotation data. In this way, the sample data has been pre-labeled, and thus used as a training model. The text sentence is at least partially a viewpoint sentence expressing a viewpoint. In some implementations, the text statements may all be perspective statements expressing perspectives, such that each text statement corresponds to an argument. Alternatively, some of the text sentences are perspective sentences, and another part of the text sentences do not express the perspective. At this time, only the viewpoint sentence may correspond to the argument.

In this embodiment, the sample data may be divided into first sample data and second sample data. Wherein the first sample data and the second template data may be used to train different text processing models, respectively. In particular, the first sample data may be used to train the point of view extraction model and the second sample data may be used to train the argument identification model. Further, the perspective sentences included in the first sample data and the second sample data are the same, and the argument identification model for realizing the training based on the second sample data can be used in combination with the perspective extraction model for training based on the first sample data, that is, the argument identification model can identify the argument of the perspective sentence output by the perspective extraction model.

The text statement may be from event related text related to the event. In general, event related text may be divided into text paragraphs. A text paragraph may consist of text sentences. Such that each text sentence may correspond to a paragraph of text that includes the text sentence. It is understood that when the text sentence is a viewpoint sentence expressing a viewpoint, the viewpoint sentence is also used for a text paragraph including the viewpoint sentence.

In some implementations, financial event related text relating to a financial event may be authorized to be obtained from a network; and extracting event description data from the financial event related text. Further, text sentences included in the text related to the financial event can be marked, so that viewpoint sentences in the text sentences and the argument corresponding to the viewpoint sentences can be obtained.

In some embodiments, the data may be manually labeled to obtain labeled data. Specifically, for example, a worker attending a labeling job may be trained to a certain extent. Specifically, for example, business capability base training and algorithm engineer training. So that the staff can know the business background and business preference and know the characteristics and difficulties of the labeling task. Further, a mode of cross marking of two workers can be adopted. That is, whether the same text sentence is a viewpoint sentence or not may be labeled separately. When the two staff members aim at the same text sentence, the labeling results are different, a third staff member can be introduced to vote to determine the final labeling result. In some embodiments, a machine learning model may also be used to automatically label the resulting label data. Specifically, for example, a machine learning model that specifically generates sample data may be trained using a machine learning algorithm.

Step S122: training the perspective extraction model based on the first sample data; the viewpoint extraction model is used for extracting viewpoint sentences in text sentences according to the event description data.

In the present embodiment, the viewpoint extraction model may be used to identify a viewpoint sentence among text sentences from event description data. After the first sample data is obtained, the event description data and the text sentence can be input into the viewpoint extraction model, the output result of the text sentence corresponding to the viewpoint extraction model is compared with the labeling data, and further, the parameter value in the viewpoint extraction model can be modified in a counter-propagation mode according to the comparison result, so that the accuracy of the viewpoint extraction model is improved. Specifically, for example, the perspective extraction model may be constructed based on the BERT model. The event description data may be represented as e and the text sentence as s, at which time the input data x= { [ CLS ], e, [ SEP ], s } may be constructed. Wherein [ CLS ] and [ SEP ] represent the starting sentence identifier of the BERT class model and the segmenters of two different sentences, respectively. Then, a BERT model is used for text matching, each s is traversed for each e to form an input x, and after the x is coded by the BERT model, the two classifications are carried out through a classifier. When s is considered to include a viewpoint, the first label corresponding to s is "1", and when s is not included, the first label is "0". Further, the value of the first tag is compared with the labeling data, so that a back propagation algorithm is executed on the BERT model under the condition that the value and the labeling data are different, and parameters of the viewpoint extraction model are modified.

Step S124: training the argument recognition model based on the second sample data; wherein the argument identification model is used for identifying an argument corresponding to the viewpoint statement.

In this embodiment, the text paragraph corresponding to the viewpoint sentence, and the argument included in the second sample data may be input together into the argument recognition model, and the argument recognition model may determine whether the input argument corresponds to the viewpoint sentence according to the text paragraph. The result output by the argument identification model can represent whether a corresponding relation exists between the input viewpoint statement and the argument according to different values. Specifically, for example, the result output by the argument recognition model may include a third value and a fourth value. When the third value is output, the input viewpoint sentence and the argument can be considered to correspond. When the fourth value is output, the input viewpoint sentence and the argument have no corresponding relation. In some embodiments, the third value may be 1 and the fourth value may be 0. Further, the value of the output result of the argument identification model is compared with the labeling data, so that under the condition that the value and the labeling data are different, a back propagation algorithm can be executed to modify the parameters of the argument identification model.

In this embodiment, a plurality of arguments may be input for one viewpoint sentence. An argument corresponding to the perspective statement may be included in the plurality of arguments. In some embodiments, a plurality of argument elements included in the labeling data may be randomly selected, and the same perspective statement may be used to train and improve accuracy of the argument element recognition model.

In some implementations, the sample data can also include location data information that records where the point-of-view statement is in the event-related text. In particular, the location data information may be used to indicate that a point of view sentence is in a position in a paragraph of text that includes the point of view sentence. For example, the position data information may include a start character number and an end character number, so that the position of the viewpoint sentence can be explicitly recorded. The argument may be extracted in advance from the text data corresponding to the positional data information. The extracted argument is also used as sample data for training an argument identification model. The argument corresponding to the viewpoint sentence and the argument not corresponding to the viewpoint sentence are respectively input into the argument identification model, so that the argument corresponding to the viewpoint sentence can be accurately determined by the argument identification model.

In some embodiments, machine-like reading understanding tasks may be constructed. That is, a plurality of arguments may be provided as options for one viewpoint sentence. And adding the argument corresponding to the viewpoint statement in the plurality of arguments serving as options according to the labeling data. Thus, the argument recognition model can be trained to accurately determine the argument corresponding to the viewpoint statement in the argument. In some implementations, the argument recognition model can be built based on the BERT model. Specifically, for example, a text paragraph in which a viewpoint sentence is located may be denoted by span, the viewpoint sentence is denoted by o, an argument is denoted by a, and input data x= { [ CLS ], span, [ SEP ], o, [ SEP ], a } is constructed. Then, the BERT model can be used for text matching, classification is performed, 1 can be output when a and o have a corresponding relation, and 0 can be output when a and o do not have a corresponding relation.

Please refer to fig. 5. The embodiment of the present specification also provides an information data generating apparatus, including: the acquisition unit is used for acquiring the event related text and the event description data; wherein the event description data is used for describing an event; the event related text is related to the event; the event related text comprises a plurality of text sentences; a viewpoint identifying unit configured to identify a viewpoint sentence expressing a viewpoint in the text sentence, based on the event description data; an argument determining unit, configured to determine an argument corresponding to the viewpoint statement; and the binding unit is used for binding the viewpoint statement and the corresponding argument to generate information data.

The functions and effects achieved by the information data generating device may be explained in comparison with the other embodiments described above, and will not be repeated.

Please refer to fig. 6. The embodiment of the specification also provides a data labeling device. The data labeling device may include: the system comprises a sample acquisition unit, a text processing unit and a text processing unit, wherein the sample acquisition unit is used for acquiring a sample to be marked, the sample to be marked comprises at least one text, the text comprises at least one event, and each text comprises at least one text sentence; the marking unit is used for identifying the text sentences in each text to obtain an identification result, if the identification result comprises information to be marked, marking the corresponding text sentences according to the information to be marked, wherein the information to be marked comprises an information type and an argument, the information type is the viewpoint type of the text sentences expressed for the event, and the argument comprises target events corresponding to the viewpoints and/or entity information of the texts related to the viewpoints.

The functions and effects achieved by the sample data labeling device can be explained in comparison with the other embodiments, and are not repeated.

Please refer to fig. 7. The embodiment of the specification provides a training device for a text processing model. The text processing model comprises a viewpoint extraction model and an argument identification model. The training device comprises: a receiving unit, configured to receive a plurality of first sample data and a plurality of second sample data, where the first sample data includes event description data and text sentences, and the second sample data includes perspective sentences, text paragraphs corresponding to the perspective sentences, and argument corresponding to the perspective sentences; the first sample data and the second sample data comprise the same perspective statement; wherein at least part of the text sentences are viewpoint sentences expressing viewpoints; the event description data is used for describing an event; a viewpoint model training unit for training the viewpoint extraction model based on the first sample data; the viewpoint extraction model is used for extracting viewpoint sentences in text sentences according to the event description data; an argument model training unit for training the argument recognition model based on the second sample data; the argument identification model is used for identifying an argument corresponding to the viewpoint statement according to the text paragraph.

The functions and effects achieved by the training device may be explained in comparison with the other embodiments described above, and will not be described again.

In some embodiments, the data labeling apparatus may further include: a description data acquisition unit configured to acquire event description data of the event; wherein the event description data is used for describing an event; and the combining unit is used for combining the text statement expressing the view of the event, the event description data and the argument corresponding to the view into sample data.

In some embodiments, the data labeling apparatus may further include: a data dividing unit for dividing the sample data into a plurality of sample data sets based on the information type of the viewpoint; wherein the plurality of sample data sets are used to respectively train the data generation model.

In some embodiments, the point of view identification unit may include: the view identification module is used for inputting the event description data and the text sentences into a view extraction model to obtain view sentences in the text sentences; the viewpoint extraction model is used for outputting a first label corresponding to the text sentence; wherein the first tag includes a first value indicating that the corresponding text sentence is a perspective sentence or a second value indicating that the corresponding text sentence is not a perspective sentence.

In some embodiments, the argument determination unit may include: a paragraph obtaining module, configured to obtain a text paragraph in which the viewpoint sentence is located in the event-related text; an argument extraction module for extracting an argument in the text paragraph; and the extraction module is used for inputting the text paragraph, the viewpoint statement and the argument into an argument identification model and identifying the argument corresponding to the viewpoint statement.

Please refer to fig. 8. The present description also provides a computer device comprising a memory storing a computer program and a processor implementing the method according to any of the above embodiments when the processor executes the computer program.

The present description also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a computer, causes the computer to perform the method of any of the above embodiments.

The present description also provides a computer program product comprising instructions which, when executed by a computer, cause the computer to perform the method of any of the above embodiments.

It will be appreciated that the specific examples herein are intended only to assist those skilled in the art in better understanding the embodiments of the present disclosure and are not intended to limit the scope of the present invention.

It should be understood that, in various embodiments of the present disclosure, the sequence number of each process does not mean that the execution sequence of each process should be determined by the function and the internal logic, and should not constitute any limitation on the implementation process of the embodiments of the present disclosure.

It will be appreciated that the various embodiments described in this specification may be implemented either alone or in combination, and are not limited in this regard.

Unless defined otherwise, all technical and scientific terms used in the embodiments of this specification have the same meaning as commonly understood by one of ordinary skill in the art to which this specification belongs. The terminology used in the description is for the purpose of describing particular embodiments only and is not intended to limit the scope of the description. The term "and/or" as used in this specification includes any and all combinations of one or more of the associated listed items. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It will be appreciated that the processor of the embodiments of the present description may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method embodiments may be implemented by integrated logic circuits of hardware in a processor or instructions in software form. The processor may be a general purpose processor, a Digital signal processor (Digital SignalProcessor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), an off-the-shelf programmable gate array (Field Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The methods, steps and logic blocks disclosed in the embodiments of the present specification may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present specification may be embodied directly in hardware, in a decoded processor, or in a combination of hardware and software modules in a decoded processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory, and the processor reads the information in the memory and, in combination with its hardware, performs the steps of the above method.

It will be appreciated that the memory in the embodiments of this specification may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an Erasable Programmable ROM (EPROM), an Electrically Erasable Programmable ROM (EEPROM), or a flash memory, among others. The volatile memory may be Random Access Memory (RAM). It should be noted that the memory of the systems and methods described herein is intended to comprise, without being limited to, these and any other suitable types of memory.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein can be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present specification.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described system, apparatus and unit may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.

In the several embodiments provided in this specification, it should be understood that the disclosed systems, apparatuses, and methods may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the embodiment.

In addition, each functional unit in each embodiment of the present specification may be integrated into one processing unit, each unit may exist alone physically, or two or more units may be integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solutions of the present specification may be essentially or portions contributing to the prior art or portions of the technical solutions may be embodied in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present specification. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a read-only memory (ROM), a random-access memory (RAM), a magnetic disk, or an optical disk, etc.

The foregoing is merely specific embodiments of the present disclosure, but the scope of the present disclosure is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope disclosed in the present disclosure, and should be covered by the scope of the present disclosure. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A data labeling method applied to a data labeling system, comprising the following steps:

obtaining a sample to be annotated, wherein the sample to be annotated comprises at least one text, the text comprises at least one event, and each text comprises at least one text sentence;

identifying text sentences in each text to obtain an identification result, and marking the corresponding text sentences according to the information to be marked if the identification result comprises the information to be marked, wherein the information to be marked comprises an information type and an argument, the information type is a viewpoint type of the text sentences expressed for the event, and the argument comprises target events corresponding to the viewpoints and/or entity information of the texts related to the viewpoints.

2. The method according to claim 1, wherein the method further comprises:

acquiring event description data of the event; wherein the event description data is used for describing an event;

and combining the text statement expressing the view of the event, the event description data and the argument corresponding to the view into sample data.

3. The method according to claim 2, wherein the method further comprises:

Dividing the sample data into a plurality of sample data sets based on the information type of the point of view; wherein the plurality of sample data sets are used to respectively train the data generation model.

4. A method of generating information data, comprising:

acquiring event related text and event description data; wherein the event description data is used for describing an event; the event related text is related to the event; the event related text comprises a plurality of text sentences;

identifying viewpoint sentences expressing viewpoints in the text sentences according to the event description data;

determining an argument corresponding to the viewpoint statement;

and binding the viewpoint sentences and the corresponding argument to generate information data.

5. The method of claim 4, wherein the argument comprises at least one of: an event expression corresponding to an event for which the viewpoint expressed by the viewpoint statement is directed; or alternatively, the first and second heat exchangers may be,

in the case that the viewpoint expressed by the viewpoint sentence is a related event for an event, the argument is a related event expression of the related event; or alternatively, the first and second heat exchangers may be,

and the view statement expresses entity information corresponding to the entity related to the view.

6. The method of claim 4, wherein the identifying a perspective sentence in which a perspective is expressed in the text sentence according to the event description data comprises:

inputting the event description data and the text sentence into a viewpoint extraction model to obtain a viewpoint sentence in the text sentence; the viewpoint extraction model is used for outputting a first label corresponding to the text sentence; wherein the first tag includes a first value indicating that the corresponding text sentence is a perspective sentence or a second value indicating that the corresponding text sentence is not a perspective sentence.

7. The method of claim 4, wherein determining an argument corresponding to the perspective statement comprises:

acquiring a text paragraph of the viewpoint sentence in the event related article;

extracting argument in the text paragraph;

inputting the text paragraph, the viewpoint statement and the argument into an argument identification model, and identifying the argument corresponding to the viewpoint statement.

8. A method of training a text processing model, the text processing model comprising a point of view extraction model and an argument recognition model, the method comprising:

Receiving a plurality of first sample data and a plurality of second sample data, wherein the first sample data comprises event description data and text sentences, the second sample data comprises viewpoint sentences, text paragraphs corresponding to the viewpoint sentences and argument corresponding to the viewpoint sentences, and the first sample data and the second sample data comprise the same viewpoint sentences; wherein at least part of the text sentences are viewpoint sentences expressing viewpoints, and the event description data are used for describing the occurred events;

training the perspective extraction model based on the first sample data; the viewpoint extraction model is used for extracting viewpoint sentences in text sentences according to the event description data;

training the argument recognition model based on the second sample data; the argument identification model is used for identifying an argument corresponding to the viewpoint statement according to the text paragraph.

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the method of any one of claims 1 to 8 when executing the computer program.

10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the method of any one of claims 1 to 8.