CN110866396B

CN110866396B - Method and device for determining main body of text specified information and computer storage medium

Info

Publication number: CN110866396B
Application number: CN201911069210.5A
Authority: CN
Inventors: 付骁弈; 张�杰
Original assignee: Beijing Mininglamp Software System Co ltd
Current assignee: Beijing Mininglamp Software System Co ltd
Priority date: 2019-11-05
Filing date: 2019-11-05
Publication date: 2023-05-09
Anticipated expiration: 2039-11-05
Also published as: CN110866396A

Abstract

A main body determining method of text specifying information includes word segmentation of a target text; performing part-of-speech tagging on each word segment to obtain a part-of-speech tagging result of each word segment; determining at least one candidate main body according to the part-of-speech tagging result of each word; dividing the target text according to each determined candidate subject to obtain a sample corresponding to each candidate subject; acquiring a vector V of each sample, and inputting a pre-trained first neural network to determine whether samples with the specified information exist; when the sample with the specified information is determined to exist, the candidate body corresponding to the sample is the body with the specified information. The method and the device can reduce manual labeling and reduce cost.

Description

Method and device for determining main body of text specified information and computer storage medium

Technical Field

The present invention relates to computer technology, and in particular, to a method and apparatus for determining a body of text specifying information, and a storage medium.

Background

The negative information body judging task is a common application in the network public opinion monitoring work. The aim is to determine whether or not negative information is contained in a text to be analyzed, and if so, to give the name (or the position in the original text) of the subject to which the negative information relates.

Existing statistical learning methods incur significant costs in the construction of artificial features, which is time consuming and laborious and can result in models lacking generalization capability in new modes beyond the features that have been encoded.

The existing statistical learning method using the deep neural network avoids the tedious process of artificial feature construction by performing joint learning on main body identification and negative judgment, however, the method needs a large number of accurate sequence labeling samples, for example: using sequence labeling, the method requires manual labeling of each character of the text to be analyzed at the labeling stage, such as in fig. 2: the name "ABXY group Co., ltd. In Guangzhou development area" for which a very Guangzhou-characterized name is intended is labeled "B I I I I I I I I I I I I I I O O O O O O O O O O O O O O O", this stage being the same as the total length of the character string of the input text.

Disclosure of Invention

The application provides a main body determining method, device and storage medium of text specifying information, which can achieve the aims of reducing manual labeling and reducing cost.

The application provides a main body determining method of text specifying information, which comprises the following steps: word segmentation is carried out on the target text; performing part-of-speech tagging on each word segment to obtain a part-of-speech tagging result of each word segment; determining at least one candidate main body according to the part-of-speech tagging result of each word; dividing the target text according to each determined candidate subject to obtain a sample corresponding to each candidate subject; acquiring a vector V of each sample, and inputting a pre-trained first neural network to determine whether samples with the specified information exist; when the sample with the specified information is determined to exist, the candidate body corresponding to the sample is the body with the specified information.

In an exemplary embodiment, the acquiring the vector V of each sample includes: the following operations are performed on each obtained sample: splitting according to the position of the candidate main body of the sample to obtain a first clause A and a second clause B; wherein the length of the first clause a is from the beginning of the sample to the beginning of the candidate subject; the length of the second clause B is from the position of the candidate body to the position of the sample end; vectorizing each word of the first clause A and the second clause B corresponding to the target text to obtain a real value matrix MA of the first clause A and a real value matrix MB of the second clause B respectively; and inputting the real value matrix MA of the first clause A and the real value matrix MB of the second clause into a second neural network to encode the first clause A and the second clause B, and obtaining a vector V of the sample.

In an exemplary embodiment, inputting the real value matrix MA of the first clause a and the real value matrix MB of the second clause into the second neural network to encode the first clause a and the second clause B, and obtaining the vector V of the sample includes: inputting a real value matrix MA of a first clause A and a real value matrix MB of a second clause A into a pre-trained second neural network, and coding the first clause A and the second clause B to obtain a coding vector VA of the first clause A and a coding vector VB of the second clause B; and splicing the obtained vectors VA and VB to obtain a vector V of the sample.

In an exemplary embodiment, the encoding the first clause a and the second clause B includes: the first clause a is encoded from front to back and the second clause B is encoded from back to front.

In an exemplary embodiment, the above method further comprises: and counting the main bodies corresponding to the samples with the specified information and merging and outputting the main bodies.

In an exemplary embodiment, determining at least one candidate body according to the part-of-speech tagging result of each word includes: and when the part of speech tagging result of the word segmentation is proper nouns or phrases formed by proper nouns, determining the word segmentation as a candidate main body.

The present application also provides a body determining apparatus of text specification information, including: the part-of-speech tagging module is used for word segmentation of the target text; performing part-of-speech tagging on each word segment to obtain a part-of-speech tagging result of each word segment; the determining module is used for determining at least one candidate main body according to the part-of-speech tagging result of each word; the sample dividing module is used for dividing the target text according to each determined candidate main body to obtain a sample corresponding to each candidate main body; the vector acquisition and analysis module is used for acquiring a vector V of each sample and inputting a pre-trained first neural network to determine whether the sample with the specified information exists or not; when the sample with the specified information is determined to exist, the candidate body corresponding to the sample is the candidate body with the specified information.

In an exemplary embodiment, the above-mentioned vector obtaining and analyzing module is configured to obtain a vector V of each sample, which refers to: the vector acquisition module is used for respectively carrying out the following operations on each obtained sample: splitting according to the position of the candidate main body of the sample to obtain a first clause A and a second clause B; wherein the length of the first clause a is from the beginning of the sample to the beginning of the candidate subject; the length of the second clause B is from the position of the candidate body to the position of the sample end; vectorizing each word of the first clause A and the second clause B corresponding to the target text to obtain a real value matrix MA of the first clause A and a real value matrix MB of the second clause B respectively; and inputting the real value matrix MA of the first clause A and the real value matrix MB of the second clause into a second neural network to encode the first clause A and the second clause B, and obtaining a vector V of the sample.

The application also provides a main body determining device of the text specifying information, which comprises a processor and a memory, wherein the memory stores a program for directing and delivering contents; the processor is configured to read the program for directing delivery of content, and execute the method of any one of the above.

The present application also provides a computer storage medium having stored thereon a computer program which, when executed by a processor, implements the method of any of the above.

Compared with the related art, the candidate main body can be obtained after word segmentation and part-of-speech tagging through the target text, the manual coding feature is not needed, the labor cost is saved, and meanwhile, the method has better generalization energy on the premise of training by using enough data quantity.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application. Other advantages of the present application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The accompanying drawings are included to provide an understanding of the technical aspects of the present application, and are incorporated in and constitute a part of this specification, illustrate the technical aspects of the present application and together with the examples of the present application, and not constitute a limitation of the technical aspects of the present application.

FIG. 1 is a flowchart of a method for determining a subject of text specifying information according to an embodiment of the present application;

FIG. 2 is a diagram showing a subject matter determination example of text specifying information according to an embodiment of the present application;

fig. 3 is a block diagram of a main body determination module of text specification information according to an embodiment of the present application.

Detailed Description

The present application describes a number of embodiments, but the description is illustrative and not limiting and it will be apparent to those of ordinary skill in the art that many more embodiments and implementations are possible within the scope of the embodiments described herein. Although many possible combinations of features are shown in the drawings and discussed in the detailed description, many other combinations of the disclosed features are possible. Any feature or element of any embodiment may be used in combination with or in place of any other feature or element of any other embodiment unless specifically limited.

The present application includes and contemplates combinations of features and elements known to those of ordinary skill in the art. The embodiments, features and elements of the present disclosure may also be combined with any conventional features or elements to form a unique inventive arrangement as defined in the claims. Any feature or element of any embodiment may also be combined with features or elements from other inventive arrangements to form another unique inventive arrangement as defined in the claims. Thus, it should be understood that any of the features shown and/or discussed in this application may be implemented alone or in any suitable combination. Accordingly, the embodiments are not to be restricted except in light of the attached claims and their equivalents. Further, various modifications and changes may be made within the scope of the appended claims.

Furthermore, in describing representative embodiments, the specification may have presented the method and/or process as a particular sequence of steps. However, to the extent that the method or process does not rely on the particular order of steps set forth herein, the method or process should not be limited to the particular sequence of steps described. Other sequences of steps are possible as will be appreciated by those of ordinary skill in the art. Accordingly, the particular order of the steps set forth in the specification should not be construed as limitations on the claims. Furthermore, the claims directed to the method and/or process should not be limited to the performance of their steps in the order written, and one skilled in the art can readily appreciate that the sequences may be varied and still remain within the spirit and scope of the embodiments of the present application.

The technical scheme of the present application will be described in more detail with reference to the accompanying drawings and examples.

As shown in fig. 1, an embodiment of the present invention provides a method for determining a body of text specifying information, including the steps of:

s1, word segmentation is carried out on a target text; performing part-of-speech tagging on each word segment to obtain a part-of-speech tagging result of each word segment;

s2, determining at least one candidate main body according to the part-of-speech tagging result of each word;

s3, dividing the target text according to each determined candidate main body to obtain a sample corresponding to each candidate main body;

s4, obtaining a vector V of each sample, and inputting a pre-trained first neural network to determine whether samples with the specified information exist; when the sample with the specified information is determined to exist, the candidate body corresponding to the sample is the body with the specified information.

In one exemplary embodiment, the first neural network is a feed-forward neural network.

Word segmentation refers to the process of recombining a sequence of consecutive words into a sequence of words according to a certain specification. Part-Of-Speech (POS) tagging, also known as grammar tagging (grammatical tagging) or word-class disambiguation (word-category disambiguation), is a text data processing technique in corpus linguistics (corpus linguistics) that tags the parts Of Speech Of words in the corpus by their meaning and context.

As shown in fig. 2, "the guangzhou development area ABXY group limited company has a name that is very characteristic of guangzhou: EF city securities. "the sentence is used as a target text, word segmentation and part-of-speech tagging are carried out, and the result is shown in fig. 2, wherein NR represents proper nouns; NN represents other nouns; JJ represents adjectives or ordinal words; PN represents a pronoun; VV represents verbs, etc., which are common abbreviations for computer parts of speech notation and are not described in detail herein. In the embodiment, a part-of-speech tagging system is adopted for the stanford CoreNLP tagging system.

In an exemplary embodiment, in step S2, the determining at least one candidate body according to the part-of-speech tagging result of each word includes: and when the part of speech tagging result of the word segmentation is proper nouns or phrases formed by proper nouns, determining the word segmentation as a candidate main body.

Illustratively, for proper nouns, based on the results of the part-of-speech tagging: NR, other nouns: phrases formed by NN adjacent combination (combination modes include, but are not limited to, NRNN, NNNR and the like) are taken as candidate subjects of negative information. For example, as shown in fig. 2, according to the labeling results, "EF city securities", "guangzhou features", and "guangzhou development area ABXY group" are taken as candidate subjects, and thus three samples should be divided. In this embodiment, each sample is the same as "the Guangzhou development area ABXY group Co., ltd. For which a name featuring Guangzhou is designed: EF city securities. ".

In an exemplary embodiment, in step S4, the acquiring the vector V of each sample includes: the following operations are performed on each obtained sample:

s41, splitting according to the position of the candidate main body of the sample to obtain a first clause A and a second clause B; wherein the length of the first clause a is from the beginning of the sample to the beginning of the candidate subject; the length of the second clause B is from the position of the candidate body to the position of the sample end;

s42, vectorizing each word of the first clause A and the second clause B corresponding to the target text to obtain real value matrix M of the first clause A respectively _A And a second clause B real value matrix M _B ；

S43, real value matrix M of first clause A _A And real value matrix M of second clause _B Inputting a second neural network pairAnd the first clause A and the second clause B are coded to obtain a vector V of the sample.

In an exemplary embodiment, in step S43, the encoding the first clause a and the second clause B includes:

s431, encoding the first clause A and the second clause B to obtain a code vector V of the first clause A _A And the code vector V of the second clause B _B ；

S432, vector V obtained _A And V _B And splicing to obtain a vector V of the sample.

In an exemplary embodiment, in step S431, the real value matrix M of the first clause A is formed _A And real value matrix M of second clause _B Inputting a pre-trained second neural network, and encoding the first clause A and the second clause B, and further comprising:

the real value matrix MA and the real value matrix MB of the first clause A are input into a pre-trained second neural network, and the first clause A is coded from front to back and the second clause B is coded from back to front.

Illustratively, the second neural network is a recurrent neural network, including but not limited to RNN, GRU, LSTM, and the like.

As shown in fig. 2, "the guangzhou development area ABXY group limited company has a name that is very characteristic of guangzhou: EF city securities. Taking the name of Guangzhou feature as a candidate for example, the corresponding A clause is "Guangzhou development area ABXY group Co., ltd. "; the B clause is "name of Guangzhou feature: EF city securities. ". Further, by searching the pre-training word vector of each word of the target text, using the real value vector as the representation of the word or phrase, the real value matrix representation of the A, B clauses in each sample is obtained and recorded as the matrix M _A And M _B . Since the candidate body may be obtained by combining a plurality of nouns, when a condition that a word vector cannot be found occurs, an average value of word vectors corresponding to a plurality of words contained in the word group should be used instead.

M to be obtained _A And M _B The input cyclic neural network encodes the first clause A from front to back and the second clause B from back to front respectively, and then the obtained codes are combined and transformed into a new semantic space through an attention mechanism so as to capture sentences with long-distance dependency. Obtaining the coding vector of the first clause A and the second clause B as V _A And V _B 。

The obtained characteristic vector V _A And V _B And splicing to obtain vector representation V of the whole sample. V is input to the feed forward neural network and Softmax is used as the activation function of the output layer. The output layer outputs three different real values, representing respectively: tag 1 (there is negative information related to the entity); tag-1 (no negative information exists or the entity is not involved); tag 0 (the candidate phrase does not constitute an entity). And comparing the sizes of the three real values, and selecting a label corresponding to the maximum real value as a final judging result. "

For example, "Guangzhou development area ABXY group Co., ltd.) has a very Guangzhou-featured name for it: EF city securities. "this sentence, exemplified by" name of Guangzhou feature ", this step should output tag 0; in the case of "EF City securities", the tag-1 should be outputted.

In an exemplary embodiment, the above-mentioned body determining method of text specifying information further includes the steps of: s5, counting the main bodies corresponding to the samples with the specified information, and merging and outputting the main bodies.

The sample obtained by splitting each piece of text to be analyzed is summarized, and if two or more subjects with negative information exist in the target text, the two or more subjects are output together as a result.

For example, "Guangzhou development area ABXY group Co., ltd.) has a very Guangzhou-featured name for it: EF city securities. "in this sentence, the result to be output in this step is: { text: "Guangzhou development area ABXY group Co., ltd.) has devised a very Guangzhou-characterized name for it: EF city securities. ", label: -1, entity: "Guangzhou development area ABXY group Co., ltd. |EF City securities", negative_entity: "}.

As shown in fig. 3, the embodiment of the present invention further provides a main body determining device for text specifying information, including the following modules:

the part-of-speech tagging module 10 is used for word segmentation of the target text; performing part-of-speech tagging on each word segment to obtain a part-of-speech tagging result of each word segment;

a determining module 20, configured to determine at least one candidate body according to the part-of-speech tagging result of each word;

the sample dividing module 30 is configured to divide the target text according to each determined candidate subject, so as to obtain a sample corresponding to each candidate subject;

a vector acquiring and analyzing module 40, configured to acquire a vector V of each sample, and input a first neural network trained in advance to determine whether there is a sample with the specified information; when the sample with the specified information is determined to exist, the candidate body corresponding to the sample is the body with the specified information.

The vector acquiring and analyzing module 40 is configured to acquire a vector V of each sample: the vector acquisition module is used for respectively carrying out the following operations on each obtained sample:

the vector acquisition and analysis module 40 splits according to the position of the candidate main body of the sample to obtain a first clause A and a second clause B; wherein the length of the first clause a is from the beginning of the sample to the beginning of the candidate subject; the length of the second clause B is from the position of the candidate body to the position of the sample end;

vector acquisition and analysis module 40, for vectorizing each word of the first clause a and the second clause B corresponding to the target text, to obtain real value matrix M of the first clause a _A And a second clause B real value matrix M _B ；

The vector acquisition and analysis module 40 performs a real value matrix M of the first clause A _A And real value matrix M of second clause _B And inputting a second neural network to encode the first clause A and the second clause B, and obtaining a vector V of the sample.

The invention also provides a main body determining device of the text specifying information, which comprises a processor and a memory, wherein the memory stores a program for directing and delivering contents; the processor is configured to read the program for directing delivery of content, and execute the method of any one of the above.

The invention also provides a computer storage medium having stored thereon a computer program, characterized in that the computer program, when executed by a processor, implements the method of any of the preceding claims.

Those of ordinary skill in the art will appreciate that all or some of the steps, systems, functional modules/units in the apparatus, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between the functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed cooperatively by several physical components. Some or all of the components may be implemented as software executed by a processor, such as a digital signal processor or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as known to those skilled in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. Furthermore, as is well known to those of ordinary skill in the art, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media.

Claims

1. A body determining method of text specifying information, comprising:

word segmentation is carried out on the target text; performing part-of-speech tagging on each word segment to obtain a part-of-speech tagging result of each word segment;

determining at least one candidate main body according to the part-of-speech tagging result of each word;

dividing the target text according to each determined candidate subject to obtain a sample corresponding to each candidate subject;

acquiring a vector V of each sample, and inputting a pre-trained first neural network to determine whether samples with the specified information exist;

when determining that the sample with the specified information exists, the candidate subject corresponding to the sample is the subject with the specified information;

the obtaining the vector V of each sample includes: the following operations are performed on each obtained sample:

splitting according to the position of the candidate main body of the sample to obtain a first clause A and a second clause B; wherein the length of the first clause a is from the beginning of the sample to the beginning of the candidate subject; the length of the second clause B is from the position of the candidate body to the position of the sample end;

vectorizing each word of the first clause A and the second clause B corresponding to the target text to obtain real value matrix M of the first clause A respectively _A And a second clause B real value matrix M _B ；

Matrix M of real values of first clause A _A And real value matrix M of second clause _B Inputting a second neural network to encode the first clause A and the second clause B, and obtaining a vector V of the sample;

said real value matrix M of the first clause A _A And real value matrix M of second clause _B Inputting a second neural network to encode the first clause A and the second clause B, and obtaining a vector V of the sample, wherein the method comprises the following steps:

matrix M of real values of first clause A _A And a second clause real value matrix M _B Inputting a pre-trained second neural network, and encoding the first clause A and the second clause B to obtain a code vector V of the first clause A _A And the code vector V of the second clause B _B ；

The obtained vector V _A And V _B And splicing to obtain a vector V of the sample.

2. The method of claim 1, wherein said encoding said first clause a and second clause B comprises:

the first clause a is encoded from front to back and the second clause B is encoded from back to front.

3. The method of claim 1, wherein the method further comprises: and counting the main bodies corresponding to the samples with the specified information and merging and outputting the main bodies.

4. The method of claim 1, wherein said determining at least one candidate subject based on the part-of-speech tagging of each word comprises:

and when the part of speech tagging result of the word segmentation is proper nouns or phrases formed by proper nouns, determining the word segmentation as a candidate main body.

5. A body determining apparatus of text specifying information, characterized by comprising:

the part-of-speech tagging module is used for word segmentation of the target text; performing part-of-speech tagging on each word segment to obtain a part-of-speech tagging result of each word segment;

the determining module is used for determining at least one candidate main body according to the part-of-speech tagging result of each word;

the sample dividing module is used for dividing the target text according to each determined candidate main body to obtain a sample corresponding to each candidate main body;

the vector acquisition and analysis module is used for acquiring a vector V of each sample and inputting a pre-trained first neural network to determine whether the sample with the specified information exists or not; when the sample with the specified information is determined to exist, the candidate body corresponding to the sample is the candidate body with the specified information;

the vector acquiring and analyzing module is configured to acquire a vector V of each sample: the vector acquisition module is used for respectively carrying out the following operations on each obtained sample:

The obtainedVector V _A And V _B And splicing to obtain a vector V of the sample.

6. A body determining device of text specification information, comprising a processor and a memory, characterized in that the memory stores a program for directing delivery of content; the processor is configured to read the program for directing delivery of content and perform the method of any of claims 1-4.

7. A computer storage medium having stored thereon a computer program which, when executed by a processor, implements the method according to any of claims 1-4.