CN111428473A

CN111428473A - Information processing method and device, computer storage medium and terminal

Info

Publication number: CN111428473A
Application number: CN202010181441.1A
Authority: CN
Inventors: 陈栋; 付骁弈
Original assignee: Beijing Mininglamp Software System Co ltd
Current assignee: Beijing Mininglamp Software System Co ltd
Priority date: 2020-03-16
Filing date: 2020-03-16
Publication date: 2020-07-17

Abstract

An information processing method, an information processing device, a computer storage medium and a terminal comprise: counting the word number of each part of speech word contained in the original text and the extraction result to obtain part of speech statistical information; calculating the information coverage of each part of speech word contained in the extraction result according to the obtained part of speech statistical information; and determining the quality of the extraction result according to the calculated information coverage of all the part-of-speech words. The embodiment of the invention realizes the quality evaluation of the extracted result through an automatic process, and improves the analysis efficiency of the quality evaluation.

Description

Information processing method and device, computer storage medium and terminal

Technical Field

This document relates to, but is not limited to, knowledge graph technology, and more particularly, to a method, an apparatus, a computer storage medium, and a terminal for information processing.

Background

At present, massive unstructured data (texts) mostly have the characteristics of non-normativity, openness and the like, so that a supervised information extraction method depending on the training corpus is not applicable any more.

The extraction of the text information of the open domain refers to a text processing technology for extracting information of a specified type from a natural language text in an unsupervised mode, wherein the extracted semantic units do not limit the type any more, but automatically mine the type of the semantic units from a network, such as entity type, relation type and the like, and form structured data; the formed structured data can be used for tasks such as knowledge graph construction, data analysis and the like at the later stage. After obtaining the structured data, generally, the quality evaluation needs to be performed on the extraction result of the extraction of the open-domain text information.

In the related art, the method for evaluating the quality of the extracted result is generally performed based on a manually labeled test sample set, and the processing process includes: for each sample in the test sample set, labeling its possible triplets (entity-relationship-entity), doublets (entity-attribute), etc.; performing quality evaluation by comparing the labeling result with the extraction result; the evaluation index includes accuracy, recall, and F1 (harmonic mean of accuracy and recall), and the like. The method has the problems that manual marking is time-consuming and labor-consuming, marking results are possibly different due to differences of marking personnel, and the like. How to realize the quality evaluation of the extraction result of the open domain text information extraction becomes a problem to be solved.

Disclosure of Invention

The following is a summary of the subject matter described in detail herein. This summary is not intended to limit the scope of the claims.

The embodiment of the invention provides an information processing method, an information processing device, a computer storage medium and a terminal, which can evaluate the quality of an extraction result.

The embodiment of the invention provides an information processing method, which comprises the following steps:

counting the word number of each part of speech word contained in the original text and the extraction result to obtain part of speech statistical information;

calculating the information coverage of each part of speech word contained in the extraction result according to the obtained part of speech statistical information;

and determining the quality of the extraction result according to the calculated information coverage of all the part-of-speech words.

In an exemplary embodiment, the calculating the information coverage of each part-of-speech word included in the extraction result includes:

according to the part-of-speech statistical information, calculating the information coverage of each part-of-speech word contained in the extraction result through the following formula:

and the word number of the current part of speech word in the extraction result/the word number of the current part of speech word in the original text.

In an exemplary embodiment, the determining the quality of the extraction result includes:

respectively multiplying the information coverage of each part of speech word obtained by calculation by a preset weighting parameter and then accumulating to obtain weighted information coverage;

wherein the weighted information coverage is used to quantify the quality of the decimation result.

In an exemplary embodiment, the extraction result includes words of one or any part of speech:

nouns, verbs, prepositions, and adverbs.

On the other hand, the embodiment of the present invention further provides a computer storage medium, in which a computer program is stored, and the computer program, when executed by a processor, implements the above information processing method.

In another aspect, an embodiment of the present invention further provides a terminal, including: a memory and a processor, the memory having a computer program stored therein; wherein the content of the first and second substances,

the processor is configured to execute the computer program in the memory;

the computer program, when executed by the processor, implements the method of information processing as described above.

In another aspect, an embodiment of the present invention further provides an information processing apparatus, including: the device comprises a statistical unit, a calculation unit and a determination unit; wherein the content of the first and second substances,

the statistic unit is used for: counting the word number of each part of speech word contained in the original text and the extraction result to obtain part of speech statistical information;

the computing unit is to: calculating the information coverage of each part of speech word contained in the extraction result according to the obtained part of speech statistical information;

the determination unit is used for: and determining the quality of the extraction result according to the calculated information coverage of all the part-of-speech words.

In an exemplary embodiment, the computing unit is specifically configured to:

In an exemplary embodiment, the determining unit is specifically configured to:

nouns, verbs, prepositions, and adverbs.

The application includes: counting the word number of each part of speech word contained in the original text and the extraction result to obtain part of speech statistical information; calculating the information coverage of each part of speech word contained in the extraction result according to the obtained part of speech statistical information; and determining the quality of the extraction result according to the calculated information coverage of all the part-of-speech words. The embodiment of the invention realizes the quality evaluation of the extracted result through an automatic process, and improves the analysis efficiency of the quality evaluation.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the example serve to explain the principles of the invention and not to limit the invention.

FIG. 1 is a flow chart of a method of information processing according to an embodiment of the present invention;

FIG. 2 is a block diagram of an information processing apparatus according to an embodiment of the present invention;

FIG. 3 is a diagram of an exemplary graph structure for use in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail below with reference to the accompanying drawings. It should be noted that the embodiments and features of the embodiments in the present application may be arbitrarily combined with each other without conflict.

The steps illustrated in the flow charts of the figures may be performed in a computer system such as a set of computer-executable instructions. Also, while a logical order is shown in the flow diagrams, in some cases, the steps shown or described may be performed in an order different than here.

Fig. 1 is a flowchart of an information processing method according to an embodiment of the present invention, as shown in fig. 1, including:

step 101, counting the word number of each part of speech word contained in an original text and an extraction result to obtain part of speech statistical information;

nouns, verbs, prepositions, and adverbs.

It should be noted that, the parts-of-speech categories of the words included in the extraction result in the embodiment of the present invention may be added or deleted by those skilled in the art according to the data types included in the knowledge graph; for example, adjectives may be added.

102, calculating the information coverage of each part-of-speech word contained in the extraction result according to the obtained part-of-speech statistical information;

in an exemplary embodiment, calculating the information coverage of each part-of-speech word contained in the extraction result comprises:

and extracting the word number of the current part of speech word in the result/the word number of the current part of speech word in the original text.

And 103, determining the quality of the extraction result according to the calculated information coverage of all the part-of-speech words.

In one exemplary embodiment, determining the quality of the extraction comprises:

wherein the weighted information coverage is used to quantify the quality of the extraction result.

It should be noted that, in the embodiment of the present invention, the weighting parameters of each part-of-speech word may be set by those skilled in the art according to the data composition of the knowledge graph and by combining experience; for example, set the weight of nouns: (30 ± 5)%, weight of verb: (30 ± 5)%, weight of adverb: (30 ± 5)%, weight of preposition: (10 ± 5)%; in general, the cumulative sum of the weighting parameters may be 1.

Fig. 2 is a block diagram of an information processing apparatus according to an embodiment of the present invention, and as shown in fig. 2, the information processing apparatus includes: the device comprises a statistical unit, a calculation unit and a determination unit; wherein the content of the first and second substances,

In an exemplary embodiment, the computing unit is specifically configured to:

In an exemplary embodiment, the determining unit is specifically configured to:

nouns, verbs, prepositions, and adverbs.

The embodiment of the invention also provides a computer storage medium, wherein a computer program is stored in the computer storage medium, and the computer program is executed by a processor to realize the information processing method.

An embodiment of the present invention further provides a terminal, including: a memory and a processor, storing a computer program; wherein the content of the first and second substances,

the processor is configured to execute the computer program in the memory;

the computer program, when executed by a processor, implements the method of information processing as described above.

The method of the embodiment of the present invention is briefly described below by using application examples, which are only used for illustrating the present invention and are not used for limiting the protection scope of the present invention.

The extraction result of the open domain information extraction is mostly a triple and double structure. Triplets may be used to reveal entity-relationship-entity information, and doublets may be used to reveal entity-attribute information. Assume the original text is: three scientists strictly criticize researcher working at home, lie four;

the extraction structure then comprises: the method comprises the following steps of (1) triples { "Zhang three", "criticism", "LieIV" }, and triples { "Zhang three", "scientist" }, { "LieIV", "researcher" } and { "criticize", "severity" }; FIG. 3 is a diagram illustrating an exemplary graph structure applied in the present invention, as shown in FIG. 3, the entity words include Zhang III and Li IV, which are represented by circles; criticizing the relation word package, and representing the criticizing by a diamond shape; the attribute words include scientists, severity and researchers, represented by squares.

The application example sets weighting parameters of each part of speech as follows:

weight of noun: 30%, weight of verb: 30%, weight of adverb: 30%, weight of preposition: 10 percent.

The application example performs a statistical process on the word number of each part-of-speech word contained in the original text, and comprises the following steps:

preprocessing an original text to obtain a word set; the pretreatment may include a treatment method known in the related art: word segmentation, part of speech tagging, entity recognition, semantic chunk recognition and the like; the set of words includes: { "scientist", "Zusan", "harsh", "criticizing", "having", "at", "home", "work", "of", "researcher", "Liquan". "};

and (3) eliminating useless information (punctuation, stop words and the like) in the word set to obtain the word set to be counted: { "scientist", "Zhang III", "strict", "criticize", "at", "home", "work", "researcher", "Li IV" };

counting each part of speech word in the original text to obtain:

noun list 1: { "scientist", "Zhang III", "Home", "researcher", "LieIV" };

verb list 1: { "criticizing", "working" };

preposition list 1: { "at" };

list of adverbs 1: { "severity" };

and counting each part of speech word of the extraction result to obtain:

noun list 2: { "scientist", "Zhang three", researcher "," LieSite "};

verb list 2: { "criticize" };

preposition list 2: { };

list of adverbs 2: { "severity" };

calculating the information coverage of each part of speech word:

the coverage of the noun information is 0.8-4/5-892-length/noun list 1-length;

verb information coverage is 0.5 as verb list 2 length/verb list 1 length is 1/2;

the coverage of preposition information is that the length of a preposition list 2 is that the length of a preposition list 1 is that 0/1 is that 0;

the coverage of the adverb information is 1/1 is 1 for the length of the adjective list 2 and the length of the adjective list 1;

according to the above, the weighted information coverage is calculated as:

weighted information coverage (noun information coverage noun weight) + (verb information coverage verb weight) + (preposition information coverage preposition weight) + (adjective information coverage adjective weight) ═ 0.8 × 0.3) + (0.5 × 0.3) + (0 × 0.1) + (1 × 0.3) ═ 0.69;

it should be noted that, in the embodiment of the present invention, a threshold may be set by a person skilled in the art according to experience, and is used for determining the quality of the extraction result according to the weighted information coverage, for example, the threshold is set to be 0.65; when the weighted information coverage is greater than 0.65, the quality of the extraction result is considered to basically meet the extraction requirement; the threshold value may be analytically adjusted by one skilled in the art based on the data content and experience of the knowledge-graph. In an exemplary embodiment, the weighted information coverage may be used to perform quality comparison analysis on the extraction structures obtained in different manners, and the larger the value of the weighted information coverage is, the higher the quality of the extraction result is.

The application example does not need manual marking, the quality of the extracted result is automatically evaluated in real time, and the evaluation efficiency of the quality evaluation is improved.

It will be understood by those of ordinary skill in the art that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed by several physical components in cooperation. Some or all of the components may be implemented as software executed by a processor, such as a digital signal processor or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.

Claims

1. A method of information processing, comprising:

2. The method according to claim 1, wherein the calculating the information coverage of each part-of-speech word included in the extraction result comprises:

3. The method of claim 1 or 2, wherein said determining the quality of the extraction comprises:

4. The method according to claim 1 or 2, wherein the extracted result includes words of one or any part of speech:

nouns, verbs, prepositions, and adverbs.

5. A computer storage medium having stored therein a computer program which, when executed by a processor, implements a method of information processing according to any one of claims 1 to 4.

6. A terminal, comprising: a memory and a processor, the memory having a computer program stored therein; wherein the content of the first and second substances,

the processor is configured to execute the computer program in the memory;

the computer program, when executed by the processor, implements a method of performing information processing as recited in any of claims 1-4.

7. An apparatus for information processing, comprising: the device comprises a statistical unit, a calculation unit and a determination unit; wherein the content of the first and second substances,

8. The apparatus according to claim 7, wherein the computing unit is specifically configured to:

9. The apparatus according to claim 7 or 8, wherein the determining unit is specifically configured to:

10. The apparatus according to claim 7 or 8, wherein the extraction result includes words of one or any part of speech:

nouns, verbs, prepositions, and adverbs.