CN111428473A - Information processing method and device, computer storage medium and terminal - Google Patents

Information processing method and device, computer storage medium and terminal Download PDF

Info

Publication number
CN111428473A
CN111428473A CN202010181441.1A CN202010181441A CN111428473A CN 111428473 A CN111428473 A CN 111428473A CN 202010181441 A CN202010181441 A CN 202010181441A CN 111428473 A CN111428473 A CN 111428473A
Authority
CN
China
Prior art keywords
speech
information
extraction result
word
coverage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010181441.1A
Other languages
Chinese (zh)
Inventor
陈栋
付骁弈
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Mininglamp Software System Co ltd
Original Assignee
Beijing Mininglamp Software System Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Mininglamp Software System Co ltd filed Critical Beijing Mininglamp Software System Co ltd
Priority to CN202010181441.1A priority Critical patent/CN111428473A/en
Publication of CN111428473A publication Critical patent/CN111428473A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

An information processing method, an information processing device, a computer storage medium and a terminal comprise: counting the word number of each part of speech word contained in the original text and the extraction result to obtain part of speech statistical information; calculating the information coverage of each part of speech word contained in the extraction result according to the obtained part of speech statistical information; and determining the quality of the extraction result according to the calculated information coverage of all the part-of-speech words. The embodiment of the invention realizes the quality evaluation of the extracted result through an automatic process, and improves the analysis efficiency of the quality evaluation.

Description

Information processing method and device, computer storage medium and terminal
Technical Field
This document relates to, but is not limited to, knowledge graph technology, and more particularly, to a method, an apparatus, a computer storage medium, and a terminal for information processing.
Background
At present, massive unstructured data (texts) mostly have the characteristics of non-normativity, openness and the like, so that a supervised information extraction method depending on the training corpus is not applicable any more.
The extraction of the text information of the open domain refers to a text processing technology for extracting information of a specified type from a natural language text in an unsupervised mode, wherein the extracted semantic units do not limit the type any more, but automatically mine the type of the semantic units from a network, such as entity type, relation type and the like, and form structured data; the formed structured data can be used for tasks such as knowledge graph construction, data analysis and the like at the later stage. After obtaining the structured data, generally, the quality evaluation needs to be performed on the extraction result of the extraction of the open-domain text information.
In the related art, the method for evaluating the quality of the extracted result is generally performed based on a manually labeled test sample set, and the processing process includes: for each sample in the test sample set, labeling its possible triplets (entity-relationship-entity), doublets (entity-attribute), etc.; performing quality evaluation by comparing the labeling result with the extraction result; the evaluation index includes accuracy, recall, and F1 (harmonic mean of accuracy and recall), and the like. The method has the problems that manual marking is time-consuming and labor-consuming, marking results are possibly different due to differences of marking personnel, and the like. How to realize the quality evaluation of the extraction result of the open domain text information extraction becomes a problem to be solved.
Disclosure of Invention
The following is a summary of the subject matter described in detail herein. This summary is not intended to limit the scope of the claims.
The embodiment of the invention provides an information processing method, an information processing device, a computer storage medium and a terminal, which can evaluate the quality of an extraction result.
The embodiment of the invention provides an information processing method, which comprises the following steps:
counting the word number of each part of speech word contained in the original text and the extraction result to obtain part of speech statistical information;
calculating the information coverage of each part of speech word contained in the extraction result according to the obtained part of speech statistical information;
and determining the quality of the extraction result according to the calculated information coverage of all the part-of-speech words.
In an exemplary embodiment, the calculating the information coverage of each part-of-speech word included in the extraction result includes:
according to the part-of-speech statistical information, calculating the information coverage of each part-of-speech word contained in the extraction result through the following formula:
and the word number of the current part of speech word in the extraction result/the word number of the current part of speech word in the original text.
In an exemplary embodiment, the determining the quality of the extraction result includes:
respectively multiplying the information coverage of each part of speech word obtained by calculation by a preset weighting parameter and then accumulating to obtain weighted information coverage;
wherein the weighted information coverage is used to quantify the quality of the decimation result.
In an exemplary embodiment, the extraction result includes words of one or any part of speech:
nouns, verbs, prepositions, and adverbs.
On the other hand, the embodiment of the present invention further provides a computer storage medium, in which a computer program is stored, and the computer program, when executed by a processor, implements the above information processing method.
In another aspect, an embodiment of the present invention further provides a terminal, including: a memory and a processor, the memory having a computer program stored therein; wherein the content of the first and second substances,
the processor is configured to execute the computer program in the memory;
the computer program, when executed by the processor, implements the method of information processing as described above.
In another aspect, an embodiment of the present invention further provides an information processing apparatus, including: the device comprises a statistical unit, a calculation unit and a determination unit; wherein the content of the first and second substances,
the statistic unit is used for: counting the word number of each part of speech word contained in the original text and the extraction result to obtain part of speech statistical information;
the computing unit is to: calculating the information coverage of each part of speech word contained in the extraction result according to the obtained part of speech statistical information;
the determination unit is used for: and determining the quality of the extraction result according to the calculated information coverage of all the part-of-speech words.
In an exemplary embodiment, the computing unit is specifically configured to:
according to the part-of-speech statistical information, calculating the information coverage of each part-of-speech word contained in the extraction result through the following formula:
and the word number of the current part of speech word in the extraction result/the word number of the current part of speech word in the original text.
In an exemplary embodiment, the determining unit is specifically configured to:
respectively multiplying the information coverage of each part of speech word obtained by calculation by a preset weighting parameter and then accumulating to obtain weighted information coverage;
wherein the weighted information coverage is used to quantify the quality of the decimation result.
In an exemplary embodiment, the extraction result includes words of one or any part of speech:
nouns, verbs, prepositions, and adverbs.
The application includes: counting the word number of each part of speech word contained in the original text and the extraction result to obtain part of speech statistical information; calculating the information coverage of each part of speech word contained in the extraction result according to the obtained part of speech statistical information; and determining the quality of the extraction result according to the calculated information coverage of all the part-of-speech words. The embodiment of the invention realizes the quality evaluation of the extracted result through an automatic process, and improves the analysis efficiency of the quality evaluation.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
Drawings
The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the example serve to explain the principles of the invention and not to limit the invention.
FIG. 1 is a flow chart of a method of information processing according to an embodiment of the present invention;
FIG. 2 is a block diagram of an information processing apparatus according to an embodiment of the present invention;
FIG. 3 is a diagram of an exemplary graph structure for use in the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail below with reference to the accompanying drawings. It should be noted that the embodiments and features of the embodiments in the present application may be arbitrarily combined with each other without conflict.
The steps illustrated in the flow charts of the figures may be performed in a computer system such as a set of computer-executable instructions. Also, while a logical order is shown in the flow diagrams, in some cases, the steps shown or described may be performed in an order different than here.
Fig. 1 is a flowchart of an information processing method according to an embodiment of the present invention, as shown in fig. 1, including:
step 101, counting the word number of each part of speech word contained in an original text and an extraction result to obtain part of speech statistical information;
in an exemplary embodiment, the extraction result includes words of one or any part of speech:
nouns, verbs, prepositions, and adverbs.
It should be noted that, the parts-of-speech categories of the words included in the extraction result in the embodiment of the present invention may be added or deleted by those skilled in the art according to the data types included in the knowledge graph; for example, adjectives may be added.
102, calculating the information coverage of each part-of-speech word contained in the extraction result according to the obtained part-of-speech statistical information;
in an exemplary embodiment, calculating the information coverage of each part-of-speech word contained in the extraction result comprises:
according to the part-of-speech statistical information, calculating the information coverage of each part-of-speech word contained in the extraction result through the following formula:
and extracting the word number of the current part of speech word in the result/the word number of the current part of speech word in the original text.
And 103, determining the quality of the extraction result according to the calculated information coverage of all the part-of-speech words.
In one exemplary embodiment, determining the quality of the extraction comprises:
respectively multiplying the information coverage of each part of speech word obtained by calculation by a preset weighting parameter and then accumulating to obtain weighted information coverage;
wherein the weighted information coverage is used to quantify the quality of the extraction result.
It should be noted that, in the embodiment of the present invention, the weighting parameters of each part-of-speech word may be set by those skilled in the art according to the data composition of the knowledge graph and by combining experience; for example, set the weight of nouns: (30 ± 5)%, weight of verb: (30 ± 5)%, weight of adverb: (30 ± 5)%, weight of preposition: (10 ± 5)%; in general, the cumulative sum of the weighting parameters may be 1.
The application includes: counting the word number of each part of speech word contained in the original text and the extraction result to obtain part of speech statistical information; calculating the information coverage of each part of speech word contained in the extraction result according to the obtained part of speech statistical information; and determining the quality of the extraction result according to the calculated information coverage of all the part-of-speech words. The embodiment of the invention realizes the quality evaluation of the extracted result through an automatic process, and improves the analysis efficiency of the quality evaluation.
Fig. 2 is a block diagram of an information processing apparatus according to an embodiment of the present invention, and as shown in fig. 2, the information processing apparatus includes: the device comprises a statistical unit, a calculation unit and a determination unit; wherein the content of the first and second substances,
the statistic unit is used for: counting the word number of each part of speech word contained in the original text and the extraction result to obtain part of speech statistical information;
the computing unit is to: calculating the information coverage of each part of speech word contained in the extraction result according to the obtained part of speech statistical information;
the determination unit is used for: and determining the quality of the extraction result according to the calculated information coverage of all the part-of-speech words.
In an exemplary embodiment, the computing unit is specifically configured to:
according to the part-of-speech statistical information, calculating the information coverage of each part-of-speech word contained in the extraction result through the following formula:
and extracting the word number of the current part of speech word in the result/the word number of the current part of speech word in the original text.
In an exemplary embodiment, the determining unit is specifically configured to:
respectively multiplying the information coverage of each part of speech word obtained by calculation by a preset weighting parameter and then accumulating to obtain weighted information coverage;
wherein the weighted information coverage is used to quantify the quality of the extraction result.
In an exemplary embodiment, the extraction result includes words of one or any part of speech:
nouns, verbs, prepositions, and adverbs.
The application includes: counting the word number of each part of speech word contained in the original text and the extraction result to obtain part of speech statistical information; calculating the information coverage of each part of speech word contained in the extraction result according to the obtained part of speech statistical information; and determining the quality of the extraction result according to the calculated information coverage of all the part-of-speech words. The embodiment of the invention realizes the quality evaluation of the extracted result through an automatic process, and improves the analysis efficiency of the quality evaluation.
The embodiment of the invention also provides a computer storage medium, wherein a computer program is stored in the computer storage medium, and the computer program is executed by a processor to realize the information processing method.
An embodiment of the present invention further provides a terminal, including: a memory and a processor, storing a computer program; wherein the content of the first and second substances,
the processor is configured to execute the computer program in the memory;
the computer program, when executed by a processor, implements the method of information processing as described above.
The method of the embodiment of the present invention is briefly described below by using application examples, which are only used for illustrating the present invention and are not used for limiting the protection scope of the present invention.
The extraction result of the open domain information extraction is mostly a triple and double structure. Triplets may be used to reveal entity-relationship-entity information, and doublets may be used to reveal entity-attribute information. Assume the original text is: three scientists strictly criticize researcher working at home, lie four;
the extraction structure then comprises: the method comprises the following steps of (1) triples { "Zhang three", "criticism", "LieIV" }, and triples { "Zhang three", "scientist" }, { "LieIV", "researcher" } and { "criticize", "severity" }; FIG. 3 is a diagram illustrating an exemplary graph structure applied in the present invention, as shown in FIG. 3, the entity words include Zhang III and Li IV, which are represented by circles; criticizing the relation word package, and representing the criticizing by a diamond shape; the attribute words include scientists, severity and researchers, represented by squares.
The application example sets weighting parameters of each part of speech as follows:
weight of noun: 30%, weight of verb: 30%, weight of adverb: 30%, weight of preposition: 10 percent.
The application example performs a statistical process on the word number of each part-of-speech word contained in the original text, and comprises the following steps:
preprocessing an original text to obtain a word set; the pretreatment may include a treatment method known in the related art: word segmentation, part of speech tagging, entity recognition, semantic chunk recognition and the like; the set of words includes: { "scientist", "Zusan", "harsh", "criticizing", "having", "at", "home", "work", "of", "researcher", "Liquan". "};
and (3) eliminating useless information (punctuation, stop words and the like) in the word set to obtain the word set to be counted: { "scientist", "Zhang III", "strict", "criticize", "at", "home", "work", "researcher", "Li IV" };
counting each part of speech word in the original text to obtain:
noun list 1: { "scientist", "Zhang III", "Home", "researcher", "LieIV" };
verb list 1: { "criticizing", "working" };
preposition list 1: { "at" };
list of adverbs 1: { "severity" };
and counting each part of speech word of the extraction result to obtain:
noun list 2: { "scientist", "Zhang three", researcher "," LieSite "};
verb list 2: { "criticize" };
preposition list 2: { };
list of adverbs 2: { "severity" };
calculating the information coverage of each part of speech word:
the coverage of the noun information is 0.8-4/5-892-length/noun list 1-length;
verb information coverage is 0.5 as verb list 2 length/verb list 1 length is 1/2;
the coverage of preposition information is that the length of a preposition list 2 is that the length of a preposition list 1 is that 0/1 is that 0;
the coverage of the adverb information is 1/1 is 1 for the length of the adjective list 2 and the length of the adjective list 1;
according to the above, the weighted information coverage is calculated as:
weighted information coverage (noun information coverage noun weight) + (verb information coverage verb weight) + (preposition information coverage preposition weight) + (adjective information coverage adjective weight) ═ 0.8 × 0.3) + (0.5 × 0.3) + (0 × 0.1) + (1 × 0.3) ═ 0.69;
it should be noted that, in the embodiment of the present invention, a threshold may be set by a person skilled in the art according to experience, and is used for determining the quality of the extraction result according to the weighted information coverage, for example, the threshold is set to be 0.65; when the weighted information coverage is greater than 0.65, the quality of the extraction result is considered to basically meet the extraction requirement; the threshold value may be analytically adjusted by one skilled in the art based on the data content and experience of the knowledge-graph. In an exemplary embodiment, the weighted information coverage may be used to perform quality comparison analysis on the extraction structures obtained in different manners, and the larger the value of the weighted information coverage is, the higher the quality of the extraction result is.
The application example does not need manual marking, the quality of the extracted result is automatically evaluated in real time, and the evaluation efficiency of the quality evaluation is improved.
It will be understood by those of ordinary skill in the art that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed by several physical components in cooperation. Some or all of the components may be implemented as software executed by a processor, such as a digital signal processor or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.

Claims (10)

1. A method of information processing, comprising:
counting the word number of each part of speech word contained in the original text and the extraction result to obtain part of speech statistical information;
calculating the information coverage of each part of speech word contained in the extraction result according to the obtained part of speech statistical information;
and determining the quality of the extraction result according to the calculated information coverage of all the part-of-speech words.
2. The method according to claim 1, wherein the calculating the information coverage of each part-of-speech word included in the extraction result comprises:
according to the part-of-speech statistical information, calculating the information coverage of each part-of-speech word contained in the extraction result through the following formula:
and the word number of the current part of speech word in the extraction result/the word number of the current part of speech word in the original text.
3. The method of claim 1 or 2, wherein said determining the quality of the extraction comprises:
respectively multiplying the information coverage of each part of speech word obtained by calculation by a preset weighting parameter and then accumulating to obtain weighted information coverage;
wherein the weighted information coverage is used to quantify the quality of the decimation result.
4. The method according to claim 1 or 2, wherein the extracted result includes words of one or any part of speech:
nouns, verbs, prepositions, and adverbs.
5. A computer storage medium having stored therein a computer program which, when executed by a processor, implements a method of information processing according to any one of claims 1 to 4.
6. A terminal, comprising: a memory and a processor, the memory having a computer program stored therein; wherein the content of the first and second substances,
the processor is configured to execute the computer program in the memory;
the computer program, when executed by the processor, implements a method of performing information processing as recited in any of claims 1-4.
7. An apparatus for information processing, comprising: the device comprises a statistical unit, a calculation unit and a determination unit; wherein the content of the first and second substances,
the statistic unit is used for: counting the word number of each part of speech word contained in the original text and the extraction result to obtain part of speech statistical information;
the computing unit is to: calculating the information coverage of each part of speech word contained in the extraction result according to the obtained part of speech statistical information;
the determination unit is used for: and determining the quality of the extraction result according to the calculated information coverage of all the part-of-speech words.
8. The apparatus according to claim 7, wherein the computing unit is specifically configured to:
according to the part-of-speech statistical information, calculating the information coverage of each part-of-speech word contained in the extraction result through the following formula:
and the word number of the current part of speech word in the extraction result/the word number of the current part of speech word in the original text.
9. The apparatus according to claim 7 or 8, wherein the determining unit is specifically configured to:
respectively multiplying the information coverage of each part of speech word obtained by calculation by a preset weighting parameter and then accumulating to obtain weighted information coverage;
wherein the weighted information coverage is used to quantify the quality of the decimation result.
10. The apparatus according to claim 7 or 8, wherein the extraction result includes words of one or any part of speech:
nouns, verbs, prepositions, and adverbs.
CN202010181441.1A 2020-03-16 2020-03-16 Information processing method and device, computer storage medium and terminal Pending CN111428473A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010181441.1A CN111428473A (en) 2020-03-16 2020-03-16 Information processing method and device, computer storage medium and terminal

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010181441.1A CN111428473A (en) 2020-03-16 2020-03-16 Information processing method and device, computer storage medium and terminal

Publications (1)

Publication Number Publication Date
CN111428473A true CN111428473A (en) 2020-07-17

Family

ID=71547929

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010181441.1A Pending CN111428473A (en) 2020-03-16 2020-03-16 Information processing method and device, computer storage medium and terminal

Country Status (1)

Country Link
CN (1) CN111428473A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101937471A (en) * 2010-09-21 2011-01-05 上海大学 Multidimensional space evaluation method of keyword extraction algorithm
KR101541170B1 (en) * 2014-10-21 2015-08-03 (주)센솔로지 Apparatus and method for summarizing text
CN106202044A (en) * 2016-07-07 2016-12-07 武汉理工大学 A kind of entity relation extraction method based on deep neural network
CN107092674A (en) * 2017-04-14 2017-08-25 福建工程学院 The automatic abstracting method and system of a kind of Chinese medicine acupuncture field event trigger word

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101937471A (en) * 2010-09-21 2011-01-05 上海大学 Multidimensional space evaluation method of keyword extraction algorithm
KR101541170B1 (en) * 2014-10-21 2015-08-03 (주)센솔로지 Apparatus and method for summarizing text
CN106202044A (en) * 2016-07-07 2016-12-07 武汉理工大学 A kind of entity relation extraction method based on deep neural network
CN107092674A (en) * 2017-04-14 2017-08-25 福建工程学院 The automatic abstracting method and system of a kind of Chinese medicine acupuncture field event trigger word

Similar Documents

Publication Publication Date Title
CN108875059B (en) Method and device for generating document tag, electronic equipment and storage medium
EP2378475A1 (en) Method for calculating semantic similarities between messages and conversations based on enhanced entity extraction
CN108491389B (en) Method and device for training click bait title corpus recognition model
CN106445915B (en) New word discovery method and device
CN110008463B (en) Method, apparatus and computer readable medium for event extraction
CN111898366A (en) Document subject word aggregation method and device, computer equipment and readable storage medium
CN109710766B (en) Complaint tendency analysis early warning method and device for work order data
CN110502742B (en) Complex entity extraction method, device, medium and system
CN111783450B (en) Phrase extraction method and device in corpus text, storage medium and electronic equipment
CN105912645A (en) Intelligent question and answer method and apparatus
CN116501898B (en) Financial text event extraction method and device suitable for few samples and biased data
CN109657056B (en) Target sample acquisition method and device, storage medium and electronic equipment
CN108462624B (en) Junk mail identification method and device and electronic equipment
CN111210332A (en) Method and device for generating post-loan management strategy and electronic equipment
CN114202443A (en) Policy classification method, device, equipment and storage medium
CN114925757B (en) Multisource threat information fusion method, device, equipment and storage medium
CN111428473A (en) Information processing method and device, computer storage medium and terminal
CN114500075B (en) User abnormal behavior detection method and device, electronic equipment and storage medium
CN113761875B (en) Event extraction method and device, electronic equipment and storage medium
CN114610576A (en) Log generation monitoring method and device
CN115329173A (en) Method and device for determining enterprise credit based on public opinion monitoring
CN111341404B (en) Electronic medical record data set analysis method and system based on ernie model
CN114117047A (en) Method and system for classifying illegal voice based on C4.5 algorithm
CN114141235A (en) Voice corpus generation method and device, computer equipment and storage medium
CN115481240A (en) Data asset quality detection method and detection device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination