CN113609853A - Enterprise subject attribute identification method, device and equipment - Google Patents

Enterprise subject attribute identification method, device and equipment Download PDF

Info

Publication number
CN113609853A
CN113609853A CN202110871670.0A CN202110871670A CN113609853A CN 113609853 A CN113609853 A CN 113609853A CN 202110871670 A CN202110871670 A CN 202110871670A CN 113609853 A CN113609853 A CN 113609853A
Authority
CN
China
Prior art keywords
text
information
identifiers
enterprise
compressed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110871670.0A
Other languages
Chinese (zh)
Inventor
罗晓天
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alipay Hangzhou Information Technology Co Ltd
Original Assignee
Alipay Hangzhou Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alipay Hangzhou Information Technology Co Ltd filed Critical Alipay Hangzhou Information Technology Co Ltd
Priority to CN202110871670.0A priority Critical patent/CN113609853A/en
Publication of CN113609853A publication Critical patent/CN113609853A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/387Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using geographical or spatial information, e.g. location

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Library & Information Science (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the specification discloses a method, a device and equipment for identifying enterprise subject attributes. The method comprises the following steps: acquiring a text to be identified containing at least one name of an enterprise to be identified; compressing the text to be recognized according to a preset compression rule to obtain a compressed text; positioning the position information of the enterprise name to be identified in the compressed text; based on the position information, selecting context information in the compressed text according to a preset word number range; determining subject attribute key information in the context information; and determining the main attribute information of the enterprise to be identified according to the main attribute key information.

Description

Enterprise subject attribute identification method, device and equipment
Technical Field
The present application relates to the field of computer technologies, and in particular, to a method, an apparatus, and a device for identifying an enterprise subject attribute.
Background
With the rapid development of social economy, at present, the number of enterprises is tens of thousands and numerous. With the diversification of enterprise types and the importance of the country on the establishment of the social credit system, more and more enterprises and financial institutions pay attention to the mastering of enterprise information. And (4) investigating conditions such as enterprise operation conditions, enterprise credit investigation and the like through enterprise information. For example: there are many products currently on the market that query, evaluate, predict, and monitor enterprise risk based on enterprise information.
In the product for evaluating the enterprise based on the enterprise data, an enterprise image needs to be established for each enterprise/company/organization, the operation condition of the enterprise needs to be determined based on the enterprise image, and when the enterprise image is established, events related to the enterprise and the role played by the enterprise in the events need to be recorded, wherein the role can be understood as the main attributes of the enterprise in each event. For example, a business has a bid event and needs to know whether the business is a buyer, a winner, or a candidate. Or a business is associated with a penalty event, it may be desirable to determine whether the business is a penalized or unrelated business. It can be seen that identifying the "roles" of a business in various events is very important information.
Therefore, there is a need to provide a more reliable enterprise subject attribute identification scheme.
Disclosure of Invention
The embodiment of the specification provides an enterprise subject attribute identification method, an enterprise subject attribute identification device and enterprise subject attribute identification equipment, and aims to solve the problems of low identification efficiency and low identification accuracy of the existing enterprise subject attribute identification method.
In order to solve the above technical problem, the embodiments of the present specification are implemented as follows:
an embodiment of the present specification provides a method for identifying an attribute of an enterprise agent, including:
acquiring a text to be identified; the text to be identified comprises at least one name of an enterprise to be identified;
compressing the text to be recognized according to a preset compression rule to obtain a compressed text;
positioning the position information of the enterprise name to be identified in the compressed text;
based on the position information, selecting context information in the compressed text according to a preset word number range;
determining subject attribute key information in the context information;
and determining the main attribute information of the enterprise to be identified according to the main attribute key information.
An enterprise agent attribute identification device provided in an embodiment of the present specification includes:
the text to be recognized acquisition module is used for acquiring a text to be recognized; the text to be identified comprises at least one name of an enterprise to be identified;
the text compression module is used for compressing the text to be recognized according to a preset compression rule to obtain a compressed text;
the enterprise name to be identified positioning module is used for positioning the position information of the enterprise name to be identified in the compressed text;
the context information selecting module is used for selecting context information from the compressed text according to a preset word number range on the basis of the position information;
the main body attribute key information determining module is used for determining main body attribute key information in the context information;
and the main body attribute information identification module is used for determining the main body attribute information of the enterprise to be identified according to the main body attribute key information.
An enterprise subject attribute identification device provided by an embodiment of the present specification includes:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to:
acquiring a text to be identified; the text to be identified comprises at least one name of an enterprise to be identified;
compressing the text to be recognized according to a preset compression rule to obtain a compressed text;
positioning the position information of the enterprise name to be identified in the compressed text;
based on the position information, selecting context information in the compressed text according to a preset word number range;
determining subject attribute key information in the context information;
and determining the main attribute information of the enterprise to be identified according to the main attribute key information.
Embodiments of the present description provide a computer-readable medium having stored thereon computer-readable instructions executable by a processor to implement an enterprise subject attribute identification method.
At least one embodiment of the present description can achieve the following advantageous effects: acquiring a text to be identified containing at least one name of an enterprise to be identified; compressing the text to be recognized according to a preset compression rule to obtain a compressed text; positioning the position information of the enterprise name to be identified in the compressed text; based on the position information, selecting context information in the compressed text according to a preset word number range; determining subject attribute key information in the context information; and determining the main attribute information of the enterprise to be identified according to the main attribute key information. By the method, the text to be identified is compressed in advance according to the preset compression rule to obtain the compressed text, and the context is selected from the compressed text, so that the content information of the context is reduced while the fact that the context information contains the key information for identifying the main body attribute of the enterprise to be identified is ensured, namely the context which is in a range with a small number of words and contains the key information for judging the main body attribute of the enterprise to be identified is selected, the defect that a resource memory occupies a large amount is overcome, and the identification efficiency and the identification accuracy of the main body attribute of the enterprise are improved.
Drawings
In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without any creative effort.
Fig. 1 is a schematic flowchart of an enterprise subject attribute identification method provided in an embodiment of the present disclosure;
fig. 2 is a flowchart illustrating a context selecting method according to an embodiment of the present disclosure;
fig. 3 is a schematic structural diagram of an enterprise subject attribute identification apparatus provided in an embodiment of the present disclosure;
fig. 4 is a schematic structural diagram of an enterprise subject attribute identification device according to an embodiment of the present disclosure.
Detailed Description
To make the objects, technical solutions and advantages of one or more embodiments of the present disclosure more apparent, the technical solutions of one or more embodiments of the present disclosure will be described in detail and completely with reference to the specific embodiments of the present disclosure and the accompanying drawings. It is to be understood that the embodiments described are only a few embodiments of the present specification, and not all embodiments. All other embodiments that can be derived by a person skilled in the art from the embodiments given herein without making any creative effort fall within the protection scope of one or more embodiments of the present disclosure.
The technical solutions provided by the embodiments of the present description are described in detail below with reference to the accompanying drawings.
For products used for business identification, for example: some applications or applets are used to identify information about enterprise risk, credit, operational conditions, etc. These products require that the subject attributes of each business in each event be identified. The aforementioned events may include: bidding events, penalty events, risk events, public opinion events, and the like. The "role" that the enterprise plays in these events, i.e., the subject attributes, can be identified.
In the prior art, a commonly used main body attribute identification scheme is divided into two steps, namely, firstly, named entity identification is carried out, and all enterprises in an event article are identified by a rule or algorithm method; and secondly, carrying out enterprise 'role' classification, positioning the enterprise in the event article, selecting a context, and judging the main body attribute of the enterprise based on the context by using a rule or algorithm.
However, when selecting a context, it is common to use the full text as the context of the enterprise for classification determination. Such a method for selecting a context may include determining key information of an enterprise role. However, problems arise after full use: firstly, the efficiency problem is judged, and if the full text is too long, the judgment speed is reduced no matter the method using the rules or the algorithm is used. Particularly, at present, deep learning is widely used, and the problems of insufficient memory and slow prediction can be caused by directly predicting a long text.
Another problem with using full text to determine enterprise roles is exacerbated by the inability to handle cases where multiple enterprise roles are determined in a single full text. Whether a rule or an algorithm, two different results cannot be obtained with the same input. That is, the same context can not predict and determine different roles. For example: in an event file, there are a plurality of enterprises belonging to different "roles", and at this time, the subject attribute of each enterprise cannot be accurately determined.
In addition, a method for dividing full-text full-selection context by using a segmentation character is provided, and the role of the enterprise is judged by using the text segment as the context of the enterprise in a classification way. However, such a method for selecting a context has certain limitations, and the context obtained by segmentation may not contain key information capable of determining the identity of a role, and may not determine the role. Another problem is that the characters used for segmentation are difficult to generalize. For example: in the bidding example, the text can be cut using line breaks, but if there are no line breaks in the text, it is difficult to find a general segmented text character. Even if the whole text is divided by using punctuation marks or other characters, it is difficult to ensure that the context obtained by the division contains key information capable of judging the identity of the role.
In order to solve the above defects, the present solution provides the following embodiments:
next, an enterprise subject attribute identification method provided in an embodiment of the specification will be specifically described with reference to the accompanying drawings:
fig. 1 is a schematic flowchart of an enterprise subject attribute identification method provided in an embodiment of the present disclosure. From the viewpoint of a program, the execution subject of the flow may be a program installed in an application server or an application client. In this embodiment, the execution subject of the flow may be a server dedicated to identifying the attribute of the enterprise subject in each event text, and the server may be independent from the server used for identifying the enterprise risk information or the operation condition. The execution agent may also be a server for identifying enterprise information, such as: may be a server corresponding to the application program for identifying the risk of the enterprise, and the server may first identify the subject attributes of the enterprise.
As shown in fig. 1, the process may include the following steps:
step 210: acquiring a text to be identified; the text to be identified comprises at least one name of the enterprise to be identified.
The text to be recognized may be the "event" text associated with the business as described above. For example: bidding event text, penalty event text, risk event text, public opinion event text, and the like. The text to be recognized comprises the names of one or more enterprises to be recognized.
Step 220: and compressing the text to be recognized according to a preset compression rule to obtain a compressed text.
The preset compression rules can be models, algorithm rules and the like for compressing texts, which are trained based on historical text data and historical enterprise identification data. The compression of the text to be recognized can be simply understood as deleting, replacing and the like of the text content of the text to be recognized so as to reduce the content in the text to be recognized.
Step 230: and positioning the position information of the name of the enterprise to be identified in the compressed text.
The position of the name of the business to be identified is located from the text, the name of the business to be identified in the text can be identified based on a Named Entity identification (NER) technology, and the position of the name of the business to be identified in the compressed text is determined.
Of course, in an actual application scenario, the name of the enterprise to be identified in the compressed text may also be identified based on other technologies, and the location information of the name of the enterprise to be identified in the compressed text is determined. This is not particularly limited in the examples of the present specification.
Step 240: and selecting context information in the compressed text according to a preset word number range based on the position information.
The preset word number range may be preset by the user, for example: the average length of the context may be determined according to the historical data, the preset word count range may be set according to the average length of the historical context, and the preset word count range may also be set according to the performance of a specific product identifying the enterprise information, for example: the preset word number range may be 50 words, and it should be noted that the preset word number range does not refer to the total word number range of the selected context, but represents in the text, and the text in the preset word number range is respectively selected before and after based on the position of the name of the enterprise to be identified. For example: when the preset word number range is 50 words, the method means that 50 words are selected forwards and 50 words are selected backwards on the basis of the position of the enterprise to be identified, and the selected text and the name of the enterprise to be identified are used as context information together.
Step 250: and determining main body attribute key information in the context information.
The subject attribute key information may represent key words or key statements used to determine subject attributes of the business, such as: the bid winner name and the candidate unit in the bidding text.
Step 260: and determining the main attribute information of the enterprise to be identified according to the main attribute key information.
The body attribute information may have different body attributes according to different texts, for example: when the text is a bid document, the subject attribute information of the enterprise may include "bidder", "successful bidder", or "candidate unit", etc. When the text is a penalty event text, the subject attribute information of the business may include "person to be penalized". In the risk text, the subject attribute information of the enterprise may include "risk enterprise", "high risk enterprise", or "credit enterprise", etc.
The subject attribute of the business to be identified can be determined according to the key words or key sentences used for identifying the subject attribute of the business to be identified.
It should be understood that the order of some steps in the method described in one or more embodiments of the present disclosure may be interchanged according to actual needs, or some steps may be omitted or deleted.
The method in fig. 1 includes obtaining a text to be recognized including at least one name of a business to be recognized; compressing the text to be recognized according to a preset compression rule to obtain a compressed text; positioning the position information of the enterprise name to be identified in the compressed text; based on the position information, selecting context information in the compressed text according to a preset word number range; determining subject attribute key information in the context information; and determining the main attribute information of the enterprise to be identified according to the main attribute key information. By the method, the text to be identified is compressed in advance according to the preset compression rule to obtain the compressed text, and the context is selected from the compressed text, so that the content information of the context is reduced while the fact that the context information contains the key information for identifying the main body attribute of the enterprise to be identified is ensured, namely the context which is in a range with a small number of words and contains the key information for judging the main body attribute of the enterprise to be identified is selected, the defect that a resource memory occupies a large amount is overcome, and the identification efficiency and the identification accuracy of the main body attribute of the enterprise are improved.
Based on the process of fig. 1, some specific embodiments of the process are also provided in the examples of this specification, which are described below.
Optionally, in the above step, as to the actual operation step of compressing the text to be recognized in step 220, the following methods may be included:
and the first method is to remove the numbers and punctuation marks in the text to be recognized to obtain the compressed text.
Compressing the text to be recognized according to a preset compression rule to obtain a compressed text, specifically comprising:
replacing the numbers and punctuation marks in the text to be recognized with first identifiers;
determining whether there are consecutive first identifiers;
and when the continuous first identifiers exist, replacing the continuous first identifiers with one first identifier to obtain compressed texts.
When the continuous first identifiers do not exist, judging whether the content between any two first identifiers is invalid information; the invalid information comprises words, auxiliary words or conjunctions;
when the content between any two first identifiers is invalid information, removing the invalid information to obtain the continuous first identifiers;
and replacing the continuous first identifier with a first identifier to obtain the compressed text.
First, in this embodiment, the number may include a single number, or may include a number word composed of a plurality of numbers, for example: 0. 1, 2, 3, 4, 5, 6, 7, 8, 9, 12, 45, etc., all belong to the objects that need to be replaced in the present scheme. In the embodiments of the present specification, the numbers to be replaced are numbers in the form of arabic numerals, and do not include words or numbers in the form of letters, i.e., the numbers "one", "two", and "three" are not included in the scope of the present invention.
In addition, in the text, the reference numbers are generally numbered with numerals, which may be replaced, for example: some dates, scores, rankings or specific data etc. such as: all of "2020", "8" and "15" on 8/15 of 2020 may be replaced. For another example: "81.87", "88.51", etc. may also be substituted.
A term "auxiliary word" is also called a word "auxiliary word" and is a grammatical term, which refers to a special virtual word with poor independence and no real meaning. Structural, temporal, linguistic, and other aids may be included, such as: ones, bars, ones, mu, woolen, cheer, strikes, etc.; the words include Yuan, Taiwan, Jie, , etc.
Conjunctions are dummy words that cannot independently serve as sentence components but only serve to link words and phrases, phrases and sentences. The conjunctions can be mainly divided into parallel conjunctions, turning conjunctions, selection conjunctions and causal conjunctions. Conjunctions can also be divided into parallel conjunctions and dependent conjunctions. For example: and, with, but also, is, then, thus, but, otherwise, because, not only, etc.
Of course, the invalid information may include prepositions, adverbs and the like in addition to the above listed words, numerals, conjunctions, auxiliary words and the like, which are not listed one by one in the present embodiment.
The first identifier may be set according to the actual application scenario, and the first identifier may be any identifying symbol other than numbers and punctuation symbols, for example: english letters, spaces, arrows, etc.
The numbers and punctuation marks in the text are replaced by the first identifiers, if the first identifiers are continuous, two or more continuous first identifiers can be combined to form one first identifier. Further, if only some invalid information without practical significance is included between the first identifiers, the invalid information can be removed, and continuous first identifiers are continuously combined to obtain a compressed text.
By the first method, the numbers and the symbols in the text are replaced, the invalid information is removed, and the continuous first identifiers are combined, so that the invalid information in the text to be recognized can be reduced, and the effect of compressing the text is achieved.
And secondly, replacing the names of other enterprises except the name of the enterprise to be identified in the text to be identified.
Compressing the text to be recognized according to a preset compression rule to obtain a compressed text, which may specifically include:
determining all enterprise names in the text to be identified;
replacing the other enterprise names except the enterprise name to be identified with second identifiers;
determining whether there are consecutive second identifiers;
and when the continuous second identifiers exist, replacing the continuous second identifiers with one second identifier to obtain compressed texts.
When the continuous second identifiers do not exist, judging whether the content between any two second identifiers is invalid information; the invalid information comprises words, auxiliary words or conjunctions;
when the content between any two second identifiers is invalid information, removing the invalid information to obtain the continuous second identifiers;
and replacing the continuous second identifier with a second identifier to obtain a compressed text.
In practical application, the text related to the enterprise may include one or more enterprise names, and in order to accurately identify the subject attribute information of the enterprise to be identified, the remaining enterprise names may be replaced and combined, so as to reduce the content of the text to be identified.
The second identifier may be any identifier other than numbers and punctuation marks, and when the second identifier is used together with the first identifier, it is generally necessary to ensure that the first identifier is different from the second identifier in order to be able to determine the type of content to be replaced. However, in special cases, the first identifier and the second identifier may also be identical if the text to be recognized is compressed as much as possible only for the purpose of removing extraneous information.
The method for removing invalid information and merging identifiers is the same as the first method, and please refer to the explanation of the first method, which is not described herein again.
Through the second method, other enterprise names except the enterprise to be identified which needs to be identified in the text are removed, so that the interference on enterprise name identification is reduced, the efficiency of enterprise name identification and positioning is improved, and the efficiency of enterprise subject attribute identification is further improved.
And in the third method, the numbers and the symbols are replaced by the first identifiers, and other enterprise names are replaced by the second identifiers and are respectively combined.
Compressing the text to be recognized according to a preset compression rule to obtain a compressed text, which may specifically include:
replacing the numbers and punctuation marks in the text to be recognized with first identifiers to obtain a first compressed text;
determining all business names in the first compressed text;
replacing other enterprise names except the enterprise name to be identified in the first compressed text with second identifiers to obtain a second compressed text;
removing invalid information in the second compressed text to obtain a third compressed text; the invalid information comprises words, auxiliary words or conjunctions;
replacing continuous first identifiers in the third compressed text with one first identifier to obtain a fourth compressed text;
and replacing continuous second identifiers in the fourth compressed text with one second identifier to obtain a compressed text.
Replacing the numbers and punctuation marks in the text to be recognized by first identifiers, removing invalid information between the first identifiers, and changing the continuous first identifiers into one first identifier; and replacing the enterprises except the target enterprise with the second identifiers, removing invalid information between the second identifiers, and combining the continuous second identifiers. And removing invalid information between the identifiers, wherein the specific operation is as follows: if only spaces or conjunctions of 'and' are filled between the symbols, the combination can be directly carried out. However, if there is a character "x" between identifiers in text data, continuous padding symbols cannot be directly merged. The 'x' is a word number, and a part-of-speech tagging tool which is ready or self-trained can be used, or a dictionary of word numbers, auxiliary words and conjunctions is maintained to remove meaningless words between filling symbols.
The invalid information can be stored in the server in advance, the invalid information can be information contained in a pre-trained part-of-speech tagging tool or dictionary, the invalid information is removed in a part-of-speech tagging mode, text information is compressed, and the method is unlimited in use, and can be used for an existing or self-trained part-of-speech tagging tool, a word dictionary maintenance and the like.
Through the third method, the numbers, punctuation marks and other enterprise names are replaced and combined, invalid information is removed, text information can be fully compressed, and the defects that the judgment efficiency is low and the resource memory occupies a large amount due to the fact that full text is used as context are overcome. The function that each enterprise in a single file can effectively judge the role is realized.
In addition, the defect of insufficient universality caused by using the segmentation character to segment the text as the context is overcome, the defect that the existing scheme is difficult to contain key information is effectively overcome, and the effect of identifying the attributes of the enterprise main body is improved.
Optionally, the selecting context information from the compressed text according to a preset word number range based on the position information may specifically include:
in the compressed texts, forward continuously selecting texts with preset word numbers and backward continuously selecting texts with preset word numbers based on the positions of the texts to be identified to obtain context information; the context information comprises the name of the enterprise to be identified.
The selected context information may include the name of the enterprise to be identified and the key information of the subject attribute of the enterprise to be identified. The subject attribute information of the enterprise to be identified can be identified more accurately and quickly based on the selected context.
According to the method, the text which limits the number of words before and after the context is used, and an information compression mode is used, invalid information is removed as far as possible, the key information is contained in the context, the defect that the key information is difficult to contain in the existing scheme is effectively overcome, and the role identification effect of the target enterprise is improved.
Optionally, the determining of the key information of the body attribute in the context information specifically may include:
determining text type information of the text to be recognized;
determining a main body attribute key information set corresponding to the pre-stored text type information based on the text type information;
and traversing the context information, and determining the main attribute key information matched with the main attribute key information set.
In practical applications, the keyword information used to represent the attributes of the enterprise body is different among different types of texts, for example: the above text mentions: when the text is a bid document, the subject attribute information of the enterprise may include "bidder", "winning bidder", or "candidate unit", etc. When the text is a penalty event text, the subject attribute information of the business may include "person to be penalized". In the risk text, the subject attribute information of the enterprise may include "risk enterprise", "high risk enterprise", or "credit enterprise", etc.
In the actual operation process, different types of texts may be stored with corresponding main attribute key information, and specifically, the texts may be stored in a mapping relationship based manner, for example: and storing according to the mapping relation of the text type-main attribute key information.
Therefore, when determining the key information of the body attribute, it is necessary to first determine the text type of the text to be recognized, and then determine the key information of the body attribute corresponding to the text type based on the text type.
The method in the above embodiment can be explained with reference to fig. 2:
fig. 2 is a flowchart illustrating a context selection method according to an embodiment of the present disclosure. As shown in fig. 2, the original full text in fig. 2 may be understood as the text to be recognized, the first identifier may be a space, and the second identifier may be a specific symbol in fig. 2. The realization process comprises the following steps:
all numbers and punctuation marks are removed from the original text and filled with spaces. When determining the subject attributes of a business at an event, the numbers and punctuation marks are often irrelevant information, and thus these characters can be removed. However, the numbers and punctuation may contain paragraph separation information, such as: "1, first portion … 2, second portion …". In order to keep paragraph separation information, the positions of the numbers and punctuations are indicated by spaces, further, the continuous spaces are changed into the spaces, the space characters are meaningless, and the combination of the continuous spaces can further shorten the length of the full text to obtain the cleaned full text. Other businesses than the target business use specific symbol replacements such as: a target enterprise in a bidding text is 'A construction engineering limited company', all other enterprises except the enterprise are replaced by using the symbol 'c', because the characters of other enterprises are not key information for judging the role of the target enterprise, the text length can be greatly reduced after the replacement, continuous filling symbols are combined, characters possibly exist among the continuous filling symbols, the characters still exist after the previous steps, and if only the space or the conjunction of 'and' is formed among the filling symbols, the characters can be directly combined. However, if the character of "x" exists between the padding symbols in the data, the consecutive padding symbols cannot be directly merged. The method can use a part-of-speech tagging tool which is ready or self-trained, or maintain a dictionary of word numbers, auxiliary words and conjunctions to remove meaningless words between filling symbols. And positioning the target enterprise to set a word number range to select the context, and setting the word number range of the context to select the context of the target enterprise. The following description is made with reference to practical examples:
suppose that the text to be recognized is excerpted as follows for a bidding event file:
fifth, the purchasing method is competitive consultation
Sixthly, the situation of successful business
Item name of goods, middle school and school department maintenance
Winning bid name A construction engineering Co., Ltd
Address x city x district initial investment base x number building x unit xxx room
Amount of transaction
Seventh, the period of announcement is from 8 and 15 days in 2020 to 8 and 17 days in 2020
Eighthly, scoring the evaluation result of the candidate unit
"one" B construction engineering Co., Ltd (88.01, 88.51, 88.51)
"two" C engineering Co., Ltd (81.72, 81.72, 82.72)
"three" D building engineering Co., Ltd (81.19, 81.69, 81.69)
"four" E building engineering Co., Ltd (80.91 )
"five" F construction engineering Co., Ltd (76.41, 76.91, 76.91)
"six" G construction group Co., Ltd (67.41, 69.41)
"seven" H construction group Co., Ltd (82.95, 83.95, 83.95)
"eight" I group Co., Ltd (80.03, 80.03, 81.03)
"Jiu" J building engineering Co., Ltd (87.5, 88.0)
K construction engineering Co., Ltd (87.42, 87.42, 87.92)
L engineering Co., Ltd. "(81.88, 82.88)
"twelve" M construction engineering Co., Ltd (81.4, 81.9)
Thirteen "N construction engineering Co., Ltd (81.38 )
Fourteen O construction group Co., Ltd (67.87, 69.87)
Fifteen P construction group company Limited (83.5, 84.5)
Nine, accessories are
The selected bidding event file can extract successful bidders and candidates in the event. The 16 enterprises in the file can be obtained through a rule or algorithm, and the second step is to judge the roles of the enterprises. Here, the first enterprise "engineering limited a" is the enterprise to be identified. When the main body attribute of the enterprise to be identified is identified, the steps are adopted, and the text compression implementation process is as follows:
(1) punctuation marks and numbers are replaced by spaces, and the result is obtained after the spaces are combined:
five-procurement competitive negotiation
Six-year-round situation
Maintenance project for school department in goods name
Winning bid name A construction engineering Co., Ltd
Address x city x district initial base x number building x unit xxx room
Amount of transaction
Seven announcements from year to month
Eight candidate unit review result score
"one" B construction engineering Co., Ltd (88.01, 88.51, 88.51)
"two" C engineering Co., Ltd (81.72, 81.72, 82.72)
"three" D building engineering Co., Ltd (81.19, 81.69, 81.69)
"four" E building engineering Co., Ltd (80.91 )
"five" F construction engineering Co., Ltd (76.41, 76.91, 76.91)
"six" G construction group Co., Ltd (67.41, 69.41)
"seven" H construction group Co., Ltd (82.95, 83.95, 83.95)
"eight" I group Co., Ltd (80.03, 80.03, 81.03)
"Jiu" J building engineering Co., Ltd (87.5, 88.0)
K construction engineering Co., Ltd (87.42, 87.42, 87.92)
L engineering Co., Ltd. "(81.88, 82.88)
"twelve" M construction engineering Co., Ltd (81.4, 81.9)
Thirteen "N construction engineering Co., Ltd (81.38 )
Fourteen O construction group Co., Ltd (67.87, 69.87)
Fifteen P construction group company Limited (83.5, 84.5)
Nine accessories do not have
(2) Other enterprises than the target enterprise use the specific symbol for replacement, and the target enterprise is: the 'A construction engineering limited company' replaces other enterprises with the symbol 'c', because characters of other enterprises are not key information for judging the roles of the target enterprises, the text length can be greatly reduced after replacement. Text obtained after replacement:
five-procurement competitive negotiation
Six-year-round situation
Maintenance project for school department in goods name
Winning bid name A construction engineering Co., Ltd
Address x city x district initial base x number building x unit xxx room
Amount of transaction
Seven announcements from year to month
Eight candidate unit review result score
The first, second, third, fourth, fifth, sixth, seventh, eighth, ninth, tenth, eleventh, twelfth, thirteenth, fourteenth, fifteenth and the like
Nine accessories do not have
(3) There may be text between successive filler symbols that remains after the first few steps, such as the example above: eight candidate unit review result score
The "one" c "two" c "three" c "four" c "five" c "six" c "seven" c "eight" c "nine" c "ten" c "eleven" c "twelve" c "thirteen" c "fourteen" c "fifteen" c ".
If only spaces or conjunctions of 'and' are filled between the symbols, the combination can be directly carried out. However, the "x" character exists between the padding symbols in the example data, and the consecutive padding symbols cannot be directly merged. At this time, a part-of-speech tagging tool which is ready or self-trained can be used, or a dictionary for maintaining a word number, a help word and a conjunctive word can be used for removing meaningless words between filling symbols.
After the compression steps are processed, the compressed text of the target enterprise, namely 'A construction engineering company Limited', is as follows:
five-procurement competitive negotiation
Six-year-round situation
Maintenance project for school department in goods name
Winning bid name A construction engineering Co., Ltd
Address x city x district initial base x number building x unit xxx room
Amount of transaction
Seven announcements from year to month
Eight candidate unit review result score
A 1
Nine accessories do not have
After the compressed text is obtained, the context information can be selected by taking the position of the enterprise to be identified as a reference.
The method in the embodiment can achieve the following technical effects:
1) the selected context can contain key information, so that the enterprise to be identified can be identified, the prediction efficiency is greatly accelerated, the resource memory occupation is improved, and in addition, the key information for role judgment is contained as far as possible in a limited word number due to the removal of irrelevant information, so that the effect performance of judging the role of the enterprise can be improved, and the like.
2) Useless information is removed by using a part-of-speech tagging mode, text information is compressed, and the method is unlimited in use, available for a part-of-speech tagging tool which is ready-made or self-trained, capable of maintaining a word dictionary and the like, and capable of further removing the useless information in the text, so that the enterprise subject attribute recognition efficiency is further improved.
3) The defects of low judgment efficiency and large resource memory occupation caused by using the full text as the context are avoided. The function that each enterprise in a single file can effectively judge the role is realized. In addition, the defect of insufficient universality caused by using the segmentation character to segment the text as the context is overcome.
4) The method in the embodiment of the specification uses the text which limits the number of words before and after the context, and uses the information compression mode to remove invalid information as far as possible, so that the key information is contained in the context, the defect that the key information is difficult to contain in the prior art is effectively relieved, and the effect of identifying the role of the target enterprise is improved.
In testing, the same rules are used to classify the context to determine the identity of the target role. The performance of several context selection schemes is shown in table 1, with test sets for comparison totaling 2147 sets of data.
TABLE 1 comparison of Performance of context selection schemes
Figure BDA0003189399710000141
Since the average length of the context is found to be about 100 words after the text is segmented using line breaks. Therefore, when the context is selected by limiting the word number, the context word number is set to 50 words, namely, the front and back 50 words of the target enterprise are used as the context of the target enterprise. Controlling the text length facilitates comparing several schemes. As shown in the table, the context generated by using the full text or the line feed character to segment the text has poor recall rate and accuracy rate for the character recognition of the candidate. In addition, the whole text is long, so that the time consumption of prediction is slow. Compared with the scheme for limiting the number of the context words, the scheme performs information compression on the basis of limiting the number of the context words, and firstly has the advantages that the average length of the context is reduced, and the speed is further increased when prediction is consumed. In addition, the method has 3-5% of great promotion on the recall rate and the accuracy rate. Therefore, the method in the embodiment of the specification selects the context with shorter text length, and ensures that the prediction is quicker in time consumption and smaller in resource occupation; and invalid information is removed as far as possible in the context, so that the key information of the main body attribute is contained in the context, the identification effect of the main body attribute of the enterprise is improved, and the identification accuracy and efficiency of the main body attribute of the enterprise can be improved. Thereby further improving the data quality use of enterprise service related products.
Based on the same idea, the embodiment of the present specification further provides a device corresponding to the above method. Fig. 3 is a schematic structural diagram of an enterprise subject attribute identification apparatus according to an embodiment of the present disclosure. As shown in fig. 3, the apparatus may include:
a text to be recognized obtaining module 310, configured to obtain a text to be recognized; the text to be identified comprises at least one name of an enterprise to be identified;
the text compression module 320 is configured to compress the text to be recognized according to a preset compression rule to obtain a compressed text;
a to-be-identified enterprise name positioning module 330, configured to position location information of the to-be-identified enterprise name in the compressed text;
a context information selecting module 340, configured to select context information from the compressed text according to a preset word number range based on the location information;
a main body attribute key information determining module 350, configured to determine main body attribute key information in the context information;
and a main body attribute information identification module 360, configured to determine the main body attribute information of the enterprise to be identified according to the main body attribute key information.
The present specification also provides some specific embodiments of the apparatus based on the apparatus of fig. 3, which is described below.
Optionally, the text compression module 320 may specifically include:
the first identifier replacing unit is used for replacing the numbers and punctuation marks in the text to be recognized with first identifiers;
a first judgment unit configured to judge whether or not there are consecutive first identifiers;
and the first identifier first combining unit is used for replacing the continuous first identifiers with one first identifier to obtain compressed texts when the continuous first identifiers exist.
Optionally, the text compression module 320 may further include:
an invalid information first detecting unit configured to determine whether content between any two of the first identifiers is invalid information when the consecutive first identifiers do not exist; the invalid information comprises words, auxiliary words or conjunctions;
a first invalid information removing unit, configured to remove, when content between any two of the first identifiers is invalid information, the invalid information to obtain the consecutive first identifiers;
and the first identifier second combining unit is used for replacing the continuous first identifiers with one first identifier to obtain compressed texts.
Optionally, the text compression module 320 may specifically include:
the first enterprise name determining unit is used for determining all enterprise names in the text to be recognized;
the second identifier replacing unit is used for replacing the other enterprise names except the enterprise name to be identified with second identifiers;
a second judgment unit operable to judge whether or not there are consecutive second identifiers;
and the second identifier first combining unit is used for replacing the continuous second identifiers with one second identifier to obtain compressed texts when the continuous second identifiers exist.
Optionally, the text compression module 320 may further include:
an invalid information second detecting unit configured to determine whether content between any two of the second identifiers is invalid information when the consecutive second identifiers do not exist; the invalid information comprises words, auxiliary words or conjunctions;
a second invalid information removing unit configured to remove invalid information to obtain the consecutive second identifiers when content between any two of the second identifiers is invalid information;
and the second identifier second merging unit is used for replacing the continuous second identifiers with one second identifier to obtain the compressed text.
Optionally, the text compression module 320 may specifically include:
the first compressed text determining unit is used for replacing the numbers and punctuations in the text to be recognized with first identifiers to obtain a first compressed text;
a second enterprise name determining unit, configured to determine all enterprise names in the first condensed text;
the second compressed text determining unit is used for replacing other enterprise names except the enterprise name to be identified in the first compressed text with second identifiers to obtain a second compressed text;
a third compressed text determining unit, configured to remove invalid information from the second compressed text to obtain a third compressed text; the invalid information comprises words, auxiliary words or conjunctions;
a fourth compressed text determining unit, configured to replace consecutive first identifiers in the third compressed text with one first identifier, so as to obtain a fourth compressed text;
and the identifier merging unit is used for replacing continuous second identifiers in the fourth compressed text with one second identifier to obtain a compressed text.
Optionally, the context information selecting module 340 may specifically include:
the context information selecting unit is used for continuously selecting texts with preset word numbers forwards and continuously selecting texts with preset word numbers backwards in the compressed texts based on the positions of the texts to be identified to obtain context information; the context information comprises the name of the enterprise to be identified.
Optionally, the main attribute key information determining module 350 may specifically include:
the text type information determining unit is used for determining the text type information of the text to be recognized;
the main body attribute key information set determining unit is used for determining a main body attribute key information set corresponding to the pre-stored text type information based on the text type information;
and the main body attribute key information matching unit is used for traversing the context information and determining the main body attribute key information matched with the main body attribute key information set.
Optionally, the invalid information may be stored in the server in advance, and the invalid information may be information contained in a pre-trained part-of-speech tagging tool or a dictionary.
Based on the same idea, the embodiment of the present specification further provides a device corresponding to the above method.
Fig. 4 is a schematic structural diagram of an enterprise subject attribute identification device according to an embodiment of the present disclosure. As shown in fig. 4, the apparatus 400 may include:
at least one processor 410; and the number of the first and second groups,
a memory 430 communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory 430 stores instructions 420 executable by the at least one processor 410 to enable the at least one processor 410 to:
acquiring a text to be identified; the text to be identified comprises at least one name of an enterprise to be identified;
compressing the text to be recognized according to a preset compression rule to obtain a compressed text;
positioning the position information of the enterprise name to be identified in the compressed text;
based on the position information, selecting context information in the compressed text according to a preset word number range;
determining subject attribute key information in the context information;
and determining the main attribute information of the enterprise to be identified according to the main attribute key information.
Based on the same idea, the embodiment of the present specification further provides a computer-readable medium corresponding to the above method. The computer readable medium has computer readable instructions stored thereon that are executable by a processor to implement the method of:
acquiring a text to be identified; the text to be identified comprises at least one name of an enterprise to be identified;
compressing the text to be recognized according to a preset compression rule to obtain a compressed text;
positioning the position information of the enterprise name to be identified in the compressed text;
based on the position information, selecting context information in the compressed text according to a preset word number range;
determining subject attribute key information in the context information;
and determining the main attribute information of the enterprise to be identified according to the main attribute key information.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
In the 90 s of the 20 th century, improvements in a technology could clearly distinguish between improvements in hardware (e.g., improvements in circuit structures such as diodes, transistors, switches, etc.) and improvements in software (improvements in process flow). However, as technology advances, many of today's process flow improvements have been seen as direct improvements in hardware circuit architecture. Designers almost always obtain the corresponding hardware circuit structure by programming an improved method flow into the hardware circuit. Thus, it cannot be said that an improvement in the process flow cannot be realized by hardware physical modules. For example, a Programmable Logic Device (PLD), such as a Field Programmable Gate Array (FPGA), is an integrated circuit whose Logic functions are determined by programming the Device by a user. A digital character system is "integrated" on a PLD by the designer's own programming without requiring the chip manufacturer to design and fabricate a dedicated integrated circuit chip. Furthermore, nowadays, instead of manually making an Integrated Circuit chip, such Programming is often implemented by "logic compiler" software, which is similar to a software compiler used in program development and writing, but the original code before compiling is also written by a specific Programming Language, which is called Hardware Description Language (HDL), and HDL is not only one but many, such as abel (advanced Boolean Expression Language), ahdl (alternate Hardware Description Language), traffic, pl (core universal Programming Language), HDCal (jhdware Description Language), lang, Lola, HDL, laspam, hardward Description Language (vhr Description Language), vhal (Hardware Description Language), and vhigh-Language, which are currently used in most common. It will also be apparent to those skilled in the art that hardware circuitry that implements the logical method flows can be readily obtained by merely slightly programming the method flows into an integrated circuit using the hardware description languages described above.
The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer-readable medium storing computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an Application Specific Integrated Circuit (ASIC), a programmable logic controller, and an embedded microcontroller, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, Atmel AT91SAM, Microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic for the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller as pure computer readable program code, the same functionality can be implemented by logically programming method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Such a controller may thus be considered a hardware component, and the means included therein for performing the various functions may also be considered as a structure within the hardware component. Or even means for performing the functions may be regarded as being both a software module for performing the method and a structure within a hardware component.
The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.
For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functionality of the units may be implemented in one or more software and/or hardware when implementing the present application.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape disk storage or other magnetic storage devices, or any other non-transmission medium which can be used to store information which can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (19)

1. An enterprise subject attribute identification method comprises the following steps:
acquiring a text to be identified; the text to be identified comprises at least one name of an enterprise to be identified;
compressing the text to be recognized according to a preset compression rule to obtain a compressed text;
positioning the position information of the enterprise name to be identified in the compressed text;
based on the position information, selecting context information in the compressed text according to a preset word number range;
determining subject attribute key information in the context information;
and determining the main attribute information of the enterprise to be identified according to the main attribute key information.
2. The method according to claim 1, wherein the compressing the text to be recognized according to a preset compression rule to obtain a compressed text specifically comprises:
replacing the numbers and punctuation marks in the text to be recognized with first identifiers;
determining whether there are consecutive first identifiers;
and when the continuous first identifiers exist, replacing the continuous first identifiers with one first identifier to obtain compressed texts.
3. The method of claim 2, after determining whether there are consecutive first identifiers, further comprising:
when the continuous first identifiers do not exist, judging whether the content between any two first identifiers is invalid information; the invalid information comprises words, auxiliary words or conjunctions;
when the content between any two first identifiers is invalid information, removing the invalid information to obtain the continuous first identifiers;
and replacing the continuous first identifier with a first identifier to obtain the compressed text.
4. The method according to claim 1, wherein the compressing the text to be recognized according to a preset compression rule to obtain a compressed text specifically comprises:
determining all enterprise names in the text to be identified;
replacing the other enterprise names except the enterprise name to be identified with second identifiers;
determining whether there are consecutive second identifiers;
and when the continuous second identifiers exist, replacing the continuous second identifiers with one second identifier to obtain compressed texts.
5. The method of claim 4, after determining whether there are consecutive second identifiers, further comprising:
when the continuous second identifiers do not exist, judging whether the content between any two second identifiers is invalid information; the invalid information comprises words, auxiliary words or conjunctions;
when the content between any two second identifiers is invalid information, removing the invalid information to obtain the continuous second identifiers;
and replacing the continuous second identifier with a second identifier to obtain a compressed text.
6. The method according to claim 1, wherein the compressing the text to be recognized according to a preset compression rule to obtain a compressed text specifically comprises:
replacing the numbers and punctuation marks in the text to be recognized with first identifiers to obtain a first compressed text;
determining all business names in the first compressed text;
replacing other enterprise names except the enterprise name to be identified in the first compressed text with second identifiers to obtain a second compressed text;
removing invalid information in the second compressed text to obtain a third compressed text; the invalid information comprises words, auxiliary words or conjunctions;
replacing continuous first identifiers in the third compressed text with one first identifier to obtain a fourth compressed text;
and replacing continuous second identifiers in the fourth compressed text with one second identifier to obtain a compressed text.
7. The method according to claim 1, wherein selecting context information from the compressed text according to a preset word number range based on the position information specifically comprises:
in the compressed texts, forward continuously selecting texts with preset word numbers and backward continuously selecting texts with preset word numbers based on the positions of the texts to be identified to obtain context information; the context information comprises the name of the enterprise to be identified.
8. The method according to claim 1, wherein the determining of the key information of the subject attribute in the context information specifically includes:
determining text type information of the text to be recognized;
determining a main body attribute key information set corresponding to the pre-stored text type information based on the text type information;
and traversing the context information, and determining the main attribute key information matched with the main attribute key information set.
9. The method according to any one of claims 3 and 5 to 6, wherein the invalid information is pre-stored in a server, and the invalid information is information contained in a pre-trained part of speech tagging tool or dictionary.
10. An enterprise agent attribute identification device, comprising:
the text to be recognized acquisition module is used for acquiring a text to be recognized; the text to be identified comprises at least one name of an enterprise to be identified;
the text compression module is used for compressing the text to be recognized according to a preset compression rule to obtain a compressed text;
the enterprise name to be identified positioning module is used for positioning the position information of the enterprise name to be identified in the compressed text;
the context information selecting module is used for selecting context information from the compressed text according to a preset word number range on the basis of the position information;
the main body attribute key information determining module is used for determining main body attribute key information in the context information;
and the main body attribute information identification module is used for determining the main body attribute information of the enterprise to be identified according to the main body attribute key information.
11. The apparatus according to claim 10, wherein the text compression module specifically includes:
the first identifier replacing unit is used for replacing the numbers and punctuation marks in the text to be recognized with first identifiers;
a first judgment unit configured to judge whether or not there are consecutive first identifiers;
and the first identifier first combining unit is used for replacing the continuous first identifiers with one first identifier to obtain compressed texts when the continuous first identifiers exist.
12. The apparatus of claim 11, the text compression module, further comprising:
an invalid information first detecting unit configured to determine whether content between any two of the first identifiers is invalid information when the consecutive first identifiers do not exist; the invalid information comprises words, auxiliary words or conjunctions;
a first invalid information removing unit, configured to remove, when content between any two of the first identifiers is invalid information, the invalid information to obtain the consecutive first identifiers;
and the first identifier second combining unit is used for replacing the continuous first identifiers with one first identifier to obtain compressed texts.
13. The apparatus according to claim 10, wherein the text compression module specifically includes:
the first enterprise name determining unit is used for determining all enterprise names in the text to be recognized;
the second identifier replacing unit is used for replacing the other enterprise names except the enterprise name to be identified with second identifiers;
a second judgment unit operable to judge whether or not there are consecutive second identifiers;
and the second identifier first combining unit is used for replacing the continuous second identifiers with one second identifier to obtain compressed texts when the continuous second identifiers exist.
14. The apparatus of claim 13, the text compression module, further comprising:
an invalid information second detecting unit configured to determine whether content between any two of the second identifiers is invalid information when the consecutive second identifiers do not exist; the invalid information comprises words, auxiliary words or conjunctions;
a second invalid information removing unit configured to remove invalid information to obtain the consecutive second identifiers when content between any two of the second identifiers is invalid information;
and the second identifier second merging unit is used for replacing the continuous second identifiers with one second identifier to obtain the compressed text.
15. The apparatus according to claim 10, wherein the text compression module specifically includes:
the first compressed text determining unit is used for replacing the numbers and punctuations in the text to be recognized with first identifiers to obtain a first compressed text;
a second enterprise name determining unit, configured to determine all enterprise names in the first condensed text;
the second compressed text determining unit is used for replacing other enterprise names except the enterprise name to be identified in the first compressed text with second identifiers to obtain a second compressed text;
a third compressed text determining unit, configured to remove invalid information from the second compressed text to obtain a third compressed text; the invalid information comprises words, auxiliary words or conjunctions;
a fourth compressed text determining unit, configured to replace consecutive first identifiers in the third compressed text with one first identifier, so as to obtain a fourth compressed text;
and the identifier merging unit is used for replacing continuous second identifiers in the fourth compressed text with one second identifier to obtain a compressed text.
16. The apparatus according to claim 10, wherein the context information selecting module specifically includes:
the context information selecting unit is used for continuously selecting texts with preset word numbers forwards and continuously selecting texts with preset word numbers backwards in the compressed texts based on the positions of the texts to be identified to obtain context information; the context information comprises the name of the enterprise to be identified.
17. The apparatus according to claim 10, wherein the body attribute key information determining module specifically includes:
the text type information determining unit is used for determining the text type information of the text to be recognized;
the main body attribute key information set determining unit is used for determining a main body attribute key information set corresponding to the pre-stored text type information based on the text type information;
and the main body attribute key information matching unit is used for traversing the context information and determining the main body attribute key information matched with the main body attribute key information set.
18. An enterprise agent attribute identification device comprising:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to:
acquiring a text to be identified; the text to be identified comprises at least one name of an enterprise to be identified;
compressing the text to be recognized according to a preset compression rule to obtain a compressed text;
positioning the position information of the enterprise name to be identified in the compressed text;
based on the position information, selecting context information in the compressed text according to a preset word number range;
determining subject attribute key information in the context information;
and determining the main attribute information of the enterprise to be identified according to the main attribute key information.
19. A computer readable medium having computer readable instructions stored thereon which are executable by a processor to implement the enterprise subject attribute identification method of any one of claims 1 to 9.
CN202110871670.0A 2021-07-30 2021-07-30 Enterprise subject attribute identification method, device and equipment Pending CN113609853A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110871670.0A CN113609853A (en) 2021-07-30 2021-07-30 Enterprise subject attribute identification method, device and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110871670.0A CN113609853A (en) 2021-07-30 2021-07-30 Enterprise subject attribute identification method, device and equipment

Publications (1)

Publication Number Publication Date
CN113609853A true CN113609853A (en) 2021-11-05

Family

ID=78306218

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110871670.0A Pending CN113609853A (en) 2021-07-30 2021-07-30 Enterprise subject attribute identification method, device and equipment

Country Status (1)

Country Link
CN (1) CN113609853A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018028164A1 (en) * 2016-08-11 2018-02-15 中兴通讯股份有限公司 Text information extracting method, device and mobile terminal
CN110196848A (en) * 2019-04-09 2019-09-03 广联达科技股份有限公司 A kind of cleaning De-weight method and its system towards public resource transaction data
CN111475603A (en) * 2019-01-23 2020-07-31 百度在线网络技术(北京)有限公司 Enterprise identifier identification method and device, computer equipment and storage medium
CN111651987A (en) * 2020-05-18 2020-09-11 北京金堤科技有限公司 Identity distinguishing method and device, computer readable storage medium and electronic equipment
CN112100288A (en) * 2020-09-15 2020-12-18 北京百度网讯科技有限公司 Method, apparatus, device and storage medium for outputting information

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018028164A1 (en) * 2016-08-11 2018-02-15 中兴通讯股份有限公司 Text information extracting method, device and mobile terminal
CN111475603A (en) * 2019-01-23 2020-07-31 百度在线网络技术(北京)有限公司 Enterprise identifier identification method and device, computer equipment and storage medium
CN110196848A (en) * 2019-04-09 2019-09-03 广联达科技股份有限公司 A kind of cleaning De-weight method and its system towards public resource transaction data
CN111651987A (en) * 2020-05-18 2020-09-11 北京金堤科技有限公司 Identity distinguishing method and device, computer readable storage medium and electronic equipment
CN112100288A (en) * 2020-09-15 2020-12-18 北京百度网讯科技有限公司 Method, apparatus, device and storage medium for outputting information

Similar Documents

Publication Publication Date Title
US10546005B2 (en) Perspective data analysis and management
CN110888968A (en) Customer service dialogue intention classification method and device, electronic equipment and medium
RU2613846C2 (en) Method and system for extracting data from images of semistructured documents
CN110427487B (en) Data labeling method and device and storage medium
WO2019041520A1 (en) Social data-based method of recommending financial product, electronic device and medium
CN113495900A (en) Method and device for acquiring structured query language sentences based on natural language
US9772991B2 (en) Text extraction
CN109117470B (en) Evaluation relation extraction method and device for evaluating text information
JPWO2012147428A1 (en) Text clustering apparatus, text clustering method, and program
CN110555203A (en) Text replying method, device, server and storage medium
TW201915777A (en) Financial analysis system and method for unstructured text data
Tumitan et al. Tracking Sentiment Evolution on User-Generated Content: A Case Study on the Brazilian Political Scene.
CN108875743B (en) Text recognition method and device
JP2020113129A (en) Document evaluation device, document evaluation method, and program
CN115392235A (en) Character matching method and device, electronic equipment and readable storage medium
US10042913B2 (en) Perspective data analysis and management
CN113553491A (en) Industrial big data search optimization method based on inverted index
CN113609853A (en) Enterprise subject attribute identification method, device and equipment
Tyers et al. What shall we do with an hour of data? Speech recognition for the un-and under-served languages of Common Voice
CN115129864A (en) Text classification method and device, computer equipment and storage medium
CN110532391B (en) Text part-of-speech tagging method and device
CN110019665A (en) Text searching method and device
CN114115878A (en) Workflow node recommendation method and device
CN112905752A (en) Intelligent interaction method, device, equipment and storage medium
CN110781365B (en) Commodity searching method, device and system and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination