CN110020424A - Extracting method, the extracting method of device and text information of contract information - Google Patents

Extracting method, the extracting method of device and text information of contract information Download PDF

Info

Publication number
CN110020424A
CN110020424A CN201910006732.4A CN201910006732A CN110020424A CN 110020424 A CN110020424 A CN 110020424A CN 201910006732 A CN201910006732 A CN 201910006732A CN 110020424 A CN110020424 A CN 110020424A
Authority
CN
China
Prior art keywords
character
contract
mark
extracted
sentence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910006732.4A
Other languages
Chinese (zh)
Other versions
CN110020424B (en
Inventor
余红
张林江
游紫微
梁山雪
胡伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Advanced New Technologies Co Ltd
Advantageous New Technologies Co Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201910006732.4A priority Critical patent/CN110020424B/en
Publication of CN110020424A publication Critical patent/CN110020424A/en
Application granted granted Critical
Publication of CN110020424B publication Critical patent/CN110020424B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

This specification provides a kind of extracting method of contract information, the extracting method of device and text information.Wherein, the extracting method of contract information includes: the text data and type parameter for obtaining contract to be extracted;Model is handled by preset contract text, determines the mark character of the character cell in the text data of contract;From the text data of contract to be extracted, mark character and the matched character cell of type parameter are extracted;According to mark character, above-mentioned matched character cell is combined, contract information to be extracted is obtained.In this specification embodiment, the mark character for determining character cell in contract text data by handling model using the preset contract text obtained based on the sample data training marked according to preset notation methods, matched character cell is extracted from text data further according to type parameter, and combine and obtain contract information to be extracted, to solve the problems, such as that the accuracy for extracting contract information present in existing method is not high.

Description

Extracting method, the extracting method of device and text information of contract information
Technical field
This specification belongs to Internet technical field more particularly to a kind of extracting method of contract information, device and text The extracting method of information.
Background technique
Contract is a kind of more important legal document, includes the legal information that enterprise more pays close attention to.
With popularizing for electronic office, enterprise mostly can first convert the paper documents such as contract to corresponding electronics shelves Text data is simultaneously stored in the storage mediums such as computer, subsequent transferred and is managed to facilitate.
The information content that usually portion contract is included can be more various.For example, a contract may include several louvers Content.Correspondingly, the data volume of the text data of contract also can be relatively large.Many times, user may merely desire to from contract Text data in inquire, extract certain several text information oneself more paid close attention to, at this moment if to browse the text of entire contract Notebook data looks for required text information, it is clear that can expend plenty of time and energy.
The extracting method of existing contract information, be mostly need first Manual definition's binary feature function, then based on this two Value tag function establishes corresponding identification model by statistical machine learning, then comes acording to the requirement of user, from the text of contract It is found in data and extracts text information required for user.But Manual definition's two-value is needed when above method specific implementation Characteristic function, to technical staff it is professional require it is relatively high.And different technical staff due to itself stock of knowledge, place Reason experience has differences, and causes the accuracy of defined binary feature function that can also have fluctuation.And used two-value is special The quality of function is levied, and influences whether the accuracy of the identification model of subsequent foundation, leads to the text extracted based on identification model The accuracy of this information also it is difficult to ensure that.Therefore, a kind of extracting method of higher contract information of accuracy is needed.
Summary of the invention
This specification is designed to provide a kind of extracting method of contract information, the extracting method of device and text information, To solve to extract the not high technical problem of accuracy of contract information present in existing method.
The extracting method of the extracting method of contract information a kind of, device and text information that this specification provides is such reality Existing:
A kind of extracting method of contract information, comprising: obtain the text data and contract to be extracted of contract to be extracted The type parameter of information;By preset contract text processing model, in the text data for determining the contract to be extracted The mark character of character cell, wherein the preset contract text processing model be by the sample data after mark into Row training obtains, and the sample data after the mark is the text data of the contract marked according to preset notation methods;From In the text data of the contract to be extracted, the type parameter for extracting mark character and the contract information to be extracted is matched Character cell;According to the mark character, the type parameter of the mark character and the contract information to be extracted is combined The character cell matched obtains the contract information to be extracted.
A kind of extracting method of text information, comprising: obtain text data to be extracted and text information to be extracted Type parameter;By preset text-processing model, the mark of the character cell in the text data to be extracted is determined Character learning symbol, wherein the preset text-processing model be by being trained to the sample data after mark, it is described Sample data after mark is the text data marked according to preset notation methods;From the text data to be extracted, Extract the matched character cell of type parameter of mark character and the text information to be extracted;According to the mark character, The matched character cell of type parameter for combining the mark character and the text information to be extracted, obtains described to be extracted Text information.
A kind of method for building up of preset contract text processing model, comprising: the sample data after obtaining mark;It will be described The sentence in sample data after mark is split as multiple character cells respectively, wherein the multiple character cell carries respectively There is a corresponding markup information, the corresponding markup information of the character cell is determined according to preset notation methods;By the sentence In multiple character cells carry out vectorization processing respectively, obtain multiple word vectors in the sentence;According in the sentence Multiple word vectors, obtain the first coded sequence and the second coded sequence for the sentence, wherein first code sequence The coded sequence that the first coding is obtained according to positive sequence corresponding to the multiple word vectors being classified as in the sentence, described second Coded sequence is the coded sequence that the second coding is obtained according to sorting by reversals corresponding to multiple word vectors in the sentence;It spells First coded sequence and second coded sequence are connect, obtains output result sequence, wherein the output result sequence packet Corresponding relationship containing mark character and character cell.
A kind of extraction element of contract information, comprising: module is obtained, for obtaining the text data of contract to be extracted, with And the type parameter of contract information to be extracted;Determining module determines institute for handling model by preset contract text State the mark character of the character cell in the text data of contract to be extracted, wherein the preset contract text handles model It is by being trained to the sample data after mark, the sample data after the mark is according to preset mark side The text data of the contract of formula mark;Extraction module, for from the text data of the contract to be extracted, extracting mark character With the matched character cell of type parameter of the contract information to be extracted;Composite module is used for according to the mark character, The matched character cell of type parameter for combining the mark character and the contract information to be extracted, obtains described to be extracted Contract information.
A kind of server, including processor and for the memory of storage processor executable instruction, the processor The type parameter of the text data and contract information to be extracted that obtain contract to be extracted is realized when executing described instruction;It is logical Preset contract text processing model is crossed, determines the identifier word of the character cell in the text data of the contract to be extracted Symbol, wherein the preset contract text processing model be by being trained to the sample data after mark, it is described Sample data after mark is the text data of the contract marked according to preset notation methods;From the text of the contract to be extracted In notebook data, the matched character cell of type parameter of mark character and the contract information to be extracted is extracted;According to described Character is identified, the matched character cell of type parameter of the mark character and the contract information to be extracted is combined, obtains The contract information to be extracted.
A kind of computer readable storage medium, is stored thereon with computer instruction, and described instruction is performed realization and obtains The type parameter of the text data of contract to be extracted and contract information to be extracted;Mould is handled by preset contract text Type determines the mark character of the character cell in the text data of the contract to be extracted, wherein the preset contract text Present treatment model is by being trained to the sample data after mark, and the sample data after the mark is according to pre- If notation methods mark contract text data;From the text data of the contract to be extracted, extract mark character with The matched character cell of type parameter of the contract information to be extracted;According to the mark character, the identifier word is combined The matched character cell of type parameter of symbol and the contract information to be extracted, obtains the contract information to be extracted.
Extracting method, the extracting method of device and text information for a kind of contract information that this specification provides, due to elder generation Sample data is labeled according to preset mode, then by pre- based on being trained to the sample data after mark If contract text processing model determine the corresponding character mark of character cell in contract text data, according to be extracted The type parameter of information selects the matched character cell of character mark, and combines above-mentioned character cell according to character mark, with Contract information to be extracted required by user, to solve the existing accuracy for extracting contract information of existing method not High technical problem reaches and accurately and efficiently extracts the contract information for meeting user's requirement from the text data of contract Technical effect.
Detailed description of the invention
In order to illustrate more clearly of this specification embodiment or technical solution in the prior art, below will to embodiment or Attached drawing needed to be used in the description of the prior art is briefly described, it should be apparent that, the accompanying drawings in the following description is only The some embodiments recorded in this specification, for those of ordinary skill in the art, in not making the creative labor property Under the premise of, it is also possible to obtain other drawings based on these drawings.
Fig. 1 is in a Sample Scenario, using the one of the extracting method of the contract information of this specification embodiment offer The schematic diagram of kind embodiment;
Fig. 2 is in a Sample Scenario, using the one of the extracting method of the contract information of this specification embodiment offer The schematic diagram of kind embodiment;
Fig. 3 is in a Sample Scenario, using the one of the extracting method of the contract information of this specification embodiment offer The schematic diagram of kind embodiment;
Fig. 4 is in a Sample Scenario, using the one of the extracting method of the contract information of this specification embodiment offer The schematic diagram of kind embodiment;
Fig. 5 is a kind of signal of embodiment of the process of the extracting method for the contract information that this specification embodiment provides Figure;
Fig. 6 is one kind of the process of the method for building up for the preset contract text processing model that this specification embodiment provides The schematic diagram of embodiment;
Fig. 7 is a kind of signal of embodiment of the process of the extracting method for the text information that this specification embodiment provides Figure;
Fig. 8 is a kind of schematic diagram of embodiment of the structure for the server that this specification embodiment provides;
Fig. 9 is a kind of signal of embodiment of the structure of the extraction element for the contract information that this specification embodiment provides Figure.
Specific embodiment
In order to make those skilled in the art more fully understand the technical solution in this specification, below in conjunction with this explanation Attached drawing in book embodiment is clearly and completely described the technical solution in this specification embodiment, it is clear that described Embodiment be only this specification a part of the embodiment, instead of all the embodiments.The embodiment of base in this manual, Every other embodiment obtained by those of ordinary skill in the art without making creative efforts, all should belong to The range of this specification protection.
In view of the existing identification model for extracting text information needs technical staff manually suitable to define mostly Binary feature function carry out the extraction of feature.And the different common text files of contract text, have certain profession special Different property, the requirement to technical staff are relatively high.Technical staff is needed to be provided simultaneously with corresponding programming knowledge and legal knowledge, And the process experience of legal document.Different technical staff is defining two since the stock of knowledge of itself, background experience are different When value tag function, it is easy to appear error, causes defined binary feature function often not accurate enough, or be not suitable for closing With this kind of legal document, and then cause the precision of established identification model poor, is often accurately mentioned when extracting contract information Obtain contract information required by user.
For the basic reason for generating the above problem, this specification consideration can be first according to preset notation methods to sample Character cell in data (such as sample contract) in crucial phrase is labeled respectively, the sample data after being marked, then By being trained to the sample data after above-mentioned mark, with obtain having degree of precision for this kind of text data of contract Preset contract text handles model;And then model can be handled by above-mentioned preset contract text and determine contract to be extracted Text data in character cell mark character, the type parameter of the contract information of the extraction according to required by user, selection The character cell matched, and according to the mark above-mentioned matched character cell of character combination, obtain contract information required by user.From And Manual definition's binary feature function is no longer needed, the accuracy for solving extraction contract information present in existing method is not high Technical problem reaches the technology that the contract information for meeting user's requirement is accurately and efficiently extracted from the text data of contract Effect.
This specification embodiment provides a kind of extracting method of contract information, and the method specifically can be applied to law works Some in platform is responsible for responding user's request being the service that user found from contract text data, extracted corresponding contract information Device.
Specifically, above-mentioned server can be used for receiving and responding the extraction request of user, the text of contract to be extracted is obtained The type parameter of notebook data and contract information to be extracted;Model is handled by preset contract text, is determined to be extracted The mark character of character cell in the text data of contract, wherein the preset contract text processing model is by mark What the sample data after note was trained, the sample data after the mark is the conjunction marked according to preset notation methods Same text data;Again from the text data of the contract to be extracted, extracts mark character and the contract to be extracted is believed The matched character cell of the type parameter of breath;And according to the mark character, combine the mark character with it is described to be extracted The matched character cell of the type parameter of contract information obtains the contract information to be extracted.
In the present embodiment, the server can be a kind of background processing system side applied to law works platform, It can be realized the Batch Processing server of the functions such as data transmission, data processing.Specifically, the server can be a tool There is the electronic equipment of data operation, store function and network interaction function;Or run in the electronic equipment, for number The software program supported is provided according to processing, storage and network interaction.In the present embodiment, the server is not limited specifically Quantity.The server is specifically as follows a server, or several servers, alternatively, by several server shapes At server cluster.
It, can be as shown in fig.1, what user can be provided by application this specification embodiment in a Sample Scenario The server of the extracting method of contract information rapidly acquires required contract information.
Specifically, the contract documents that user will can be related to place company in advance are converted into the contract of corresponding electronics shelves Text data is uploaded and is stored in the database of law works platform, to facilitate the calling and pipe of the subsequent text data to contract Reason.
Active user wants the validity period that the service arranged with company energetically is acquired from the text data of B contract Limit, so that the service that opposite company energetically provides carries out time limit control.At this moment, user can be by client (for example, user makes Laptop) it is responsible for extracting the server transmission extraction request of contract information into law works platform.Wherein, said extracted The contract title of contract to be extracted or the class of contract number and contract information to be extracted can be specifically carried in request Shape parameter.
Wherein, the type of said contract information specifically can be understood as this legality data of contract text data Data characteristics, and combine classification class designed by search use habit of the people to relevant information in this kind of legality data Type.Specifically, the type of said contract information may include one of multiple types of act set forth below or a variety of: Party A, Party B, Business Name, the amount of money, contract award date, term of validity, contract number, execution of contract date, expiration of contract date etc. Deng.Certainly, it should be noted that above-mentioned cited contract information type is intended merely to that this specification embodiment party is better described Formula.When it is implemented, as the case may be, may be incorporated into other information types, such as address, phone, penalty et al. The type of the information type more paid close attention to as said contract information.
The type parameter of above-mentioned contract information to be extracted specifically can be understood as being used to indicate contract information to be extracted Type identification parameter.Specifically, the type parameter of above-mentioned contract information to be extracted can be it is corresponding to be extracted The typonym of contract information, being also possible to server corresponding with the typonym of contract information to be extracted can recognize reason The characteristic character etc. of solution.For the concrete form and content of the type parameter of above-mentioned contract information to be extracted, this specification is not It limits.
For example, " term of validity of service " conduct can be directly arranged in user in extracting request in this Scene case " FWDYXQX " character can also be arranged in the type parameter of above-mentioned contract information to be extracted in extracting request, i.e., to be extracted The first letter of pinyin of the title of the type of contract information, the type parameter etc. as contract information to be extracted.In addition, user is also The contract number or contract title of the contract can be set in extracting request, to indicate contract to be extracted.
Server can first parse said extracted request after the extraction request for receiving user, obtain extraction request In entrained contract title or number and user setting contract information to be extracted type parameter.And then server The text data for obtaining being stored in corresponding contract in law works platform database can be called according to said contract title or number As the text data of contract to be extracted, preset contract text processing model trained in advance is recalled, to above-mentioned wait mention The text data of contract is taken to be handled, with the mark of character cell in the above-mentioned contract text data to be extracted of determination, further according to The matched character cell of type parameter selective extraction of contract information to be extracted, and according to the corresponding identifier word of character cell Symbol combines above-mentioned matched character cell, so that reduction, which obtains user, wants the contract information extracted.
Wherein, above-mentioned character cell specifically can be understood as the data cell that text information is characterized in contract text data. Specifically, above-mentioned character cell specifically can be a Chinese character, such as " first ", it is also possible to a word, such as " Date ", also It can be a number, such as " 2018 " etc..Certainly, it should be noted that above-mentioned cited character cell is intended merely to more Illustrate this specification embodiment well.For the concrete form and content of above-mentioned character cell, this specification is not construed as limiting.
Phrase where above-mentioned mark character specifically can be understood as character cell corresponding to a kind of be used to indicate is closing With the position of information type and the character cell corresponding in the text data of text data this kind type in the phrase The character properties of feature combine.Specifically, above-mentioned mark character may include two parts, a portion is to be used to indicate this The mark character of the information type of phrase where character cell, another part is to be used to indicate the character cell in the phrase The mark character of position feature.
For example, can wrap " company energetically " this phrase containing 4 character cells, respectively " big ", " power ", " public affairs " " department ", information type corresponding to the phrase is Business Name, according to preset rules, can be denoted as COM to indicate.Into one Step, determines position of above-mentioned 4 character cells in the phrase respectively, finds: character cell is " big " to be located in phrase Initial position, i.e. bebinning character in the phrase can be denoted as " B " according to preset rules to indicate;Character cell " power ", " public affairs " are located at the middle position in phrase, i.e. intermediate character in the phrase, according to preset rules, can be denoted as respectively " M1 ", " M2 " is indicated;Character cell " department " is located at the end position in phrase, i.e. last character in the phrase, according to preset rules " E " can be denoted as to indicate.In summary the information type of the phrase where each character cell and character cell are at this Position feature in phrase, the corresponding mark character of above-mentioned 4 character cells can respectively indicate are as follows: " B-COM ", " M1- COM ", " M2-COM " and " E-COM ".Certainly, it should be noted that the mark character of above-mentioned cited character cell is to be It is better described this specification embodiment.When it is implemented, as the case may be, other types or form can also be used Character properties as mark character.In this regard, this specification is not construed as limiting.
Above-mentioned preset contract text processing model specifically can be understood as server and be first passed through in advance to largely according to default The text data of sample contract that marked of notation methods carry out obtained after positive training and reverse train, can be used in Identify the neural network model for determining the corresponding mark character of character cell in contract text data.
In this Sample Scenario, specifically, server can be defeated as model using the text data of above-mentioned contract to be extracted Enter, is input to preset contract text processing model and obtains output result are as follows: character cell in the text data of contract to be extracted Mark character.In turn, server can be according to the type parameter of contract information to be extracted, from the textual data of contract to be processed The matched character cell of type parameter that mark character and the contract information to be extracted is searched out in, as matching character Unit.
For example, server can be treated and be mentioned according to the type parameter " term of validity of service " of contract information to be extracted It takes the entrained mark character of character cell in the text data of contract to be retrieved, determines and " term of validity of service " Matched mark character are as follows: " B-VAL ", " M1-VAL ", " M2-VAL ", " M3-VAL ", " M4-VAL " and " E-VAL " (VAL is root It is used to characterize the character properties of the term of validity of service according to preset rules).And then it can will carry above-mentioned mark character This 6 character cells of character cell " 2018 ", " year ", " 12 ", " moon ", " 30 " and " day " are determined as and " term of validity of service " Matched character cell.
Server, may further be according to the mark character entrained by character cell after determining matched character cell Characterized position feature splices and combines the above-mentioned matched character cell determined, obtains energy according to corresponding sequence of positions Specific meaning, complete phrase are enough characterized as contract information to be extracted.
For example, server can the mark according to entrained by character cell " 2018 ", " year ", " 12 ", " moon ", " 30 " and " day " That of position feature is characterized in character learning symbol " B-VAL ", " M1-VAL ", " M2-VAL ", " M3-VAL ", " M4-VAL " and " E-VAL " Partial character parameter, i.e. " B ", " M1 ", " M2 ", " M3 ", " M4 " and " E " determine that character cell " 2018 " are rising for the phrase Beginning character, is arranged in the beginning location of phrase, and character cell " year " is the intermediate character of the phrase, and is located at the of intermediate character The character cell can be connected to after bebinning character " 2018 " by one position.In a manner mentioned above, successively in character list Concatenation character unit " 12 " after first " year ", the concatenation character unit " 30 " after character cell " moon ", in character cell " 30 " Concatenation character unit " day " later.After having connected character cell " day ", server is by having recognized character cell " day " institute The part character of the characterization position feature of carrying is " E ", can determine that the character cell is the last character in the phrase. Hence, it can be determined that completing connection after having connected character cell, and then it will can currently connect obtained character cell Connection combination " on December 30th, 2018 " this phrase as the contract information type with extraction required by user, that is, defines Contract information to be extracted.
In turn, server can respond the extraction request of user, and identified contract information to be extracted is showed use Family, such user can easily acquire the contract for wanting to extract from the text data of the huge contract of data volume Information.
By above-mentioned Sample Scenario as it can be seen that this specification provide contract information extracting method, by elder generation according to preset Mode is labeled sample data, then by based on the preset contract text being trained to the sample data after mark Present treatment model determines the corresponding character mark of character cell in contract text data, according to the type of information to be extracted The matched character cell of parameter selection character mark, and above-mentioned character cell is combined according to character mark, it is wanted with obtaining user The contract information to be extracted asked is asked to solve the not high technology of the existing accuracy for extracting contract information of existing method Topic reaches the technical effect that the contract information for meeting user's requirement is accurately and efficiently extracted from the text data of contract.
In another Sample Scenario, server can be first passed through in advance to according to the sample after preset notation methods mark Data carry out learning training, establish obtain for identification, determine character mark corresponding to character cell in contract text data The preset contract text known handles model.
When it is implemented, the text datas of a certain number of contracts can first be obtained as sample data, and according to default Notation methods, above-mentioned sample data is marked accordingly, the sample data after being marked.
It, can be in conjunction with this kind of document content feature with law scene properties of contract text, Yi Jiren when specific mark Included to contract text the extraction of information, use habit, first pass through the text data of retrieval contract, determine contract text It uses more frequent contract information as crucial phrase in this, carries out subsequent mark.
Wherein, above-mentioned crucial phrase specifically can be understood as in contract text data, and people more pay close attention to, and be searched for by people The frequency used is greater than phrase corresponding to the contract information of preset frequency threshold.Specifically, above-mentioned crucial phrase can be The Business Name occurred in contract text, such as " company energetically " are also possible to the specific amount of money occurred in contract text, such as " 5000 yuan " can also be the contract award date occurred in contract text, such as " on January 1st, 2017 " etc..Certainly, on Cited crucial phrase is stated to be intended merely to that this specification embodiment is better described.When it is implemented, according to concrete scene And user demand, it may be incorporated into the phrase of contract information in other characterization contract texts as above-mentioned crucial phrase.In this regard, this Specification is not construed as limiting.
It, may further be to the word in each crucial phrase in the text data for determining contract after each crucial phrase Symbol unit is divided, and determines the character cell that each crucial phrase is separately included;It is again mark unit with character cell, It determines and marks out markup information corresponding to each character cell in each crucial phrase.
For marking any one character cell in any one crucial phrase, in mark, can first it determine The type of contract information corresponding to crucial phrase where the character cell and the character cell are in the crucial phrase Location information;Type and the character cell further according to contract information corresponding to the crucial phrase where the character cell Location information in the crucial phrase is determined the markup information for the character cell, and is passed through at the character cell Such as the modes such as label for labelling mark out above-mentioned markup information, so that character cell can carry corresponding markup information. When it is implemented, can be used it is above-mentioned can the corresponding information type of phrase where pointing character unit and character cell exist The mark character of position spy in the phrase is labeled as corresponding markup information.Also other, which can be used, can refer to The corresponding information type of phrase where showing character cell and the character style of position feature are labeled character cell.It is right This, this specification is not construed as limiting.
For example, as shown in fig.2, the crucial phrase of the characterization amount information in mark contract text data: " 500 yuan " When, the crucial phrase first can be split into two character cells " 500 ", " members ";Above-mentioned two character cell place is determined again Phrase corresponding to information type be the amount of money, correspondence can be denoted as " AMT " (i.e. corresponding informance type), and two character cells exist The position of the phrase is respectively initial position and end position, and correspondence can be denoted as " B " and " E " respectively, and (i.e. corresponding position is special Sign);Further according to the above-mentioned information type of each character cell and position feature, character cell " 500 ", " member " institute difference are determined Corresponding markup information is " B-AMT ", " E-AMT ", and above-mentioned markup information is marked by adding tagged mode respectively On corresponding character cell, to complete the mark to each character cell in the crucial phrase.
When specific mark, the character cell that some crucial phrases are included is relatively more, and corresponding position feature may It is relatively complicated, for example, characterization Business Name crucial phrase " company energetically " contain 4 character cells, for interposition Two character cells " power ", " public affairs " set can be denoted as " M " according to preset rules, but in order to subtly distinguish in above-mentioned two Between character cell at position context, can be further, it is logical that above-mentioned two character cell is denoted as " M1 ", " M2 " respectively Later the position of the digital representation character cell " power " connect is in middle position, and in the character cell for being similarly positioned in middle position Before " public affairs ".In addition, for only including the crucial phrase of single character cell, such as characterize the crucial phrase of phone 14545129045, the character " S " for characterizing single byte can be used to characterize the position feature of the character cell, illustrate this It include a character cell in crucial phrase, corresponding, the markup information of the character cell can be denoted as " S-TEL (corresponding table Levy the information type of phone) ".
Mark crucial phrase in character cell while, in order to avoid in establishing preset contract processing model by To the interference of other unrelated phrases or character, the character cell in contract text data in invalid phrase can also be marked out.
Wherein, above-mentioned invalid phrase specifically can be understood as in contract text data, and people less focus on, and are searched for by people Phrase corresponding to the lower contract information of the frequency used, or do not characterize contract information phrase (such as only rise grammer company Connect the phrase of effect).Specifically, above-mentioned invalid phrase may include: at least one of: name, place name, adjective, auxiliary word Etc..Certainly, it should be noted that above-mentioned cited invalid phrase is intended merely to that this specification embodiment party is better described Formula.When it is implemented, as the case may be, other kinds of phrase can also be introduced as above-mentioned invalid phrase.
It, can be by invalid word in order to distinguish the character cell of crucial phrase when marking the character cell in invalid phrase Character cell in group is uniformly denoted as character " O " and is used as corresponding markup information.Certainly, when it is implemented, it can also be used He can be different from the markup information of the character cell of crucial phrase to mark the character cell in invalid phrase.In this regard, this theory Bright book is not construed as limiting.
Before the text data for retrieving the contract of determination, in order to improve treatment effeciency, the interference for reducing non-key phrase and shadow It rings, when it is implemented, the invalid phrase in the sample data can also be filtered first, then is carried out based on filtered sample data The retrieval and determination of crucial phrase.
Each phrase (including crucial phrase and the invalid word in the text data of contract can be marked out in the manner described above Group) in each character cell, the sample data so as to complete the mark of the text data to contract, after being marked.
After the sample data after being marked, server can be torn the sample data after mark according to sentence open Point, obtain multiple sentences, and then the character cell for carrying markup information can be sequentially input as unit of sentence and learn It practises, training, so as to by the matched combined of phrase in the whole context stated using sentence and sentence, so that machine is more It goes well to understand the incidence relation between the semanteme that each character cell is characterized, and the character cell of different phrases, so that By learning, training established model that there is higher precision, to determine the character cell in phrase more accurately Identify character.
Further, it is also contemplated that through the above way training, study, it will usually so that machine from positive word order (i.e. Sequence from left to right or from front to back) angle go in judgment of learning sentence the association between the character cell of different phrases Relationship.For example, can only predict the word of type-A from the angle of positive word order after the phrase for determining some type-A in sentence What may be connected after group is the bebinning character unit of B type phrase, but can not be after determining B type phrase, from reversed word order Angle prediction B type phrase before may connection be type-A phrase termination character unit.But from the cognition angle of the mankind It spends, the type of word-combination that may actually and before judgement can be gone to determine from reversed sentence connect.Therefore, in order to further mention The precision of high model, consideration can be combined as unit of sentence by positive training and reverse train, enable model from Two kinds of angles of positive word order and reversed word order are all trained, learn, and then when subsequent use model, model can be integrated from just To the two kinds of angle analysis judgements of word order and reversed word order as a result, more accurately determining the character cell in each phrase Identify character.
In this Sample Scenario, in order to establish the higher preset contract text processing model of above-mentioned precision, specific implementation When, it can be as shown in fig.3, server can incite somebody to action in the manner described above first by the sample data after mark as unit of sentence Each sentence is split as multiple specific character cells respectively in sample data, wherein each word in the multiple character cell Symbol unit carries corresponding markup information respectively.Then, service can be as unit of sentence, respectively will be more in each sentence A character cell carries out vectorization processing respectively, obtains the word vector that each sentence is included, wherein each word vector Carry the markup information of corresponding character cell.And then the word vector in each sentence can be pressed as unit of sentence The positive word sequence vector of sentence is obtained according to forward direction sequence, while multiple word vectors in sentence are obtained into sentence according to sorting by reversals The reversed word sequence vector of son, then using the reversed word sequence vector of the positive word sequence vector of above-mentioned sentence and sentence as one The input of time step is input to characteristic layer and is trained, and obtains the first coded sequence and the second coded sequence.Splice above-mentioned again One coded sequence and the second coded sequence obtain output result sequence to the end, wherein specifically wrap in the output result sequence Corresponding relationship containing mark character and character cell.To establish, to have obtained precision higher, for contract text data Preset contract processing model.
When it is implemented, Embedding layers can be separately input into using multiple character cells in sentence as input, lead to It crosses and trains, exported as a result, and result will be exported as the corresponding word vector of character cell.
The first coded sequence is specifically being obtained and when the second coded sequence, in order to which training pattern can be from positive word order and anti- It analyzed, judged to two kinds of angles of word order, it, can be with to improve the precision of model identification, the mark character for determining character cell As shown in fig.3, first word vector each in sentence sorts according to forward direction, i.e., arranged according to the sequence in former sentence from left to right each Word vector corresponding to a character cell obtains the positive word sequence vector of sentence.Similarly, can by word each in sentence to Amount arranges word vector corresponding to each character cell according to the sequence that sentence is turned left from the right side, obtains sentence according to sorting by reversals The reversed word sequence vector of son.
It, may further be by the sentence of the same sentence after the positive word sequence and reversed word sequence that obtain above-mentioned sentence again Positive word sequence vector and sentence input of the reversed word sequence vector as a time step, be input to be able to carry out together The characteristic layer of positive training and reverse train, such as Bi-LSTM layers, by training, feature extraction is carried out, to obtain the first coding Sequence and the second coded sequence.
Wherein, above-mentioned first coded sequence (positive LSTM can also be claimed) specifically can be understood as multiple words in sentence to The coded sequence that the first corresponding coding of amount is obtained according to positive sequence (such as from left to right), above-mentioned second coded sequence (can also claim reversed LSTM) specifically can be understood as the second coding corresponding to multiple word vectors in sentence according to reversed row The coded sequence that sequence (such as turning left from the right side) obtains.Wherein, above-mentioned first coding can be specifically understood to a kind of for characterizing The coded data of feature corresponding to word vector, above-mentioned second coding specifically can be understood as another right for characterizing word vector institute Answer the coded data of feature.
It, can be in characteristic layer, to above-mentioned first coding after obtaining above-mentioned first coded sequence and the second coded sequence Sequence and the second coded sequence carry out splicing.Specifically, can be as shown in fig.3, the first coded sequence and second be compiled Code sequence in correspond to same character cell first coding and second coding links together, complete splicing obtains one completely Hidden status switch has obtained preset contract text processing model as above-mentioned output result sequence, to establish.
Wherein, the corresponding relationship of above-mentioned output result sequence includes mark character and character cell, therefore subsequent mould Type can determine the corresponding mark character of character cell based on above-mentioned corresponding relationship.
It should be noted that above-mentioned preset contract text processing model is obtained by then passing through positive training and reverse train To model, thus it is subsequent when in use can simultaneously from two kinds of angles of positive word order and reversed word order to the character cell in sentence into Row analyzes and determines, then synthesis is analyzed and determined based on above two angle as a result, more accurately determining word in the sentence Accord with the mark character of unit.
In addition, splicing first coded sequence and the second coded sequence in order to further increase model accuracy, obtaining It, can be as shown in fig.4, being input to mark layer by the output result sequence as input and carrying out after exporting result sequence Training, the output result sequence with the incidence relation established between mark character, after being constrained.
Wherein, the output result sequence after the constraint specifically can also include the incidence relation identified between character. Incidence relation between above-mentioned mark character is carried out for the mark character to each character cell in identified sentence into one The pact beam alignment of step.
It is limited behind some type identification character specifically, can be constrained according to the incidence relation between above-mentioned mark character Another type of character mark cannot be connected.It is carried for example, will not be connected after carrying the character cell of the mark character of " O " The character cell of the mark character of " E ".Therefore, can character mark to the character cell for occurring existing similar connection relationship into Row amendment, or even redefine, to guarantee that identified character mark is accurate, meet preset notation methods, and be based on Relevant preset rules.
When it is implemented, a kind of CRF layers (mark layer) progress can be input to using above-mentioned output result sequence as input Training, with the output result sequence after constrain, so that establishing the relatively higher preset contract text of precision handles model.
By above-mentioned Sample Scenario as it can be seen that this specification provide contract information extracting method, by elder generation according to preset Mode is labeled sample data, then by based on the preset contract text being trained to the sample data after mark Present treatment model determines the corresponding character mark of character cell in contract text data, according to the type of information to be extracted The matched character cell of parameter selection character mark, and above-mentioned character cell is combined according to character mark, it is wanted with obtaining user The contract information to be extracted asked is asked to solve the not high technology of the existing accuracy for extracting contract information of existing method Topic reaches the technical effect that the contract information for meeting user's requirement is accurately and efficiently extracted from the text data of contract;Also By when training establishes preset contract text and handles model, according to preset notation methods to the keyword in sample data The character cell of group carries out corresponding mark, and the sample data after being marked is directed to further according to the sample data after mark The first coded sequence and the second coded sequence of the sentence, and splice above-mentioned first coded sequence and the second coded sequence, make Obtaining the preset contract text processing model established has higher processing accuracy, and then can be by above-mentioned models coupling Information hereafter extracts contract information, further improves the accuracy of the contract information of extraction.
As shown in fig.5, this specification embodiment provides a kind of extracting method of contract information, wherein this method tool Body is applied to server-side.When it is implemented, this method may include the following contents:
S51: the text data of contract to be extracted and the type parameter of contract information to be extracted are obtained.
In the present embodiment, the type of said contract information specifically can be understood as this method of contract text data The data characteristics of rule property data, and combine designed by search use habit of the people to relevant information in this kind of legality data Classification type.Specifically, the type of said contract information may include one of multiple types of act set forth below or a variety of: Party A, Party B, Business Name, the amount of money, contract award date, term of validity, contract number, execution of contract date, expiration of contract Date etc..Certainly, it should be noted that above-mentioned cited contract information type is intended merely to that this specification is better described Embodiment.When it is implemented, as the case may be, may be incorporated into other information types, such as address, phone, promise breaking Type of the information type that the people such as gold more pay close attention to as said contract information.
In the present embodiment, the type parameter of above-mentioned contract information to be extracted, specifically can be understood as being used to indicate to The identification parameter of the type of the contract information of extraction.
In the present embodiment, the defeated textual data that can directly acquire the contract to be extracted that user inputs in client of server According to and contract information amount to be extracted type parameter;The contract to be extracted that user inputs in client can also be directly acquired Text data contract title or contract number and the contract information to be extracted of user setting type parameter, then root According to contract title or contract number, retrieve the database for storing the text data of electronics shelves contract, also obtain it is above-mentioned to The text data of extraction contract.
S53: model is handled by preset contract text, determines the character in the text data of the contract to be extracted The mark character of unit, wherein the preset contract text processing model is by instructing to the sample data after mark It gets, the sample data after the mark is the text data of the contract marked according to preset notation methods.
In the present embodiment, above-mentioned preset contract text processing model specifically can be understood as server and first pass through in advance pair The text data of the sample contract largely marked according to preset notation methods carries out gained after positive training and reverse train It arrives, can be used in identifying the neural network model of the corresponding mark character of character cell in determining contract text data.
In the present embodiment, above-mentioned character cell specifically can be understood as the number that text information is characterized in contract text data According to unit.Specifically, above-mentioned character cell specifically can be a Chinese character, it is also possible to a word, can also be a number Etc..Certainly, it should be noted that above-mentioned cited character cell is intended merely to that this specification embodiment party is better described Formula.As the case may be, it may be incorporated into the character of other forms or content as above-mentioned character cell.For example, it is also possible to will Some punctuation marks in contract text with specific meanings are also used as a kind of character cell.In this regard, this specification does not limit It is fixed.
In the present embodiment, above-mentioned mark character specifically can be understood as character cell institute corresponding to a kind of be used to indicate Phrase information type and the character cell corresponding in the text data of this type of contract text data at this The character properties of position feature in phrase combine.Specifically, above-mentioned mark character may include two parts, a portion For be used to indicate phrase where the character cell information type mark character, another part is to be used to indicate the character cell The mark character of position feature in the phrase.
In the present embodiment, the sample data after above-mentioned mark specifically can be carries out according to following preset notation methods What mark obtained: retrieving the crucial phrase in sample data;In the crucial phrase in the sample data, mark out respectively Type of word-combination (the crucial phrase institute table where i.e. of crucial phrase where each character cell that the crucial phrase is included The information type of sign) and, position feature of the character cell in the crucial phrase, the mark as the character cell is believed Breath.Specifically, above-mentioned markup information can be marked onto corresponding character cell, by adding tagged mode so that word Symbol unit carries corresponding markup information.
In the present embodiment, when it is implemented, can be input to using the text data of contract to be extracted as mode input In the processing model of preset contract text, it may thereby determine that out that character cell is corresponding in the text data of contract to be extracted Identify character.
S55: from the text data of the contract to be extracted, mark character and the contract information to be extracted are extracted The matched character cell of type parameter.
In the present embodiment, above-mentioned mark character is matched with the type parameter of the contract information to be extracted, specifically can be with It is interpreted as being used to indicate type indicated by type indicated by the mark character of type of word-combination and type parameter in mark character Identical or difference value be less than default discrepancy threshold.
For example, the mark character of following 3 character cells is respectively as follows: mark character 1 " B-COM ", mark 2 " E- of character COM " and mark character " M-AMT ", decompose above-mentioned mark character, are used to indicate type of word-combination in available mark character 1 Identifying type indicated by character " COM " is Business Name.Similar, phrase class is used to indicate in available mark character 2 The type of the mark character instruction of type is also Business Name.The mark character " AMT " of type of word-combination is used to indicate in mark character 3 The type then indicated is the amount of money.And information type indicated by the type parameter of acquired contract information to be extracted is company name Claim, i.e., it is identical as type indicated by the mark character of type of word-combination is used to indicate in mark character 1 and mark character 2.Therefore, By above-mentioned mark character 1 and mark character 2 while it can be determined as matching with the type parameter of contract information to be extracted.
In the present embodiment, the above-mentioned matched character cell of type parameter with contract information to be extracted is understood that It is the subsequent character cell for needing to extract to be a kind of matched character cell.
In the present embodiment, above-mentioned from the text data of the contract to be extracted, mark character is extracted with described wait mention The matched character cell of the type parameter of the contract information taken, when it is implemented, may include the following contents: retrieving conjunction to be extracted The mark character of character cell, finds the matched mark of type parameter with the contract information to be extracted in same text data Character learning symbol, and it is determined as matched character list with the matched character cell of type parameter of contract information to be extracted for having Member.
S57: according to the mark character, the type parameter of the mark character and the contract information to be extracted is combined Matched character cell obtains the contract information to be extracted.
In the present embodiment, according to the mark character, the mark character and the contract information to be extracted are combined The matched character cell of type parameter, obtain the contract information to be extracted, when it is implemented, may include it is following in Hold: retrieving the mark character of matched character cell, the mark character that type of word-combination is used to indicate in mark character is identical Character cell is divided into one group;For character cell in same group, according in the mark character of character cell in the group for referring to The mark character for showing the position feature of character cell, according to corresponding position merging features combining characters unit, obtaining one can table Levy semantic phrase, the contract information to be extracted as one.
For example, obtained following matched character cell: " big " (corresponding mark character is B-COM), " public affairs " are (right The mark character answered is M2-COM), " 500 " (corresponding mark character is B-AMT), " power " (corresponding mark character M1- COM), " member " (corresponding mark character is E-AMT) and " department " (corresponding mark character is E-COM).It can be first according to character For characterizing the mark character of type of word-combination in the mark character of unit, above-mentioned several matched character cells are divided into several Group.Specifically, can be by the mark character for being used to characterize type of word-combination the character cell " big " of " COM ", " public affairs ", " power " and " department " is divided into first group;It is that " AMT " D character cell " 500 " and " member " divide by the mark character for being used to characterize type of word-combination It is second group.Again to each group, according in the mark character of character cell in the group for characterizing the mark character of position feature, This group of character cell is spliced and combined, corresponding phrase is obtained.For first group, determine character cell " big " in the group, In the mark character of " public affairs ", " power " and " department " for characterize position feature mark character be respectively " B ", " M2 ", " M1 " and " E ", can determine the position feature of above-mentioned 4 character cells be respectively as follows: initial position, the 2nd position in middle position, in Between the 1st position and end position in position.According to above-mentioned position feature, can be set for the character cell " big " of initial position In beginning, so can and then character cell it is " big " connection middle position in the 1st position character cell " power ", immediately Character cell " power " connection middle position in the 2nd position character cell " public affairs ", and then character cell " public affairs " connection The character cell " department " of end position;It is corresponded to again since character cell " department " is used to characterize the mark character of position feature for " E " Be end position, i.e., will not other character lists where being connected with the character cell in phrase after the character cell Member, therefore, after having connected character cell " department ", it can be determined that have been completed first group of corresponding phrase connection, into And the combination " big-power-public affairs-department " of the character cell connected can be determined as phrase, as corresponding to be extracted Contract information.
In the present embodiment, it according to the mark character, combines the mark character and the contract to be extracted is believed The matched character cell of the type parameter of breath, after obtaining the contract information to be extracted, the method is when it is implemented, may be used also To include the following contents: showing the contract information to be extracted.
In the present embodiment, specifically, server can directly be shown to user determine to meet user's requirement to Contract information to be extracted first can also be sent to client, then show institute from client to user by the contract information of extraction It is required that the contract information extracted.
In the present embodiment, sample data is labeled according to preset mode by elder generation, then by based on to mark The preset contract text that sample data afterwards is trained handles model to determine character list in contract text data The corresponding character mark of member, according to the type parameter of the information to be extracted selection matched character cell of character mark, and according to Character mark combines above-mentioned character cell, to obtain contract information to be extracted required by user, to solve existing side The not high technical problem of the existing accuracy for extracting contract information of method, reaches accurately and efficiently from the text data of contract Extract the technical effect for meeting the contract information of user's requirement.
In one embodiment, the preset contract text processing model specifically can be trained in the following way It arrives: the sample data after obtaining mark;Sentence in sample data after the mark is split as multiple character cells respectively, Wherein, the multiple character cell carries corresponding markup information respectively, the corresponding markup information of the character cell according to Preset notation methods determine;Multiple character cells in the sentence are subjected to vectorization processing respectively, obtain the sentence In multiple word vectors;According to multiple word vectors in the sentence, the first coded sequence for the sentence and the are obtained Two coded sequences, wherein first coded sequence be the sentence in multiple word vectors corresponding to first coding according to The coded sequence that forward direction sequence obtains, second coded sequence are compiled for corresponding to multiple word vectors in the sentence second The coded sequence that code is obtained according to sorting by reversals;Splice first coded sequence and second coded sequence, is exported As a result sequence, wherein the output result sequence includes the corresponding relationship for identifying character and character cell.
In the present embodiment, when it is implemented, the text data of available a certain number of contracts is as sample data, And the character cell in the sample data is labeled according to preset notation methods, to obtain the sample after the mark Data.
In one embodiment, when it is implemented, the sample data can (i.e. preset mark side in the following way Formula) it is labeled: the crucial phrase in sample retrieval data;In the crucial phrase in the sample data, mark out respectively The type of word-combination of crucial phrase where the character cell that the crucial phrase is included and, the position in crucial phrase, make For the markup information of the character cell.
In the present embodiment, above-mentioned crucial phrase specifically can be understood as in contract text data, and people more pay close attention to, quilt The frequency that people's search uses is greater than phrase corresponding to the contract information of preset frequency threshold.Specifically, above-mentioned keyword Group can be the Business Name occurred in contract text, such as " company energetically ", be also possible to occur in contract text specific The amount of money, such as " 5000 yuan " can also be the contract award date occurred in contract text, such as " on January 1st, 2017 " etc. Deng.
In the present embodiment, you need to add is that, the contract that the type of word-combination and crucial phrase of crucial phrase are characterized is believed The information type of breath is corresponding.
In the present embodiment, the type of word-combination of the crucial phrase can specifically include at least one of: Business Name, The amount of money, contract award date, term of validity, contract number, execution of contract date, expiration of contract date etc..Certainly, it needs Bright, the type of word-combination of above-mentioned cited crucial phrase is intended merely to that this specification embodiment is better described.Specifically When implementation, other type of word-combination relevant to contract text data can also be introduced as the case may be as above-mentioned crucial phrase Type of word-combination.In this regard, this specification is not construed as limiting.
In the present embodiment, it in determining crucial phrase after the markup information of each character cell, may further incite somebody to action Markup information by add it is tagged in a manner of be labeled on corresponding character cell, to complete to the mark of character cell so that Character cell carries corresponding markup information, the sample data after being marked.
In the present embodiment, the sentence in the above-mentioned sample data by after the mark is split as multiple character lists respectively Member, when it is implemented, may include: to be handled as unit of sentence for the sample data after mark.Specifically, can be with The sample data after mark is first split as multiple sentences;Again as unit of sentence, each sentence is further broken into multiple Character cell.
In the present embodiment, when it is implemented, can by retrieval mark after sample data in default punctuation mark, And according to default punctuation mark, the sample data after the mark is split as multiple sentences.Wherein, above-mentioned default punctuation mark It can specifically include at least one of: fullstop, exclamation mark, question mark, branch etc..Certainly it should be noted that it is above-mentioned listed The default punctuation mark lifted is that one kind schematically illustrates.For presetting the type of punctuation mark, this specification is not construed as limiting.
In one embodiment, above-mentioned multiple character cells by the sentence carry out vectorization processing respectively, obtain Multiple word vectors in the sentence, when it is implemented, may include the following contents: using multiple character cells in sentence as Input is input to Embedding layers, carries out vectorization processing, obtains corresponding word vector.
In one embodiment, there may be the character of characterization Chinese character in the sample data after further contemplating mark Unit, and computer often can not the character cell directly to characterization Chinese character carry out vectorization processing, therefore, by multiple characters Before unit carries out vectorization processing, the method is when it is implemented, can also include the following contents: extracting the characterization in sentence The character cell of Chinese character, and one-hot is first carried out to the character cell of the characterization Chinese character in sentence and encodes to obtain the character cell Corresponding, the accessible character code of computer, then by other character cells in the sentence, together with the character of the characterization Chinese character Character code corresponding to unit is input to Embedding layers, as input to obtain corresponding word vector.
In one embodiment, above-mentioned multiple word vectors according in the sentence obtain first for the sentence Coded sequence and the second coded sequence, when it is implemented, may include the following contents: multiple word vectors in the sentence are pressed The positive word sequence vector of sentence is obtained according to positive sequence (such as arranging according to sequence from left to right), it will be in the sentence Multiple word vectors obtain the reversed word sequence vector of sentence according to sorting by reversals (such as arranging according to sequence from right to left);It will The input of the positive word sequence vector of the sentence and the reversed word sequence vector of the sentence as a time step, is input to Characteristic layer is trained, and obtains first coded sequence and second coded sequence.
In the present embodiment, above-mentioned first coded sequence (can also claim positive LSTM) specifically can be understood as in sentence The coded sequence that first coding corresponding to multiple word vectors is obtained according to positive sequence (such as from left to right), above-mentioned second compiles Code sequence (reversed LSTM can also be claimed) specifically can be understood as the second coding corresponding to multiple word vectors in sentence according to The coded sequence that sorting by reversals (such as turning left from the right side) obtains.Wherein, above-mentioned first coding can specifically be understood to a kind of use The coded data of the feature corresponding to characterization word vector, above-mentioned second coding specifically can be understood as it is another for characterize word to Measure the coded data of corresponding feature.
It in the present embodiment, can be with after the reversed word sequence vector of the positive word sequence vector and sentence that obtain sentence By above two direction, i.e., input of the word sequence vector of the sentence described from different word order angles as a time step, It is input to the characteristic layer such as Bi-LSTM layers, feature extraction is carried out, to obtain corresponding first coded sequence and the second code sequence Column.
In the present embodiment, above-mentioned splicing first coded sequence and second coded sequence obtain output result Sequence, when it is implemented, may include the following contents: same character list will be corresponded in the first coded sequence and the second coded sequence The first coding and the second coding of member link together, and complete splicing and obtain a complete hidden status switch as above-mentioned output Structure sequence has obtained preset contract text processing model to establish.Wherein, the mark that above-mentioned output result sequence includes The corresponding relationship of character learning symbol and character cell, therefore following model can determine that character cell is corresponding based on above-mentioned corresponding relationship Mark character.
In one embodiment, it is contemplated that the targeted user's of contract text data uses scene and use habit, perhaps More phrases are information relatively low using probability, that people less focus on, for example, for enterprise seldom in concern contract The information that the phrases such as name and adjective are characterized.In addition, some phrases for grammer connection in contract text data, For example, auxiliary word, conjunction often do not have the effective semanteme of user's concern yet.In order to avoid in modeling process by above-mentioned nothing The interference of phrase is imitated, and also to avoid carrying out above-mentioned invalid phrase processing waste of resource and time, in sample retrieval number Before crucial phrase in, the method can also include the following contents when being embodied: filter the nothing in the sample data Phrase is imitated, and then only the character cell in the crucial phrase in filtered sample data can be labeled.Certainly, specifically When implementation, above-mentioned filter type specifically can also be the character cell marked out in invalid phrase, for example, with character mark " O " Mark out the character cell of the invalid phrase in sample data, computer recognizes above-mentioned character mark when subsequent progress model training Know " O ", can pick out the corresponding character cell of the character mark is the character cell in invalid phrase, and then can carry out area Other places reason, to reduce or avoid to expend this kind of character cell excessive resource and processing time, to improve treatment effeciency.
In the present embodiment, the invalid phrase can specifically include at least one of: name, adjective, helps place name Word etc..Certainly, above-mentioned cited invalid phrase is that one kind schematically illustrates.When it is implemented, according to specific application Scene and process demand may be incorporated into other kinds of phrase as above-mentioned wireless phrase.In this regard, this specification does not limit It is fixed.
In one embodiment, in order to further increase the precision of the preset contract text established processing model, so that It is subsequent to use the mark character that character cell is more accurately determined when the model.Splice first coded sequence and Second coded sequence, after obtaining output result sequence, when it is implemented, the method can also include the following contents: will be described Result sequence is exported as input, mark layer is input to and is trained, to establish the constraint relationship between character mark, obtain about Output result sequence after beam, wherein the output result sequence after the constraint also includes that the association identified between character is closed System.
In the present embodiment, above-mentioned mark layer can specifically refer to that CRF marks layer.Certainly, above-mentioned cited CRF mark Layer is that one kind schematically illustrates.When it is implemented, can also be using other suitable layers as mark layer.In this regard, this explanation Book is not construed as limiting.
In one embodiment, according to the mark character, the mark character and the contract to be extracted are combined The matched character cell of the type parameter of information, after obtaining the contract information to be extracted, the method is when it is implemented, also It may include the following contents: showing the contract information to be extracted.
In the present embodiment, specifically, server can directly be shown to user determine to meet user's requirement to Contract information to be extracted first can also be sent to client, then show institute from client to user by the contract information of extraction It is required that the contract information extracted.
Therefore the extracting method of the contract information of this specification embodiment offer, by elder generation according to preset mode Sample data is labeled, then by based on the preset contract text being trained to the sample data after mark Model is managed to determine the corresponding character mark of character cell in contract text data, according to the type parameter of information to be extracted The matched character cell of character mark is selected, and above-mentioned character cell is combined according to character mark, to obtain required by user Contract information to be extracted, thus solve the not high technical problem of the existing accuracy for extracting contract information of existing method, Reach the technical effect that the contract information for meeting user's requirement is accurately and efficiently extracted from the text data of contract;It is also logical It crosses when preset contract text processing model is established in training, according to preset notation methods to the crucial phrase in sample data Character cell carry out corresponding mark, the sample data after mark obtains being directed to institute further according to the sample data after mark The first coded sequence and the second coded sequence of sentence are stated, and splices above-mentioned first coded sequence and the second coded sequence, so that Established preset contract text processing model have higher processing accuracy, and then can be by above-mentioned models coupling above and below The information of text extracts contract information, further improves the accuracy of the contract information of extraction.
As shown in fig.6, this specification embodiment provides a kind of method for building up of preset contract text processing model, Wherein, this method is applied particularly to server-side.When it is implemented, this method may include the following contents:
S61: the sample data after obtaining mark;
S63: the sentence in the sample data after the mark is split as multiple character cells respectively, wherein described more A character cell carries corresponding markup information respectively, and the corresponding markup information of the character cell is according to preset mark side Formula determines;
S65: multiple character cells in the sentence are subjected to vectorization processing respectively, are obtained multiple in the sentence Word vector;
S67: it according to multiple word vectors in the sentence, obtains compiling for the first coded sequence of the sentence and second Code sequence, wherein first coded sequence is the first coding corresponding to multiple word vectors in the sentence according to forward direction Sort obtained coded sequence, and second coded sequence is pressed for the second coding corresponding to multiple word vectors in the sentence The coded sequence obtained according to sorting by reversals;
S69: splicing first coded sequence and second coded sequence, obtains output result sequence, wherein described Output result sequence includes the corresponding relationship for identifying character and character cell.
In the present embodiment, the sample data after above-mentioned mark is when it is implemented, can be in the following manner according to default Notation methods be labeled: the crucial phrase in sample retrieval data;In the crucial phrase in the sample data, respectively The type of word-combination for marking out the crucial phrase where the character cell that the crucial phrase is included and the position in crucial phrase It sets, the markup information as the character cell.
Therefore the method for building up of preset contract text processing model provided by the present application passes through according to the contract of being based on Preset notation methods determined by the data characteristics and use habit of text data mark sample data, the sample after being marked Notebook data;Be again processing unit with sentence, by positive training and reverse train establish to obtain precision it is higher, for contract text The contract text that data information extracts handles model, improves the accuracy of model.
As shown in fig.7, this specification embodiment additionally provides a kind of extracting method of text information.Wherein, this method When it is implemented, may include the following contents:
S71: the type parameter of text data and text information to be extracted to be extracted is obtained;
S73: by preset text-processing model, the mark of the character cell in the text data to be extracted is determined Character learning symbol, wherein the preset text-processing model be by being trained to the sample data after mark, it is described Sample data after mark is the text data marked according to preset notation methods;
S75: from the text data to be extracted, the type of mark character and the text information to be extracted is extracted The matched character cell of parameter;
S77: according to the mark character, the type parameter of the mark character and the text information to be extracted is combined Matched character cell obtains the text information to be extracted.
In the present embodiment, above-mentioned text data to be extracted can specifically include contract to be extracted text data, to Extract the text data of paper, the text data of rules and regulations to be extracted, text data of notice letter to be extracted etc..When So, above-mentioned cited text data to be extracted is that one kind schematically illustrates.When it is implemented, as the case may be and place Reason requires, and may be incorporated into other kinds of text data as above-mentioned text data to be extracted, and apply above-mentioned text envelope The extracting method of breath extracts text information required by user.In this regard, this specification is not construed as limiting.
Therefore the extracting method of the text information of this specification embodiment offer, by elder generation according to preset mode Sample data is labeled, then by based on the preset text-processing mould being trained to the sample data after mark Type determines the corresponding character mark of character cell in text data, according to the type parameter of information to be extracted selects character Matched character cell is identified, and above-mentioned character cell is combined according to character mark, it is to be extracted required by user to obtain Text information, to solve the not high technical problem of the existing accuracy for extracting text information of existing method, reach it is accurate, The technical effect for meeting the text information of user's requirement is efficiently extracted from text data.
This specification embodiment also provides a kind of server, including processor and is used for storage processor executable instruction Memory, the processor be embodied when can be according to instruction execution following steps: obtain the textual data of contract to be extracted According to and contract information to be extracted type parameter;Model is handled by preset contract text, is determined described to be extracted The mark character of character cell in the text data of contract, wherein the preset contract text processing model is by right What the sample data after mark was trained, the sample data after the mark is to mark according to preset notation methods The text data of contract;From the text data of the contract to be extracted, extracts mark character and the contract to be extracted is believed The matched character cell of the type parameter of breath;According to the mark character, the mark character and the conjunction to be extracted are combined With the matched character cell of type parameter of information, the contract information to be extracted is obtained.
In order to more accurately complete above-metioned instruction, as shown in fig.8, this specification additionally provides another kind specifically Server, wherein the server includes network communications port 801, processor 802 and memory 803, and above structure is logical It crosses Internal cable to be connected, so that each structure can carry out specific data interaction.
Wherein, the network communications port 801, specifically can be used for obtaining the text data of contract to be extracted, and to The type parameter of the contract information of extraction.
The processor 802 specifically can be used for handling model by preset contract text, determine described to be extracted The mark character of character cell in the text data of contract, wherein the preset contract text processing model is by right What the sample data after mark was trained, the sample data after the mark is to mark according to preset notation methods The text data of contract;From the text data of the contract to be extracted, extracts mark character and the contract to be extracted is believed The matched character cell of the type parameter of breath;According to the mark character, the mark character and the conjunction to be extracted are combined With the matched character cell of type parameter of information, the contract information to be extracted is obtained.
The memory 803 specifically can be used for storing the text of the contract to be extracted obtained through network communications port 801 The corresponding instruction repertorie that data, type parameter and processor 802 are based on.
In the present embodiment, the network communications port 801 can be is bound from different communication protocol, thus The virtual port of different data can be sent or received.Lead to for example, the network communications port can be responsible for progress web data No. 80 ports of letter are also possible to No. 21 ports for being responsible for carrying out FTP data communication, can also be that responsible progress mail data is logical No. 25 ports of letter.In addition, the network communications port can also be the communication interface or communication chip of entity.For example, its It can be mobile radio network communication chip, such as GSM, CDMA;It can also be Wifi chip;It can also be bluetooth core Piece.
In the present embodiment, the processor 802 can be implemented in any suitable manner.For example, processor can be with Take such as microprocessor or processor and storage can by (micro-) processor execute computer readable program code (such as Software or firmware) computer-readable medium, logic gate, switch, specific integrated circuit (Application Specific Integrated Circuit, ASIC), programmable logic controller (PLC) and the form etc. for being embedded in microcontroller.This specification is simultaneously It is not construed as limiting.
In the present embodiment, the memory 803 may include many levels, in digital display circuit, as long as can save Binary data can be memory;In integrated circuits, the circuit with store function of a not no physical form Also memory, such as RAM, FIFO are;In systems, the storage equipment with physical form is also memory, such as memory bar, TF Card etc..
This specification embodiment additionally provides a kind of computer storage medium of extracting method based on said contract information, The computer storage medium is stored with computer program instructions, is performed realization in the computer program instructions: obtaining The type parameter of the text data of contract to be extracted and contract information to be extracted;Mould is handled by preset contract text Type determines the mark character of the character cell in the text data of the contract to be extracted, wherein the preset contract text Present treatment model is by being trained to the sample data after mark, and the sample data after the mark is according to pre- If notation methods mark contract text data;From the text data of the contract to be extracted, extract mark character with The matched character cell of type parameter of the contract information to be extracted;According to the mark character, the identifier word is combined The matched character cell of type parameter of symbol and the contract information to be extracted, obtains the contract information to be extracted.
In the present embodiment, above-mentioned storage medium includes but is not limited to random access memory (Random Access Memory, RAM), read-only memory (Read-Only Memory, ROM), caching (Cache), hard disk (Hard DiskDrive, ) or storage card (Memory Card) HDD.The memory can be used for storing computer program instructions.Network communication unit It can be according to standard setting as defined in communication protocol, for carrying out the interface of network connection communication.
In the present embodiment, the function and effect of the program instruction specific implementation of computer storage medium storage, can To compare explanation with other embodiment, details are not described herein.
As shown in fig.9, this specification embodiment additionally provides a kind of extraction dress of contract information on software view It sets, which can specifically include construction module below:
Module 91 is obtained, specifically can be used for obtaining the text data and contract information to be extracted of contract to be extracted Type parameter;
Determining module 92 specifically can be used for handling model by preset contract text, determine the conjunction to be extracted The mark character of character cell in same text data, wherein the preset contract text processing model is by mark Sample data after note be trained 93, the sample data after the mark is to mark according to preset notation methods The text data of contract;
Extraction module 94 specifically can be used for from the text data of the contract to be extracted, extract mark character and institute State the matched character cell of type parameter of contract information to be extracted;
Composite module 95, specifically can be used for according to the mark character, combine the mark character with it is described to be extracted Contract information the matched character cell of type parameter, obtain the contract information to be extracted.
In one embodiment, described device can also include specifically model building module, for establishing preset contract Text-processing model, the model building module can specifically include following structural unit:
Acquiring unit, the sample data after specifically can be used for obtaining mark;
Split cells specifically can be used for the sentence in the sample data after the mark being split as multiple characters respectively Unit, wherein the multiple character cell carries corresponding markup information respectively, the corresponding markup information of the character cell It is determined according to preset notation methods;
Vectorization processing unit specifically can be used for respectively carrying out multiple character cells in the sentence at vectorization Reason, obtains multiple word vectors in the sentence;
Determination unit, specifically can be used for according to multiple word vectors in the sentence, obtain for the sentence One coded sequence and the second coded sequence, wherein first coded sequence is corresponding to multiple word vectors in the sentence The coded sequence that is obtained according to positive sequence of the first coding, second coded sequence is multiple word vectors in the sentence The coded sequence that the second corresponding coding is obtained according to sorting by reversals;
Concatenation unit specifically can be used for splicing first coded sequence and second coded sequence, be exported As a result sequence, wherein the output result sequence includes the corresponding relationship for identifying character and character cell.
In one embodiment, the model building module can also include specifically mark unit, specifically can be used for examining Crucial phrase in rope sample data;In the crucial phrase in the sample data, the crucial phrase institute is marked out respectively The type of word-combination of crucial phrase where the character cell for including and the position in crucial phrase, as the character cell Markup information.
In one embodiment, the type of word-combination of the crucial phrase can specifically include at least one of: company name Title, the amount of money, contract award date, term of validity, contract number, execution of contract date, expiration of contract date etc..Certainly, above-mentioned Cited type of word-combination is that one kind schematically illustrates.In this regard, this specification is not construed as limiting.
In one embodiment, the model building module can also include specifically filter element, specifically can be used for Filter the invalid phrase in the sample data, wherein the invalid phrase can specifically include at least one of: name, Name, adjective, auxiliary word etc..
In one embodiment, the determination unit can specifically include following structural sub-units:
Sorting subunit specifically can be used for multiple word vectors in the sentence obtaining sentence according to forward direction sequence Multiple word vectors in the sentence are obtained the reversed word sequence vector of sentence according to sorting by reversals by positive word sequence vector;
Training subelement, specifically can be used for by the reversed word of the positive word sequence vector of the sentence and the sentence to Input of the sequence as a time step is measured, characteristic layer is input to and is trained, obtains first coded sequence and described the Two coded sequences.
In one embodiment, the model building module specifically can also include constraint element, specifically can be used for by The output result sequence is input to mark layer and is trained as input, the output result sequence after being constrained, wherein Output result sequence after the constraint also includes the incidence relation identified between character.
In one embodiment, described device specifically can also include display module, specifically can be used for showing it is described to The contract information of extraction.
It should be noted that unit, device or module etc. that above-described embodiment illustrates, specifically can by computer chip or Entity is realized, or is realized by the product with certain function.For convenience of description, it describes to divide when apparatus above with function It is described respectively for various modules.It certainly, can be the function of each module in same or multiple softwares when implementing this specification And/or realized in hardware, the module for realizing same function can also be realized by the combination of multiple submodule or subelement etc..With Upper described Installation practice is only schematical, for example, the division of the unit, only a kind of logic function is drawn Point, there may be another division manner in actual implementation, such as multiple units or components may be combined or can be integrated into separately One system, or some features can be ignored or not executed.Another point, shown or discussed mutual coupling or straight Connecing coupling or communication connection can be through some interfaces, and the indirect coupling or communication connection of device or unit can be electrical property, Mechanical or other forms.
Therefore the extraction element of the contract information of this specification embodiment offer, by elder generation according to preset mode Sample data is labeled, then is called by determining module and is preset based on what is be trained to the sample data after mark Contract text processing model determine the corresponding character mark of character cell in contract text data, pass through extraction module The matched character cell of character mark is selected according to the type parameter of information to be extracted with composite module, and according to character mark group It closes and states character cell, to obtain contract information to be extracted required by user, mentioned existing for existing method to solve The technical problem for taking the accuracy of contract information not high, reaches accurately and efficiently to extract from the text data of contract and meets The technical effect for the contract information that user requires;Preset contract text is also established in training by model building module and handles mould When type, corresponding mark is carried out according to character cell of the preset notation methods to the crucial phrase in sample data, is marked Sample data afterwards obtains the first coded sequence and the second code sequence for the sentence further according to the sample data after mark Column, and splice above-mentioned first coded sequence and the second coded sequence, so that the preset contract text processing model tool established There is higher processing accuracy, and then contract information can be extracted by the information of above-mentioned models coupling context, further mentions The accuracy of the high contract information extracted.
Although being based on routine or nothing present description provides the method operating procedure as described in embodiment or flow chart Creative means may include more or less operating procedure.The step of enumerating in embodiment sequence is only numerous steps One of rapid execution sequence mode does not represent and unique executes sequence.When device or client production in practice executes, Can be executed according to embodiment or the execution of method shown in the drawings sequence or parallel (such as parallel processor or multithreading The environment of processing, even distributed data processing environment).The terms "include", "comprise" or its any other variant are intended to Cover non-exclusive inclusion, so that the process, method, product or the equipment that include a series of elements not only include those Element, but also including other elements that are not explicitly listed, or further include for this process, method, product or setting Standby intrinsic element.In the absence of more restrictions, being not precluded is including process, method, the product of the element Or there is also other identical or equivalent elements in equipment.The first, the second equal words are used to indicate names, and are not offered as appointing What specific sequence.
It is also known in the art that other than realizing controller in a manner of pure computer readable program code, it is complete Entirely can by by method and step carry out programming in logic come so that controller with logic gate, switch, specific integrated circuit, programmable Logic controller realizes identical function with the form for being embedded in microcontroller etc..Therefore this controller is considered one kind Hardware component, and the structure that the device for realizing various functions that its inside includes can also be considered as in hardware component.Or Person even, can will be considered as realizing the device of various functions either the software module of implementation method can be hardware again Structure in component.
This specification can describe in the general context of computer-executable instructions executed by a computer, such as journey Sequence module.Generally, program module include routines performing specific tasks or implementing specific abstract data types, programs, objects, Component, data structure, class etc..This specification can also be practiced in a distributed computing environment, in these distributed computing rings In border, by executing task by the connected remote processing devices of communication network.In a distributed computing environment, program mould Block can be located in the local and remote computer storage media including storage equipment.
As seen through the above description of the embodiments, those skilled in the art can be understood that this specification It can realize by means of software and necessary general hardware platform.Based on this understanding, the technical solution of this specification Substantially the part that contributes to existing technology can be embodied in the form of software products in other words, the computer software Product can store in storage medium, such as ROM/RAM, magnetic disk, CD, including some instructions are used so that a computer Equipment (can be personal computer, mobile terminal, server or the network equipment etc.) execute each embodiment of this specification or Method described in certain parts of person's embodiment.
Each embodiment in this specification is described in a progressive manner, the same or similar portion between each embodiment Dividing may refer to each other, and each embodiment focuses on the differences from other embodiments.This specification can be used for In numerous general or special purpose computing system environments or configuration.Such as: personal computer, server computer, handheld device Or portable device, laptop device, multicomputer system, microprocessor-based system, set top box, programmable electronics set Standby, network PC, minicomputer, mainframe computer, distributed computing environment including any of the above system or equipment etc..
Although depicting this specification by embodiment, it will be appreciated by the skilled addressee that there are many become for this specification Shape and the spirit changed without departing from this specification, it is desirable to which the attached claims include these deformations and change without departing from this The spirit of specification.

Claims (21)

1. a kind of extracting method of contract information, comprising:
Obtain the text data of contract to be extracted and the type parameter of contract information to be extracted;
Model is handled by preset contract text, determines the mark of the character cell in the text data of the contract to be extracted Character learning symbol, wherein the preset contract text processing model be by being trained to the sample data after mark, Sample data after the mark is the text data of the contract marked according to preset notation methods;
From the text data of the contract to be extracted, the type parameter of mark character and the contract information to be extracted is extracted Matched character cell;
According to the mark character, the matched word of type parameter of the mark character and the contract information to be extracted is combined Unit is accorded with, the contract information to be extracted is obtained.
2. according to the method described in claim 1, training obtains the preset contract text processing model in the following way:
Sample data after obtaining mark;
Sentence in sample data after the mark is split as multiple character cells respectively, wherein the multiple character list Member carries corresponding markup information respectively, and the corresponding markup information of the character cell is determined according to preset notation methods;
Multiple character cells in the sentence are subjected to vectorization processing respectively, obtain multiple word vectors in the sentence;
According to multiple word vectors in the sentence, the first coded sequence and the second coded sequence for the sentence are obtained, Wherein, first coded sequence is that the first coding corresponding to multiple word vectors in the sentence is obtained according to forward direction sequence Coded sequence, second coded sequence is the second coding corresponding to multiple word vectors in the sentence according to reversed row The coded sequence that sequence obtains;
Splice first coded sequence and second coded sequence, obtains output result sequence, wherein the output result Sequence includes the corresponding relationship for identifying character and character cell.
3. according to the method described in claim 2, the sample data is labeled in the following way:
Crucial phrase in sample retrieval data;
In the crucial phrase in the sample data, where marking out the character cell that the crucial phrase is included respectively The type of word-combination of crucial phrase and, the position in crucial phrase, the markup information as the character cell.
4. according to the method described in claim 3, the type of word-combination of the crucial phrase includes at least one of: company name Title, the amount of money, contract award date, term of validity, contract number, execution of contract date, expiration of contract date.
5. according to the method described in claim 3, before the crucial phrase in sample retrieval data, the method also includes:
Filter the invalid phrase in the sample data, wherein the invalid phrase includes at least one of: name describes Word, auxiliary word.
6. according to the method described in claim 2, according to multiple word vectors in the sentence, for the sentence is obtained One coded sequence and the second coded sequence, comprising:
Multiple word vectors in the sentence are obtained into the positive word sequence vector of sentence according to forward direction sequence, it will be in the sentence Multiple word vectors obtain the reversed word sequence vector of sentence according to sorting by reversals;
Using the reversed word sequence vector of the positive word sequence vector of the sentence and the sentence as the input of a time step, It is input to characteristic layer to be trained, obtains first coded sequence and second coded sequence.
7. according to the method described in claim 2, obtaining output knot splicing first coded sequence and the second coded sequence After infructescence column, the method also includes:
Using the output result sequence as input, it is input to mark layer and is trained, the output result sequence after being constrained, Wherein, the output result sequence after the constraint also includes the incidence relation identified between character.
8. according to the method described in claim 1, according to the mark character, combine the mark character with it is described to be extracted Contract information the matched character cell of type parameter, after obtaining the contract information to be extracted, the method also includes:
Show the contract information to be extracted.
9. a kind of extracting method of text information, comprising:
Obtain the type parameter of text data and text information to be extracted to be extracted;
By preset text-processing model, the mark character of the character cell in the text data to be extracted is determined, Wherein, the preset text-processing model is by being trained to the sample data after mark, after the mark Sample data be the text data marked according to preset notation methods;
From the text data to be extracted, extracts mark character and matched with the type parameter of the text information to be extracted Character cell;
According to the mark character, the matched word of type parameter of the mark character and the text information to be extracted is combined Unit is accorded with, the text information to be extracted is obtained.
10. a kind of method for building up of preset contract text processing model, comprising:
Sample data after obtaining mark;
Sentence in sample data after the mark is split as multiple character cells respectively, wherein the multiple character list Member carries corresponding markup information respectively, and the corresponding markup information of the character cell is determined according to preset notation methods;
Multiple character cells in the sentence are subjected to vectorization processing respectively, obtain multiple word vectors in the sentence;
According to multiple word vectors in the sentence, the first coded sequence and the second coded sequence for the sentence are obtained, Wherein, first coded sequence is that the first coding corresponding to multiple word vectors in the sentence is obtained according to forward direction sequence Coded sequence, second coded sequence is the second coding corresponding to multiple word vectors in the sentence according to reversed row The coded sequence that sequence obtains;
Splice first coded sequence and second coded sequence, obtains output result sequence, wherein the output result Sequence includes the corresponding relationship for identifying character and character cell.
11. according to the method described in claim 10, the sample data is labeled in the following way:
Crucial phrase in sample retrieval data;
In the crucial phrase in the sample data, where marking out the character cell that the crucial phrase is included respectively The type of word-combination of crucial phrase and, the position in crucial phrase, the markup information as the character cell.
12. a kind of extraction element of contract information, comprising:
Module is obtained, for obtaining the text data of contract to be extracted and the type parameter of contract information to be extracted;
Determining module is used for through preset contract text processing model, in the text data for determining the contract to be extracted Character cell mark character, wherein the preset contract text processing model is by the sample data after mark It is trained, the sample data after the mark is the text data of the contract marked according to preset notation methods;
Extraction module, for extracting mark character and the contract to be extracted from the text data of the contract to be extracted The matched character cell of the type parameter of information;
Composite module, for combining the class of the mark character and the contract information to be extracted according to the mark character The matched character cell of shape parameter obtains the contract information to be extracted.
13. device according to claim 12, described device further includes model building module, the model building module packet It includes:
Acquiring unit, for obtaining the sample data after marking;
Split cells, for the sentence in the sample data after the mark to be split as multiple character cells respectively, wherein institute It states multiple character cells and carries corresponding markup information respectively, the corresponding markup information of the character cell is according to preset mark Note mode determines;
Vectorization processing unit obtains described for multiple character cells in the sentence to be carried out vectorization processing respectively Multiple word vectors in sentence;
Determination unit, for according to multiple word vectors in the sentence, obtain for the sentence the first coded sequence and Second coded sequence, wherein first coded sequence is pressed for the first coding corresponding to multiple word vectors in the sentence According to the coded sequence that positive sequence obtains, second coded sequence is second corresponding to multiple word vector in the sentence Encode the coded sequence obtained according to sorting by reversals;
Concatenation unit obtains output result sequence for splicing first coded sequence and second coded sequence, In, the output result sequence includes the corresponding relationship for identifying character and character cell.
14. device according to claim 13, the model building module further includes mark unit, is used for sample retrieval number Crucial phrase in;In the crucial phrase in the sample data, the word that the crucial phrase is included is marked out respectively Accord with unit where crucial phrase type of word-combination and, the position in crucial phrase, as the character cell mark believe Breath.
15. the type of word-combination of device according to claim 14, the crucial phrase includes at least one of: company name Title, the amount of money, contract award date, term of validity, contract number, execution of contract date, expiration of contract date.
16. device according to claim 14, the model building module further includes filter element, for filtering the sample Invalid phrase in notebook data, wherein the invalid phrase includes at least one of: name, place name, adjective, auxiliary word.
17. device according to claim 13, the determination unit include:
Sorting subunit, for multiple word vectors in the sentence to be obtained the positive word vector sequence of sentence according to forward direction sequence Multiple word vectors in the sentence are obtained the reversed word sequence vector of sentence according to sorting by reversals by column;
Training subelement, for using the reversed word sequence vector of the positive word sequence vector of the sentence and the sentence as one The input of a time step is input to characteristic layer and is trained, and obtains first coded sequence and second coded sequence.
18. device according to claim 13, the model building module further includes constraint element, is used for the output As a result sequence is input to mark layer and is trained as input, the output result sequence after being constrained, wherein the constraint Output result sequence afterwards also includes the incidence relation identified between character.
19. device according to claim 12, described device further includes display module, for showing the conjunction to be extracted Same information.
20. a kind of server, including processor and for the memory of storage processor executable instruction, the processor is held The step of any one of claims 1 to 8 the method is realized when row described instruction.
21. a kind of computer readable storage medium is stored thereon with computer instruction, described instruction, which is performed, realizes that right is wanted The step of seeking any one of 1 to 8 the method.
CN201910006732.4A 2019-01-04 2019-01-04 Contract information extraction method and device and text information extraction method Active CN110020424B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910006732.4A CN110020424B (en) 2019-01-04 2019-01-04 Contract information extraction method and device and text information extraction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910006732.4A CN110020424B (en) 2019-01-04 2019-01-04 Contract information extraction method and device and text information extraction method

Publications (2)

Publication Number Publication Date
CN110020424A true CN110020424A (en) 2019-07-16
CN110020424B CN110020424B (en) 2023-10-31

Family

ID=67188726

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910006732.4A Active CN110020424B (en) 2019-01-04 2019-01-04 Contract information extraction method and device and text information extraction method

Country Status (1)

Country Link
CN (1) CN110020424B (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110610348A (en) * 2019-08-14 2019-12-24 深圳壹账通智能科技有限公司 Contract expiration reminding method and device, computer equipment and storage medium
CN110674254A (en) * 2019-09-24 2020-01-10 江苏鸿信系统集成有限公司 Intelligent contract information extraction method based on deep learning and statistical extraction model
CN110688411A (en) * 2019-09-25 2020-01-14 北京地平线机器人技术研发有限公司 Text recognition method and device
CN110955796A (en) * 2019-11-26 2020-04-03 北京明略软件系统有限公司 Case characteristic information extraction method and device based on record information
CN111310473A (en) * 2020-02-04 2020-06-19 四川无声信息技术有限公司 Text error correction method and model training method and device thereof
CN112380869A (en) * 2020-11-12 2021-02-19 平安科技(深圳)有限公司 Crystal information retrieval method, crystal information retrieval device, electronic device and storage medium
CN112489652A (en) * 2020-12-10 2021-03-12 北京有竹居网络技术有限公司 Text acquisition method and device for voice information and storage medium
CN112507118A (en) * 2020-12-22 2021-03-16 北京百度网讯科技有限公司 Information classification and extraction method and device and electronic equipment
CN112597772A (en) * 2020-12-31 2021-04-02 讯飞智元信息科技有限公司 Hotspot information determination method, computer equipment and device
CN112883687A (en) * 2021-02-05 2021-06-01 北京科技大学 Law contract interactive labeling method based on contract text markup language
CN113065343A (en) * 2021-03-25 2021-07-02 天津大学 Enterprise research and development resource information modeling method based on semantics
CN113177401A (en) * 2021-04-25 2021-07-27 鼎富智能科技有限公司 Information extraction method and device, electronic equipment and storage medium
CN114840634A (en) * 2022-07-04 2022-08-02 中关村科学城城市大脑股份有限公司 Information storage method and device, electronic equipment and computer readable medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050182736A1 (en) * 2004-02-18 2005-08-18 Castellanos Maria G. Method and apparatus for determining contract attributes based on language patterns
CN106776538A (en) * 2016-11-23 2017-05-31 国网福建省电力有限公司 The information extracting method of enterprise's noncanonical format document
CN107357789A (en) * 2017-07-14 2017-11-17 哈尔滨工业大学 Merge the neural machine translation method of multi-lingual coding information
CN107680580A (en) * 2017-09-28 2018-02-09 百度在线网络技术(北京)有限公司 Text transformation model training method and device, text conversion method and device
CN107783960A (en) * 2017-10-23 2018-03-09 百度在线网络技术(北京)有限公司 Method, apparatus and equipment for Extracting Information
CN108197099A (en) * 2017-12-01 2018-06-22 厦门快商通信息技术有限公司 A kind of text message extracting method and computer readable storage medium
CN108399482A (en) * 2018-01-17 2018-08-14 阿里巴巴集团控股有限公司 Appraisal procedure, device and the electronic equipment of contract

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050182736A1 (en) * 2004-02-18 2005-08-18 Castellanos Maria G. Method and apparatus for determining contract attributes based on language patterns
CN106776538A (en) * 2016-11-23 2017-05-31 国网福建省电力有限公司 The information extracting method of enterprise's noncanonical format document
CN107357789A (en) * 2017-07-14 2017-11-17 哈尔滨工业大学 Merge the neural machine translation method of multi-lingual coding information
CN107680580A (en) * 2017-09-28 2018-02-09 百度在线网络技术(北京)有限公司 Text transformation model training method and device, text conversion method and device
CN107783960A (en) * 2017-10-23 2018-03-09 百度在线网络技术(北京)有限公司 Method, apparatus and equipment for Extracting Information
CN108197099A (en) * 2017-12-01 2018-06-22 厦门快商通信息技术有限公司 A kind of text message extracting method and computer readable storage medium
CN108399482A (en) * 2018-01-17 2018-08-14 阿里巴巴集团控股有限公司 Appraisal procedure, device and the electronic equipment of contract

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110610348A (en) * 2019-08-14 2019-12-24 深圳壹账通智能科技有限公司 Contract expiration reminding method and device, computer equipment and storage medium
CN110674254B (en) * 2019-09-24 2023-03-10 中电鸿信信息科技有限公司 Intelligent contract information extraction method based on deep learning and statistical extraction model
CN110674254A (en) * 2019-09-24 2020-01-10 江苏鸿信系统集成有限公司 Intelligent contract information extraction method based on deep learning and statistical extraction model
CN110688411A (en) * 2019-09-25 2020-01-14 北京地平线机器人技术研发有限公司 Text recognition method and device
CN110955796A (en) * 2019-11-26 2020-04-03 北京明略软件系统有限公司 Case characteristic information extraction method and device based on record information
CN110955796B (en) * 2019-11-26 2023-05-02 北京明略软件系统有限公司 Case feature information extraction method and device based on stroke information
CN111310473A (en) * 2020-02-04 2020-06-19 四川无声信息技术有限公司 Text error correction method and model training method and device thereof
CN112380869A (en) * 2020-11-12 2021-02-19 平安科技(深圳)有限公司 Crystal information retrieval method, crystal information retrieval device, electronic device and storage medium
CN112489652A (en) * 2020-12-10 2021-03-12 北京有竹居网络技术有限公司 Text acquisition method and device for voice information and storage medium
CN112507118A (en) * 2020-12-22 2021-03-16 北京百度网讯科技有限公司 Information classification and extraction method and device and electronic equipment
CN112597772A (en) * 2020-12-31 2021-04-02 讯飞智元信息科技有限公司 Hotspot information determination method, computer equipment and device
CN112883687A (en) * 2021-02-05 2021-06-01 北京科技大学 Law contract interactive labeling method based on contract text markup language
CN113065343A (en) * 2021-03-25 2021-07-02 天津大学 Enterprise research and development resource information modeling method based on semantics
CN113177401A (en) * 2021-04-25 2021-07-27 鼎富智能科技有限公司 Information extraction method and device, electronic equipment and storage medium
CN114840634A (en) * 2022-07-04 2022-08-02 中关村科学城城市大脑股份有限公司 Information storage method and device, electronic equipment and computer readable medium
CN114840634B (en) * 2022-07-04 2022-09-20 中关村科学城城市大脑股份有限公司 Information storage method and device, electronic equipment and computer readable medium

Also Published As

Publication number Publication date
CN110020424B (en) 2023-10-31

Similar Documents

Publication Publication Date Title
CN110020424A (en) Extracting method, the extracting method of device and text information of contract information
CN107766371B (en) Text information classification method and device
CN106407178A (en) Session abstract generation method and device
CN108446286A (en) A kind of generation method, device and the server of the answer of natural language question sentence
CN107704453A (en) A kind of word semantic analysis, word semantic analysis terminal and storage medium
WO2021218028A1 (en) Artificial intelligence-based interview content refining method, apparatus and device, and medium
CN110309511B (en) Shared representation-based multitask language analysis system and method
CN110462613B (en) Automatically generating documents
CN106682387A (en) Method and device used for outputting information
CN112686022A (en) Method and device for detecting illegal corpus, computer equipment and storage medium
CN109783624A (en) Answer generation method, device and the intelligent conversational system in knowledge based library
CN105740227A (en) Genetic simulated annealing method for solving new words in Chinese segmentation
CN109815486A (en) Spatial term method, apparatus, equipment and readable storage medium storing program for executing
CN108846138A (en) A kind of the problem of fusion answer information disaggregated model construction method, device and medium
CN111798118B (en) Enterprise operation risk monitoring method and device
CN104731874A (en) Evaluation information generation method and device
CN112084752A (en) Statement marking method, device, equipment and storage medium based on natural language
CN110032736A (en) A kind of text analyzing method, apparatus and storage medium
CN109063772A (en) A kind of image individuation semantic analysis, device and equipment based on deep learning
CN108055192A (en) Group's generation method, apparatus and system
CN112199954A (en) Disease entity matching method and device based on voice semantics and computer equipment
CN113609833B (en) Dynamic file generation method and device, computer equipment and storage medium
CN110209821A (en) Text categories determine method and apparatus
Staab Human language technologies for knowledge management
CN115496830A (en) Method and device for generating product demand flow chart

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20201014

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant after: Innovative advanced technology Co.,Ltd.

Address before: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant before: Advanced innovation technology Co.,Ltd.

Effective date of registration: 20201014

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant after: Advanced innovation technology Co.,Ltd.

Address before: Greater Cayman, British Cayman Islands

Applicant before: Alibaba Group Holding Ltd.

GR01 Patent grant
GR01 Patent grant