Specific embodiment
In order to make those skilled in the art more fully understand the technical solution in this specification, below in conjunction with this explanation
Attached drawing in book embodiment is clearly and completely described the technical solution in this specification embodiment, it is clear that described
Embodiment be only this specification a part of the embodiment, instead of all the embodiments.The embodiment of base in this manual,
Every other embodiment obtained by those of ordinary skill in the art without making creative efforts, all should belong to
The range of this specification protection.
In view of the existing identification model for extracting text information needs technical staff manually suitable to define mostly
Binary feature function carry out the extraction of feature.And the different common text files of contract text, have certain profession special
Different property, the requirement to technical staff are relatively high.Technical staff is needed to be provided simultaneously with corresponding programming knowledge and legal knowledge,
And the process experience of legal document.Different technical staff is defining two since the stock of knowledge of itself, background experience are different
When value tag function, it is easy to appear error, causes defined binary feature function often not accurate enough, or be not suitable for closing
With this kind of legal document, and then cause the precision of established identification model poor, is often accurately mentioned when extracting contract information
Obtain contract information required by user.
For the basic reason for generating the above problem, this specification consideration can be first according to preset notation methods to sample
Character cell in data (such as sample contract) in crucial phrase is labeled respectively, the sample data after being marked, then
By being trained to the sample data after above-mentioned mark, with obtain having degree of precision for this kind of text data of contract
Preset contract text handles model;And then model can be handled by above-mentioned preset contract text and determine contract to be extracted
Text data in character cell mark character, the type parameter of the contract information of the extraction according to required by user, selection
The character cell matched, and according to the mark above-mentioned matched character cell of character combination, obtain contract information required by user.From
And Manual definition's binary feature function is no longer needed, the accuracy for solving extraction contract information present in existing method is not high
Technical problem reaches the technology that the contract information for meeting user's requirement is accurately and efficiently extracted from the text data of contract
Effect.
This specification embodiment provides a kind of extracting method of contract information, and the method specifically can be applied to law works
Some in platform is responsible for responding user's request being the service that user found from contract text data, extracted corresponding contract information
Device.
Specifically, above-mentioned server can be used for receiving and responding the extraction request of user, the text of contract to be extracted is obtained
The type parameter of notebook data and contract information to be extracted;Model is handled by preset contract text, is determined to be extracted
The mark character of character cell in the text data of contract, wherein the preset contract text processing model is by mark
What the sample data after note was trained, the sample data after the mark is the conjunction marked according to preset notation methods
Same text data;Again from the text data of the contract to be extracted, extracts mark character and the contract to be extracted is believed
The matched character cell of the type parameter of breath;And according to the mark character, combine the mark character with it is described to be extracted
The matched character cell of the type parameter of contract information obtains the contract information to be extracted.
In the present embodiment, the server can be a kind of background processing system side applied to law works platform,
It can be realized the Batch Processing server of the functions such as data transmission, data processing.Specifically, the server can be a tool
There is the electronic equipment of data operation, store function and network interaction function;Or run in the electronic equipment, for number
The software program supported is provided according to processing, storage and network interaction.In the present embodiment, the server is not limited specifically
Quantity.The server is specifically as follows a server, or several servers, alternatively, by several server shapes
At server cluster.
It, can be as shown in fig.1, what user can be provided by application this specification embodiment in a Sample Scenario
The server of the extracting method of contract information rapidly acquires required contract information.
Specifically, the contract documents that user will can be related to place company in advance are converted into the contract of corresponding electronics shelves
Text data is uploaded and is stored in the database of law works platform, to facilitate the calling and pipe of the subsequent text data to contract
Reason.
Active user wants the validity period that the service arranged with company energetically is acquired from the text data of B contract
Limit, so that the service that opposite company energetically provides carries out time limit control.At this moment, user can be by client (for example, user makes
Laptop) it is responsible for extracting the server transmission extraction request of contract information into law works platform.Wherein, said extracted
The contract title of contract to be extracted or the class of contract number and contract information to be extracted can be specifically carried in request
Shape parameter.
Wherein, the type of said contract information specifically can be understood as this legality data of contract text data
Data characteristics, and combine classification class designed by search use habit of the people to relevant information in this kind of legality data
Type.Specifically, the type of said contract information may include one of multiple types of act set forth below or a variety of: Party A,
Party B, Business Name, the amount of money, contract award date, term of validity, contract number, execution of contract date, expiration of contract date etc.
Deng.Certainly, it should be noted that above-mentioned cited contract information type is intended merely to that this specification embodiment party is better described
Formula.When it is implemented, as the case may be, may be incorporated into other information types, such as address, phone, penalty et al.
The type of the information type more paid close attention to as said contract information.
The type parameter of above-mentioned contract information to be extracted specifically can be understood as being used to indicate contract information to be extracted
Type identification parameter.Specifically, the type parameter of above-mentioned contract information to be extracted can be it is corresponding to be extracted
The typonym of contract information, being also possible to server corresponding with the typonym of contract information to be extracted can recognize reason
The characteristic character etc. of solution.For the concrete form and content of the type parameter of above-mentioned contract information to be extracted, this specification is not
It limits.
For example, " term of validity of service " conduct can be directly arranged in user in extracting request in this Scene case
" FWDYXQX " character can also be arranged in the type parameter of above-mentioned contract information to be extracted in extracting request, i.e., to be extracted
The first letter of pinyin of the title of the type of contract information, the type parameter etc. as contract information to be extracted.In addition, user is also
The contract number or contract title of the contract can be set in extracting request, to indicate contract to be extracted.
Server can first parse said extracted request after the extraction request for receiving user, obtain extraction request
In entrained contract title or number and user setting contract information to be extracted type parameter.And then server
The text data for obtaining being stored in corresponding contract in law works platform database can be called according to said contract title or number
As the text data of contract to be extracted, preset contract text processing model trained in advance is recalled, to above-mentioned wait mention
The text data of contract is taken to be handled, with the mark of character cell in the above-mentioned contract text data to be extracted of determination, further according to
The matched character cell of type parameter selective extraction of contract information to be extracted, and according to the corresponding identifier word of character cell
Symbol combines above-mentioned matched character cell, so that reduction, which obtains user, wants the contract information extracted.
Wherein, above-mentioned character cell specifically can be understood as the data cell that text information is characterized in contract text data.
Specifically, above-mentioned character cell specifically can be a Chinese character, such as " first ", it is also possible to a word, such as " Date ", also
It can be a number, such as " 2018 " etc..Certainly, it should be noted that above-mentioned cited character cell is intended merely to more
Illustrate this specification embodiment well.For the concrete form and content of above-mentioned character cell, this specification is not construed as limiting.
Phrase where above-mentioned mark character specifically can be understood as character cell corresponding to a kind of be used to indicate is closing
With the position of information type and the character cell corresponding in the text data of text data this kind type in the phrase
The character properties of feature combine.Specifically, above-mentioned mark character may include two parts, a portion is to be used to indicate this
The mark character of the information type of phrase where character cell, another part is to be used to indicate the character cell in the phrase
The mark character of position feature.
For example, can wrap " company energetically " this phrase containing 4 character cells, respectively " big ", " power ", " public affairs "
" department ", information type corresponding to the phrase is Business Name, according to preset rules, can be denoted as COM to indicate.Into one
Step, determines position of above-mentioned 4 character cells in the phrase respectively, finds: character cell is " big " to be located in phrase
Initial position, i.e. bebinning character in the phrase can be denoted as " B " according to preset rules to indicate;Character cell " power ",
" public affairs " are located at the middle position in phrase, i.e. intermediate character in the phrase, according to preset rules, can be denoted as respectively " M1 ",
" M2 " is indicated;Character cell " department " is located at the end position in phrase, i.e. last character in the phrase, according to preset rules
" E " can be denoted as to indicate.In summary the information type of the phrase where each character cell and character cell are at this
Position feature in phrase, the corresponding mark character of above-mentioned 4 character cells can respectively indicate are as follows: " B-COM ", " M1-
COM ", " M2-COM " and " E-COM ".Certainly, it should be noted that the mark character of above-mentioned cited character cell is to be
It is better described this specification embodiment.When it is implemented, as the case may be, other types or form can also be used
Character properties as mark character.In this regard, this specification is not construed as limiting.
Above-mentioned preset contract text processing model specifically can be understood as server and be first passed through in advance to largely according to default
The text data of sample contract that marked of notation methods carry out obtained after positive training and reverse train, can be used in
Identify the neural network model for determining the corresponding mark character of character cell in contract text data.
In this Sample Scenario, specifically, server can be defeated as model using the text data of above-mentioned contract to be extracted
Enter, is input to preset contract text processing model and obtains output result are as follows: character cell in the text data of contract to be extracted
Mark character.In turn, server can be according to the type parameter of contract information to be extracted, from the textual data of contract to be processed
The matched character cell of type parameter that mark character and the contract information to be extracted is searched out in, as matching character
Unit.
For example, server can be treated and be mentioned according to the type parameter " term of validity of service " of contract information to be extracted
It takes the entrained mark character of character cell in the text data of contract to be retrieved, determines and " term of validity of service "
Matched mark character are as follows: " B-VAL ", " M1-VAL ", " M2-VAL ", " M3-VAL ", " M4-VAL " and " E-VAL " (VAL is root
It is used to characterize the character properties of the term of validity of service according to preset rules).And then it can will carry above-mentioned mark character
This 6 character cells of character cell " 2018 ", " year ", " 12 ", " moon ", " 30 " and " day " are determined as and " term of validity of service "
Matched character cell.
Server, may further be according to the mark character entrained by character cell after determining matched character cell
Characterized position feature splices and combines the above-mentioned matched character cell determined, obtains energy according to corresponding sequence of positions
Specific meaning, complete phrase are enough characterized as contract information to be extracted.
For example, server can the mark according to entrained by character cell " 2018 ", " year ", " 12 ", " moon ", " 30 " and " day "
That of position feature is characterized in character learning symbol " B-VAL ", " M1-VAL ", " M2-VAL ", " M3-VAL ", " M4-VAL " and " E-VAL "
Partial character parameter, i.e. " B ", " M1 ", " M2 ", " M3 ", " M4 " and " E " determine that character cell " 2018 " are rising for the phrase
Beginning character, is arranged in the beginning location of phrase, and character cell " year " is the intermediate character of the phrase, and is located at the of intermediate character
The character cell can be connected to after bebinning character " 2018 " by one position.In a manner mentioned above, successively in character list
Concatenation character unit " 12 " after first " year ", the concatenation character unit " 30 " after character cell " moon ", in character cell " 30 "
Concatenation character unit " day " later.After having connected character cell " day ", server is by having recognized character cell " day " institute
The part character of the characterization position feature of carrying is " E ", can determine that the character cell is the last character in the phrase.
Hence, it can be determined that completing connection after having connected character cell, and then it will can currently connect obtained character cell
Connection combination " on December 30th, 2018 " this phrase as the contract information type with extraction required by user, that is, defines
Contract information to be extracted.
In turn, server can respond the extraction request of user, and identified contract information to be extracted is showed use
Family, such user can easily acquire the contract for wanting to extract from the text data of the huge contract of data volume
Information.
By above-mentioned Sample Scenario as it can be seen that this specification provide contract information extracting method, by elder generation according to preset
Mode is labeled sample data, then by based on the preset contract text being trained to the sample data after mark
Present treatment model determines the corresponding character mark of character cell in contract text data, according to the type of information to be extracted
The matched character cell of parameter selection character mark, and above-mentioned character cell is combined according to character mark, it is wanted with obtaining user
The contract information to be extracted asked is asked to solve the not high technology of the existing accuracy for extracting contract information of existing method
Topic reaches the technical effect that the contract information for meeting user's requirement is accurately and efficiently extracted from the text data of contract.
In another Sample Scenario, server can be first passed through in advance to according to the sample after preset notation methods mark
Data carry out learning training, establish obtain for identification, determine character mark corresponding to character cell in contract text data
The preset contract text known handles model.
When it is implemented, a kind of CRF layers (mark layer) progress can be input to using above-mentioned output result sequence as input
Training, with the output result sequence after constrain, so that establishing the relatively higher preset contract text of precision handles model.
By above-mentioned Sample Scenario as it can be seen that this specification provide contract information extracting method, by elder generation according to preset
Mode is labeled sample data, then by based on the preset contract text being trained to the sample data after mark
Present treatment model determines the corresponding character mark of character cell in contract text data, according to the type of information to be extracted
The matched character cell of parameter selection character mark, and above-mentioned character cell is combined according to character mark, it is wanted with obtaining user
The contract information to be extracted asked is asked to solve the not high technology of the existing accuracy for extracting contract information of existing method
Topic reaches the technical effect that the contract information for meeting user's requirement is accurately and efficiently extracted from the text data of contract;Also
By when training establishes preset contract text and handles model, according to preset notation methods to the keyword in sample data
The character cell of group carries out corresponding mark, and the sample data after being marked is directed to further according to the sample data after mark
The first coded sequence and the second coded sequence of the sentence, and splice above-mentioned first coded sequence and the second coded sequence, make
Obtaining the preset contract text processing model established has higher processing accuracy, and then can be by above-mentioned models coupling
Information hereafter extracts contract information, further improves the accuracy of the contract information of extraction.
As shown in fig.5, this specification embodiment provides a kind of extracting method of contract information, wherein this method tool
Body is applied to server-side.When it is implemented, this method may include the following contents:
S51: the text data of contract to be extracted and the type parameter of contract information to be extracted are obtained.
In the present embodiment, the type of said contract information specifically can be understood as this method of contract text data
The data characteristics of rule property data, and combine designed by search use habit of the people to relevant information in this kind of legality data
Classification type.Specifically, the type of said contract information may include one of multiple types of act set forth below or a variety of:
Party A, Party B, Business Name, the amount of money, contract award date, term of validity, contract number, execution of contract date, expiration of contract
Date etc..Certainly, it should be noted that above-mentioned cited contract information type is intended merely to that this specification is better described
Embodiment.When it is implemented, as the case may be, may be incorporated into other information types, such as address, phone, promise breaking
Type of the information type that the people such as gold more pay close attention to as said contract information.
In the present embodiment, the type parameter of above-mentioned contract information to be extracted, specifically can be understood as being used to indicate to
The identification parameter of the type of the contract information of extraction.
In the present embodiment, the defeated textual data that can directly acquire the contract to be extracted that user inputs in client of server
According to and contract information amount to be extracted type parameter;The contract to be extracted that user inputs in client can also be directly acquired
Text data contract title or contract number and the contract information to be extracted of user setting type parameter, then root
According to contract title or contract number, retrieve the database for storing the text data of electronics shelves contract, also obtain it is above-mentioned to
The text data of extraction contract.
S53: model is handled by preset contract text, determines the character in the text data of the contract to be extracted
The mark character of unit, wherein the preset contract text processing model is by instructing to the sample data after mark
It gets, the sample data after the mark is the text data of the contract marked according to preset notation methods.
In the present embodiment, above-mentioned preset contract text processing model specifically can be understood as server and first pass through in advance pair
The text data of the sample contract largely marked according to preset notation methods carries out gained after positive training and reverse train
It arrives, can be used in identifying the neural network model of the corresponding mark character of character cell in determining contract text data.
In the present embodiment, above-mentioned character cell specifically can be understood as the number that text information is characterized in contract text data
According to unit.Specifically, above-mentioned character cell specifically can be a Chinese character, it is also possible to a word, can also be a number
Etc..Certainly, it should be noted that above-mentioned cited character cell is intended merely to that this specification embodiment party is better described
Formula.As the case may be, it may be incorporated into the character of other forms or content as above-mentioned character cell.For example, it is also possible to will
Some punctuation marks in contract text with specific meanings are also used as a kind of character cell.In this regard, this specification does not limit
It is fixed.
In the present embodiment, above-mentioned mark character specifically can be understood as character cell institute corresponding to a kind of be used to indicate
Phrase information type and the character cell corresponding in the text data of this type of contract text data at this
The character properties of position feature in phrase combine.Specifically, above-mentioned mark character may include two parts, a portion
For be used to indicate phrase where the character cell information type mark character, another part is to be used to indicate the character cell
The mark character of position feature in the phrase.
In the present embodiment, the sample data after above-mentioned mark specifically can be carries out according to following preset notation methods
What mark obtained: retrieving the crucial phrase in sample data;In the crucial phrase in the sample data, mark out respectively
Type of word-combination (the crucial phrase institute table where i.e. of crucial phrase where each character cell that the crucial phrase is included
The information type of sign) and, position feature of the character cell in the crucial phrase, the mark as the character cell is believed
Breath.Specifically, above-mentioned markup information can be marked onto corresponding character cell, by adding tagged mode so that word
Symbol unit carries corresponding markup information.
In the present embodiment, when it is implemented, can be input to using the text data of contract to be extracted as mode input
In the processing model of preset contract text, it may thereby determine that out that character cell is corresponding in the text data of contract to be extracted
Identify character.
S55: from the text data of the contract to be extracted, mark character and the contract information to be extracted are extracted
The matched character cell of type parameter.
In the present embodiment, above-mentioned mark character is matched with the type parameter of the contract information to be extracted, specifically can be with
It is interpreted as being used to indicate type indicated by type indicated by the mark character of type of word-combination and type parameter in mark character
Identical or difference value be less than default discrepancy threshold.
For example, the mark character of following 3 character cells is respectively as follows: mark character 1 " B-COM ", mark 2 " E- of character
COM " and mark character " M-AMT ", decompose above-mentioned mark character, are used to indicate type of word-combination in available mark character 1
Identifying type indicated by character " COM " is Business Name.Similar, phrase class is used to indicate in available mark character 2
The type of the mark character instruction of type is also Business Name.The mark character " AMT " of type of word-combination is used to indicate in mark character 3
The type then indicated is the amount of money.And information type indicated by the type parameter of acquired contract information to be extracted is company name
Claim, i.e., it is identical as type indicated by the mark character of type of word-combination is used to indicate in mark character 1 and mark character 2.Therefore,
By above-mentioned mark character 1 and mark character 2 while it can be determined as matching with the type parameter of contract information to be extracted.
In the present embodiment, the above-mentioned matched character cell of type parameter with contract information to be extracted is understood that
It is the subsequent character cell for needing to extract to be a kind of matched character cell.
In the present embodiment, above-mentioned from the text data of the contract to be extracted, mark character is extracted with described wait mention
The matched character cell of the type parameter of the contract information taken, when it is implemented, may include the following contents: retrieving conjunction to be extracted
The mark character of character cell, finds the matched mark of type parameter with the contract information to be extracted in same text data
Character learning symbol, and it is determined as matched character list with the matched character cell of type parameter of contract information to be extracted for having
Member.
S57: according to the mark character, the type parameter of the mark character and the contract information to be extracted is combined
Matched character cell obtains the contract information to be extracted.
In the present embodiment, according to the mark character, the mark character and the contract information to be extracted are combined
The matched character cell of type parameter, obtain the contract information to be extracted, when it is implemented, may include it is following in
Hold: retrieving the mark character of matched character cell, the mark character that type of word-combination is used to indicate in mark character is identical
Character cell is divided into one group;For character cell in same group, according in the mark character of character cell in the group for referring to
The mark character for showing the position feature of character cell, according to corresponding position merging features combining characters unit, obtaining one can table
Levy semantic phrase, the contract information to be extracted as one.
For example, obtained following matched character cell: " big " (corresponding mark character is B-COM), " public affairs " are (right
The mark character answered is M2-COM), " 500 " (corresponding mark character is B-AMT), " power " (corresponding mark character M1-
COM), " member " (corresponding mark character is E-AMT) and " department " (corresponding mark character is E-COM).It can be first according to character
For characterizing the mark character of type of word-combination in the mark character of unit, above-mentioned several matched character cells are divided into several
Group.Specifically, can be by the mark character for being used to characterize type of word-combination the character cell " big " of " COM ", " public affairs ", " power " and
" department " is divided into first group;It is that " AMT " D character cell " 500 " and " member " divide by the mark character for being used to characterize type of word-combination
It is second group.Again to each group, according in the mark character of character cell in the group for characterizing the mark character of position feature,
This group of character cell is spliced and combined, corresponding phrase is obtained.For first group, determine character cell " big " in the group,
In the mark character of " public affairs ", " power " and " department " for characterize position feature mark character be respectively " B ", " M2 ", " M1 " and
" E ", can determine the position feature of above-mentioned 4 character cells be respectively as follows: initial position, the 2nd position in middle position, in
Between the 1st position and end position in position.According to above-mentioned position feature, can be set for the character cell " big " of initial position
In beginning, so can and then character cell it is " big " connection middle position in the 1st position character cell " power ", immediately
Character cell " power " connection middle position in the 2nd position character cell " public affairs ", and then character cell " public affairs " connection
The character cell " department " of end position;It is corresponded to again since character cell " department " is used to characterize the mark character of position feature for " E "
Be end position, i.e., will not other character lists where being connected with the character cell in phrase after the character cell
Member, therefore, after having connected character cell " department ", it can be determined that have been completed first group of corresponding phrase connection, into
And the combination " big-power-public affairs-department " of the character cell connected can be determined as phrase, as corresponding to be extracted
Contract information.
In the present embodiment, it according to the mark character, combines the mark character and the contract to be extracted is believed
The matched character cell of the type parameter of breath, after obtaining the contract information to be extracted, the method is when it is implemented, may be used also
To include the following contents: showing the contract information to be extracted.
In the present embodiment, specifically, server can directly be shown to user determine to meet user's requirement to
Contract information to be extracted first can also be sent to client, then show institute from client to user by the contract information of extraction
It is required that the contract information extracted.
In the present embodiment, sample data is labeled according to preset mode by elder generation, then by based on to mark
The preset contract text that sample data afterwards is trained handles model to determine character list in contract text data
The corresponding character mark of member, according to the type parameter of the information to be extracted selection matched character cell of character mark, and according to
Character mark combines above-mentioned character cell, to obtain contract information to be extracted required by user, to solve existing side
The not high technical problem of the existing accuracy for extracting contract information of method, reaches accurately and efficiently from the text data of contract
Extract the technical effect for meeting the contract information of user's requirement.
In one embodiment, the preset contract text processing model specifically can be trained in the following way
It arrives: the sample data after obtaining mark;Sentence in sample data after the mark is split as multiple character cells respectively,
Wherein, the multiple character cell carries corresponding markup information respectively, the corresponding markup information of the character cell according to
Preset notation methods determine;Multiple character cells in the sentence are subjected to vectorization processing respectively, obtain the sentence
In multiple word vectors;According to multiple word vectors in the sentence, the first coded sequence for the sentence and the are obtained
Two coded sequences, wherein first coded sequence be the sentence in multiple word vectors corresponding to first coding according to
The coded sequence that forward direction sequence obtains, second coded sequence are compiled for corresponding to multiple word vectors in the sentence second
The coded sequence that code is obtained according to sorting by reversals;Splice first coded sequence and second coded sequence, is exported
As a result sequence, wherein the output result sequence includes the corresponding relationship for identifying character and character cell.
In the present embodiment, when it is implemented, the text data of available a certain number of contracts is as sample data,
And the character cell in the sample data is labeled according to preset notation methods, to obtain the sample after the mark
Data.
In one embodiment, when it is implemented, the sample data can (i.e. preset mark side in the following way
Formula) it is labeled: the crucial phrase in sample retrieval data;In the crucial phrase in the sample data, mark out respectively
The type of word-combination of crucial phrase where the character cell that the crucial phrase is included and, the position in crucial phrase, make
For the markup information of the character cell.
In the present embodiment, above-mentioned crucial phrase specifically can be understood as in contract text data, and people more pay close attention to, quilt
The frequency that people's search uses is greater than phrase corresponding to the contract information of preset frequency threshold.Specifically, above-mentioned keyword
Group can be the Business Name occurred in contract text, such as " company energetically ", be also possible to occur in contract text specific
The amount of money, such as " 5000 yuan " can also be the contract award date occurred in contract text, such as " on January 1st, 2017 " etc.
Deng.
In the present embodiment, you need to add is that, the contract that the type of word-combination and crucial phrase of crucial phrase are characterized is believed
The information type of breath is corresponding.
In the present embodiment, the type of word-combination of the crucial phrase can specifically include at least one of: Business Name,
The amount of money, contract award date, term of validity, contract number, execution of contract date, expiration of contract date etc..Certainly, it needs
Bright, the type of word-combination of above-mentioned cited crucial phrase is intended merely to that this specification embodiment is better described.Specifically
When implementation, other type of word-combination relevant to contract text data can also be introduced as the case may be as above-mentioned crucial phrase
Type of word-combination.In this regard, this specification is not construed as limiting.
In the present embodiment, it in determining crucial phrase after the markup information of each character cell, may further incite somebody to action
Markup information by add it is tagged in a manner of be labeled on corresponding character cell, to complete to the mark of character cell so that
Character cell carries corresponding markup information, the sample data after being marked.
In the present embodiment, the sentence in the above-mentioned sample data by after the mark is split as multiple character lists respectively
Member, when it is implemented, may include: to be handled as unit of sentence for the sample data after mark.Specifically, can be with
The sample data after mark is first split as multiple sentences;Again as unit of sentence, each sentence is further broken into multiple
Character cell.
In the present embodiment, when it is implemented, can by retrieval mark after sample data in default punctuation mark,
And according to default punctuation mark, the sample data after the mark is split as multiple sentences.Wherein, above-mentioned default punctuation mark
It can specifically include at least one of: fullstop, exclamation mark, question mark, branch etc..Certainly it should be noted that it is above-mentioned listed
The default punctuation mark lifted is that one kind schematically illustrates.For presetting the type of punctuation mark, this specification is not construed as limiting.
In one embodiment, above-mentioned multiple character cells by the sentence carry out vectorization processing respectively, obtain
Multiple word vectors in the sentence, when it is implemented, may include the following contents: using multiple character cells in sentence as
Input is input to Embedding layers, carries out vectorization processing, obtains corresponding word vector.
In one embodiment, there may be the character of characterization Chinese character in the sample data after further contemplating mark
Unit, and computer often can not the character cell directly to characterization Chinese character carry out vectorization processing, therefore, by multiple characters
Before unit carries out vectorization processing, the method is when it is implemented, can also include the following contents: extracting the characterization in sentence
The character cell of Chinese character, and one-hot is first carried out to the character cell of the characterization Chinese character in sentence and encodes to obtain the character cell
Corresponding, the accessible character code of computer, then by other character cells in the sentence, together with the character of the characterization Chinese character
Character code corresponding to unit is input to Embedding layers, as input to obtain corresponding word vector.
In one embodiment, above-mentioned multiple word vectors according in the sentence obtain first for the sentence
Coded sequence and the second coded sequence, when it is implemented, may include the following contents: multiple word vectors in the sentence are pressed
The positive word sequence vector of sentence is obtained according to positive sequence (such as arranging according to sequence from left to right), it will be in the sentence
Multiple word vectors obtain the reversed word sequence vector of sentence according to sorting by reversals (such as arranging according to sequence from right to left);It will
The input of the positive word sequence vector of the sentence and the reversed word sequence vector of the sentence as a time step, is input to
Characteristic layer is trained, and obtains first coded sequence and second coded sequence.
In the present embodiment, above-mentioned first coded sequence (can also claim positive LSTM) specifically can be understood as in sentence
The coded sequence that first coding corresponding to multiple word vectors is obtained according to positive sequence (such as from left to right), above-mentioned second compiles
Code sequence (reversed LSTM can also be claimed) specifically can be understood as the second coding corresponding to multiple word vectors in sentence according to
The coded sequence that sorting by reversals (such as turning left from the right side) obtains.Wherein, above-mentioned first coding can specifically be understood to a kind of use
The coded data of the feature corresponding to characterization word vector, above-mentioned second coding specifically can be understood as it is another for characterize word to
Measure the coded data of corresponding feature.
It in the present embodiment, can be with after the reversed word sequence vector of the positive word sequence vector and sentence that obtain sentence
By above two direction, i.e., input of the word sequence vector of the sentence described from different word order angles as a time step,
It is input to the characteristic layer such as Bi-LSTM layers, feature extraction is carried out, to obtain corresponding first coded sequence and the second code sequence
Column.
In the present embodiment, above-mentioned splicing first coded sequence and second coded sequence obtain output result
Sequence, when it is implemented, may include the following contents: same character list will be corresponded in the first coded sequence and the second coded sequence
The first coding and the second coding of member link together, and complete splicing and obtain a complete hidden status switch as above-mentioned output
Structure sequence has obtained preset contract text processing model to establish.Wherein, the mark that above-mentioned output result sequence includes
The corresponding relationship of character learning symbol and character cell, therefore following model can determine that character cell is corresponding based on above-mentioned corresponding relationship
Mark character.
In one embodiment, it is contemplated that the targeted user's of contract text data uses scene and use habit, perhaps
More phrases are information relatively low using probability, that people less focus on, for example, for enterprise seldom in concern contract
The information that the phrases such as name and adjective are characterized.In addition, some phrases for grammer connection in contract text data,
For example, auxiliary word, conjunction often do not have the effective semanteme of user's concern yet.In order to avoid in modeling process by above-mentioned nothing
The interference of phrase is imitated, and also to avoid carrying out above-mentioned invalid phrase processing waste of resource and time, in sample retrieval number
Before crucial phrase in, the method can also include the following contents when being embodied: filter the nothing in the sample data
Phrase is imitated, and then only the character cell in the crucial phrase in filtered sample data can be labeled.Certainly, specifically
When implementation, above-mentioned filter type specifically can also be the character cell marked out in invalid phrase, for example, with character mark " O "
Mark out the character cell of the invalid phrase in sample data, computer recognizes above-mentioned character mark when subsequent progress model training
Know " O ", can pick out the corresponding character cell of the character mark is the character cell in invalid phrase, and then can carry out area
Other places reason, to reduce or avoid to expend this kind of character cell excessive resource and processing time, to improve treatment effeciency.
In the present embodiment, the invalid phrase can specifically include at least one of: name, adjective, helps place name
Word etc..Certainly, above-mentioned cited invalid phrase is that one kind schematically illustrates.When it is implemented, according to specific application
Scene and process demand may be incorporated into other kinds of phrase as above-mentioned wireless phrase.In this regard, this specification does not limit
It is fixed.
In one embodiment, in order to further increase the precision of the preset contract text established processing model, so that
It is subsequent to use the mark character that character cell is more accurately determined when the model.Splice first coded sequence and
Second coded sequence, after obtaining output result sequence, when it is implemented, the method can also include the following contents: will be described
Result sequence is exported as input, mark layer is input to and is trained, to establish the constraint relationship between character mark, obtain about
Output result sequence after beam, wherein the output result sequence after the constraint also includes that the association identified between character is closed
System.
In the present embodiment, above-mentioned mark layer can specifically refer to that CRF marks layer.Certainly, above-mentioned cited CRF mark
Layer is that one kind schematically illustrates.When it is implemented, can also be using other suitable layers as mark layer.In this regard, this explanation
Book is not construed as limiting.
In one embodiment, according to the mark character, the mark character and the contract to be extracted are combined
The matched character cell of the type parameter of information, after obtaining the contract information to be extracted, the method is when it is implemented, also
It may include the following contents: showing the contract information to be extracted.
In the present embodiment, specifically, server can directly be shown to user determine to meet user's requirement to
Contract information to be extracted first can also be sent to client, then show institute from client to user by the contract information of extraction
It is required that the contract information extracted.
Therefore the extracting method of the contract information of this specification embodiment offer, by elder generation according to preset mode
Sample data is labeled, then by based on the preset contract text being trained to the sample data after mark
Model is managed to determine the corresponding character mark of character cell in contract text data, according to the type parameter of information to be extracted
The matched character cell of character mark is selected, and above-mentioned character cell is combined according to character mark, to obtain required by user
Contract information to be extracted, thus solve the not high technical problem of the existing accuracy for extracting contract information of existing method,
Reach the technical effect that the contract information for meeting user's requirement is accurately and efficiently extracted from the text data of contract;It is also logical
It crosses when preset contract text processing model is established in training, according to preset notation methods to the crucial phrase in sample data
Character cell carry out corresponding mark, the sample data after mark obtains being directed to institute further according to the sample data after mark
The first coded sequence and the second coded sequence of sentence are stated, and splices above-mentioned first coded sequence and the second coded sequence, so that
Established preset contract text processing model have higher processing accuracy, and then can be by above-mentioned models coupling above and below
The information of text extracts contract information, further improves the accuracy of the contract information of extraction.
As shown in fig.6, this specification embodiment provides a kind of method for building up of preset contract text processing model,
Wherein, this method is applied particularly to server-side.When it is implemented, this method may include the following contents:
S61: the sample data after obtaining mark;
S63: the sentence in the sample data after the mark is split as multiple character cells respectively, wherein described more
A character cell carries corresponding markup information respectively, and the corresponding markup information of the character cell is according to preset mark side
Formula determines;
S65: multiple character cells in the sentence are subjected to vectorization processing respectively, are obtained multiple in the sentence
Word vector;
S67: it according to multiple word vectors in the sentence, obtains compiling for the first coded sequence of the sentence and second
Code sequence, wherein first coded sequence is the first coding corresponding to multiple word vectors in the sentence according to forward direction
Sort obtained coded sequence, and second coded sequence is pressed for the second coding corresponding to multiple word vectors in the sentence
The coded sequence obtained according to sorting by reversals;
S69: splicing first coded sequence and second coded sequence, obtains output result sequence, wherein described
Output result sequence includes the corresponding relationship for identifying character and character cell.
In the present embodiment, the sample data after above-mentioned mark is when it is implemented, can be in the following manner according to default
Notation methods be labeled: the crucial phrase in sample retrieval data;In the crucial phrase in the sample data, respectively
The type of word-combination for marking out the crucial phrase where the character cell that the crucial phrase is included and the position in crucial phrase
It sets, the markup information as the character cell.
Therefore the method for building up of preset contract text processing model provided by the present application passes through according to the contract of being based on
Preset notation methods determined by the data characteristics and use habit of text data mark sample data, the sample after being marked
Notebook data;Be again processing unit with sentence, by positive training and reverse train establish to obtain precision it is higher, for contract text
The contract text that data information extracts handles model, improves the accuracy of model.
As shown in fig.7, this specification embodiment additionally provides a kind of extracting method of text information.Wherein, this method
When it is implemented, may include the following contents:
S71: the type parameter of text data and text information to be extracted to be extracted is obtained;
S73: by preset text-processing model, the mark of the character cell in the text data to be extracted is determined
Character learning symbol, wherein the preset text-processing model be by being trained to the sample data after mark, it is described
Sample data after mark is the text data marked according to preset notation methods;
S75: from the text data to be extracted, the type of mark character and the text information to be extracted is extracted
The matched character cell of parameter;
S77: according to the mark character, the type parameter of the mark character and the text information to be extracted is combined
Matched character cell obtains the text information to be extracted.
In the present embodiment, above-mentioned text data to be extracted can specifically include contract to be extracted text data, to
Extract the text data of paper, the text data of rules and regulations to be extracted, text data of notice letter to be extracted etc..When
So, above-mentioned cited text data to be extracted is that one kind schematically illustrates.When it is implemented, as the case may be and place
Reason requires, and may be incorporated into other kinds of text data as above-mentioned text data to be extracted, and apply above-mentioned text envelope
The extracting method of breath extracts text information required by user.In this regard, this specification is not construed as limiting.
Therefore the extracting method of the text information of this specification embodiment offer, by elder generation according to preset mode
Sample data is labeled, then by based on the preset text-processing mould being trained to the sample data after mark
Type determines the corresponding character mark of character cell in text data, according to the type parameter of information to be extracted selects character
Matched character cell is identified, and above-mentioned character cell is combined according to character mark, it is to be extracted required by user to obtain
Text information, to solve the not high technical problem of the existing accuracy for extracting text information of existing method, reach it is accurate,
The technical effect for meeting the text information of user's requirement is efficiently extracted from text data.
This specification embodiment also provides a kind of server, including processor and is used for storage processor executable instruction
Memory, the processor be embodied when can be according to instruction execution following steps: obtain the textual data of contract to be extracted
According to and contract information to be extracted type parameter;Model is handled by preset contract text, is determined described to be extracted
The mark character of character cell in the text data of contract, wherein the preset contract text processing model is by right
What the sample data after mark was trained, the sample data after the mark is to mark according to preset notation methods
The text data of contract;From the text data of the contract to be extracted, extracts mark character and the contract to be extracted is believed
The matched character cell of the type parameter of breath;According to the mark character, the mark character and the conjunction to be extracted are combined
With the matched character cell of type parameter of information, the contract information to be extracted is obtained.
In order to more accurately complete above-metioned instruction, as shown in fig.8, this specification additionally provides another kind specifically
Server, wherein the server includes network communications port 801, processor 802 and memory 803, and above structure is logical
It crosses Internal cable to be connected, so that each structure can carry out specific data interaction.
Wherein, the network communications port 801, specifically can be used for obtaining the text data of contract to be extracted, and to
The type parameter of the contract information of extraction.
The processor 802 specifically can be used for handling model by preset contract text, determine described to be extracted
The mark character of character cell in the text data of contract, wherein the preset contract text processing model is by right
What the sample data after mark was trained, the sample data after the mark is to mark according to preset notation methods
The text data of contract;From the text data of the contract to be extracted, extracts mark character and the contract to be extracted is believed
The matched character cell of the type parameter of breath;According to the mark character, the mark character and the conjunction to be extracted are combined
With the matched character cell of type parameter of information, the contract information to be extracted is obtained.
The memory 803 specifically can be used for storing the text of the contract to be extracted obtained through network communications port 801
The corresponding instruction repertorie that data, type parameter and processor 802 are based on.
In the present embodiment, the network communications port 801 can be is bound from different communication protocol, thus
The virtual port of different data can be sent or received.Lead to for example, the network communications port can be responsible for progress web data
No. 80 ports of letter are also possible to No. 21 ports for being responsible for carrying out FTP data communication, can also be that responsible progress mail data is logical
No. 25 ports of letter.In addition, the network communications port can also be the communication interface or communication chip of entity.For example, its
It can be mobile radio network communication chip, such as GSM, CDMA;It can also be Wifi chip;It can also be bluetooth core
Piece.
In the present embodiment, the processor 802 can be implemented in any suitable manner.For example, processor can be with
Take such as microprocessor or processor and storage can by (micro-) processor execute computer readable program code (such as
Software or firmware) computer-readable medium, logic gate, switch, specific integrated circuit (Application Specific
Integrated Circuit, ASIC), programmable logic controller (PLC) and the form etc. for being embedded in microcontroller.This specification is simultaneously
It is not construed as limiting.
In the present embodiment, the memory 803 may include many levels, in digital display circuit, as long as can save
Binary data can be memory;In integrated circuits, the circuit with store function of a not no physical form
Also memory, such as RAM, FIFO are;In systems, the storage equipment with physical form is also memory, such as memory bar, TF
Card etc..
This specification embodiment additionally provides a kind of computer storage medium of extracting method based on said contract information,
The computer storage medium is stored with computer program instructions, is performed realization in the computer program instructions: obtaining
The type parameter of the text data of contract to be extracted and contract information to be extracted;Mould is handled by preset contract text
Type determines the mark character of the character cell in the text data of the contract to be extracted, wherein the preset contract text
Present treatment model is by being trained to the sample data after mark, and the sample data after the mark is according to pre-
If notation methods mark contract text data;From the text data of the contract to be extracted, extract mark character with
The matched character cell of type parameter of the contract information to be extracted;According to the mark character, the identifier word is combined
The matched character cell of type parameter of symbol and the contract information to be extracted, obtains the contract information to be extracted.
In the present embodiment, above-mentioned storage medium includes but is not limited to random access memory (Random Access
Memory, RAM), read-only memory (Read-Only Memory, ROM), caching (Cache), hard disk (Hard DiskDrive,
) or storage card (Memory Card) HDD.The memory can be used for storing computer program instructions.Network communication unit
It can be according to standard setting as defined in communication protocol, for carrying out the interface of network connection communication.
In the present embodiment, the function and effect of the program instruction specific implementation of computer storage medium storage, can
To compare explanation with other embodiment, details are not described herein.
As shown in fig.9, this specification embodiment additionally provides a kind of extraction dress of contract information on software view
It sets, which can specifically include construction module below:
Module 91 is obtained, specifically can be used for obtaining the text data and contract information to be extracted of contract to be extracted
Type parameter;
Determining module 92 specifically can be used for handling model by preset contract text, determine the conjunction to be extracted
The mark character of character cell in same text data, wherein the preset contract text processing model is by mark
Sample data after note be trained 93, the sample data after the mark is to mark according to preset notation methods
The text data of contract;
Extraction module 94 specifically can be used for from the text data of the contract to be extracted, extract mark character and institute
State the matched character cell of type parameter of contract information to be extracted;
Composite module 95, specifically can be used for according to the mark character, combine the mark character with it is described to be extracted
Contract information the matched character cell of type parameter, obtain the contract information to be extracted.
In one embodiment, described device can also include specifically model building module, for establishing preset contract
Text-processing model, the model building module can specifically include following structural unit:
Acquiring unit, the sample data after specifically can be used for obtaining mark;
Split cells specifically can be used for the sentence in the sample data after the mark being split as multiple characters respectively
Unit, wherein the multiple character cell carries corresponding markup information respectively, the corresponding markup information of the character cell
It is determined according to preset notation methods;
Vectorization processing unit specifically can be used for respectively carrying out multiple character cells in the sentence at vectorization
Reason, obtains multiple word vectors in the sentence;
Determination unit, specifically can be used for according to multiple word vectors in the sentence, obtain for the sentence
One coded sequence and the second coded sequence, wherein first coded sequence is corresponding to multiple word vectors in the sentence
The coded sequence that is obtained according to positive sequence of the first coding, second coded sequence is multiple word vectors in the sentence
The coded sequence that the second corresponding coding is obtained according to sorting by reversals;
Concatenation unit specifically can be used for splicing first coded sequence and second coded sequence, be exported
As a result sequence, wherein the output result sequence includes the corresponding relationship for identifying character and character cell.
In one embodiment, the model building module can also include specifically mark unit, specifically can be used for examining
Crucial phrase in rope sample data;In the crucial phrase in the sample data, the crucial phrase institute is marked out respectively
The type of word-combination of crucial phrase where the character cell for including and the position in crucial phrase, as the character cell
Markup information.
In one embodiment, the type of word-combination of the crucial phrase can specifically include at least one of: company name
Title, the amount of money, contract award date, term of validity, contract number, execution of contract date, expiration of contract date etc..Certainly, above-mentioned
Cited type of word-combination is that one kind schematically illustrates.In this regard, this specification is not construed as limiting.
In one embodiment, the model building module can also include specifically filter element, specifically can be used for
Filter the invalid phrase in the sample data, wherein the invalid phrase can specifically include at least one of: name,
Name, adjective, auxiliary word etc..
In one embodiment, the determination unit can specifically include following structural sub-units:
Sorting subunit specifically can be used for multiple word vectors in the sentence obtaining sentence according to forward direction sequence
Multiple word vectors in the sentence are obtained the reversed word sequence vector of sentence according to sorting by reversals by positive word sequence vector;
Training subelement, specifically can be used for by the reversed word of the positive word sequence vector of the sentence and the sentence to
Input of the sequence as a time step is measured, characteristic layer is input to and is trained, obtains first coded sequence and described the
Two coded sequences.
In one embodiment, the model building module specifically can also include constraint element, specifically can be used for by
The output result sequence is input to mark layer and is trained as input, the output result sequence after being constrained, wherein
Output result sequence after the constraint also includes the incidence relation identified between character.
In one embodiment, described device specifically can also include display module, specifically can be used for showing it is described to
The contract information of extraction.
It should be noted that unit, device or module etc. that above-described embodiment illustrates, specifically can by computer chip or
Entity is realized, or is realized by the product with certain function.For convenience of description, it describes to divide when apparatus above with function
It is described respectively for various modules.It certainly, can be the function of each module in same or multiple softwares when implementing this specification
And/or realized in hardware, the module for realizing same function can also be realized by the combination of multiple submodule or subelement etc..With
Upper described Installation practice is only schematical, for example, the division of the unit, only a kind of logic function is drawn
Point, there may be another division manner in actual implementation, such as multiple units or components may be combined or can be integrated into separately
One system, or some features can be ignored or not executed.Another point, shown or discussed mutual coupling or straight
Connecing coupling or communication connection can be through some interfaces, and the indirect coupling or communication connection of device or unit can be electrical property,
Mechanical or other forms.
Therefore the extraction element of the contract information of this specification embodiment offer, by elder generation according to preset mode
Sample data is labeled, then is called by determining module and is preset based on what is be trained to the sample data after mark
Contract text processing model determine the corresponding character mark of character cell in contract text data, pass through extraction module
The matched character cell of character mark is selected according to the type parameter of information to be extracted with composite module, and according to character mark group
It closes and states character cell, to obtain contract information to be extracted required by user, mentioned existing for existing method to solve
The technical problem for taking the accuracy of contract information not high, reaches accurately and efficiently to extract from the text data of contract and meets
The technical effect for the contract information that user requires;Preset contract text is also established in training by model building module and handles mould
When type, corresponding mark is carried out according to character cell of the preset notation methods to the crucial phrase in sample data, is marked
Sample data afterwards obtains the first coded sequence and the second code sequence for the sentence further according to the sample data after mark
Column, and splice above-mentioned first coded sequence and the second coded sequence, so that the preset contract text processing model tool established
There is higher processing accuracy, and then contract information can be extracted by the information of above-mentioned models coupling context, further mentions
The accuracy of the high contract information extracted.
Although being based on routine or nothing present description provides the method operating procedure as described in embodiment or flow chart
Creative means may include more or less operating procedure.The step of enumerating in embodiment sequence is only numerous steps
One of rapid execution sequence mode does not represent and unique executes sequence.When device or client production in practice executes,
Can be executed according to embodiment or the execution of method shown in the drawings sequence or parallel (such as parallel processor or multithreading
The environment of processing, even distributed data processing environment).The terms "include", "comprise" or its any other variant are intended to
Cover non-exclusive inclusion, so that the process, method, product or the equipment that include a series of elements not only include those
Element, but also including other elements that are not explicitly listed, or further include for this process, method, product or setting
Standby intrinsic element.In the absence of more restrictions, being not precluded is including process, method, the product of the element
Or there is also other identical or equivalent elements in equipment.The first, the second equal words are used to indicate names, and are not offered as appointing
What specific sequence.
It is also known in the art that other than realizing controller in a manner of pure computer readable program code, it is complete
Entirely can by by method and step carry out programming in logic come so that controller with logic gate, switch, specific integrated circuit, programmable
Logic controller realizes identical function with the form for being embedded in microcontroller etc..Therefore this controller is considered one kind
Hardware component, and the structure that the device for realizing various functions that its inside includes can also be considered as in hardware component.Or
Person even, can will be considered as realizing the device of various functions either the software module of implementation method can be hardware again
Structure in component.
This specification can describe in the general context of computer-executable instructions executed by a computer, such as journey
Sequence module.Generally, program module include routines performing specific tasks or implementing specific abstract data types, programs, objects,
Component, data structure, class etc..This specification can also be practiced in a distributed computing environment, in these distributed computing rings
In border, by executing task by the connected remote processing devices of communication network.In a distributed computing environment, program mould
Block can be located in the local and remote computer storage media including storage equipment.
As seen through the above description of the embodiments, those skilled in the art can be understood that this specification
It can realize by means of software and necessary general hardware platform.Based on this understanding, the technical solution of this specification
Substantially the part that contributes to existing technology can be embodied in the form of software products in other words, the computer software
Product can store in storage medium, such as ROM/RAM, magnetic disk, CD, including some instructions are used so that a computer
Equipment (can be personal computer, mobile terminal, server or the network equipment etc.) execute each embodiment of this specification or
Method described in certain parts of person's embodiment.
Each embodiment in this specification is described in a progressive manner, the same or similar portion between each embodiment
Dividing may refer to each other, and each embodiment focuses on the differences from other embodiments.This specification can be used for
In numerous general or special purpose computing system environments or configuration.Such as: personal computer, server computer, handheld device
Or portable device, laptop device, multicomputer system, microprocessor-based system, set top box, programmable electronics set
Standby, network PC, minicomputer, mainframe computer, distributed computing environment including any of the above system or equipment etc..
Although depicting this specification by embodiment, it will be appreciated by the skilled addressee that there are many become for this specification
Shape and the spirit changed without departing from this specification, it is desirable to which the attached claims include these deformations and change without departing from this
The spirit of specification.