CN110442691A

CN110442691A - Machine reads the method, apparatus and computer equipment for understanding Chinese

Info

Publication number: CN110442691A
Application number: CN201910597621.5A
Authority: CN
Inventors: 苏智辉; 钱柏丞
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-07-04
Filing date: 2019-07-04
Publication date: 2019-11-12
Also published as: WO2021000675A1

Abstract

The method, apparatus and computer equipment for understanding Chinese are read this application discloses the machine of the embodiment of the present application, vectorization training is carried out to first problem Chinese text and first Chinese text using BERT, then understood in model using the preset first Chinese machine reading and calculated, obtain the first answer text of corresponding described problem text, because the accuracy rate of finally obtained first answer text is higher without carrying out word segmentation processing there is no participles improperly to happen to first problem Chinese text and first Chinese text.

Description

Machine reads the method, apparatus and computer equipment for understanding Chinese

Technical field

This application involves field is read to machine, a kind of method, apparatus of machine reading understanding Chinese is especially related to And computer equipment.

Background technique

The machine of Chinese text, which is read, to be understood for the machine of English is read and understood, understands that accuracy rate etc. is not high, For example, having space between English word and word, which is a natural participle symbol, reads and understands in machine machine During segment it is accurate, so the accuracy rate of the answer of output is relatively high.And Chinese is then different, word segmentation processing is more multiple Miscellaneous, different word segmentation processing can obtain different answers, for example, carrying out word segmentation processing to " I will go to school ", can be divided into " I wants, and goes, on, learn ", " I wants, and goes, and goes to school " can also be divided into, or be divided into " I, will go, and learn " etc., then on Different participles is stated, corresponding semanteme etc. can change, to obtain different understanding.So needing one kind to mention at present High machine, which is read, understands that the machine of Chinese accuracy reads the method for understanding Chinese.

Summary of the invention

The main purpose of the application is to provide the method, apparatus and computer equipment of a kind of machine reading understanding Chinese, purport Understand that Chinese accuracy rate is low solving the problems, such as that machine is read in the prior art.

In order to achieve the above-mentioned object of the invention, the application proposes that a kind of machine reads the method for understanding Chinese, comprising:

Obtain first problem Chinese text, and the first Chinese text to be understood；

Respectively by the first problem Chinese text and first Chinese text be input in preset language model into Row vector, the problem of obtaining the first problem Chinese text vector and first Chinese text vector to be understood, Wherein, the language model is BERT；

Vector to be understood described in described problem vector sum is input to the preset first Chinese machine reading to understand in model It is calculated, obtains the first answer text for corresponding to the first problem Chinese text.

Further, it is described the first problem Chinese text and first Chinese text are input to respectively it is preset The problem of carrying out vectorization in language model, obtaining first problem Chinese text vector and first Chinese text Vector to be understood the step of, comprising:

Respectively to each of the first problem Chinese text and first Chinese text character vector, Obtain character vector；And each character marking position vector is given, obtain character position vector；

The corresponding character vector of each character and character position vector are merged, obtained in the reply first problem The problem of text vector, and the vector to be understood of corresponding first Chinese text.

Further, described vector to be understood described in described problem vector sum is input to the preset first Chinese machine to read After the step of reading, which understands in model, to be calculated, and the first answer text of corresponding described problem text is obtained, comprising:

Judge with the presence or absence of non-Chinese word in the first answer text, if so, being then converted into non-Chinese word pair The Chinese answered, and be substituted into the first answer text, obtain the first answer text of pure Chinese；

Each of the first answer text by pure Chinese Chinese character carries out vectorization, obtains corresponding to the pure Chinese The first answer text in text the first sequence multiple primary vectors, multiple primary vectors form primary vector strings；

By multiple primary vectors according to first sequence, the primary vector of every specified quantity forms one group, Obtain multiple primary vector groups；

Into preset template vector database, each primary vector group is searched and its similarity highest, and similarity Reach the secondary vector group of preset first threshold；

If finding the secondary vector group, the secondary vector group is replaced corresponding the in the primary vector string One Vector Groups obtain secondary vector string；

The secondary vector string is converted into Chinese, obtains the second answer text.

Each of the first answer text by pure Chinese Chinese character carries out vectorization, obtains corresponding to the pure Chinese The first answer text in text the first sequence multiple primary vectors, multiple primary vectors form primary vector strings (x1, x2, x3xn), wherein x is primary vector, and n is the integer greater than 1；

According to first sequence, first using first primary vector x1 as start vector, with second described first Vector x 2 is combined, and obtains primary vector combination (x1, x2), and in preset template vector database, search with it is described First mix vector (x1, x2) similarity is maximum, and is greater than default first mix vector (y1, the y2) of preset second threshold, Wherein y is preset vector；

If finding described default first mix vector (y1, y2), it is combined, is obtained with first three described primary vector (x1, x2, x3) is combined to secondary vector, and in the template vector database, is searched similar to second mix vector Degree is maximum, and is greater than default second mix vector (y1, y2, the y3) of the second threshold；And so on, when do not obtain combining to Measure (y1, y2, y3yn) when, then carry out " in the primary vector, by mix vector (y1, y2, Y3yn-1 corresponding vector combination (x1, x2, x3xn-1)) is replaced, and by the combination Vector (y1, y2, y3yn-1) is cured in the primary vector string " the first solidification process；

If not finding second mix vector (y1, y2), carry out " primary vector combination (x1, x2) is solid Change into the primary vector string " the second solidification process；Simultaneously using the third primary vector as start vector, institute is repeated The first solidification process and/or second solidification process are stated, until obtaining the third vector string of entirely cured vector；

The third vector string is converted into Chinese, obtains third answer text.

Further, first Chinese text is that answer person answers problem formation in the first problem Chinese text Text；It is described vector to be understood described in described problem vector sum is input to the preset first Chinese machine reading to understand in model After the step of being calculated, obtaining the first answer text of corresponding described problem text, comprising:

Obtain the 4th answer text of the correct option of problem in corresponding described problem text；

Calculate the answer similarity of the 4th answer text and the first answer text；

Score value corresponding with the answer similarity is searched into preset score value-similarity list；

Using the score value as the score output to the first Chinese text degree of understanding.

Further, before the acquisition first problem Chinese text, and the step of the first Chinese text to be understood, Include:

Obtain preset Building of Chinese Reading Comprehension data set；Wherein, Building of Chinese Reading Comprehension data set includes into one-to-one correspondence and closes The a plurality of training data of the Second Problem Chinese text of system, the second Chinese text to be understood and the 4th answer text；

Vectorization is carried out to each training data using the language model, obtains asking in the second of one-to-one relationship Inscribe the training data of Chinese text vector, the 4th answer text vector of the second Chinese text vector sum；

The training vector data are input to the preset second Chinese machine reading understand in model and be trained, obtains Described first Chinese machine reading understands model.

Further, it is described the first problem Chinese text and first Chinese text are input to respectively it is preset The problem of carrying out vectorization in language model, obtaining first problem Chinese text vector and the first Chinese text wait understand Before the step of vector, comprising:

Foreign language word is searched in the first problem Chinese text and the first Chinese text respectively；

If finding, by the foreign language word translation found at Chinese, and the Chinese replacement that translation is obtained is corresponding Foreign language word.

The embodiment of the present application also provides a kind of device of machine reading understanding Chinese, comprising:

Acquiring unit, for obtaining first problem Chinese text, and the first Chinese text to be understood；

The first problem Chinese text and first Chinese text are input to default by vectorization unit for respectively Language model in the problem of carrying out vectorization, obtaining the first problem Chinese text vector and the first Chinese text This vector to be understood, wherein the language model is BERT；

Computing unit is read for vector to be understood described in described problem vector sum to be input to the preset first Chinese machine Reading, which understands in model, to be calculated, and the first answer text of corresponding described problem text is obtained.

The embodiment of the present application also provides a kind of computer equipment, including memory and processor, and the memory is stored with The step of computer program, the processor realizes any of the above-described the method when executing the computer program.

The embodiment of the present application also provides a kind of computer readable storage medium, is stored thereon with computer program, the meter The step of calculation machine program realizes method described in any of the above embodiments when being executed by processor.

The machine of the embodiment of the present application reads the method, apparatus and computer equipment for understanding Chinese, using BERT to first Problem Chinese text and first Chinese text carry out vectorization training, then read reason using the preset first Chinese machine It is calculated in solution model, obtains the first answer text of corresponding described problem text, because without to first problem Chinese text This and first Chinese text carry out word segmentation processing so there is no participles improperly to happen, therefore finally obtained the The accuracy rate of one answer text is higher.

Detailed description of the invention

Fig. 1 is that machine reads the flow diagram for understanding the method for Chinese in one embodiment of the application；

Fig. 2 is that machine reads the structural block diagram for understanding the device of Chinese in one embodiment of the application；

Fig. 3 is the structural schematic block diagram of the computer equipment of one embodiment of the application.

The embodiments will be further described with reference to the accompanying drawings for realization, functional characteristics and the advantage of the application purpose.

Specific embodiment

It is with reference to the accompanying drawings and embodiments, right in order to which the objects, technical solutions and advantages of the application are more clearly understood The application is further elaborated.It should be appreciated that specific embodiment described herein is only used to explain the application, not For limiting the application.

Referring to Fig.1, the embodiment of the present application provides a kind of method that machine reads understanding Chinese, comprising steps of

S1, first problem Chinese text, and the first Chinese text to be understood are obtained；

S2, the first problem Chinese text and first Chinese text be input in preset language model respectively Carry out vectorization, the problem of obtaining the first problem Chinese text vector and first Chinese text wait understand to Amount, wherein the language model is BERT；

S3, vector to be understood described in described problem vector sum is input to the preset first Chinese machine reading understands model In calculated, obtain the first answer text for corresponding to the first problem Chinese text.

As described in above-mentioned steps S1, above-mentioned first problem Chinese text refers to the problem of writing using Chinese, such as " Sino-Japan It is how many at the beginning of the Sino-Japaness War of 1894-1895 " Deng；Above-mentioned first Chinese text refers to the article etc. write using Chinese, content Generally comprise the answer of problem in above-mentioned first problem Chinese text.

As described in above-mentioned steps S2, above-mentioned language model is BERT, and the full name of BERT is based on the two-way of Transformer Encoder characterization, the training method of BERT determine that he can be truly realized contexual representations (context table Sign), deeply bidirectional can be truly realized compared with other term vector representation methods and currently the only one Pre-traied (pre-training) language model of (depth is two-way).Using BERT to first problem Chinese text, and wait understand The first Chinese text carry out vectorization, word segmentation processing is not carried out to first problem Chinese text and the first Chinese text, and It is to be directly inputted in BERT to carry out vectorization, it, can be with so different there is no participle and cause to understand different problems Improve the accuracy of subsequent answer.

As described in above-mentioned steps S3, the above-mentioned first Chinese machine reading understands that model can be in the prior art any one Kind Chinese machine reading understands model, the basic principle is that, wait understand matched in vector it is similar to described problem vector Highest answer vector is spent, then answer vector is converted into the first answer text of Chinese, details are not described herein.

In one embodiment, above-mentioned to be respectively input to the first problem Chinese text and first Chinese text In the problem of carrying out vectorization in preset language model, obtaining first problem Chinese text vector and described first The step S2 of the vector to be understood of text, comprising:

S21, respectively to each of the first problem Chinese text and first Chinese text character vector Change, obtains character vector；And each character marking position vector is given, obtain character position vector；

S22, the corresponding character vector of each character and character position vector are merged, obtains asking reply described first The problem of inscribing Chinese text vector, and the vector to be understood of corresponding first Chinese text.

In the present embodiment, machine reads the prediction that the difficult point understood is answer boundary, and Boundary Prediction method has Very much, such as pointer network (pointer network model) etc., RNN (Recognition with Recurrent Neural Network, Recurrent can be utilized Neural Network) neural network ability, realize time series on data are abstracted.And it is used herein BERT, has abandoned RNN completely, instead treated sentence is passed in large-scale Transformer model into Row processing, therefore it is necessary to the position of each character be marked, in order to get a real idea of contextual information.In the application reality It applies in example, the method to each character marking position vector is the method for Positional Encoding, and this method is position The method for setting vector, so-called position vector are exactly to carry out vector obtained from vector training to the position that character occurs.

In one embodiment, above-mentioned that vector to be understood described in described problem vector sum is input to preset first Chinese Machine reading, which understands in model, to be calculated, and the step S3 for the first answer text for corresponding to the first problem Chinese text is obtained Later, comprising:

S301, judge with the presence or absence of non-Chinese word in the first answer text, if so, then converting non-Chinese word It at corresponding Chinese, and is substituted into the first answer text, obtains the first answer text of pure Chinese；

S302, by each of the first answer text of pure Chinese Chinese character carry out vectorization, corresponded to described in Multiple primary vectors of first sequence of text in first answer text of pure Chinese, multiple primary vectors form primary vector String；

S303, by multiple primary vectors according to first sequence, the primary vector of every specified quantity formed One group, obtain multiple primary vector groups；

S304, into preset template vector database, each primary vector group is searched and its similarity highest, and phase Reach the secondary vector group of preset first threshold like degree；

If S305, finding the secondary vector group, it is right in the primary vector string that the secondary vector group is replaced The primary vector group answered, obtains secondary vector string；

S306, the secondary vector string is converted into Chinese, obtains the second answer text.

In the present embodiment, because the first answer text is that machine is extracted from the first Chinese text, it is understood that there may be Syntax error etc., such as the first answer text are " going to have a meal at once ", and it be " going to have a meal at once ", institute that it, which is correctly expressed, To need the first answer text modification for " going to have a meal " at once, amending method is the above method.Above-mentioned template vector data It is stored with the common vector of multiple preset Chinese common phrases in library, and common phrase corresponding with the common vector. For example, being stored with the above-mentioned secondary vector of " hello " formed etc. of " hello " the corresponding vector sum " good " by " you ".It is above-mentioned Primary vector string is as grouped at least more than being equal to 2, obtains multiple primary vector groups by specified quantity, for example, first to Amount string be " go to have a meal at once, you when what wait go ", then form " going to have a meal at once ", " you when what is waited " and " going " three the One Vector Groups, the last one primary vector group are made of the corresponding vector of text remaining after other composing types.Then in mould The secondary vector group that can replace primary vector group is found in plate vector library.Above-mentioned primary vector group and secondary vector group it is similar Degree calculation method can use any one algorithm known in the prior art, and details are not described herein, such as " going to have a meal at once " The corresponding Chinese of the corresponding secondary vector group of primary vector group be " going to have a meal at once ", the primary vector group of " you when what wait " The corresponding Chinese of corresponding secondary vector group is " you are when ", and " going " corresponding secondary vector group is also " going ", then Finally obtaining the second answer text is " going to eat at once to put, when you go ".In other embodiments, if do not find with Primary vector group similarity threshold is greater than the secondary vector group of second threshold, then without replacing primary vector group；If find with The secondary vector group that primary vector group similarity is 100%, it is same without replacing primary vector group.

In another embodiment, above-mentioned that vector to be understood described in described problem vector sum is input in preset first Literary machine reading, which understands in model, to be calculated, and the step of corresponding to the first answer text of the first problem Chinese text is obtained After S3, comprising:

S311, judge with the presence or absence of non-Chinese word in the first answer text, if so, then converting non-Chinese word It at corresponding Chinese, and is substituted into the first answer text, obtains the first answer text of pure Chinese；

S312, by each of the first answer text of pure Chinese Chinese character carry out vectorization, corresponded to described in Multiple primary vectors of first sequence of text in first answer text of pure Chinese, multiple primary vectors formed first to Amount string (x1, x2, x3xn), wherein x is primary vector, and n is the integer greater than 1；

S313, according to it is described first sequence, first using first primary vector x1 as start vector, described in second Primary vector x2 is combined, and obtains primary vector combination (x1, x2), and in preset template vector database, search with First mix vector (the x1, x2) similarity is maximum, and be greater than preset second threshold default first mix vector (y1, Y2), wherein y is preset vector；

If S314, finding described default first mix vector (y1, y2), group is carried out with first three described primary vector It closes, obtains secondary vector combination (x1, x2, x3), and in the template vector database, search and second mix vector Similarity is maximum, and is greater than default second mix vector (y1, y2, the y3) of the second threshold；And so on, when not obtaining group When resultant vector (y1, y2, y3yn), then carry out " in the primary vector, by mix vector (y1, y2, Y3yn-1 corresponding vector combination (x1, x2, x3xn-1)) is replaced, and by the combination Vector (y1, y2, y3yn-1) is cured in the primary vector string " the first solidification process；

If S315, not finding second mix vector (y1, y2), carry out " by the primary vector combination (x1, X2) solidify in the primary vector string " the second solidification process；Simultaneously using the third primary vector as start vector, weight Multiple first solidification process and/or second solidification process, until obtaining the third vector string of entirely cured vector；

S316, the third vector string is converted into Chinese, obtains third answer text.

In the present embodiment, for example the first answer text is " I will go to have a meal " five words, by its vectorization, sequence is looked for obtain " going " " eating " " meal " five primary vectors " are wanted " to " I ", five primary vectors form the primary vector of (x1, x2, x3, x4, x5) String.Then now " I " " wanting " two vectors are combined to obtain (x1, x2), then arrive template vector database lookup search with it is described Primary vector combines (x1, x2) similarity maximum, and is greater than default first mix vector (y1, the y2) of preset second threshold, If there is the first mix vector (y1, y2), then " I " " is wanted into " " going " three primary vector combinations, obtain secondary vector combination (x1, x2, x3) then arrives the lookup of template vector database lookup and combines (x1, x2, x3) similarity with the secondary vector most Greatly, and be greater than default second mix vector (y1, y2, the y3) of preset second threshold, if there is the second mix vector (y1, Y2, y3), then " I " " is wanted into " combination of " going " " eating " four primary vectors, obtain third vector combination (x1, x2, x3, x4) after Continue above-mentioned step, searches third mix vector (y1, y2, y3, y4), if there is no third mix vector, then by second group Resultant vector (y1, y2, y3) replaces secondary vector combination (x1, x2, x3), and solidifies, and is then started with the vector of " eating " and " meal " Combination, repeats the above steps, finishes until by entire first answer text-processing.I.e. from first first in primary vector string Second primary vector of vector sum starts, and repeats above-mentioned first solidification process and/or second solidification process, complete until obtaining Portion be cured vector third vector string (primary vector on whole primary vector strings otherwise be replaced solidification or this Body is cured).The vector combination that preset mix vector is constantly replaced to each spelling words intellectual in the first answer text, can obtain The third answer text more clear and more coherent to sentence.

In one embodiment, above-mentioned first Chinese text is that answer person answers problem in the first problem Chinese text The text of formation；It is described that vector to be understood described in described problem vector sum is input to the preset first Chinese machine reading understanding It is calculated in model, after obtaining the step S3 for the first answer text for corresponding to the first problem Chinese text, comprising:

S331, the 4th answer text for obtaining the correct option of problem in corresponding described problem text；

S332, the answer similarity for calculating the 4th answer text and the first answer text；

S333, score value corresponding with the answer similarity is searched into preset score value-similarity list；

S334, it is exported the score value as the score to the first Chinese text degree of understanding.

In the present embodiment, the above process is the process of machine scoring.Above-mentioned 4th answer text, which refers to, to be preset Chinese text, record content be problem in corresponding above-mentioned first problem text correct option.Above-mentioned comparison the 4th is answered Case text and the method for the answer similarity of the first answer text include a variety of, for example use identical vectorization rule point Not by the 4th answer text and the first answer text vector, the similarity of two vectors is then calculated, or, Calculate the similarity etc. of each character collating sequence.Above-mentioned score value-similarity list is a kind of score value and similarity in referring to The list of mapping relations is determined, for example, the similarity within the scope of one, corresponding fixed score value etc., the basic similarity that presents are got over Height, the trend that corresponding score value also can be higher.The process of machine scoring can be quickly completed, the efficiency for scoring of going over examination papers is improved, Reduce the consumption etc. of human resources.

In one embodiment, above-mentioned acquisition first problem Chinese text, and the step of the first Chinese text to be understood Before rapid S1, comprising:

S101, preset Building of Chinese Reading Comprehension data set is obtained；Wherein, Building of Chinese Reading Comprehension data set includes into an a pair The a plurality of training data of the Second Problem Chinese text, the second Chinese text to be understood and the 4th answer text that should be related to；

S102, vectorization is carried out to each training data using the language model, obtained in one-to-one relationship The training data of Second Problem Chinese text vector, the 4th answer text vector of the second Chinese text vector sum；

S103, the training vector data are input to the preset second Chinese machine reading understand in model and instruct Practice, obtains the described first Chinese machine reading and understand model.

In the present embodiment, above-mentioned Building of Chinese Reading Comprehension data set selects Harbin Institute of Technology to interrogate the CMRC for flying laboratory offer (Chinese Machine Reading Comprehension, Chinese machine, which is read, to be understood) 2018 data sets, the data set mesh Before be that the People's Republic of China (PRC) is domestic unique, the higher disclosed Building of Chinese Reading Comprehension data set of quality.The present embodiment is Training obtains the first Chinese machine and reads the process for understanding model, when the second Chinese machine reading understands the answer text of model output When the picture of this second corresponding answer text reaches designated value like degree, both it is considered that training terminates.Above-mentioned second Chinese Machine, which is read, understands that model is the nerve that the realizations such as a neural network model, such as shot and long term memory models read that supervision is learned Network model.Specific training process is the standard training procedure of neural network, and details are not described herein

In one embodiment, above-mentioned to be respectively input to the first problem Chinese text and first Chinese text The problem of carrying out vectorization in preset language model, obtaining first problem Chinese text vector and the first Chinese text Before the step S2 of vector to be understood, comprising:

S201, in the first problem Chinese text and the first Chinese text foreign language word is searched respectively；

If S202, finding, by the foreign language word translation found at Chinese, and the Chinese replacement pair that translation is obtained The foreign language word answered.

In the present embodiment, above-mentioned first problem Chinese text and the first Chinese text are users as pure Chinese text What this was used, but can not prevent in first problem Chinese text and the first Chinese text without containing foreign language word.Here Foreign language refers to the spoken and written languages in addition to Chinese, such as English, Japanese, Korean.The foreign language word found out is utilized preset Translation engine translates it to obtain corresponding Chinese, then corresponding foreign language word is replaced using Chinese, to obtain It is the first problem Chinese text and the first Chinese text of pure Chinese, improves Chinese machine and read the accuracy understood.In this reality It applies in example, traverses first problem Chinese text and the first Chinese text respectively, extract the character for being not belonging to Chinese character, then It identifies the corresponding language form of the character extracted, then calls corresponding translation engine to be translated according to language form, The Chinese that translation obtains finally is replaced into its corresponding foreign language word.It, can be according to the byte quantity of character not in the present embodiment Differences that are same and distinguishing Chinese with other texts, for example, one character of Chinese is two bytes, and one character of English is one Byte etc..

The machine of the embodiment of the present application reads the method for understanding Chinese, using BERT to first problem Chinese text and described First Chinese text carries out vectorization training, is then understood in model using the preset first Chinese machine reading and is calculated, The first answer text of corresponding described problem text is obtained, because without to first problem Chinese text and the first Chinese text This progress word segmentation processing improperly happens so participle is not present, therefore the accuracy rate of finally obtained first answer text It is higher.

Referring to Fig. 2, the application also provides a kind of device of machine reading understanding Chinese, comprising:

Acquiring unit 10, for obtaining first problem Chinese text, and the first Chinese text to be understood；

The first problem Chinese text and first Chinese text are input to pre- by vectorization unit 20 for respectively If language model in the problem of carrying out vectorization, obtaining the first problem Chinese text vector and first Chinese The vector to be understood of text, wherein the language model is BERT；

Computing unit 30, for vector to be understood described in described problem vector sum to be input to the preset first Chinese machine Reading, which understands in model, to be calculated, and the first answer text for corresponding to the first problem Chinese text is obtained.

In one embodiment, above-mentioned vectorization unit 20, comprising:

Character vector module, for respectively in the first problem Chinese text and first Chinese text Each character vector, obtains character vector；And each character marking position vector is given, obtain character position vector；

Character merging module obtains pair for merging the corresponding character vector of each character and character position vector The problem of coping with first problem Chinese text vector, and the vector to be understood of corresponding first Chinese text.

In one embodiment, above-mentioned machine reads the device for understanding Chinese further include:

First judges replacement unit, for judging with the presence or absence of non-Chinese word in the first answer text, if so, then Non- Chinese word is converted into corresponding Chinese, and is substituted into the first answer text, the first answer of pure Chinese is obtained Text；

Primary vector unit, for each of the first answer text of pure Chinese Chinese character to be carried out vector Change, obtain the multiple primary vectors for corresponding to the first sequence of text in the first answer text of the pure Chinese, multiple first to Amount forms primary vector string；

First assembled unit, for will multiple primary vectors according to it is described first sequentially, every specified quantity it is described Primary vector forms one group, obtains multiple primary vector groups；

First searching unit, for into preset template vector database, each primary vector group to be searched and its phase Like degree highest, and similarity reaches the secondary vector group of preset first threshold；

Replacement unit, if for finding the secondary vector group, by the secondary vector group replacement described first to Corresponding primary vector group in amount string, obtains secondary vector string；

First converting unit obtains the second answer text for the secondary vector string to be converted into Chinese.

In another embodiment, above-mentioned machine reads the device for understanding Chinese further include:

Second judgment unit, for judging with the presence or absence of non-Chinese word in the first answer text, if so, then will be non- Chinese word is converted into corresponding Chinese, and is substituted into the first answer text, obtains the first answer text of pure Chinese；

Secondary vector unit, for each of the first answer text of pure Chinese Chinese character to be carried out vector Change, obtains the multiple primary vectors for corresponding to the first sequence of text in the first answer text of the pure Chinese, multiple described the One vector forms primary vector string (x1, x2, x3xn), wherein x is primary vector, and n is the integer greater than 1；

Second searching unit is used for according to first sequence, first using first primary vector x1 as start vector, It is combined with second primary vector x2, obtains primary vector combination (x1, x2), and in preset template vector data In library, search maximum with the first mix vector (x1, x2) similarity and first group default greater than preset second threshold Resultant vector (y1, y2), wherein y is preset vector；

First solidified cell, if for finding described default first mix vector (y1, y2), with first three described the One vector is combined, and obtains secondary vector combination (x1, x2, x3), and in the template vector database, search with it is described Second mix vector similarity is maximum, and is greater than default second mix vector (y1, y2, the y3) of the second threshold；With such It pushes away, when not obtaining mix vector (y1, y2, y3yn), then carries out " in the primary vector, by combining Vector (y1, y2, y3yn-1) replaces corresponding vector combination (x1, x2, x3xn-1), and The mix vector (y1, y2, y3yn-1) is cured in the primary vector string " first solidified Journey；

Second solidified cell, if carrying out for not finding second mix vector (y1, y2) " by described first Vector combination (x1, x2) solidify in the primary vector string " the second solidification process；It is with the third primary vector simultaneously Start vector repeats first solidification process and/or second solidification process, until obtaining entirely cured vector Third vector string；

Second converting unit obtains third answer text for the third vector string to be converted into Chinese.

In one embodiment, above-mentioned first Chinese text is that answer person answers problem in the first problem Chinese text The text of formation；It is described that vector to be understood described in described problem vector sum is input to the preset first Chinese machine reading understanding It is calculated in model, after obtaining the step of corresponding to the first answer text of the first problem Chinese text, comprising:

Data set unit is obtained, for obtaining preset Building of Chinese Reading Comprehension data set；Wherein, Building of Chinese Reading Comprehension data Collection includes into the more of the Second Problem Chinese text of one-to-one relationship, the second Chinese text to be understood and the 4th answer text Training data；

Training vector unit is obtained for carrying out vectorization to each training data using the language model in one The training number of the Second Problem Chinese text vector of one corresponding relationship, the 4th answer text vector of the second Chinese text vector sum According to；

Training unit understands in model for the training vector data to be input to the preset second Chinese machine reading It is trained, obtains the described first Chinese machine reading and understand model.

Above-mentioned each unit, module are the corresponding intrument for executing above method embodiment, explanation be not unfolded one by one herein.

Referring to Fig. 3, a kind of computer equipment is also provided in the embodiment of the present application, which can be server, Its internal structure can be as shown in Figure 3.The computer equipment includes processor, the memory, network connected by system bus Interface and database.Wherein, the processor of the Computer Design is for providing calculating and control ability.The computer equipment is deposited Reservoir includes non-volatile memory medium, built-in storage.The non-volatile memory medium is stored with operating system, computer program And database.The built-in storage provides environment for the operation of operating system and computer program in non-volatile memory medium. The database of the computer equipment is for storing the data such as language model, various Chinese texts.The network of the computer equipment connects Mouth with external terminal by network connection for being communicated.To realize any of the above-described reality when the computer program is executed by processor It applies machine described in example and reads the method for understanding Chinese.

A kind of computer readable storage medium is also provided in the embodiment of the present application, is stored thereon with computer program, The computer program realizes that any of the above-described machine as described in the examples reads the method for understanding Chinese when being executed by processor.

Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, the computer program can store and a non-volatile computer In read/write memory medium, the computer program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, Any reference used in provided herein and embodiment to memory, storage, database or other media, Including non-volatile and/or volatile memory.Nonvolatile memory may include read-only memory (ROM), programming ROM (PROM), electrically programmable ROM (EPROM), electrically erasable ROM (EEPROM) or flash memory.Volatile memory may include Random access memory (RAM) or external cache.By way of illustration and not limitation, RAM can by diversified forms , such as static state RAM (SRAM), dynamic ram (DRAM), synchronous dram (SDRAM), double speed are according to rate SDRAM (SSRSDRAM), increasing Strong type SDRAM (ESDRAM), synchronization link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic ram (DRDRAM) and memory bus dynamic ram (RDRAM) etc..

It should be noted that, in this document, the terms "include", "comprise" or its any other variant are intended to non-row His property includes, so that the process, device, article or the method that include a series of elements not only include those elements, and And further include the other elements being not explicitly listed, or further include for this process, device, article or method institute it is intrinsic Element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including being somebody's turn to do There is also other identical elements in the process, device of element, article or method.

The foregoing is merely preferred embodiment of the present application, are not intended to limit the scope of the patents of the application, all utilizations Equivalent structure or equivalent flow shift made by present specification and accompanying drawing content is applied directly or indirectly in other correlations Technical field, similarly include in the scope of patent protection of the application.

Claims

1. a kind of machine reads the method for understanding Chinese characterized by comprising

The first problem Chinese text and first Chinese text are input in preset language model respectively carry out to The vector to be understood of the problem of quantifying, obtaining first problem Chinese text vector and first Chinese text, In, the language model is BERT；

Vector to be understood described in described problem vector sum is input to the preset first Chinese machine reading understands in model and carries out It calculates, obtains the first answer text for corresponding to the first problem Chinese text.

2. machine according to claim 1 reads the method for understanding Chinese, which is characterized in that described respectively by described first Problem Chinese text and first Chinese text are input in preset language model and carry out vectorization, obtain described first and ask The step of vector to be understood of the problem of inscribing Chinese text vector and first Chinese text, comprising:

The corresponding character vector of each character and character position vector are merged, obtained literary to the first problem Chinese is coped with This problem of vector, and the vector to be understood of corresponding first Chinese text.

3. machine according to claim 1 reads the method for understanding Chinese, which is characterized in that described by described problem vector The preset first Chinese machine reading is input to the vector to be understood understands in model and calculated, obtain corresponding to described the After the step of first answer text of one problem Chinese text, comprising:

Judge with the presence or absence of non-Chinese word in the first answer text, if so, being then converted into non-Chinese word corresponding Chinese, and be substituted into the first answer text, obtain the first answer text of pure Chinese；

Each of the first answer text by pure Chinese Chinese character carries out vectorization, obtains corresponding to the of the pure Chinese Multiple primary vectors of first sequence of text in one answer text, multiple primary vectors form primary vector string；

By multiple primary vectors according to first sequence, the primary vector of every specified quantity forms one group, obtains Multiple primary vector groups；

Into preset template vector database, each primary vector group is searched and its similarity highest, and similarity reaches The secondary vector group of preset first threshold；

If finding the secondary vector group, by the secondary vector group replace in the primary vector string corresponding first to Amount group obtains secondary vector string；

4. machine according to claim 1 reads the method for understanding Chinese, which is characterized in that described by described problem vector The preset first Chinese machine reading is input to the vector to be understood understands in model and calculated, obtain corresponding to described the After the step of first answer text of one problem Chinese text, comprising:

Each of the first answer text by pure Chinese Chinese character carries out vectorization, obtains corresponding to the of the pure Chinese Multiple primary vectors of first sequence of text in one answer text, multiple primary vectors formed primary vector string (x1, X2, x3xn), wherein x is primary vector, and n is the integer greater than 1；

According to first sequence, first using first primary vector x1 as start vector, with second primary vector X2 is combined, and obtains primary vector combination (x1, x2), and in preset template vector database, is searched and described first Mix vector (x1, x2) similarity is maximum, and is greater than default first mix vector (y1, the y2) of preset second threshold, wherein y For preset vector；

If finding described default first mix vector (y1, y2), it is combined with first three described primary vector, obtains the Two vectors combine (x1, x2, x3), and in the template vector database, search with the second mix vector similarity most Greatly, and greater than default second mix vector (y1, y2, the y3) of the second threshold；And so on, when not obtaining mix vector When (y1, y2, y3yn), then carry out " in the primary vector, by mix vector (y1, y2, Y3yn-1 corresponding vector combination (x1, x2, x3xn-1)) is replaced, and by the combination Vector (y1, y2, y3yn-1) is cured in the primary vector string " the first solidification process；

If not finding second mix vector (y1, y2), carry out " by primary vector combination (x1, x2) solidification institute State in primary vector string " the second solidification process；Simultaneously using the third primary vector as start vector, described first is repeated Solidification process and/or second solidification process, until obtaining the third vector string of entirely cured vector；

The third vector string is converted into Chinese, obtains third answer text.

5. machine according to claim 1 reads the method for understanding Chinese, which is characterized in that first Chinese text is Answer person answers the text that problem is formed in the first problem Chinese text；It is described by described in described problem vector sum wait understand Vector, which is input to the preset first Chinese machine reading and understands in model, to be calculated, and obtains corresponding to the first problem Chinese text After the step of this first answer text, comprising:

6. machine according to claim 1 reads the method for understanding Chinese, which is characterized in that in the acquisition first problem Before text, and the step of the first Chinese text to be understood, comprising:

Obtain preset Building of Chinese Reading Comprehension data set；Wherein, Building of Chinese Reading Comprehension data set includes into one-to-one relationship The a plurality of training data of Second Problem Chinese text, the second Chinese text to be understood and the 4th answer text；

Vectorization is carried out to each training data using the language model, is obtained in the Second Problem in one-to-one relationship The training data of literary text vector, the 4th answer text vector of the second Chinese text vector sum；

7. machine according to claim 1 reads the method for understanding Chinese, which is characterized in that described respectively by described first Problem Chinese text and first Chinese text are input in preset language model and carry out vectorization, obtain in first problem The problem of text vector and the first Chinese text vector to be understood the step of before, comprising:

If finding, by the foreign language word translation found at Chinese, and the Chinese that translation is obtained replaces corresponding foreign language Word.

8. a kind of machine reads the device for understanding Chinese characterized by comprising

Vectorization unit, for the first problem Chinese text and first Chinese text to be input to preset language respectively The problem of carrying out vectorization, obtaining first problem Chinese text vector and first Chinese text in speech model Vector to be understood, wherein the language model is BERT；

Computing unit reads reason for vector to be understood described in described problem vector sum to be input to the preset first Chinese machine It is calculated in solution model, obtains the first answer text for corresponding to the first problem Chinese text.

9. a kind of computer equipment, including memory and processor, the memory are stored with computer program, feature exists In the step of processor realizes any one of claims 1 to 7 the method when executing the computer program.

10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the computer program The step of method described in any one of claims 1 to 7 is realized when being executed by processor.