CN110414395B

CN110414395B - Content identification method, device, server and storage medium

Info

Publication number: CN110414395B
Application number: CN201910651578.6A
Authority: CN
Inventors: 罗强
Original assignee: Beijing ByteDance Network Technology Co Ltd
Current assignee: Douyin Vision Co Ltd; Douyin Vision Beijing Co Ltd
Priority date: 2019-07-18
Filing date: 2019-07-18
Publication date: 2022-08-02
Anticipated expiration: 2039-07-18
Also published as: CN110414395A

Abstract

The embodiment of the disclosure provides a content identification method, a content identification device, a server and a storage medium; the method comprises the following steps: dividing the target text into at least one block based on the block type; acquiring the corresponding relation between each line of text in the target text and the block to which each line of text belongs; respectively determining layout characteristics corresponding to the target words in each line of text based on the corresponding relationship; and carrying out named entity identification on the target text based on the layout characteristics corresponding to the target words.

Description

Content identification method, content identification device, server and storage medium

Technical Field

The present disclosure relates to content identification technologies, and in particular, to a content identification method, a content identification device, a server, and a storage medium.

Background

For better management and use of textual data, named entity recognition is often used to parse the textual data to obtain structured data.

In the related art, a named entity recognition method is generally used, in which a sentence or a line of text is used as a unit to recognize an entity in the sentence or the line, and the entity is deduced from context information in the sentence or the line. However, for the text of information brevity, such as resume, case, etc., since there is not enough information for entity inference, accurate named entity identification cannot be achieved.

Disclosure of Invention

In view of this, the embodiments of the present disclosure provide a content identification method, device, server and storage medium.

The embodiment of the disclosure provides a content identification method, which includes:

dividing the target text into at least one block based on the block type;

acquiring the corresponding relation between each line of text in the target text and the block to which each line of text belongs;

respectively determining layout characteristics corresponding to the target words in each line of text based on the corresponding relationship;

and carrying out named entity recognition on the target text based on the layout characteristics corresponding to the target words.

In the above scheme, the dividing the target text into at least one block based on the block type includes:

matching the text content of the target text with the block type to obtain keywords matched with the block type in the text content;

and carrying out block division on the target text based on the position of the target text where the keyword is located.

In the foregoing solution, the determining layout features corresponding to target words in each line of text based on the correspondence respectively includes:

respectively determining the block to which the target word belongs and the line number of the initial line of the block to which the distance belongs in each line of text based on the corresponding relation;

acquiring context information of each divided block;

determining a block adjacent to a block to which a target word belongs based on the context information;

and respectively determining layout characteristics corresponding to the target words in each line of text based on the block to which the target words in each line of text belong, the line number of the initial line of the block to which the distance belongs and the block adjacent to the block to which the target words belong.

In the foregoing solution, the performing named entity recognition on the target text based on the layout features corresponding to the target words includes:

acquiring a feature vector of the layout feature corresponding to the target word;

obtaining a word vector corresponding to the target word;

splicing the feature vector and the word vector to obtain a target feature vector corresponding to the target word;

and carrying out named entity recognition on the target text based on the target feature vector corresponding to the target word.

In the foregoing scheme, the obtaining the feature vector of the layout feature corresponding to the target word includes:

constructing a layout feature vector matrix, wherein the layout feature vector matrix comprises vectors corresponding to all blocks;

constructing a layout offset eigenvector matrix, wherein the offset eigenvector matrix comprises vectors corresponding to the row number of the starting row of the block to which the target word distance belongs;

and acquiring the feature vector of the layout feature corresponding to the target word based on the constructed layout feature vector matrix and the layout offset feature vector matrix.

inputting the layout characteristics corresponding to the target words into a neural network model, and outputting entity labels corresponding to the target words, wherein the entity labels represent entity categories corresponding to the target words.

An embodiment of the present disclosure provides a content identification apparatus, including:

the block dividing unit is used for dividing the target text into at least one block based on the block type;

the acquisition unit is used for acquiring the corresponding relation between each line of text in the target text and the block to which each line of text belongs;

a determining unit, configured to determine layout features corresponding to target words in each line of text respectively based on the correspondence;

and the identification unit is used for carrying out named entity identification on the target text based on the layout characteristics corresponding to the target words.

In the above scheme, the block dividing unit is further configured to match text content of the target text with a block type to obtain a keyword in the text content, where the keyword is matched with the block type;

In the above scheme, the determining unit is further configured to determine, based on the correspondence, a block to which a target word belongs and a number of rows from an initial row of the block to which the target word belongs in each row of text;

acquiring context information of each divided block;

In the above scheme, the identification unit is further configured to obtain a feature vector of a layout feature corresponding to the target word;

obtaining a word vector corresponding to the target word;

In the above scheme, the identifying unit is further configured to construct a layout feature vector matrix, where the layout feature vector matrix includes vectors corresponding to each block;

In the above scheme, the identification unit is further configured to input the layout characteristics corresponding to the target word into a neural network model, and output an entity tag corresponding to the target word, where the entity tag represents an entity category corresponding to the target word.

An embodiment of the present disclosure provides a server, including:

a memory for storing executable instructions;

and the processor is used for realizing the content identification method provided by the embodiment of the disclosure when the executable instruction is executed.

The embodiment of the disclosure provides a storage medium, which stores executable instructions, and when the executable instructions are executed, the storage medium is used for realizing the content identification method provided by the embodiment of the disclosure.

The embodiment of the disclosure has the following beneficial effects:

by applying the embodiment of the disclosure, the target text is divided into at least one block based on the block type, the corresponding relation between each line of text in the target text and the block to which each line of text belongs is obtained, and the layout characteristics corresponding to the target words in each line of text are obtained based on the corresponding relation, so that the named entity recognition is performed on the target text according to the layout characteristics.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale.

Fig. 1 is an architecture diagram of a content recognition system provided by an embodiment of the present disclosure;

fig. 2 is a schematic structural diagram of a content recognition apparatus provided in an embodiment of the present disclosure;

fig. 3 is a schematic diagram of a component structure of a content identification device according to an embodiment of the disclosure;

fig. 4 is a schematic flow chart of a content identification method provided by an embodiment of the present disclosure;

FIG. 5 is a schematic illustration of a resume provided by an embodiment of the present disclosure;

fig. 6 is a schematic flow chart of a content identification method according to an embodiment of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and the embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "include" and its variants, as used herein, are inclusive, i.e., "including but not limited to"; the term "based on" is "based at least in part on"; the term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments", and relevant definitions of other terms will be given in the following description.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a" or "an" in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will appreciate that references to "one or more" are intended to be exemplary and not limiting unless the context clearly indicates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

Before the embodiments of the present disclosure are explained in detail, terms and wordings related to the embodiments of the present disclosure are explained, and the terms and wordings related to the embodiments of the present disclosure are applied to the following explanations.

1) Named entity recognition, also called "proper name recognition", refers to recognizing entities with specific meanings in texts, mainly including names of people, places, organizations, proper nouns, etc.;

2) context information, some or all of the information that can affect an object in a scene or image.

First, a content identification system according to an embodiment of the present invention is described, fig. 1 is a schematic structural diagram of a content identification system according to an embodiment of the present disclosure, and referring to fig. 1, in order to support an exemplary application, a content identification system 100 includes a terminal (including a terminal 400-1 and a terminal 400-2) and a server 200, the terminal 400 is connected to the server 200 through a network 300, the network 300 may be a wide area network or a local area network, or a combination of the two, and data transmission is implemented using a wireless link.

A terminal (terminal 400-1 and/or terminal 400-2) for sending the target text to the server 200;

a server 200 for dividing the target text into at least one block based on the block type; acquiring the corresponding relation between each line of text in the target text and the block to which each line of text belongs; respectively determining layout characteristics corresponding to the target words in each line of text based on the corresponding relationship; and carrying out named entity recognition on the target text based on the layout characteristics corresponding to the target words.

Next, a description will be given of a content recognition apparatus provided in an embodiment of the present disclosure. The content recognition apparatus of the embodiments of the present disclosure may be implemented in various forms, such as: the method is implemented by a terminal such as a smart phone, a tablet computer and a desktop computer, or implemented by a server, or implemented by cooperation of the terminal and the server. The content identification device provided by the embodiments of the present disclosure may be implemented in hardware or a combination of hardware and software, and various exemplary implementations of the content identification device provided by the embodiments of the present disclosure are described below.

The hardware structure of the content identification device according to the embodiment of the present disclosure is described in detail below, fig. 2 is a schematic diagram of a component structure of the content identification device according to the embodiment of the present disclosure, and the component of the device shown in fig. 2 is only an example, and should not bring any limitation to the function and the application range of the embodiment of the present disclosure.

As shown in fig. 2, the content recognition device may include a processing device (e.g., a central processing unit, a graphics processor, etc.) 210, which may perform various appropriate actions and processes according to a program stored in a Read-Only Memory (ROM) 220 or a program loaded from a storage device 280 into a Random Access Memory (RAM) 230. In the RAM 230, various programs and data necessary for the operation of the terminal are also stored. The processing device 210, the ROM220, and the RAM 230 are connected to each other through a bus 240. An Input/Output (I/O) interface 250 is also connected to bus 240.

Generally, the following devices may be connected to I/O interface 250: input devices 260 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 270 including, for example, a Liquid Crystal Display (LCD), speakers, vibrators, or the like; storage devices 280 including, for example, magnetic tape, hard disk, etc.; and a communication device 290. The communication means 290 may allow the terminal to perform wireless or wired communication with other devices to exchange data. While fig. 2 illustrates a terminal having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, the processes described by the provided flowcharts may be implemented as computer software programs, according to embodiments of the present disclosure. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program containing program code for performing the method illustrated in the flow chart. In such embodiments, the computer program may be downloaded and installed from a network through communication device 290, or installed from storage device 280, or installed from ROM 220. The above-described functions defined in the methods of the embodiments of the present disclosure are performed when the computer program is executed by the processing device 210.

It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

In the disclosed embodiments, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the disclosed embodiments, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, Radio Frequency (RF), etc., or any suitable combination of the foregoing.

The computer readable medium may be embodied in the terminal; or may be separate and not assembled into the terminal.

The computer readable medium carries one or more programs which, when executed by the terminal, cause the terminal to perform the content identification method provided by the embodiments of the present disclosure.

Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) and a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams provided by the embodiments of the present disclosure illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units and/or modules described in the embodiments of the present disclosure may be implemented by software or hardware.

The functions described in the embodiments of the present disclosure may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: Field-Programmable Gate arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Parts (ASSPs)), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.

In the context of embodiments of the present disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The following describes a software implementation of the content recognition apparatus provided in the embodiments of the present disclosure. Fig. 3 is a schematic diagram of a composition structure of a content identification device according to an embodiment of the present disclosure, and referring to fig. 3, a content identification device 30 according to an embodiment of the present disclosure includes:

a block dividing unit 31 for dividing the target text into at least one block based on the block type;

an obtaining unit 32, configured to obtain a corresponding relationship between each line of text in the target text and a block to which each line of text belongs;

a determining unit 33, configured to determine, based on the correspondence, layout features corresponding to target words in each line of text respectively;

and the identifying unit 34 is configured to perform named entity identification on the target text based on the layout features corresponding to the target words.

In some embodiments, the block dividing unit 31 is further configured to match the text content of the target text with the block type to obtain a keyword in the text content, where the keyword is matched with the block type;

and carrying out block division on the target text based on the position of the target text where the keywords are positioned.

In some embodiments, the determining unit 33 is further configured to determine, based on the correspondence, a block to which the target word belongs and a number of rows from a starting row of the block to which the target word belongs in each row of text;

acquiring context information of each divided block;

determining a block adjacent to the block to which the target word belongs based on the context information;

In some embodiments, the identifying unit 34 is further configured to obtain a feature vector of the layout feature corresponding to the target word;

obtaining a word vector corresponding to a target word;

splicing the feature vectors and the word vectors to obtain target feature vectors corresponding to the target words;

In some embodiments, the identifying unit 34 is further configured to construct a layout feature vector matrix, where the layout feature vector matrix includes vectors corresponding to the blocks;

In some embodiments, the identifying unit 34 is further configured to input the layout features corresponding to the target word into the neural network model, and output an entity label corresponding to the target word, where the entity label characterizes an entity category corresponding to the target word.

It should be noted that the above categories of units do not constitute limitations of the electronic device itself, e.g. some units may be split into two or more sub-units, or some units may be combined into a new unit.

It should be further noted that the names of the above units do not in some cases constitute a limitation on the units themselves, and for example, the above block division unit 31 may also be described as a unit of "dividing the target text into at least one block based on the block type".

For the same reason, units and/or modules in the electronic device, which are not described in detail, do not represent defaults of the corresponding units and/or modules, and all operations performed by the electronic device may be implemented by the corresponding units and/or modules in the electronic device.

With continuing reference to fig. 4, fig. 4 is a schematic flow chart of the content identification method provided in the embodiment of the present disclosure, taking a server as an example, and with reference to fig. 4, the content identification method in the embodiment of the present disclosure includes:

step 401: the server divides the target text into at least one block based on the block type.

It should be noted that the block type is defined according to the type of the target text. In some embodiments, the target text may be a resume, and the tile type may be defined as "basic information," "educational experience," "work experience," "project experience," or the like; in other embodiments, the target text may be a medical record, and the block type may be defined as "basic information", "chief complaints", "current medical history", "past history", and the like.

In actual implementation, the target text may be sent to the server by the user through the client, for example, the user may send the resume to the server through the resume delivery client on the terminal, and the server receives the resumes sent from each terminal and performs corresponding processing on the received resumes; the target text may also be pre-stored by the server.

In some embodiments, the server may divide the target text into at least one section by: matching the text content of the target text with the block type to obtain keywords matched with the block type in the text content; and carrying out block division on the target text based on the position of the target text where the keywords are positioned.

Here, all block types that may appear in the target text are predefined, the target text is respectively matched with each block type to obtain keywords corresponding to each block type, and the target text is divided into at least one block according to the line of each keyword in the target text.

Taking the target text as an example of the resume, fig. 5 is a schematic diagram of the resume provided by the embodiment of the present disclosure, where the block types are defined as "basic information", "educational experience", "work experience", and "project experience", and the text content of the resume shown in fig. 5 is matched with the predefined block types to obtain the keywords "personal information", "educational experience", "work situation", and "project experience" corresponding to the block types. Here, the "personal information" is located in the first line of the target text, and the "work condition" is located in the fourth line of the target text, then the first to third lines may be divided into blocks corresponding to the "basic information".

In some embodiments, the server may further implement block type-based division of the target text into at least one block based on a text classification algorithm, for example, perform text classification on each line of text so that each line corresponds to one block type, and perform block division on the target text according to a correspondence between each line of text and the block type.

Step 402: and acquiring the corresponding relation between each line of text in the target text and the block to which each line of text belongs.

Here, the server may determine, directly according to each block obtained by dividing, a correspondence between each line of text in the target text and the block to which the line of text belongs.

Step 403: and respectively determining layout characteristics corresponding to the target words in each line of text based on the corresponding relation.

Here, the layout characteristics corresponding to the target words in each line of text are the same as the layout characteristics of the line in which the target words are located, and in actual implementation, the server may determine the layout characteristics of each line according to the correspondence between each line of text and the block in which the line is located, and then determine the layout characteristics corresponding to the target words according to the layout characteristics corresponding to the line in which the target words are located.

In some embodiments, the server may perform word segmentation on each line of text to obtain each word in each line of text, use all the obtained words as target words, and determine layout characteristics of each word according to the obtained corresponding relationship; in other embodiments, after segmenting the text of each line, the server may pre-process the words obtained by segmenting the text of each line to screen out the target words, and then determine the layout characteristics corresponding to the target words according to the corresponding relationship.

In some embodiments, the server may determine the layout characteristics by: respectively determining a block to which the target word belongs and the number of lines from the initial line of the block to which the target word belongs in each line of text based on the corresponding relation; acquiring context information of each divided block; determining a block adjacent to the block to which the target word belongs based on the context information; and respectively determining layout characteristics corresponding to the target words in each line of text based on the block to which the target words in each line of text belong, the line number of the initial line of the block to which the distance belongs and the block adjacent to the block to which the target words belong.

In practical implementation, each block is provided with a unique identifier, the identifier corresponding to the block to which the target word belongs is used as a current layout feature, the line number distant from the start line of the block to which the target word belongs is used as a layout offset feature, the identifier corresponding to the previous block of the block to which the target word belongs is used as a precursor layout feature, and the identifier corresponding to the next block of the block to which the target word belongs is used as a back-drive layout feature. And integrating the current layout characteristic, the layout offset characteristic, the front-driving layout characteristic and the rear-driving layout characteristic to obtain the layout characteristic.

For example, if the current layout feature is curr _ block _ id, the layout offset feature is line _ offset, the back-drive layout feature is next _ block _ id, and the predecessor layout feature is pre _ block _ id, then the layout features may be represented as curr _ block _ id, line _ offset, pre _ block _ id, next _ block _ id.

It should be noted that, when the block to which the target word belongs is the first block of the corresponding target text, the previous block of the block does not exist; when the block to which the target word belongs is the last block of the corresponding target text, the next block of the block is not present, so the space outside the target text is abstracted to be an assumed empty block, and a corresponding identifier is set, for example, the identifier of the empty block is set to "0".

For example, as shown in fig. 5, each block is assigned a block _ id as a block identifier from 1, and then block _ id corresponding to "basic information" is 1, block _ id corresponding to "educational experience" is 2, block _ id corresponding to "work experience" is 3, and block _ id corresponding to "project experience" is 4. For the 'lie' of the second row, the current layout characteristic curr _ block _ id is 1; the post-drive layout characteristic next _ block _ id is 3; setting the predecessor layout characteristic pre _ block _ id of the block to which the block belongs as the first block of the resume to be 0; since "lie" is located in the second row of the block to which it belongs, and its offset characteristic is 2, the layout characteristic corresponding to "lie" of the second row can be represented as [1,2,0,3 ].

Step 404: and carrying out named entity identification on the target text based on the layout characteristics corresponding to the target words.

Here, the server can perform named entity recognition on the target text directly based on the layout characteristics corresponding to the target words without combining context information in a sentence or a line, for example, "lie" in a block corresponding to basic information refers to a person name, and perform named entity recognition on the target text, and "lie" in a block corresponding to a work experience refers to a company.

In some embodiments, the server may perform named entity recognition on the target text by: acquiring a feature vector of layout features corresponding to a target word; acquiring a word vector corresponding to a target word; splicing the feature vectors and the word vectors to obtain target feature vectors corresponding to the target words; and carrying out named entity recognition on the target text based on the target feature vector corresponding to the target word.

Here, the word vector corresponding to the target word may be a word embedding vector corresponding to the target word, which means that a high-dimensional space having a dimension of the number of all words is embedded into a continuous vector space having a much lower dimension, and each word or phrase is mapped as a vector on a real number domain. In practical implementation, the training tool can be obtained through tool training such as word2 vec.

In some embodiments, the server may obtain the feature vector of the layout feature corresponding to the target word by: constructing a layout feature vector matrix, wherein the layout feature vector matrix comprises vectors corresponding to all blocks; constructing a layout offset eigenvector matrix, wherein the offset eigenvector matrix comprises vectors corresponding to the row number of the starting row of the block to which the target word distance belongs; and acquiring the feature vector of the layout feature corresponding to the target word based on the constructed layout feature vector matrix and the layout offset feature vector matrix.

In practical implementation, the layout feature vector matrix may be constructed in the following manner: assuming that the target text is divided into m blocks in total, abstracting the space outside the target text into an assumed empty block, namely, setting and marking the space before the starting position of the target text and after the ending position of the target text as '0', and then, m +1 blocks are in total. The vectorization matrix defining the layout features is a, where a is a matrix of (m +1) × k, where k is the dimension of each block vector, and each row of the matrix a corresponds to one block. Wherein A is a parameter which can be learned by the model, the value of A is initialized randomly before training, and the A can be adjusted continuously in the training process.

In practical implementation, the layout offset eigenvector matrix can be constructed by: taking the line number of the target word from the initial line of the block as a layout offset characteristic, and assuming that the maximum value of the line number of the target word from the initial line of the block is n, then the maximum layout offset is n, defining a layout offset characteristic vector matrix as B, wherein B is a matrix of n × d, and d is a dimensionality of the layout offset characteristic vector. B is a parameter which can be learned by the model, the value of B is initialized randomly before training, and the B can be adjusted continuously in the training process.

Here, the server may extract a vector corresponding to a block to which the target word belongs and a vector corresponding to a block adjacent to the block to which the target word belongs from the matrix a, extract a vector corresponding to a row number of the target word from the matrix B, which is distant from the starting row of the block to which the target word belongs, and splice the four extracted vectors to obtain a feature vector of the layout feature corresponding to the target word. Since the dimension of each block vector is k and the dimension of the layout offset feature vector is d, the dimension of the feature vector of the layout feature can be obtained as 3k + d. And (3) assuming that the dimension of the word vector corresponding to the target word is c, splicing the feature vector of the layout feature with the word vector to obtain the dimension of the target feature vector of 3k + d + c.

In some embodiments, the server may perform named entity recognition on the target term through a neural network model: and the server inputs the layout characteristics corresponding to the target words into the neural network model and outputs entity labels corresponding to the target words, wherein the entity labels represent entity categories corresponding to the target words.

Here, the entity category corresponding to each target word can be predicted by using a network structure of Bi-LSTM + CRF, and each entity label can adopt a BIO label set, that is, B-PER, I-PER represents first name of person, non-first name of person, B-LOC, I-LOC represents first name of place, non-first name of place, B-ORG, I-ORG represents first name of organization, non-first name of organization, O represents that the word does not belong to a part of a named entity; BIOS label sets can also be adopted, such as B-Person represents a word at the beginning position of a name, I-Person represents a word at the middle position of the name, E-Person represents a word at the end position of the name, and S-Person represents a name formed by single words.

In some embodiments, the server may also perform named entity recognition on the target word directly using a conventional sequence tagging model, such as a Conditional Random Field (CRF) algorithm.

Next, fig. 6 is a schematic flow chart of a content identification method provided in the embodiment of the present disclosure, and referring to fig. 6, the content identification method in the embodiment of the present disclosure includes:

step 601: all tile types that may appear in the target text are predefined.

Step 602: and dividing the target text into at least one block according to the block type, and assigning an identifier to each block.

Step 603: and acquiring the corresponding relation between each line of text in the target text and the block to which each line of text belongs.

Step 604: and constructing the current layout characteristics and layout offset characteristics of each line of text according to the acquired corresponding relation.

Here, the identifier of the block to which each line of text belongs is defined as the current layout characteristic, and the number of lines of text in each line from the start line of the block to which each line of text belongs is defined as the layout offset characteristic.

Step 605: and acquiring the context information of each divided block.

Step 606: and constructing the front-driving layout characteristics and the rear-driving layout characteristics of each line of text according to the context information of each block obtained by dividing.

Here, the previous block identifier of the block to which each line of text belongs is defined as a predecessor layout feature, and if the block to which the line of text belongs is the first block, the predecessor layout feature is set to 0; and defining the mark of the block next to the block of each line of text as a back-driving layout characteristic, and if the block is the last block, setting the back-driving layout characteristic to be 0.

Step 607: and integrating the current layout characteristics, the layout offset characteristics, the front-drive layout characteristics and the rear-drive layout characteristics corresponding to the texts in each line to obtain the layout characteristics corresponding to the texts in each line.

Step 608: and constructing a layout feature vector matrix which comprises vectors corresponding to all the blocks.

Here, assuming that there are m predefined blocks, the space outside the target text is abstracted to an assumed empty block, and the setting is identified as 0, then the target text has m +1 blocks. Defining a layout feature vector matrix as A, wherein A is a matrix of (m +1) x k, k is the dimension of each layout feature vector, and each row of the matrix A corresponds to one block. A is a parameter which can be learned by the model, the value of A is initialized randomly before training, and the value is adjusted continuously in the training process.

Step 609: and constructing a layout offset eigenvector matrix, wherein the offset eigenvector matrix comprises vectors corresponding to the offset features of each layout.

Here, assuming that the maximum layout offset is n, a layout offset feature vector matrix is defined as B, where B is an n × d matrix, where d is a dimension of the layout offset feature vector. B is a parameter which can be learned by the model, the value of B is initialized randomly before training, and the B can be adjusted continuously in the training process.

Step 610: and extracting corresponding feature vectors from the layout offset feature vector matrix and the layout offset feature vector matrix according to the layout features of the text of the line to which the target word belongs to obtain the feature vectors of the layout features corresponding to the target word.

Step 611: and splicing the feature vectors and the word vectors to obtain target feature vectors corresponding to the target words.

Step 612: and inputting the target word vector into the neural network model, and outputting an entity label corresponding to the target word, wherein the entity label represents the entity category corresponding to the target word.

According to one or more embodiments of the present disclosure, there is provided a content recognition method including:

dividing the target text into at least one block based on the block type;

According to one or more embodiments of the present disclosure, there is provided the above content identification method, where the dividing of the target text into at least one block based on the block type includes:

and carrying out block division on the target text based on the position of the target text where the keyword is positioned.

According to one or more embodiments of the present disclosure, the above content identification method is provided, where the determining layout features corresponding to target words in each line of text respectively based on the correspondence includes:

respectively determining a block to which a target word belongs and the number of lines from a starting line of the block to which the target word belongs in each line of text based on the corresponding relation;

acquiring context information of each divided block;

According to one or more embodiments of the present disclosure, there is provided the above content identification method, where performing named entity identification on the target text based on the layout features corresponding to the target words includes:

obtaining a word vector corresponding to the target word;

According to one or more embodiments of the present disclosure, the above content identification method is provided, where the obtaining of the feature vector of the layout feature corresponding to the target word includes:

According to one or more embodiments of the present disclosure, there is provided a content recognition apparatus including:

According to one or more embodiments of the present disclosure, there is provided the above content identification apparatus, where the block dividing unit is further configured to match text content of the target text with a block type, so as to obtain a keyword in the text content, where the keyword is matched with the block type;

According to one or more embodiments of the present disclosure, there is provided the above content identification apparatus, where the determining unit is further configured to determine, based on the correspondence, a block to which a target word belongs in each line of text and a line number from an initial line of the block to which the target word belongs;

acquiring context information of each divided block;

According to one or more embodiments of the present disclosure, the above content identification apparatus is provided, where the identification unit is further configured to obtain a feature vector of a layout feature corresponding to the target word;

obtaining a word vector corresponding to the target word;

According to one or more embodiments of the present disclosure, the content identification apparatus is provided, where the identification unit is further configured to construct a layout feature vector matrix, where the layout feature vector matrix includes vectors corresponding to each block;

According to one or more embodiments of the present disclosure, the content identification apparatus is provided, and the identification unit is further configured to input the layout characteristics corresponding to the target word into a neural network model, and output an entity label corresponding to the target word, where the entity label represents an entity category corresponding to the target word.

According to one or more embodiments of the present disclosure, there is provided a server including:

a memory for storing executable instructions;

According to one or more embodiments of the present disclosure, a storage medium is provided, which stores executable instructions for implementing a content identification method provided by an embodiment of the present disclosure when executed.

The above description is only an example of the present disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Claims

1. A method for identifying content, the method comprising:

dividing the target text into at least one block based on the block type;

acquiring context information of each divided block;

respectively determining layout characteristics corresponding to the target words in each line of text based on the block to which the target words in each line of text belong, the line number of the initial line of the block to which the target words belong and the block adjacent to the block to which the target words belong;

2. The method of claim 1, wherein the dividing the target text into at least one block based on block type comprises:

3. The method of claim 1, wherein the performing named entity recognition on the target text based on the layout features corresponding to the target words comprises:

obtaining a word vector corresponding to the target word;

4. The method of claim 3, wherein the obtaining the feature vector of the layout feature corresponding to the target word comprises:

5. The method of claim 1, wherein the performing named entity recognition on the target text based on the layout features corresponding to the target words comprises:

6. An apparatus for identifying content, the apparatus comprising:

the determining unit is used for respectively determining a block to which the target word belongs and the number of lines from a starting line of the block to which the target word belongs in each line of text based on the corresponding relation; acquiring context information of each divided block; determining a block adjacent to a block to which a target word belongs based on the context information; respectively determining layout characteristics corresponding to the target words in each line of text based on the block to which the target words in each line of text belong, the line number of the initial line of the block to which the target words belong and the block adjacent to the block to which the target words belong;

7. The apparatus of claim 6,

the block dividing unit is further configured to match text content of the target text with a block type to obtain a keyword matched with the block type in the text content;

8. The apparatus of claim 6,

the identification unit is further used for acquiring a feature vector of the layout feature corresponding to the target word;

obtaining a word vector corresponding to the target word;

9. The apparatus of claim 8,

the identification unit is further configured to construct a layout feature vector matrix, where the layout feature vector matrix includes vectors corresponding to each block;

10. The apparatus of claim 6,

the identification unit is further configured to input the layout characteristics corresponding to the target word into a neural network model, and output an entity tag corresponding to the target word, where the entity tag represents an entity category corresponding to the target word.

11. A server, comprising:

a memory for storing executable instructions;

a processor, configured to execute the executable instructions to implement the content recognition method according to any one of claims 1 to 5.

12. A storage medium storing executable instructions for implementing a content recognition method as claimed in any one of claims 1 to 5 when executed.