CN110825913A

CN110825913A - Professional word extraction and part-of-speech tagging method

Info

Publication number: CN110825913A
Application number: CN201910841201.7A
Authority: CN
Inventors: 高巍
Original assignee: Shanghai Engineering And Mechanics Engineering Technology Co Ltd
Current assignee: Shanghai Engineering And Mechanics Engineering Technology Co Ltd
Priority date: 2019-09-03
Filing date: 2019-09-03
Publication date: 2020-02-21

Abstract

The invention discloses a method for extracting professional words and labeling parts of speech, which comprises the following steps: s1: establishing a keyword tag database, and storing a corresponding relation between the keywords and the industry characteristic words; s2: extracting key words from the query instruction; s3: matching the industry characteristic words corresponding to the keywords obtained in the step S2 based on the keyword tag database obtained in the step S1; s4: screening the video files based on the industry feature words obtained in the S3; s5: and marking the industry characteristic words in the video file obtained by screening. The invention can extract the industry characteristic words from the video file and acquire the text information associated with the general industry characteristic words corresponding to the keywords. The retrieval accuracy is greatly improved, the workload is reduced, and the working efficiency is improved.

Description

Professional word extraction and part-of-speech tagging method

Technical Field

The invention belongs to the technical field of voice recognition, and particularly relates to a method for extracting professional words and labeling parts of speech of a video file.

Background

The AR technology, also called augmented display technology, is a new technology that implements analog simulation processing of physical information that is difficult to experience in the real space range on the basis of scientific technologies such as computers, and achieves super-real sensory experience by superimposing virtual information contents in the real world and making this process perceived by human senses. In recent years, AR technology has been widely used in the fields of industry, video, medicine, education, and the like. In the process of extracting professional words, the conventional AR technology mainly obtains a domain term by calculating the coupling degree between adjacent words, but the method needs to calculate the coupling degree of all words in a corpus and has low accuracy; and if the mode of manually marking all the professional terms is adopted, the problems of large workload and low efficiency exist. Therefore, how to develop a new method for extracting professional words and labeling parts of speech in the AR field to overcome the above problems, improve the accuracy of extraction, improve the work efficiency, and reduce the workload is a direction that needs to be researched by those skilled in the art.

Disclosure of Invention

The invention aims to provide a method for extracting professional words and labeling parts of speech, which can improve the extraction accuracy of video files related to keywords, reduce the extraction workload and improve the extraction efficiency.

The technical scheme adopted is as follows:

a method for extracting professional words and labeling parts of speech comprises the following steps: s1: establishing a keyword tag database, and storing a corresponding relation between the keywords and the industry characteristic words; s2: extracting key words from the query instruction; s3: matching the industry characteristic words corresponding to the keywords obtained in the step S2 based on the keyword tag database obtained in the step S1; s4: screening the video files based on the industry feature words obtained in the S3; s5: and marking the industry characteristic words in the video file obtained by screening.

By adopting the technical scheme: the corresponding industry characteristic words are matched by the keywords in the query instruction through the keyword tag database which is preset in the industry characteristic word one-to-one corresponding matching, each video file is automatically screened according to the industry characteristic words queried at this time, and the industry characteristic words in the screening result are correspondingly annotated, so that the extraction workload is reduced, and the extraction efficiency is improved.

Preferably, the method for extracting professional words and tagging parts of speech further includes step S6, and step S6 includes: and adding retrieval tags to the video files obtained in the step S4 and sorting the video files according to the time sequence.

By adopting the technical scheme: by sequencing the video files based on the appearance time of the industry feature words, the accuracy of screening the video files is improved, and a user can be ensured to preferentially search the video files with higher relevance according to the time sequence.

More preferably, in the method for extracting professional words and labeling parts of speech, the method comprises the following steps: the step S1 includes the following steps: s11: collecting industry characteristic words according to industry safety regulations and operation manuals; s12: inputting the industry characteristic words into a keyword tag database and uniformly formatting; s13: marking corresponding keyword labels on the characteristic words of each industry; s14: and marking corresponding weight value labels on the industry feature words based on the occurrence frequency of the industry feature words, and sequencing the industry feature words according to the weight value labels.

By adopting the technical scheme: the vocabulary of the universal industry feature words is constructed based on the safety regulations of various enterprises in the universal industry, the operation manual of equipment manufacturers and the like, and the method is flexible in feature extraction and high in accuracy. The retrieval accuracy is further improved by giving weight values to the words and sequencing the words based on frequency statistics of the industry feature words.

Further preferably, in the method for extracting professional words and tagging parts of speech, the step S4 includes the steps of: s41: intercepting an audio track file in a video file; s42: converting the audio track file obtained in the step S21 into a text description file; s43: performing word segmentation processing on the text description file, and splitting the text description file into a plurality of words; s44: and screening out the video files corresponding to the text description files containing the industry characteristic words in the word segmentation.

Still more preferably, in the method for extracting professional words and tagging parts of speech, the method comprises: and step S43, the word segmentation processing is realized based on a distributed crawler platform.

Still more preferably, in the method for extracting professional words and labeling parts of speech, in step S5, the industry feature words in the video file obtained by screening are displayed in a color-mixing manner and displayed in a brightness-mixing manner. And the label display adopts color modulation display and/or brightness modulation display.

Compared with the prior art, the method and the device can extract the industry characteristic words from the video file and acquire the text information associated with the general industry characteristic words corresponding to the keywords. The retrieval accuracy is greatly improved, the workload is reduced, and the working efficiency is improved.

Drawings

The invention is described in further detail in the following description of embodiments with reference to the accompanying drawings:

fig. 1 is a schematic flow chart of embodiment 1 of the present invention.

Detailed Description

In order to more clearly illustrate the technical solution of the present invention, the above description will be further described with reference to various embodiments.

FIG. 1 shows example 1 of the present invention:

a method for extracting professional words and labeling parts of speech comprises the following steps:

s11: collecting industry characteristic words according to industry safety regulations and operation manuals;

s12: inputting the industry characteristic words into a keyword tag database and uniformly formatting;

s13: marking corresponding keyword labels on the characteristic words of each industry;

s14: marking corresponding weight value labels on the industry feature words based on the appearance frequency of the industry feature words, and sequencing the industry feature words according to the weight value labels;

s2: extracting key words from the query instruction;

s3: matching the industry characteristic words corresponding to the keywords obtained in the step S2 based on the keyword tag database obtained in the step S1;

s41: intercepting an audio track file in a video file;

s42: converting the audio track file obtained in the step S21 into a text description file;

s43: performing word segmentation processing on the text description file, and splitting the text description file into a plurality of words;

s44: screening out a video file corresponding to a text description file containing an industry feature word in the word segmentation;

s5: and performing color modulation display and brightness modulation display on the industry characteristic words in the video file obtained by screening.

S6: and adding retrieval tags to the video files obtained in the step S4 and sorting the video files according to the time sequence.

In the above embodiment: and step S43, the word segmentation processing is realized based on a distributed crawler platform.

In the above technical scheme: the similarity between the text description of the screened video clip and the text expression of the natural language expressed scene can be selected and compared, and a key frame set of the input text of the scene which accords with the natural language expression in the content is output; identifying and extracting objects in the key frame set to generate an object set; and finally generating a key frame according to the scene graph and the object set to generate the video. Compared with the keywords, the video retrieval method based on natural language processing greatly reduces the ambiguity of description, so that the system can more efficiently filter and find out the matched video. Meanwhile, the extraction of tens of important fields such as the title, the text and the like of the text is extracted from the centralized control regulation, the overhaul regulation, the user operation manual and the product specification used by the industry company, and the customized extraction service of the special type webpage is adopted. The extraction background completes the normalization and structuralization processing work of the webpage content, and a user can efficiently complete the acquisition of rich structuralization information from a specified page only by calling an extraction API. The method has flexible feature extraction and high accuracy; according to the scheme, the accuracy is not required to be verified manually, and the information extraction speed is increased.

The above description is only an embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. The protection scope of the present invention is subject to the protection scope of the claims.

Claims

1. A method for extracting professional words and labeling parts of speech is characterized by comprising the following steps:

s1: establishing a keyword tag database, and storing a corresponding relation between the keywords and the industry characteristic words;

s2: extracting key words from the query instruction;

s4: screening the video files based on the industry feature words obtained in the S3;

s5: and marking and displaying the industry characteristic words in the video file obtained by screening.

2. The method for extracting specialized words and labeling parts of speech as claimed in claim 1, further comprising a step S6, wherein the step S6 comprises: and adding retrieval tags to the video files obtained in the step S4 and sorting the video files according to the time sequence.

3. The method for extracting specialized words and labeling parts of speech as claimed in claim 1, wherein: the step S1 includes the following steps:

s14: and marking corresponding weight value labels on the industry characteristic words based on the occurrence frequency of the industry characteristic words, and sequencing the industry characteristic words according to the weight value labels.

4. The method for extracting specialized words and labeling parts of speech as claimed in claim 1, wherein said step S4 comprises the steps of:

s41: intercepting an audio track file in a video file;

s44: and screening out the video files corresponding to the text description files containing the industry characteristic words in the word segmentation.

5. The method for extracting specialized words and labeling parts of speech as claimed in claim 1, wherein: and step S43, the word segmentation processing is realized based on a distributed crawler platform.

6. The method for extracting specialized words and labeling parts of speech according to claim 1, wherein the label display in step S5 is a color-modulation display and/or a brightness-modulation display.