CN110825913A - Professional word extraction and part-of-speech tagging method - Google Patents

Professional word extraction and part-of-speech tagging method Download PDF

Info

Publication number
CN110825913A
CN110825913A CN201910841201.7A CN201910841201A CN110825913A CN 110825913 A CN110825913 A CN 110825913A CN 201910841201 A CN201910841201 A CN 201910841201A CN 110825913 A CN110825913 A CN 110825913A
Authority
CN
China
Prior art keywords
words
industry
extracting
characteristic words
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910841201.7A
Other languages
Chinese (zh)
Inventor
高巍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Engineering And Mechanics Engineering Technology Co Ltd
Original Assignee
Shanghai Engineering And Mechanics Engineering Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Engineering And Mechanics Engineering Technology Co Ltd filed Critical Shanghai Engineering And Mechanics Engineering Technology Co Ltd
Priority to CN201910841201.7A priority Critical patent/CN110825913A/en
Publication of CN110825913A publication Critical patent/CN110825913A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/7867Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, title and artist information, manually generated time, location and usage information, user ratings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • G06F16/735Filtering based on additional data, e.g. user or group profiles

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Library & Information Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for extracting professional words and labeling parts of speech, which comprises the following steps: s1: establishing a keyword tag database, and storing a corresponding relation between the keywords and the industry characteristic words; s2: extracting key words from the query instruction; s3: matching the industry characteristic words corresponding to the keywords obtained in the step S2 based on the keyword tag database obtained in the step S1; s4: screening the video files based on the industry feature words obtained in the S3; s5: and marking the industry characteristic words in the video file obtained by screening. The invention can extract the industry characteristic words from the video file and acquire the text information associated with the general industry characteristic words corresponding to the keywords. The retrieval accuracy is greatly improved, the workload is reduced, and the working efficiency is improved.

Description

Professional word extraction and part-of-speech tagging method
Technical Field
The invention belongs to the technical field of voice recognition, and particularly relates to a method for extracting professional words and labeling parts of speech of a video file.
Background
The AR technology, also called augmented display technology, is a new technology that implements analog simulation processing of physical information that is difficult to experience in the real space range on the basis of scientific technologies such as computers, and achieves super-real sensory experience by superimposing virtual information contents in the real world and making this process perceived by human senses. In recent years, AR technology has been widely used in the fields of industry, video, medicine, education, and the like. In the process of extracting professional words, the conventional AR technology mainly obtains a domain term by calculating the coupling degree between adjacent words, but the method needs to calculate the coupling degree of all words in a corpus and has low accuracy; and if the mode of manually marking all the professional terms is adopted, the problems of large workload and low efficiency exist. Therefore, how to develop a new method for extracting professional words and labeling parts of speech in the AR field to overcome the above problems, improve the accuracy of extraction, improve the work efficiency, and reduce the workload is a direction that needs to be researched by those skilled in the art.
Disclosure of Invention
The invention aims to provide a method for extracting professional words and labeling parts of speech, which can improve the extraction accuracy of video files related to keywords, reduce the extraction workload and improve the extraction efficiency.
The technical scheme adopted is as follows:
a method for extracting professional words and labeling parts of speech comprises the following steps: s1: establishing a keyword tag database, and storing a corresponding relation between the keywords and the industry characteristic words; s2: extracting key words from the query instruction; s3: matching the industry characteristic words corresponding to the keywords obtained in the step S2 based on the keyword tag database obtained in the step S1; s4: screening the video files based on the industry feature words obtained in the S3; s5: and marking the industry characteristic words in the video file obtained by screening.
By adopting the technical scheme: the corresponding industry characteristic words are matched by the keywords in the query instruction through the keyword tag database which is preset in the industry characteristic word one-to-one corresponding matching, each video file is automatically screened according to the industry characteristic words queried at this time, and the industry characteristic words in the screening result are correspondingly annotated, so that the extraction workload is reduced, and the extraction efficiency is improved.
Preferably, the method for extracting professional words and tagging parts of speech further includes step S6, and step S6 includes: and adding retrieval tags to the video files obtained in the step S4 and sorting the video files according to the time sequence.
By adopting the technical scheme: by sequencing the video files based on the appearance time of the industry feature words, the accuracy of screening the video files is improved, and a user can be ensured to preferentially search the video files with higher relevance according to the time sequence.
More preferably, in the method for extracting professional words and labeling parts of speech, the method comprises the following steps: the step S1 includes the following steps: s11: collecting industry characteristic words according to industry safety regulations and operation manuals; s12: inputting the industry characteristic words into a keyword tag database and uniformly formatting; s13: marking corresponding keyword labels on the characteristic words of each industry; s14: and marking corresponding weight value labels on the industry feature words based on the occurrence frequency of the industry feature words, and sequencing the industry feature words according to the weight value labels.
By adopting the technical scheme: the vocabulary of the universal industry feature words is constructed based on the safety regulations of various enterprises in the universal industry, the operation manual of equipment manufacturers and the like, and the method is flexible in feature extraction and high in accuracy. The retrieval accuracy is further improved by giving weight values to the words and sequencing the words based on frequency statistics of the industry feature words.
Further preferably, in the method for extracting professional words and tagging parts of speech, the step S4 includes the steps of: s41: intercepting an audio track file in a video file; s42: converting the audio track file obtained in the step S21 into a text description file; s43: performing word segmentation processing on the text description file, and splitting the text description file into a plurality of words; s44: and screening out the video files corresponding to the text description files containing the industry characteristic words in the word segmentation.
Still more preferably, in the method for extracting professional words and tagging parts of speech, the method comprises: and step S43, the word segmentation processing is realized based on a distributed crawler platform.
Still more preferably, in the method for extracting professional words and labeling parts of speech, in step S5, the industry feature words in the video file obtained by screening are displayed in a color-mixing manner and displayed in a brightness-mixing manner. And the label display adopts color modulation display and/or brightness modulation display.
Compared with the prior art, the method and the device can extract the industry characteristic words from the video file and acquire the text information associated with the general industry characteristic words corresponding to the keywords. The retrieval accuracy is greatly improved, the workload is reduced, and the working efficiency is improved.
Drawings
The invention is described in further detail in the following description of embodiments with reference to the accompanying drawings:
fig. 1 is a schematic flow chart of embodiment 1 of the present invention.
Detailed Description
In order to more clearly illustrate the technical solution of the present invention, the above description will be further described with reference to various embodiments.
FIG. 1 shows example 1 of the present invention:
a method for extracting professional words and labeling parts of speech comprises the following steps:
s11: collecting industry characteristic words according to industry safety regulations and operation manuals;
s12: inputting the industry characteristic words into a keyword tag database and uniformly formatting;
s13: marking corresponding keyword labels on the characteristic words of each industry;
s14: marking corresponding weight value labels on the industry feature words based on the appearance frequency of the industry feature words, and sequencing the industry feature words according to the weight value labels;
s2: extracting key words from the query instruction;
s3: matching the industry characteristic words corresponding to the keywords obtained in the step S2 based on the keyword tag database obtained in the step S1;
s41: intercepting an audio track file in a video file;
s42: converting the audio track file obtained in the step S21 into a text description file;
s43: performing word segmentation processing on the text description file, and splitting the text description file into a plurality of words;
s44: screening out a video file corresponding to a text description file containing an industry feature word in the word segmentation;
s5: and performing color modulation display and brightness modulation display on the industry characteristic words in the video file obtained by screening.
S6: and adding retrieval tags to the video files obtained in the step S4 and sorting the video files according to the time sequence.
In the above embodiment: and step S43, the word segmentation processing is realized based on a distributed crawler platform.
In the above technical scheme: the similarity between the text description of the screened video clip and the text expression of the natural language expressed scene can be selected and compared, and a key frame set of the input text of the scene which accords with the natural language expression in the content is output; identifying and extracting objects in the key frame set to generate an object set; and finally generating a key frame according to the scene graph and the object set to generate the video. Compared with the keywords, the video retrieval method based on natural language processing greatly reduces the ambiguity of description, so that the system can more efficiently filter and find out the matched video. Meanwhile, the extraction of tens of important fields such as the title, the text and the like of the text is extracted from the centralized control regulation, the overhaul regulation, the user operation manual and the product specification used by the industry company, and the customized extraction service of the special type webpage is adopted. The extraction background completes the normalization and structuralization processing work of the webpage content, and a user can efficiently complete the acquisition of rich structuralization information from a specified page only by calling an extraction API. The method has flexible feature extraction and high accuracy; according to the scheme, the accuracy is not required to be verified manually, and the information extraction speed is increased.
The above description is only an embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. The protection scope of the present invention is subject to the protection scope of the claims.

Claims (6)

1. A method for extracting professional words and labeling parts of speech is characterized by comprising the following steps:
s1: establishing a keyword tag database, and storing a corresponding relation between the keywords and the industry characteristic words;
s2: extracting key words from the query instruction;
s3: matching the industry characteristic words corresponding to the keywords obtained in the step S2 based on the keyword tag database obtained in the step S1;
s4: screening the video files based on the industry feature words obtained in the S3;
s5: and marking and displaying the industry characteristic words in the video file obtained by screening.
2. The method for extracting specialized words and labeling parts of speech as claimed in claim 1, further comprising a step S6, wherein the step S6 comprises: and adding retrieval tags to the video files obtained in the step S4 and sorting the video files according to the time sequence.
3. The method for extracting specialized words and labeling parts of speech as claimed in claim 1, wherein: the step S1 includes the following steps:
s11: collecting industry characteristic words according to industry safety regulations and operation manuals;
s12: inputting the industry characteristic words into a keyword tag database and uniformly formatting;
s13: marking corresponding keyword labels on the characteristic words of each industry;
s14: and marking corresponding weight value labels on the industry characteristic words based on the occurrence frequency of the industry characteristic words, and sequencing the industry characteristic words according to the weight value labels.
4. The method for extracting specialized words and labeling parts of speech as claimed in claim 1, wherein said step S4 comprises the steps of:
s41: intercepting an audio track file in a video file;
s42: converting the audio track file obtained in the step S21 into a text description file;
s43: performing word segmentation processing on the text description file, and splitting the text description file into a plurality of words;
s44: and screening out the video files corresponding to the text description files containing the industry characteristic words in the word segmentation.
5. The method for extracting specialized words and labeling parts of speech as claimed in claim 1, wherein: and step S43, the word segmentation processing is realized based on a distributed crawler platform.
6. The method for extracting specialized words and labeling parts of speech according to claim 1, wherein the label display in step S5 is a color-modulation display and/or a brightness-modulation display.
CN201910841201.7A 2019-09-03 2019-09-03 Professional word extraction and part-of-speech tagging method Pending CN110825913A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910841201.7A CN110825913A (en) 2019-09-03 2019-09-03 Professional word extraction and part-of-speech tagging method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910841201.7A CN110825913A (en) 2019-09-03 2019-09-03 Professional word extraction and part-of-speech tagging method

Publications (1)

Publication Number Publication Date
CN110825913A true CN110825913A (en) 2020-02-21

Family

ID=69547927

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910841201.7A Pending CN110825913A (en) 2019-09-03 2019-09-03 Professional word extraction and part-of-speech tagging method

Country Status (1)

Country Link
CN (1) CN110825913A (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101826084A (en) * 2009-03-05 2010-09-08 深圳市万泉河科技有限公司 Fast searching method for files, fast searching method for mass talent hiring on Internet and system
CN102043812A (en) * 2009-10-13 2011-05-04 北京大学 Method and system for retrieving medical information
CN103425742A (en) * 2013-07-16 2013-12-04 北京中科汇联信息技术有限公司 Method and device for searching website
CN103678694A (en) * 2013-12-26 2014-03-26 乐视网信息技术(北京)股份有限公司 Method and system for establishing reverse index file of video resources
CN106874443A (en) * 2017-02-09 2017-06-20 北京百家互联科技有限公司 Based on information query method and device that video text message is extracted
CN107203616A (en) * 2017-05-24 2017-09-26 苏州百智通信息技术有限公司 The mask method and device of video file
CN108241856A (en) * 2018-01-12 2018-07-03 新华智云科技有限公司 Information generation method and equipment
CN108388583A (en) * 2018-01-26 2018-08-10 北京览科技有限公司 A kind of video searching method and video searching apparatus based on video content
CN109101558A (en) * 2018-07-12 2018-12-28 北京猫眼文化传媒有限公司 A kind of video retrieval method and device

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101826084A (en) * 2009-03-05 2010-09-08 深圳市万泉河科技有限公司 Fast searching method for files, fast searching method for mass talent hiring on Internet and system
CN102043812A (en) * 2009-10-13 2011-05-04 北京大学 Method and system for retrieving medical information
CN103425742A (en) * 2013-07-16 2013-12-04 北京中科汇联信息技术有限公司 Method and device for searching website
CN103678694A (en) * 2013-12-26 2014-03-26 乐视网信息技术(北京)股份有限公司 Method and system for establishing reverse index file of video resources
CN106874443A (en) * 2017-02-09 2017-06-20 北京百家互联科技有限公司 Based on information query method and device that video text message is extracted
CN107203616A (en) * 2017-05-24 2017-09-26 苏州百智通信息技术有限公司 The mask method and device of video file
CN108241856A (en) * 2018-01-12 2018-07-03 新华智云科技有限公司 Information generation method and equipment
CN108388583A (en) * 2018-01-26 2018-08-10 北京览科技有限公司 A kind of video searching method and video searching apparatus based on video content
CN109101558A (en) * 2018-07-12 2018-12-28 北京猫眼文化传媒有限公司 A kind of video retrieval method and device

Similar Documents

Publication Publication Date Title
CN107633005B (en) Knowledge graph construction and comparison system and method based on classroom teaching content
CN111046656B (en) Text processing method, text processing device, electronic equipment and readable storage medium
CN111581990B (en) Cross-border transaction matching method and device
WO2021135469A1 (en) Machine learning-based information extraction method, apparatus, computer device, and medium
CN104809142A (en) Trademark inquiring system and method
CN111190920B (en) Data interaction query method and system based on natural language
CN116108857B (en) Information extraction method, device, electronic equipment and storage medium
CN110674378A (en) Chinese semantic recognition method based on cosine similarity and minimum editing distance
CN111428503A (en) Method and device for identifying and processing same-name person
Duarte et al. Heterogeneous data sources for signed language analysis and synthesis: The signcom project
CN117076693A (en) Method for constructing digital human teacher multi-mode large language model pre-training discipline corpus
CN107688621B (en) Method and system for optimizing file
CN111401044A (en) Title generation method and device, terminal equipment and storage medium
CN111444720A (en) Named entity recognition method for English text
CN114141384A (en) Method, apparatus and medium for retrieving medical data
CN109766442A (en) A kind of couple of user takes down notes the method and system classified
CN113609847A (en) Information extraction method and device, electronic equipment and storage medium
CN110825913A (en) Professional word extraction and part-of-speech tagging method
CN112417875A (en) Configuration information updating method and device, computer equipment and medium
Chivadshetti et al. Content based video retrieval using integrated feature extraction and personalization of results
CN111191413A (en) Method, device and system for automatically marking event core content based on graph sequencing model
CN105224642B (en) The abstracting method and device of entity tag
CN114996494A (en) Image processing method, image processing device, electronic equipment and storage medium
CN114637831A (en) Data query method based on semantic analysis and related equipment thereof
CN114328946A (en) Hidden danger processing method based on knowledge graph

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination