CN114998785B - Intelligent Mongolian video analysis method - Google Patents

Intelligent Mongolian video analysis method Download PDF

Info

Publication number
CN114998785B
CN114998785B CN202210575336.5A CN202210575336A CN114998785B CN 114998785 B CN114998785 B CN 114998785B CN 202210575336 A CN202210575336 A CN 202210575336A CN 114998785 B CN114998785 B CN 114998785B
Authority
CN
China
Prior art keywords
mongolian
model
image
information
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210575336.5A
Other languages
Chinese (zh)
Other versions
CN114998785A (en
Inventor
周巴特尔
蒋晓栋
杨莉莉
张宇
冯祥
董德武
王梦忠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inner Mongolia Autonomous Region Public Security Bureau
Iflytek Information Technology Co Ltd
Original Assignee
Inner Mongolia Autonomous Region Public Security Bureau
Iflytek Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inner Mongolia Autonomous Region Public Security Bureau, Iflytek Information Technology Co Ltd filed Critical Inner Mongolia Autonomous Region Public Security Bureau
Priority to CN202210575336.5A priority Critical patent/CN114998785B/en
Publication of CN114998785A publication Critical patent/CN114998785A/en
Application granted granted Critical
Publication of CN114998785B publication Critical patent/CN114998785B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Character Discrimination (AREA)

Abstract

The application relates to a Mongolian video intelligent analysis method, which comprises the following steps: data processing, voice information processing, image information processing, mongolian text translation, element extraction, intention recognition, content analysis and early warning, general model research and judgment recognition and self-building analysis and judgment model. The method and the device utilize the current advanced artificial intelligence, big data and Mongolian information processing technology to identify, translate, retrieve, monitor and manage the Mongolian video data, greatly improve the automatic analysis capability of Mongolian video, reduce the manual analysis cost and improve the analysis efficiency and the real-time rate.

Description

Intelligent Mongolian video analysis method
Technical Field
The application relates to the field of Mongolian video analysis, in particular to an intelligent Mongolian video analysis method.
Background
Along with the rapid development of informatization and internet technology, the number of Mongolian websites and public numbers is increased, the internet data also contains a large amount of Mongolian video resources, and platforms and technologies for communication and connection of netizens in a video mode are rapidly developed and mature. The netizens carry out convenient communication and exchange in the communication tool by utilizing Mongolian, and meanwhile, the netizens bring great difficulty to the normal management of public safety authorities, the demand on police officers is huge, and timeliness is difficult to ensure.
Therefore, it is needed to automatically search and monitor abnormal video information of Mongolian network to complete the analysis of Internet video information.
Disclosure of Invention
In order to solve the problems that artificial analysis of Mongolian video information is difficult to develop, low in efficiency and timeliness cannot be guaranteed, the application provides an intelligent analysis method for Mongolian video.
The application provides a Mongolian video intelligent analysis method, which comprises the following steps:
step S1, data processing, namely judging whether the received data are video data, if yes, turning to step S2, if not, carrying out format conversion to extract audio data and Mongolian text information, extracting Mongolian text information from the extracted audio data through audio transcription, then translating the Mongolian text information into corresponding Chinese text information, and turning to step S5;
s2, voice information processing, namely extracting voice information in video data for processing;
s3, image information processing, namely extracting image information in video data for processing;
s4, performing Mongolian text translation, namely extracting Mongolian text information from image information through OCR (optical character recognition) of the image, extracting Mongolian text information from voice information through audio transcription, and translating the Mongolian text information into corresponding Chinese text information;
step S5, element extraction, namely identifying element information contained in the Chinese text information, wherein the element information contains name, place name, transaction and organization information, and obtaining corresponding element organization information according to the extracted name, place name and transaction;
step S6, intention recognition, namely classifying the Chinese text information into one or more categories according to the theme, the content and the attribute of the Chinese text information, and recognizing the intention expressed in the text information;
s7, content analysis and early warning, wherein the system performs scoring early warning on videos from the original sound, MD5, voiceprints and keywords through a base library by combining with a manual experience tactical model, and the videos are ranked according to the score;
s8, performing the research and judgment recognition on the video data by using the universal model;
and S9, a self-building analysis and judgment model is built, and a targeted event analysis and judgment model is formed through analyzing and learning the occurrence and development trend of the event of big data, so that further judgment is performed.
By the method, the video data of the Mongolian are identified, translated, retrieved, monitored and managed, so that the automatic analysis capability of the Mongolian video is greatly improved, the manual analysis cost is reduced, and the analysis efficiency and the real-time rate are improved.
Further, the step S2 includes:
step S21, voice information preprocessing, namely, scene segmentation is carried out on audio in video, and music, noise, voice and the like are judged, so that subsequent voice recognition processing is facilitated;
step S22, voice information language identification, namely performing acoustic model training and language model training on Mongolian languages, performing language identification comparison on video data to be processed, automatically identifying and judging the languages to which the video data belong, and confirming video data fragments of Mongolian languages in the video data, so that the language identification efficiency of Mongolian can be improved;
step S23, transferring voice information, performing endpoint detection and noise reduction on music, noise and voice obtained by preprocessing the voice information, and extracting acoustic characteristics; and decoding the extracted acoustic features and video data of the Mongolian language identified in the language identification of the voice information by using a decoder in training a trained acoustic model and a trained language model, and performing audio text conversion to obtain Mongolian text information, so that the Mongolian text information can be rapidly obtained.
Further, in the step S21, the voice information preprocessing further includes:
s211, setting four states of a silence state, a voice start state, a voice stable state and a voice attenuation state by an energy four-threshold algorithm, respectively setting four energy threshold values required by skip among the states, realizing skip among the four states according to the energy information of each frame in the audio in the video data, and finally realizing detection of voice fragments with higher energy in the audio;
s212, based on a noise judgment algorithm of a rule, utilizing the frequency band energy of the audio to perform initial judgment of music and noise scenes on the signal fragments passing through an energy four-threshold algorithm; most scenes can be detected through preliminary detection based on rules, but the scene conditions are different in specific application environments of each set of system in consideration of the requirements of specific environments.
S213, judging by a model classifier, and training a model matched with various scenes appearing in an actual application scene according to the application environment of the actual system; in the training process, distinguishing training is introduced, and the minimum classification error criterion is used, so that the matching precision of various scenes is improved.
Further, the step S3 includes:
step S31, preprocessing image information, screening whether validity detection, definition detection and MD5 de-duplication function of the image are met; screening whether the image enhancement, the image binarization, the image perspective transformation, the image boundary detection, the image inclination detection, the image external block detection and the image content area detection are satisfied; performing binarization, noise removal and inclination correction on the image information meeting detection;
the binarization is used for enabling the image information to only contain black foreground information and white background information, so that the efficiency and the accuracy of image information preprocessing are improved; the noise removal carries out denoising treatment on the image information to be identified according to the characteristics of the noise, so that the accuracy of image information preprocessing is improved; the tilt correction is used for correcting the image direction;
step S32, image information identification is combined with OCR recognition service, character morphological characteristics are analyzed by utilizing various pattern recognition algorithms, standard codes of Mongolian are judged, mongolian text information contained in a picture can be rapidly extracted and stored in a text document according to a general character format, and more data support is provided for subsequent business analysis.
Further, the step S4 includes:
step S41, word segmentation is carried out, and the Chinese character sequence is segmented into word sequences. Because in Chinese, words are the most basic units for bearing semantics, and word segmentation is the basis of a plurality of Chinese natural language processing tasks such as information retrieval, text classification, emotion analysis and the like;
step S42, part-of-speech tagging, which is to give each word in the sentence a part-of-speech category including numbers and names, and prevent the numbers and names from misleading the translation.
Step S43, the decoder decodes, including a hierarchical phrase-based decoder PSMT and a neural network-based decoder NMT; the hierarchical phrase-based decoder PSMT comprises a translation model, a language model, a sequencing model, a search space and a numerical linear model scoring, and is used for segmenting sentences according to phrases, translating each phrase respectively and then sequencing; the search space comprises all segmented phrases and obtains all translation hypotheses, the numerical linear model scores the translation hypotheses, and the translation hypothesis with the highest score is selected as a translation result, so that the translation accuracy is improved.
Further, in the step S5, the element extraction is performed through industry domain data in big data, and the manual domain expert labeling is performed, where the manual domain expert labeling includes: morphology, syntax, and semantics; training a statistical model of lexical, syntactic and semantic analysis based on the labeling data; the lexical analysis adopts a conditional random field model and combines a rule grammar; the syntax adopts a probability context-free grammar to establish a statistical syntax analysis model, designs a syntax analysis algorithm based on a dynamic programming thought, and optimizes the efficiency of a clipping strategy of the analysis algorithm; the semantics are extracted based on the syntax structure tree, a semantic disambiguation model is trained according to data with semantic labels, and semantic understanding of elements is achieved by combining semantic analysis rule grammar. Through the technical scheme, the effective elements in the video data can be extracted to the greatest extent.
In the step S6, the method is intended to identify a combined keyword matching KWS strategy supporting multiple categories, and based on experimental prototypes, positive and negative combined keywords of different categories can be customized, and keyword matching based on rules can be realized; intent recognition supports both KWS, KWP, NB, LDA +SVM and NN strategies; when multi-strategy classification is carried out, the intention recognition supports configuration use supporting the five strategies, so that the accuracy of the intention recognition is improved; the intention recognition completes multi-strategy score fusion under multi-category text classification, each strategy supports multi-category discrimination, comprehensive multi-strategy score fusion is finally carried out, the weight of each strategy is configured, and score fusion is carried out according to the weight, so that the accuracy of the intention recognition is improved; the intention identification supports a multi-strategy input-output unified format, and the definition of the unified input-output format under different strategies is completed; the intention recognition is based on an NN strategy, a plurality of NN models are loaded simultaneously, the weight and the threshold value of each NN model are supported to be configured, and the NN strategy is used for calculating the score of each NN model to output a fusion result; the NN model which can be dynamically switched and used based on the NN strategy is intended to be identified, and the NN model is flexible and convenient to use.
Further, the step S7 includes:
step S71, voice separation, namely performing active voice monitoring on voice contents in video data and audio data, identifying noise of each part in video data and audio data fragments, wherein the noise comprises silence, white noise and color ring, suppressing the noise according to monitoring results, enhancing effective voice, and then clustering according to voiceprint characteristics of different speakers to finally realize voice separation of the speakers; voiceprint early warning, voice separation of people is followed by voiceprint extraction and registration of video data and audio data, then voiceprint information in the video data and the audio data is registered in a voiceprint library, and by combining an early warning discovery module of the voiceprint library, specific personnel crossing applications are discovered, so that the accuracy and efficiency of the voiceprint early warning can be improved;
step S72, image early warning, namely, based on scene recognition and image language recognition, labeling and scenerising the acquired image data, and pushing and early warning the data of gun-related, mongolian-related and specific portrait-related; the image early warning comprises a human image early warning, a Mongolian picture identification early warning and a gun-related picture discovery early warning; the portrait early warning can establish a knowledge base of key portraits, and the key portraits are early warned through the knowledge base and the current face similarity recognition engine; the recognition early warning of the Mongolian picture calls an image OCR and a language recognition engine; the discovery early warning of the gun-related pictures calls an object monitoring engine of the image type; the early warning content of the image early warning comprises: pictures of figures, crowd, parades, firearms, flags, pornography, blood fishy smell, self-burning and burn;
step S73, text content early warning is carried out, a knowledge base of keywords is established through OCR recognition of images and transcription of audio data, harmful text information in the images is found by combining the keywords, chinese and Mongolian content in the images is recognized and extracted when video data are accessed, and the text content early warning is compared with the knowledge base of the keywords, and the text content early warning is also carried out by adopting parallel processing which is used for improving the use efficiency of the text content early warning.
Further, the step S8 includes:
step S81, pornography and scene recognition are divided into pornography, sexy and normal types, and a plurality of network models are trained, multi-model cascade judgment is adopted for specific users, and for video yellow discrimination, firstly frame cutting and yellow discrimination are adopted, and for suspected pictures, a video fragment algorithm and an optical flow algorithm are adopted, so that the efficiency of pornography and scene recognition is improved.
Step S82, intelligent identification of the riot content, namely classifying the picture and the video by means of a riot picture and a video data source and depending on a distributed deep learning platform, and identifying riot scenes and riot objects, wherein the identification of the riot scenes comprises a parade, a flag and a station logo, and the identification of the riot objects comprises a gun, a mask and a beard face;
step S83, political sensitive figures are intelligently identified, the political figures appearing in the video are automatically and intelligently identified, whether the political figures exist in the video image or not is identified by comparing the characteristics of the faces of the political figures, and if so, the user is identified; the intelligent recognition model of the political sensitive person establishes a knowledge base of the political sensitive person, and early warning of the political person is achieved through the knowledge base and the current face similarity recognition engine.
Further, in step S9, the self-built analysis and study model is created based on the polices, and associated with each polices, corresponding service attributes are marked, the applied case direction is illustrated, and the accurate use of the self-built analysis and study model is facilitated; the self-built analysis and study model is based on different data sources, comprises two groups of data and Internet data, and is respectively analyzed by using different technical and tactical methods, and when the self-built analysis and study model is presented, the self-built analysis and study model is classified and displayed based on different data types; the self-built analysis and judgment models comprise public models and private models, and all the self-built analysis and judgment models can release a idiom public model or appoint to share to other polices according to the current using effect, so that the independent use requirements of the polices are met, and the self-built analysis and judgment models are shared; the final application of the self-built analysis and research model is used for actively alarming and reminding a user by setting the starting time, the application data range, the comparison task and the early warning analysis hook.
In summary, the present application includes the following beneficial technical effects:
1. the method has the advantages that the identification, translation, retrieval, monitoring and management of Mongolian video information acquired by various channels are realized by utilizing the current advanced artificial intelligence, big data and Mongolian information processing technology, so that the automatic analysis capability of Mongolian video is greatly improved, the artificial analysis cost is reduced, and the analysis efficiency and the real-time rate are improved;
2. the video data information is treated, and services are implemented for processing, organizing, treating and the like of various Mongolian video information in public safety business, so that clear network space is created, and social stability is maintained.
Drawings
FIG. 1 is a general step diagram of a Mongolian video intelligent analysis method of the present application;
FIG. 2 is a detailed step diagram of speech information processing of a Mongolian video intelligent analysis method of the present application;
FIG. 3 is a detailed step diagram of image information processing of a Mongolian video intelligent analysis method of the present application;
FIG. 4 is a detailed step diagram of Mongolian text translation for a Mongolian video intelligent analysis method of the present application;
FIG. 5 is a detailed step diagram of a content analysis pre-warning of a Mongolian video intelligent analysis method of the present application;
fig. 6 is a detailed step diagram of a generic model research judgment identification of a method for intelligent analysis of a mongolian video.
Detailed Description
1-6, through the following description of the embodiments, the specific embodiments of the present application, such as the shapes, structures, mutual positions and connection relationships among the parts, roles and working principles of the parts, manufacturing processes, operation and use methods of the parts, etc., are described in further detail, so as to help those skilled in the art to more fully understand the inventive concept, technical solution of the present invention. For convenience of description, reference is made to the directions shown in the drawings.
A Mongolian video intelligent analysis method comprises the following steps:
step S1, data processing, namely judging whether the received data are video data, if yes, turning to step S2, if not, carrying out format conversion to extract audio data and Mongolian text information, extracting Mongolian text information from the extracted audio data through audio transcription, then translating the Mongolian text information into corresponding Chinese text information, and turning to step S5;
s2, voice information processing, namely extracting voice information in video data for processing;
s3, image information processing, namely extracting image information in video data for processing;
s4, performing Mongolian text translation, namely extracting Mongolian text information from image information through OCR (optical character recognition) of the image, extracting Mongolian text information from voice information through audio transcription, and translating the Mongolian text information into corresponding Chinese text information;
step S5, element extraction, namely identifying element information contained in the Chinese text information, including person names, place names, transactions and organization information, and performing triplet combination according to the extracted person names, place names and transactions to form corresponding element organization information;
step S6, intention recognition, namely classifying the Chinese text information into one or more categories according to the theme, the content and the attribute of the Chinese text information, and recognizing the intention expressed in the text information;
s7, content analysis and early warning, wherein the system performs scoring early warning on videos from the original sound, MD5, voiceprints and keywords through a base library by combining with a manual experience tactical model, and the videos are ranked according to the score;
s8, performing the research and judgment recognition on the video data by using the universal model;
and S9, a self-building analysis and judgment model is built, and a targeted event analysis and judgment model is formed through analyzing and learning the occurrence and development trend of the event of big data, so that further judgment is performed.
By the method, the video data of the Mongolian are identified, translated, retrieved, monitored and managed, so that the automatic analysis capability of the Mongolian video is greatly improved, the manual analysis cost is reduced, and the analysis efficiency and the real-time rate are improved.
The step S2 includes:
step S21, voice information preprocessing, namely, scene segmentation is carried out on audio in video, music, noise, voice and the like are judged, and the voice in the audio is extracted, so that the subsequent voice recognition processing is facilitated;
the voice information preprocessing further comprises:
s211, setting four states of a silence state, a voice start state, a voice stable state and a voice attenuation state by an energy four-threshold algorithm, wherein the range can be defined by the user as required, four energy threshold values required by the skip between the states are respectively set, the skip between the four states is realized according to the energy information of each frame in the audio in the video data, and finally the detection of the voice fragment with higher energy in the audio is realized;
s212, based on a noise judgment algorithm of a rule, utilizing the frequency band energy of the audio to perform initial judgment of music and noise scenes on the signal fragments passing through an energy four-threshold algorithm; most scenes can be detected through preliminary detection based on rules, but the scene conditions are different in specific application environments of each set of system in consideration of the requirements of specific environments, so that model classifier judgment is needed.
S213, judging by a model classifier, and training a model matched with various scenes appearing in an actual application scene according to the application environment of the actual system; in the training process, distinguishing training is introduced, and the minimum classification error criterion is used, so that the scene resolution effect is improved, the matching precision of various scenes is improved, and the final effective voice is obtained.
The complexity of the three steps is sequentially improved, different scene types are detected respectively, and finally different scene segmentation and voice fragments in the scene are detected.
Step S22, voice information language identification, namely performing acoustic model training and language model training on Mongolian languages, performing language identification comparison on video data to be processed, automatically identifying and judging the languages to which the video data belong, and confirming video data fragments of Mongolian languages in the video data, so that the language identification efficiency of Mongolian can be improved;
the acoustic model training is used for establishing a database of Mongolian language voice information, and comprises data screening, data labeling, quality review and sampling review. The acoustic model training has an effective data volume of 3000 hours. The screening voice of the data screening is 12000 hours, and the breakage rate is calculated according to 75%. The voice of the data label is 3890 hours, and the breakage rate is calculated according to 23%. The quality control data is subjected to 100% full detection after 3890 hours of data marking. Sampling and rechecking the data after quality rechecking, extracting 20% to recheck, wherein the checked data are 600 hours, and finally 3000 hours of effective data are formed.
Language model training is used to build a database of Mongolian textual information, similar to acoustic model training, and will not be described in detail herein.
Step S23, transferring voice information, performing endpoint detection and noise reduction on music, noise and voice obtained by preprocessing the voice information, and extracting acoustic characteristics; and decoding the extracted acoustic features and video data of the Mongolian language identified in the language identification of the voice information by using a decoder in training a trained acoustic model and a trained language model, and performing audio text conversion to obtain Mongolian text information, so that the Mongolian text information can be rapidly obtained.
The step S3 includes:
step S31, preprocessing image information, screening whether validity detection, definition detection and MD5 de-duplication function of the image are met; screening whether the image enhancement, the image binarization, the image perspective transformation, the image boundary detection, the image inclination detection, the image external block detection and the image content area detection are satisfied; performing binarization, noise removal and inclination correction on the image information meeting detection; the image boundary detection is black edge detection and is used for detecting whether a black area exists at the image boundary, and the image external block is a non-self page image block and is used for detecting whether the images are of the same page.
The binarization is used for enabling the image information to only contain black foreground information and white background information, so that the efficiency and the accuracy of image information preprocessing are improved; the noise removal carries out denoising treatment on the image information to be identified according to the characteristics of the noise, so that the accuracy of image information preprocessing is improved; the tilt correction is used for correcting the image direction;
because the information content of the color image is too large, before the print character in the image is identified, the image is binarized, so that the image only contains black foreground information and white background information, and the efficiency and accuracy of the identification process are improved. Because the quality of the image to be identified is limited by the input equipment, the environment and the printing quality of the document, the image to be identified is subjected to denoising processing according to the characteristics of noise before the print character in the image is identified, and the accuracy of the identification processing is improved. Because the scanning and shooting processes involve manual operations, the image to be identified input into the computer is more or less inclined, and before the identification process is performed on the print characters in the image, the image direction is detected and corrected. The real-time rate of the image information preprocessing ensures that one million pictures are processed in one hour.
Step S32, image information identification is combined with an optical character recognition OCR service with the functions of detecting dark and bright modes to determine the shape of the image, then the character recognition method is used for translating the shape into computer characters, various mode recognition algorithms are used for analyzing character morphological characteristics to judge standard codes of Mongolian, mongolian text information contained in a picture can be rapidly extracted and stored in a text document according to a universal character format, and more data support is provided for subsequent business analysis.
The step S4 includes:
step S41, word segmentation is carried out, and the Chinese character sequence is segmented into word sequences.
Step S42, part-of-speech tagging, which is to give each word in the sentence a part-of-speech category including numbers and names, and prevent the numbers and names from misleading the translation. The part of speech is used as a generalization of words, and plays an important role in tasks such as language identification, syntactic analysis, information extraction and the like. Part-of-speech information is also used in the translation, for example, for the translation of numbers, names and the like, if the translation is difficult to be correctly translated only by means of a decoder, if the numbers and the names are identified in the preprocessing process, only one placeholder is used in the translation, for example, the numbers are replaced by $number, the names are replaced by $human_name, and the translation of the numbers, the names and the like can be better processed by restoring the original words in the post-processing.
In step S43, the decoder decodes, including the conventional hierarchical phrase-based decoder PSMT and the neural network-based decoder NMT. The hierarchical phrase-based decoder PSMT comprises a translation model, a language model, a distortion model, a sequencing model, a search space and a numerical linear model scoring, and is used for segmenting sentences according to phrases, translating each phrase respectively and then sequencing; the search space comprises all segmented phrases and obtains all translation hypotheses, the numerical linear model scores the translation hypotheses, and the translation hypothesis with the highest score is selected as a translation result, so that the translation accuracy is improved.
In the step S5, the element extracts industry domain data in the big data, and performs manual domain expert labeling, where the manual domain expert labeling includes: morphology, syntax, and semantics; and training a statistical model of lexical, syntactic and semantic analysis based on the labeling data. The lexical analysis adopts a conditional random field model and combines a rule grammar, so that a better word segmentation effect can be achieved. The syntax adopts a probability context-free grammar to establish a statistical syntax analysis model, designs a syntax analysis algorithm based on a dynamic programming thought, and optimizes the cutting strategy efficiency of the analysis algorithm. The semantics are extracted based on the syntactic structure tree, a semantic disambiguation model is trained according to data with semantic labels, and semantic understanding of elements is achieved by combining semantic analysis rule grammar. Through the technical scheme, the effective elements in the video data can be extracted to the greatest extent.
In the step S6, the method is intended to identify a combined keyword matching KWS strategy supporting multiple categories, and based on experimental prototypes, positive and negative combined keywords of different categories can be customized, and keyword matching based on rules can be realized; intent recognition supports both KWS, KWP, NB, LDA +SVM and NN strategies; when multi-strategy classification is carried out, the intention recognition supports the configuration and the use of the five strategies, and each strategy supports multi-classification discrimination, so that the accuracy of the intention recognition is improved. The configuration method of the intention recognition comprises the steps of firstly carrying out multi-strategy score fusion under multi-category text classification, then carrying out comprehensive multi-strategy score fusion, finally configuring the weight of each strategy, and carrying out score fusion according to the weight, thereby improving the accuracy of the intention recognition. The intent recognition supports a multi-strategy input-output unified format, and the definition of unified input-output formats under different strategies, particularly the definition of output, json format and unified output format are completed; the intention recognition is based on an NN strategy, a plurality of NN models are loaded simultaneously, the weight and the threshold value of each NN model are supported to be configured, and the NN strategy is used for calculating the score of each NN model to output a fusion result; the NN model which can be dynamically switched and used based on the NN strategy is intended to be identified, and the NN model is flexible and convenient to use.
The step S7 includes:
step S71, human voice separation, namely, performing active voice monitoring on voice contents in video data and audio data, identifying noise of each part in video data and audio data fragments, wherein the noise comprises silence, white noise and color ring, suppressing the noise according to monitoring results, enhancing effective voice, clustering according to voiceprint characteristics of different speakers, finally realizing human voice separation of the speakers, realizing blind separation aiming at the characteristic of uncertain number of Internet audio/video speakers, realizing uploading voice only, and automatically identifying speaker fragment information; voiceprint early warning, voice separation of people is followed by voiceprint extraction and registration of video data and audio data, then voiceprint information in the video data and the audio data is registered in a voiceprint library, and by combining an early warning discovery module of the voiceprint library, specific personnel crossing applications are discovered, so that the accuracy and efficiency of the voiceprint early warning can be improved; because the Internet voice fragments are generally shorter, the time length requirement of voiceprint extraction cannot be met, and at the moment, a plurality of short audios are spliced, and the voiceprint extraction can be carried out after the short audios are spliced.
Step S72, image early warning, namely, based on scene recognition and image language recognition, labeling and scenerising the acquired image data, and pushing and early warning the data of gun-related, mongolian-related and specific portrait-related; the image early warning comprises a human image early warning, a Mongolian picture identification early warning and a gun-related picture discovery early warning; the portrait early warning can establish a knowledge base of key portraits, and the key portraits are early warned through the knowledge base and the current face similarity recognition engine; the recognition early warning of the Mongolian picture calls an image OCR and a language recognition engine; the discovery early warning of the gun-related pictures calls an object monitoring engine of the image type; the early warning content of the image early warning comprises: pictures of figures, crowd, parades, firearms, flags, pornography, blood fishy smell, self-burning and burn;
step S73, text content early warning is carried out, a knowledge base of keywords is established through OCR recognition of images and transcription of audio data, harmful text information in the images is found by combining the keywords, chinese and Mongolian content in the images is recognized and extracted when video data are accessed, and the text content early warning is compared with the knowledge base of the keywords, and the text content early warning is also carried out by adopting parallel processing which is used for improving the use efficiency of the text content early warning.
The step S8 includes:
step S81, pornography content and scene recognition are divided into pornography, sexy and normal types, a plurality of network models are trained, multi-model cascade judgment is adopted for specific users, a frame cutting and yellow discrimination is firstly adopted for video yellow discrimination, and a video fragment algorithm and an optical flow algorithm are adopted for suspected pictures, so that whether the picture content belongs to pornography content and scene recognition or not is confirmed, and the efficiency of pornography content and scene recognition is improved.
Step S82, intelligent identification of the riot content, namely classifying the picture and the video by means of a riot picture and a video data source and depending on a distributed deep learning platform, and identifying riot scenes and riot objects, wherein the identification of the riot scenes comprises a parade, a flag and a station logo, and the identification of the riot objects comprises a gun, a mask and a beard face; frame slicing is also performed in advance for processing of video data.
Step S83, political sensitive figures are intelligently identified, the political figures appearing in the video are automatically and intelligently identified, whether the political figures exist in the video image or not is identified by comparing the characteristics of the faces of the political figures, and if so, the user is identified; the intelligent recognition model of the political sensitive person establishes a knowledge base of the political sensitive person, and early warning of the political person is achieved through the knowledge base and the current face similarity recognition engine. The knowledge base can be pre-stored with materials of various important characters, so that the comparison and identification can be conveniently and rapidly carried out.
In step S9, the self-built analysis and judgment model is created based on the polices, and because the technical and tactics used by each police in the analysis are different and the respective mining directions are different, the self-built analysis and judgment model is associated with each police, corresponding service attributes are marked, the applied case directions are illustrated, and the accurate use of the self-built analysis and judgment model is convenient. The self-built analysis and study model is based on different data sources, including two groups of data and Internet data, and is analyzed by using different technical and tactical methods respectively, and when the self-built analysis and study model is presented, the self-built analysis and study model is classified and displayed based on different data types. The self-built analysis and judgment models comprise public models and private models, and all the self-built analysis and judgment models can release a idiom public model or appoint to share to other polices according to the current using effect, so that the independent use requirements of the polices are met, the self-built analysis and judgment models are shared, and the data mining and analysis are facilitated. The final application of the self-built analysis and research model is used for actively alarming and reminding a user by setting the starting time, the application data range, the comparison task and the early warning analysis hook.
The invention and its embodiments have been described above schematically, without limitation, and the drawings illustrate only one embodiment of the invention and the actual structure is not limited thereto. Therefore, if one of ordinary skill in the art is informed by this disclosure, a structural manner and an embodiment similar to the technical scheme are not creatively devised without departing from the gist of the present invention, and all the structural manners and the embodiment are considered to be within the protection scope of the present invention.

Claims (6)

1. The intelligent Mongolian video analysis method is characterized by comprising the following steps of:
step S1, data processing, namely judging whether the received data are video data, if yes, turning to step S2, if not, carrying out format conversion to extract audio data and Mongolian text information, extracting Mongolian text information from the extracted audio data through audio transcription, then translating the Mongolian text information into corresponding Chinese text information, and turning to step S5;
s2, voice information processing, namely extracting voice information in video data for processing;
s3, image information processing, namely extracting image information in video data for processing;
s4, performing Mongolian text translation, namely extracting Mongolian text information from image information through OCR (optical character recognition) of the image, extracting Mongolian text information from voice information through audio transcription, and translating the Mongolian text information into corresponding Chinese text information;
step S5, element extraction, namely identifying element information contained in the Chinese text information, wherein the element information contains name, place name, transaction and organization information, and obtaining corresponding element organization information according to the extracted name, place name and transaction;
step S6, intention recognition, namely classifying the Chinese text information into one or more categories according to the theme, the content and the attribute of the Chinese text information, and recognizing the intention expressed in the text information;
s7, content analysis and early warning, wherein the system performs scoring early warning on videos from the original sound, MD5, voiceprints and keywords through a base library by combining with a manual experience tactical model, and the videos are ranked according to the score;
s8, performing the research and judgment recognition on the video data by using the universal model;
step S9, a self-built analysis and judgment model is formed, and a targeted event analysis and judgment model is formed through analysis and study of occurrence and development trend of events of big data so as to conduct further judgment;
the step S4 includes:
step S41, word segmentation, namely segmenting a Chinese character sequence into word sequences;
step S42, part-of-speech tagging, namely giving each word in the sentence a part-of-speech category comprising numbers and names;
step S43, the decoder decodes, including a hierarchical phrase-based decoder PSMT and a neural network-based decoder NMT; the hierarchical phrase-based decoder PSMT comprises a translation model, a language model, a sequencing model, a search space and a numerical linear model scoring, and is used for segmenting sentences according to phrases, translating each phrase respectively and then sequencing; the search space comprises all segmented phrases and obtains all translation hypotheses, the numerical linear model scores the translation hypotheses, and the translation hypothesis with the highest score is selected as a translation result;
in the step S5, the element extracts industry domain data through big data, and performs manual domain expert labeling, where the manual domain expert labeling includes: morphology, syntax, and semantics; training a statistical model of lexical, syntactic and semantic analysis based on the labeling data; the lexical analysis adopts a conditional random field model and combines a rule grammar; the syntax adopts a probability context-free grammar to establish a statistical syntax analysis model, designs a syntax analysis algorithm based on a dynamic programming thought, and optimizes the efficiency of a clipping strategy of the analysis algorithm; the semantics are extracted based on the syntax structure tree, a semantic disambiguation model is trained according to data with semantic labels, and semantic understanding of elements is realized by combining with a semantic analysis rule grammar;
in the step S6, the method is intended to identify a combined keyword matching KWS strategy supporting multiple categories, and based on experimental prototypes, positive and negative combined keywords of different categories can be customized, and keyword matching based on rules can be realized; intent recognition supports both KWS, KWP, NB, LDA +SVM and NN strategies; the intention recognition completes multi-strategy score fusion under multi-category text classification, each strategy supports multi-category discrimination, and finally comprehensive multi-strategy score fusion is carried out, the weight of each strategy is configured, and score fusion is carried out according to the weight; the intention identification supports a multi-strategy input-output unified format, and the definition of the unified input-output format under different strategies is completed; the intention recognition is based on an NN strategy, a plurality of NN models are loaded simultaneously, the weight and the threshold value of each NN model are supported to be configured, and the NN strategy is used for calculating the score of each NN model to output a fusion result; intent recognition enables dynamic switching of an NN model used based on NN policies;
the step S7 includes:
step S71, voice separation, namely performing active voice monitoring on voice contents in video data and audio data, identifying noise of each part in video data and audio data fragments, wherein the noise comprises silence, white noise and color ring, suppressing the noise according to monitoring results, enhancing effective voice, and then clustering according to voiceprint characteristics of different speakers to finally realize voice separation of the speakers; voiceprint early warning, voice extraction and registration of video data and audio data are carried out after voice separation of people, then voiceprint information in the video data and the audio data is registered in a voiceprint library, and specific personnel crossing applications are found by combining an early warning finding module of the voiceprint library;
step S72, image early warning, namely, based on scene recognition and image language recognition, labeling and scenerising the acquired image data, and pushing and early warning the data of gun-related, mongolian-related and specific portrait-related; the image early warning comprises a human image early warning, a Mongolian picture identification early warning and a gun-related picture discovery early warning; the portrait early warning can establish a knowledge base of key portraits, and the key portraits are early warned through the knowledge base and the current face similarity recognition engine; the recognition early warning of the Mongolian picture calls an image OCR and a language recognition engine; the discovery early warning of the gun-related pictures calls an object monitoring engine of the image type; the early warning content of the image early warning comprises: pictures of figures, crowd, parades, firearms, flags, pornography, blood fishy smell, self-burning and burn;
step S73, text content early warning is carried out, a knowledge base of keywords is established through OCR recognition of images and transcription of audio data, harmful text information in the images is found by combining the keywords, chinese and Mongolian content in the images is recognized and extracted when video data are accessed, the text content early warning is compared with the knowledge base of the keywords, and parallel processing is adopted.
2. The intelligent analysis method for Mongolian videos according to claim 1, wherein the intelligent analysis method comprises the following steps:
the step S2 includes:
step S21, preprocessing voice information, and dividing the audio in the video data into music, noise and voice;
step S22, voice information language identification, namely performing language identification comparison on video data to be processed by performing acoustic model training and language model training on Mongolian languages, automatically identifying and judging the languages to which the video data belong, and confirming video data fragments of Mongolian languages in the video data;
step S23, transferring voice information, performing endpoint detection and noise reduction on music, noise and voice obtained by preprocessing the voice information, and extracting acoustic characteristics; and decoding the extracted acoustic features and video data of the Mongolian language identified in the language identification of the voice information by using a decoder in training a trained acoustic model and a trained language model, and performing audio text conversion to obtain Mongolian text information.
3. The intelligent analysis method for Mongolian videos according to claim 2, wherein the intelligent analysis method is characterized by comprising the following steps of:
in the step S21, the voice information preprocessing further includes:
s211, setting four states of a silence state, a voice start state, a voice stable state and a voice attenuation state by an energy four-threshold algorithm, respectively setting four energy threshold values required by skip among the states, realizing skip among the four states according to the energy information of each frame in the audio in the video data, and finally realizing detection of voice fragments with higher energy in the audio;
s212, based on a noise judgment algorithm of a rule, utilizing the frequency band energy of the audio to perform initial judgment of music and noise scenes on the signal fragments passing through an energy four-threshold algorithm;
s213, judging by a model classifier, and training a model matched with various scenes appearing in an actual application scene according to the application environment of the actual system; during the training process, differential training is introduced and a minimum classification error criterion is used.
4. The intelligent analysis method for Mongolian videos according to claim 1, wherein the intelligent analysis method comprises the following steps:
the step S3 includes:
step S31, preprocessing image information, screening whether validity detection, definition detection and MD5 de-duplication function of the image are met; screening whether the image enhancement, the image binarization, the image perspective transformation, the image boundary detection, the image inclination detection, the image external block detection and the image content area detection are satisfied; performing binarization, noise removal and inclination correction on the image information meeting detection;
the binarization is used for enabling the image information to only comprise black foreground information and white background information; the noise removal carries out denoising treatment on the image information to be identified according to the characteristics of the noise; the tilt correction is used for correcting the image direction;
step S32, image information recognition, combined with OCR recognition service, utilizes various pattern recognition algorithms to analyze character morphological characteristics, judges out standard codes of Mongolian, extracts Mongolian text information contained in pictures, and stores the Mongolian text information in a text document according to a universal character format.
5. The intelligent analysis method for Mongolian videos according to claim 1, wherein the intelligent analysis method comprises the following steps:
the step S8 includes:
step S81, pornography content and scene recognition are divided into pornography, sexy and normal, a plurality of network models are trained, multi-model cascade judgment is adopted for specific users, a frame cutting and yellow discrimination is adopted for video yellow discrimination, and a video fragment algorithm and an optical flow algorithm are adopted for suspected pictures;
step S82, intelligent identification of the riot content, namely classifying the picture and the video by means of a riot picture and a video data source and depending on a distributed deep learning platform, and identifying riot scenes and riot objects, wherein the identification of the riot scenes comprises a parade, a flag and a station logo, and the identification of the riot objects comprises a gun, a mask and a beard face;
step S83, political sensitive figures are intelligently identified, the political figures appearing in the video are automatically and intelligently identified, whether the political figures exist in the video image or not is identified by comparing the characteristics of the faces of the political figures, and if so, the user is identified; the intelligent recognition model of the political sensitive person establishes a knowledge base of the political sensitive person, and early warning of the political person is achieved through the knowledge base and the current face similarity recognition engine.
6. The intelligent analysis method for Mongolian videos according to claim 1, wherein the intelligent analysis method comprises the following steps:
in the step S9, the self-built analysis and study model is created based on the polices, and is associated with each polices, and corresponding service attributes are marked to explain the applied case direction; the self-built analysis and study model is based on different data sources, comprises two groups of data and Internet data, and is respectively analyzed by using different technical and tactical methods, and when the self-built analysis and study model is presented, the self-built analysis and study model is classified and displayed based on different data types; the self-built analysis and judgment models comprise public models and private models, and all the self-built analysis and judgment models can issue a text public model or appoint to share to other polices according to the current using effect; the final application of the self-built analysis and research model is used for actively alarming and reminding a user by setting the starting time and the application data range and the comparison task and the early warning analysis hook.
CN202210575336.5A 2022-05-24 2022-05-24 Intelligent Mongolian video analysis method Active CN114998785B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210575336.5A CN114998785B (en) 2022-05-24 2022-05-24 Intelligent Mongolian video analysis method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210575336.5A CN114998785B (en) 2022-05-24 2022-05-24 Intelligent Mongolian video analysis method

Publications (2)

Publication Number Publication Date
CN114998785A CN114998785A (en) 2022-09-02
CN114998785B true CN114998785B (en) 2023-06-02

Family

ID=83029926

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210575336.5A Active CN114998785B (en) 2022-05-24 2022-05-24 Intelligent Mongolian video analysis method

Country Status (1)

Country Link
CN (1) CN114998785B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116738359A (en) * 2023-05-23 2023-09-12 内蒙古工业大学 Mongolian multi-mode emotion analysis method based on pre-training model and high-resolution network

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104933072A (en) * 2014-03-19 2015-09-23 北京航天长峰科技工业集团有限公司 Multi-language internet information analysis method
US20190043500A1 (en) * 2017-08-03 2019-02-07 Nowsportz Llc Voice based realtime event logging
CN108256513A (en) * 2018-03-23 2018-07-06 中国科学院长春光学精密机械与物理研究所 A kind of intelligent video analysis method and intelligent video record system
CN108806668A (en) * 2018-06-08 2018-11-13 国家计算机网络与信息安全管理中心 A kind of audio and video various dimensions mark and model optimization method
CN110688862A (en) * 2019-08-29 2020-01-14 内蒙古工业大学 Mongolian-Chinese inter-translation method based on transfer learning

Also Published As

Publication number Publication date
CN114998785A (en) 2022-09-02

Similar Documents

Publication Publication Date Title
CN112804400B (en) Customer service call voice quality inspection method and device, electronic equipment and storage medium
CN107315737B (en) Semantic logic processing method and system
CN106328147B (en) Speech recognition method and device
CN108536654B (en) Method and device for displaying identification text
CN107515877B (en) Sensitive subject word set generation method and device
US9230547B2 (en) Metadata extraction of non-transcribed video and audio streams
CN111291566B (en) Event main body recognition method, device and storage medium
KR102041621B1 (en) System for providing artificial intelligence based dialogue type corpus analyze service, and building method therefor
US20150019206A1 (en) Metadata extraction of non-transcribed video and audio streams
US11386897B2 (en) Method and system for extraction of key-terms and synonyms for the key-terms
CN108536667B (en) Chinese text recognition method and device
CN114896305A (en) Smart internet security platform based on big data technology
CN113033438B (en) Data feature learning method for modal imperfect alignment
KR102070197B1 (en) Topic modeling multimedia search system based on multimedia analysis and method thereof
CN110633475A (en) Natural language understanding method, device and system based on computer scene and storage medium
CN114998785B (en) Intelligent Mongolian video analysis method
CN114756675A (en) Text classification method, related equipment and readable storage medium
CN115512259A (en) Multimode-based short video auditing method
CN112395392A (en) Intention identification method and device and readable storage medium
CN113301382B (en) Video processing method, device, medium, and program product
CN114880496A (en) Multimedia information topic analysis method, device, equipment and storage medium
Birla A robust unsupervised pattern discovery and clustering of speech signals
KR101440887B1 (en) Method and apparatus of recognizing business card using image and voice information
CN112231440A (en) Voice search method based on artificial intelligence
CN112528653A (en) Short text entity identification method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CB03 Change of inventor or designer information

Inventor after: Hu Rong

Inventor after: Dong Dewu

Inventor after: Chen Zhiwen

Inventor after: Wang Mengzhong

Inventor after: Zhang Li

Inventor after: Sun Yicheng

Inventor after: Wang Tao

Inventor after: Zhou Bateer

Inventor after: Ren Fuqiang

Inventor after: Jiang Xiaodong

Inventor after: Naren Grzyle

Inventor after: Hou Jian

Inventor after: Yang Lili

Inventor after: Zhang Yu

Inventor after: Feng Xiang

Inventor before: Zhou Bateer

Inventor before: Jiang Xiaodong

Inventor before: Yang Lili

Inventor before: Zhang Yu

Inventor before: Feng Xiang

Inventor before: Dong Dewu

Inventor before: Wang Mengzhong

CB03 Change of inventor or designer information