CN111488725A

CN111488725A - Machine intelligent auxiliary root-pricking theoretical coding optimization method

Info

Publication number: CN111488725A
Application number: CN202010178957.0A
Authority: CN
Inventors: 卢暾; 蒋特; 顾宁
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2020-03-15
Filing date: 2020-03-15
Publication date: 2020-08-04
Anticipated expiration: 2040-03-15
Also published as: CN111488725B

Abstract

The invention belongs to the technical field of qualitative research, and particularly relates to a machine intelligent auxiliary root-tying theoretical coding optimization method. The core of the parent optimization method is embodied in two links: feature extraction and automatic coding classification: the feature extraction is to extract the text information features under the same classification according to the design that the text has higher feature consistency on information in the same coding classification and is used as the classification basis of the subsequent automatic classification link; the automatic coding is to calculate the similarity between the new text and each classified corpus according to the classification characteristics extracted in the characteristic extraction link and classify the new text into the classification with the highest similarity; in the whole encoding process, the processes of manual adjustment, feature re-extraction and the like are combined to obtain a more accurate encoding result. The invention integrates machine intelligent technology in the classic root theory coding process to optimize the coding process and improve the efficiency of researchers on data processing and coding.

Description

Machine intelligent auxiliary root-pricking theoretical coding optimization method

Technical Field

The invention belongs to the technical field of qualitative research, and particularly relates to a root-tying theoretical coding optimization method.

Background

In qualitative research, the root theory is a widely adopted qualitative research method. The root theory is a particular methodology proposed by glazer and schtelus in 1967 to establish theory from data. Researchers can further supplement related materials from materials such as biographies, diaries, recordings, manuscripts and reports, or supplement interviews and field observation records, and further deeply analyze the essence of a certain phenomenon or problem on the basis of the materials.

The method for obtaining information materials by supplementing interview is a common method for researchers to develop and research social phenomena at the present stage. The method emphasizes that, starting from no theoretical hypothesis and starting from actual observation, researchers recruit respondents who accord with the characteristics of research phenomena and have related experiences. In the communication with the information materials, the information materials of one hand are obtained, deep reasons behind the mining phenomenon are analyzed in a deep interview mode, an experience mode is concluded, and then the theory of a certain height is developed.

Interview mode, the collection of raw material, necessarily involves interviews to the audience, which in turn generates large amounts of interview data. Researchers need to organize these large amounts of interview data to form a coding framework. The arrangement work usually consumes a great deal of effort of researchers, and in the actual coding process, a certain amount of repetitive work exists, a certain rule can be followed, and part of work can be replaced by a machine.

Disclosure of Invention

In order to better assist qualitative researchers to carry out the work of organizing and analyzing interview data, the invention designs a machine-assisted rooting theory coding optimization method.

Typically, in most scenarios, the collection of raw material will involve interviews to the audience, which in turn will generate large amounts of interview data. Researchers need to organize into coding frameworks in these large volumes of interview data. The sorting work usually consumes a great deal of effort of researchers, the actual coding process is carried out according to certain logic steps, certain rules can be followed, and the machine can replace part of sorting and classifying work. Therefore, the invention provides an optimization method for the encoding process. The encoding process is shown in fig. 1. The specific steps of the method of the present invention are described below.

(1) Data pre-processing

After obtaining the interview recording data, researchers can use transcription software or a platform to transcribe the recording data and obtain corresponding text materials in a manual carding mode.

Subsequently, the interview record is cut into sentence blocks through a sentence segmentation tool; and the sentence segmentation result is properly adjusted in a manual checking and checking mode to obtain a corpus which is used as an original material of the code.

(2) Artificial precoding

And (3) carrying out manual pre-coding on the corpus set obtained in the step (1) to form a preliminary coding scheme. In the pre-coding algorithm, the selected original material is coded in a concept level and a theme level in a mode of cyclic coding and random data selection, and a coding frame is continuously adjusted until preliminary information saturation is achieved or the data of the current data set is completely coded; in addition, the algorithm also supports continuous coding of new data on the basis of the original coding, and has higher flexibility. Thus, when information saturation is not reached, or when the user considers that encoding is not complete, new data can continue to be encoded. The precoding algorithm is shown in appendix 1, and the flow is as follows:

each encoding process can be continued on the result set of the last encoding or can be performed on the empty encoding result. Each time the encoding process is in a new data set, uncoded data PD is randomly selected (algorithm lines 4-11). Generating a concept CN corresponding to the data for the data in a manual coding mode (line 12 of the algorithm); then, in the current encoding result CT, searching whether corresponding concepts exist in the subject set TS one by one (13 th-22 th lines of the algorithm); if the concept already exists, the concept is added to the corresponding topic, the corresponding set of concepts (lines 23-29 of the algorithm).

The process of loop coding continues until information saturation is reached or the current new data set has been completely coded (row 3 of the algorithm). The determination of information saturation depends on the comparison between the ISV _ cnt value and the ISV value. The meaning of the ISV value is that after ISV strip data are continuously coded for a plurality of times defined by a user, the total concept number in the coding result is still unchanged, and then the information saturation is considered to be achieved. The ISV _ cnt value is the number of times that the total number of concepts in the continuous encoding results does not change in the recording encoding process. As long as a reasonable ISV _ cnt value is set, the judgment accuracy of information saturation can be basically guaranteed. If new encoded data still needs to be supplemented, the encoding can be supplemented by only operating the algorithm 1 again with the previous encoding result and the new data.

Judging whether the encoding of the data set is finished or not, namely judging according to the sizes of the selected data set and the full data set, and if the sizes of the selected data set and the full data set are equal, representing that the encoding of the data set is finished; if the size of the selected data set is less than the full data set size, then the representation is not yet encoded.

(3) Coding feature extraction

The extraction of the coding features can be performed by adopting an applicable feature extraction method according to the corpus, including but not limited to methods such as TF-IDF, &lTtTtranslation = L "&gTtL &/T &gTtDA, neural network, and the like.

The TF-IDF method is a common statistical method for evaluating the importance degree of a word or a word for a certain file in a specific corpus set. When the TF-IDF value is large, it is generally considered that the word can better characterize the importance of the document.

Wherein, TF represents Term Frequency. As shown in formula 1, the number of times word _ cnt the word appears in the corpus may be divided by the total number of words total _ cnt in the corpus to calculate. (for continuous corpus of Chinese, etc., the sentence can be cut by means of word segmentation tool to obtain corresponding word)

The IDF is an Inverse document frequency, which is used to measure the general importance of a word. The calculation method is as shown in formula 2, the total file number total _ file may be divided by the file number file _ cnt including the term, and the obtained quotient is taken as a logarithm with the base 10, so as to obtain the IDF value.

Finally, the TF is multiplied by the IDF to obtain the TF-IDF, as shown in equation 3:

TF-IDF ═ TF × IDF (formula 3)

For the application scenario herein. The extraction of the characteristics can be divided into two levels. The method comprises the steps of corpus feature extraction based on a theme level and corpus feature coding based on a concept level.

And extracting features based on the theme hierarchy, wherein all the linguistic data under the theme are used as one material, and the linguistic data under all the themes are used as a corpus set. And calculating the words with the highest TF-IDF value of the specified num _ topic number as the characteristics of the corpus. The concept level feature extraction is to use all corpora under a certain concept as a material and all corpora under the subject to which the concept belongs as a corpus set. And extracting num _ concept words with the highest number of TF-IDF as the characteristics of the corpus.

On the basis of automatic extraction, the feature words of corresponding topics and concepts can be further manually adjusted and supplemented, so that the classification accuracy is higher.

(4) Automatic coding

On the basis of the feature extraction in the step 3, the new corpus can be subjected to coding classification, and the corpus in the coding frame is supplemented. Here, the features extracted in step 3 are continued, and the new corpus is automatically encoded and classified by the TF-IDF method.

For a Chinese corpus, a feature extraction method of TF-IDF is adopted, and word segmentation needs to be carried out on a text material. And after removing the common words, taking the residual words as the characteristic words of the corpus. Then, the matching degree of the text with the corresponding concept and topic classification is calculated through the words, and the text is classified under the topic classification and the coding classification with the highest matching degree. The similarity between a new corpus t and a corpus s is calculated as shown in formula 4.

Wherein m and n are the number of the feature words of the new corpus t and the corpus s respectively. score (t)_i,s_j) The similarity score between the ith vocabulary in the corpus t and the jth vocabulary in the corpus s is represented by the following formula 5

In score (x, y) similarity gain calculation for two words, dis (x, y) represents the spatial distance of vocabulary x and vocabulary y in the word vector dataset. And threshold represents the maximum spatial distance at which two words are evaluated to still belong to a synonym.

After the similarity of all subject corpus of the new corpus is calculated, the new corpus can be distributed under the subject with the highest similarity. Then, the similarity between the corpus and all concept corpus sets under the theme is calculated, and the theme is divided into corresponding concept corpus sets. In addition, a corresponding threshold value can be set, and when the similarity matching is low, the judgment and the adjustment are needed manually. The threshold needs to be set specifically according to the specific corpus and word vector data set. Usually, several pairs of unmatched vocabulary pairs can be taken, the average similarity of the unmatched vocabulary pairs is calculated, and a larger numerical value, such as 0.2, is added to the value as the threshold value for judgment according to the actual situation.

(5) Feature set expansion

After the new data coding is finished every time, the classification item with low matching degree can be artificially checked and adjusted. If a new concept or new theme occurs outside the existing coding framework, the coding framework is adjusted. Subsequently, the feature set of the new corpus is extracted again in the manner of step 3, and is retained in the corpus. And continuously repeating the encoding process until all data are encoded.

The invention aims at the problem that a large amount of energy is needed to carry out corpus coding based on the coding stage in the root-tying theory. Based on the classic encoding process and operation method in the root theory research, the encoding process and the encoding method which are integrated with machine intelligence are designed. By the aid of the machine intelligence fused classified coding method, coding efficiency in qualitative research by root theory can be effectively improved.

In the traditional qualitative research coding method, researchers usually need to go through the following stages for coding analysis:

(1) transcribing the recording material into character contents;

(2) extracting related sentence content, and marking corresponding concepts and topics for the sentences;

(3) on the basis of sufficient sentence content, a preliminary coding framework is formed through discussion;

(4) distributing the sentences in the subsequent transcription draft to a coding frame, and dynamically adjusting the coding frame;

(5) and carrying out related research analysis according to the coding frame and the related linguistic data.

After the coding optimization method provided by the invention is adopted, machine intelligence can be used for assistance in multiple stages, so that complicated manual operation is saved, and the method is mainly divided into the following aspects:

(1) the transcription of the recording material can realize direct transcription by means of a recording character-to-character tool;

(2) in the stage of extracting related statement contents, the original corpus can be segmented by means of a sentence segmentation tool, and then related predictions are selected by researchers;

(3) after a preliminary coding frame is formed, the residual corpora can be directly and automatically coded and classified, and a researcher only needs to pay attention to the corpora with low matching degree and make certain adjustment.

By adopting the qualitative research method provided by the invention, machine intelligence can be utilized in multiple stages, the coding process can be obviously assisted and optimized, and the coding efficiency of researchers is improved.

Drawings

FIG. 1 is a core encoding flow chart of the method of the present invention.

Detailed Description

The pseudo code for realizing the intelligent coding classification of the fusion machine is shown in appendix 1.

The inventive content herein is a machine-assisted encoding classification flow method. The example introduction is performed in the introduction of the coding feature extraction and automatic coding links by taking the TF-IDF method as the classification standard, and the brief introduction is described below.

(1) In the data preprocessing link, a voice-to-text tool can be used for transcribing the voice material to obtain a text material; meanwhile, the text can be processed by means of the clauses to obtain a corpus.

(2) In the manual pre-coding link, a coding frame can be determined in a mode of independent coding and negotiation by a plurality of researchers. For example, 3 parts of the audio recording material are selected as the pre-encoded corpus.

(3) In the feature extraction step, feature extraction can be performed on the coded data, and features of each topic and each concept are extracted for subsequent automatic coding.

(4) In the automatic coding link, the method adopted here takes TF-IDF as an example, can calculate the similarity of two linguistic data, and distributes the similarity to the theme and the concept with the highest similarity for automatic coding. For the corpus with low similarity, the corpus can be recoded, and if the existing coding frame does not contain a new coding result, the coding frame can be adjusted. For the remaining 27 corpora, the encoding can be performed by automatic encoding.

(5) On the final encoding scheme, researchers can develop further research according to the corresponding encoding result.

The coding optimization method of the invention fully utilizes machine intelligent technology to assist in developing the coding process of qualitative research in each link of qualitative research coding.

The above description is only exemplary of the present invention and should not be taken as limiting the invention, as any modification, equivalent replacement, or improvement made within the spirit and scope of the present invention is also included in the present invention.

Appendix 1

Claims

1. A machine-intelligent-assisted root-tying theoretical coding optimization method is characterized by comprising the following specific steps:

(1) data pre-processing

After interview recording data are obtained, the recording data are transcribed by transcription software or a platform, and corresponding text materials are obtained in a manual carding mode;

then, cutting the interview record into sentence blocks through a sentence segmentation tool; the sentence segmentation result is properly adjusted in a manual checking and checking mode to obtain a corpus which is used as an original material of the code;

(2) artificial precoding

Performing manual pre-coding on the corpus obtained in the step (1) to form a preliminary coding scheme; in the pre-coding algorithm, the selected original material is coded in a concept level and a theme level in a mode of cyclic coding and random data selection, and a coding frame is continuously adjusted until preliminary information saturation is achieved or the data of the current data set is completely coded; in addition, new data can be continuously coded on the basis of the original coding, and the method has higher flexibility; therefore, when the information saturation is not reached or when the user considers that the encoding is not completed, new data can be continuously encoded;

(3) coding feature extraction

On the basis of a pre-coding scheme, extracting coding characteristics to realize automatic classification coding of subsequent data; extracting coding characteristics by adopting a TF-IDF method; wherein, TF represents Term Frequency; dividing the number word _ cnt of the word appearing in the corpus by the total word number _ cnt in the corpus to calculate, as shown in formula 1:

equation 1

The IDF refers to Inverse document frequency, and is used for measuring the universal importance of a word; dividing the total file number total _ file by the file number file _ cnt containing the term, and then taking the logarithm with the base 10 as the obtained quotient to obtain the value of the IDF, wherein the calculation formula is shown as a formula 2:

equation 2

Finally, multiplying the TF and the IDF to obtain the TF-IDF value, as shown in formula 3:

TF-IDF=TF*IDFequation 3

(4) Automatic coding

On the basis of the feature extraction in the step 3, carrying out coding classification on the new corpus and supplementing the corpus in the coding frame; here, the feature method extracted in step 3 is continued, and the new corpus is automatically encoded and classified by the TF-IDF method;

for a Chinese corpus, firstly, word segmentation is carried out on a text material; after removing the common words, taking the residual words as the characteristic words of the corpus; then, calculating the matching degree of the text with the corresponding concept and topic classification through the words, and classifying the text into the topic classification and the coding classification with the highest matching degree;

specifically, the similarity between a new corpus t and a corpus s is calculated as shown in formula 4:

equation 4

Here, m and n are the number of the feature words of the new corpus t and the corpus s respectively; score (t)_i,s_j) The score of the similarity between the ith vocabulary in the corpus t and the jth vocabulary in the corpus s is represented by the following specific calculation method shown in formula 5:

equation 5

Where dis (x, y) represents the spatial distance of vocabulary x and vocabulary y in the word vector dataset, and threshold represents: evaluating the maximum spatial distance of two words still belonging to a synonym;

after calculating the similarity of all subject corpus sets of the new corpus, distributing the new corpus to a subject with the highest similarity; then, calculating the similarity between the corpus and all concept corpus sets under the theme, and dividing the theme into corresponding concept corpus sets;

(5) feature set expansion

After the new data coding is finished every time, manually checking the classification item with low matching degree and adjusting; if a new concept or a new theme outside the existing coding frame occurs, the coding frame is adjusted; then, extracting a feature set of the new corpus in the mode in the step (3) again, and reserving the feature set in the corpus; and continuously repeating the encoding process until all data are encoded.

2. The machine-intelligent-assisted rooting theory coding optimization method according to claim 1, wherein the precoding algorithm in the step (2) is as follows:

each coding process is carried out continuously on a result set of the last coding or on an empty coding result; in each coding process, uncoded data PD is randomly selected in a new data set; generating a corresponding concept CN for the data in a manual coding mode; then, in the current encoding result CT, searching whether corresponding concepts exist in the subject set TS one by one; if the concept exists, adding the concept into the corresponding theme and the corresponding concept set;

the process of circular coding is continued until information saturation is achieved, or the current new data set is completely coded; wherein, the judgment of information saturation depends on the comparison between the ISV _ cnt value and the ISV value; the meaning of the ISV value is that after ISV strip data are continuously coded for many times defined by a user, the total concept number in a coding result is still unchanged, and information saturation is considered to be achieved; the ISV _ cnt value is the number of times that the total concept number in the continuous coding results does not change in the coding process;

if the new encoded data still needs to be supplemented, the encoding result of the previous encoding and the new data need to be operated again to supplement the encoding.

3. The machine-intelligent-assisted rooting theory coding optimization method according to claim 1, wherein in the step (3), for an application scenario of a text, the feature extraction is divided into two levels, namely corpus feature extraction based on a theme level and corpus feature coding based on a concept level;

the method comprises the steps of extracting features based on a theme hierarchy, taking all linguistic data under the theme as a material, taking the linguistic data under the theme as a linguistic data set, and calculating the words with the highest TF-IDF value in the number of specified num _ topic as the features of the linguistic data set; extracting the characteristic of concept level, namely taking all linguistic data under the concept as a material, taking all linguistic data under the theme to which the concept belongs as a linguistic data set, extracting num _ concept number of words with the highest TF-IDF number from the linguistic data set as the characteristic of the linguistic data set, wherein a coding frame can be divided into two levels, the first level is the theme, and a plurality of themes are arranged; the second level is concepts, and a plurality of concepts are arranged under each level topic.

4. The machine-intelligent-assisted rooting theory coding optimization method according to claim 1, wherein in the step (4), corresponding threshold values are set, and when the similarity matching is low, judgment and adjustment are required manually.