What is data coding in Research Process

What is data coding in Research Process


Content Outline

  1. Introduction
  2. Data coding
  3. Dealing with missing data
  4. Analysis of missing data


When a researcher has completed collecting information or data, this information is ready to be processed and analyzed. Quantitative data is information that is measurable and focuses on numerical values, unlike qualitative data which is more descriptive. During the data processing step, the collected data is transformed into a form that is appropriate to manipulate and analyze.

The process in which raw data is transformed into a standardized form suitable for machine processing and analysis is called coding. In other words, coding is the act of assigning numerical values to a set of data in order to make the analysis simpler. Coding can be used to quantify both manifest content i.e. the tangible or concrete surface content (data), and latent content i.e. the underlying meaning behind this information. The difference between manifest content and latent content is very important when it comes to survey research.

It is advisable to do a pilot or a pretest of the instrument of data collection as it would help uncover the potential problems with the study and accordingly help make changes in the tool. It will also give the researcher an idea of how the data will look. On the basis of this, the researcher can work out the layout of the codebook keeping in mind the responses collected for each variable, guiding him to provide enough variables to capture all the richness, complexity, and variety of data that has been collected. Depending on what shape the data comes in, the researcher will have to decide how to code this information, with the help of one, two, or multiple variables.  

Data Coding

Though the preparation of a codebook commences prior to actual data collection, after designing the instrument and pre testing it, data coding as a step in the research process takes place after the completion of data collection, simultaneously during data entry. It is important to keep the following points in mind during coding of data:
  • Identification variables
    Unique identification of the respondent/ questionnaire/ response sheet is extremely important as it helps the researcher verify and check data. The identification variable is a unique number corresponding to every respondent which has to be accommodated in a special field at the beginning of each record. For example, 001, 002, etc. may be used as identification variables 
  • Code categories
    Code categories should be mutually exclusive, exhaustive, and precisely defined. Each interview response should fit into one and only one category. Ambiguity will cause coding difficulties and problems with the interpretation of the data also. An example of this would be while recording literacy levels of youth in a slum community, the coding should include not just those who have gone through the formal system of education, but also those who have participated in non-formal education programs and are therefore literate.
  • Preserving original information
    Data once coded is retained and becomes final; hence it is important to code as much detail as possible by recording the original data rather than collapsing or bracketing the information. With original or detailed data, the research analyst can determine other meaningful relationships between variables beyond those which are selected primarily or restricted by the entering or coding data. Hence, occupation of women in a slum community being surveyed could include the home based small enterprises (involved in small scale business like making snacks, Knick knacks, etc.) each as a separate code.
  • Closed-ended questions
    Responses to survey questions that are pre coded in the questionnaire should retain this coding scheme in the machine-readable data to avoid errors and confusion. For example, in the above mentioned study in case the women respondents may be pre divided as per their age and marital status, this pre coding should be retained.
  • Open-ended questions
    For open-ended items, investigators can either use a predetermined coding scheme or review the initial survey responses to construct a coding scheme based on major categories that emerge. Any coding scheme and its derivation should be reported in study documentation. Increasingly, investigators submit the full verbatim text of responses to open-ended questions to archives so that users can code these responses themselves. However, such responses may contain sensitive information and may involve the risk of identification; they must therefore be reviewed prior to disclosure.
  • Check-coding
    Check-coding provides an important means of quality control in the coding process. In this process some cases are repeated with an independent coder in order to verify the coding assigned and rule out discrepancies and ambiguities if any.
  • Series of responses
    If a series of responses requires more than one field, organizing the responses into meaningful major classifications becomes helpful. Responses within each major category are assigned the same first digit. Secondary digits can distinguish specific responses within the major categories. Such a coding scheme permits analysis of the data using broad groupings or more detailed categories. 

Dealing With Missing Data

There are various situations wherein the data may be missing and each of these would need to be coded differently. Some of these situations are listed below:
  • Refusal to answer or No response
    In such a scenario the respondent explicitly refuses to answer a question or does not answer it when he or she should have. This may be with regard to questions that may be deemed personal and sensitive or too private by the respondent or even in a case where there was lack of clarity with regard to a particular question. 
  • Don’t know responses
    In this case the respondent was unable to answer a question, either because he or she had no opinion or because the required information was not available (e.g., a respondent could not provide family income for the previous year).
  • Processing error
    In this case, there is no answer to the question, although the subject provided one. This would indicate an error on the part of the interviewer, incorrect coding or other such problems, despite the respondent providing a response to the particular question. 
  • Not applicable
    The subject was never asked a question for a particular reason. While this may be a result of skip patterns following filter questions, for example, subjects who are not married are not asked questions pertaining to their children. Other examples of inapplicability are sets of items asked only of random subsamples and those asked of one member of a household but not another. 
  • No match
    This situation arises when data are drawn from different sources (for example, a survey questionnaire and an administrative database), and information from one source cannot be located
  • No data available 
    The question should have been asked to the respondent, but for a reason other than those listed above, no answer was given or recorded. This may be due to an error on the part of the interviewer.

Analysis of missing data

In order to effectively assign and analyze the missing data, accurate identification of missing data is necessary. Hence, this has to be recorded and interpreted correctly. Missing data codes should match the content of the field. If the field is numeric or alphanumeric, the codes may be likewise. Most researchers use codes for missing data that are above the maximum valid value for the variable (e.g., 97, 98, 99). This occasionally presents problems, most typically when the valid values are single-digit values but two digits are required to accommodate all necessary missing data codes. Missing data codes should be standardized such that the same code is used for each type of missing data for all variables in a data file, or across the entire collection if the study consists of multiple data files.

In general, blanks should not be used as missing data codes. Blanks are acceptable when a case is missing a large number of variables, or when an entire sequence of variables is missing due to inapplicability, such as data on non-existent children. In such instances, an indicator variable should allow analysts to determine unambiguously when cases should have blanks in particular areas of the data record.

Not Applicable’ codes should be distinct from other missing data codes. Data should clearly show for every item exactly who was or was not asked the question. At the data-cleaning stage, all ‘filter items’ should be checked against items that follow to make sure that the coded answers do not contradict one another, and that unanswered items have the correct missing data codes. 


Thank You