Item Deletions Based on Difficulty Values and Discriminating Values

ABSTRACT


INTRODUCTION
Types of tools used in Educational research are Multiple Choice Questions (MCQ) tests scored as "1" for correct answer to an item and "0" for the rest items; Likert type scales to monitor student learning for feedback and also to assess the important outcomes at the end of the instructions.Each such tool uses summative scores obtained as sum of item scores.
It may be necessary to delete some of the items due to various reasons.Improvement of test requires deleting ineffective items or items with only few corrected answers i.e. extremely difficult items.The existing test may be lengthy or deletion of items may increase reliability of the test.Similarly, deletions of number of items in a questionnaire are important to have reduced response error, higher respondent engagement, reduction of multicollinear items improved test charecteristics.
Traditional approach is to consider item-analysis results and delete or modify items based on item difficulty value and item discriminating value.Difficulty value of an item is defined as the proportion of correct responses to the item and discriminating value of an item ( indicates ability of the item to distinguish between examines with high ability level from those with low ability level (Ferrando, 2012).Discriminating value of a binary item is traditionally computed based on top 27% and bottom 27% of data which amounts to rejection of 46% of the data and hence not desirable.For the i-th item, relationship between based on the entire data and based on 54% of the data is difficult to interpret and may give rise to contrasting results.For example, Rao, et al. (2016) found = 0.56 which contradicts usual idea of poor discrimination value of a very easy items (high difficulty value) which was answered correctly by most of the subjects taking the test.Sim and Rasiah (2006) found positive value of for ranging between 0.80 to 1.00 and negative correlations when and relationship showed dome-shape when all the items are considered.Further study to investigate correlation between and was proposed (Chauhan, et al. 2013)

Literature survey:
Deletions of items are usually done by following one or more approaches given below: 1. Low value of discriminating index computed as difference between the top rd of respondents and the bottom rd of respondents.
2. Low correlations between an item and the total score.
3. Items whose deletion improves Cronbach's alpha i.e. alpha if item is deleted

Problem areas and issues:
Approach 1: The approach suffers from disadvantages of not considering entire data and giving rise to contrasting results.
Approach 2: Researchers differed in deciding such value of correlation.While Avanoor and Mahendran, (2018) suggested to delete an item if the correlation is less than 0.3, Kehoe (1995) and Popham (2011) favoured deletion of an item if 0.15 and item-total correlation is less than 0.19.
Approach 3: To find "alpha if the item is deleted" and delete the items accordingly so that the test excluding the deleted items has higher value of alpha.In other words, delete the j-th item if where reliability in terms of Cronbach alpha of the test including the j-th item and denotes reliability of the test without the j-th item.If deletion of an item increases alpha for the test, the item needs to be deleted (Raykov, 2008).However, such modified test may lower criterion validity (Raykov, 2007).In addition, set of items showing high value of alpha may not always be homogenous or unidimensional (Green et al. 1977).
Test reliability does not indicate the degree of discrimination offered by an instrument (Hankins, 2007).If items with are included, measurement disturbance by the test may occur.Thus, or are closely related to the quality of the score as a measure of the trait (McDonald, 1999).Range of item discrimination index is between -1.0 to 1.0.(Shakil, 2008;Denga, 1987) and is not defined if all subjects taking the test got same score on the item.Erhart et al. (2010) investigated item deletion to maximize alpha and item fit of the partial credit model (Masters, 1982) and opined that item deletion approaches need to consider additional analyses since quality of a test is more than test reliability.
Major issues on test reliability in terms of Cronbach and validity as correlation between test scores and scores of a chosen criterion scale are as follows: Open Access: https://ejournal.papanda.org/index.php/edukasiana/ 1) Alpha as a measure of internal consistency is concerned with the homogeneity of the items within a test and does not work well for a multi-dimensional test.
2) Alpha assumes uncorrelated errors and tau-equivalent items which imply all the factor loadings are same (Ogasawara, 2006).However, equality of factor loadings is rather rare for tests used in educational research (Pronk et al. 2022).
3) If items are not essentially tau-equivalent and the test measure different constructs i.e. multidimensional test, alpha may get distorted.However, many scales reports alpha despite finding several factors from PCA or FA.
4) Huang et al. (2021) found that the construct with highest eigenvalue had the maximum alpha.
Using results of PCA, Ten Berge and Hofstee, (1999)  does not belong to the same family as the other items or do not sample the same domains measured by the remaining items.Thus, could imply either the i-th item is noisy lacking discriminating power (Ferrando, 2012) or the item is redundant and does not share the contrast being measured by the other items of the test.Cut-off value for discarding items may be relevant for noisy items but not for the redundant items.Presences of both noisy and redundant items create problems for EFA in item analysis.
In addition, noisy items with poor loadings on any factor for multi-dimensional test fail to test whether the items measure different factors or are pure noise.However, deletion of an item will change mean, SD of the test/scale and also correlation of a retained item with total test/scale score.

DIFFICULTY AND DISCRIMINATING VALUES MCQ tests
Suppose where k denotes number of persons answering the item correctly.

Deletion of items
If k= 0 for an item, the item is extremely difficult and each subject fails to pass the item, then discriminating value is not defined for the item.Clearly, such items with zero mean or infinite to be rejected forthwith.If for then item with higher SD is preferred to be retained.
Equation ( 9) depicts a non-linear relationship between item difficulty value and item discriminating value.Lower i.e. low value of k higher .Similarly, higher lower .
Thus, correlation between and will be negative.In other words, as k increases, curve (or percentage ) will be positively sloped and curve (or percentage will be negatively slopped and the two curves will intersect at point ( where = .Value of can be obtained using equation ( 6) and ( 8) and by solving √ or .Value of to be taken to the nearest integer.
Items may be retained by choosing the acceptance region as [ where SD is standard deviation of or .Choosing acceptance region as [ ] may result in discarding too few items.In addition, considering skew of distribution of (or few more items having high concentration at the tail may be discarded.It may be noted that deletion of one or more items will change values of Other considerations for item deletions are low value of point biserial correlation and alpha if item is deleted. However, choice of acceptance region (or deletion region) may depend on original number of items in the test, type of test, whether to measure single dimension or multi dimensions and also considering relationship between test discrimination and test reliability (equation 10).Discarding few easy items (with high values of k) and few extremely difficult items (with very low values of k) will reduce m, and in turn may increase product of which is equal to SD per item.Effect of item deletions need to be checked with increase in test reliability and/or factorial validity.

Likert scales:
Concept of discriminating values of items and test in terms of coefficient of variation (CV) can be extended for Likert scales also where difficulty value is not relevant.Mean of a polytomous items is simply the average score.Chakrabartty (2020) compared seven dissimilarity measures which can be computed from a single administration of a questionnaire using proportion for each cell of the Item-Response category matrix and found that CV has maximum advantages to find discriminating values of Likert items and also for the Likert questionnaire.Here, and .Lower value of CV is desirable.It is possible to estimate population CV and test statistical hypothesis on equality of .
For a scale with m-items, relationship of Cronbach α and was derived as and ( 14) Each of ( 13) and ( 14) indicates negative relationship between test reliability and i.e. higher the , lower is the reliability and vice versa.
Deletions of items of a Likert type test may be done by removing items with high values of CV (i.e. high value of SD).Reliability of the scale (or test) containing the retained items is likely to get improved because of negative relationship of reliability and discriminating values.For the same reason, items with low CV may be retained.

Distribution and statistical tests of CV
Scores of an item of MCQ test can be taken to follow Binomial distribution with parameters n and where n denotes number of individuals taking the test and is the probability of success in a single trial i.e. difficulty value of the i-th item ( Convolution of distributions of item scores will give the distribution of sum of all item scores which will also be Binomial. is given by which is equal to ̅̅̅ .CV can be used to compare discriminating value of two items even if they differ significantly with respect to mean.Similarly, discriminating value of two tests or scales can be compared on the basis of CV.Unbiased estimate of population CV for normally distributed data is ̂ (Sokal and Rohlf, 1995).Asymptotic test for equality of CV proposed by Feltz and Miller (1996)  Items with high values of CV may be deleted.In addition, few more items having high concentration at the tail may be discarded.It may be noted that deletion of one or more items will change values of Effect of deletion of items needs to be investigated using the derived relationships among the proposed measures with emphasis on test reliability as per theoretical definition which is negatively related to in non-linear fashion.Effect of deletion of items on validity or factorial validity may be investigated by undertaking PCA with the retained items.Methods of estimation of population CV and statistical testing of hypothesis on equality of two or more CVs are also suggested.

CONCLUSION
The proposed method of item deletions based on difficulty values and discriminating values offers significant benefits and is recommended.However, the approach may be compared empirically with deletion of items by "alpha if the item is deleted" with respect to an optimal range of and the effect of deletion of items on point bi-serial correlations, test reliability and factorial validity.
a MCQ-test with m-items has been administered among n-subjects.Scores of the subjects can be presented as a n-dimensional vector = where the component denotes test score of the i-th subject.Consider another n-dimensional vector representing maximum possible scorewhere each component is equal to 1. Let the angle between the vectors X and I be .Here, Marwick and Krishnamoorthy, (2019)ows distribution and is widely used.However, the "Modified signed-likelihood ratio test (SLRT) for equality of CVs for different sample sizes(Krishnamoorthy and Lee, 2013) has more advantages.Software package for testing equality of CVs from multiple groups is given byMarwick and Krishnamoorthy, (2019).ofMCQ test with n-number of items is in line with usual notion of difficulty value which actually measures degree of easiness of a test.Here, and 0 .Range of is from 0 to 1.As number of correct answer to items ( increases, positively sloped percentage curve tests in terms of SD per mean (i.e.CV) has desired properties.
measures and their relationships were derived including relationship with test reliability, as per definition.All the measures and relationships can be computed from a single administration of the test or scale.Open Access: https://ejournal.papanda.org/index.php/edukasiana/