Score Variation in Multiple-Choice Tests of Grammar: On the Effect of Gender and Stem Type

Document Type: Research Paper


English Department, Alzahra University, Tehran, Iran


This paper examines the effect of gender and type of stem on Iranian test takers’ performance on the grammar section of a nation-wide language proficiency test. In so doing, the scores of 2931 examinees (1107 female and 1824 male) who sat Tarbiat Modares English proficiency test were obtained. The examinees’ scores on three types of MC grammar items with different kinds of stems (i.e., blank filling, error recognition, and cloze) were compared to see if the type of stem has any effect on performance. Grammar scores of males and females were also compared to see if there is an effect for gender on the examinees’ performance on the grammar test in general and on its three item types in particular. The results indicated that test takers performed better on cloze items than the other two types. It was also found that females outperformed males on both the whole test, and also on items with blank filling and cloze types of stems. Due to the particularity of the context and the small effect sizes found, the study calls for more research to be conducted on this topic.


Alderson, C. (2000). Assessing reading. Cambridge: Cambridge University Press.

Bachman, L. (1990). Fundamental considerations in language testing. Oxford: Oxford University Press.

Born, M., & Lynn, R. (1994). Sex differences on the Dutch WISC-R: A comparison with the USA and Scotland Educational Psychology, 14(2), 249–255.

Breland, H. M., Bridgeman, B., & Fowles, M. E. (1999). Writing assessments in admission to higher education: Review and framework (Report No. 99-3). New York, NY: College Entrance Examination Board.

Buck, G. (2001). Assessing listening. Cambridge: Cambridge University Press.

Carleton University. (2009). CAEL test score and users’ guide . Ottawa, Canada: Author. Retrieved from

Carlton, S. T., & Harris, A. M. ( 1992). Characteristics associated with differential item functioning on the scholastic aptitude test: gender and majority/minority group comparisons. ETS Research Report, 92–64. Princeton, NJ: ETS.

Chen, Z., & Henning, G. (1985). Linguistic and cultural bias in language proficiency tests. Language Testing, 2(2), 155-163.

Cheng, H. f. (2004). A comparison of multiple-choice and open-ended response formats for the assessment of listening proficiency in English. Foreign Language Annals, 37(4), 544-553.

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum Associates.

Cole, N. S. (1997). The ETS gender study: how females and males perform in educational setting. Princeton, NJ: Educational Testing Service.

Dávid, G. (2007). Investigating the performance of alternative types of grammar items. Language Testing, 24(1), 65-97.

Educational Testing Service. (2007). Test and score data summary for TOEFL Internet-based test: September 2005-December 2006 test data. Princeton, NJ. Retrieved from

Farhady, H. (1982). Measures of language proficiency from the learner’s perspective. TESOL Quarterly, 16(1), 43-59.

In'nami, Y., & Koizumi, R. (2009). A meta-analysis of test format effects on reading and listening test performance: Focus on multiple-choice and open-ended formats. Language Testing, 26(2), 219-244.

James, C. L. (2010). Do language proficiency test scores differ by gender? TESOL Quarterly, 44(2), 387-398.

Johnson, J. S., & Song, T. (2008). MELAB 2007 descriptive statistics and reliability estimates. Ann Arbor, MI: English Language Institute, University of Michigan.

Kobayashi, M. (2002). Method effects on reading comprehension test performance: text organization and response format. Language Testing, 19(2), 193-220.

Kunnan, A. J. (1990). DIF in native language and gender groups in an esl placement test. TESOL Quarterly, 24(4), 741-746.

Lawrence, I. M., & Curley, W. E. (1989). Differential Item Functioning for males and females on SAT-Verbal Reading subscore items: follow-up study.: Princeton, NJ: Educational Testing Service.

Lawrence, I. M., Curley, W. E., & Hale, F. J. M. (1988). Differential item functioning for males and females on SAT verbal reading subscore items . Report No. 88–4. New York: College Entrance Examination Board.

Lumley, T., & O’Sullivan, B. (2005). The effect of test-taker gender, audience and topic on task performance in tape-mediated assessment of speaking. Language Testing, 22(4), 415-437.

Lynn, R., & Dai, X. Y. (1993). Sex differences on the Chinese standardiz-ation sample of the WAIS-R. Journal of Genetic Psychology, 154 (4), 459–464.

Pae, T.-I. (2004). DIF for examinees with different academic backgrounds. Language Testing, 21(1), 53-73.

Rodriguez, M. C. (2003). Construct equivalence of multiple-choice and constructed-response items: a random effects synthesis of correlations. Journal of Educational Measurement, 40(2), 163-184.

Shohamy, E. (1984). Does the testing method make a difference? The case of reading comprehension. Language Testing, 1(2), 147-170.

Takala, S., & Kaftandjieva, F. (2000). Test fairness: a DIF analysis of an L2 vocabulary test. Language Testing, 17(3), 323-340.

Tsagari, C. (1994). Method effects on testing reading comprehension: How far can we go? . Unpublished MA thesis, University of Lancaster, UK.

Wolf, D. F. (1993). A comparison of assessment tasks used to measure fl reading comprehension. The Modern Language Journal, 77(4), 473-489.