Notes on: Gomm, R. (2021) SATs, sets and allegations of bias: The allocation of 11 year old students to mathematics sets in some English schools in 2015. The response to Connerly et al., 2019. British Educational Research Journal 48:704 – 29. DOI: 10.1002/berj.3790

Dave Harris

The question is directed to the idea of correct set placements and suggests that the results in Connolly et al. are 'largely artefact of their evaluation approach'. There may be teacher bias involved, but this approach by Connolly 'could not evidence it' (704). The way in which data is generated and aggregated from diverse clusters needs to be investigated.

Setting or grouping or tracking takes place in most secondary schools, and national curriculum assessments, SATs or KS2 tests are also mandatory. The idea is to generate performance indicator data, but they also inform some decision-making about individual students including set placement. KS2 maths papers are 'marked externally by disinterested markers', but setting is done by secondary school staff who 'might have known the students for up to 9 weeks and if so, their class, gender or ethnic prejudices might bias setting decisions' (705). Connolly et al claim to provide evidence of such bias.

Their research as part of a whole suite of research projects which also included an RCT trial investigating the effects of allocating first year secondary school students to sets according to their KS2 results alone then measuring results at two years on, compared to control schools that allocated students to sets in any way they chose. NFER research found 'no differences in academic or self-confidence outcomes' between these two schools or between high and low scoring students while between FSM and others [research cited on 706].

Subsets from this sample were also studied although it is not clear how they were actually drawn in some of the other specific studies, including in Connollly. Connolly's sample had a 'size of some ethnic subgroups… Too small for sensible analysis' which led to consolidation of Asian and all Black groups 'despite these being heterogeneous in set occupancy and allocative treatments''. In the full RCT study, there were no consequences for the students arising from the ways they have been allocated, but these results were not mentioned in Connolly.

Connolly claims to have discovered teacher bias, against Black Asians and against girls, but not against students on the basis of SES, and these have been much publicised. But there are methodological shortcomings:
contradiction between claims and data,
problems with the demography of maths sets and how this is weakly connected to discriminatory behaviour,
the researchers strategy for detecting bias, the procedures used to identify correct and incorrect allocations to sets, especially the zero tolerance approach which does not take account of random error based on test score confidence intervals — if these are factored in 'for many students their observed SAT score ceases to be an adequate basis for setting decision without further information which was not available to the researchers' (707),
Misallocations between subgroups is too small given the amount of missing data to make conclusions about ethnic discrimination
 there may be a '"within school" ecological fallacy where the risk of being misallocated according to the researchers varies according to 'student position in the rank order of KS2 scores within a school'
there may be a between school ecological fallacy explaining the disproportionality of misallocations between subgroups because there is 'the asymmetrical distribution of ethnic and socio-economic subgroups between schools with different set architectures and hence with different risks of receiving a misallocation verdict' (708)

There are two ways in which disproportionate distributions of misallocations 'can be generated without any teacher bias'. This does not exclude the possibility of bias but it does 'confound attempts to demonstrate such an effect'. There is no outcome data to support the view that misallocation has important consequences for academic futures. The national counterparts of nearly all the groups allegedly disfavoured 'made more progress nationally in the following five years than did the national counterparts of those allegedly favoured by setting decisions' (708)

Contradictions between claims and data are shown in table 1, 708. For example Connolly claims that the odds of Black students being misallocated to lower maths sets was 2.4 times higher than for White students, but the actual data shows that 9% of known Black students were in bottom sets compared with 13% of known White students and probably 15% of the unknown ones. 82% of Black students were in bottom or middle compared to 70% of known White ones. Despite the claim that the odds of Black and Asian being misallocated to higher sets were 0.480.58 times lower than for White students, the raw data shows that 93% of no nations were in top or middle sets compared to 87% students known to be White. Despite the calculation of better odds for female students, girls 'have substantially lower KS2 mathematics scores than boys.' FSM eligibility was relevant because those with FSM were '4.6 less likely to be in top or middle sets and those not eligible are 9.5% more likely to be middle or bottom sets' [so the shift towards oodds ratios is highly misleading]. Gomm gets the descriptive statistics from the larger sample not the subsample, apparently on the student questionnaire in 2015, and from the National Pupil Database. Another sample also shows that 20% of students change sets, while Connolly assumes that set placements are 'enduring'. There is missing data — 5% of those who might be eligible for FSM, 38% students without an ethnic designation.

There seems to be a close relationship between rank order for occupancy of set and for the test results, using the threefold classification, except for Bangladeshi students: however four more top set occupancies would bring them level with White students. There is also an unusual group of Black students other than Black Caribbeans — 'removing these two exceptions brings the coefficients up to "strong" correlation levels between ranking for mean test score and ranking for set occupancy' (709).

Girls have a much lower mean test KS2 scores than boys and have 2.5% more bottom set places and 5% more middle set places which 'seems consistent with gender differences being in line with differences in prior achievement'. FSM have the lowest mean scores at KS2 and 'correspondingly are very poorly placed in sets', yet their low positioning is 'said by Connolly et al. not to have involved teacher bias'. However, 'all told, the data point to occupancies that follow group means for prior achievement fairly closely: the opposite of what is claimed in Connolly et al.' (710), although there are three caveats:
(1) reducing the sample initially to 9300 'makes it highly likely that some subgroups differences in occupancies (either way) are due to attrition bias, a possibility not investigated by the authors'.
(2) there are 3376 students 'for whom there is no ethnic designation — 36% of all and possibly the majority of all  BAME students in the database', which must raise doubts about the pattern of ethnic differences, within the sample, let alone national trends.
(3) inequalities within the sets, especially the middle ones are not addressed.

The claim in Connolly is that some subgroups get more and some less than their fair share of allocations compared to others with the same test scores — but they mean the same score in the same school. However, the same score predicts a different correct placement in different schools. (713). It also means 'precisely the same observed score, since these authors ignore confidence intervals'.  They propose that the absence of teacher bias would be shown by the absence of any systematic relationship between subgroups membership and misallocations, but this is only a test of their hypothesis if:
(a) the researchers' prescriptions for correct setting are accepted as more valid for in some way better — but they provide no outcome data to justify their prescriptions.
(b) seeking as a null condition random relations between misallocations and set memberships would be appropriate if all subgroups had similar profiles of KS2 scores in the first place, but if they have scores already placing them close to set boundaries, they are obviously a greater risk of being a misallocation verdict by the researchers — 'the boundary proximity risk'.
(c) if each of the 46 schools had similar rates of misallocation and if subgroups were evenly distributed between them, but minority ethnic groups are strongly concentrated in a minority of schools and there is also variation in socio-economic profile. There is also variation in set architecture. Even if students are placed fairly within each school, they might have been dealt with unfairly by going to different schools — 'the school difference risk'.

Connolly initially identified misallocation using '"fine point" scoring' which was used in earlier studies but later abandoned. The originating marks are more meaningful. In 2015 up to a hundred marks on the scale could be derived from the three mandatory level V KS2 maths papers, and students could get an additional 50 marks by taking an optional level VI paper. All those classified as [having the highest 'fine points'] must have taken this additional paper, and 'nationally, Chinese and Indian students were much more likely to take and pass this examination and boys slightly more likely than girls. The researchers then allocate those with the top scores to the top set and compare them to those actually allocated to sets by teachers — differences are called misalignments or misallocations. They also identify borderlines, and researchers deal with those by 'fair dealing' roughly splitting them into upward and downward allocations, or acting 'by fiat' (714). They might have used another 'concurrent assessment', however, some additional evidence.

This would be more in line with 'the theory on which national curriculum tests are based… Wherein a test result comprises a true score plus an error score. The latter is a random factor, likely to vary on a second equivalent testing occasion, although within calculable probability limits'. The margins of tolerance are usually expressed as confidence intervals and other evidence is considered. OfQual strongly recommends this procedure. In 2018, the reliability of the scores was 0.96, with a 68% confidence interval, and, applied to the researchers' data this means there would be an error of plus or minus 4 or five test marks. Connolly takes a zero tolerance approach however which enables them to see as misallocations  one or two mark difference. Since questions on the test papers carry one or two marks, even one wrong answer can differentiate the eligibility of the student, and in one of their cases, of the 32 misallocations 20 involve such discrepancies: none involved more than six marks, and this is a representative school, suggesting overall a 'small magnitude of misallocations'.

There might also be information available to teachers but not the researchers that justified different allocations. Connolly has a rather abstract notion of equity — students in the same school in November with the same rank order as each other for maths in May being put in sets equally, but there might well be exceptions, such as students who were ill in May but recovered in November, or students intensively coached for the KS2, and underachieving later [as teachers suspect]. The body responsible for tests advise against overconfidence in SAT scores and research confirms this [listed on 715], not surprisingly because these were 'primarily designed to assess schools not students'. Most secondary schools test their intakes to 'evade the effects of "teaching to the test"' or use commercial tests [Connolly is sceptical about these], although some are licensed by the Qualifications and Curriculum Development Agency. These might be expected to rank students quite differently from KS2 results. Whether teachers did use such evidence cannot be evaluated adequately by Connolly. Teachers might have had other pedagogical intentions such as creating single sex sets in STEM subjects, or avoiding mono-ethnic sets. They might have had 'alternative notions of equity' ignored by Connolly.

The smaller sample used in the multilevel regression analysis has problems — 'it is almost certainly not the case that this smaller sample is statistically unrepresentative of the 9301, particularly for the smaller subgroups or the original 12,500' (716). The composition 'is wrongly given on page 885… And subgroup numbers are not given on their tables of results'. Those treated as without ethnic designation are treated as if they were not White, but 'it is likely that 70% or so of the unknown are actually White'. 'The researchers admit that their data did not meet the requirements for the analysis that they conducted. For these reasons the results of the multilevel analysis are best ignored'.

The claim that nearly 1/3 of students are misallocated applies to differences between individuals, and this is not the same as inequity between subgroups: these 'reveal a much smaller scale than those between individuals'. There will always be some mismatch between scores and set capacities so this was really whether subgroups have a fair share of correct verdicts and misallocations. His own analysis of the data suggests 'an untidy pattern of rather small differences', not unlike the summary by Connolly themselves. Put into absolute numbers 'there are between two and three fewer downwardly misallocated boys per school than expected as a proportional share, but around seven boys missing per school, plus another 63  gendered students were not accounted for' (719). BAME students received '76 more downward misallocation verdicts than their fair share compared with students known to be White who had 59 less, but there are 106 known BAME students where there is no allocation data, and 309 White students, and 3376 of unknown ethnicity. 'These data are simply not adequate for an ethnic equity audit'. There are further anomalies so that Black Caribbean students have the highest rate of upward misallocations but also the highest occupancy of bottom sets and 'teacher bias seems an unlikely explanation for such occurrences, which may better be explained as artefacts of the way Connolly FL pursued their evaluation, if not also of attrition bias'

There is also boundary proximity risk where scores close to the boundary mark are more likely to be misclassified according to OfQual. Those with high scores have a low risk of misallocation, occupying a 'zone of immunity', indicating apparently, 'a common research artefact: floor and ceiling effects'. Of the 158 in Connolly, 131 are accounted for by students with such immunity — 69%. Chinese and Indian students are the least likely to be upwardly misallocated and are also the clusters with the highest mean KS2 test scores. There is also the issue of clusters around boundaries. In the sample school, two marks or less around the researchers' borderline are crucial in allocating 47 students. A random allocation, the recommendation made by the researchers, might still look inequitable [if there were simply more boys, say, than girls in the first place]. A random distribution model is correct only if subjects share the same risk, but this is not always the case [I'm not sure I understand this — I think it is because the same SAT scores have different consequences according to the secondary school you go to]. There is a 'risk landscape', providing a range of expected figures, although small numbers make this difficult. The main point is that Connolly's methods themselves 'generate unequal (and spurious) claims of "misallocation"' (721).

Schools have different rates of misallocation, and this can affect aggregate data. These errors of inference from aggregate to components are called '"ecological fallacies"', and they are' a long-standing problem in American studies of ethnic and gender differences in education and recruitment" (722) [lots of references including Hammersley and Gomm, 2021]. Major differences in gender between schools are unlikely, but they exist for ethnicity — '13 schools have more than 10 and 33 fewer than 10 Black students' and similar for Asian students, and there were differences in socio-economic profiles. Other differences included: (a) missing data — eight schools are missing from the statistical analysis and there is also unevenness in subgroup data. (b) numbers of tiers of sets vary from 2 to 8 and the more boundaries there are more misalignments there will be. (c) the size of sets differed which affects the entitlement to be allocated according to KS2 score. (d) the composition of the scores of all students in the intake will determine how many students congregate in and between sets and how many are in immune positions. (e) number of students in the school which will clearly affect the number of misallocations and misallocations of subgroups.

Connolly admits that his model does not take account of school size, nor that larger schools tend to be in urban areas. There is also more diversity in set architectures according to recent surveys and this affects risk landscapes. Those with more sets and a higher risk of misallocation will seem to be unfavourably treated even if 'students with the same KS2 test scores actually have the same chances set allocation' (722) [this is demonstrated with an abstract example taking particular percentages of misallocation verdicts but varying the number of tiers that schools have, showing that schools with more sets must have higher misallocation rates if only because they will have higher random classification errors, and fewer percentages of immune students]. There is an assumption that bias is indicated by deviation from a random pattern of allocation verdicts between subgroups, but subgroups are not distributed at random between schools and schools have different levels of allocation risk, and subgroups also vary and are not randomly dispersed between schools.

The researchers have not investigated the consequences of misallocation, but the larger RCT study suggested there is no effect if allocation was carried out solely according to KS2 scores. The study looked at the national cohort at 16+, which took no 16+ examinations because of COVID, although their setting experience and outcomes in maths were similar to that of the immediately preceding cohort. In the study, 'all ethnic groups allegedly disadvantaged by misallocation made more progress nationally than White students in maths over five years and matched or outperformed White students in attainment' except for Black Caribbeans. (724) girls made more progress nationally than boys, and were less likely to attain very low scores. FSM 'made less progress achieved less than those not eligible'. It seems that misallocation 'is inconsequential for student performance five years later' unless the students were not representative.

Connolly's recommendations are based on 'zero tolerance use of a six-month-old measure'. Even their data suggest that occupancies for subgroups 'are very close to what would be expected from prior achievement' although there are deficiencies in the data. Overall, known White students are not particularly well placed in sets, and b'lack students (apart from Black Caribbean's) supposedly biased against the allocations, seem underrepresented in bottom sets, despite their low KS2 mean score' [that is quite an exception though]. As an indication of teacher bias, misallocation 'lacks secure construct validity'. It overestimates the predictive capacity of SAT scores for future individual performance. Clumps of misallocation indicate processes other than teacher bias — differential risk for subgroups arising from the distribution of their scores and set boundaries, and the distribution of subgroups disproportionately between schools. There are also the ways in which the evidence from KS2 schools are combined with 'richer, more sensitive and more timely information… Which may or may not be used judiciously or equitably. Sometimes teacher judgements will be haphazard and biased… Not only with regard to gender, social class and ethnicity, but implicating any dispositions of teachers [who] could look favourably or unfavourably on particular students for a wide range of reasons or create unrealistic assumptions about student types' (725) [the ellipses Indicate references]. However, Connolly's study is not well designed to investigate these.