Notes on: Gomm, R. (2021) SATs,
sets and allegations of bias: The allocation of
11 year old students to mathematics sets in some
English schools in 2015. The response to
Connerly et al., 2019. British Educational
Research Journal 48:704 – 29. DOI:
10.1002/berj.3790
Dave Harris
The question is directed to the idea of correct
set placements and suggests that the results in Connolly et al.
are 'largely artefact of their evaluation
approach'. There may be teacher bias involved, but
this approach by Connolly 'could not evidence it'
(704). The way in which data is generated and
aggregated from diverse clusters needs to be
investigated.
Setting or grouping or tracking takes place in
most secondary schools, and national curriculum
assessments, SATs or KS2 tests are also mandatory.
The idea is to generate performance indicator
data, but they also inform some decision-making
about individual students including set placement.
KS2 maths papers are 'marked externally by
disinterested markers', but setting is done by
secondary school staff who 'might have known the
students for up to 9 weeks and if so, their class,
gender or ethnic prejudices might bias setting
decisions' (705). Connolly et al claim to provide
evidence of such bias.
Their research as part of a whole suite of
research projects which also included an RCT trial
investigating the effects of allocating first year
secondary school students to sets according to
their KS2 results alone then measuring results at
two years on, compared to control schools that
allocated students to sets in any way they chose.
NFER research found 'no differences in academic or
self-confidence outcomes' between these two
schools or between high and low scoring students
while between FSM and others [research cited on
706].
Subsets from this sample were also studied
although it is not clear how they were actually
drawn in some of the other specific studies,
including in Connollly. Connolly's sample had a
'size of some ethnic subgroups… Too small for
sensible analysis' which led to consolidation of
Asian and all Black groups 'despite these being
heterogeneous in set occupancy and allocative
treatments''. In the full RCT study, there were no
consequences for the students arising from the
ways they have been allocated, but these results
were not mentioned in Connolly.
Connolly claims to have discovered teacher bias,
against Black Asians and against girls, but not
against students on the basis of SES, and these
have been much publicised. But there are
methodological shortcomings:
contradiction
between claims and data,
problems with the demography
of maths sets and how this is weakly connected
to discriminatory behaviour,
the researchers strategy for
detecting bias, the procedures used to identify
correct and incorrect allocations to sets,
especially the zero tolerance approach which
does not take account of random error based on
test score confidence intervals — if these are
factored in 'for many students their observed
SAT score ceases to be an adequate basis for
setting decision without further information
which was not available to the researchers'
(707),
Misallocations between
subgroups is too small given the amount of
missing data to make conclusions about ethnic
discrimination
there may be a '"within
school" ecological fallacy where the risk of
being misallocated according to the researchers
varies according to 'student position in the
rank order of KS2 scores within a school'
there may be a between school
ecological fallacy explaining the
disproportionality of misallocations between
subgroups because there is 'the asymmetrical
distribution of ethnic and socio-economic
subgroups between schools with different set
architectures and hence with different risks of
receiving a misallocation verdict' (708)
There are two ways in which disproportionate
distributions of misallocations 'can be generated
without any teacher bias'. This does not exclude
the possibility of bias but it does 'confound
attempts to demonstrate such an effect'. There is
no outcome data to support the view that
misallocation has important consequences for
academic futures. The national counterparts of
nearly all the groups allegedly disfavoured 'made
more progress nationally in the following five
years than did the national counterparts of those
allegedly favoured by setting decisions' (708)
Contradictions between claims and data are shown
in table 1, 708. For example Connolly claims that
the odds of Black students being misallocated to
lower maths sets was 2.4 times higher than for
White students, but the actual data shows that 9%
of known Black students were in bottom sets
compared with 13% of known White students and
probably 15% of the unknown ones. 82% of Black
students were in bottom or middle compared to 70%
of known White ones. Despite the claim that the
odds of Black and Asian being misallocated to
higher sets were 0.480.58 times lower than for
White students, the raw data shows that 93% of no
nations were in top or middle sets compared to 87%
students known to be White. Despite the
calculation of better odds for female students,
girls 'have substantially lower KS2 mathematics
scores than boys.' FSM eligibility was relevant
because those with FSM were '4.6 less likely to be
in top or middle sets and those not eligible are
9.5% more likely to be middle or bottom sets' [so
the shift towards oodds ratios is highly
misleading]. Gomm gets the descriptive statistics
from the larger sample not the subsample,
apparently on the student questionnaire in 2015,
and from the National Pupil Database. Another
sample also shows that 20% of students change
sets, while Connolly assumes that set placements
are 'enduring'. There is missing data — 5% of
those who might be eligible for FSM, 38% students
without an ethnic designation.
There seems to be a close relationship between
rank order for occupancy of set and for the test
results, using the threefold classification,
except for Bangladeshi students: however four more
top set occupancies would bring them level with
White students. There is also an unusual group of
Black students other than Black Caribbeans —
'removing these two exceptions brings the
coefficients up to "strong" correlation levels
between ranking for mean test score and ranking
for set occupancy' (709).
Girls have a much lower mean test KS2 scores than
boys and have 2.5% more bottom set places and 5%
more middle set places which 'seems consistent
with gender differences being in line with
differences in prior achievement'. FSM have the
lowest mean scores at KS2 and 'correspondingly are
very poorly placed in sets', yet their low
positioning is 'said by Connolly et al. not to
have involved teacher bias'. However, 'all told,
the data point to occupancies that follow group
means for prior achievement fairly closely: the
opposite of what is claimed in Connolly et al.'
(710), although there are three caveats:
(1) reducing the sample initially to 9300 'makes
it highly likely that some subgroups differences
in occupancies (either way) are due to attrition
bias, a possibility not investigated by the
authors'.
(2) there are 3376 students 'for whom there is no
ethnic designation — 36% of all and possibly the
majority of all BAME students in the
database', which must raise doubts about the
pattern of ethnic differences, within the sample,
let alone national trends.
(3) inequalities within the sets, especially the
middle ones are not addressed.
The claim in Connolly is that some subgroups get
more and some less than their fair share of
allocations compared to others with the same test
scores — but they mean the same score in the same
school. However, the same score predicts a
different correct placement in different schools.
(713). It also means 'precisely the same observed
score, since these authors ignore confidence
intervals'. They propose that the absence of
teacher bias would be shown by the absence of any
systematic relationship between subgroups
membership and misallocations, but this is only a
test of their hypothesis if:
(a) the researchers' prescriptions for correct
setting are accepted as more valid for in some way
better — but they provide no outcome data to
justify their prescriptions.
(b) seeking as a null condition random relations
between misallocations and set memberships would
be appropriate if all subgroups had similar
profiles of KS2 scores in the first place, but if
they have scores already placing them close to set
boundaries, they are obviously a greater risk of
being a misallocation verdict by the researchers —
'the boundary proximity risk'.
(c) if each of the 46 schools had similar rates of
misallocation and if subgroups were evenly
distributed between them, but minority ethnic
groups are strongly concentrated in a minority of
schools and there is also variation in
socio-economic profile. There is also variation in
set architecture. Even if students are placed
fairly within each school, they might have been
dealt with unfairly by going to different schools
— 'the school difference risk'.
Connolly initially identified misallocation using
'"fine point" scoring' which was used in earlier
studies but later abandoned. The originating marks
are more meaningful. In 2015 up to a hundred marks
on the scale could be derived from the three
mandatory level V KS2 maths papers, and students
could get an additional 50 marks by taking an
optional level VI paper. All those classified as
[having the highest 'fine points'] must have taken
this additional paper, and 'nationally, Chinese
and Indian students were much more likely to take
and pass this examination and boys slightly more
likely than girls. The researchers then allocate
those with the top scores to the top set and
compare them to those actually allocated to sets
by teachers — differences are called misalignments
or misallocations. They also identify borderlines,
and researchers deal with those by 'fair dealing'
roughly splitting them into upward and downward
allocations, or acting 'by fiat' (714). They might
have used another 'concurrent assessment',
however, some additional evidence.
This would be more in line with 'the theory on
which national curriculum tests are based… Wherein
a test result comprises a true score plus an error
score. The latter is a random factor, likely to
vary on a second equivalent testing occasion,
although within calculable probability limits'.
The margins of tolerance are usually expressed as
confidence intervals and other evidence is
considered. OfQual strongly recommends this
procedure. In 2018, the reliability of the scores
was 0.96, with a 68% confidence interval, and,
applied to the researchers' data this means there
would be an error of plus or minus 4 or five test
marks. Connolly takes a zero tolerance approach
however which enables them to see as
misallocations one or two mark difference.
Since questions on the test papers carry one or
two marks, even one wrong answer can differentiate
the eligibility of the student, and in one of
their cases, of the 32 misallocations 20 involve
such discrepancies: none involved more than six
marks, and this is a representative school,
suggesting overall a 'small magnitude of
misallocations'.
There might also be information available to
teachers but not the researchers that justified
different allocations. Connolly has a rather
abstract notion of equity — students in the same
school in November with the same rank order as
each other for maths in May being put in sets
equally, but there might well be exceptions, such
as students who were ill in May but recovered in
November, or students intensively coached for the
KS2, and underachieving later [as teachers
suspect]. The body responsible for tests advise
against overconfidence in SAT scores and research
confirms this [listed on 715], not surprisingly
because these were 'primarily designed to assess
schools not students'. Most secondary schools test
their intakes to 'evade the effects of "teaching
to the test"' or use commercial tests [Connolly is
sceptical about these], although some are licensed
by the Qualifications and Curriculum Development
Agency. These might be expected to rank students
quite differently from KS2 results. Whether
teachers did use such evidence cannot be evaluated
adequately by Connolly. Teachers might have had
other pedagogical intentions such as creating
single sex sets in STEM subjects, or avoiding
mono-ethnic sets. They might have had 'alternative
notions of equity' ignored by Connolly.
The smaller sample used in the multilevel
regression analysis has problems — 'it is almost
certainly not the case that this smaller sample is
statistically unrepresentative of the 9301,
particularly for the smaller subgroups or the
original 12,500' (716). The composition 'is
wrongly given on page 885… And subgroup numbers
are not given on their tables of results'. Those
treated as without ethnic designation are treated
as if they were not White, but 'it is likely that
70% or so of the unknown are actually White'. 'The
researchers admit that their data did not meet the
requirements for the analysis that they conducted.
For these reasons the results of the multilevel
analysis are best ignored'.
The claim that nearly 1/3 of students are
misallocated applies to differences between
individuals, and this is not the same as inequity
between subgroups: these 'reveal a much smaller
scale than those between individuals'. There will
always be some mismatch between scores and set
capacities so this was really whether subgroups
have a fair share of correct verdicts and
misallocations. His own analysis of the data
suggests 'an untidy pattern of rather small
differences', not unlike the summary by Connolly
themselves. Put into absolute numbers 'there are
between two and three fewer downwardly
misallocated boys per school than expected as a
proportional share, but around seven boys missing
per school, plus another 63 gendered
students were not accounted for' (719). BAME
students received '76 more downward misallocation
verdicts than their fair share compared with
students known to be White who had 59 less, but
there are 106 known BAME students where there is
no allocation data, and 309 White students, and
3376 of unknown ethnicity. 'These data are simply
not adequate for an ethnic equity audit'. There
are further anomalies so that Black Caribbean
students have the highest rate of upward
misallocations but also the highest occupancy of
bottom sets and 'teacher bias seems an unlikely
explanation for such occurrences, which may better
be explained as artefacts of the way Connolly FL
pursued their evaluation, if not also of attrition
bias'
There is also boundary proximity risk where scores
close to the boundary mark are more likely to be
misclassified according to OfQual. Those with high
scores have a low risk of misallocation, occupying
a 'zone of immunity', indicating apparently, 'a
common research artefact: floor and ceiling
effects'. Of the 158 in Connolly, 131 are
accounted for by students with such immunity —
69%. Chinese and Indian students are the least
likely to be upwardly misallocated and are also
the clusters with the highest mean KS2 test
scores. There is also the issue of clusters around
boundaries. In the sample school, two marks or
less around the researchers' borderline are
crucial in allocating 47 students. A random
allocation, the recommendation made by the
researchers, might still look inequitable [if
there were simply more boys, say, than girls in
the first place]. A random distribution model is
correct only if subjects share the same risk, but
this is not always the case [I'm not sure I
understand this — I think it is because the same
SAT scores have different consequences according
to the secondary school you go to]. There is a
'risk landscape', providing a range of expected
figures, although small numbers make this
difficult. The main point is that Connolly's
methods themselves 'generate unequal (and
spurious) claims of "misallocation"' (721).
Schools have different rates of misallocation, and
this can affect aggregate data. These errors of
inference from aggregate to components are called
'"ecological fallacies"', and they are' a
long-standing problem in American studies of
ethnic and gender differences in education and
recruitment" (722) [lots of references including
Hammersley and Gomm, 2021]. Major differences in
gender between schools are unlikely, but they
exist for ethnicity — '13 schools have more than
10 and 33 fewer than 10 Black students' and
similar for Asian students, and there were
differences in socio-economic profiles. Other
differences included: (a) missing data — eight
schools are missing from the statistical analysis
and there is also unevenness in subgroup data. (b)
numbers of tiers of sets vary from 2 to 8 and the
more boundaries there are more misalignments there
will be. (c) the size of sets differed which
affects the entitlement to be allocated according
to KS2 score. (d) the composition of the scores of
all students in the intake will determine how many
students congregate in and between sets and how
many are in immune positions. (e) number of
students in the school which will clearly affect
the number of misallocations and misallocations of
subgroups.
Connolly admits that his model does not take
account of school size, nor that larger schools
tend to be in urban areas. There is also more
diversity in set architectures according to recent
surveys and this affects risk landscapes. Those
with more sets and a higher risk of misallocation
will seem to be unfavourably treated even if
'students with the same KS2 test scores actually
have the same chances set allocation' (722) [this
is demonstrated with an abstract example taking
particular percentages of misallocation verdicts
but varying the number of tiers that schools have,
showing that schools with more sets must have
higher misallocation rates if only because they
will have higher random classification errors, and
fewer percentages of immune students]. There is an
assumption that bias is indicated by deviation
from a random pattern of allocation verdicts
between subgroups, but subgroups are not
distributed at random between schools and schools
have different levels of allocation risk, and
subgroups also vary and are not randomly dispersed
between schools.
The researchers have not investigated the
consequences of misallocation, but the larger RCT
study suggested there is no effect if allocation
was carried out solely according to KS2 scores.
The study looked at the national cohort at 16+,
which took no 16+ examinations because of COVID,
although their setting experience and outcomes in
maths were similar to that of the immediately
preceding cohort. In the study, 'all ethnic groups
allegedly disadvantaged by misallocation made more
progress nationally than White students in maths
over five years and matched or outperformed White
students in attainment' except for Black
Caribbeans. (724) girls made more progress
nationally than boys, and were less likely to
attain very low scores. FSM 'made less progress
achieved less than those not eligible'. It seems
that misallocation 'is inconsequential for student
performance five years later' unless the students
were not representative.
Connolly's recommendations are based on 'zero
tolerance use of a six-month-old measure'. Even
their data suggest that occupancies for subgroups
'are very close to what would be expected from
prior achievement' although there are deficiencies
in the data. Overall, known White students are not
particularly well placed in sets, and b'lack
students (apart from Black Caribbean's) supposedly
biased against the allocations, seem
underrepresented in bottom sets, despite their low
KS2 mean score' [that is quite an exception
though]. As an indication of teacher bias,
misallocation 'lacks secure construct validity'.
It overestimates the predictive capacity of SAT
scores for future individual performance. Clumps
of misallocation indicate processes other than
teacher bias — differential risk for subgroups
arising from the distribution of their scores and
set boundaries, and the distribution of subgroups
disproportionately between schools. There are also
the ways in which the evidence from KS2 schools
are combined with 'richer, more sensitive and more
timely information… Which may or may not be used
judiciously or equitably. Sometimes teacher
judgements will be haphazard and biased… Not only
with regard to gender, social class and ethnicity,
but implicating any dispositions of teachers [who]
could look favourably or unfavourably on
particular students for a wide range of reasons or
create unrealistic assumptions about student
types' (725) [the ellipses Indicate references].
However, Connolly's study is not well designed to
investigate these.
|
|