A very disaggregated form of ethnicity is available after 2002 (see Annex E, p38 of the 2011 user guide - - using this it would be possible to go behind some of the more broad/general categories (e.g. White British, African, Caribbean). This would be interesting and help to avoid / explore general inferences about pupils whose ethnicity is defined in terms of a whole continent. In 2002 only more aggregated ethnic categories are available.

From 2003 onwards there are two main variables ... a minor grouping variable that contains 18 categories (ethnicgroupminor_xx) and a major groupings variable with 6 categories (ethnicgroupmajor_xx).

There is a certain amount of missing data which is for the most part coded as refused or not yet obtained. This missing data can be minimised (see suggested SPSS coding below).

Ethnic classification does fluctuate for some pupils over time. This may relate to a pupil redefining their perceived ethnicity (which we can do little to explore without qualitative detail). Alternatively, it might relate to where this data comes from (e.g. the parent, the pupil themselves or the school -see below) - and this is something that could be drawn on to help 'correct' the final ethnicity classification (see coding below).

Data collection

For each year there is a source code for the data (ethnicitysource_xx). This contains 5 letters which are defined as follows:

C = Provided by the child (i.e. pupil)
P = Provided by the parent
S = Ascribed by the current school
T = Ascribed by a previous school
O = Other

This data is provided by parents around 85-90% of the time.

This detail is drawn on to try and improve on the validity / integrity of the final ethnicity classification

Validity of measure

As mentioned above, there are issues with missing responses and refusals - which can be limited in later data files by drawing on earlier classifications (when available).

Another issue relates to how ethnicity classifications fluctuate over time. Using the main, 18-category scheme, around 93% of pupils were classified the same in 2007 and 2008. The comparable proportion classified similarly in 2004 and 2008 was around 89%.

Cleaning the variable

Stata code

This do file creates consistent ethnic groups between 2002 and 2012, noting the changes in ethnic classifications over time.

This do file does not draw on the source of the information or reduce missing values as in the SPSS code below. More information about changes in the ethnic group classifications can be found in this file, which was provided by the Department for Education.

SPSS code

This file converts the original string variables to be numeric and re-orders the classification. Following this, the data source details are drawn on so that the parental source is given priority over child, school and other. Finally, the missing values in the 2008 variable are reduced by drawing on the 2007, 06, 05 & 04 variables to fill in the details.

The SPSS SYNTAX can be found in this file

This excel file compares the original (numeric & re-ordered) 2008 ethnicity variable with the one with the coding 'tweaks'

Description of values across cohorts

By age groups - to do

Over time
This is shown in this excel file -

Stability within pupil