The preparation of a synthetic population needs two elements: samples which are the individual household records to be used and targets which are the control totals for geographies in the model system, such as zones. The software works to identify a list of units whose aggregate attribute values match a pre-specified set of corresponding target values. This list forms a synthetic population of such units consistent with the target values. Each unit included in this list is drawn from a sample of such units, with the potential that any particular unit in the sample is included in the list 0, 1 or more times as appropriate.

The software proceeds by iteratively considering adding a unit from the sample to the list, subtracting a unit from the list, or ‘swaps’ where a unit in the list is swapped out and a unit from the sample is swapped in. The match of the list to the target values is scored using a goodness-of-fit function.

The list is divided into subgroups, which are commonly used to specify geographic areas known as “zones” which are to contain portions of a population. The process works through the list subgroup by subgroup. For each subgroup first one of the three operations is selected with equal 1/3 probability (add, subtract or swap). In the case of subtract or swap a unit in the subgroup is randomly selected. In the case of add or swap a unit in the sample is randomly selected. The operation is then performed, and the magnitude of the improvement in the goodness-of-fit score is calculated. If the goodness of fit improves the operation is kept. If the goodness of fit gets worse there is a less-than-1.0 probability that the operation will be kept, otherwise the operation will be undone.

It is possible to have the program start with a previously generated list. If no such previously generated list is available, an initial list is generated by randomly selecting enough units at random to reach or exceed target values for just one of the attributes for each subgroup.

The decision to keep an operation at any point includes a probabilistic component. Operations that lead to a worse goodness of fit will be more likely to be accepted early in the process than later in the process. This is what is termed a ‘simulated annealing’ algorithm – based on the idea that the program should be getting closer to the best possible match as the number of iterations increases and thus a non-improving swap is more likely to be detrimental rather than advantageous in the search for this best possible match.

The general formula used in the measurement of goodness-of-fit is:

gof = Sqrt(Σa weighta2 · (lista-targeta)2)

where:

a = index of attributes whose aggregate values for the synthetic population list are to match a pre-specified set of corresponding target values

gof = goodness-of-fit, with values closer to 0 indicating a better fit (such that it might be appropriate to consider it a ‘lack-of-fit’ measure).

weighta = weight associated with attribute a

lista = aggregate value for attribute a for the list

targeta = target value for attribute a

The weight associated with an attribute indicates the relative importance to be placed on achieving a match with regard to that attribute.

The formula used to assign the probability of accepting an operation that leads to a worse goodness of fit is:

P = exp(-iteration/a)^(Δgof^g)

where:

P = probability of accepting the operation

Δgof = change in goodness-of-fit associated with the operation

g = parameter controlling influence of the size of the change in goodness-of-fit on the probability of making the swap, specified as gofDifExponent in the properties file for the program

iteration = number of operations that have been evaluated so far in the process

α = parameter controlling influence of the number of iterations on the probability of making the swap, specified as coolingParameter in the properties file for the program

It is common to set g to 0.0 when first setting up the synthesizer, so that the probability of accepting an operation that leads to a worse goodness of fit is simply

P = exp(-iteration/a)

The number of iterations to be performed by the program is specified by the user as part of the inputs.

The target values for the attributes can be specified for individual subgroups of the population or for combinations of the subgroups. The program proceeds subgroup-by-subgroup in its processing, with the number of iterations within each subgroup for each pass through the entire list of subgroup determined according to the comparative goodness-of-fit for the subgroup.

The formula used to establish the number of iterations for a given subgroup is:

nz = t· gofz + 1

where:

z = index of subgroup

gofz = goodness-of-fit for subgroup z

t = parameter controlling influence of the goodness-of-fit value for a subgroup on the number of iterations for the subgroup as part of the current pass through the subgroups, called iterationsPerZonalLackOfFit in the properties file.

The program performs iterations on each subgroup, checking the total number of iterations every time a subgroup is processed, and terminating if the total number of iterations meets or exceeds the specified number of iterations. Upon termination the program reports the overall goodness-of-fit, the goodness of fit for each subgroup and for each target in each subgroup, and the resulting list of units comprising the synthetic population. For more information refer to the Population Synthesizer manual here.

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

### Wiki

We have set up a password protected wiki for users to discuss their experiences with the population synthesizer. Contact us if you would like a password for the wiki.