Maximising Data Value

Programme

The book icons () link to extended abstracts of the papers, and the slides () to the presentations (some of these are rather large).

Session 1, organised by the Office for National Statistics
Process Integration

Karen Dunnell
National Statistician, ONS

(528Kb)

Keynote: Towards a single continuous population survey for the UK

The paper will discuss ONS plans to redesign its existing continuous household surveys (GHS, EFS, LFS, Omnibus) into a single module-based survey. It will cover - rationale, methodology, efficiency and statistical benefits.

Allyson Seyb
Statistics New Zealand

(355Kb)

Statistics New Zealand's Longitudinal Business Frame

This paper introduces Statistics NZ's new longitudinal research database, the Longitudinal Business Frame (LBF), and describes the use of probabilistic matching in the LBF. The LBF, with its economy-wide coverage and basic data items (attributes) such as employment, location, industrial activity and ownership relationships, is a rich source of longitudinally linked business data. The LBF has information on business activity at both the plant ('establishment') level and at the enterprise level.

Philip Cookson & Jason Sobell
Philology Pty Ltd., Australia

(535Kb)

The Architectural Design of a Survey Questionnaire and Respondent Data Repository:
Practical Considerations

This paper will examine the technical requirements for the design of a survey questionnaire and respondent data repository capable of efficiently storing, retrieving, and analyzing survey questionnaire and respondent data, and explain the application of the system for facilitating cross-wave and cross-study data analysis of market research survey results.

Kevin Wavell
Technical Director, TGI Surveys, BMRB Ltd

(1.2Mb)

The Role of Software as a value added tool in Survey Research

TGI is a large, continuous single-source survey with a history going back more than 35 years. It collects data on all aspects of purchasing, behaviour, attitude and media consumption and the delivery of the survey database has taken full advantage of technical and software advancement over the period, particularly with regard to maximising the use of the published database.

TGI has had access to a wide range of software and has used this to assist users to find ways of understanding the data, and it continues to strive to find innovative solutions to some of the problems arising from the successful expansion of the product.

This paper aims to cover aspects of the software involved in using and re-using TGI data, and will examine more closely some examples of the techniques used.

Session 2, organised by the Royal Statistical Society
Methodology and Software for Complex Models

Nicky Best
Imperial College, London
Keynote: Modelling complexity in health and social sciences:
Bayesian graphical models as a tool for combining multiple sources of information

Researchers in substantive fields such as social, behavioural and health sciences face some common problems when attempting to construct and estimate realistic models for phenomena of interest.

The available data tend to be observational rather than collected via carefully controlled experimentation, and are typically fraught with missing values, unmeasured confounders, selection biases and so on. These features often render the use of standard analyses misleading; instead a comprehensive set of inter-dependent sub-models are needed to model the data complexities and core processes that researchers want to understand. It is also invariably the case that a single dataset fails to provide all the necessary information, and many complex research questions require the combination of datasets from multiple sources. Bayesian graphical models provide a natural framework for combining a series of local sub-models, informed by different data sources, into a coherent global analysis.

This talk will introduce the key ideas behind Bayesian inference and graphical models in this context and show how they can be used to easily construct models of almost arbitrary complexity. The ideas will be illustrated by applications involving the integration of survey data, census data and routinely collected health data. The use of the WinBUGS software for Bayesian modelling will be illustrated.

Bill Browne
University of Nottingham

(217Kb)

MCMC Estimation for random effect modelling - The MLwiN experience

Multilevel models and their extensions to other random effect models that account for the underlying dependence structure of the data when modelling have become very popular in many application areas.

I first came across random effect models in my PhD studies when I compared Bayesian (MCMC) methods to standard likelihood based methods for fitting multilevel models. As a by-product of my PhD I added some basic MCMC functionality to the multilevel modelling package MLwiN and my research since has often focussed on building on this research.

In this talk I will contrast between MCMC and likelihood based methods for complex models and focus on the ease of model extension offered by MCMC methods. I will discuss the incremental approach that has been used in MLwiN development and concentrate in particular on two extensions to the multilevel model, cross classifications and multiple membership models. I will end my talk by discussing further work that is currently starting in various research projects associated with the MLwiN development team including multilevel factor modelling, models with responses at various levels, missing data through multiple imputation and sample size calculations for complex random effect models.

Danny Pfeffermann
Hebrew University, Israel, and University of Southampton, U.K.
Binidicte Terryn
UNESCO Institute for Statistics, Montreal, Canada
Fernando Moura
Federal University of Rio de Janeiro, Brazil

(161Kb)

Small Area Estimation under a Two Part Random Effects Model with Application to Estimation of Literacy in Developing Countries

The UNESCO Institute for Statistics has initiated a programme to collect data on the level of literacy of adults in developing countries. This will involve conducting small-scale surveys in a few countries that will consist of giving interviewees aged 15+ a test to measure their literacy score. One of the main objectives of these surveys is to obtain summary measures of literacy levels in small geographical areas for which only very small samples would be available, thus requiring the use of model based small area estimation methods.

Available methods are not suitable, however, for this kind of data due to the mixed distribution of the literacy scores in developing countries. This distribution has a large peak at zero, i.e., a large proportion of adults that are illiterate, and juxtaposed to this peak is an approximately bell-shaped distribution of the non-zero scores measured for the rest of the sample.

In this presentation we will develop a two part three-level model that is suitable for this kind of data and show how to obtain the small area measures and their variances, or compute confidence intervals, based on this model. The proposed method will be illustrated using simulated data and data obtained from a similar literacy survey conducted in Cambodia.

David Curtis
Office for National Statistics
,
Ayoub Saei
Southampton Statistical Sciences Research Institute, University of Southampton, UK

(2.9Mb)

EBLUP-type Estimation of Local Authority Unemployment

As in many other countries, the Labour Force Survey (LFS) serves as the key source of national information about the UK labour market, and in particular about numbers of unemployed and associated unemployment rates. However, the small sample size of the LFS in many local authority districts (LADs) limits the use of LFS estimates of unemployment at LAD level. Application of standard methods for small area estimation based on linear models also fails in this situation because the response variable of interest (unemployed/not unemployed) is dichotomous.

An empirical best linear unbiased-type (EBLUP-type) method based on a logistic model for unemployment can be used to estimate unemployment at local areas. This model is an extension of the usual linear logistic model, and includes an LAD-specific random effect in the linear predictor. Estimates of the parameters of the model, including those associated with the random effect, are obtained using maximum likelihood and restricted/residual maximum likelihood methods.

In this paper we describe how the Office for National Statistics has implemented this methodology in SAS. We also provide results from a realistic simulation study carried out by the ONS that examines the performance of these EBLUP-type estimators as well as associated estimates of their variability.

Session 3, organised by the Association for Survey Computing
Models for Data, Metadata and Knowledge

Andrew Westlake
Survey & Statistical Computing

(302Kb)

Keynote: Combining Data and Knowledge in Models:
Promises and Problems

We collect data in order to increase our knowledge, but we always have some knowledge before we start. Our existing knowledge raises the questions for which we need more information, and it also guides us in deciding what further data to collect and how to collect it.

Models allow us to generalise from specific observed data to a wider situation. When we analyse data we (usually) update our knowledge. If we can find a formal representation for our knowledge, then a standard statistical technique provides a way to formalise the process of updating our knowledge. This can be the basis for the integration of multiple data sets that relate to different aspects of the same system.

While of general importance, this approach is the only way of developing an integrated understanding of complex systems which are too extensive to observe with a single data set. But complex methodology is difficult to understand, so we must also address the issues of convincing users from the application domain that our models are appropriate and valid, and of making the results obtained from the methodology accessible.

The talk will address these issues and illustrate them with experiences from the Opus project (www.opus-project.org) which, amongst other things, is looking at the problems of simultaneously modelling all forms of passenger movement in London. 

Ken Miller, Ekkehard Mochmann & Jostein Ryssevik
UK Data Archive

(8.7Mb)

European Unification through Initiative

Comparative social science research in Europe is hampered by the fragmentation of the scientific information space. Data, information and knowledge are scattered in space and divided by language and institutional barriers. As a consequence too much of research is based on data from a single nation, carried out by a single-nation team of researchers and communicated to a single-nation audience. In order to advance interoperability, data bases must be improved by metadata standards and appropriate documentation of measurement instruments. This paper will present recent developments in the field of social research, in particular the Madiera and Metadater projects, which are laying the ground for the social science GRID and have used the Data Documentation Initiative (DDI) as a building block.

Phil Edwards
School of Law, University of Manchester

(1.9Mb)

Bridging the gap – Metadata in e-social science

One of the problems for social science researchers trying to use multiple datasets is that concepts and classifications across these datasets differ.  This is not just an accident that could have been prevented with more careful planning; it is in the nature of social science concepts, which are often fuzzy and overlapping. Definitions are constructed for a purpose, and are bound up in the social practices and contexts in which they arise: we need only consider the social 'facts' represented by records of the incidence of 'drug abuse' or 'anti-social behaviour'.

The challenge is to record both these ‘facts’ and the circumstances of their production.  Social science researchers need to take a three-dimensional view of data: in terms of its underlying topic area; the claims and interactions which produced the data; and the meanings and associations which were effectively written into it.

VS Chalasani & KW Axhausen
IVT, ETH Zürich

(191Kb)

Conceptual data model for integrated transport and land-use data

All the persons involved in transport and land-use planning are at some stage involved with data, if not produced, might have analysed.  Each transport survey is conducted for a set of objectives. Data obtained from these transport surveys does not follow any specific pattern, hence difficult to understand. At the same time, a research organization conducts a wide variety of surveys ranging from simple road-side interviews to the complex travel diaries. These surveys can be either longitudinal surveys or cross-sectional surveys. Differences in methodology, design, and protocols often obscure basic differences in data among surveys. Above all, it is almost impossible to collect complete information about the existing transportation system in a single survey. Most of the transport surveys collect partial and very relevant information and depends on other sources for the additional information. To solve the difference in interactions among various datasets obtained from different surveys, an attempt is made to develop conceptual data model for the integrated transport and land-use data.

Session 4, organised by the Market Research Society
Multi-Mode and Multi-Source Surveys

George Terhanian
President, HI Europe

(910Kb)

Keynote: The Design and Analysis of Research that Exploits Multiple Interviewing Modes and Multiple Data Sources:
Theoretical and Practical Advice
Reginald Baker
Market Strategies, Inc.

(1.8Mb)

Adding Value to Data Through Improved Access:
The Case for Portals

With the increasingly widespread use of the Internet have come new opportunities to leverage the value of survey data.  By creating broader access and disseminating easy-to-use analytical tools we not only can make more data available to a broader set of users, we also can minimize the barriers that might exist among the often discrete activities of data collection, data analysis, and data reporting.  One prime example in the market research world is the recent deployment of Web portals designed to deliver data more quickly, to increase the value of those data by making them easier to analyze, to deliver data and results deeper into the client’s organization, and to retain historical data for future analyses in an online archive.  In the best of these systems, intuitive interfaces help users build self-documenting composite variables, case filters, weights, and report formats, as well as share the results of their analysis with other users.  In some, users can design questionnaires, draw samples, and launch surveys.

Margaret Ward
Nesstar Ltd
presented by
Jostein Ryssevik

Making existing data re-usable - the requirements of a web-enabled tool

Using the analogy from the days when information was mostly derived from printed material, we would find the information we required from the local library.  This provided the means for information to be shared, discovered, analysed, researched and perhaps published in a different form.  Today, the principles of the library still apply when looking at the re-use of data in the web-enabled world.

Mike Trotman
DataLucid, Ltd.

(1.7Mb)

Managing Complex Raw Data Linkage with XML

This paper will discuss a general approach to the process of managing and combining multiple Questionnaires with data from different sources. The approach focuses primarily on the underlying task of transforming and combining disparate raw data sets into a common Questionnaire and / or data format.


Programme Committee

Market Research Society Richard Cornelius
Office for National Statistics Tony Manners
Royal Statistical Society Suzanne Evans
Antony Fielding
Paul Hewson
Association for Survey Computing Randy Banks
Raz Khan (Chair)
Tim Macer
Andrew Westlake
Social Research Association Wendy Sykes

Back to: Top | Conference

Page last updated on 13 November, 2005