Open-ended surveys allow respondents to answer in their own words, without lim-iting their possible answers in terms of linguistic form and semantic content, but on the other hand bring about severe problems in terms of cost and speed, since their coding requires trained professionals to manually identify and tag meaningful text segments. To overcome these problems, a few automatic approaches have been proposed in the past, some based on matching the answer with textual descriptions of the possible codes, some others based on manually building categorization rules that check the answer for presence or absence of code-revealing words. While the former approach is scarcely effective, the major drawback of the latter approach is that the categorization rules need to be developed manually, and before the actual observation of text data takes place. Manual engineering of categorization rules is expensive, and the maintenance of the engineered rules is expensive too, since e.g. adding a new code, deleting a previously existing one, or catering for the changed meaning of yet another one, may require a manual revision of the entire set of rules.
We propose a new approach, inspired by text categorization work in information retrieval, that overcomes these drawbacks. In this approach survey coding is viewed as a task of multiclass text categorization (MTC), and is tackled through techniques originally developed in the _eld of supervised machine learning [4]. InMTC a set of texts have to be classified into exactly one from a set of pre-dened categories [5]. In the supervised machine learning approach to MTC, a set of categorization rules is built automatically by learning, for each category in the set, the characteristics that a text should have in order to be classified under it. Such characteristics are automatically learnt from a set of training examples, i.e. a set of texts whose membership or non-membership in each category is known. For survey coding, we equate the set of codes (pre-dened in order to code answers to a given question) with categories, and all the collected answers to a given question with texts. We have carried out automatic coding experiments with two different supervised learning techniques, one based on naive Bayesi an catego-rization [3] and the other based on multiclass support vector machines [2]. Our experiments have been run on a corpus of social surveys carried out by the National Opinion Research Center, University of Chicago (NORC) [1]. These experiments show that our methods based on automatic rule construction outperform, in terms of accuracy, previous methods based on manual rule construction that had been tested on the same corpus.
| Back to: Top | Programme | Page last updated on 31 August, 2003 |