Three Methods for Occupation Coding Based on Statistical Learning

Abstract: Occupation coding, an important task in official statistics, refers to coding a respondent's text answer into one of many hundreds of occupation codes. To date, occupation coding is still at least partially conducted manually, at great expense. We propose three methods for automatic coding: combining separate models for the detailed occupation codes and for aggregate occupation codes, a hybrid method that combines a duplicate-based approach with a statistical learning algorithm, and a modified nearest neighbor approach. Using data from the German General Social Survey (ALLBUS), we show that the proposed methods improve on both the coding accuracy of the underlying statistical learning algorithm and the coding accuracy of duplicates where duplicates exist. Further, we find defining duplicates based on ngram variables (a concept from text mining) is preferable to one based on exact string matches

Standort
Deutsche Nationalbibliothek Frankfurt am Main
Umfang
Online-Ressource
Sprache
Englisch
Anmerkungen
Veröffentlichungsversion
begutachtet (peer reviewed)
In: Journal of Official Statistics ; 33 (2017) 1 ; 101-122

Klassifikation
Informatik

Ereignis
Veröffentlichung
(wo)
Mannheim
(wann)
2017
Urheber
Gweon, Hyukjun
Schonlau, Matthias
Kaczmirek, Lars
Blohm, Michael
Steiner, Stefan

DOI
10.1515/JOS-2017-0006
URN
urn:nbn:de:101:1-2019052715483512319010
Rechteinformation
Open Access; Open Access; Der Zugriff auf das Objekt ist unbeschränkt möglich.
Letzte Aktualisierung
14.08.2025, 10:48 MESZ

Datenpartner

Dieses Objekt wird bereitgestellt von:
Deutsche Nationalbibliothek. Bei Fragen zum Objekt wenden Sie sich bitte an den Datenpartner.

Beteiligte

  • Gweon, Hyukjun
  • Schonlau, Matthias
  • Kaczmirek, Lars
  • Blohm, Michael
  • Steiner, Stefan

Entstanden

  • 2017

Ähnliche Objekte (12)