Classifying businesses by economic activity

Exploring the suitability of web-based text mining to classify businesses by economic activity.
Determining the economic activity of businesses is a tedious task for National Statistical Institutes. The present study evaluates the suitability of text mining to classify businesses by economic activity based on their web sites. We focus a case study where a population of businesses has been labelled based on a classification with 9 economic top-sectors and 29 sub-sectors.

We evaluated a number of methodical aspects of the machine learning techniques: different types of feature selections, different word-weighting methods and different classifiers. Further, we varied in the conditions to which we applied the text mining, for instance we compared the performance of one-man businesses versus those of larger businesses.

We obtained an accuracy of 51% for our best performing method at top-sector level while that for sub-sector level was much smaller. In the discussion, we present several ideas to improve the performance.