Innovation in small businesses

How can we get a good idea of all the innovative businesses in the Netherlands? Statistics Netherlands (CBS) has researched this question at the Center for Big Data Statistics. CBS is currently only surveying companies that have more than ten employees about their level of innovation, an approach which by definition excludes a great many Dutch companies. To make it possible to collect information about the other businesses too, a big data method has been developed that analyses the text on a company’s website. This ‘web scraping’ method is mainly useful for identifying small innovative businesses, such as start-ups.


The text on the website’s homepage is used to decide whether a business is or is not innovative. Punctuation marks and common general terms are removed from the text on each site, and the remaining words form the initial dataset for the development of an algorithm that can distinguish between innovative and non-innovative businesses. Because we know which of the businesses in the CBS innovation survey are innovative and which are not, we use the websites of these larger companies to train the algorithm. Ultimately, this produces a list of words that are important when classifying innovation, such as ‘technology’, ‘new product’, ‘innovation’ and ‘software’. The language in which the website is presented is another important indicator. A company whose website is in English is statistically more likely to be innovative than a business with a Dutch website. Some words actually indicate that a business is not innovative; these include ‘shop’, ‘transport’, restaurant’ and ‘service’. Of course this does not necessarily mean that a shop can never be innovative; the combination of the other words on the website is also relevant. The latest version of the algorithm has been shown to be able to identify the innovativeness of large companies’ websites with 93% accuracy.


The next step was to select half a million companies with fewer than ten employees from CBS’ business register. The text of these companies’ websites was then collected and classified using the algorithm. We did not know in advance whether these businesses were innovative or not, but a prediction was made based on the algorithm’s results. A manual check of a large section of the results confirmed that the algorithm also works well on small companies’ websites. Its functionality was also checked against the Innovation Top 100 for small and medium-sized enterprises (SME) and using the websites of start-ups. In both cases, the algorithm proved to be able to accurately classify a very large number of businesses as innovative. The approach that had been developed worked especially well in relation to companies with a high level of technological innovation. Our initial findings indicate that more than a third of the 500,000 websites can be classified as innovative.

The maps show information about more than half a million small businesses, displayed at the level of the province and the municipality. The businesses’ post codes were used to achieve this. The provinces with the most innovative businesses, both small and large, are Noord-Holland, Zuid-Holland and Noord-Brabant.

Innovative business provinces

However, especially in comparison with the larger companies, slightly more small innovative businesses were found in the other provinces. Good-quality data was previously lacking about this group. The new method that has been developed makes it possible to draw up more detailed maps, for instance at a municipal level, revealing the areas of the Netherlands with a relatively large number of small innovative businesses.

Innovative businesses

These areas are mainly to be found in the large cities, particularly Amsterdam and Rotterdam, and in municipalities with universities and universities of applied sciences, especially technical universities. Please note that the maps show the absolute number of companies, and therefore do not indicate how many people are employed at these companies. In other words, a tiny innovative start-up with a single worker counts for the same as a company with nine employees. Neither does this study look at the amount of investment made in innovation.


The link between businesses and websites presents a challenge. To make a good analysis it is important to link the right company to the right website, which is not always easy, especially with small businesses. That is why the links are also checked using a method developed jointly by several statistical agencies in a large European Big Data project. In addition, not every website was shown to be active. These factors make it difficult to make an accurate estimate at this time of the absolute number of small innovative businesses in the Netherlands, and the amount of investment they attract. Further research will be done into these aspects in a follow-up study.


CBS innovation survey data was used to develop the method, together with texts on websites and the associated post codes. The information used was then aggregated at a municipal and provincial level to prevent individual companies from being identified.


The classification of companies based on the text on their websites has been shown to work well for innovation. This makes it possible to draw up extremely detailed maps of areas that are home to small innovative businesses, which is of particular interest to municipalities and provinces. For large cities, it is even possible to make maps of a given post code area. The development of this approach means that it is now possible to find small innovative businesses, such as start-ups, and to follow them over time. This will help to establish the impact of incentive policies for innovation. The method is also likely to be used in other fields, such as identifying sustainable companies and family-run businesses.


We look forward to hearing your views about this innovation and about its potential applications, and we are always open to ideas to help us further refine this web scraping method. Please send us your feedback using the form below.
You are using an unsupported browser. Please upgrade your browser. You provided an inconsistent user agent while solving the challenge. You may have browser extensions or settings enabled to spoof the user agent and should disable them to proceed. Some portion of Turnstile was accidentally cached. Please clear your cache. The time on the clock is incorrect. Please set your clock to the correct time. An unspecified error occurred.