Assigning industry labels to companies is a manual, costly, slow and error prone process conducted by data providers. In order to solve this problem we developed a fully automated machine learning based industry classification that assigns WZ-2008 labels to companies in no time. Our classifier solely relies on the company purposes, which we extracted from the German company register. The classifier is composed of a natural language processing step and a bag-of-words based elastic-net classifier. This classifier enables us to predict the industry of a company in real-time, rises our data coverage by 30 %, and lowers the costs of the data acquisition to a fraction compared to conventional methods.
The state-of-the-art for enriching and enhancing data is to hire click-workers and let them extract information from unstructured data sources. This manual labeling process is slow, costly and error prone due to low experience and skill-level of click-workers. At Implisense we developed a natural language processing driven machine learning model in order to leverage those shortcomings and enhance the quality of our datapoints. This automated system lets us classify newly founded companies in no-time and at zero marginal costs.
We have been working on a machine-learning based industry-classifier specifically designed for small companies and startups. Our classifier’s main ingredient is the company purpose, which every company is legally obliged to submit to the German company register (see example).
|(a) die Herstellung und der Vertrieb von Verpackungen, Verpackungsmaterialien und Behältern aller Art, insbesondere Intermediate Bulk Containern (IBC), und sonstiger Erzeugnisse aus Kunststoff und anderen synthetischen Materialien, aus Metall, Papier und Verbundstoffen insbesondere für die Industrie, (b) die Entwicklung aller hiermit zusammenhängenden Verfahren, insbesondere unter der eingetragenen Marke XXX, (c) die Beratung in Bezug auf die aufgeführten Tätigkeiten sowie (d) die Erbringung von Dienstleistungen und die Vergabe von Lizenzen oder sonstigen Rechten an Beteiligungsgesellschaften und ähnliche Unternehmen sowie für Auftraggeber aller Art einschließlich aller sonstigen in diesem Zusammenhang anfallenden Geschäfte.
||(a) the manufacture and distribution of packaging, packaging materials and containers of all kinds, in particular Intermediate Bulk Containers (IBCs), and other articles made of plastics and other synthetic materials, metal, paper and composites, in particular for industry, (b) the development of all related procedures, in particular under the registered trade mark XXX, (c) advice in relation to the listed activities and (d) the provision of services and the granting of licences or other rights to affiliated companies and similar companies as well as for clients of all including all other transactions arising in this context.
As a classification system we use the WZ-2008 schema, which incorporates the European NACE-standard. We trained a machine learning model on a dataset of roughly 650.000 company-purposes and their associated industry labels (as the target variable). From a machine learning perspective the problem can be referred to as a hierarchical, multi-label classification problem. There are 88 fundamental categories, which do split up in a tree-like manner into specific economic branches. Each company can be labeled with up to 3 industry labels.
Natural Language Processing Approach
We experimented with different types of classifiers and concluded that a bag-of-words based logistic regression classifier best meets the problem requirements. Random forest classifiers are not suited due to the high dimensionality and sparsity of the one-hot encoded text-data. Nonlinear classifiers, such as neural networks, don’t give much of an improvement due the high dimensionality of the data. A classifier based on “bag-of-words” focuses on word occurrences but ignores the order in which they appear in the text. Features are extracted from single words and then “thrown into a bag”. This bag refers to an unordered set of features. The following processing steps make up the feature-extracting pipeline, which transforms the textual data into a form that is processable by the classifier.
- tokenization (splitting up the text into single words, i.e. tokens)
The table depicts examples for some of our feature-extraction steps. To enhance the classifiers generalization abilities we use a “compound word splitter” as a special treatment for german compound. Compound words that were torn apart by enumerations are reassembled by our “compound word composer”.
|compound word composing
||Regie – sowie Film-, Fernseh- und Multimedia- und Videoproduktionen
-> Multimediaproduktionen Fernsehproduktionen Filmproduktionen Regieproduktionen
Beton-sowie Stahlbetonbauertätigkeit -> Betonbauertätigkeit
||Süßwarenerzeugnissen -> [Süßwaren, erzeugnissen]
Stahlbetonbauertätigkeit -> [Stahl, beton, bauer, tätigkeit]
Fernsehproduktion -> [Fernseh, produktion]
Multimediaproduktion -> [Multimedia, produktion]
|stemming & lowercasing
||Demontagearbeiten -> demontagearbeit
Gebäude -> gebaud
Erbringung -> erbring
The following visualization shows a raw text (on the left) and it’s corresponding bag-of-words representation (on the right):
|Die Verwertung von Urheberrechten, das Ausarbeiten von Ideen für Drehbücher, die Übernahme von Regie- sowie Film-, Fernseh- und Video- und Multimediaproduktionen und alle damit verwandten Geschäfte.
||verwert urheberrecht , ausarbeit ide fur drehbuch , ubernahm regi – sowi film – , fernseh – video – multimediaproduktion all damit verwandt geschaft . videoproduktion . fernsehproduktion . filmproduktion . regieproduktion multimedia produktion
The classifier is trained to assign weights to each feature and thereby learn to categorize text. In the following visualization the learned feature-weightings are encoded in font-size. Depicted are the features regarding the classification of the WZ2008 label J62 (information technology services).
Machine Learning Classifier
Without further precautions machine learning models often tend to “over-fit” the training set. This effect occurs often if the data’s dimensionality is not significantly smaller than the size of the training set. From a mathematical perspective training a machine learning model means minimizing some sort of error function (i.e., the sum of misclassifications). Adding a regularization term, for example the sum the absolute values of the model’s learnable parameters (l1-norm), leads to a feature-selecting behavior of the linear classifier.
In contrast to the naive Bayes classifier the l1-regularized logistic regression classifier inherently incorporates a feature selection and can thus handle features of similar information with respect to the label of interest. The naive assumption of conditionally independent features does not hold for text-classification problems due to the interchangeability of synonymous words or the co-occurrence of inherently related terms.
Results and Conclusions
We trained our machine learning models as one-vs-all classifiers with respect to a manually labeled target variable. We experienced a strong variation of classification performances for different industry labels. The histogram below depicts the F-scores over the 88 two-digit industry labels.
The classifier’s capability to predict meaningful industry-classes is strongly dependent on the heterogeneity of the data it was trained on. Depending on the vocabulary and wording that is used to describe a company’s purpose the categorization of certain industries is more demanding than of others. Another source of uncertainty is human inter-raterdisagreements, which we believe vary strongly between the different classes. For industries like M69 (legal and accounting activities) or K64 (financial services) we reach a comparatively good F1-score of 82%, while a sector like S96 (provision of miscellaneous predominantly personal services) only reaches an f1-score of 27%. The table below compares classification performances of the best and most challenging industry sectors:
In our productive setting we tuned the models performance towards a high precision (at the cost of a lower recall). This way we don’t have to completely abandon predictions for challenging sectors like S96. Future work lies in extending our NLP-preprocessing pipeline by use of dependency-parse-tags and/or use of word-embeddings obtained from pretrained models as well as deploying methods from the field of deep learning.
Written by Tilo Himmelsbach.