Skills in demand for ICT and statistical occupations: evidence from web-based job vacancies. (English) Zbl 07260732

Summary: Online job portals collecting web vacancies have become important media for job demand and supply matching. They also represent a growing research area for the application of analytical methods to study the labour market using innovative data sources. This paper analyses Italian web job vacancies scraped from several types of Italian web job portals between June and September 2015. After describing how the occupations associated with each web vacancy (classification up to level 4) were identified and the related skills retrieved in texts using mixed supervised and unsupervised text mining approaches, we focused on job vacancies related to ICT and statistical positions.
The principal aim of this paper is to describe these jobs in terms of the required skills that have emerged in the labour market from a demand perspective and to identify those skills that best distinguish statisticians from other ICT occupations. Hence, several machine learning techniques were used to assess those skills that best distinguish occupation codes from other job groups.
After quality control and removal of duplications, the scraping collected more than 110,000 job advertisements: nearly 6,200 were classified as ICT or statistical positions (largely dominated by software developers). The data indicate that high-level statisticians have superior and heterogeneous professional backgrounds, linked to theoretical statistics, where analytic skills are more relevant than computing skills. Many soft and management-oriented skills were also called for, which are missing among lower level statisticians, who are restricted to more technical jobs oriented towards general computing and informatics.


62-XX Statistics
68-XX Computer science


Full Text: DOI


[1] OECD, Digital economy outlook 2015, OECD Press, Paris, 2015.
[2] Cedefop. Information and communication technology professionals: Skills opportunities and challenges. Skills Panorama, 2016a. http://skillspanorama.cedefop.europa.eu/en/analytical_highligths/information‐and‐communication‐technology‐professionals‐skills‐opportunities. Accessed November 13, 2016.
[3] U.S. Bureau of Labor Statistics. Occupational Outlook Handbook, 2015. December 17, 2015. http://www.bls.gov/ooh/ Accessed November 17, 2016
[4] Cedefop. Skill shortage and surplus occupations in Europe. Cedefop Briefing notes 9115, 2016b. http://www.cedefop.europa.eu/en/publications‐and‐resources/publications/9115. Accessed November 13, 2016
[5] Lovaglio, P. G., Vacca, G., and Verzillo, S., Human capital estimation in higher education, Adv. Data Anal. Classif.10( 4) ( 2016), 465– 489. · Zbl 1414.62501
[6] Varian, H.. The McKinsey Quarterly, January 2009.
[7] Davenport, T. H. and Patil, D. J., Data scientist: The sexiest job of the 21st century, HBR90 ( 2012), 70– 76.
[8] Hunter, D.. The design principles of ISCO‐08: Challenges for coding occupations globally. ILO Geneva. Presentation given at Amsterdam, Ingrid Workshop, February 10, 2014. https://inclusivegrowth.be/events/call6‐ExpertWorkshop/programme‐and‐presentations Accessed September 18, 2016
[9] ILO, International standard classification of occupations ISCO‐08. Volume 1: Structure, group definitions and correspondence tables, International Labour Office, Geneva, 2012.
[10] Askitas, N., and Zimmermann, K.. Google econometrics and unemployment forecasting. IZA Discussion Papers No. 4201, 2009http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1465341. Accessed June 13, 2015
[11] Bergamaschi, S. E. Carlini, M. Ceci, B. Furletti, F. Giannotti, D. Malerba, M. Mezzanzanica, A. Monreale, G. Pasi, D. Pedreschi et al., Big data research in Italy: A perspective, Engineering2 ( 2016), 163– 170.
[12] D’Amuri, F., and Marcucci, J.. Google It! Forecasting the US unemployment rate with a Google Job search index. Nota di lavoro/Fondazione Eni Enrico Mattei: Global challenges, 2010. http://www.econstor.eu/handle/10419/43536 Accessed December 22, 2015
[13] Lovaglio, P. G. and Vittadini, G., Structural equation models in a redundancy analysis framework with covariates, Multivar. Behav. Res.49 ( 2014), 486– 501.
[14] Lovaglio, P. G. and Verzillo, S., Heterogeneous economic returns to higher education: Evidence from Italy, Qual. Quan.50 ( 2016), 791– 822.
[15] Mezzanzanica, M. and Mercorio, F., Big data enable labour market intelligence. Encyclopedia of big data technologies, Springer International, Cham, 2018.
[16] Aggarwal, C. and Zhai, C., Mining text data, Springer, Heidelberg, 2012.
[17] Sebastiani, F., Machine learning in automated text categorization, ACM Comput. Surv.34( 1) ( 2002), 1– 47.
[18] Amato, F. R. Boselli, M. Cesarini, F. Mercorio, M. Mezzanzanica, V. Moscato, F. Persia, A. Picariello, Challenge: Processing web texts for classifying job offers, in Proceedings of the 2015 IEEE International Conference on Semantic Computing, M. S. Kankanhalli, T. Li, and W. Wang, Eds., IEEE Computer Society Press, Anaheim, CA, 2015, 460– 463.
[19] Marrara, S., Pasi, G., Viviani, M., Cesarini, M., Mercorio, F., Mezzanzanica, M., and Pappagallo, M.. A Language Modelling Approach for Discovering Novel Labour Market Occupations from the Web. In 2017 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2017), 2017, pp. 1026-1034.
[20] Lee, I., The evolution of e‐recruiting: A content analysis of Fortune 100 career websites, JECO3( 3) ( 2005), 57– 68.
[21] Singh, A., Rose, C., Visweswariah, K., Chenthamarakshan, V., and Kambhatla, N.. Prospect: A system for screening candidates for recruitment. Proceedings of the 19th ACM international conference on Information and knowledge management, 2010, pp. 659-668.
[22] Yu, K., Guan, G., and Zhou, M.. 2005. Resume information extraction with cascaded hybrid model. Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pp. 499-506.
[23] Poch, M., Bel, , Espeja, S., and Navio, F.. Ranking job offers for candidates: Learning hidden knowledge from big data. Proceedings of the Ninth international conference on Language Resources and Evaluation, 2014, pp. 2076-2082.
[24] Yi, X., Allan, J., and Croft, W. B.. Matching resumes and jobs based on relevance models. Proceedings of the 30th annual international ACM STGTR conference on research and development in information retrieval, 2007, pp. 809-810.
[25] Hong, W., Zheng, S., and Wang, H.. Dynamic user profile‐based job recommender system. Proceedings of the 8^th International Conference Computer Science and Education, 2013, pp. 1499-1503.
[26] Lu, Y., Helou, E. S., and Gillet, D.. Analyzing user patterns to derive design guidelines for job seeking and recruiting website, 2012. Proceedings of 4th International Conferences on Pervasive Patterns and Applications, pp. 11-16.
[27] Fayyad, U., Piatetsky‐Shapiro, G., and Smyth, P., The KDD process for extracting useful knowledge from volumes of data, Commun. ACM39( 11) ( 1996), 27– 34.
[28] Cavnar, W. B. and Trenkle, J. M., N‐gram‐based text categorization, in Proceedings of SDAIR‐94, 3rd Annual Symposium on Document Analysis and Information Retrieval, Ann Arbor, Michigan, 1994, 161– 175.
[29] Cohen, A. M. and Hersh, W. R., A survey of current work in biomedical text mining, Brief. Bioinform.6 ( 2005), 57– 71.
[30] Liu, X., Rujia, G., and Liufu, S.. Internet news headlines classification method based on the N‐Gram language model. Proceedings of international conference on Computer Science and Information Processing (CSIP), 2012, pp. 826-828, https://doi.org/10.1109/CSIP.2012.6308980.
[31] Vapnik, V. and Chervonenkis, A., A note on one class of perceptions, Automat Remot Contr25 ( 1964), 821– 837.
[32] Pedregosa, F. et al., Scikit‐learn: Machine learning in python, J. Mach. Learn. Res.12 ( 2011), 2825– 2830. · Zbl 1280.68189
[33] Bird, S., Ewan, K., and Edward, L.. Natural language processing with python, O’Reilly Media, 2009. · Zbl 1187.68630
[34] Fan, R. E. et al., LIBLINEAR: A library for large linear classification, J. Mach. Learn. Res.9 ( 2008), 1871– 1874. · Zbl 1225.68175
[35] Tibshirani, R., Regression shrinkage and selection via the Lasso, J. R. Statist. Soc. B58 ( 1996), 267– 288. · Zbl 0850.62538
[36] Zou, H. and Hastie, T., Regularization and variable selection via the elastic net, J. R. Statist. Soc. B67 ( 2005), 301– 320. · Zbl 1069.62054
[37] Fan, J., Comments on “Wavelets in statistics: A review” by A. Antoniadis, J. Italian Stat. Assoc.6 ( 1997), 131– 138.
[38] Fan, J. and Li, R., Variable selection via nonconcave penalized likelihood and its oracle properties, J. Am. Stat. Assoc.96 ( 2001), 1348– 1360. · Zbl 1073.62547
[39] Zhao, P. and Yu, B., On model selection consistency of Lasso, J. Mach. Learn. Res.7 ( 2006), 2541– 2563. · Zbl 1222.62008
[40] Wang, H. and Leng, C., Unified lasso estimation by least squares approximation, J. Am. Stat. Assoc.102 ( 2007), 1039– 1048. · Zbl 1306.62167
[41] Xu, J. and Ying, Z., Simultaneous estimation and variable selection in median regression using Lasso‐type penalty, Ann. Inst. Stat. Math.62 ( 2010), 487– 514. · Zbl 1440.62280
[42] Fix, E., and Hodges, J. L.. Discriminatory analysis, nonparametric discrimination: Consistency properties, 1951. Technical Report 4, USAF School of Aviation Medicine, Randolph Field, Texas. · Zbl 0715.62080
[43] Breiman, L., Random forests, Mach. Learn.45( 1) ( 2001), 5– 32. · Zbl 1007.68152
[44] Aiello, S., Eckstrand, E., Fu, A., Landry, M., and Aboyoun, P.. Machine Learning with R and H_2O, 2016. http://h2o.ai/resources/ Accessed September 18, 2016
[45] Freund, Y. and Schapire, R., A short introduction to boosting, Trans. Jpn. Soc. Artif. Intell.14 ( 1999), 771– 780.
[46] Friedman, J. H.. Stochastic gradient boosting, 1999. IMS Reits lecture.
[47] Friedman, J. H., Greedy function approximation: A gradient boosting machine, Ann. Stat.29( 5) ( 2001), 1189– 1232. · Zbl 1043.62034
[48] Friedman, J., Hastie, T., and Tibshirani, R., Additive logistic regression: A statistical view of boosting, Ann. Stat.28 ( 2000), 337– 374. · Zbl 1106.62323
[49] Mease, D. and Wyner, A., Evidence Contrary to the Statistical View of Boosting, J. Mach. Learn. Res.9 ( 2008), 131– 156.
[50] Boser, B. E., Guyon, I. M., and Vapnik, V. N., A training algorithm for optimal margin classifiers, in Proceedings of the fifth annual workshop on Computational learning theory, ACM Press, Pittsburgh, 1992, 144– 152.
[51] Boselli, R., M. Cesarini, F. Mercorio and M. Mezzanzanica, An AI planning system for data cleaning, in Machine learning and knowledge discovery in databases ‐ European Conference, ECML PKDD, Lecture Notes in Computer Science, Vol 10536, Springer, Cham, 2017c, 349– 353.
[52] Boselli, R., M. Cesarini, F. Mercorio and M. Mezzanzanica, Planning meets data cleansing, in The 24th international conference on automated planning and scheduling (ICAPS), 2014, pp. 439– 443.
[53] Boselli, R., M. Cesarini, F. Mercorio and M. Mezzanzanica, A model‐based evaluation of data quality activities in KDD, Inform. Process Manage.51( 2) ( 2015), 144– 166.
[54] Gosling, S. D. et al., Should we trust web‐based studies, Am. Psychol.59( 2) ( 2004), 93– 104.
[55] Pedraza, P., Tijdens, K., and Muñoz de Bustillo, R.. WP 60‐sample bias, weights and efficiency of weights in a continuous web voluntary survey, 2007. AIAS, Amsterdam Institute for Advanced Labour Studies. http://ideas.repec.org/p/aia/aiaswp/wp60.html Accessed July 23, 2015.
[56] Stefánik, M., Internet job search data as a possible source of information on skills demand (with results for Slovak university graduates), in Building on skills forecasts — Comparing methods and applications, Cedefop, Ed., CEDEFOP, Luxembourg, 2012, 246– 260.
[57] Steinmetz, S., Tijdens, K., and Pedraza, P.. WP 76‐Comparing different weighting procedures for volunteer web surveys. AIAS, Amsterdam Institute for Advanced Labour Studies, 2009. http://ideas.repec.org/p/aia/aiaswp/wp76.html. Accessed September 18, 2016
[58] Mang, C.. Online job search and matching quality. IFO Institute for Economic Research at the University of Munich, 2012. ftp://ftp.zew.de/pub/zewdocs/veranstaltungen/ICT2012/Papers/Mang.pdf. Accessed October 2, 2016
[59] Little, R. J. A. and Rubin, D. B., Statistical analysis with missing data, John Wiley & Sons, New York, 1987. · Zbl 0665.62004
[60] Kennan, M. A. et al., Changing workplace demands: What job ads tell us, Aslib Proc.58( 3) ( 2006), 179– 196. https://doi.org/10.1108/00012530610677228.
[61] Wade, Michael R. and Parent, Michael, Relationships between job skills and performance: A study of Webmasters, J. Manage. Inform. Syst.18( 3) ( 2001), 71– 96. https://doi.org/10.2307/40398554.
[62] Kureková, L., Beblavý, M., and Haita, C.. Qualifications or soft skills? Studying demand for low skilled from job advertisements. NEUJOBS Working Paper No. 4.3.3, 2012. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2402729 Accessed January 21, 2017
[63] Boselli, R., M. Cesarini, F. Mercorio and M. Mezzanzanica, Using machine learning for labour market intelligence, in Machine learning and knowledge discovery in databases. ECML PKDD, Lecture Notes in Computer Science, Vol 10536, Y. Altun et al., Eds., Springer, Cham, 2017a, 2017.
[64] Boselli, R., M. Cesarini, S. Marrara, F. Mercorio, M. Mezzanzanica, G. Pasi, M. Viviani, WoLMIS: A labor market intelligence system for classifying web job vacancies, J. Intell. Inform. Syst. ( 2017b), 1– 26.
[65] Boselli, R., M. Cesarini, F. Mercorio and M. Mezzanzanica, Accurate data cleansing through model checking and machine learning techniques, in Data management technologies and applications of communications in computer and information science, Vol 178, M. Helfert et al., Eds., Springer International, Cham, 2015, 62– 80.
[66] Boselli, R., M. Cesarini, F. Mercorio and M. Mezzanzanica, A policy‐based cleansing and integration framework for labour and healthcare data, Knowl. Discov. Data Mining LNCS8401 ( 2014), 141– 168.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.