Ben Burtenshaw

Ben Burtenshaw · 2025-09-17T14:16:09.791Z

IBM just released a useful open source model for processing unstructured documents. It's based on their latest Granite models and integrates with docling (the library for document processing). - it's 285m parameters so runs on locally. - optimised for equations and scientific graphics. - flexible inference modes; either full page or constrained areas. - document QA, like "what's fastest selling product". - multilingual support. take it for a spin on this space: https://lnkd.in/er-M6v6a or, you can just do this: `docling --to html --to md --pipeline vlm --vlm-model granite_docling "https://lnkd.in/eZ7GAaz4"`

Antwerp, Flemish Region, Belgium
34K followers 500+ connections

View mutual connections with Ben

Welcome back

Email or phone

Password

Forgot password?

or

By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.

New to LinkedIn? Join now

or

By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.

New to LinkedIn? Join now

Join to follow

Hugging Face

University of Antwerp

About

Right now, I'm working on educational and learning material at Hugging Face which teaches…

Activity

Congrats to Dr. Sasha Luccioni, Margaret Mitchell, Clem Delangue 🤗 & Julien Chaumond for making the Observer's Top 100 AI Power Index. They and the…

Congrats to Dr. Sasha Luccioni, Margaret Mitchell, Clem Delangue 🤗 & Julien Chaumond for making the Observer's Top 100 AI Power Index. They and the…

Liked by Ben Burtenshaw
Open and free are two distinct concepts. If you want a free and open source experiment tracker with reliable tech, trackio is here to serve you…

Open and free are two distinct concepts. If you want a free and open source experiment tracker with reliable tech, trackio is here to serve you…

Liked by Ben Burtenshaw
I've been using trackio a few months now and I've never felt better. I have more energy. My skin is clearer. My eye sight has…

I've been using trackio a few months now and I've never felt better. I have more energy. My skin is clearer. My eye sight has…

Posted by Ben Burtenshaw

Join now to see all activity

Experience

Hugging Face
-

Paris, Île-de-France, France
-
-

Paris, Île-de-France, France
-
-
-

Antwerp, Flemish Region, Belgium
-
-

Groningen, Netherlands
-

Ghent, Flemish Region, Belgium
-

Brussels Area, Belgium
-

Antwerp Area, Belgium
-

London, England, United Kingdom

Education

University of Antwerp

-

2017 - 2021
-

2015 - 2017
-

2013 - 2015
-

2009 - 2012

Licenses & Certifications

Databases and SQL for Data Science

Coursera

Issued Apr 2020

See credential
Complete SQL

Udemy

Issued Jan 2020

Credential ID UC-49SZLG2B

See credential
Machine Learning with Python

Coursera

Issued May 2018

See credential
Data Analysis with Python

Coursera

Issued Apr 2018

See credential
Natural Language Processing with Machine Learning

Educative, Inc.

Issued Jan 2018
Practical Deep Learning with Pytorch

Udemy

Issued Jan 2018

Credential ID UC-7e81ae49-ef64-40ab-88e7-dcb321b68a79

See credential
Neural Networks and Deep Learning

Coursera

Issued Dec 2017

Credential ID 9c972b8d675ee63eee469bf715e36c9d

See credential
Data Visualization

Coursera

Issued Sep 2017

Credential ID 4363c096384ddfb86da6a6e5ddf6fb62

See credential
Data Science Methodology

Coursera

Issued Apr 2017

See credential
Python for Data Science and AI

Coursera

Issued Apr 2017

See credential

Join now to see all certifications

Publications

The future of open human feedback

Nature Machine Intelligence June 20, 2025

Human feedback on conversations with language models is central to how these systems learn about the world, improve their capabilities and are steered towards desirable and safe behaviours. However, this feedback is mostly collected by frontier artificial intelligence labs and kept behind closed doors. Here we bring together interdisciplinary experts to assess the opportunities and challenges to realizing an open ecosystem of human feedback for artificial intelligence. We first look for…

Human feedback on conversations with language models is central to how these systems learn about the world, improve their capabilities and are steered towards desirable and safe behaviours. However, this feedback is mostly collected by frontier artificial intelligence labs and kept behind closed doors. Here we bring together interdisciplinary experts to assess the opportunities and challenges to realizing an open ecosystem of human feedback for artificial intelligence. We first look for successful practices in the peer-production, open-source and citizen-science communities. We then characterize the main challenges for open human feedback. For each, we survey current approaches and offer recommendations. We end by envisioning the components needed to underpin a sustainable and open human feedback ecosystem. In the centre of this ecosystem are mutually beneficial feedback loops, between users and specialized models, incentivizing a diverse stakeholder community of model trainers and feedback providers to support a general open feedback pool.

See publication
Classifying toxicity in adolescent conversations : applications in paediatrics

University of Antwerp March 31, 2022

This PhD thesis investigates and analyses the effectiveness of text classification models for detecting toxic language in paediatric settings. The literature highlights toxic language, and its effects like bullying and mental health problems, as fundamental societal challenges. Moreover, the World Health Organisation asserts that tackling bullying for adolescents should not be limited to educational settings and that it is the responsibility of healthcare institutions to address these issues…

This PhD thesis investigates and analyses the effectiveness of text classification models for detecting toxic language in paediatric settings. The literature highlights toxic language, and its effects like bullying and mental health problems, as fundamental societal challenges. Moreover, the World Health Organisation asserts that tackling bullying for adolescents should not be limited to educational settings and that it is the responsibility of healthcare institutions to address these issues. Social media platforms have implemented text classification systems that protect against toxic language within their products, and paediatric wards should have comparable safeguards when using language-based technology. The thesis aims to expose methods from Natural Language Processing that are suitable for application in paediatrics, and highlight aspects of state-of-the-art methodology that demand consideration and attention. The thesis is structured in three parts; an introduction, a series of case studies, and a strategic analysis. The introduction is targeted at non-expert readers and intends to support the technical case studies chapters by clarifying systems and practices from the field of Natural Language Processing. The case studies part contains a series of contained experiments in text classification and toxic language detection. The final part returns to the systems from the case studies and analyses them against the context of paediatric application.

See publication
A Dutch Dataset for Cross-lingual Multilabel Toxicity Detection

Proceedings of the 14th Workshop on Building and Using Comparable Corpora (BUCC 2021) September 1, 2021

Multi-label toxicity detection is highly prominent, with many research groups, companies, and individuals engaging with it through shared tasks and dedicated venues. This paper describes a cross-lingual approach to annotating multi-label text classification on a newly developed Dutch language dataset, using a model trained on English data. We present an ensemble model of one Transformer model and an LSTM using Multilingual embeddings. The combination of multilingual embeddings and the…

Multi-label toxicity detection is highly prominent, with many research groups, companies, and individuals engaging with it through shared tasks and dedicated venues. This paper describes a cross-lingual approach to annotating multi-label text classification on a newly developed Dutch language dataset, using a model trained on English data. We present an ensemble model of one Transformer model and an LSTM using Multilingual embeddings. The combination of multilingual embeddings and the Transformer model improves performance in a cross-lingual setting.

See publication
Spans are Spans, stacking a binary word level approach to toxic span detection

Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021) August 1, 2021

This paper describes the system developed by the Antwerp Centre for Digital humanities and literary Criticism [UAntwerp] for toxic span detection. We used a stacked generalisation ensemble of five component models, with two distinct interpretations of the task. Two models attempted to predict binary word toxicity based on ngram sequences, whilst 3 categorical span based models were trained to predict toxic token labels based on complete sequence tokens. The five models’ predictions were…

This paper describes the system developed by the Antwerp Centre for Digital humanities and literary Criticism [UAntwerp] for toxic span detection. We used a stacked generalisation ensemble of five component models, with two distinct interpretations of the task. Two models attempted to predict binary word toxicity based on ngram sequences, whilst 3 categorical span based models were trained to predict toxic token labels based on complete sequence tokens. The five models’ predictions were ensembled within an LSTM model. As well as describing the system, we perform error analysis to explore model performance in relation to textual features.

See publication
Offence in dialogues: A corpus-based study

Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019) August 30, 2019

In recent years an increasing number of analyses of offensive language has been published, however, dealing mainly with the automatic detection and classification of isolated instances. In this paper we aim to understand the impact of offensive messages in online conversations diachronically, and in particular the change in offensiveness of dialogue turns. In turn, we aim to measure the progression of offence level as well as its direction-For example, whether a conversation is escalating or…

In recent years an increasing number of analyses of offensive language has been published, however, dealing mainly with the automatic detection and classification of isolated instances. In this paper we aim to understand the impact of offensive messages in online conversations diachronically, and in particular the change in offensiveness of dialogue turns. In turn, we aim to measure the progression of offence level as well as its direction-For example, whether a conversation is escalating or declining in offence. We present our method of extracting linear dialogues from tree-structured conversations in social media data and make our code publicly available. Furthermore, we discuss methods to analyse this dataset through changes in discourse offensiveness. Our paper includes two main contributions; first, using a neural network to measure the level of offensiveness in conversations; and second, the analysis of conversations around offensive comments using decoupling functions.

See publication
Sarcasm detection using an ensemble approach

Proceedings of the Second Workshop on Figurative Language Processing August 30, 2019

We present an ensemble approach for the detection of sarcasm in Reddit and Twitter responses in the context of The Second Workshop on Figurative Language Processing held in conjunction with ACL 2020. The ensemble is trained on the predicted sarcasm probabilities of four component models and on additional features, such as the sentiment of the comment, its length, and source (Reddit or Twitter) in order to learn which of the component models is the most reliable for which input. The component…

We present an ensemble approach for the detection of sarcasm in Reddit and Twitter responses in the context of The Second Workshop on Figurative Language Processing held in conjunction with ACL 2020. The ensemble is trained on the predicted sarcasm probabilities of four component models and on additional features, such as the sentiment of the comment, its length, and source (Reddit or Twitter) in order to learn which of the component models is the most reliable for which input. The component models consist of an LSTM with hashtag and emoji representations; a CNN-LSTM with casing, stop word, punctuation, and sentiment representations; an MLP based on Infersent embeddings; and an SVM trained on stylometric and emotion-based features. All component models use the two conversational turns preceding the response as context, except for the SVM, which only uses features extracted from the response. The ensemble itself consists of an adaboost classifier with the decision tree algorithm as base estimator and yields F1-scores of 67% and 74% on the Reddit and Twitter test data, respectively.

See publication
Synthetic literature: Writing science fiction in a co-creative process

Proceedings of the Workshop on Computational Creativity in Natural Language Generation August 31, 2017

This paper describes a co-creative text generation system applied within a science fiction setting to be used by an established novelist. The project was initiated as part of The Dutch Book Week, and the generated text will be published within a volume of science fiction stories. We explore the ramifications of applying Natural Language Generation within a cocreative process, and examine where the cocreative setting challenges both writer and machine. We employ a character-level language model…

This paper describes a co-creative text generation system applied within a science fiction setting to be used by an established novelist. The project was initiated as part of The Dutch Book Week, and the generated text will be published within a volume of science fiction stories. We explore the ramifications of applying Natural Language Generation within a cocreative process, and examine where the cocreative setting challenges both writer and machine. We employ a character-level language model to generate text based on a large corpus of Dutch novels that exposes a number of tunable parameters to the user. The system is used through a custom graphical user interface, that helps the writer to elicit, modify and incorporate suggestions by the text generation system. Besides a literary work, the output of the present project also includes user-generated meta-data that is expected to contribute to the quantitative evaluation of the text-generation system and the co-creative process involved.

See publication

Languages

Dutch

Full professional proficiency
French

Limited working proficiency
English

Native or bilingual proficiency

Recommendations received

Ekaterina L.

“I worked with Ben on a research and development part of an NLP consulting project. It was a great pleasure to collaborate with him thanks to his astute thinking, optimistic attitude, and energetic approach. He is a broad-minded person who is always looking for ways to further improve the pipeline, combining it with the ability to concisely describe his conclusions to non-technical stakeholders.”

1 person has recommended Ben

Join now to view

More activity by Ben

Thinking back to the origins, BERT, ALBERT, DistilBERT all seemed so far apart Crazy to see them so tightly coupled in this timeline; we have truly…

Thinking back to the origins, BERT, ALBERT, DistilBERT all seemed so far apart Crazy to see them so tightly coupled in this timeline; we have truly…

Liked by Ben Burtenshaw
Anthropic: A postmortem of three recent issues Anthropic had a very bad month in terms of model reliability: Between August and early September…

Anthropic: A postmortem of three recent issues Anthropic had a very bad month in terms of model reliability: Between August and early September…

Liked by Ben Burtenshaw
Recap of the Python user group & Pydata Belgium meetup for those who missed it👇 1st talk A nice technical walkthrough of Langgraph functionality…

Recap of the Python user group & Pydata Belgium meetup for those who missed it👇 1st talk A nice technical walkthrough of Langgraph functionality…

Liked by Ben Burtenshaw
Thinking about learning the keys to post-training LLMs? 🧐 the fastest track to mastering fine-tuning large language models. Free, hands-on…

Thinking about learning the keys to post-training LLMs? 🧐 the fastest track to mastering fine-tuning large language models. Free, hands-on…

Liked by Ben Burtenshaw
Georgia needed a couple of people to help and a day later thousands joined 😅😅😅 There's a growing excitement about "AI for science", can't wait to…

Georgia needed a couple of people to help and a day later thousands joined 😅😅😅 There's a growing excitement about "AI for science", can't wait to…

Liked by Ben Burtenshaw
Today we are celebrating ⭐️⭐️150 000 stars ⭐️⭐️ on transfomers github repo! Thanks everyone!!

Today we are celebrating ⭐️⭐️150 000 stars ⭐️⭐️ on transfomers github repo! Thanks everyone!!

Liked by Ben Burtenshaw
You can now use any open LLM as your coding assistant in VS Code with the Hugging Face Provider for GitHub Copilot Chat. Just pick your fav open…

You can now use any open LLM as your coding assistant in VS Code with the Hugging Face Provider for GitHub Copilot Chat. Just pick your fav open…

Liked by Ben Burtenshaw
IBM just released a useful open source model for processing unstructured documents. It's based on their latest Granite models and integrates with…

IBM just released a useful open source model for processing unstructured documents. It's based on their latest Granite models and integrates with…

Shared by Ben Burtenshaw
🔥 500,000 public datasets on Hugging Face! Perfect time to ask the community 👇 What are we missing to help you build and share more and…

🔥 500,000 public datasets on Hugging Face! Perfect time to ask the community 👇 What are we missing to help you build and share more and…

Liked by Ben Burtenshaw
I'm glad that I’m now a Hugging Face Fellow!🤗❤️ With the same passion, trust, and momentum to contribute to the community, and leverage to build…

I'm glad that I’m now a Hugging Face Fellow!🤗❤️ With the same passion, trust, and momentum to contribute to the community, and leverage to build…

Liked by Ben Burtenshaw
🎉 We just crossed 500,000 public datasets on HF 🎉 - there is a new dataset shared every 60 seconds - most datasets are text, images & audio but…

🎉 We just crossed 500,000 public datasets on HF 🎉 - there is a new dataset shared every 60 seconds - most datasets are text, images & audio but…

Liked by Ben Burtenshaw
🤖 As AI-generated content is shared in movies/TV/across the web, there's one simple low-hanging fruit 🍇 to help know what's real: Visible…

🤖 As AI-generated content is shared in movies/TV/across the web, there's one simple low-hanging fruit 🍇 to help know what's real: Visible…

Liked by Ben Burtenshaw
Training long-context LLMs is getting easier! TRL now supports Context Parallelism (CP), letting you scale sequences across multiple GPUs, even…

Training long-context LLMs is getting easier! TRL now supports Context Parallelism (CP), letting you scale sequences across multiple GPUs, even…

Liked by Ben Burtenshaw

View Ben’s full profile

See who you know in common
Get introduced
Contact Ben directly

Join to view full profile

Other similar profiles

Explore collaborative articles

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

Explore More

Add new skills with these courses

See all courses

Ben Burtenshaw

Antwerp, Flemish Region, Belgium 34K followers 500+ connections

About

Activity

Congrats to Dr. Sasha Luccioni, Margaret Mitchell, Clem Delangue 🤗 & Julien Chaumond for making the Observer's Top 100 AI Power Index. They and the…

Liked by Ben Burtenshaw

Open and free are two distinct concepts. If you want a free and open source experiment tracker with reliable tech, trackio is here to serve you…

Liked by Ben Burtenshaw

I've been using trackio a few months now and I've never felt better. I have more energy. My skin is clearer. My eye sight has…

Posted by Ben Burtenshaw

Experience

-

-

-

-

-

-

-

-

-

-

-

-

Education

-

-

-

-

Licenses & Certifications

Natural Language Processing with Machine Learning

Publications

Nature Machine Intelligence June 20, 2025

University of Antwerp March 31, 2022

Proceedings of the 14th Workshop on Building and Using Comparable Corpora (BUCC 2021) September 1, 2021

Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021) August 1, 2021

Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019) August 30, 2019

Proceedings of the Second Workshop on Figurative Language Processing August 30, 2019

Proceedings of the Workshop on Computational Creativity in Natural Language Generation August 31, 2017

Languages

Dutch

Full professional proficiency

French

Limited working proficiency

English

Native or bilingual proficiency

Recommendations received

Ekaterina L.

More activity by Ben

Thinking back to the origins, BERT, ALBERT, DistilBERT all seemed so far apart Crazy to see them so tightly coupled in this timeline; we have truly…

Liked by Ben Burtenshaw

Anthropic: A postmortem of three recent issues Anthropic had a very bad month in terms of model reliability: Between August and early September…

Liked by Ben Burtenshaw

Recap of the Python user group & Pydata Belgium meetup for those who missed it👇 1st talk A nice technical walkthrough of Langgraph functionality…

Liked by Ben Burtenshaw

Thinking about learning the keys to post-training LLMs? 🧐 the fastest track to mastering fine-tuning large language models. Free, hands-on…

Liked by Ben Burtenshaw

Georgia needed a couple of people to help and a day later thousands joined 😅😅😅 There's a growing excitement about "AI for science", can't wait to…

Liked by Ben Burtenshaw

Today we are celebrating ⭐️⭐️150 000 stars ⭐️⭐️ on transfomers github repo! Thanks everyone!!

Liked by Ben Burtenshaw

You can now use any open LLM as your coding assistant in VS Code with the Hugging Face Provider for GitHub Copilot Chat. Just pick your fav open…

Liked by Ben Burtenshaw

IBM just released a useful open source model for processing unstructured documents. It's based on their latest Granite models and integrates with…

Shared by Ben Burtenshaw

🔥 500,000 public datasets on Hugging Face! Perfect time to ask the community 👇 What are we missing to help you build and share more and…

Liked by Ben Burtenshaw

I'm glad that I’m now a Hugging Face Fellow!🤗❤️ With the same passion, trust, and momentum to contribute to the community, and leverage to build…

Liked by Ben Burtenshaw

🎉 We just crossed 500,000 public datasets on HF 🎉 - there is a new dataset shared every 60 seconds - most datasets are text, images & audio but…

Liked by Ben Burtenshaw

🤖 As AI-generated content is shared in movies/TV/across the web, there's one simple low-hanging fruit 🍇 to help know what's real: Visible…

Liked by Ben Burtenshaw

Training long-context LLMs is getting easier! TRL now supports Context Parallelism (CP), letting you scale sequences across multiple GPUs, even…

Liked by Ben Burtenshaw

View Ben’s full profile

Other similar profiles

Jeroen Berrevoets

Vera Rimmer

Audrey Cuvellier

Julián Rojas

Antwerp, Flemish Region, Belgium
34K followers 500+ connections