TissueNet: detect lesions in uterine cervix specimens - Open data set TissueNet - Open Data

Description

1. Purpose of the database:

This database was collected in order to organize the TissueNet Data Challenge. This dataset consists of high resolution images of microscopic slides created from cervical biopsies and surgical specimens. Additionally, the competitors were given slide metadata as well as annotations for the training set that outlined some (but not necessarily all) of the lesions present on a slide.

The database shared on data.gouv.fr is a portion of the database used for the data challenge. This database contains the slides coming from the pathological centers that have agreed to share the data openly.

This database includes 1272 microscopic slides of uterine cervical tissue from medical centers across France.

The slides are distributed in the following datasets:

diagnosed_biopsies = 443 slides
diagnosed_conizations = 21 slides
annotated_biopsies = 295 slides
undiagnosed_biopsies = 217 slides
undiagnosed_conizations = 296 slides

  • Diagnosed_biopsies & diagnosed_conizations:

Pathologists have labeled each slide according to four classes of lesion severity as classified by the World Health Organization (5th edition):
0: benign (normal or subnormal)
1: low malignant potential (low grade squamous intraepithelial lesion)
2: high malignant potential (high grade squamous intraepithelial lesion)
3: invasive cancer (invasive squamous carcinoma)

→ It refers to the class of the most severe lesion on the slide (at the slide level, not annotation level).

  • Fully_annotated_biopsies:

Pathologists have labeled and annotated these images to point out regions that represent lesions.
When working with the annotations, it's important to keep in mind the following points:
-- The annotated regions do not necessarily include all lesioned tissue in the slide. An unannotated region is not necessarily normal tissue.
-- The whole image class label and the annotation class label do not necessarily match. The annotated regions may be the image's labeled class or below. For instance, an image labeled as a class 2 lesion could have annotations representing class 0, 1, or 2. At least some of the annotated regions will represent the most severe/labeled class. All annotations on a slide with label 0 will be normal tissue.
-- The lesion may fall entirely within the square, or may extend beyond the annotation boundaries.
-- All annotations are a fixed size of 300x300 micrometers. As images have different resolutions in pixels/micrometer, annotations will have different dimensions in terms of pixels.
-- When using the geometries, it is important to know the origin of the coordinate system. Image processing software may assume the image origin is either the bottom left or the top left. The WKT shapes that we provide as annotations (geometry column in train_annotations.csv) are relative to the bottom left being the origin (0, 0).

  • Undiagnosed_biopsies & undiagnosed_conizations:

There are no labels for these images or corresponding annotations

All images are standardized in pyramidal TIF format. These images are compressed using JPEG Q=75. The pyramidal TIF format maintains a sufficient level of detail for pathologists to perform diagnoses while enabling smaller file sizes and easier loading with actively developed Python libraries such as PyVips.

2. Context of creation of the database:

This database was created as part of the TissueNet Data Challenge. This challenge began in 2019 when the French Society of Pathology (SFP) and the Health Data Hub (HDH) decided to build a challenge using a data bank of whole slide images (WSIs). Nineteen public and private pathology departments across France contributed more than 5,000 WSIs as data for the challenge. These slides are often difficult for pathologists themselves to diagnose, and expert eyes may be required. All labeled images included in the challenge were reviewed twice by expert pathologists.

The database shared on data.gouv.fr is a portion of the database used for the data challenge. This database contains the slides coming from the pathological centers that have agreed to share the data openly.
This database includes 1272 microscopic slides of uterine cervical tissue from medical centers across France.

3. Target:

Data challenges are global competitions aimed at solving specific problems within a given time frame using highly anonymized data. Thus, these challenges are intended for data scientists (researcher, industrials, students etc.) from all around the world.

The objective of the challenge was to classify each image according to the most severe category of epithelial lesion present in the sample. The classes are defined as follows:
0: benign (normal or subnormal)
1: low malignant potential (low grade squamous intraepithelial lesion)
2: high malignant potential (high grade squamous intraepithelial lesion)
3: invasive cancer (invasive squamous carcinoma)

4. Results obtained from the database:

In the TissueNet competition, participants were tasked with building machine learning models that could predict the most severe lesions in each digital biopsy slide. What's more, participants needed to submit code for executing their solution on test data in the cloud, ensuring that the model could run fast enough on this large scale data to be useful in practice. This setup rewards models that perform well on unseen images and brings these innovations one step closer to impact.

Global performance of each algorithm was evaluated according to a custom metric devised by a panel of expert pathologists. The score for each prediction equals 1 minus the error, where the error weighting for misclassification has been set by an expert consensus within the scientific council as defined in the table below. The total error is the average error across all predictions. Note that the metric is symmetric, e.g., predicting class 3 when it is actually class 0 produces the same error as predicting class 0 when it is actually class 3.

Error table of misclassification:
Error table of misclassification:

The winning solutions used clever approaches to prioritize the parts of each slide to analyze further, and built computer vision pipelines to determine the most appropriate diagnosis for the selected tissue. Models were scored not just on their accuracy, but also on the impact of their errors (providing a large penalty for mistakes that have worse consequences in practice).

The top-performing model achieved over 76% accuracy in predicting the exact severity label of each slide across 4 ranked classes, including 95% accuracy for the most severe class of cancerous tissue. In addition, the top 3 solutions achieved >98% on-or-adjacent accuracy, meaning they reduced the more costly misclassifications that erred by more than one class to less than 2% of the 1,500+ slide test set!

All prize-winning solutions are available under an open source license for ongoing use and learning.
For more details : winning models on GitHub

5. Other informations:

Here are some resources you can use in order to work with the data :

-OpenSlide supports all native whole slide image formats, including:
.mrxs (MIRAX)
.svs (Aperio)
.ndpi (Hamamatsu)

-PyVips is a Python binding for libvips, a low-level library for working with large images. PyVips can be used to read and manipulate the pyramidal TIF formats.

-Cytomine allows you to display and explore native whole slide images and pyramidal TIF formats in a web browser. It also supports adding annotations and executing scripts from inside Cytomine or from any computing server using the dedicated Cytomine Python client. Cytomine can be installed locally or on any Linux server. The Cytomine GitHub repository includes examples of Python scripts demonstrating how to interact with your Cytomine instance, as well as examples of ready-to-use machine learning scripts (all S_ prefixed repos, such as S_CellDetect_Stardist_HE_ROI).

Here are a few papers and tutorials that talk about machine learning with WSI that you may find helpful:

6. Licences:

Creative Commons Attribution (CC BY 3.0)
Licence Ouverte/Open Licence 2.0 (Etalab 2.0)

7. User form:

USER FORM

The purpose of the user form is to track who (in terms of individuals and institutions) is using the data and potentially for what purposes. This form is not restrictive in the sense that access requests will never be denied.

7. Cite:

For any reuse of this database, use the DOI provided below:
https://doi.org/10.60597/eaqa-k904

Producteur

Dernière mise à jour

21 janvier 2025

Licence

Licence Ouverte / Open Licence version 2.0

Qualité des métadonnées
100.0/100

1 API

Il n'y a pas encore de réutilisation pour ce jeu de données.

Publiez une réutilisation Qu'est-ce qu'une réutilisation ?

Il n'y a pas encore de discussion pour ce jeu de données.

Il n'y a pas encore de ressources communautaires pour ce jeu de données.

Partagez vos ressources En savoir plus sur la communauté

Informations

Temporalité

Création

14 janvier 2025

Fréquence

Ponctuelle

Couverture temporelle

2020

Dernière mise à jour

21 janvier 2025

Couverture spatiale

Couverture territoriale

France

Granularité de la couverture territoriale

Pays

Actions

Intégrer sur votre site

Visites

84

59 en févr. 2025

Téléchargements

178

85 en févr. 2025

Réutilisations de ce jeu de données

0

Favoris

0