TissueNet: detect lesions in uterine cervix specimens - Open data set TissueNet - Open Data
Description
1. Purpose of the database:
This database was collected in order to organize the TissueNet Data Challenge. This dataset consists of high resolution images of microscopic slides created from cervical biopsies and surgical specimens. Additionally, the competitors were given slide metadata as well as annotations for the training set that outlined some (but not necessarily all) of the lesions present on a slide.
The database shared on data.gouv.fr is a portion of the database used for the data challenge. This database contains the slides coming from the pathological centers that have agreed to share the data openly.
This database includes 1272 microscopic slides of uterine cervical tissue from medical centers across France.
The slides are distributed in the following datasets:
diagnosed_biopsies = 443 slides
diagnosed_conizations = 21 slides
annotated_biopsies = 295 slides
undiagnosed_biopsies = 217 slides
undiagnosed_conizations = 296 slides
- Diagnosed_biopsies & diagnosed_conizations:
Pathologists have labeled each slide according to four classes of lesion severity as classified by the World Health Organization (5th edition):
0: benign (normal or subnormal)
1: low malignant potential (low grade squamous intraepithelial lesion)
2: high malignant potential (high grade squamous intraepithelial lesion)
3: invasive cancer (invasive squamous carcinoma)
→ It refers to the class of the most severe lesion on the slide (at the slide level, not annotation level).
- Fully_annotated_biopsies:
Pathologists have labeled and annotated these images to point out regions that represent lesions.
When working with the annotations, it's important to keep in mind the following points:
-- The annotated regions do not necessarily include all lesioned tissue in the slide. An unannotated region is not necessarily normal tissue.
-- The whole image class label and the annotation class label do not necessarily match. The annotated regions may be the image's labeled class or below. For instance, an image labeled as a class 2 lesion could have annotations representing class 0, 1, or 2. At least some of the annotated regions will represent the most severe/labeled class. All annotations on a slide with label 0 will be normal tissue.
-- The lesion may fall entirely within the square, or may extend beyond the annotation boundaries.
-- All annotations are a fixed size of 300x300 micrometers. As images have different resolutions in pixels/micrometer, annotations will have different dimensions in terms of pixels.
-- When using the geometries, it is important to know the origin of the coordinate system. Image processing software may assume the image origin is either the bottom left or the top left. The WKT shapes that we provide as annotations (geometry column in train_annotations.csv) are relative to the bottom left being the origin (0, 0).
- Undiagnosed_biopsies & undiagnosed_conizations:
There are no labels for these images or corresponding annotations
All images are standardized in pyramidal TIF format. These images are compressed using JPEG Q=75. The pyramidal TIF format maintains a sufficient level of detail for pathologists to perform diagnoses while enabling smaller file sizes and easier loading with actively developed Python libraries such as PyVips.
2. Context of creation of the database:
This database was created as part of the TissueNet Data Challenge. This challenge began in 2019 when the French Society of Pathology (SFP) and the Health Data Hub (HDH) decided to build a challenge using a data bank of whole slide images (WSIs). Nineteen public and private pathology departments across France contributed more than 5,000 WSIs as data for the challenge. These slides are often difficult for pathologists themselves to diagnose, and expert eyes may be required. All labeled images included in the challenge were reviewed twice by expert pathologists.
The database shared on data.gouv.fr is a portion of the database used for the data challenge. This database contains the slides coming from the pathological centers that have agreed to share the data openly.
This database includes 1272 microscopic slides of uterine cervical tissue from medical centers across France.
3. Target:
Data challenges are global competitions aimed at solving specific problems within a given time frame using highly anonymized data. Thus, these challenges are intended for data scientists (researcher, industrials, students etc.) from all around the world.
The objective of the challenge was to classify each image according to the most severe category of epithelial lesion present in the sample. The classes are defined as follows:
0: benign (normal or subnormal)
1: low malignant potential (low grade squamous intraepithelial lesion)
2: high malignant potential (high grade squamous intraepithelial lesion)
3: invasive cancer (invasive squamous carcinoma)
4. Results obtained from the database:
In the TissueNet competition, participants were tasked with building machine learning models that could predict the most severe lesions in each digital biopsy slide. What's more, participants needed to submit code for executing their solution on test data in the cloud, ensuring that the model could run fast enough on this large scale data to be useful in practice. This setup rewards models that perform well on unseen images and brings these innovations one step closer to impact.
Global performance of each algorithm was evaluated according to a custom metric devised by a panel of expert pathologists. The score for each prediction equals 1 minus the error, where the error weighting for misclassification has been set by an expert consensus within the scientific council as defined in the table below. The total error is the average error across all predictions. Note that the metric is symmetric, e.g., predicting class 3 when it is actually class 0 produces the same error as predicting class 0 when it is actually class 3.
Error table of misclassification:
The winning solutions used clever approaches to prioritize the parts of each slide to analyze further, and built computer vision pipelines to determine the most appropriate diagnosis for the selected tissue. Models were scored not just on their accuracy, but also on the impact of their errors (providing a large penalty for mistakes that have worse consequences in practice).
The top-performing model achieved over 76% accuracy in predicting the exact severity label of each slide across 4 ranked classes, including 95% accuracy for the most severe class of cancerous tissue. In addition, the top 3 solutions achieved >98% on-or-adjacent accuracy, meaning they reduced the more costly misclassifications that erred by more than one class to less than 2% of the 1,500+ slide test set!
All prize-winning solutions are available under an open source license for ongoing use and learning.
For more details : winning models on GitHub
5. Other informations:
Here are some resources you can use in order to work with the data :
-OpenSlide supports all native whole slide image formats, including:
.mrxs (MIRAX)
.svs (Aperio)
.ndpi (Hamamatsu)
-PyVips is a Python binding for libvips, a low-level library for working with large images. PyVips can be used to read and manipulate the pyramidal TIF formats.
-Cytomine allows you to display and explore native whole slide images and pyramidal TIF formats in a web browser. It also supports adding annotations and executing scripts from inside Cytomine or from any computing server using the dedicated Cytomine Python client. Cytomine can be installed locally or on any Linux server. The Cytomine GitHub repository includes examples of Python scripts demonstrating how to interact with your Cytomine instance, as well as examples of ready-to-use machine learning scripts (all S_ prefixed repos, such as S_CellDetect_Stardist_HE_ROI).
Here are a few papers and tutorials that talk about machine learning with WSI that you may find helpful:
- Can AI predict epithelial lesion categories via automated analysis of cervical biopsies: The TissueNet challenge?
- Le premier data challenge organisé par la Société Française de Pathologie : une compétition internationale en 2020, un outil de recherche en intelligence artificielle pour l’avenir ?The first data challenge of the french society of pathology: An international competition in 2020, a research tool in A.I. for the future?
- Whole slide image preprocessing in Python
- Assessment of Machine Learning of Breast Pathology Structures for Automated Differentiation of Breast Cancer and High-Risk Proliferative Lesions - PubMed
- Using deep convolutional neural networks to identify and classify tumor-associated stroma in diagnostic breast biopsies
- Assessment of Machine Learning of Breast Pathology Structures for Automated Differentiation of Breast Cancer and High-Risk Proliferative Lesions
- Histologic tissue components provide major cues for machine learning-based prostate cancer detection and grading on prostatectomy specimens
- Assessment of Machine Learning Detection of Environmental Enteropathy and Celiac Disease in Children
6. Licences:
Creative Commons Attribution (CC BY 3.0)
Licence Ouverte/Open Licence 2.0 (Etalab 2.0)
7. User form:
The purpose of the user form is to track who (in terms of individuals and institutions) is using the data and potentially for what purposes. This form is not restrictive in the sense that access requests will never be denied.
7. Cite:
For any reuse of this database, use the DOI provided below:
https://doi.org/10.60597/eaqa-k904
Producteur
Dernière mise à jour
21 janvier 2025
Licence
Licence Ouverte / Open Licence version 2.0
Qualité des métadonnées :
Description des données renseignée
Fichiers documentés
Licence renseignée
Fréquence de mise à jour respectée
Formats de fichiers standards
Couverture temporelle renseignée
Couverture spatiale renseignée
Tous les fichiers sont disponibles
Qualité des métadonnées
1 API
Il n'y a pas encore de réutilisation pour ce jeu de données.
Il n'y a pas encore de discussion pour ce jeu de données.
Il n'y a pas encore de ressources communautaires pour ce jeu de données.
Informations
Mots-clés
ID
67864c1c2999ce903509795e
Temporalité
Création
14 janvier 2025
Fréquence
Ponctuelle
Couverture temporelle
2020
Dernière mise à jour
21 janvier 2025
Couverture spatiale
Couverture territoriale
France
Granularité de la couverture territoriale
Pays
Actions
Intégrer sur votre site
Statistiques des 12 derniers mois
Visites
84
Téléchargements
178
Réutilisations de ce jeu de données
0
Favoris
0