Human Liver Biopsy Microscopy Images and Ground Truth Scores
A total of 467 clinically indicated human liver biopsies from 3 sources were used in this study in a retrospective analysis. Table 2 provides an overview of sources, digital slide scanners, and colorants.
338 of the scanned biopsies were obtained from Duke University (Duke Department of Medicine, Durham, USA). For Duke blades, patient age was also available. The experimental protocols were approved by an institutional review board (Duke Health, DUHS IRB Office, Hock Plaza, Suite 405, 2424 Erwin Road, Durham, USA).
72 digitized biopsies were obtained from the Medizinische Hochschule Hannover (Institut für Pathologie, Hannover, Germany). The experimental protocols were approved by the ethics committee of the Medizinische Hochschule Hannover (ethics vote no. 8667_BO_K_2019, Ethik-Kommission der MHH, OE 9515, Carl-Neuberg-Str. 1, 30625 Hannover, Germany).
57 digitized biopsies were obtained from the internal repositories of the Boehringer Ingelheims Biberach (Germany) and Ridgefield (USA) sites. Boehringer Ingelheim’s sample supplier was Discovery Live Sciences (Huntsville, AL, USA). The experimental protocols were approved by the Business Practice “Acquisition and Use of Human Biospecimens”, Discovery Research Coordination (Boehringer Ingelheim, Birkendorfer Str. 65, 88400 Biberach, Germany).
Written informed consent was obtained from the subjects and/or their legal guardian(s). The study complied with the ethical guidelines of the 1975 Declaration of Helsinki.
The liver samples consisted of two types of liver biopsies: wedge biopsies and fine needle biopsies. The wedge biopsies contained a generous amount of material (typical dimension ~0.8 cm edge length) and numerous portal areas. Fine needle biopsies were typically 1–2 cm long and contained between 6 and 10 representative portal triads28.
Slides were either stained with Masson Goldner stain or Masson Trichrome stain according to established protocols. Whole slide microscopy images were acquired with a Leica Aperio scanner (Leica Biosystems, Wetzlar, Germany) or a Zeiss Axioscan Z1 scanner (Carl Zeiss, Jena, Germany) using a 20× objective under brightfield illumination.
The 467 biopsies were randomly divided into two series. Set 1 contained 296 biopsies. The CNN training tiles in this set 1 were further divided into train and validation sets (see below). The fitting of the artificial neural network (see below) was performed on the entire set 1 using 4× cross-validation. Set 2 (test) contained 171 biopsies and was only used to assess the performance of the trained CNN.
Score by biopsy according to the characteristics of the Kleiner NAS scoreten bloating, inflammation and degree of fibrosis separatedten was performed at the respective source site by a clinical pathologist specializing in NASH (338 samples from Duke University and 72 samples from Medizinische Hochschule Hannover) or a trained biomedical expert with 10 years of experience in NASH histopathology and scoring de Kleiner and Brunt, for the 57 samples from Boehringer Ingelheim. The slide scoring procedure followed the established procedure described in detail by Kleinerten. Briefly, the four characteristics of NASH have been defined and scored as follows:
Steatosis: This parameter refers to the amount of surface area involved with steatosis. Emphasis has been placed on vacuolar changes, where medium and large lipid droplets (macrosteatosis) move the nucleus and cell organelles to the cell periphery. Relative vacuole area coverage was determined automatically, as human expert scores were found to have substantial systematic bias (see Supplementary Fig. S1). A CNN-based segmentation approach to recognize liver tissue and vacuoles was trained using Halo-AI (Indica Labs, Albuquerque, NM, USA), and steatosis scores were obtained using thresholds often: 0 ( 33–66%), 3 (>66% area of steatosis). We noticed, however, that the Kleiner score cut-off values combined with precise area determination resulted in a situation where the highest score (steatosis score 3: >66% steatosis coverage) was only met by none of our 467 biopsies, which effectively resulted in a situation where the steatosis score only shows up between zero and two. This human bias may necessitate revision of the steatosis scoring scheme and zone thresholds of Kleiner et al.ten. At present, this could undermine comparability and compatibility between pathologist-focused and AI-focused Kleiner scoring approaches.
Lobular inflammation: mononuclear cells as well as neutrophils were evaluated and scored according to the global evaluation of all inflammatory foci in scores 0 (no foci), 1 ( 4 foci per 200 × field) as further defined in Ref.ten.
Hepatocyte bloat, hepatocytes showing the morphological characteristics of hydropic degeneration (vs. 2 (multiple), as defined in Ref.ten.
Fibrosis, the notation has been simplified according to the work of Younossi29 with scores 0 (none), 1 (centrilobular/perisinusoidal fibrosis), 2 (centrilobular and periportal fibrosis), 3 (bridging fibrosis), 4 (cirrhosis). Sub-steps 1a, b and c used in ref.ten were not taken into account, because it was not suitable for our goal of obtaining a unique numerical score per feature.
Not for all biopsies all scores were obtained. A few slides with severe artifacts (eg, very dark spot) were excluded from analysis after manual inspection by BS. Accordingly, of the 467 slides, 453 had a steatosis score, 388 a bloating score, 249 an inflammation score and 384 a fibrosis score. Supplementary Table S1 provides an overview of the distribution of pathologist scores.
Full slide image pre-processing and tile generation
Whole slide images (WSI) were converted from the microscope vendor’s native format (e.g., czi or ndpi) to TIFF or BigTIFF (for files larger than 4 GB) and scaled down to 0.44 µm resolution /px.
Then, the TIFFs were converted to non-overlapping adjacent tiles using Halcon image processing software version 18.11 (https://www.mvtec.com/products/halcon, MVTec Software GmbH, Munich, Germany). 299 px × 299 px tiles were created in two dimensions: “high resolution tiles” at 0.44 µm/px (swelling, inflammation, steatosis) and “low resolution tiles” after downscaling 1: 3 to 1.32 µm/px (fibrosis).
Annotation of tiles for CNN training
For bloat, inflammation, steatosis, and fibrosis, tile annotated datasets were created from the slides in set 1. For each pattern, classes were defined, corresponding to the relevant histopathological structures visible at tile level, for example, the presence or absence of a bloat cell. Figure 2 provides an overview of models and classes. In case of bloat, inflammation and fibrosis, an experienced biomedical expert with 10 years of experience in NASH (BS) pathology annotated the tiles. For steatosis, an automatic annotation based on the U-Net architecture30 and further sorting based on fractional vacuole area coverage bins was chosen. The annotated tiles of the four models were randomly divided into 95% for CNN training and 5% for CNN validation. For more details on the morphological definition of classes for bloat, inflammation and fibrosis, see supplementary material (section: tile annotation and CNN models).
CNN Training and Classification
CNNs for bloat, inflammation and steatosis were trained and validated on the annotated tiles. Inception-V3 backbone in Tensorflow—Keras implementation was used as CNN31, with some modifications in the last layers. During application, the trained CNNs were applied to the corresponding tiles of the respective model (using “high resolution tiles” for bloat, inflammation, steatosis and using “low resolution tiles” for fibrosis) . For details, see Supplementary Material.
Scoring Artificial Neural Network (ANN)
For each model (bloat, inflammation, steatosis, and fibrosis), results from all tiles belonging to a slide were aggregated by an ANN to obtain a single score per slide.
Numerical features describing the classification results of the tiles distributed on a given slide were generated and used as input to the ANNs. ANNs were predictive multilayer perceptrons with two hidden layers in regression mode to predict ground truth (pathologist’s wounds). The mean square error (MSE) loss between the pathologist’s score and the continuous output of ANNs was minimized. A custom activation feature ensured that continuous ANN outputs remained within the range of the pathologist’s score, for example, 0 to 4 for fibrosis. For details, see Supplementary Material.
Image analysis of collagen content by color deconvolution and thresholding
Duke University biopsy images were also analyzed for collagen content by “classic image analysis”, i.e. color deconvolution and subsequent thresholding to recover the color component of Trichrome staining. corresponding to collagen. For analysis, HALO 3.2 digital pathology software was used (Indica Labs; Albuquerque, NM, USA). For details, see Supplementary Material.