Synthesizing realistic high-resolution retina image by style-based generative adversarial network and its utilization

Synthesizing realistic high-resolution retina image by style-based generative adversarial network and its utilization


The retinal images used in this study were received from the Health Examination Center of the ASAN Medical Center in Seoul, South Korea. Two kinds of dataset were prepared for GAN training. First, a total of 98,561 normal retinal (hereinafter referred to as “normal”) and 20% (26,437 patients) of the other retinal (hereinafter referred to as “uncertain”) patient data were retrieved. Here, we defined a normal subject if both eyes have normal keywords in the medical chart and defined the others as uncertain. Retinal images obtained from the second visit for each patient were not included. Of all images from uncertain patient data, only 3.29% have the actual abnormal keyword in the medical chart and the other images from an uncertain patient were regarded as normal or benign normal. The abnormality and their number percentage are shown in Appendix Table 1.

The dataset is in a DICOM (Digital Imaging and Communications in Medicine) format and contains 24-bit RGB retinal images. The original image size is 1536 × 2048 pixels. Header information, except age and sex, was anonymized from the data center. We excluded zoomed retinal images where the peripheries of the major vascular arcades were not shown, and saturated or dark retinal images where the optic discs were not correctly detected. Optic disc detection was performed using a publicly available deep-learning-based method22 where not correctly detected images were confirmed by retina specialist (Y. J. Kim). The remaining 98,446 normal and 26,113 uncertain patient data were used to train the GAN. Because most patient data included left and right eye images and some of them have more than two images obtained from follow-up observation, a total of 276,113 images were used for GAN training. In the input data, the average age was 50.3 ± 11.3 years, and 53.5% of the patients were male.

For the evaluations of efficacy of transfer learning, epiretinal membraine (ERM) disease cases shown in Appendix Table 1 were prepared as second dataset. They are total 2671 retinal images from 1975 patient data. The average age was 59.4 ± 8.4 years, and 58.1% of the patients were male. The image preprocessing was done as same as above.

This retrospective study was conducted according to the principles of the Declaration of Helsinki and in accordance with current scientific guidelines. The study protocol was approved by the Internal Review Board (IRB) of Asan Medical Center, University of Ulsan College of Medicine, Seoul, Korea (IRB No. 2019-1373). Because of the retrospective design of the study and the use of de-identified patient data, the review board of IRB of Asan Medical Center, University of Ulsan College of Medicine, Seoul, Korea waived the need for written informed consent.

GAN training

For retinal image synthesis, we used StyleGAN23, which has been proven successful in image synthesis by various studies24. Previously released GAN algorithms trained a mapping function between the input latent vector and target image. However, although it exhibits noteworthy generation power, direct mapping from a latent vector has limitations in changing various visual attributes. On the other hand, StyleGAN sets an intermediate latent space between the latent vector and target image. For the StyleGAN training performed in this study, input images were processed as follows. First, we adjusted the center of each retinal image to be located at the center of a 2-dimensional image plane. Square cropping was then performed on the outer region of the retinal image. Finally, the image was resized to 1024 × 1024 pixels. Two Titan-RTX 24-GB graphics processing units (GPUs) were used, and the learning rate was set at 0.001. Other training parameters were set as default. Source code is from StyleGAN official Tensorflow code. During the training, the resolution of the synthesized images progressively grew until, finally, a 1024 × 1024-sized high-resolution image was synthesized. Examples of the synthesized images are shown in Fig. 1. In Appendix Fig. 1, training curve with frechet Inception distance (FID) score25 was shown where the FID is a metric to evaluate statistical similarities between the real and synthesized images. The best model weight was chosen by considering FID score as well as quality of synthesized images examined by retinal specialist (Y. J. Kim).

Figure 1
figure 1

Examples of randomly synthesized images.

Image turing test

To validate whether the synthesized images were realistic, we performed image Turing tests with 40 ophthalmologists, which included residents. For the Turing test, 50 real images were randomly chosen from the whole dataset, and 50 synthesized images were randomly synthesized from the StyleGAN generator with a random latent vector as a seed. These 100 images were then uploaded to a dedicated webpage for the image Turing test. A screenshot of the webpage is shown in Appendix Fig. 2. The webpage displayed each image one by one with no other information. Forty ophthalmologists, comprising of 12 residents, 14 non-retina specialists, and 14 retina specialists, then independently accessed the website to perform the image Turing test. Information on the reader population is outlined in Appendix Table 2. To reduce environmental variability during the Turing test, the images were displayed in the same order for all ophthalmologists, modification of answers was prohibited, and the ratio of real to synthesized images was unknown to the readers. In addition, prior to this test, none of the readers had experience with synthesized retinal images. Under these conditions, all readers successfully finished the image Turing test.

To evaluate the result of the image Turing test, statistical analysis was performed using IBM SPSS software version 23.0. Because the results of the image Turing test were in binary format for each image and the data were correlated with each person and each image, logistic generalized estimating equation (GEE) models were used for evaluating the image Turing test. The ophthalmologists were grouped in two ways: by specialty, i.e., resident, non-retinal specialist, and retinal specialist; and by specialty with consideration for their years of work experience with a 5-year criterion. Through logistic GEE statistics, each class in a group was compared with the resident class based on the following comparison metrics: accuracy, sensitivity, and specificity, where sensitivity refers to the correctness of selecting real images from the entire set of real images. Furthermore, reader examination time per image was recorded to assess performance according to work experience. Because examination time is a continuous value, GEE analysis with identity link was used to compare between classes for both ways of grouping the ophthalmologists.

Morphological characteristics of synthesized images

The morphological characteristics of synthesized images were examined in comparison with the real retinal images by two retina specialists (Y. N. Kim and Y. J. Kim). Of the 50 synthesized images used for image Turing test, top 3 of the most frequently selected as real images (i.e., incorrectly answered) and top 3 of the most frequently selected as synthesized images (i.e., correctly answered) were analyzed. Top 3 images are shown in Fig. 2. Furthermore, to distinguish the differences between real and synthetic retinal images in greater detail, the anatomical features of three landmarks (optic disc, macular and vascular structures) were manually traced for each of the 50 synthesized images (Appendix Fig. 3).

Figure 2
figure 2

High-resolution synthesized retinal photographs. (a) Synthesized retinal images that most ophthalmologists selected as “real” image. (b) Synthesized retinal images that most ophthalmologists considered as “synthesized” images. Numbers in parentheses indicate the number of examiners out of 40 ophthalmologists who chose image as “real” or “synthesized.”

Metric evaluation of vessels

Metric evaluations were performed to compare quantitative indices between the real and synthesized images, specifically, comparisons of skeletonized vessel amounts and signal-to-noise ratios (SNR), where vessels and their nearby background regions were defined as signal and noise, respectively. Because both metrics required vessel-segmented images, we first derived vessel segmentation maps using the feature pyramid network (FPN), a deep-learning-based segmentation network26. We used a high-resolution fundus (HRF) public database and our data for training the segmentation model. The segmentation performance (area under the curve: AUC 0.80) was comparable to those of the other publicly available segmentation tools. Based on the segmented vessel map, metric evaluation was performed on randomly selected 1,000 real images and randomly synthesized 1,000 synthesized images. For the comparison of vessel amounts, we counted all pixels comprising vessels in each vessel-segmented image.

For SNR measurement, we measured mean signals from vessels in zone B. Zone B is defined as the region between two to three optic-disc diameters away from the optic disc center27. The signal and noise of the region were estimated based on the following method.

First, we identified all vessels in zone B using the vessel segmentation map. Five points on each vessel, at even distances, were then selected. At each point, pixel intensities perpendicular to the vessel direction is averaged to define the signal, and a standard deviation of ± 5 neighboring pixels perpendicular to the point but outside of the vessel wall was defined as noise. Appendix Fig. 4 presents in detail the process of SNR calculation. Here, R software (R Foundation for Statistical Computing, Vienna, Austria), version 3.5.3, was used for the statistical analysis, with a significance level of p < 0.05.

Efficacy of the StyleGAN model weight for synthesizing retinal images having specific disease via transfer learning

Unlike the normal and uncertain image cases, retinal images having specific disease is unable to synthesize due to small number of images from original dataset. Nevertheless, this can be overcome through transfer learning the StyleGAN model weight trained for normal and uncertain images even with the small number of images having a specific disease. Through this study, we verified the efficacy of the transfer learning and the usage of synthesized retinal images having ERM disease as an example. Detailed procedures are described as follows and shown in Appendix Fig. 5.

  1. 1.

    Among all dataset collected from Health Examination Center, retinal images were collected by searching ERM keywords in the medical chart. Finally, a total of 7476 images were collected.

  2. 2.

    Because keyword searching from medical chart does not guarantee all 7476 images have ERM disease, retinal specialist (Y. J. Kim) examined each image until 600 ERM images and 600 non-ERM images were collected. The residual 6276 images were classified using deep learning model developed at Step 3.

  3. 3.

    To classify residual 6276 images into ERM and non-ERM, deep learning based binary classification model using ResNet15228 was developed by training already classified 600 ERM and 600 non-ERM retinal images at Step 2. The dataset was divided into train, valid, and test set with a ratio of 7:1:2. Input image size is 512 × 512 pixels and RGB image is used with geometric image augmentation such as horizontal flip, vertical flip, shift(6.25% of image size), zoom(10% of image size), rotation(± 5degree). Here, augmentation is randomly applied to images at each mini-batch. Learning rate and mini-batch size were 0.0001 and 10, respectively with Adam optimizer29. The developed model shows AUC, accuracy, sensitivity, and specificity of 0.986, 0.958, 0.970, and 0.947, respectively.

  4. 4.

    Using the classification model, 6276 images were then classified into 3362 ERM images and 2914 non-ERM images. 3362 ERM images were then divided with ratio of 8:2 where 2671 images were used to develop ERM synthesis model by transfer learning StyleGAN and 691 images were used to develop classification model at Step 6.

  5. 5.

    Using the 2671 ERM images, transfer learning of StyleGAN was performed leveraging the weight from the trained model used for the image Turing test (see Chapter “GAN training“). Specifications of input images and training parameters were the same as the Chapter “GAN training“except for the learning rate which was reduced to 0.0001. The reduced learning rate was applied for fine-tuning the model weight to synthesize ERM features on retinal images. In Appendix Fig. 1, training curve with FID score25 was shown. The best model weight was chosen by considering FID score as well as quality of synthesized images examined by retinal specialist (Y. J. Kim). Therefore, retinal specialist (Y. J. Kim) examined the 100 randomly synthesized ERM images and confirmed that characteristics of ERM features, e.g. cellophane-like membrane formation at the macula and perifoveal vascular tortuosity were well synthesized. Examples of synthesized ERM images were shown in upper row of Fig. 3. In bottom row of Fig. 3, heatmaps which is a highlighted attention map representing regions that how the deep learning model trained ERM features was also shown. Here, heatmap is derived using gradient-weighted class activation mapping (Grad-CAM)30 with well trained deep learning based classification model using real normal versus real ERM classification model with equal number ratio at Step 6. Retinal specialist (Y. J. Kim) examined the heatmap and concluded that the features are corresponding to the region of ERM features.

  6. 6.

    At this step, efficacy of synthesized ERM images when developing ERM and non-ERM image classification model under imbalanced dataset was evaluated. That is, binary classification was performed by training various number ratios between real normal and real ERM disease (i.e., 1:1, 1:0.5, 1:0.4, 1:0.3, 1:0.2, 1:0.1). Then, compared the classification performance for balanced ratio by adding synthesized ERM images to the imbalanced one. Here, the validation and test set were common to all classification studies. For the training, ERM images were prepared from 691 real images remained at Step 4 and synthesized images generated at Step 5. The non-ERM images were prepared from the dataset for StyleGAN training at Chapter “GAN training“. The training parameters including preprocessing method of input images were the same as the classification model developed in Step 3. The detailed combinations for training and characteristics of the dataset were tabulated in Table 1. The training curve for each combination is shown in Appendix Fig. 6.

Figure 3
figure 3

Representative randomly synthesized images demonstrating ERM. Cellophane-like membrane formation at the macula and perifoveal vascular tortuosity are shown in synthetic fundus images. Heatmap derived using Grad-CAM ( correspond to these characteristic ERM features.

Table 1 Baseline characteristics of study for evaluating classification performance.

Source link

Leave a Reply

Your email address will not be published. Required fields are marked *