Abstract
Recently, many attempts have been made to construct a transformer base Ushaped architecture, and new methods have been proposed that outperformed
CNN-based rivals. However, serious problems such as blockiness and
cropped edges in predicted masks remain because of transformers’ patch
partitioning operations. In this work, we propose a new U-shaped architecture
for medical image segmentation with the help of the newly introduced focal
modulation mechanism. The proposed architecture has asymmetric depths for
the encoder and decoder. Due to the ability of the focal module to aggregate
local and global features, our model could simultaneously benefit the wide
receptive field of transformers and local viewing of CNNs. This helps the
proposed method balance the local and global feature usage to outperform
one of the most powerful transformer-based U-shaped models called SwinUNet. We achieved a 1.68% higher DICE score and a 0.89 better HD metric
on the Synapse dataset. Also, with extremely limited data, we had a 4.25%
higher DICE score on the NeoPolyp dataset.