
Ablation experiment on MSGP
The MSGP structure mainly comprises DSC and MSA. DSC aggregates multi-scale local features, and MSA further constructs the global context. Thus, in this section, we separately remove DSC and MSA to assess their impacts on prediction accuracy and feature variability. Other components in the MSA were evaluated in ablation experiments, including the embedding feature dimension and the number of network layers. Table 1 presents different encoder structures, including small (S), medium (M), and large scales (L). In this ablation experiment, the decoder removes SCRD and uses fused features from stages 3–5 to generate the final predicted maps using convolution 1 × 1 and 3 × 3.
This ablation experiment first compares different parameter levels and selects the optimal structure to complete component ablation. Two base backbone encoders are used to compare and evaluate the effectiveness of each module. Instead of MSGP, CNN-DSC, and CNN-MSA are used to extract features. CNN-DSC removes MSA from MSGP by only using DSC and residual structure as encoders. In contrast, CNN-MSA was constructed by removing DSC from MSGP. At different stages, CNN-MSA uses pooling operators to downsample feature maps. All decoders maintain the same network structure. Table 2 reports the prediction performance on two datasets. Compared with three different scales of encoders, MSGP-S achieved the best performance with an average 92.33% F1 score and 96.75% OA. However, MSGP-L and MSGP-M did not significantly improve accuracy on the WHU datasets. This could indicate that when the dimension of feature embedding reaches a certain threshold, the encoder has limited representational performance. Additionally, MSGP-M is a higher 1.54% F1 score than MSGP-L in the Massub dataset. Hence, compared to the WHU dataset, fewer training samples could bring more optimized difficulties for MSGP-L on the Massub dataset. MSGP-S improved the average prediction accuracy by about 2% F1 score and 3% OA compared to CNN-MSA-S. In addition, although CNN-DSC-S has a similar OA to the proposed module on WHU, it has a lower 3.26% F1 score on the WHU dataset and a 3.3% F1 score on the Massub dataset than MSGP-S. That confirms that introducing global features can significantly improve the building extraction ability.
For the model efficiency, CNN-DSC-S exhibits advantages over other methods without MSA, and MSGP-S obtained the second. Additionally, CNN-MSA-S constructs global attention by performing high-dimensional tensor computation. As a result, compared with other encoders, it has the highest algorithm complexity. On the contrary, MSGP uses multi-scale tokens from DSC with spatial downsampling to generate MSA, significantly reducing the computation of multidimensional tensors. Overall, MSGP-S outperforms other methods in terms of accuracy and efficiency. Visually, this study uses the Grad-CAM30 algorithm to generate a feature heat map. In Fig. 7, the feature maps from Stage 3 are visualized to present the feature response using pre- and post-MSGP (where, satellite images were processed by Python 3.1143). Partial details are enlarged, as shown in the green rectangles. It can be observed that most foreground features are enhanced in multi-scale regions, and some background information is filtered. Additionally, the interior features in large-scale areas have a more robust response, which indicates that MSGP can enhance the overall representation of building areas, as marked in Area 2. Before using MSGP, feature variations were discontinuous within the building area. In contrast, there has been a significant enhancement in the building area after using MSGP. That indicates that multi-scale information synthesis can improve feature extraction capabilities.
Heatmap and prediction using pre- and post-MSGP modules. Legend: satellite images were generated by Python software [version number 3.11, URL: https://www.python.org/downloads/].
Ablation experiment on SCRD
Based on the evaluation in “Ablation experiment on MSGP” Sect., MSGP – S is used as the encoder. In this experiment, SCRD is removed and replaced with the feature pyramid decoder for ablation testing. Semantic token dimension T as a hyperparameter is assigned to determine the best performance of the model for SCRD. The experiment uses the percentage of input feature channels as the T-dimension. Figure 8 shows the effect of the T dimension on extraction accuracy (where the sector area represents accuracy changes, and different colors and the height of the sector represent the T-dimension increase clockwise). F1 score and OA increase in two test datasets as T increases from 0 to 30%. However, when the T exceeds 40%, the accuracy metric shows a downward trend. Intuitively, the accuracy reaches the saturation when T falls between 30% and 40%. The increase in T dimension possibly introduces more parameters, resulting in increasing computational complexity and difficulties for model training. In addition, matrix operations between the feature maps and T consume more computational memory. As a result, this experiment set a 40% dimension input feature channel for T to balance accuracy and efficiency.
Precision statistical results in SCRD ablation testing using multi-radius sector graphs.
Overall, SCRD improved the prediction performance by about 2% of the F1 score and 1% OA on the two datasets. Visually, Fig. 9 shows the comparison results using pre- and post-SCRD for shallow features (where, satellite images were processed by Python 3.1143). It can be observed that the edges and corners information of the building presents a strong feature response after using SCRD. The irrelevant background features are filtered out. As a result, SCRD can refine shallow features by interacting with the high-level semantic context.
Heatmap and prediction using pre- and post-SCRD modules. Legend: satellite images were generated by Python software [version number 3.11, URL: https://www.python.org/downloads/].
Comparison with attention mechanisms
This experiment compared the proposed model with some representative self-attention mechanisms, including CNN-based methods such as CBAM and DANet and Transformer-based methods such as SwinTransformer (SwinT)17 and SegFormer (SegF)20. Figure 10 shows the results for partial building extraction using current state-of-the-art spatial attention mechanisms (where, satellite images were processed by Python 3.1143). Areas 1–2 come from the Massub dataset, and Areas 3–4 come from the WHU dataset. Different colors are marked in the original RGB image to observe prediction performance visually. The true positive, false positive, and false negative are marked red, green, and yellow, respectively.
Comparison results using attention mechanisms on the test datasets. Legend: satellite images were generated by Python software [version number 3.11, URL: https://www.python.org/downloads/].
Visually, the proposed network model outperforms other attention modules, implying that MSGP and SCRD acquire gain performance to optimize the feature extraction capabilities. The SwinT performs relatively better in large-scale buildings but has poor extraction at building boundaries and many FNs in some interior regions. In contrast, the SegF extracts most of the complete building objects but still has weak capabilities in some detail features, such as building edges and corners of eaves, as shown in Areas 3–4. DANet outperforms SegF and the proposed model in partial detail prediction. However, it still has difficulty dealing with multi – scale buildings. For example, in large – scale regions shown in Areas 1–2, there are many misclassifications.That implies that the dual-attention mechanism is sensitive to scale changes and spectral heterogeneity. CBAM achieved the worst performance compared to other attention models. Many roads are misclassified, as shown in Areas 3–4. Due to the lack of context optimization strategies, CBAM networks only enhance the global feature but cannot obtain multi-scale local responses, which is confusing with other objects.
The quantitive statistics are presented in Table 3. The proposed model achieves the best performance for building extraction results on two datasets in terms of efficiency and accuracy. Although DANet has a higher OA than other methods, it has a lower F1 score than the proposed model. In addition, overall, DANet is superior to SegF and Swin-T in the two datasets. That implies that the dual attention mechanism could obtain better extraction accuracy than the single spatial attention model when using small datasets. Although the proposed method achieves second inference speed, it has more lightweight parameters and lower complexity than other methods.
SwinT and SegF only focus on global attention but neglect the semantic correlation in multi-scale representation. In contrast, the proposed model not only establishes a multi-scale global context but refines spatial information. In addition, although the above CNN-based hybrid attention models leverage spatial and channel contexts, these structures do not model semantic correlation in the different stages. Hence, the semantic discrepancy could increase model optimization difficulty. The proposed SCRD can utilize high-level semantics to guide shallow feature refinement. Therefore, the feature optimization strategy is conducive to improving the model’s prediction accuracy.
Comparison with different methods
This experiment employed several representative methods using a multi-scale extraction paradigm, including CNN-based methods: DeeplabV3 + 31, UNet + + 32, PSPNet33 (ResNet50 as the backbone), and Transformer-based methods including Swin-Unet34 and Pyramid Vision Transformer (PVT)35. In addition, some similar multi-scale ViT methods were selected for comparison in the experiment, such as SSA40, SMT41, and Dilateformer42. In the experiment, the DeeplabV3 + model uses three dilation convolution rates with 6, 12, 18, and a global average pooling. Figure 11 presents building extraction results using different models on the WHU dataset (where, satellite images were processed by Python 3.1143). In the marked rectangle, the details of the relevant extraction results are enlarged for inspection. Visually, the test data covers the uneven distribution of buildings, and the density shows variability in the suburbs and urban areas. Compared with other methods, the proposed model achieves better extraction results, especially in large-scale buildings and small-scale densely built areas.
Among CNN – based models, Unet + + demonstrates unsatisfactory performance when dealing with multi – scale buildings. Although Deeplabv3 + shows relatively better performance in small – to – medium – scale buildings, there are numerous false negatives in some large – scale buildings, as depicted in Areas 2–3. This manifest ASPP, using local attention, cannot effectively capture the context in a large receptive field using dilation convolution. Multi – scale features are processed at the deepest network layer in Deeplabv3 + without refining semantic information. This shortcoming prevents Deeplabv3 + from adaptively selecting the context. Compared with ASPP, the proposed MSGP enhances multi-scale context information and global interdependency. Although the interior of the buildings shows high intra-class spectral variability, as shown in Area 2, MSGP can effectively alleviate this problem and improve the segmentation accuracy. PSPNet performs well on small scales but has many false predictions on roads and building boundaries. That indicates that SPP has good aggregation ability for some features. However, details are easily lost without feature optimization.
Visual comparison with different methods on the WHU dataset. Legend: it is the figures that can be generated using the Python software [version number 3.11, URL: https://www.python.org/downloads/].
Although PVT has obtained good prediction results in medium-scale buildings, there are many incomplete extraction results on other scales and misclassifications, such as roads and ground surfaces being similar in texture and color to buildings. This indicates that PVT can enhance global context representation using the pyramid MSA but has a weak feature extraction ability for the spectral intra-class variation and inter-class similarity. PVT employs simple spatial downsampling techniques to mitigate computational complexity. However, it fails to ensure the adequate restoration of spatial details. Compared to the proposed method, Swin-Unet did not obtain good extraction results in small-scale buildings, as many undetected areas are in Area (1) Swin Unet utilizes the local window-based attention mechanism and constructs a hierarchical feature fusion structure using skip connections. However, multi-scale context and spatial information are simply fused without semantic correlations. Although SSA and SMT introduce multi-scale local features for global representation, they exhibit robustness to spatial details and spectral heterogeneity in areas 1 and (2) Dilateformer has better predictive performance for large-scale buildings than small-scale ones. However, some roads and shaded areas were misclassified.
The Massub test results were visualized using different network models, as illustrated in Fig. 12 (where, satellite images were processed by Python 3.1143). The proposed method and Swin-Unet have better discrimination ability in large-scale building areas compared to other network models. However, the Swin-Unet has poor performance in small-scale buildings and edges. In contrast, PSPnet and DeepLabv3 + perform poorly in large-scale buildings and misidentify backgrounds as buildings. In addition, many FNs are found when using PSPNet and Unet++. CNN-based methods are easily confused with roads and ground surfaces due to a lack of global feature optimization, resulting in ambiguous classification. Although PVT can correctly identify most buildings, it performs poorly at the boundaries and in highly heterogeneous areas, especially in some shadow- and vegetation-covered areas.
In contrast, although, when using the proposed method, some false positives (FPs) and false negatives (FNs) are present on certain roads or boundaries, the method still yields better results in multi – scale building prediction. Moreover, it can restore fine – grained features more effectively compared to other methods. As shown in regions 1 and 3, SSA, Dilateformer, and SMT have poor performance for densely distributed buildings, especially in boundary areas, which are easily confused with the ground. This indicates that these methods lack refinement of spatial features and are susceptible to similar background features. In contrast, the proposed method using SCRD can alleviate this problem by calibrating shallow fusion with deep semantics.
Visual comparison with different methods on the Massub dataset. Legend: it is the figures that can be generated using the Python software [version number 3.11, URL: https://www.python.org/downloads/].
The quantitive statistics are reported in Table 4. The bold values denote the best results, and the underlined values denote the second results. The extraction results on two datasets confirm that the proposed network model is superior to other methods by comparing OA and F1 scores. Deeplabv3 + achieves a high OA of 97.62%, which outperforms other methods on the WHU dataset. However, the proposed method is 2.37% more in F1 score than that. Due to the limited training samples, these models have lower IOU and F1 scores on the Massub dataset than on the WHU dataset. In addition, compared with other methods, PSPNet has poor extraction performance, with about average of 86% F1 score and 92% OA.
In the Massub test dataset, Although the proposed method did not achieve the best OA, it achieved the best performance with 96.36% F1-score. Swin-Unet obtained the second accuracy results and has almost the same OA accuracy, but it is 1.32% lower in the F1-score than the proposed method. DeepLabv3 + has relatively high OA accuracy and a lower F1-score than the proposed method. Similarly, PVT achieved high OA accuracy, but the IOU was 3.46% lower than the proposed method. Swin-Unet and PVT applied attention mechanisms with high algorithm complexity, but the prediction accuracy and efficiency did not achieve optimal performance. Dilateformer has better performance on the WHU dataset than on the Massub dataset. SSA and SMT have higher OA compared to other methods but have lower F1 scores and higher algorithm complexity. UNet + + has the highest algorithm complexity due to the lack of channel dimension optimization and the utilization of transposed convolution for decoders. In contrast, the proposed model has lighter parameter quantities and lower complexity than others. That demonstrates that the proposed methods, combined with a multi-scale global optimization strategy, can achieve advanced performance for building extraction.