J/A+A/701/A223 ML-aided selected Lyalpha candidates in COSMOS2020 (Vale+, 2025)
A gradient boosting and broadband approach to finding Lyman-alpha emitting
galaxies beyond narrow-band surveys.
Vale A., Paulino-Afonso A., Humphrey A., Cunha P.A.C., Ribeiro B.,
Cerqueira B., Carvajal R., Fonseca J.
<Astron. Astrophys. 701, A223 (2025)>
=2025A&A...701A.223V 2025A&A...701A.223V (SIMBAD/NED BibCode)
ADC_Keywords: Galaxies ; Photometry ; Galaxy catalogs ; Models
Keywords: methods: data analysis - methods: statistical - surveys -
galaxies: high redshift - galaxies: photometry
Abstract:
The identification of Lyman-alpha emitting galaxies (LAEs) has
traditionally relied on dedicated surveys using custom narrow-band
filters, which constrain observations to specific narrow redshift
intervals, or on blind spectroscopy, which - although unbiased -
typically requires extensive telescope time, making it challenging to
assemble large, statistically robust galaxy samples. With the advent
of wide-area astronomical surveys producing datasets significantly
larger than traditional surveys, the need for new techniques arises.
We test whether gradient boosting algorithms, trained on broadband
photometric data from traditional LAE surveys, can efficiently and
accurately identify LAE candidates from typical star-forming galaxies
at similar redshifts and brightness levels.
Using galaxy samples at z ∈ [2, 6] derived from the COSMOS2020 and
SC4K catalogs, we trained gradient-boosting machine learning
algorithms (LGBM, XGBoost, and CatBoost), using optical and
near-infrared broad-band photometry. To ensure balanced performance,
the models were trained on carefully selected datasets, with similar
redshift and i-band magnitude distributions. Additionally, the models
were tested for robustness by perturbing the photometric data using
the associated observational uncertainties.
Our classification models achieved F1-scores ∼87%, successfully
identifying around 7000 objects with unanimous agreement across all
models. This more than doubles the number of LAEs identified in the
COSMOS field compared with the SC4K dataset. We managed to
spectroscopically confirmed 60 of these LAEs candidates using the
publicly available catalogs in the COSMOS field.
These results highlight the potential of machine learning in
efficiently identifying LAEs candidates, laying foundations for
application to larger photometric surveys, such as Euclid and LSST. By
complementing traditional approaches and providing robust
pre-selection capabilities, our models facilitate the analysis of
these objects, crucial to increase our knowledge of the overall LAE
population.
Description:
We applied machine-learning techniques, specifically, the
gradient-boosting algorithms LightGBM, XGBoost, and CatBoost, to
identify LAEs candidates using broadband photometric data (fluxes,
magnitudes, and colors) in the optical and NIR. Using SC4K and
COSMOS2020, we extracted five samples with similar redshift and i-band
magnitude distributions to ensure that we had comparable LAE and nLAE
populations. We finally trained, tested, and analyzed the three
algorithms in each one of the five samples, resulting in 15 models.
ML-aided selected Lyalpha candidates in COSMOS2020 using
gradient-boosting algorithms.
File Summary:
--------------------------------------------------------------------------------
FileName Lrecl Records Explanations
--------------------------------------------------------------------------------
ReadMe 80 . This file
seldata.dat 74 7073 Identification of the selected candidates
--------------------------------------------------------------------------------
See also:
J/MNRAS/476/4725 : SC4K catalogue of candidate LAEs (Sobral+, 2018)
J/ApJS/258/11 : The COSMOS2020 catalog (Weaver+, 2022)
Byte-by-byte Description of file: seldata.dat
--------------------------------------------------------------------------------
Bytes Format Units Label Explanations
--------------------------------------------------------------------------------
1- 7 I7 --- ID COSMOS2020 ID number
9- 26 F18.14 deg RAdeg COSMOS2020 Right Ascension (J2000.0)
28- 45 F18.16 deg DEdeg COSMOS2020 Declination (J2000.0)
47- 52 F6.4 --- zph COSMOS2020 LePhare redshift (lp_zBEST)
54- 55 I2 --- TimesPred Number of predictions (out of 15)
57- 74 F18.16 --- AvgPredProba Average prediction probability of
LAE candidate
--------------------------------------------------------------------------------
Acknowledgements:
Afonso Vale, afonso.vale(at)astro.up.pt
(End) Patricia Vannier [CDS] 29-Jul-2025