J/ApJ/811/30 Machine learning metallicity predictions using SDSS (Miller, 2015)
The synthetic-oversampling method: using photometric colors to discover
extremely metal-poor stars.
Miller A.A.
<Astrophys. J., 811, 30 (2015)>
=2015ApJ...811...30M 2015ApJ...811...30M (SIMBAD/NED BibCode)
ADC_Keywords: Abundances, [Fe/H] ; Models ; Photometry, SDSS
Keywords: methods: data analysis; methods: statistical; stars: general;
stars: statistics; stars: fundamental parameters; surveys
Abstract:
Extremely metal-poor (EMP) stars ([Fe/H]≤-3.0dex) provide a unique
window into understanding the first generation of stars and early
chemical enrichment of the universe. EMP stars are exceptionally rare,
however, and the relatively small number of confirmed discoveries
limits our ability to exploit these near-field probes of the first
∼500Myr after the Big Bang. Here, a new method to photometrically
estimate [Fe/H] from only broadband photometric colors is presented. I
show that the method, which utilizes machine-learning algorithms and a
training set of ∼170000 stars with spectroscopically measured [Fe/H],
produces a typical scatter of ∼0.29dex. This performance is similar to
what is achievable via low-resolution spectroscopy, and outperforms
other photometric techniques, while also being more general. I further
show that a slight alteration to the model, wherein synthetic EMP
stars are added to the training set, yields the robust identification
of EMP candidates. In particular, this synthetic-oversampling method
recovers ∼20% of the EMP stars in the training set, at a precision of
∼0.05. Furthermore, ∼65% of the false positives from the model are
very metal-poor stars ([Fe/H]≤-2.0dex). The synthetic-oversampling
method is biased toward the discovery of warm (∼F-type) stars, a
consequence of the targeting bias from the Sloan Digital Sky
Survey/Sloan Extension for Galactic Understanding survey. This EMP
selection method represents a significant improvement over alternative
broadband optical selection techniques. The models are applied to
>12 million stars, with an expected yield of ∼600 new EMP stars, which
promises to open new avenues for exploring the early universe.
Description:
Photometric colors and spectroscopic [Fe/H] measurements for the
training set sources are selected from SDSS data release 10 (DR10; Ahn
et al. 2014ApJS..211...17A 2014ApJS..211...17A). The selection criteria are designed to
select sources with the most reliable photometric and spectroscopic
measurements. It is important to note that each of these criteria can
be applied to the ∼2.6x108 SDSS stars with no spectroscopic
observations, ensuring that these choices do not introduce a
significant bias in the final model predictions. See section 2 for
further explanations.
In addition to building a robust and representative training set, the
choice of machine-learning algorithm is essential for the construction
of a useful model. Three different algorithms are utilized in this
study: the K-nearest Neighbors (KNN) regression, the Random Forest
(RF) method and the Suport Vector Machines (SVMs) model. See section 3
for further explanations.
File Summary:
--------------------------------------------------------------------------------
FileName Lrecl Records Explanations
--------------------------------------------------------------------------------
ReadMe 80 . This file
table3.dat 95 12569529 Final metallicity predictions for field stars
--------------------------------------------------------------------------------
See also:
V/139 : The SDSS Photometric Catalog, Release 9 (Adelman-McCarthy+, 2012)
J/ApJ/807/171 : SkyMapper Survey metal-poor star spectrosc. (Jacobson+, 2015)
J/ApJ/798/122 : SEGUE Stellar Parameters Pipeline abundances (Miller+, 2015)
J/A+A/568/A7 : Model SDSS colors for halo stars (Allende Prieto+, 2014)
J/AJ/147/136 : Stars of very low metal abundance. VI. (Roederer+, 2014)
J/AJ/145/13 : Metal-poor stars from SDSS/SEGUE. I. Abundances (Aoki+, 2013)
J/ApJS/199/30 : Effective temperatures for KIC stars (Pinsonneault+, 2012)
J/MNRAS/414/2602 : Automated classification of HIP variables (Dubath+, 2011)
J/AJ/137/4377 : List of SEGUE plate pairs (Yanny+, 2009)
J/AJ/136/2070 : SEGUE stellar parameter pipeline. III. (Allende Prieto+, 2008)
J/AJ/136/2050 : SEGUE stellar parameter pipeline. II. (Lee+, 2008)
J/A+A/484/721 : HES survey. IV. Candidate metal-poor stars (Christlieb+, 2008)
J/ApJ/652/1585 : Bright metal-poor stars from HES survey (Frebel+, 2006)
J/AJ/103/1987 : Stars of very low metal abundance (Beers+ 1992)
J/AJ/90/2089 : Stars of very low metal abundance. I (Beers+, 1985)
http://www.sdss3.org/ : SDSS-III home page
Byte-by-byte Description of file: table3.dat
--------------------------------------------------------------------------------
Bytes Format Units Label Explanations
--------------------------------------------------------------------------------
1- 4 A4 --- --- [SDSS]
6- 24 A19 --- SDSS SDSS object name (JHHMMSS.ss+DDMMSS.s)
26- 44 I19 --- objID Object ID from the SDSS DR10 PhotoObjAll table
46- 47 I2 h RAh Hour of Right Ascension (J2000)
49- 50 I2 min RAm Minute of Right Ascension (J2000)
52- 56 F5.2 s RAs Second of Right Ascension (J2000)
58 A1 --- DE- Sign of the Declination (J2000)
59- 60 I2 deg DEd Degree of Declination (J2000)
62- 63 I2 arcmin DEm Arcminute of Declination (J2000)
65- 68 F4.1 arcsec DEs Arcsecond of Declination (J2000)
70- 73 I4 K Teff [4500/6988] Photometric Teff (1)
75- 80 F6.3 [-] [Fe/H]1 [-2.5/0.5] Photometric [Fe/H] using Support
Vector Machine (SVM)-regression model
82- 87 F6.3 [-] [Fe/H]2 [-3.6/0.4] Photometric [Fe/H] using
synthetic-oversampling
89- 95 F7.4 --- rho [0.03/28.6] Proximity Measure (ρ);
given star to the training set (2)
--------------------------------------------------------------------------------
Note (1): After Pinsonneault et al. (2012, Cat. J/ApJS/199/30)
Note (2): ρ represents the mean Euclidean distance between a given source
and its 60-nearest-training-set neighbors. Sources with large ρ
are likely to have unreliable estimates of [Fe/H].
Thresholds on ρ as in table 4:
----------------------------
Percentile ρt
----------------------------
68 0.0843
90 0.1310
95 0.1705
99 0.3883
99.5 0.5774
99.7 0.7737
----------------------------
Note: The threshold, ρt, corresponding to the percentage
of training set sources with ρ≤ρt.
See section 6 for further explanations.
--------------------------------------------------------------------------------
History:
From electronic version of the journal
(End) Prepared by [AAS], Emmanuelle Perret [CDS] 21-Dec-2015