J/MNRAS/484/834 Unsupervised machine learning to detect anomalies (Giles+, 2019)
Systematic serendipity: a test of unsupervised machine learning as a method for
anomaly detection.
Giles D., Walkowicz L.
<Mon. Not. R. Astron. Soc., 484, 834-849 (2019)>
=2019MNRAS.484..834G 2019MNRAS.484..834G (SIMBAD/NED BibCode)
ADC_Keywords: Surveys ; Positional data ; Optical
Keywords: methods: data analysis - surveys - stars: individual: KIC 846285 -
stars: individual: KIC 8462852
Abstract:
Advances in astronomy are often driven by serendipitous discoveries.
As survey astronomy continues to grow, the size and complexity of
astronomical data bases will increase, and the ability of astronomers
to manually scour data and make such discoveries decreases. In this
work, we introduce a machine learning-based method to identify
anomalies in large data sets to facilitate such discoveries, and apply
this method to long cadence light curves from NASA's Kepler Mission.
Our method clusters data based on density, identifying anomalies as
data that lie outside of dense regions. This work serves as a
proof-of-concept case study and we test our method on four quarters of
the Kepler long cadence light curves. We use Kepler's most notorious
anomaly, Boyajian's star (KIC 8462852), as a rare 'ground truth' for
testing outlier identification to verify that objects of genuine
scientific interest are included among the identified anomalies. We
evaluate the method's ability to identify known anomalies by
identifying unusual behaviour in Boyajian's star; we report the full
list of identified anomalies for these quarters, and present a sample
subset of identified outliers that includes unusual phenomena, objects
that are rare in the Kepler field, and data artefacts. By identifying
<4 per cent of each quarter as outlying data, we demonstrate that this
anomaly detection method can create a more targeted approach in
searching for rare and novel phenomena.
Description:
The data we consider in this study are long-cadence photometric light
curves from Quarters 4, 8, 11, and 16 of NASA's Kepler mission. We
utilize Data Release 25 that reprocessed all Q0-Q17 data with the
updated data pipeline.
The Kepler spacecraft was designed to obtain near-continuous
photometry for stars in a single, star-rich 105deg2 field of view
(FOV) centred at R.A.=19h22m40s and Dec=44°30'00" from 2009 March
to 2013 May. The photometer camera contains 42 CCDs with 2200x1024
pixels, where each pixel covers 4arcsec. However, only pre-selected
stars of interest were downloaded (Batalha et al.
2010ApJ...713L.109B 2010ApJ...713L.109B). Four times a year, every 3 months, the Kepler
spacecraft rolled by 90deg to re-align its solar panels, and these
define epochs known as 'Quarters'. This will place any given star in
one of four different positions on the focal plane depending on
season: in this study Quarters 4, 8, and 16 are the same orientation
with Quarter 11 in the preceding orientation.
This work utilizes a proximity clustering approach to identify
outliers, based on Density-Based Spatial Clustering of Applications
with Noise (DBSCAN) (Ester et al. 1996, in Simoudis E., Han J., Fayyad
U., eds, Proceedings of the 2nd International Conference on Knowledge
Discovery and Data Mining. AAAI Press, Palo Alto. p.226). The DBSCAN
algorithm is a nearest neighbour approach with two parameters defining
what constitutes a cluster: the maximum separation (ε) in
feature space between two points to be associated with one another,
and the minimum number of associated neighbours (k) to qualify a point
as a core cluster member.
Across all quarters we considered 149789 objects, of which 8507 unique
objects were identified as outliers representing 5.68 per cent of all
objects considered (list of outliers shown in table 4). A total of
141282 objects, 94.32 per cent of all objects, were identified only as
part of a cluster, either as core cluster members or edge cluster
members. Objects that were identified as outliers in every quarter
constituted 3584 of the outliers (2.39 per cent of all objects and 42
per cent of all outliers), and the remaining 4923 objects were found
to be transient outliers, identified as an outlier and as a cluster
member at least once each in different quarters.
File Summary:
--------------------------------------------------------------------------------
FileName Lrecl Records Explanations
--------------------------------------------------------------------------------
ReadMe 80 . This file
table4.dat 88 8507 List of outliers
--------------------------------------------------------------------------------
See also:
V/133 : Kepler Input Catalog (Kepler Mission Team, 2009)
Byte-by-byte Description of file: table4.dat
--------------------------------------------------------------------------------
Bytes Format Units Label Explanations
--------------------------------------------------------------------------------
1- 9 I9 --- KIC KIC identification number
11- 12 I2 h RAh Right ascension (J2000)
14- 15 I2 min RAm Right ascension (J2000)
17- 22 F6.3 s RAs Right ascension (J2000)
24 A1 --- DE- Declination sign (J2000)
25- 26 I2 deg DEd Declination (J2000)
28- 29 I2 arcmin DEm Declination (J2000)
31- 35 F5.2 arcsec DEs [0/60] Declination (J2000)
37- 41 I5 K Teff ? Effective temperature
43- 45 I3 K E_Teff ? Upper error on Teff
47- 50 I4 K e_Teff ? Lower error on Teff
52- 57 F6.3 [cm2/s] logg ? Surface gravity
59- 63 F5.3 [cm2/s] E_logg ? Upper error on logg
65- 69 F5.3 [cm2/s] e_logg ? Lower error on logg
71- 76 F6.3 mag Kepmag ? Kepler magnitude
78- 79 I2 --- Q4 Outlier flag on quarter 4 of the Kepler
mission (1)
81- 82 I2 --- Q8 Outlier flag on quarter 8 of the Kepler
mission (1)
84- 85 I2 --- Q11 Outlier flag on quarter 11 of the Kepler
mission (1)
87- 88 I2 --- Q16 Outlier flag on quarter 16 of the Kepler
mission (1)
--------------------------------------------------------------------------------
Note (1): Flag as follows:
-1 = the object is outlying in this quarter
0 = core cluster membership
1 = the object is an edge cluster member
--------------------------------------------------------------------------------
History:
From electronic version of the journal
(End) Ana Fiallos [CDS] 16-Aug-2022