Finally Getting Results

This past week we have kept working on finding the ideal cuts for each of the filters in order to get a data catalogue that’s got the right balance of including as many of the real line emitters as possible but also not be full of ‘trash’ data. 

In order to do this, we needed to make 3D plots of EW_Cut, Sigma_Cut, and the relative effectiveness (a value that multiplied purity and completeness and then compared it to a predone basic cut done by our supervisor david), and then find the peak in that data. To start with, we weren’t getting peaks due to various things we hadn’t included or considered. 

Figure 1: Colour-Magnitude graph for the sources in the filter IA550, showcasing the ‘bulge’ in noisy data increasing from low to high magnitude

One of the things we needed to include was subtracting the ‘false’ values due to a spread of noise (visible in the ‘bulge’ in figure 1). These ‘false’ values are sources that are in the right redshift range to be an emitter, but they do not have the expected sigma and EW values. We expected there to be a distribution of fake emitters around Sigma and EW = 0. By using a minimum Sigma and EW cut we already eliminated a big bulk of the fake emitters or data points, however to fine tune it even more some emitters must be eliminated at higher EW and Sigma values than our cut in order to make sure the reliability is good enough for our completeness and purity values. Because when plotting we only plotted from Sigma>3 and EW>100 (why this is will be explained later on), so what we did for the subtraction is find how many sources came under Sigma<-3 and EW_obs<-100 within our redshift ranges for different line emitters and assume that this spread of ‘fake emitters’ is approximately equal both below an EW of 0 and above it. With the number of sources found below this cut, we subtracted them from the number of sources found above the same cut in the positive, so that the ‘fake emitters’ are statistically accounted for and the completeness and purity take this spread into consideration. 

When plotting the graphs, we only included cuts that were at least Sigma>3 and EW>100. This is because lower cuts don’t cut out many sources, so the completeness will be very high. Although the purity will be very low, the overall effectiveness will be relatively high because of the completeness. We would rather have a lower effectiveness where the purity and the completeness is more balanced out. Most of the noise has a sigma of 2.5 or lower, which is why we picked a minimum sigma of 3 for our graphs. 

For a lot of this process, we used an ever improving code to make our process more efficient (we’re now on version 8 part 3!!). The first Python script we wrote had only one goal: it was used to determine at which redshift we expected to observe emission lines (Lyman-alpha, H-alpha, [OIII], [OIV] and [CIV]). For each filter, it took into account the centre wavelength and the FWHM, in order to calculate the minimum and maximum wavelengths of the filter, and their corresponding redshifts, using the formula:

Figure 3: Formula for redshift

We then decided to improve the script by adding new parts that would also, for a selected filter, calculate purity, completeness, effectiveness (purity x completeness) and relative effectiveness, for a number of combinations of EW and Sigma cuts. We would thus obtain catalogues that would allow us to plot the EW cut, the Sigma cut and the relative effectiveness on a same 3D plot for each filter: this plot should exhibit a clear peak in relative effectiveness for a certain combination of EW cut and Sigma cut.

Initially, we used a code structure based on the use of multiple while loops: we had three different increments, and three loops within one another, to check if, for each combination of EW and Sigma cuts, each emission satisfied the said cuts or not. We then used another loop to calculate purity, completeness and effectiveness. We displayed the results with a final loop, in a textual form.

One of the earliest modifications we performed was to modify the way the results were displayed: indeed, we quickly found out that having data expressed in sentences is not very handy when you later have to analyse this data or plot it… Therefore, we modified the script so that it would now generate an .ascii file containing all the data we needed. This .ascii file could then be opened in Topcat, the software we used to plot our data.

Another problem we encountered was the time it took to run the code: for each filter, dozens of minutes were necessary to get the full results. After listening to the advice given by David, our supervisor, we realized that although our code worked and produced the desired catalogues, it was not efficient at all. Therefore, we completely changed the structure of the code. Instead of having three while loops within another to check if each emission satisfied each combination of cuts and could be counted as an emitter, we directly summed all the emissions corresponding to our criteria.

So… What’s Next?

After determining the best EW and Sigma cuts per filter, we will be finding an average for the cuts in order to be able to have the a homogenous cut for every filter to be able to compare the data to each other. We will also be comparing the effectiveness of this cut with the cut used in Sobral et al. 2018 to make sure that changing it is worth the lack of being able to compare between fields. After that we will move onto the next steps, narrowing it down to Lyman alpha emitters and not just any emitter. This will be done in 3 steps:

  1. Use the sources with available spectroscopic redshift to determine which are within the right redshift range to definitely be Ly-alpha emitters
  2. Find an appropriate photometric redshift range that won’t eliminate many line emitters in a similar way to the way we are finding the most appropriate
  3. Lyman Break Technique

More on these steps and more detailed explanations coming next week so keep an eye out on the blog and for now: keep being curious, keep learning, because we’re trying to do just that

-Amaia, Louis, Emily

Weeks 3, 4 and 5

Broadly speaking, astronomical catalogues consist of lists of celestial bodies. They have been around for thousands of years: the earliest star catalogues known to this day were made in Ancient Babylon during the second millennium BC.[1] [2] Until the catalogue of the Danish astronomer Tycho Brahe, created in the 16th century, these catalogues were based on observations made with the naked eye. [3] The invention of the very first telescope in 1608 in the Netherlands [4] changed the way we look at space forever, and allowed the development of more complete and detailed astronomical catalogues over the next few centuries. Today, rather than being engraved on clay or printed on paper, they tend to be generated as computer files, but they are still an important tool to study the universe. 

We were given a collection of 19 catalogues as FITS files, each one corresponding to a specific filter. Each catalogue contains thousands of emissions detected across the GOODS-S field, and a number of properties associated with them, such as observed equivalent width or coordinates. 

 

Fields

Figure 1: The GOODS fields. Source: Mauduit et al., 2012.

 

Our goal is to identify which ones of these emissions are Lyman-alpha emitters (LAEs), and which ones are not. However, before making such a selection, an intermediate step is needed. Indeed, not all emissions correspond to emission lines, and it is crucial to separate true line emitters from, on the one hand, noise, and, on the other hand, cosmic rays, image defects, and other artefacts. In order to perform such a selection, we use different “cuts”, as explained by Emily in a previous blog post. The first one, the Σ cut (Sigma cut), is used to identify with a high level of confidence the emissions that correspond to noise and that should be eliminated from our list of emitters. The second one, the EW cut (equivalent width cut), allows us to further refine our list of line emitters by excluding various artefacts. For example, if, for specific filter, we select a Σ cut of 3 and a EW cut of 100, it means that out of all the emissions detected by this filter, we will only consider the ones that have Σ > 3 and EW_obs > 100 (equivalent width observed > 100) to be “candidate line emitters”. Furthermore, to be selected as a candidate line emitter, an emission must have a positive spectroscopic redshift (specz > 0).

Using the software Topcat, we selected a few filters and we started experimenting by creating scatter plots and histograms with different Σ and EW cuts, to see how varying these values would affect the selected samples.

 

1

Figure 2: Scatter plot for IA427, Σ>2. Green dots are the candidate line emitters, while green and red dots together represent all emissions.

 

2

Figure 3: Count against spectroscopic redshift for IA427, Σ>2 and specz > 0.

 

3

Figure 4: Scatter plot for IA427, Σ>4. Green dots are the candidate line emitters, while green and red dots together represent all emissions.

 

4

Figure 5: Count against spectroscopic redshift, for IA427: Σ>4 and specz > 0.

 

5

Figure 6: Scatter plot for IA445. Red: Emissions before EW cut; green: emissions for EW_0 > 30; pink: emissions for EW_0 > 90.

 

It is also possible to combine both cuts.

 

6

Figure 7: Scatter plot for IA427, Σ>3. Green dots are the candidate line emitters, while green and red dots together represent all emissions.

 

7

Figure 8: Scatter plot for IA427, EW_obs > 50. Blue dots are the candidate line emitters, while blue and red dots together represent all emissions.

 

8

Figure 9: Scatter plot for IA427, Σ>3 and EW_obs > 50. Grey dots are the candidate line emitters, while grey and red dots together represent all emissions.

 

After this period of trials, we had become a bit more familiar with the data and the concepts we were working with. We decided to take a more quantitative approach in order to identify line emitters.

The histograms of selected emitters were expected to look as the following:

 

9

Figure 10: Typical histogram after selection of line emitters. Credit: David Sobral.

 

As illustrated on Fig. 10, the distribution of the number of sources depending on spectroscopic redshift (specz) should be discrete. Any histogram that contains a rather continuous distribution of sources depending on redshift does not correspond to a cut that can be considered efficient.

Let us choose IA484 to illustrate how this works in practice. This filter is centred around the wavelength 4840 Å, and has a width of 229 Å, which means that the wavelengths that can potentially be detected by this specific filter stretch from 4734.65 Å to 4963.75 Å. Following the advice of our supervisor David, we started calculating, for each filter, the corresponding redshift of different emission lines (H-alpha, Oxygen III, Oxygen II, Carbon IV, an Lyman-alpha) at these wavelengths, using the formula:

10

 Lyman-alphaH-alphaOIIOIIICIV

Rest frame wavelength (Å)

1215.676562.83726.050071549.48
Min redshift (to achieve an observed wavelength of 4734.65 Å)2.90-0.240.27-0.052.06
Max redshift (to achieve an observed wavelength of 4963.75 Å)3.08-0.290.33-0.012.20

Table 1: Expected redshifts for different emission lines, for IA484. The second and the  third columns correspond to a negative redshift, so the H-alpha and OIII lines will not be detected by the IA484 filter. For this filter, the Lyman-alpha line should be between z = 2.90 and 3.08.

 

We obtained tables such as Table 1, containing values for the minimum and maximum redshifts associated with each emission line. Initially, we had started calculating these values for different filters by hand, but it ended up being a tedious task. We decided to write a Python script to automate the process, which made things quicker and easier. We then plotted new histograms, trying to identify emission lines, and modifying the values of our EW and Σ cuts in order to get a discrete distribution of sources.

 

11

Figure 11:  Count against spectroscopic redshift for IA427, for sigma_NB > 3 AND EW_obs > 50 AND specz > 0:

 

After having created several plots for a number of different cuts, we needed to find a way to determine the best combination of cuts. The two most important factors to take into account were how much noise remained in the sample, and how many actual line emitters had been excluded from it. In order to find out, we calculated two other values: the purity, which is the proportion of emitters to noise after a cut has been performed, and the completeness, which represents the proportion of emitters present after the cut has been performed in comparison with before. We plotted completeness against accuracy to obtain new graphs.

accuracy

Figure 12: Completeness against purity (or “accuracy”) in %, for IA709

 

We also normalised the values in order to get another type of plots.

 

plot_2

Figure 13: Variation of completeness (in orange) and purity (in blue) depending on the EW cut, for normalised values.

 

We modified our Python script so that it would be able to calculate the number of emitters, purity, completeness and effectiveness (purity multiplied by completeness) for a large number of different combinations of EW and Sigma cuts, including for high values that we had not tried before. For a selected filter, the script creates a .ascii file containing a catalogue of emitters that can later be opened in Topcat, which makes it easy to obtain 3D plots to represent EW_Cut, Sigma_Cut, and relative effectiveness (the effectiveness value we calculated divided by the one obtained by our supervisor).

 

IA856 catalogue

Table 2: Sample from the IA856 catalogue

 

IA856b purity

Figure 14: 3D plot for IA856

 

During the remainder of the internship, we will try to produce better 3D plots in order to identify which combination of EW and Sigma cut is the most efficient for each filter. The next step will then be to identify Lyman-alpha emitters (LAEs) out of the selected emitters by applying other selection criteria, and performing visual inspections.

 

Louis

 

References:

[1] http://adsabs.harvard.edu/full/1951C%26T….67..153F

[2] https://www.britishmuseum.org/collection/object/W_1899-0610-108

[3] https://www.britannica.com/science/star-catalog

[4] http://www.bo.astro.it/dip/Museum/english/can_int.html

 

“99 Percent of Success is Built on Failure” – Charles F. Kettering

Charles Kettering was an engineer in the first half of the 1900s and you could say he was a very successful scientist, being the holder of 186 patents and being the head of research for General Motors for 27 years. You wouldn’t think that he’s a ‘failure’ or ‘full of mistakes’. And yet, this is a real quote by him. In science, including physics, we’re often made to believe that we have to be perfect, that everything has to go right first time, that we need to understand everything as soon as it gets explained to us. However, some of the most successful scientists have a life riddled with mistakes. Even Einstein isn’t free of them, and he’s widely regarded as one of the most influential (if not THE most) in physics and astronomy. He added the cosmological constant to his equations of general relativity, to make the equations compatible with the Universe being static, which was something that was believed at the time (which is no longer something we believe to be the case). When the Universe was found to be expanding, he scrapped the constant, another mistake which was realised to be a mistake when we discovered the universe is not only expanding, but accelerating.

Fig 1: Albert Einstein (left) [Photo Source: RR] and Charles F. Kettering (right) [Photo source: Hemmings.com]

So, in our internship, we have taken this philosophy of making mistakes to heart and made plenty of them. As a group of two second years and one first year, failure is part of the process, so we’ve made our fair share of mistakes or gone down the wrong rabbit hole more than once. In this blog post, we’re going to explain some of our mistakes and how we fixed them or used them as a learning process if they were unfixable.

Starting with our first mistake, when trying to understand how appropriate different EW and Sigma cuts, we thought it would be useful to do histograms of the spectroscopic redshift and calculate the mean and standard deviation of them:


Turns out this tells us absolutely nothing, but we spent a good long while making fun little histograms for no real reason. The reason this doesn’t tell us anything is because they will need a relatively large standard deviation even if the cuts are appropriate, because we’re finding emitters at various different redshifts in that filter so they don’t need to be a Gaussian as there will be multiple peaks. 

We then, much to our own surprise, did some things right! Started calculating some redshift ranges that different emission lines would need in each filter to be detected. But then, surprise surprise, we went wrong again. These redshifts were meant to be used to calculate the purity and completeness of a sample after different cuts. And whilst this is something we did correctly, our understanding of Equivalent Width cuts was a bit misguided. We were going off equivalent widths values in the Sobral et al 2017 paper, but didn’t realize that we were creating the cuts with observed equivalent width whereas the Sobral et al 2017 used rest frame equivalent width. This meant that our tables that we painstakingly spent hours and hours doing had EW cuts that were so low that they made very little difference. 

However, whilst we complain and grumble, these mistakes (and a lot of coding errors in python) have lead us to a lot of learning and progress, and we’re excited to see where this internship takes us and just how much we can learn from it. 

Week 2

In this project, we are looking to find young, primeval galaxies, so for this week, we were focusing on understanding why and how the data collected from the GOODS-S field is filtered so that we are left with a high proportion of these galaxies. We can identify these galaxies by looking at the wavelengths of light that they emit, and also the strength of the light emitted. These young galaxies form lots of stars which means that they are emitting lots of light from certain elements at specific wavelengths, which are Hydrogen alpha, Oxygen II & III, Carbon IV and Lyman alpha. Finding the strength of the light at these wavelengths will help us identify these galaxies. Also, we have to look very far back in time to see these galaxies so young, and because light travels at a finite speed, we have to look at galaxies very far away. These primeval galaxies are significantly further away than most objects in the night sky, so to identify these objects we need a way of seeing how old the sources are. 

Equivalent width is a way of finding the strength of a certain emission line. This is found by plotting a graph of intensity against wavelength and finding the area under the curve, as shown in figure 1. A rectangle is then made with the same area, with the width of this rectangle being the equivalent width. We will pick a minimum value of EW_obs and use this to cut out some of the noise.

Figure 1: (Source: Szdori, 2006) A graph of intensity plotted against wavelength to form a line profile for an absorption line. The rectangle drawn has the same area as the area under the line profile and the width is the equivalent width. For an emission line, the line profile would have a positive intensity.

Sigma quantifies the significance of the colour excess, which is the difference in colour detected from a source and the colour emitted from the source. Over long distances, light from a source appears more red, which is due to a combination of redshift and interstellar extinction. 

Redshift occurs when a source of light is moving away from the detector, or vice versa. Because the universe is expanding, all sources of light are moving away from us and the further away they are the faster they are moving. This means that the further away a source is, the more it is redshifted. The galaxies that we are looking for are very old, and because light travels at a finite speed, the further away a galaxy the older it is. This means that primeval galaxies have very high redshifts, which is one way of identifying them.

Interstellar extinction is where light is absorbed and scattered in space. Light with a shorter wavelength (i.e. bluer light) is more likely to be scattered, which means that light sources from far away are likely to appear redder. Primeval galaxies are very far away, so extinction makes them appear more red.

Figure 2: (Source: Rieke, 2003) The bluer light is much more likely to be scattered or absorbed than redder light, therefore the star in this scenario appears redder.

The combination of redshift and interstellar extinction means that primeval galaxies are observed to be a lot redder than what we think they emit. Colour excess is the difference between the colour a source is observed to be and the colour of the light emitted from the source, so we expect primeval galaxies to have a high colour excess. Σ (sigma) represents the significance of the difference of the colour excess compared with the mean for the whole sample. Most of the sample is noise, so to find primeval galaxies we need to select data with a significantly higher colour excess compared with the rest of the sample (i.e a higher sigma). Like with the equivalent width, we will select a minimum value of sigma and use it to cut out some of the noise.

We call the sources that emit the right wavelengths ‘emitters’.The data that we have collected is split up into different wavelength filters, where only photons of a small range of wavelengths are observed. For example, in the filter called IA484, the centre of the filter is 4840 Å, however the range of wavelengths it picks up is 4734.65 to 4963.75 Å. We know the wavelengths that the emitters emit, and we know the range of wavelengths that they would be picked up as in each filter; from this we can calculate the redshift of each emission line. Therefore, we can use the redshift to identify possible emitters.  It is not definite that all of the possible emitters are primeval galaxies, so we will do a visual check later on in the project after we have decided on the best cut.

The way that we are selecting which cut is the most effective is by looking at how much noise is eliminated, but also by looking at how many emitters have been eliminated as well. Ideally, we would want a cut that gets rid of as much noise as possible without getting rid of too many emitters. The values we measure to measure these two things are purity and completeness. Purity is the fraction of real galaxies out of all of the sources in the sample after the cut, and completeness is the fraction of real galaxies in the cut out of all of the real galaxies in the sample before the cut. 

We spent some time calculating these values, both by hand and using code. For each combination of EW and sigma cuts, we calculated the ratio of purity and completeness. We are still evaluating whether we should have equal weighting on purity and completeness, or if we should prioritise one over the other. We plotted some graphs comparing the EW cut, the sigma cut, and the ratio of purity and completeness, as shown in figure 3. Because our first draft of code was not very efficient, we plotted a limited number of results. Our aim for the next week is to make our code more efficient so that we can analyse a larger number of possible cuts.

Figure 3: a 3D graph made using Topcat showing the sigma cut, the EW cut, and the ratio of the purity and the completeness for the filter IA651. The highest point on the graph is highlighted as it is the most effective cut out of the cuts selected, which is sigma>3 and EW>250.

Emily Wickens

Time travelling through the Universe to discover and study primeval galaxies – Week 1

During my first week, I started to gather the knowledge I will need to complete my project. My internship revolves around the study of SC4K, a catalogue that contains much information about 3908  galaxies located in the COSMOS field, a region of the sky that has been much studied by astrophysicists. 

This project relies on the use of Python, a programming language I am familiar with. However, I had never used Astropy before starting this internship. Astropy is a collection of software packages written in Python and designed to be used in astrophysics projects. It allows us, among many other things, to manipulate astronomical quantities that have values and units, to use astronomical time, dates and coordinate systems, to handle data tables, to create models, and to visualize data. I spent time during this first week learning about the basics of Astropy, using mostly different sources: the official astropy documentation (https://docs.astropy.org/en/stable/), https://python4astronomers.github.io/, learn.astropy.org, and https://astropy4cambridge.readthedocs.io/en/latest/

I was given the complete SC4K catalogue, as well as the two Python scripts that were used to generate it. I did my best to analyse them and to try to understand how all the modules and functions work, what they do and how they are used. The scripts make use of an algorithm called SExtractor, which prompted my interest, and I managed to find the original paper in which it was published, Bertin and Arnouts, 1996 (https://ui.adsabs.harvard.edu/abs/1996A&AS..117..393B/abstract), to learn more about the mechanisms behind it. 

The SC4K catalogue is available as a FITS file, a format mostly used to store scientific data. This file can be opened with Astropy, or with astronomy softwares such as TOPCAT (Tool for OPerations on Catalogues), which is described as “an interactive graphical viewer and editor for tabular data” by its creator Mark Taylor. Here is a screenshot of the SC4K catalogue as seen in TOPCAT:

In order to have a better understanding of the theory behind the creation of the SC4K catalogue, I started carefully reading the main paper that explains its conception, Sobral et al., 2018 (https://arxiv.org/abs/1712.04451). Reading scientific papers is an arduous art that I just started learning, but I found some very interesting guidance in a paper intended to explain how to efficiently read astronomy papers for people who are not used to it yet: Cooke et al., 2020 (https://arxiv.org/abs/2006.12566). As with many other activities, I am convinced that practice is key, and I will probably have to read plenty of papers before I become more comfortable with them. It’s actually something I’m looking forward to!

But for now let’s go back to Sobral et al., 2018: it presents works performed with data obtained from the Subaru and Isaac Newton telescopes, located respectively at Mauna Kea  in Hawaiʻi and on the island of La Palma in the Canary Islands. The said data was used to identify nearly 4000 galaxies in the COSMOS field using 16 narrow- and medium-band filters, thus slicing the Universe into 16 cosmic periods, from redshift z ~ 2 to z ~ 6. These galaxies are all Ly-α emitters (LAEs), selected among thousands of possible candidates according to strict criteria. Ly-α transitions occur in Hydrogen atoms, when an electron goes from the level n = 2 to to the level n = 1. Because Ly-α emission lines are very strong, they are excellent tools to detect distant galaxies. David Sobral and his team were even able to create a 3D map of the galaxies they identified:

Source: Sobral et al., 2018

I also carefully read parts of the thesis of Dr Jorryt Matthee of and the thesis of (soon Dr) Sérgio da Graça Santos. They gave me important information about cosmology and extragalactic astrophysics that helped me understand better the research ongoing in the XGAL group. Going through scientific works of hundreds of pages discussing cutting-edge research is both intimidating and fascinating, but I mostly focused on the introductions of these two thesis, as they are the parts that will be the most relevant for my project. 

This first week was also an opportunity to discover the strange world of remote working. Socializing behind a screen is not an easy task, and it requires skills that I will need to acquire along the way. This summer, the weekly astro lunch and astro tea sessions are held on Zoom, the famous conference app that became viral all across the world in the past few months as more and more countries imposed a lockdown and urged their citizens to work from home when possible. I was very happy to see again students and professors I had met last year during my first internship at Lancaster University, and to meet new people. In addition to these informal meetings, each Tuesday members of the XGAL group discuss their more recent work together. Finally, on Thursday evenings, the journal club takes place, during which several members from the Observational Astrophysics group present and explain astronomy papers recently published by other researchers.

This first week was full of discoveries for me, from having my first Zoom meetings to learning about the detection of galaxies in the early universe, and I am very excited about the next steps of the project!

Louis Marinho Fernandes