Category Archives: Data

The Periodogram

An antique radio

An antique radio (source)

When dealing with radial velocity data sets, the behaviour of the star is pretty unambiguous for a convenient sampling rate. For example, if a planet orbits a period over the course of a year, and the star is observed weekly, then there’s a clear sinusoidal curve in the data that can be easily picked out and modeled. However the sampling rate is often not convenient, and can even be larger than the period of the planet. Some data sets can essentially be great guessing games, where the periods are in the data but take some considerable effort to find. This effort takes the form of tuning the orbit parameters of the Keplerian fit until it finally lines up with enough data points.

For example, consider this radial velocity data set for 54 Piscium given by Fischer et al.

54 Psc RV

Radial velocity data set for 54 Piscium

It turns out there is a tool we can use to get a feel for what signals might be in the data. This tool, called a periodogram, was discussed earlier (here). Assume that we have a data set of velocities (or any variable, really). The radial velocity, v, is measured at a time t, resulting in a time series data with N data points, {v (tj), j=1,2,…,N0 }. A Fourier transform is performed on the data set as follows.

\displaystyle   \begin{aligned}  P_v(\omega)&=\frac{1}{N}|\text{FT}_v(\omega)|^2\\  &=\frac{1}{N_0}\left|\sum_{j=1}^{N_0}v(t_j)\exp(-\text{i}\omega t_j)\right|^2\\  &=\frac{1}{N_0}\left[\left(\sum_jv_j\cos(\omega t_j)\right)^2+\left(\sum_jv_j\sin(\omega t_j)\right)^2\right]\\  \end{aligned}

where P is the “power” of a given period ω, N is the number of data points, v is the radial velocity at time tj, and i is the imaginary unit.

With the above equation we can compute a basic periodogram for this data set. It is useful because if the data set contains a sinusoidal component with a frequency of ω0, then v (t) and exp(-iωt) are in phase. This causes a sinusoidal signal near ω to significantly contribute to the sums in the equation. Otherwise, values of ω that are far from a real sinusoidal signal in the data randomly fluctuate between positive and negative and therefore more or less cancel out to a small sum. Hence, if a sinusoidal signal of period ω exists in the data set, the value of Pv (ω) will be large. In a graph of periods on the x-axis, the presence of that period in the data set can be graphed as a “power” on the y-axis. Higher power values indicate that the periodicity is more strongly represented in the data.

With the periodogram, while we can technically look at the power of any of an infinite number of frequencies, it would be helpful to narrow our search to a finite set of frequencies. We must decide what frequencies those are. Since the periodogram of the data is symmetric (that is, P (-ω) = P (ω)), all of its information is contained in positive frequencies, n=0,1,2,…,N (=N0/2). Roughly half of the information in the data is discarded when calculating the periodogram because the absolute value part of the equation eliminates any distinction between positive and negative frequencies. As such, we need only to look at

\displaystyle N=N_0/2

evenly spaced frequencies.

It is not necessary to understand how the periodogram works, but what it means is important for this discussion. The periodogram takes your data set and tells you what periodicities are in it. For example, a periodogram of the above 54 Piscium data set would look like the following.

54 Piscium RV periodogram

54 Piscium RV periodogram

A strong peak in the periodogram is found at ~62 days, corresponding to the location of 54 Piscium b. We can fold the data to the period of the planet and see a reasonable fit by a ~Saturn-mass planet in a 0.284 AU orbit with an eccentricity of 0.63, and a longitude of periapsis of 235 degrees.

54 Psc b fit

54 Psc b fit

The periodogram, while useful, suffers a couple of problems. One, which is probably fairly obvious, is that it is noisy. A plethora of small peaks are seen that don’t really apply to anything physically real. The periodogram is noisy even when the actual data itself is only slightly noisy. Furthermore, the noise does not diminish with increasing sample size. This may be a bit counter-intuitive, as any fake signals would seem to be discouraged by adding new data, but it turns out that adding new data also adds new frequencies on top of the old, discouraged ones. As such, the noise is not actually averaged out by adding more data. However while the noise does not decrease with increasing sample-size, the signals in the data do. Specifically, the signal-to-noise ratio of a real signal in the data increases with the square root of the total number of observations.

Another major problem and contribution to noise is the problem of leakage. Signals tend to “leak” into other periodicities. In the 54 Psc periodogram above, you can see smaller peaks – called “sidelobes” – straddling the real signal of the planet. This is caused by other similar signals being sufficiently in phase that they end up having a higher P (ω) value. Such leakage to nearby periods is caused by the finite total interval over which the data is sampled.

Sidelobes are fairly easy to recognise and ignore, but a more dangerous sort of leakage occurs due to the finite size of the interval between measurements, a phenomena called aliasing. In particular, periods that are of integer ratios of the planet’s signal can be represented in the data. For a real signal of length S, aliased signals can exist at S/2, S/3, and so on. When you see a car driving down the road and the tires look like they are actually spinning very slowly, perhaps even in the opposite direction, the sampling rate of your eyes and brain are being fooled by an alias of the real signal – the actual rotation of the tires. Variable sampling rates are usually sufficient to eliminate the aliases that are most likely to be problematic.

As you can see, there are helpful statistical tools to take data sets and extract information out of them, but their interpretation is often unclear or ambiguous and great care must be taken when using them. As to the overreaching point that has been hinted at numerous times on this blog, detecting extrasolar planets is hard.

Advertisements

The Real Ones

On the last post, we looked at recovering a periodic signal from a radial velocity plot and interpreting it as a planet. Now let’s look at a few of the complications involved in this.

A powerful statistical tool used to get an idea of what kind of periodic signals are in your radial velocity data set is a Lomb–Scargle periodogram (the mathematical details for the interested reader may be found here, but it is sufficiently complex to warrant skipping over in the interests of maintaining reader attention and reasonable post length). In the interests of brevity, further references to the Lomb-Scargle periodogram will be shortened to simply “periodogram.”

The purpose of this periodogram is to give an indication of how likely an arbitrary periodicity is in a data set whose data points need not be equally spaced (as is frequently the case in astronomy for a variety of reasons). Periodicities that are strongly represented in the data are assigned a higher “power,” where periodicities that are not present or only weakly present are given a lower power.

Let’s look at an example using a radial velocity data set for BD-08 2823 (source) If we calculate a periodogram for the data set, we come up with this

BD-08 2823 RV Data Periodogram

BD-08 2823 RV Data Periodogram

The dashed line represents a 0.1% false alarm probability (FAP). A clear, obvious peak is seen at 1 day, 230 days and ~700 days, implying that periodicities of 1 day, 230 days, and ~700 days are present in the data. Creating a one-planet model with a Saturn-mass planet at 238 days produces a nice fit. After subtracting this signal from the data, we’re left with the residuals. Now we may run up a periodogram of the residuals and see what’s left in the data.

BD-08 2823 Periodogram of Residuals

BD-08 2823 Periodogram of Residuals

We see three noteworthy things. First and foremost is the emergence of a new peak in the periodogram that was not strongly present before at 5.6 days. We also see that the peak at 1 day remains. Lastly we see that the peak toward 700 days has weakened and moved further out. It would seem to suggest the 700-day signal is perhaps not real, or was an artifact of the 238-day signal.

Why was the 5.6-day signal not present in the first periodogram? The answer may lie in it’s mass: the planet has a mere 14 Earth-masses. It’s RV signal is completely dominated by the Saturn-mass planet. The giant planet forces the shape of the RV diagram and the signal of the second planet is just dragged along, superimposed on the larger signal.

On the radial velocity data plot, the two-planet fit we have come to looks like this:

BD-08 2823 Two-Planet Fit

BD-08 2823 Two-Planet Fit

It is important to realise that the obvious sine curve is not necessarily a bold line, but there is a second periodicity in there going up and down frantically, once every 5.6 days, compared to the Saturn-mass planet, at 237 days.

The fit has a reduced chi-squared of χ2 = 3.2, and a scatter of σO-C = 4.3 m s-1. There’s no obvious structure to the residuals and the scatter is not terribly bad, so any new signals will likely indicate planets of low mass. Let’s check in on the periodogram of the residuals to the two-planet fit and see what may be left in the data.

Periodogram of Residuals to 2-Planet Fit

Periodogram of Residuals to 2-Planet Fit

That signal out toward a thousand days is stubbornly refusing to go away, despite a low χ2. It may either not be real, or it may be indicative of a low-amplitude signal with a rather long period.

Also noteworthy is that the periodicity at one day continues to exist, rather strongly. This periodicity is what’s known as an alias. Because the telescope observes only at night, the observations are roughly evenly spaced – there are (on average) twelve hour gaps between each data point. Therefore a sine curve with a period of 24 hours can be made to fit the data. To illustrate this, consider this (completely made up) data set:

Fake Signal

Fake Signal

There’s no doubt that the data is well-fitted by the sine curve, but there is no real evidence that the periodicity proposed by it arises from a real, physical origin. What’s more, a sine curve with half this period could also equally well fit the data. So could a sine curve with a third of this period, and so on. There are mathematically an infinite number of aliases at ever-shortening periods that can be fit to this data.

Generally, if you observe a system with a frequency of f_o, and there exists a true signal with a frequency of f_t, then aliases will exist at frequencies f_{t+i} * f_o, where i is an integer.

Therefore we see that these aliases are caused by the sampling rate. If we could get data between the data points already available, if we could double our observation frequency, we could break this degeneracy. But the problem for telescopes on Earth is that the star is not actually up in the sky more than half the day, and a given portion of the time it is up could be during daylight hours. Therefore the radial velocity data sets of most stars can be plagued with short-period aliases since there is typically a small window of a few hours to observe any given star. It must be noted that as the seasons change and the stars are in different places in the sky at night, that window of availability will shift around a bit, allowing one some leverage in breaking these degeneracies. Ultimately, telescopes in multiple locations around the world (or one in space) would sufficiently break these degeneracies.

A real example of aliases exist in this example from an Alpha Arietis data set. In this case, the alias is not nearly so straightforward. Two signals of periods 0.445 days and 0.571 days can be modelled to fit the data.

Alpha Arietis RV Alias

Alpha Arietis RV Alias

So which of these two signals correspond to an actual planet? It turns out neither of them do: these radial velocity variations are caused by pulsations on the star – contracting and expansion of the star produces Keplerian-like signals in radial velocity data, too. That’s yet another thing to watch out for. This can be detected with simultaneous photometry of the star. If there is a photometric periodicity that is equivalent to your radial velocity periodicity, avoid claiming a planet at this period as if your academic credibility depends on it.

Additional observations could easily break this degeneracy, provided they are planned at times where the two signals do not overlap.

We see therefore that it is important to keep in mind that a low FAP speaks only to whether or not the signal is real, and not where or what it actually came from. The one-day periodicity is surely present in the data, but it is not of physical origin. It can also be extremely hard to tell whether or not a signal at a given period is actually an alias of another, more real period. There are times when the peak of an alias in the periodogram can be higher than the actual, real period. For reasons that include these, radial velocity fits must be considered fairly preliminary. New data may provide drastic revisions to the orbital periods of proposed planets if signals are exposed to be aliases.

Confusion over aliases have occurred before in literature. HD 156668 b and 55 Cnc e have both had their orbital periods considerably revised after it was realised that their published periods were, in fact, aliases. In the case of 55 Cnc e, the new, de-aliased orbital period ended up being vindicated after transits were detected). The GJ 581 data set, for example, is severely limited by sampling aliases that have spawned controversies over the possible existence of additional planets in that system.

In summary, periodograms are a useful tool to provide the user with a starting point when fitting Keplerian signals to radial velocity data, but they cannot distinguish real signals from aliases. Many observations with a diverse sampling rate are necessary to disentangle aliases from true planetary signals. Ultimately, a cautious approach to fitting signals to radial velocity data works best.

Making Waves

Photo by Mila Zinkova, 2007

Photo by Mila Zinkova, 2007

We looked at how the Doppler effect can tell you about the radial velocity of a star (here), and how you can use this information to detect an orbiting companion (here). Let’s now look a bit more at some of the later parts of this process. You’ve already finished your observations, you’re done with the telescope time and you have a collection of data points. What do you do with them?

The first thing you will want to do is to create a model that fits the data. This is done using the equations in the post referred to in the above paragraph. The model will simply manifest itself as a plot of the radial velocity behaviour you would expect for a planet with a given mass and orbit. The closer your model fits the data, the more likely your model is a good representation of reality. For example, let’s look at radial velocity data of the giant star Iota Draconis (Credit: ESO).

Iota Draconis (HR 810) Radial Velocity Data

Iota Draconis (HR 810) Radial Velocity Data

The individual points are the actual data, whereas the sine wave is the model of a 2.26 MJ planet with a 320-day period. We see that the proposed model fits the data well (being very close to the data), and is thus likely representative of reality. There are a few stray points and the model is not a perfect match to the data, but the data points can be prone to intrinsic noise, such as flares or granulation on the star’s surface, or instrumental noise with the spectrometer.

We see that the radial velocity data set covers several orbits of the planet around the star. As the planet continues to orbit the star (as it will do so for aeons) and the radial velocity data continues to come in (as we would hope would occur, more data is more opportunities to discover something new), the graph could become extremely long and complicated. For example, consider a set of radial velocity data for HD 192263 (source).

Unfolded HD 192263 RV data

Unfolded HD 192263 RV data

What we could do would be to fold the entire data set by a certain period of time, specifically the orbital period (taking the time of the measurement modulo the orbital period). Since the model is periodic, that is, repeating each orbital period, the data points (provided they do have the same periodicity that you’re folding by) will fold up into a nice, neat, singular sine curve.

HD 192263 Folded RV data

HD 192263 Folded RV data

Here it is quite convincing that the model well fits the data. But how do we quantify how good of a fit to the data our model is? A value exists called \chi^2(read “chi squared”).

\displaystyle \chi^2 = \sum \frac{(O-C)^2}{\sigma^2}

That is, calculate the difference between each observed data point and the model (or “observed minus calculated”), divided by \sigma^2, which represents the variance of the data. Then total up each of these. The higher the \chi^2, the poorer the model fits the data. The variance, \sigma^2,can be determined with

\displaystyle \sigma^2 = \frac{1}{N} \sum_{i=1}^{N}(v_i - \bar{v})^2

Which is simply an average of the differences of each point, v, from the average values of the points \bar{v}. To put this in other terms (and this is more of a discussion about general statistics than it is specific to radial velocity data), all of the radial velocity measurements can be averaged together. The average of the differences of each point from that average radial velocity value is the variance of the data, \sigma^2, and is a representation of how scattered the data is. Our \chi^2value is simply the total of each point’s difference from our model, while dividing it by the variance to compensate for highly scattered data. Models that are not rigidly affixed to the data points (high total OC) are given more forgiveness of the data itself has a lot of variance (\sigma^2) anyway. If there is low variance, then only models that are closer to the datapoints (low OC) are treated well, or given a low \chi^2 value.

We can obtain another value called the “reduced chi-squared” which takes into account the number of freely adjustable variables (or, the “degrees of freedom”). This is obtained simply by dividing \chi^2by the degrees of freedom, d.

\displaystyle \chi^2_{red} = \frac{1}{Nd} \sum \frac{(O-C)^2}{\sigma^2} = \frac{\chi^2}{d}

The need to do this stems from the fact that a quantitative analysis of data must be undertaken to seriously distinguish between two models whose goodness of fit to the data may not be as easily discriminated between in our “chi-by-eye” technique of just looking at them.

Large reduced chi-squared values indicate a poor model fit, and a value of 1 is typically sought but can be extremely difficult or impossible to achieve due to instrumental or intrinsic noise that are out of one’s ability to control. Values of the reduced chi squared that are less than 1 are typically a sign that you are “over-fitting” the model to the data. This can result from trying too hard to fit the model to noise, or overestimating the variance.

HD192263 RV Residuals

HD192263 RV Residuals

Above are the residuals to the HD 192263 dataset. This is keeping the x-axis (the time of the measurement) and plotting the O-C on the y-axis, instead of the radial velocity value for that data point. In this example, there is a trend in the data that has been modelled out which could potentially be due to a second orbiting companion that is far enough away that only a very small section of its orbit has been observed.

Other than the trend, there is no apparent structure in the residuals, and so there is no reason to suspect the existence of additional orbiting bodies.

One can calculate the standard deviation, or variance, of the residuals to get a measure of how scattered the residuals are. This is typically represented with \sigma or written as ‘rms’ in literature, and is an algebraically obvious transformation of one of the above equations,

\displaystyle \sigma = \sqrt{\frac{1}{N} \sum_{i=1}^{N}(v_i - \bar{v})^2}

The radial velocity jitter of a star scales with both its mass and chromospheric activity. A quiet M dwarf may have a jitter as low as 2 to 3 metres per second, whereas a sun-like star may have a jitter on the order of 3 – 5 metres per second. F-type stars have jitters close to 10 metres per second, and the jitter sky-rockets from there, where A-type stars are nearly unusable for using radial velocity to detect extrasolar planets. As of the time of this writing, no planets have ever been discovered around an A-type star through the use of the radial velocity method. So there is an amount of variance in the residuals we should expect to have given the star we’re studying. If the variance is much higher than expected for the star, then it could be evidence of either unusually high stellar chromospheric activity, or as-yet undetected planets with small amplitude signals.

The systems we have looked at thus far throughout this blog in the context of radial velocity have been systems whose radial velocity data graphs are dominated by a single periodicity. Indeed, our own solar system is like this, with Jupiter dominating the sun’s radial velocity profile. But not all planetary systems come so conveniently organised.

I leave you with an image of how daunting this can be and why we are all thankful that computers are here to help us: An image of the 2008 three-planet fit for HD 40307 (since discovered to have three more planets).

HD 40307 Three-Planet Fit

HD 40307 Three-Planet Fit


Update (15 Apr 2013): I have corrected a math error in the equation for variance and standard deviation. In both cases the (v_i-\bar{v}) term should have been squared.