An Introductory SQL Tutorial for VSA Users
In this document we provide a brief introduction to the use of Structured Query Language (SQL) for accessing data in the VISTA Science Archive (VSA). Sections included in this document are a Primer aimed at users new both to the VSA and to SQL and a Reference which should be of use more experienced and returning users. Readers wanting a fuller introduction to SQL should consult an online tutorial or one of the legion of SQL books available: O'Reilly's SQL in a nutshell is a good introduction. The document concludes with some examples of cross-querying VISTA and external survey data, which we expect to be a common usage of the archive.
The VSA and its prototype, the SuperCOSMOS Science Archive (SSA) are large databases so, for test purposes, we have produced the "Personal SSA" (PSSA), which is a small subset of the SSA, containing solely the data in the region of the sky with 184 < RA (deg) < 186 and -1.25 < Dec. (deg) < 1.25, which is the same area of sky as the "Personal SkyServer" produced for the Early Data Release (EDR)of the Sloan Digital Sky Survey (SDSS). The PSSA may be downloaded from here (as a .zip file with installation instructions included) or can be queried using a web interface of the same form as for the full SSA. Similarly, the initial release of the VSA (which contains a limited amount of commissioning data) should be used if you are unfamiliar with SQL but wish to exercise queries on that database. VSA and SSA users are strongly encouraged to use these small subsets for developing and debugging queries: with databases as large as the VSA and full SSA it can take a long time to find out that the query you wrote does not do what you intended!
This VSA Cookbook is based on the SSA version. Currently, the latter provides a more comprehensive introduction to SQL and in particular gives real-world examples of astronomy application queries that were used in the construction of the SSA. Novice users are strongly recommended to familiarise themselves with the PSSA and the information given in the SSA Cookbook before attempting real science with the VSA.
2.1. Relational databases
The VSA is a relational database, which means that it stores data in tables composed of rows and columns. Each row comprises the information stored for one data entry – e.g.. a celestial object in the case of the VSA – and there is one column for each of the attributes recorded for that entry – e.g. RA, Dec, ellipticity, etc. The different tables comprising a database may be linked (or related), if they each have columns representing the same data value (integrity constraints can be included in the table definitions which ensure consistency between two related tables, e.g. by preventing the deletion of only one of a pair of rows in different tables thus linked). For ease of use, it is possible to define virtual tables - called views - which are subsets of the data in one or more tables and which can be queried using the same syntax as ordinary tables (which are sometimes called base tables, to distinguish them from these virtual tables). In addition to tables and views, the major constituents of a relational database are indexes (the database community prefer that spelling to "indices"), which can speed up the identification of records which satisfy the particular condition expressed in a query, and various stored procedures and functions which extend the range of operations which can be performed on data held in the tables. The collection of definitions of columns, tables, views, indexes, stored procedures and functions in a database is called its schema.
The VSA schema is described in detail elsewhere, but we recap here the basic features which we shall use later. For each VISTA survey programme in the VSA (e.g. the VISTA Hemisphere Survey) two major tables called, for example, vhsDetection and vhsSource are available. The columns in *Detection are basically the attributes derived by running the pipeline image analyser over single passband frames, and these single-frame detections are then merged into multi-epoch, multi-colour records for individual celestial objects, which are stored in *Source. In addition to these major tables, there are also a number of metadata tables, which store ancillary information describing the processes involved in obtaining VISTA imaging data, and which enable the provenance of data values in *Source and *Detection to be traced all the way back to a given image. To aid spatial matching of objects within the VSA and between the VSA and external catalogue datasets (such as the 2MASS Point Source Catalogue and SDSS Data Release catalogues), there are also "neighbour" and "cross-neighbour" tables which record pairs of sources within 10 to 30 arcsec of one another, depending on the catalogues joined.
SQL is the standard language for accessing and manipulating data stored in a relational database. In fact, several versions of the SQL standard exist, and most database management systems (DBMSs) actually support a subset of standard SQL, with some vendor-specific additions. The VSA is currently implemented in Microsoft's SQL Server 2000 DBMS, so VSA users will employ its SQL dialect (known as Transact-SQL, or T-SQL), although we have tried to restrict the use of vendor-specific features to a minimum. A fuller reference on T-SQL dialect than presented here is available online here.
The first thing to understand about SQL is that it is a set-based language, not a procedural language, like Fortran or C. A user submitting an SQL query to a relational database is defining the set of properties of the records to be returned from the database, not specifying the list of operations which will lead to their delivery; this latter is the responsibility of the DBMS engine, which will decide the best way to execute a given query from a set of possible execution plans. Many database vendors are adding procedural capabilities to the SQL dialects they support, and these constitute one of the main areas of difference between those dialects. These extensions will not be discussed here, as we shall concentrate on the basics of standard SQL.
For security reasons, the VSA does not allow users to execute queries which affect the basic structure and contents of the database, only those which can extract data from it. In SQL terms, this means that only SELECT statements are allowed: N.B. in this tutorial we write all SQL keywords in upper case italics and some column names in mixed case, both for clarity, although the VSA's SQL dialect is case insensitive by default. There are three basic classes of SELECT statement:
A projection is the retrieval of a set of full columns from a table. To retrieve the nominal RAs and Decs of the centres of all VISTA Hemisphere Survey fields in the VSA, one would type:
where vhsMergeLog is the name of the VSA table which records information about VHS fields, and ra and dec are the names of the relevant columns in that table.
A selection is the retrieval of the data values in particular columns for those rows in a table which satisfy certain criteria. So, if one were interested only in fields whose nominal centres lie in a 1 degree strip on the celestial equator, the appropriate SQL query would be:
In this example the SQL statement has been split into three lines to emphasise
the SELECT…FROM…WHERE syntax, but this is still one SQL
statement. The SQL Query Form in the VSA interface ignores the
whitespace at the end of each line of text and generates a single query string
from valid multi-line text like this. (Note that this means that users
should not used double hyphens to indicate comments in multi-line
queries, since this will result in all text after the
first comment being ignored.)
while the centres of all other fields could be selected using the following statement:
The parentheses in these examples have been included for clarity – they are only required when needed to avoid ambiguity, and when necessary to over-rule the standard order of precedence amongst operators, outlined in Section 3.3.6 (users should note that the accidental omission of the WHERE clause from a selection turns it not into an invalid query, but into the projection of the columns contained in its SELECT clause, which, for tables as large as the *Source and *Detection tables of the VSA will return a lot of data). Note the units of spherical coordinates in the VSA: these are almost always decimal degrees for both RA and Dec (a notable exception to this rule is attribute raBase in table Multiframe, where the units are hours for consistency with the FITS files ingested from the processing centre. You can easily convert units in selections or WHERE clauses: e.g. SELECT raBase*15.0 ..., or WHERE (ra/15.0) > 23.0 ... etc.
Note that users should check which attributes in which tables have been indexed in the VSA, since the performance of queries that can make use of them should be significantly better than for those which do not: this information is presented in the Schema Browser as highlighted rows in the lists of table attributes.
A join is the retrieval of data entries from one or more tables in a database matched under some criterion. Extending our example above, a user may be interested in the dates on which the Y band images in a strip were taken. The Multiframe table in the VSA has an attribute called mjdObs, which records the Modified Julian Date of the observation. The *MergeLog and Multiframe tables are linked by having the common attribute multiframeID, which is a unique identifier for each FITS file ingested into the archive. The SQL query retrieving the desired dates here would be:
In this query, records in the Multiframe table and the vhsMergeLog table are being joined on condition that their values for the multiframeID attribute are equal. Furthermore, we select only those rows for which there is a Y multiframe identifier, i.e. we exclude any rows where there is no Y multiframe in the merged frame set. Such a situation is indicated by a default value for the attribute (see Section 2.5.). Note also that in general, FITS files produced by the pipeline are multi-extension files containing a paw-print of 4 individual images, and fields in the merge logs are made up from sets of individual images in different passbands and at different epochs. The observation dates of the four paw-print images will be nearly the same, so the results set should contain sets of four repeats of the observation dates matched with the different centres of the device images that comprise the multiframe.
The AS keyword can be used to rename the attributes in the SELECT clause so that their names in the output result set differs from those by which they are known in the database table. For example, a user thinking that the column names referred to in the query above are a little cryptic could rewrite it as follows:
and the columns returned in the output result set will be headed yModifiedJulianDate, fieldCentreRA, fieldCentreDec. This useful facility can be misused by the unwary or the very stupid. For example, it would be possible to have a query which started "SELECT ra AS dec, dec AS ra", which could confuse the subsequent analysis of the extracted result set. In the particular case of extracting data in VOTable format from the VSA, the assignment of Unified Content Descriptors (UCDs) to columns - i.e. setting the values of the ucd attributes to <FIELD> elements in the metadata section at the start of the VOTable document - is performed on the basis of the column name in the output result set, so a user swapping ra and dec, as in the example above, would end up with the wrong UCD for both those two columns, causing problems were that VOTable to be used in conjunction with other Virtual Observatory tools reliant on UCDs. For the moment, users wishing to have UCDs included in VOTable output from the VSA must not rename table columns, while, more generally, it is left to the user to ensure that any renaming they do perform is sensible (note: UCDs in the SSA and VSA have been initially assigned according to the UCD1 definitions).
It is also possible to write join queries in a way that makes it more explicit that they are joining the two tables, i.e. the example above becomes:
This is an inner join, meaning it only returns the (mjdObs,ra,dec) triplets for matched rows, but there are other types of join, which return different combinations of data (see the SSA Cookbook for more details).
The SQL Server dialect of SQL allows the construction of nested SELECT statements, in which the WHERE clause includes a subquery which is itself a valid SELECT statement. For example, the join on vhsMergeLog and Multiframe in the previous Section could be rewritten in the following way:
Note that the ra and dec attributes from the vhsMergeLog table have now been removed from the SELECT clause of the main query. This is because the vhsMergeLog table is only present in the subquery, and its attributes are no longer accessible to the main query. The one exception to this is its ymfID attribute, but that is only accessible by value - i.e. the subquery generates a list of ymfID values and matches between this and the multiframeID column of the Multiframe table are made by use of the IN operator. Note also that now the results set only contains the distinct values of the observation dates, i.e. we get one observation date per multiframe, as opposed to previously where the results sets contained repeated observation dates for the distinct device image centres.
This query could also be written using a second logical operator, ANY, as follows:
where the ANY operator is used to match rows in the Multiframe table with any of the rows in the output result set from the subquery which have the same ymfID value. Note that in both these subqueries formulations the list of attributes in the SELECT clause of the subquery must be consistent with the rest of the WHERE clause in the main query, since they are to be matched by value.
These last two queries illustrate the equivalence of IN and = ANY, but care must be taken if the logical operator NOT is added to the outer query, so that it seeks matches with the complement of the set for which matches are sought in the original query. The operator which is equivalent to NOT IN is not < > ANY, as one might initially expect, but rather < > ALL - where ALL is another logical operator, which evaluates to TRUE if all of a set of comparisons are TRUE - which is clear when one thinks through the logic of that query, but perhaps not immediately apparent.
The most common use for subqueries is to express complicated joins in a simpler fashion. Up to 32 levels of nesting are allowed, in principle, although memory restrictions may prevent that from being achieved in practice. To evaluate a complicated join as a nested series of queries would often be much slower, but the database engine should recognise the equivalence of the two forms of the query and recast it in the most efficient form as part of its query optimisation process. So, there should be no difference in the performance of queries submitted as complicated joins or as nested subqueries, and the latter are to be prefered if they help ensure that the query that is executed really is the one that the user wanted to run.
Our discussion to this point has implicitly assumed that the values of the attributes corresponding to each column in every row in a database table are known. This need not always be the case, as a simple example from the VSA illustrates. The *Source tables in the VSA merges information about detections made in the individual frame set images. A very blue object may well not be detected in a K band image, so what should be written for such an object in the column of the *Source table which records, say, the ellipicity of the K band detection? One answer would be a null value, which is a special type of entry to be included in a table if the value of a given attribute is not known (or is indeterminate or is not applicable) for a particular row. In designing the VSA we have decided not to use nulls in these cases, but to define default values for use in these situations instead: e.g. in the example above, we would set the K band ellipticity in the *Source table of a celestial object undetected in that band to be a recognisably meaningless value; in the case of floating point numbers, -0.9999995e9. Nulls and default values are semantically different: the query processor in a database management system (DBMS) recognises that a value marked as null is unknown or indeterminate and will not include it in, say, the computation of the mean value of all entries in a column, while, to the query processor, a default value is like any other value, and will include it, unless the user explicitly excludes it - e.g. by computing the mean magnitude only for those objects with magnitudes brighter than -0.9999995e9, in this case. Other default values in the VSA include -99999999 for 4- and 8-byte integer attributes, -9999 for 2-byte integers, and 0 for 1-byte (unsigned) integers. The schema browser generally indicates the default value for many of the table attributes, and they have all been chosen to lie well beyond the range of legitimate values found in the VSA, so it is simple to exclude them:
As a result of defining default values for some columns, there have to be dummy rows in some tables (i.e. rows for which every attribute takes the appropriate default value). The reason for this is illustrated by consideration of the *MergeLog and Multiframe tables in the VSA. If one of the multiframe UIDs (e.g. the Y image attribute, ymfID) in one of the *MergeLog tables has a default value because that particular frame set does not yet have a Y-band image, and if the attribute ymfID references the attribute multiframeID in table Multiframe, then table Multiframe needs an entire row of defaults for multiframeID=-99999999 in order to maintain the referential integrity of the database.
It is the responsibility of the user to ensure that the presence of default values and dummy rows will not corrupt the results of queries run on the VSA, but our decision to use them, rather than nulls greatly simplifies the logic involved in all queries run on the database. The inclusion of null values for an attribute means that an expression involving it can evaluate to TRUE, FALSE or UNKNOWN, and we believe that the simplification for users in avoiding this three-valued logic greatly outweighs the burden of having to remember that default values exist for some columns in the VSA.
Previous sections have described the basic SELECT…FROM…WHERE… structure of an SQL query. This is the basic syntax to be employed for querying the VSA, but there are some additional options in the SELECT clause which users may find useful. Once again, a fuller reference than is presented below is available online here.
SQL offers a number of useful aggregate functions, which can be used for deriving summary information on columns or selections from them.
The meanings of those mathematical aggregate functions which apply only to numeric columns are very straightforward: AVG, MAX, MIN, SUM, STDEV, STDEVP, VAR, and VARP return, respectively, the mean, maximum, minimum, sum, standard deviation, population standard variation, variance and population variance of all values in the specified column. They can be used in conjunction with a WHERE clause or not, i.e.
will return the maximum value for the dec column found in the vhsMergeLog table, while
returns the maximum value found for fields within one hemisphere of the sky.
N.B. Strictly speaking, these functions only apply to non-null values found within the particular column. As discussed above, the VSA contains no null values by design, but uses default values instead. These will not be ignored in the computation of aggregate functions. For example, the SQL query
returns the value -0.9999995e9, which is clearly nonsensical astronomically, and just illustrates the unintentional selection of the dummy row in the Multiframe table.
There are additional aggregate functions which can be run on columns of all types. The most useful of these is COUNT, which can be used in a number of ways, as shown in the following example. The programmeID attribute in the ProgrammeFrame table identifies the multiframes which belong to that programme in the archive: since the VSA combines data from several programmes (e.g. the UKIDSS subsurveys), there are several different values of programmeID. To find out how many, one could use the following query:
In practice, this is best discovered from querying the VSA's Programme table.
Note that the inclusion of the DISTINCT keyword means that only the
number of distinct programmeID values was returned. If it had been omitted, then
the returned value would be the total number of entries in the programmeID column,
which is simply equal to the number of rows in the table, since there are no
null values for programmeID in the ProgrammeFrame table.
Using COUNT(*) like this is a very good way of ensuring that a query is sensible before getting data returned by running it, and users are strongly encouraged to use this method to develop and debug SQL queries before running them on the full VSA. The performance overhead involved in this two-query process is not as high as it might seem, since, depending on how heavily the database is being used, some of the data from the COUNT(*) query will still be in cache when the query is run a second time to extract the desired attributes. Users should note that COUNT should be replaced by COUNT_BIG if there is likely to be more than 2x109 records to be counted: this circumvents an SQL Server bug.
An interesting function specific to SQL Server’s SQL dialect is TOP, which is illustrated as follows. The query
would return the MJDs for ten rows in the Multiframe table. This will generally not be the ten highest (i.e. most recent) values in the table; remember that SQL is a set-based language, so a query yields the set of rows satisfying the criteria specified in the query, but with, by default, no particular ordering within that set. The ten highest area values can be obtained using TOP, however, with the addition an ORDER BY clause, which does impose an ordering of the rows in the result set: i.e.:
Note the presence of the DESC (for descending) keyword in the ORDER BY clause, which is required because the default behviour is for that clause to list rows in ascending order.
3.1.4 GROUP BY and HAVING
The GROUP BY clause allows aggregate functions to return more than a single value. For example, the user running the "SELECT COUNT(DISTINCT programmeID) FROM ProgrammeFrame" query above might want to know how many VSA multiframes are associated with each programme. That information is returned by the following query:
where the rows in ProgrammeFrame are grouped by their programmeID value and separate counts are made of the number of rows in each group.
The following mathematical functions are supported by SQL Server's SQL dialect.
Arithmetic functions (such as ABS, CEILING, DEGREES, FLOOR, POWER, RADIANS, and SIGN) return a value having the same data type as the input value, while trigonometric functions and others (such as EXP, LOG, LOG10, SQUARE, and SQRT), cast their input values to float and return a float value; this probably is of no concern to the average VSA user. All mathematical functions, except for RAND, are deterministic functions - i.e. they return the same results each time they are called with a specific set of input values - and RAND is deterministic only when a seed parameter is specified.
An operator in SQL is a symbol specifying an action that is performed on one or more expressions. For the present purposes, their major use is to provide greater flexibility in the possible forms of WHERE clauses of queries and in the columns of the result sets they can produce, which need not simply be columns drawn from the table(s) being queried. There are several classes of operator to consider.
3.3.1 Arithmetic operators
SQL Server's SQL dialect supports five arithmetic operators. The four basic ones - Addition (+), Subtraction (-), Multiplication (*) and Division (/) - plus the Modulo operation, (%), which returns the remainder of dividing one integer by another, and is used in the format "dividend%divisor". The use of the four basic arithmetic operators follows straightforwardly from their definitions.
3.3.2 Comparison operators
The Boolean comparison operators are used most frequently to filter rows via the WHERE clause of a SQL query. The most simple comparison operators ( <, >, =) were used above without introduction, but there a total of nine comparison operators which can be applied to pairs of expressions in the SQL Server dialect of SQL: = (Equal to); > (Greater than); < (Less than); >= (Greater than or equal to); <= (Less than or equal to); <>(Not equal to); != (Not equal to); !> (Not greater than); and !< (Not less than).
3.3.3 Logical operators
In a similar fashion, we have used a number of the logical operators (e.g. AND, BETWEEN, etc) above without introduction, but the following is the full list of logical operators supported by SQL Server:
The use of some of these operators (e.g. AND, BETWEEN, OR) has been illustrated above, but it is worth making a few comments on the remaining ones.
The LIKE operator is used for pattern matching. This is most commonly used for string matching. For example, a user interested in knowing how many VHS multiframes are present in the VSA could issue the query:
Note the use of the percentage sign wildcard character which matches all the UKIRT OMP project strings defined for the VHS. Several other wildcard characters can be used in conjunction with LIKE; these are described further in the SSA Cookbook, as are further examples of logical operators.
The plus sign, +, is used as a string concatenation operator in the SQL Server dialect of SQL. This is most likely to be of use to VSA users in the formatting of result sets - i.e. in the definition of SELECT clauses.
3.3.5 Unary operators
The SQL Server dialect of SQL defines three unary operators - i.e. operators which have only one operand - although none of these are likely to be of much use to most VSA users. The first of these is the positive unary operator, +, which is used principally in SQL statements not allowed by the VSA query interface. The second, -, is the negative unary operator, which returns the negative value of a numeric expression, as shown in the following query:
The final unary operator, ~, the Bitwise NOT operator, converts each bit in a numeric expression of integer data type into its 1s complement (i.e. 0s are changed to 1 and vice versa). This might be of use in queries involving the quality flags in the *Source and *Detection tables.
3.3.6 Operator precedence
The operators described in this subsection have the following descending levels of precedence:
When two operators in an expression have the same operator precedence level, they are evaluated left to right based on their position in the expression.
Archive catalogue tables are automatically cross-matched to themselves and to a number of external survey catalogue datasets held locally, e.g. SDSS Data Releases; FIRST, IRAS, ROSAT and 2MASS catalogues; and legacy photographic catalogues like USNO-B (for a complete list, click on "Browser" on a navigation bar or see below). Rather than prejoining any two datasets to create a static, merged table of what are assumed to be associated sources on the basis of a fixed joining algorithm, and choosing a subset of what are assumed to be useful attributes from each of the two catalogue tables to propagate into that merged table, the philosophy is to create a table of pointers between any two matched datasets. This means, for example, that any externally catalogued source co-incident or nearby an archive source is readily available, out to some predefined maximum angular search radius, and all sorts of science usages are possible because the exact matching criteria (e.g. maximum allowable displacement, consistency in morphological parameters or classification, or even the number of possible external matches for a given archive source) can be tuned at query time by simple expressions in SQL. Furthermore, all attributes of both datasets are available to querying, because no decision has been made as to which attributes to propagate into a static merged set. The flip side to this flexibility is the rather opaque syntax for querying cross-matched data in SQL: instead of querying a single merged table, e.g. SELECT * FROM MergedTable WHERE ..., in general you have to query three tables: the two cross-matched tables and the table of pointers, e.g. SELECT * FROM Table1, ExternalDB..Table2, CrossNeighboursTable2 WHERE ..., using the joining techniques discussed above to select out the rows that you require. Some real-world examples best illustrate how to query cross-matched data in the VSA, but first we discuss the naming of VSA objects (databases, tables and attributes) that pertain to catalogue joining.
In VSA parlance, a table of pointers that associates sources within one table (i.e. that provides a list of all nearby sources for each source within one table) is called a neighbour table, while a table of pointers that associates a set of externally catalogued sources to a table of VISTA sources is called a cross-neighbour table. Cross-neighbour tables are likely to be used the most, but neighbour tables have their uses (for example, you may wish to check the internal consistency of VISTA photometry and astrometry by selecting a set of measurements of the same sources in overlap regions of adjacent frame sets - this is most easily done using a neighbour table).
A list of all available neighbour/cross-neighbour tables is most easily obtained as follows:
The naming convention is simple: neighbour tables are named by appending the word Neighbours to the source table that they point to, e.g. vhsSource has a neighbour table called vhsSourceNeighbours; while cross-neighbour tables are named by concatenating the two associated table names with an X, e.g. the cross-neighbour table between the VHS source table and the 2MASS point source catalogue is called vhsSourceXtwomass_psc. Within these tables, there will always be the following three attributes: masterObjID, which points to the source for which neighbours have been created (the central source of the neighbourhood, if you like); slaveObjID, which points to the neighbouring sources (sources lying within the neighbourhood); and finally distanceMins which is the angular separation between the central source and the neighbour, in units of arcminutes. Depending on the external catalogue joined, there may be other attributes in cross-neighbour tables (use the Browser to examine these).
At the time of writing, external catalogue databases held within the VSA consist of:
Alternatively, you can use the VSA Browser to look at all the neighbour tables, their attributes and also the external databases that are available in the VSA, or use the following SQL query:
Suppose a user wishes to select the identifiers and co-ordinates of all Faint Source Catalogue sources from the ROSAT All-Sky Survey that are within 6 arcsec of a VHS source. The SQL to do this is as follows:
Note: i) the use of table aliases vhs and fsc which are a convenient short-hand; ii) the use of attribute prefixes (e.g. vhs.ra) to distinguish between attributes in different tables that happen to have the same name; iii) the table joining condition in the WHERE clause, which selects associated rows (if you omit this condition, you will get all rows of each table joined to all other rows, i.e. an extremely large and useless dataset!); and finally iv) the specification of a maximum radius of 0.1 arcmin (=6 arcsec) for this query, where the maximum available for ROSAT joins is 30 arcsec. You can see the maximum joining radii (in arcsec) available to you in each neighbour/cross-neighbour table by issuing the following query:
When selecting nearby cross-matches, users should note that one or more than one rows may result for each master object, since there may be more than one slave neighbour within the neighbourhood defined by the maximum join criterion specified. If you want the nearest match only, then the next section explains how to do this.
Suppose a user wishes to select the infrared and optical point source photometry for a sample (say the first 50 rows for speed) in the VHS and SDSS DR2 overlap, taking the nearest SDSS source to each VISTA source as being the most likely match, where the maximum positional tolerance is set to 2 arcsec. The following query will do the job:
Note the following: i) here, we have used table aliases as attribute prefixes only where necessary (i.e. where attributes are not unambiguously identified by their names alone); ii) we select "primary" objects only from the SDSS, as we are interested in a data set with no duplicates; iii) a subquery is used to select the closest neighbour in each case, and the main query selects this nearest object if it is within the specified 2 arcsec maximum radial tolerance.
For VISTA survey tables that have neighbour tables defined, you can examine the neighbours of each source in the table. For example, suppose we want to check the internal consistency of VSA photometry and astrometry by selecting a sample of duplicate source measurements from overlap regions between adjacent frame sets in the VHS. The following query will do the job:
Note that in this case, a stringent pairing tolerance of 0.6 arcsec has been set (this corresponds roughly to a 3-sigma chop given the typical positional errors in VISTA astrometry). Also, we have used the useful feature of user-defined computed columns to create a tabular dataset of attributes that are most useful to us (i.e. differences in positions in arcsec and differences in magnitudes in each available passband).
5. Refining sample selection using quality bit flags
Within the detection and source tables there are quality bit flags, labelled ppErrBits in the detection tables and *ppErrBits in the source tables, where * is the filter name e.g. yppErrBits. These attributes encode the quality issues associated with a given detection such that detections subject to more serious quality issues have higher values. For a detection with no known quality issues the attribute's value is zero. The bit flag value encodes up to 32 distinct quality issues, which are divided into four different classes according to severity. This allows you to select a sample of sources where the quality is better than a certain decimal threshold value corresponding to each warning classification listed in this table:
So, to select a sample that consists of sources with no known quality issues or just minor, informative, warnings we would add ppErrBits < 256 to the WHERE clause of our SQL query. This is partly how the *PointSource views generate a good quality view of the *Source tables for each programme. The reliable*PointSource view selects only those sources with no known quality issues. To consider the quality information from every passband in your selection, you can merge the source table *ppErrBits attributes as: (ksppErrBits | hppErrBits | jppErrBits) < 256 for as many passbands as exist in that programme's source table. As an example of such source selections in practice, suppose we want to produce a colour magnitude plot for galaxies in the VIDEO source table which have no important quality issues:
Initially one would wish to select just those sources with no known quality issues, and then widen the search to include more and more potentially spurious sources. To further refine our samples we can select or deselect certain quality issues according to their associated bit mask, the full list of which may be found in the glossary entries for the ppErrBits attributes. To select sources marked with a certain quality issue we would add ppErrBits & bitMask != 0 to the WHERE clause of our SQL query, where bitMask is the bit mask value (decimal or hexadecimal) assigned to that quality issue as listed in the ppErrBits glossary entry. Conversely, to exclude a certain quality issue we would add ppErrBits & bitMask = 0 to the WHERE clause of our SQL query.
For example, having initially excluded all sources with important quality issues, we might wish to include a sample that are within the ears of tiles. To do this we just extend the SQL query as follows:
where the 0x00800000 is the bit mask, in hexadecimal notation, that represents the "source in underexposed tile ear" quality issue.
Finally, to select a sample of galaxies that do not lie within a dither offset of the stacked J-band frame boundary (and so contains complete data), but can be affected by any other quality issue:
where the 0x00400000 is the bit mask, in hexadecimal notation, that represents the "source within a dither offset of the stacked frame boundary" quality issue.
Link to a fully detailed guide to the design of the quality bit flag attribute, ppErrBits, together with the complete list of quality issues with full descriptions and associated bit masks.
6. Refining sample selection for filter coverageWhen making selections from the survey merged source tables *Source, users are reminded to think carefully about their requirements as regards filter coverage in the sample. Because releases of survey data are made prior to full completion of the surveys (obviously to expedite science exploitation), selections will be incomplete in various subtle ways depending on exactly how the sample predicates are specified.
This will select the stars which have been classified as variable using the current algorithm in the Schema Browser glossary.
Other selections using the statistics calculated in the Variability table can be used, e.g. selecting bright objects with high rms in J and Ks and large absolute skew in each band.
This gives the necessary data for the ks-band light curve for sourceID=446677514563 in VIDEO. Objects with d.ppErrBits>0 are flagged and objects with d.seqNum<0 are non-detections. Objects with x.flag=1 are also assigned to another source and those with x.flag=2 are non-detections within one dither offset of the frame edge.
This gives the necessary data for the ks-band light curve for the first source selected which meets the following criteria: 0<Ks<(KsML-3.); 0<J<(JML-3.); Ksskew>2.; star-like; number of ks-band observations > 10; overall classification is a variable. In this case the error bars are the expected rms of a non-variable object with the mean magnitude of this source.
Some programmes, such as the standard star (CAL) or VMC observe all filters over a short duration to get colours that are correlated and are independent of variability. In these data sets, we put the colour information at each epoch into a SynopticSource table and match this to the Source table. This selection is the same as 7.3, except we have included a selection on the strength of the correlation between the J and Ks bands: hksiWS. The magnitudes come from the vmcSynopticSource table, which is the table of matched filters at a specific epoch, specified in the vmcSynopticMergeLog table, by the meanMjdObs attribute. The matching is done using the synFrameSetID and synSeqNum, but users must be aware that a synoptic frameset in SynopticMergeLog may not include all filters, if some are deprecated. The flag is also more complicated, see the Glossary for details.
This is similar to 7.4, but the magnitude has been replaced with the H-K colour, so in this case the user can see how the colour varies over time.
It is easy to select detections that are from offset 1,or from extNum 5 of offset 5, or detections that are matched on all six offsets, or from offset 1 only,
In each case setting the relevant seqNum>0 to select detections in that frame or <0 to exclude detections in that frame is all that is needed. To compare the photometry between the tiles and pawprints is also very easy:In this case we match the photometry between the tile and offset1 only, and only for those detections where both the tile and pawprint have a detection (tlseqNum>0 AND o1seqNum>0) and the detections are not flagged (td.ppErrBits=0 AND o1d.ppErrBits=0). Any attriutes in the detection tables can be compared in this way, although it is necessary to match a new detection table for every frame in the table. Since this is done via the primary key (multiframeID, extNum, seqNum), the matching is very fast. Unfortunately this table cannot be directly connected to the Source table or to the BestMatch table, This is caused by the fact that the TilePawPrints table contains rows where there are no tile Detections but there are pawprint Detections, so any match to the Source table which often has default rows when there is a non-detection in a particular filter would have multiple choices for a "default" tile detection. To avoid this, add the following constraint on the TilePawPrints: (tlSeqNum>0 OR tileSetSeqNum<0). To select all pawprint and tile aperMag3 values for all filters in a VMC frameset, use the following query:
This looks complicated, but it is similar to the previous query, but joining vmcTileSet and vmcTilePawPrints to vmcMergeLog and vmcSource. There are a lot of links to be made, but they are quite repetitive. Adding OPTION (MAXDOP 1) prevents SQL server using more than one processor. If it does, each process tends to get in the way of the others. In this case, the data has been selected for one frameSet only, frameSetID=558345748481.
To avoid loading so many tables at once, which can cause MS SQL to launch multiple sub-processes which get in the way of each other, it may be better to break queries like the one above into 2 parts, using the ENHANCED VERSION of the FREEFORM SQL interface. The first part is to match the *Detection* tables to the *TilePawPrints*. This will be faster if any selection on position is applied here, although it is ESSENTIAL that the default row (tileSetID<0) is also selected. Add in whichever attributes are needed at this stage. In the query below, we have selected two different aperture magnitudes (3 and 9) and the filename for the tiles and each pawprint.
Download this file as a FITS or VOTABLE.
Then using the ENHANCED VERSION, use multiple instances of this table in the following query. In the query below we have compared all the aperMag3 values for tiles and pawprints in all 4 VHS filters. We could have included the aperMag9 values and the filenames too.
Here each #userTable is an instance of the saved table above, one for each filter. It is important to do any selection in ra/dec here too, especially if you do not query all possible filters, since you may return some all default rows too.
For the VHS, a single epoch survey, we have also included a simple neighbour table to match the vhsSource to pawprint detections: vhsSourceXPawPrints. Selections such as the following can be used to compare the tile and pawprint photometry:
This returns all the pawprint detections matched within 1" of each source along with information about which offset and extension and which filter the detection came from.
The query is simplified in
Home | Overview | Browser | Access | Login | Cookbook
Listing | FreeSQL
Links | Credits
WFAU, Institute for Astronomy,