Oral Abstract

Oral Contribution (O1.2) Mauricio Araya (Universidad Técnica Federico Santa María)

Theme: Data discovery across heterogeneous datasets

Content-aware Data Discovery on VO Catalogs Using Succinct Representations

VO-services and online astronomical archives in general allow to discover data resources based on the metadata that each resource provides. Content-aware data discovery is the process of searching for patterns within the content of the resources, for example over the values of astronomical catalogs, and returning how many matches each resource produces. While a combination of existing protocols and services might produce this result, scaling up to a large number of resources while maintaining reasonable query speeds is a challenging problem. We propose using succinct representations to produce compressed intermediate files where these queries can be performed with low computational complexity. In particular, we focus on tabular data resources (i.e. catalogs), where a content-aware query can be casted as an attribute-retrieval problem. We show that these intermediate files can be computed directly from VOTable results from TAP services, so a succinct (and compressed) representation of any catalog available over this standard can be obtained. We compare our results with standard SQL queries over a popular DBMS, showing that for most of the queries our approach outperforms the state of the art.