Friday, September 7, 2007

ETL Built-in to the Database Layer?

Sounds like BCP to me...

Lucid is another column-store database that complements Mondrain as a data warehousing tool.

LucidDB is capable of executing extract/transform/load processes directly as pipelined SQL statements, without any external ETL engine required. This is made possible via a number of features:

  • Extraction of external metadata and data via SQL/MED: LucidDB comes with out-of-the-box support for extracting from flat files and JDBC data sources. Additional foreign data wrappers can be plugged in to make any data source look like a schema of relational tables. For example, a wrapper has been written for access to web services; each Salesforce object can be queried as a table, with SQL expressions such as filters pushed down into the webservice call via an optimizer rule.
  • Pipelined Java transformations (a.k.a. UDX): LucidDB supports the creation of "table functions" implemented as Java routines; these functions take one or more cursors as input and produce a new cursor as output, meaning they can be incorporated directly into SQL statements. For example, a data clustering algorithm can be implemented in Java and invoked as a pipelined operator in a SQL query, allowing complex ETL functions such as name and address matching to be executed without the need to materialize intermediate temporary tables. LucidDB's hybrid Java/C++ architecture is optimized for efficient batching of tuples back and forth across the JNI boundary between managed and unmanaged memory. All threads run in a single process and Java virtual machine, and calls across the JNI boundary do not require extra threads, so there is no inter-process communication overhead. For more information, see the UDX HOWTO.
  • Bulk-load via INSERT and UPSERT: no separate bulk-load utility is required; instead, the SQL INSERT statement does the job directly, selecting from a query consisting of SQL operators such as join and minus, Java transformations, and SQL/MED sources. LucidDB also supports the SQL:2003 standard MERGE statement for implementing upsert. All DML operations (including DELETE) are fully recoverable via table-level locking and page-level logging; for efficiency, no record-level logging or locking is needed. Errors due to constraint violations can be written to a reject file without aborting the load.
  • Read/write concurrency with snapshot consistency: readers (e.g. report execution) are never locked out by bulk load, update, or deletion.
  • Sequence support: via the GENERATED clause in surrogate key column definitions.
  • Application library: LucidDB comes with applib, a library of commonly used ETL and analytical functions and transformation, e.g. a calendar table generator.

LucidDB Home Page

No comments: