Interview Preparation mode beta
Funny Facebook Status Funny Facebook Status
Enter your email address

Various SSIS Data Flow Components?

Nice?Vote!

3 Answers

Nice?Vote!
Data Sources:
------------
There are several kinds of predefined source adapters, called Data Flow Sources in the toolbox pane:

*  Use the OLE DB Source adapter to extract data from any source that has an OLE DB provider, including SQL Server, Oracle, and DB2. You can source from a table or view, or from a SQL query or stored procedure. You may source from a query or stored procedure if your source table is huge and you want only a relatively small number of rows. Figure 5.3 depicts a data flow that uses an OLE DB data source.
*  Use the DataReader Source adapter to extract data from a .NET Provider like ADONET:SQL or ADONET:ORACLE. You’ll probably use an OLE DB provider instead, even for sourcing data from SQL Server because it has a nicer user interface. If performance is vital, use the DataReader source adapter, which has slightly better performance than the OLE DB adapter, but only the most basic user interface.
*  The Flat File Source adapter pulls data from a flat file. You could use an OLE DB or ODBC flat file driver, but we prefer the Flat File Source because it has a better UI and handles bad data more flexibly.
*  The Raw File Source adapter pulls data from a raw data file, which is a format unique to Integration Services. Integration Services writes data to and reads data from the raw format very efficiently. Raw files are a great format to use for data staging, for example to store data extracts during the day (or month) until the main transformation logic kicks off. Also use raw files to hold a fast “backup” of the dimension tables or fact rows that will be updated during an ETL operation.
*  The XML Source adapter pulls data from an XML source. The XML source can be either in a file or in a package variable (which, presumably, you set in an earlier task in the package). You can optionally apply an XSD schema definition against the XML data. You should hesitate to take any XML data that doesn’t have a schema! Because XML data is so unstructured, the source adapter lets you break out the data into multiple destinations. These might flow into different tables, or be handled in different ways before being recombined into a single table.
*  You can define a custom source adapter to consume data from a unique source, like a file whose rows have different formats under different conditions. For a single-time use, you could use the Script transform to write the source adapter. To easily re-use the custom adapter, develop it in any VS.NET language and install the custom adapter into the list of data flow transformations, as described in Books Online.
answered 1 year ago by R (19,530 points)
Nice?Vote!
Data Destinations
-----------------
There are several kinds of predefined destination adapters, called Data Flow Destinations. These are available in the same toolbox illustrated in Figure 5.3, simply by scrolling down in the toolbox list.
*  The OLE DB destination adapter will load data into any OLE DB target database or table, like SQL Server, Oracle, DB2, or even Excel. If you are loading a large volume of data into non-SQL Server OLE DB destination, it’s usually more efficient to write the data as a flat file and use the Execute Process control flow task to invoke the target database’s bulk loader.
*  You can use the SQL Server destination adapter to load data into the SQL Server relational database, instead of the OLE DB adapter. There are several differences between the two adapters:
    o The SQL Server adapter can write data only to a SQL Server relational database running on the same machine as Integration Services.
    o The SQL Server adapter requires that data be formatted perfectly; it skips some data checking and conversion steps that the OLE DB adapter performs for you.
    o The SQL Server adapter can perform faster than the OLE DB adapter because it skips those data checking and conversion steps.
    o Unlike the OLE DB adapter, the SQL Server adapter doesn’t let you set a batch size or commit size for bulk loading. In some scenarios, setting these parameters optimally enables the OLE DB adapter to out-perform the SQL Server adapter.
*  Use OLE DB during the early stages of development because it’s more forgiving of data type mismatches. Late in the development cycle, test the performance of the two adapters to determine which you want to use in production.
*  The Flat File and Raw File destination adapters write the data stream to the file system. Use raw files if the data will be consumed only by Integration Services. Flat files are a standard format appropriate for many uses.
*  Use the Data Reader destination adapter to generate a Reporting Services report directly from an Integration Services package. This feature enables an interesting use of Integration Services for real-time DW/BI systems.
*  The Recordset destination adapter populates an ADODB recordset object held in memory.
*  The SQL Mobile destination adapter is appropriate for pushing data down to SQL Server Mobile Edition databases.
*  The Dimension and Partition Processing destination adapters push the pipeline data into an Analysis Services dimension or fact partition, without the usual intermediate step of writing the data to a relational database. In this scenario, the Analysis Services database is populated directly from the ETL data stream. This feature is most interesting for real-time DW/BI systems.
*  You can define a custom destination adapter to write data to a custom target, just as you could develop a custom source adapter. We hope someone will develop a flexible, high-performance Oracle destination.
answered 1 year ago by R (19,530 points)
Nice?Vote!
Data Transformations
--------------------
Between sources and destinations are the data transformation steps. These include:
*  Sort and Aggregate transforms perform high performance sorting and aggregation. The Aggregate transform can generate multiple aggregates in a single step. The Sort transform can optionally remove duplicate rows. Both the Sort and Aggregate transforms are asynchronous, which means they consume all the data in the flow before they generate an output flow. When you think about what they’re doing, this makes perfect sense. The performance of the Sort and Aggregate transforms is excellent, and should meet most applications’ sorting and aggregation requirements. In exceptional circumstances you may need a thirdparty
sort or aggregation utility that integrates with Integration Services.
*  Conditional Split and Multicast transforms create multiple output streams from one data stream. Use Conditional Split to define a condition that divides the data stream into discrete pieces, for example based on a column’s value being greater or less than 100. You can treat the split streams differently and recombine them later, or you can use the Conditional Split to filter the data. The Multicast transform efficiently replicates a stream, perhaps for parallel processing. The Conditional Split transform sends each input row to one and only one output stream; Multicast sends each input row to each output stream.
*  The Union All, Merge, Merge Join, and Lookup transforms all join multiple data streams. Use Union All to combine multiple streams with similar structures,
perhaps from an upstream Conditional Split, or for Customer records from multiple source systems. Merge is very similar to Union All, but it interleaves
sorted rows. If you don’t care about the order of the rows in the output stream, Union All is both more forgiving and more efficient. Merge Join is a lot like a database join: it merges rows based on columns in common, and can perform a left, full, or inner join. The Lookup transform is very useful for decoding a code from the source data into a friendly label using information stored in a database table. You’ll make heavy use of the Lookup transform in performing the surrogate key assignment on your incoming fact data. The operations performed by these four transforms could all be performed using SQL queries. But if the data is in the pipeline it’s easier and almost always more efficient to use these transforms than to write to a table and potentially index that table before performing a SQL query.
*  The Character Map, Copy/Map, Data Conversion, and Derived Column transforms all do basic transformations. Character Map works only on string data, and will perform transformations like changing case, and changing character width for international strings. Copy/Map simply creates a second copy of an existing column in the stream. Data Conversion changes the data type of one or more columns, for example translating a date string to a true date. Finally, the Derived Column transform is where the most interesting transformation logic occurs. Use Integration Services variables and expressions, discussed in the text
that follows, to develop complex transformations on one or more columns, like parsing or concatenating strings, calculating a mathematical expression, or performing a date/time computation. The Derived Column transform dialog box calls the expression editor to help you construct your expressions.
*  The Slowly Changing Dimension transform does a lot more for you than most of the other transforms. It launches a wizard that generates a bunch of objects to manage dimension changes. You’ll use the wizard to specify which columns take a Type 1 change, a Type 2 change, or which can never be changed. The generated objects automatically compare the incoming dimension rows with the existing dimension, propagate a new row for the Type 2 dimension if necessary, update Type 1 attributes, and manage the common slowly changing dimension metadata attributes. You could do all this by hand using more atomic Integration Services objects, but the Slowly Changing Dimension transform is a lot easier.
*  Use the OLE DB Command transform to update or delete rows in a target table, based on the data in the data flow. The OLE DB Command transformation executes a SQL statement for each row in a data flow; the SQL statement is usually parameterized. The OLE DB Command is the only way to execute an UPDATE statement from within the Data Flow. If your ETL process requires a large number of updates—most common for snapshot fact tables—you should consider writing the target rows to a staging table and performing a bulk UPDATE statement from the Execute SQL task in the Control Flow instead.
*  The Row Count and Audit transforms are most useful for developing process metadata about the data flow. Use the Row Count transform to efficiently count the rows that will be loaded to the target, or that flow into the error handler. Audit captures information about the system environment at the time the data flow task is run, including the computer name, user id, and execution start time.
*  You can use the Percentage Sampling and Row Sampling transforms during package testing, to operate against a subset of the source data. You’ll also use sampling to choose a random sample for deep analysis and data mining, and for building your sandbox source system as described later in this chapter.
*  The Pivot and Unpivot transforms will reduce or increase, respectively, the number of rows in the data stream. Use the Unpivot transform to create 12 rows
of data from a single source row that has 12 columns holding a year’s worth of monthly values.
*  Use the Data Mining Model Training and Data Mining Query transforms to incorporate Analysis Services data mining technology into your ETL application.
For example, you may have a Customer Score attribute in the Customer dimension, whose value comes from a data mining model. These transforms will compute that model and return the results into the data flow pipeline.
*  Fuzzy Grouping and Fuzzy Lookup transforms employ fuzzy logic algorithms that were developed by Microsoft Research. These are useful transforms for data
cleansing and de-duplication. Fuzzy Lookup, like the Lookup transform, compares and matches two sets of data, but with a measure of confidence based on semantic similarity. Fuzzy Grouping acts much the same way, but is used to de-duplicate data on fuzzy criteria. Unlike third-party tools that perform name/address matching based on a detailed history of addresses, these fuzzy algorithms are based on semantics. You have to experiment with the settings, but the transforms are valuable for a wide range of data cleansing problems.
*  Term Extraction and Term Lookup transforms employ text-mining algorithms that were also developed by Microsoft Research. Term Extraction is a text-mining component which returns key terms from a document in a text column. Term Lookup matches a document to key terms in a reference table.
*  The File Extractor and File Injector transforms are used primarily to strip out (extract) text and image data from a data flow and put it into a file or files, or to add such data from files into the data flow.
*  The Script component provides a simple mechanism for creating a custom transformation in the Visual Studio for Applications environment.
answered 1 year ago by R (19,530 points)

Related questions