Technology

Sunday, September 23, 2012

Elements of the Dimensional Model

OLTP vs OLAP :
Describe the application that business is looking to build.

Business Processes :
Identify the business processes. Each of the business processes typically becomes a tables in data warehouse called FACT tables.

Fact Tables:> Granularity is at which level of detail the data will be stored for each fact table.
> Trasaction Fact : Transaction records for individual events as they happened
> Periodic Snapshot : Aggregated summary of data.
> Accumulating Snapshot :
> Additive, semi additive, non - additive Fact tables
> Factless Fact tables

Data Warehouse Bus Architecture :
> Define a standard interface for the data warehouse environment, separate data marts can be implemented by different groups at different times. The separate data marts can be plugged together and usefully coexist if they adhere to the standard.
> The team designs a master suite of standardized dimensions and facts that have uniform interpretation across the enterprise. This establishes the data architecture framework.

Data Warehouse Bus Matrix
The Data Warehouse Business Matrix is the tool of choice for documenting and communicating the Dimensional Modeling process so far. Business Processes vs Dimensions

Conformed Dimensions:
Dimension tables are not conformed if the attributes are labeled differently or contain different values. If a customer or product dimension is deployed in a nonconformed manner, then either the separate data marts cannot be used together or, worse, attempts to use them together will produce invalid results.
Flavours:
> Conformed dimensions mean the exact same thing with every possible fact table to which they are joined.
> Sometimes dimensions are needed at a rolled-up level of granularity. Perhaps the roll-up dimension is required because the fact table represents aggregated facts that are associated with aggregated dimensions

Dimensional Authority:The dimension authority has responsibility for defining, maintaining, and publishing a particular dimension or its subsets to all the data mart clients who need it.

SQL-92 - Four Isolation Levels:

Overview
While creating or reusing an existing acess method, it is required to know the degree of isolation supported between processes attempting to access the same data. This is commonly defined by referring to the SQL-92 Isolation levels.

SQL-92 defines four Isolation Levels:

Read Uncommitted
Read Committed
Repeatable Read
Serializable

Isolations levels are stated in terms of three prohibited operation sequences, called Phenomena:

Dirty Read
Non-Repeatable Read
Phantom Read

If an access method does not support Serializable, then it is possible that the data set qualifying for a particular query may change within a transaction because of an update by another transaction. If this can occur, then it is very important for user to know what level is supported, so that they may design their applications accordingly. Phenomena define the ways in which the qualifying data set may change in the transaction.

Dirty Read

This occurs if one transaction can see the results of the actions of another transaction before it commits. Consider the following example:

Transaction 1	Write(a)		Rollback
Transaction 2		Read(a)

If Transaction 2 reads the value written by Transaction 1, then a DIRTY READ has occurred.

Non-Repeatable Read (also called Fuzzy Read)

This occurs if the results of one transactions can be modified or deleted by another transaction before it commits. This is illustrated below:

Transaction 1	Read(a)			Read(a)
Transaction 2		Write/Delete(a)	Commit

If Transaction 1 gets a different result from the each Read, then a NON-REPEATABLE READ has occurred

Phantom Read

This occurs if the results of a query in one transaction can be changed by another transaction before it commits. This is illustrated below:

Transaction 1	Select(criteria)			Select(criteria)
Transaction 2		Update/Create(match to criteria)	Commit

If Transaction 1 gets a different result from each Select, then a PHANTOM READ has occurred

SQL-92 Isolation Levels Defined

The SQL-92 Isolation levels are defined in terms of what Phenomena can occur:

Isolation Level	P1	P2	P3
Read Uncommitted	YES	YES	YES
Read Committed	NO	YES	YES
Repeatable Read	NO	NO	YES
Serializable	NO	NO	NO

Thus if Read Committed is supported you will never experience a DIRTY READ, but you might experience a NON-REPEATABLE READ or a PHANTOM READ.
The above order is in terms of "strength" in that Read Uncommitted is less restrictive, and thus less "strong" or at a "lower level" (of isolation) than Read Committed.

Tuesday, April 27, 2010

DUAL

ORACLE DUAL TABLE;
SQL> SELECT LEVEL FROM DUAL CONNECT BY LEVEL < 4

SQL> SELECT COUNT(*) FROM DUAL;
COUNT(*)
----------
1

SQL> DELETE FROM DUAL;
DELETE FROM DUAL
*
ERROR at line 1:
ORA-01031: insufficient privileges

SQL> INSERT INTO DUAL VALUES ('X');
INSERT INTO DUAL VALUES ('X')
*
ERROR at line 1:ORA-01031: insufficient privileges

CONNECT SYS AS SYSDBA

SQL> INSERT INTO DUAL VALUES ('X');

SQL> COMMIT;

SQL> SELECT COUNT(*) FROM DUAL;
2

SQL> SELECT * FROM DUAL;
D
-
X

SQL> SELECT USER FROM DUAL;
USER
------------------------------
SYS

DUAL is owned by SYS. SYS owns the data dictionary, therefore DUAL is part of the data dictionary. Never update the dd table directly.

Thursday, February 25, 2010

DW Concepts

Data Warehouse Concepts:

Slice :
Lets say there we want to look at year long sales for 3 particular stores(a,b,c). This concept is called slicing. The slicing of the members allows us to focus only on these three members across all other dimensions.

Dice:
The dicing concept means that you put multiple members from a dimension on an axis and then put multiple members from a different dimension on another axis. This allows you to view the interrelationship of members from different dimensions.

Drill-down and drill-up , Roll-down and Roll-up:
Drilling in multidimensional terminology means going from one hierarchy level to another. In other words, drill-down can be defined as the capability to browse through information, following a hierarchical structure.

Drill-across :
Drill-across is a method where you drill from one dimension to another. e.g. earlier we were looking at sales by region of products, now we drill across and now looking at sales by store in a region of products.

Grain:
The grain conveys the level of detail associated with the fact tablemeasurements. Identifying the grain also means deciding on the level of detailyou want to be made available in the dimensional model. The more detailthere is, the lower the level of granularity. The less detail there is, the higherthe level of granularity.

Each fact and dimension table is said to have its own grain or granularity. In other words, each table (either fact or dimension) will have some level of detail associated with it. The grain of the dimensional model is the finest level of detail implied by the joining of the fact and dimension tables.

It is possible to have multiple grains in one fact table. This can be accommodated by adding a column called the granularity flag e.g. you may have data at daily, weekly, monthly level.

Inversion Key Entry : Non Unique Index

Surrogate Keys:
Surrogate keys are keys that are maintained within the data warehouse instead of the natural keys taken from source data systems.

Data tables in various OLTP source systems may use different keys for the same entity. It may also be possible that a single key is being used by different instances of the same entity. This means that different customers might be represented using the same key across different OLTP systems.
Surrogate keys provide the means to maintain data warehouse information when dimensions change.
Natural OLTP system keys may change or be reused in the source data systems. This situation is less likely than others, but some systems have reuse keys belonging to obsolete data or for data that has been purged.
One simple way improve performance
Changes or realignment of the employee identification number should be carried in a separate column in the table, so information about the employee can be reviewed or summarized, regardless of the number of times the employee record appears in the dimension table.

Degenerate dimensions: Degenerate dimensions are dimensions without any attributes. They are not typical dimensions, but often simply a transaction number that is placed inside the fact table.

Conformed dimensions :

Slowly changing dimensions:

Garbage dimensions:
A garbage dimension is a dimension that consists of low-cardinality columns such as codes, indicators, and status flags. The garbage dimension is also referred to as a junk dimension.

Role-playing dimension:

A single dimension which is expressed differently in a fact table using views is
called a role-playing dimension.

Multi-valued dimensions:

Late arriving Dimensions:

Wednesday, February 10, 2010

Interview Questions : Informatica

This discussion is around the Informatica design related interview questions and queries.

~~ Replace a Dynamic lookup cache in a map with a static with optimal performance. We have a dimension table which is in SCD2 with the following structure (id, attr, datetimestamp,status) where :
id : is the key being used from the source itself. Assuming for simplicity of the scenario else use a synthetic key.
attr: change in attribute will result in a new record to be inserted in dim table with status='C' and rest records for this id needs to be marked as 'H'.

Consider the source table with structure (id, attr, datetimestamp). We need to pupulate the dimension from the source.

Design should be such that: For multiple records for the same source key, we should perform updates/inserts only once i.e. if there are multiple records for single id in source, to process first record we first update earlier record as 'H' and then insert new as 'C' and then do the same for other recs with similar id. Rather than following this we issue a single update statement to update existing record as 'H', insert last updated record (with max date) for the id as 'C' and rest all we insert as 'H'.

~~ Maintain SCD2 using 'C' and 'H' flags along with timestamp in the most optimised manner using Informatica as ETL tool.

~~ Transpose rows to cols and cols to rows without using Normalizer.
Assume source in the format (id, subject, marks) and we need to populate a target with structure (id, marks_s1, marks_s2, marks_s3, ...50).
Other way round, we have the source as (id,marks_s1,marks_s2,marks_s3...50) and we need the data in target as (id, subject, marks)

~~ Salary: We have the data as follows (id, name, dept, sal). We need to populate the following target (id,name,dept,sal,d_sal) where id is the id of the person, a unique identifier; name : name of the person ; dept is dept in which person works, sal is the sal he is drawing and d_sal is the difference between the salary of the person and the average salary of the dept in which the person is working.
There are numerous employees working in various depts and at any given time an employee is working only for a single dept.

~~ A big fat table which gets loaded with a flat file provided by business. File got loaded into the table but The file provided by business had 10% wrong records. Now, business provided a new file with ALL the records along with the corrected ones.

Which is the best approach to upload the corrected records?

~~ Many files to be uploaded: Assuming 1000 files to be uploaded all with different structures and data. Design logic such that we can load the data from all these files using a SINGLE mapping with multiple sessions.

~~ Sanity Check: There is a flat file which needs to be checked for sanity. Check can be related to data type of a column, format of data or business checks. We need to perform all checks. Now there can be different scenarios:

We may like to upload a file in case there are no failiures, or upload the file in any case or upload in case not more than 10% records have been rejected.

Similarly we like to capture the errors in the file in different formats in error table such as capture each error as different records with a unique record id, or capture errors for a single line in the file as a single record, or capture all the errors for all the lines of the file as single line in error table with the records

~~ Rejects and Reloads: Describe the rejects and error handling strategy.

~~ Duplicates: There are duplicates in the source file. Either we need to completely skip the duplicates while processing and commit only distinct rows, or we may need to put all the first instance of records in one table and route all the other into another or route all the instances of the records having count more than 1 into one table and all the non duplicate records into another table.

~~ Replicate: There is a table in source. We need to capture all the changes in this and replicate the same in target without using Informatica. What could be the options.

~~ Incremental aggregation: Implement incremental aggregation in the mapping without using session property.

~~ Lookup vs SQL Override : When will you prefer a lookup and when a SQL override. Can there be scenarios where a lookup cant be replaced with a SQL override or vice versa from the functionality perspective.

~~ Connected vs Unconnected Lookup: Describe various factors for considering an unconnected lookup vs a connected in the mapping logic.