Wednesday, January 18, 2012

Data Integrity

Data Integrity means data validation or data checking process or type checking process. Before storing user supplied information into the table server performs data integrity process in order verify whether user supplying valid information or not.
If user supplies valid information then only it will get stored into the table otherwise server raises an error message.
We can achieve this data integrity in three ways:
1.       Data Types
2.       Constraints
3.       Triggers
Constraints: Constraint is nothing but condition on column. If we perform any operation against to constraint server raises an error message.
Type of Constraints:
NOT NULL:  When we place a NOT NULL constraint on any column, it will not allow duplicate values. Entering value for that column is mandatory.
UNIQUE: When we place UNIQUE constraint on any column, it will not allow duplicate values but it allows single null value.
CHECK: It is used for evaluating range condition on numeric columns. It is used for evaluating character comparison conditions on character columns.
PRIMARY KEY: It is a combination of UNIQUE + NOT NULL + CLUSTERED INDEX.
It means when we place PRIMARY KEY constraint on any column then it will not allow any duplicate values and it does not accept any null values mean time the data in that column will be arranged in ascending order due to CLUSTERED INDEX.
FOREIGN KEY: FOREIGN KEY must be PRIMARY KEY.
FOREIGN KEY can accept duplicate values and NULL values.
FOREIGN KEY can accept duplicate values and NULL values.
FOREIGN KEY has to take the values from its corresponding PRIMARY KEY.

Tuesday, January 17, 2012

SQL Queries - In General

To Search Stored Procedures:
SELECT OBJECT_NAME(id)
FROM syscomments
WHERE [text] LIKE '%<TEXT>%' AND OBJECTPROPERTY(id, 'IsProcedure') = 1
GROUP BY OBJECT_NAME(id)
To Search Tables:

SELECT *
FROM INFORMATION_SCHEMA.TABLES 
WHERE TABLE_NAME like '%<TEXT>%'

Query to see exactly what queries running at Background (typically long running queries).
select r.session_id
,status,substring(qt.text,r.statement_start_offset/2
,
(
case when r.statement_end_offset = -1
then len(convert(nvarchar(max), qt.text)) * 2
else r.statement_end_offset end - r.statement_start_offset)/2)
as query_text   --- this is the statement executing right now,qt.dbid,qt.objectid
,r.cpu_time
,r.total_elapsed_time
,r.reads
,r.writes
,r.logical_reads
,r.scheduler_id
from sys.dm_exec_requests r
cross apply sys.dm_exec_sql_text(sql_handle) as qt
where r.session_id > 50
order by r.scheduler_id, r.status, r.session_id;

Wednesday, January 11, 2012

Fundamentals of Data Warehousing

Data Warehousing (DWH)
Data warehousing is the vast field that includes the processes and methodologies involved in the creating, populating, querying and maintaining data in a warehouse. The objective of DWH is to help to maintain the historical data and facilitate users make better business decisions with that data.
Data Warehouse:
As per Bill Inmon "A warehouse is a Historical, subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process".
By Historical we mean, the data is continuously collected from sources and loaded in the warehouse. The previously loaded data is not deleted for long period of time. This results in building historical data in the warehouse.
By Subject Oriented we mean data grouped into a particular business area instead of the business as a whole.
By Integrated we mean, collecting and merging data from various sources. These sources could be disparate in nature.
By Time-variant we mean that all data in the data warehouse is identified with a particular time period.
By Non-volatile we mean, data that is loaded in the warehouse is based on business transactions in the past, hence it is not expected to change over time.
Enterprise Data Warehouse (EDW):
Enterprise Data Warehouse (EDW) is a central normalized repository containing multiple subject area data for an organization. EDW are generally used for feeding data to subject area specific data marts.

We must consider building an EDW when
1. We have clarity on multiple business processes requirements.
2. Want a centralized architecture catering to single version of truth.
Data Mart:
When we restrict the scope of a data warehouse to a sub set of business process or subject area we call it a Data Mart. Data Marts can be stand-alone entities, built from warehouses or be used to build warehouse.
1. Stand-Alone Data Marts – These are single subject areas data warehouses. Standalone data marts could cause many stove-pipe/silo data marts in the future.
2. Data Marts fed from Enterprise Data Warehouses (Top-Down). This is the Top-Down strategy suggested by Bill Inmon. Here we build an enterprise data warehouse (EDW) hosting data for multiple subject areas. From the EDW we build subject area specific data marts.
3. Data Marts building Data Warehouses (Bottom-Up). This is the Bottom-Up strategy suggested by Ralph Kimball. Here we build subject area specific data marts first, keeping in mind the shared business attributes (called conformed dimensions). Using the conformed dimensions we build a warehouse.
Online Transaction Processing Systems (OLTP):
Online Transaction Processing Systems (OLTP) are transactional systems that perform the day to day business operations. In other words, OLTP are the automated day to day business processes. In the world of DWH, they are usually one of the types of source system from where the day to day business transactional data is extracted. OLTP systems contain the current valued business data that is updated based on business transactions.

Legacy Systems:

Legacy system is the term given to a historical system having very large storage capabilities. Mainframes and AS400 are the two examples of legacy systems. Applications running on legacy systems are generally OLTP systems. Due to their large storage capabilities the business transactional data from legacy applications and other OLTP systems is stored on them and maintained for a long period of time. In the world of DWH, legacy systems are often used to extract past historical business data.

Data Profiling:
This is a very important activity mostly performed while collecting the requirements and the system study. Data Profiling is a process that familiarizes you with the data you will be loading into the warehouse/mart.
In the most basic form, data profiling process would read the source data and report on the
Type of data – Whether data is Numeric, Text, Date etc.
Data statistics – Minimum and maximum values, most occurred value.
Data Anomalies like Junk values, incorrect business codes, invalid characters, values outside a given range.
Using data profiling we can validate business requirements with respect to business codes and attributes present in the sources, define exceptional cases when we get incorrect or inconsistent data.
A lot of Data Profiling tools are available in the market. Some examples are Trillium Software and IBM Websphere Data Integrator’s Profile Stage.
In the simplest form of data profiling, we can even hand code (programming) data profile application specific to projects. Programming languages like PERL, Java, PLSQL and Shell Scripting can be very handy in creating these applications.
Data Modeling:
This activity is performed once we have understood the requirements for the data warehouse/mart. Data modeling refers to creating the structure and relationship of the data for the business rules defined in the requirements.
We create the conceptual model followed by logical model followed by physical model. Here we see a progression in the level of detail as we proceed toward the physical model.
Conceptual Model:
Conceptual Model identifies at a high level the subject areas involved. The objective of creating the conceptual model is to identify the scope of the project in terms of the subject areas involved.
Logical Model:
Logical Model identifies the details of the subject areas identified in the conceptual model, and the relationships between the subject areas. Logical model identifies the entities within the subject area and documents in detail the definition of the entity, attributes, candidate keys, data types and the relationship with other entities.
Physical Model:
Physical Model identifies the table structure, Columns within the tables, Primary Keys, data types and size of the columns, constraints on the table and the privileges on the tables. Using the physical model we develop the database objects through the DDL's developed in the physical model.
Normalization:
This is a technique used in data modeling that emphasizes on avoiding storing the same data element at multiple places. We follow the 3 rules of normalization called the First Normal Form, Second Normal Form, and Third Normal Form to achieve a normalized data model.
A normalized data model may result in many tables/entities having multiple levels of relationships, example table1 related to table2, table2 further related to table3, table3 related to table 4 and so on.
First Normal Form (1NF) – The attributes of the entity must be atomic and must depend on the Key.
Second Normal Form (2NF) – This rule demands that every aspect of each and every attribute depends on Key.
Third Normal Form (3NF) – This rule demands that every aspect of each and every attributes depends on nothing but the key.
Theoretically we have further rules called the Boyce-Codd Normal Form, Fourth Normal Form and the Fifth Normal form. In practice we don’t use the rules beyond 3NF.
Dimensional Modeling:
This is the form of data modeling used in data warehousing where we store business related attributes in tables called Dimension Tables and the business related metrics/measures in tables called Fact Tables. The business metrics are associated with the business attributes by relating the dimension tables with the fact table. The dimension tables are not related to one another.
A fact table could be related to many dimension table and hence form the center of the structure when graphically represented. This structure is also called the star schema.
Please note that here we do not emphasize on normalizing the data instead have lesser levels of relationships.
Extract Transform Load (ETL):
The back-end processes involved in collecting data from various sources, preparing the data with respect to the requirements and loading it in the warehouse/mart.
Extraction – This is the process where we select the data from various source systems and transport it in an ETL server. Source systems could range from one to many in number, and similar or completely disparate in nature.
Cleansing – Once the data has been extracted, we need to check whether the data is valid and ensure consistency when coming from disparate sources. We check for valid values of data by performing ETL data validations rules. The rules could be defined to check for correct data formats (Numeric, Date, Text), data within the range, correct business codes, check for junk values. We can ensure consistency of data when it is collected from disparate sources by making sure that the same business attribute will have the same representation. Example would be the gender of customer may be represented as “m” and “f” in one source system and “male” and “female” in another source system. In this case we will make sure we convert “male” to “m” and “female” to “f” so that the gender coming from the second source system is consistent with that from the first source system.
Transformation – This is the process where we apply the business rules to the source data before loading it to the warehouse/mart. Transformation as the term suggests converts the data based on the business rules. Transformation also adds further business context to the data. The types of transformations we generally use in ETL are sort data, merge data, compare data, generate running numbers, aggregate data, change data type etc.
Loading – Once the source data is transformed by applying business rules, we load it into the warehouse/mart. There are two types of loads we perform in data warehousing.
History Load / Bulk Load – This is the process of loading the historical data over a period of time into the warehouse. This is usually a onetime activity where we take the historical business data and bulk load into the warehouse/mart.
Incremental Load – This follows the history load. Based on the availability of the current source data we load the data into the warehouse. This is an ongoing process and is performed periodically based on the availability of source data (eg: daily, weekly, monthly etc).
Operational Data Store (ODS):
Operational Data Store (ODS) is layer of repository that resides between a data warehouse and the source system. The current, detailed data from disparate source systems is integrated, grouped subject areas wise in the ODS. The ODS is updated more frequently than a data warehouse. The updates happen shortly after a source system data changes. There are two objectives for an ODC
1. Source DWH with detailed, integrated data.
2. Provide organization with operational Integration.
Online Analytical Processing (OLAP):
Online Analytical Processing (OLAP) is the approach involved in providing analysis to multidimensional data. There are various forms of OLAP namely:
Multidimensional OLAP (MOLAP) – This uses a cube structure to store the business attributes and measures. Conceptually it can be considered as multi-dimensional graph having various axes. Each axis represents a category of business called a dimension (eg location, product, time, employee etc). The data points are the business transactional values called measures. The coordinate of each data point related it to the business dimension. The cubes are generally stored in proprietary format.
Relational OLAP (ROLAP) – This uses database to store the business categories as dimension tales (containing textual business attributes and codes) and the fact tables to store the business transactional values (numeric measures). The fact tables are connected to the dimension tables through the Foreign-Key relationships.
Hybrid OLAP (HOLAP) – This uses a cube structure to store some business attributes and database tables to hold the other attributes.
Business Intelligence (BI):
Business Intelligence (BI) is a generic term used for the processes, methodologies and technologies used for collecting, analyzing and presenting business information.
The objective of BI is to support users make better business decisions. The objective of DWH and BI go hand-in-hand, that of enabling users make better business decisions.
Data marts/warehouses provide easily accessible historical data to the BI tools to provide end users with forecasting/predictive analytical capabilities and historical business patters and trends.
BI Tools are generally accompanied with reporting features like Slice-and-dice, Pivot-table-analysis, Visualization, and statistical functions that make the decision making process even more sophisticated and advanced.
Decision Support Systems (DSS):
A Decision Support System (DSS) is used for management based decision making reporting. It is basically an automated system to gather business data and help make management business decisions.

Data Mining:
Data mining in the broad sense deals with extracting relevant information from very large amount of data and make intelligent decision based on the information extracted.

Building blocks of Data Warehousing:
If you had to build a Data Warehouse, you would need the following components
Source System(s) – A source system is generally a collection of OLTP System and/or legacy systems that contains day to day business data.
ETL Server – These are generally Symmetric Multi-Processing (SMP) machines or Massively Parallel Processing Machines (MPP).
ETL Tool – These are sophisticated GUI based applications that enable the operation of Extracting data from source systems, transforming and cleaning data and loading the data into databases. ETL tool resides on the ETL server.
Database Server – Usually we have the database reside on the ETL server itself. In case of very large amount of data we may want to have a separate Symmetric Multi-Processing machines or Massively Parallel Processing Machines clustered to exploit parallel processing of the RDBMS.
Landing Zone – Usually a file-system and sometimes a database used to land the source data from various source systems. Generally the data in the landing zone is in the raw format.
Staging Area -Usually a file-system and a database used to house the transformed and cleansed source data before loading it to the warehouse or mart. Generally we avoid any further data cleansing and transformation after we have loaded data in the staging area. The format of the data in staging area is similar to that of in the warehouse or mart.
Data Warehouse – A Database housing the normalized (top-Down) or conformed star schemas (bottom-up) Reporting Server - These are generally Symmetric Multi-Processing machines or Massively Parallel Processing Machines.
Reporting Tool – A sophisticated GUI based application that enables users to view, create and schedule reports on data warehouse.
Metadata Management Tool – Metadata is defined as description of the data stored in your system. In data warehousing the Reporting tool, database and ETL tool have their own metadata. But if we look at the overall picture, we need to have Metadata defined starting from source systems, landing zone, staging area, ETL, warehouse, mart and the reports. An integrated metadata definition system containing metadata definition from source, landing, staging, ETL, warehouse, mart and reporting is advisable.


SQL Interview Questions - A Complete Reference

1.   What is SQL??
Ans: SQL Stands for ‘Structured Query Language’
2.   What is Select statement?
Ans: Select statement is used for data retrieval purpose. The Select statement lets you select a set of values from a table in a database and the values selected from the database table would depend on the various conditions that are specified in the SQL query.
3.   How can you compare a part of the name rather than the entire name?
Ans: SELECT * FROM dbo.People WHERE Empname LIKE ‘%ab%’
Would return a recordset with records consisting Empname the sequence 'ab' in Empname
4.   What is the use of INSERT statement?
Ans: The INSERT statement lets you insert information into the database.
5.   How do you delete a record from table?
Ans: By using DELETE statement and WHERE clause.
6.   How could I get distinct entries in a table?
Ans: The SELECT statement in conjunction with DISTINCT lets you select a set of distinct values from a table in a database. The values selected from the database table would of course depend on the various conditions that are specified in the SQL query.
Example: SELECT DISTINCT Empname FROM Emptable
7.   How to get the results of a Query sorted in any order?
Ans: You can sort the results and return the sorted results to your program by using ORDER BY keyword.
Example: SELECT Empname, Age, City FROM Emptable ORDER BY Empname
8.   How can I find the total number of records in a table?
Ans: You could use the COUNT keyword
Example: SELECT COUNT (*) FROM Emp WHERE Age>40
9.   What is a join and explain different types of joins?
Ans: Joins are used in queries to explain how different tables are related and also let you select data from a table depending upon data from another table.
Types of joins: INNER JOINs, OUTER JOINs, CROSS JOINs. OUTER JOINs are further classified as LEFT OUTER JOINS, RIGHT OUTER JOINS and FULL OUTER JOINS.
10.  What is a self join?
        Ans: Self join is just like any other join, except that two instances of the same table will be joined in the              
        query.

Yet to be continued.. will refresh daily at a count of 5-10!!

SQL Server 2005 for Beginners – Microsoft Sql Server

MS SQL Server is a database management system developed and owned by Microsoft Corporation. The Sql Server database system was originally developed and implemented by Sybase Inc. Microsoft licensed this DBMS in 1988 for OS/2 operating system and began implementing it for Windows Operating System in early 90’s and further development of Sql Server for OS/2 was cancelled.
In April 1994, Microsoft ended their cooperative agreement with Sybase and then came up with MS Sql Server. MS Sql Server runs exclusively under Microsoft Operating Systems windows 2000/2003 server and XP.
SQL Server was designed as a client/server DBMS. A client provides one or more different requests to server. The server processes this request and sends the result back to the client.
SQL Server – Versions
1.       SQL Server 4.0
2.       SQL Server 4.2
3.       SQL Server 6.0
4.       SQL Server 6.5
5.       SQL Server 7.0
6.       SQL Server 8.0 or SQL Server 2000
7.       SQL Server 9.0 or SQL Server 2005
8.       SQL Server 10.0 or SQL Server 2008
SQL Server – Editions
1.       Enterprise Edition
2.       Standard Edition
3.       Developer Edition
4.       Workgroup Edition
5.       Mobile Edition or Compact Edition
6.       Express Edition ( Free Edition - We can download from www.microsoft.com)
SQL Server – Features
1.       Security – Provides login and password to interact with the data in database.
2.       Data Redundancy – Less Data Reputation
3.       Data Integrity – Data Validation
4.       Backup and Recovery
5.       GUI Based Software
6.       Low Cost
7.       Easy to Install and Maintain
Requirements to install SQL Server 2005 in your work station:
1.       Operating System: Windows XP/Windows 2000/NT/Professional, Vista
2.       Primary Memory: 512 MB RAM or above
3.       Secondary Memory: 1GB or above
4.       Processor: 500 MHz or above
5.       Internet explorer: 5.0 or above
6.       Windows Installer: 3.1
7.       .Net Framework 2.0
Data Types in SQL Server:
Data Type means the type of data which users provide to a specific column.
In SQL Server 2 types of data types available this includes
1.       System Defined Data Types
2.       User defined Data Types
System Defined Data Types:
SQL Server has predefined data types called System Defined data types or Built-in data types. These are again categorized into 4 types:
a.      Numeric Data Types
b.      String Data Types
c.      Date Time Data Types
d.      Special Data Types
Numeric Data Types:
These data types are used to provide the numeric information for the columns, these includes

Data Type
Size
BIT
0 or 1
TINYINT
1 BYTE
SMALLINT
2 BYTES
INT
4 BYTES
BIGINT
8 BYTES
REAL
4 BYTES
FLOAT
8 BYTES
DECIMAL (P, S)
5-17 BYTES


TINYINT, SMALLINT, INT, BIGINT are called Exact Numerics where REAL and FLOAT are called Approximate Numerics.
String Data Types:
These data types are used to provide character information for the columns, these include:

Data Type
Size
CHAR [(N)]
1 BYTE
VARCHAR [(N)]
1 BYTE
TEXT
16 BYTES
VARCHAR(MAX)
16 GB

Data Types in SQL Server:
Data Type means the type of data which users provide to a specific column.
In SQL Server 2 types of data types available this includes
1.       System Defined Data Types
2.       User defined Data Types
System Defined Data Types:
SQL Server has predefined data types called System Defined data types or Built-in data types. These are again categorized into 4 types:
a.       Numeric Data Types
b.      String Data Types
c.       Data Time Data Types
d.      Special Data Types
Numeric Data Types:
These data types are used to provide the numeric information for the columns, these includes

Data Type
Size
BIT
0 or 1
TINYINT
1 BYTE
SMALLINT
2 BYTES
INT
4 BYTES
BIGINT
8 BYTES
REAL
4 BYTES
FLOAT
8 BYTES
DECIMAL (P, S)
5-17 BYTES

TINYINT, SMALLINT, INT, BIGINT are called Exact Numerics where REAL and FLOAT are called Approximate Numerics.
String Data Types:
These data types are used to provide character information for the columns, these include:

Data Type
Size
CHAR [(N)]
1 BYTE
VARCHAR [(N)]
1 BYTE
TEXT
16 BYTES
VARCHAR(MAX)
16 GB

CHAR [(N)]: It is a fixed length data type, which occupies by default 1 byte memory. When we specify the N value it occupies N bytes of memory where N is an integer. It follows static memory allocation process.
VARCHAR [(N)]: It is a variable length data type, which occupies by default 1 byte memory. When we specify N value then it occupies N bytes of memory when N is an integer. It follows dynamic memory allocation process.
Note: The maximum limit for N is 8000, if it is more than 8000 characters we will use TEXT or VARCHAR (MAX).
Date Time Data Type:
These data types are used to provide date oriented information to the columns, these includes

Data Type
Size
Range
SMALLDATETIME
4 BYTES
1900-01-01 to 2079-06-06
DATETIME
8 BYTES
1753-01-01 to 9999-12-31

Special Data Types:
These data types are used to provide miscellaneous information to the columns, these includes

Data Type
Size
SMALLMONEY
4 BYTES
MONEY
8 BYTES
IMAGE
16 BYTES
VARBINARY (MAX)
Unlimited
SQL_VARIANT
8016 BYTES

Data Bases in SQL Server:
In SQL Server there are two types of databases available.
1.   System Defined Databases
2.   User Defined Databases

System Defined Databases: Complete system level information is stored in system databases. SQL Server uses system databases to operate user databases.  These are also known as Pre-defined databases or built-in databases, these includes:
a.      Master Database
b.      Model Database
c.       MS Database
d.      Temp Database
e.      Sample Databases
1.       Adventure Works
2.       Adventure Works DW
Master Database: The master database is the “database for all databases”. It keeps track of logon accounts, linked servers, system configuration settings and more. It also contains initialization settings for SQL Server.
Model Database: Model is very special database, anything that you place in model db is automatically propagated to all the databases that you can add a state table to model. That state table then appears in all the new databases. You can include almost any object in Model db.
MS Database: The MSDB (Microsoft Database) database is used as an agent between user and SQL Server meantime it concentrates on Back and Recovery process.
Temp Databases: Temp Databases is used to hold temporary objects which are generated automatically when user starts working with SQL Server.
User Defined Databases: SQL Server enables users to create their own databases. Those are called User Defined Databases.
Introduction to Management Studio:
The SQL Server Developer’s and Administrator’s primary tool for interacting with system is SQL Server Management Studio.
To open, this tool Start Menu go to ProgramsMicrosoft SQL server 2005 click on SQL Server Management Studio then it shows ‘Connect to Server’ window.
Dialog box which appears right after choosing SQL Server Management Studio.

1.       For Server type and Server name we need to enter the details.
2.       We have 2 authentications Authentication list box
a.       Windows Authentication: You will be connected to SQL Server using windows account.
b.      SQL Server Authentication: You will be connected to SQL Server using owned credentials.
Note: Default login name is sa and password sa123
After connecting to SQL Server, then it appears as follows:



Click on New Query (available under file menu) to get Query Editor in which you can create queries.