Code: A-20

Code: T-19 Subject: DATA WAREHOUSING AND DATA MINING

Time: 3 Hours June 2006 Max. Marks: 100

NOTE: There are 9 Questions in all.

· Question 1 is compulsory and carries 20 marks. Answer to Q. 1. must be written in the space provided for it in the answer book supplied and nowhere else.

· Out of the remaining EIGHT Questions answer any FIVE Questions. Each question carries 16 marks.

· Any required data not explicitly given, may be suitably assumed and stated.

Q.1 Choose the correct or best alternative in the following: (2x10)

a. An example of OLTP is

(A) printing of payslips in payroll system

(B) withdrawal of money at ATM

(C) preparation of reservation chart in railway reservation system.

(D) calculation of Income Tax.

b. Which of the following is true for OLAP

(A) usually requires read-only access.

(B) can be easily performed on any relational database.

(C) works on detailed operational data.

(D) satisfies the needs of operational level tasks in database systems.

c. Fact-constellation schema in a data warehouse is

(A) a normalized star schema

(B) a normalized snowflake schema

(C) collection of star schemata

(D) a subject oriented star schema.

d. Data integration means

(A) combining relevant fields of the record to derive aggregates.

(B) selecting task relevant data from different data sources.

(C) combining task relevant data from different data sources.

(D) normalizing and aggregating data using a well-defined technique.

e. During decision tree induction, pre-pruning leads to

(A) construction of full grown trees.

(B) small and accurate trees.

(C) small and in-accurate trees

(D) oversimplified and accurate trees.

f. Machine learning algorithms are not suitable for data mining because

(A) there is no machine for data mining.

(B) the algorithms are not scalable.

(C) they are not designed to mine association rules.

(D) they cannot discover hidden patterns from data.

g. ETL tools are used for

(A) populating a relational database.

(B) populating a data warehouse.

(C) mining large relational database.

(D) mining large data warehouses.

h. PCA is a technique used for

(A) Mining patterns. (B) Reducing data.

(C) Integrating data. (D) Cleaning data.

i. Whether a loan application is to be accepted or rejected can be decided by the following technique:

(A) Wavelet transform (B) Classification.

(C) Association rule. (D) Normalization.

j. Whether items X and Y should be put on sale together can be decided by the following technique

(A) decision tree (B) association rule

(C) binning (D) sampling

Answer any FIVE Questions out of EIGHT Questions.

Each question carries 16 marks.

Q.2 a. List two problems that arise with naturally evolving architectures. Explain with the help of suitable examples. (8)

b. List the results that are achieved by monitoring the data warehouse environment. (8)

Q.3 a. What are the perceived goals of executive information systems? What are the practical difficulties in the implementation of these systems? (6)

b. With the help of a diagram, show how an EIS is supported by a data warehouse. (6)

c. Explain ‘Event mapping’ with the help of an example. (4)

Q.4 a. What are the problems relating to the storage and use of external data in a data warehouse? How are these problems handled? (6)

b. What do you understand by Metadata? How does it facilitate the use of external data in a data warehouse? (6)

c. Why is archiving of external data an important activity in a data warehouse. (4)

Q.5 a. Why are derived data and DSS data kept out of corporate data model? (4)

b. Explain the difference between a migration plan and a methodology. Why do methodologies fail during implementation? (2+6)

c. Why is a feed back loop important for success of data warehouse implementation? (4)

Q.6 a. List and explain four differences between operational database system and data warehouse. (8)

b. What are advantages of using update-driven approach over query-driven approach for integrating multiple heterogonous information sources? (6)

c. What is a data cube? (2)

Q.7 Consider a data warehouse University which consists of the following dimensions: Student, Course, Semester, Instructor and two measures count and grade. Make assumptions about the key fields for the dimensions.

i. Draw a star schema diagram for the data warehouse.

ii. Using DMQL define the cube and dimensions.

iii. Assume that semester can be rolled up to a year. Draw the snowflake schema corresponding to this.

iv. What is the OLAP operation to list the average grade of CS courses for the year?

v. Specify the SQL query corresponding to (iv) above assuming the implementation in ROLAP. (3+3+2+4+4)

Q.8 a. Explain four methods of handling missing values during data cleaning. (8)

b. What is normalization during data transformation? Differentiate between

the three types of normalization. Give an example for each type. (8)

Q.9 a. Given the following transaction database D, mine for association rules with minimum support = 0.5 and minimum confidence = 0.8 (6)

Tid
1	A B D
2	A C
3	A B C
4	B D E
5	D E
6	A C
7	B D
8	B D E

b. Write a Generic decision tree induction algorithm. (5)

c. Given the following decision tree for the concept buys-computer,

which indicates whether or not a customer is likely to purchase a computer,

extract classification rules. (5)