Briefly describe the major components of a data warehouse architecture?
Components in data warehouse
Data warehouse contains the collection of data that are used for decision making and used business intelligence.
· It is a subject-oriented, integrated, time- variant, and non-updateable data.
· Three components in the architecture of the data warehouse are
· Operational data
· Reconciled data
· Derived data
Diagrammatic representation of architecture of data warehouse is shown below:
Components in the data warehouse architecture:
Operational data:
· It maintains the data from the operational system throughout the organization.
Reconciled data
· It is a data stored in the enterprise data warehouse and an operational data store.
· it contains a current and detailed data and authoritative sources for decision support application.
Derived data
· Derives data is a data obtained from the data mart that is used for the end user decision support application.
· It contains the selected, formatted, and aggregated data.
· It is the data stored in every mart.
Types of metadata in the data warehouse architecture:
There are three types of metadata. They are,
· Operational metadata.
· Enterprises data warehouse (EDW)metadata.
· Data mart metadata.
Operational metadata:
It describes the data in the operational system that provides for the enterprise data warehouse.
It is available in various formats, but the quality is poor.
Enterprises data warehouse (EDW)metadata:
It describes the data of reconciled layer.
It provides the rules for converting the operational data into reconciled data.
It extracts from the enterprise data model.
Data mart metadata:
It describes the data of derived data layer.
It provides the rules for converting the reconciled data into derived data.
2. Explain how the volatility of a data warehouse is different from the volatility of a database for an operational information system?
Data warehouse
· Data warehouse contains the collection of data that are used for decision making and used business intelligence.
· It is a unique kind of database, so it focuses on business intelligence, time variant data, and external data.
· The term data warehouse usually denotes to the grouping of many different database across an entire enterprise.
· It is a subject-oriented, integrated, time- variant, and non-updateable data.
Operational database:
An operational database is the database which is usually accessed and restructured on a regular basis and generally handles the daily transactions for a business.
It is used to manage the dynamic data and modification in the real-time data.
Volatility of a data warehouse and operational database:
A key dissimilarity between a data warehouse and an operational system is the data stored type.
Data warehouse is based on the use of periodic data operational system is based on the use of the transient data.
A change in the existing record present in the stores that overwrites the previous record and deletes the old record is called a transient data.
A data that cannot be overwriting after added to the store is called a periodic data.
In operational system, the data are very volatile and data warehouse stores each change in the data.
There for, the volatility of data warehouse and operational database by data store.
Ch11
What are the key capabilities of NoSQL that extend what SQL can do?
NoSQL is the technology designed for handling big data. It stores and retrieves the data but not based on relational model.
The key capabilities of NoSQL that extent what SQL can do are:
· It does not concern more about reduction in storage, as the SQL did, space because today the storage cost has been reduced so much.
· It focuses on Flexibility, Variety, Versality, Agility and Scalability.
· It facilitates for “Scaling Out”, instead of “Scaling Up” as SQL did”. It has huge number of commodity servers to be added with architectural solutions hence the facility pf “Scaling Out” is possible.
· Other parts of the system may work efficient, even if there is found the failure in a single component.
· It facilitates for “Shared-Nothing” architecture which refers a replication architecture which does not role master and slave separately
· It provides “Schema on Read”, instead of “Schema on Write” as SQL did.” Schema on Read”
· Refers separate specification of s single collection of any individual data items. It uses the languages such as JSON or XML.
· Instead of using ACID (Atomicity, Consistency, Isolation and Durability) used in SQL, it uses BASE (Basically Available, Soft State and Eventually Consistent) characteristic.
· NoSQL guaranties for high availability over that for consistency while SQL offers guaranteed consistency but in maintaining availability in number of situations.
· Multi-Model: NoSQL database play significant role in multi-model database applications where it is capable enough to handle all kind of data such as structured, semi-structured, and unstructured and can ensure to work for applications which require all kind of data.
· Easy Scalability: This database is established using traditional master-slave architecture which makes it capable for expanding it by making it larger through additional servers as per requirements.
· Flexibility: Its more flexible than relational database because of its multi-model design that allow it controls over multiple data forms.
· Distributed: This database is distributed in nature because it provides global accessibility means it can use at multiple locations by multiple companies at the same time using their central data centers.
· Zero Downtime: Because it uses master less architecture that helps it to make multiple copies of same data and manage at different nodes so if one database node is under maintenance or not working then can support with other database node.
Explain the relationship between Hadoop and MapReduce?
MapReduce is a programing model designed for large scale parallel processing of data. In other word, the MapReduce helps in computer solving problem and parallelization of large data storage in an environment that consists of a large number of commodity servers.
1. Mapper phase: In this phase, raw files are taken an input and then the required-out value and output key are separated.
2. Reduced phase: the output coming from the Mapper phase will be taken as an input of the Reduced phase. Then, the grouping performed on these data based on the key and aggregate all output values and output keys.
At the end the output coming from the reduces will be sent to the Hadoop Distributed File System (HDFS).
Hadoop is a batch processing tool. Hadoop’s essence is in processing very large amounts of data by distributing the data(using HDFS or Hadoop Distributed File System) and processing task among a large number of lao-cost commodity server.
Hadoop is an open source implementation of MapReduce that makes the capabilities of this algorithm available for other application.
In this way, the MapReduce and Hadoop are related to each other.