Sr.No | Movie | PooJa's Rater |
1 | Sholay | Wow |
2 | DevD | Wow |
3 | Dasvidaniya | Wow |
4 | The Blue Umbrella | Wow |
5 | Billu Barber | No Comments |
6 | Welcome to Sajjanpur | Wow |
7 | Slumdog Millionare | Ok |
8 | Raiders of the Lost Ark (1981) | Liked |
9 | Indiana Jones and the Last Crusade (1989) | Liked |
10 | Indiana Jones and Kingdom of the Crystal Skull | Liked |
11 | The Mummy | Wow |
12 | The Mummy Returns | Wow |
13 | Ratatouille (2007) | Wow |
14 | WALL-E | Wow |
15 | Happy Feet | Wow |
16 | Finding Nemo | Wow |
17 | The Big Lebowski (1998) | Liked |
18 | Forrest Gump (1994) | Wow |
19 | The Prestige (2006) | Ok |
20 | Le fabuleux destin d'Amélie Poulain (2001) | Wow |
21 | Apocalypse Now (1979) | Wow |
22 | It's a Wonderful Life (1946) | Wow |
23 | The Curious Case of Benjamin Button (2008) | Wow |
24 | In Bruges (2008) | Wow |
25 | The Wrestler (2008) | Wow |
26 | Memento (2000) | Liked |
27 | Rachel Getting Married. | Ok |
28 | Milk | Ok |
29 | Seven Pounds | Liked |
30 | The Pursuit of Happiness | Wow |
31 | Before Sunrise | Wow |
32 | Before Sunset | Wow |
We created this blog so that we can share the information we gather, in our day-2-day lives. Be it the Books which we read or Music we listen to or Movies which we watch. Anything and Everything under the Sun.
Saturday, May 30, 2009
Movies.
Tuesday, May 26, 2009
Glossary of Datawarehousing Terms
Common terms used in DWH arena:
24/7 Operational availability 24 hours a day, 7 days a week.
3NF See Third normal form.
Accumulating snapshot fact table: Type of fact table with multiple dates representing the major milestones of a relatively short-lived process or pipeline. The fact table is revisited and updated as activity occurs. A record is placed in an accumulating snapshot fact table just once, when the item that it represents is first created. Contrast with Periodic snapshot fact table and Transaction-grain fact table.
Activity-based costs: Costs that are reported as a measure of the activity required rather than on an unchanging standard value. See Allocations.
Additive (facts): Measurements in a fact table that can be added across all the dimensions. Ratios and unit prices are not generally additive.
Ad hoc queries: Queries that are formulated by the user on the spur of the moment. The ad hoc attack refers to the difficulty a database has in anticipating the pattern of queries. The more those queries are ad hoc, the more symmetric the database model must be so that all queries look the same. This is one of the strengths of the dimensional modeling approach.
Aggregate navigator Layer of software between the client and the relational data that intercepts the client’s Structured Query Language (SQL) and transforms that SQL, if it can, to use aggregates that are present somewhere in the data warehouse. The aggregate navigator, by definition, shields the user application from needing to know if an aggregate is present. In this sense, an aggregate behaves like an index. Some relational database suppliers have incorporated aggregate navigation capabilities into their database management systems (DBMSs).
Aggregates Physical rows in a database, almost always created by summing other records in the database for the purpose of improving query performance, sometimes referred to as pre-calculated summary data. See Aggregate navigator.
Algorithm: Standard method for computing something; essentially a mathematical recipe.
Analytic application: Prebuilt data access applications intended for less frequent users of the data warehouse. Typically parameter-driven with flexibility to analyze countless permutations. Such an application represents an opportunity to encapsulate the analytic best practices of an organization.
Analytic processing: Using data for analytic purposes to support business decision-making, versus operational processing, where data is used to run the business. Analytic processing often involves trend analysis, period-to-period comparisons, and drilling.
ANSI American National Standards Institute, the recognized standards publishing body for a range of businesses, professions, and industries.
Application constraint (SQL): Portion of the WHERE clause in SQL that defines a constraint on values, usually within a dimension table. To be contrasted with a join constraint.
Architected data marts See Data warehouse bus architecture.
ASCII: American Standard Code for Information Interchange. An 8-bit character set encoding. ASCII can only support 127 characters, which is not enough for international usage. See Extended ASCII and UNICODE.
Asset Item that appears on the balance sheet of a company that represents something owned by the company or something owed to the company by someone else. Bank loans are assets from the bank’s point of view because they are owed to the bank.
Associative table
Atomic data The most detailed granular data captured by a business process. Atomic data must be made available in the data presentation area to respond to unpredictable ad hoc queries.
Attribute A column (field) in a dimension table.
Audit dimension A special dimension that tags each fact table row with operational meta data (for example, data lineage and confidence) when the row is created.
Authentication The step of determining the identity of the requesting client. Single-factor authentication usually is based on a simple password and is the least secure authentication scheme. Two-factor authentication may involve What-You-Know (a password) with What-You-Possess (a plastic card) and is secure enough for banks’ automated teller machines. Other two-factor authentication schemes involve What-You-Know with Who-You-Are, using biometric scanning devices, such as fingerprint-, retina-, or voice-based systems.
Average order backlog Average length of time that orders have been waiting to be fulfilled.
B-tree index: A relational index that is particularly useful for high-cardinality columns. The B-tree index builds a tree of values with a list of row IDs that have the leaf value. B-tree indexes are almost worthless for low-cardinality columns because they take a lot of space and they usually cannot be combined with other indexes at the same time to increase the focus of the constraints. Contrast with Bitmap index.
Behavior score: Figure of merit that is assigned to a customer based on purchase patterns or credit patterns. Also referred to as a segmentation score. Behavior scores can range from simple segmentation labels such as high medium, or low to complex numerical results of a data-mining application.
BI See Business intelligence.
Bitmap index: A relational indexing technique most appropriate for columns with a limited number of potential values (low cardinality). Most optimizers can combine more than one bitmapped index in a single query. Contrast with B-tree index.
Bridge table: A table with a multipart key capturing a many-to-many relationship that can’t be accommodated by the natural granularity of a single fact table or single-dimension table. Serves to bridge between the fact table and the dimension table in order to allow many-valued dimensions or ragged hierarchies. Sometimes referred to as a helper or associative table.
Browser: Personal computer (PC) client software that communicates with Web servers and displays Web content (text, image, audio, video) on the PC. The main function of the browser is to execute the HyperText Markup Language (HTML) program downloaded from the remote Web server.
Bus Originally used in the electrical power industry to refer to the common structure providing power; then used in the computer industry to refer to a standard interface specification. In the data warehouse, the bus refers to the standard interface that allows separate data marts to coexist usefully.
See Data warehouse bus architecture.
Business dimensional lifecycle: A methodology for planning, designing, implementing, and maintaining data warehouses, as described in The Data Warehouse Lifecycle Toolkit (Wiley, 1998).
Business intelligence (BI): A generic term to describe leveraging the organization’s internal and external information assets for making better business decisions.
Business measure: Business performance metric captured by an operational system and represented as a fact in a dimensional model.
Business process Major operational activities or processes supported by a source system, such as orders, from which data can be collected for the analytic purposes of the data warehouse. Choosing the business process is the first of four key steps in the design of a dimensional model.
Byte (B): Unit of measure, consisting of 8 bits of data.
Cache In a Web browser, disk space set aside to store temporary copies of Web objects so that if they are requested again, they need not be fetched from the Web but can be obtained locally. More generally, a cache is a temporary storage space for objects or data expected to be used in the near future.
Cannibalization: Growth of sales of one product causing the slowing of sales of another product. Usually referring to two products made by the same manufacturer.
Cardinality: The number of unique values for a given column in a relational table. Low cardinality refers to a limited number of values, relative to the overall number of rows in the table.
Cartesian product: A set comprised of all the possible combinations from multiple constraints.
Causal (factor or dimension): Something that is thought to be the cause of something else. Causal factors in retail sales usually refer to ads, displays, coupons, and price reductions. A causal dimension describes these causal factors.
Centipede fact table: A fact table with too many dimensions (often more than 20), leading to a schema that resembles a centipede with numerous foreign keys joined to numerous dimension tables. Centipedes typically result when designers attempt to represent hierarchical relationships with a proliferation of separate dimensions rather than nested within a single dimension.
Chart of accounts List of accounts used by the general ledger. A uniform chart of accounts is a chart of accounts applied consistently across an organization.
Churn In a subscription service, the ratio of customers lost to customers gained.
Click and mortar: A hybrid business with both a Web-based and a physically tangible presence. Contrast with Brick and mortar.
Clickstream: The composite body of actions taken by a user at a Web browser. The Clickstream can include both the actual clicks (browser requests) and the server responses to those requests. The Clickstream takes the form of Web server logs, where each Web server log record corresponds to a single page event.
Click-through: The action of arriving at a Uniform Resource Locator (URL; Web page) by clicking on a button or link, usually located on a different Web site.
Composite key: Key in a database table made up of several columns. Same as Concatenated key. The overall key in a typical fact table is a subset of the foreign keys in the fact table. In other words, it usually does not require every foreign key to guarantee uniqueness of a fact table row.
Concatenated key See Composite key.
Conformed dimensions: Dimensions are conformed when they are either exactly the same (including the keys) or one is a perfect subset of the other. Most important, the row headers produced in answer sets from two different conformed dimensions must be able to be matched perfectly.
Conformed facts: Facts from multiple fact tables are conformed when the technical definitions of the facts are equivalent. Conformed facts are allowed to have the same name in separate tables and can be combined and compared mathematically. If facts do not conform, then the different interpretations must be given different names.
Consolidated data mart: Data marts that combine business measurements from multiple business processes. Sometimes called a second-level data mart. Contrast with First-level data mart.
Constraint Phrase in the SQL WHERE clause. A constraint is either a join constraint or an application constraint.
Continuously valued (facts): Numeric measurement that usually is different every time it is measured. Continuously valued measurements should be facts in the fact table as opposed to discrete attributes that belong in a dimension table.
Cookie: A small text file placed on a user’s PC by a Web browser in response to a specific request from a remote Web server. The cookie contents are specified by the Web server and can only be read from Web servers belonging o the domain that is specified in the cookie.
Copybook: Traditional COBOL header file that describes all the columns in an underlying data file.
Core table: The fact table or the dimension table in a heterogeneous product situation that is meant to span all the products at once. Contrast with Custom line-of-business tables. See also heterogeneous products.
Cost based optimizer: Software in a relational database that tries to determine how to process the query by assigning estimated costs to various table lookup alternatives.
CRC See Cyclic redundancy checksum.
Cross-selling The technique of increasing sales by selling a new product line to existing customers. See also Up-selling.
Cube Name for a dimensional structure on a multidimensional or online analytical processing (OLAP) database platform, originally referring to the simple three-dimension case of product, market, and time.
Custom line-of-business table The fact table or the dimension table in a heterogeneous product situation that contains facts or attributes specific to one set of products, where those facts or attributes are incompatible with the other sets of products. Contrast with Core tables. See also Heterogeneous products.
Customer relationship management (CRM) Operational and analytic processes that focus on better understanding and servicing customers in order to maximize mutually beneficial relationships with each customer.
Cyclic redundancy checksum (CRC) An algorithm that is useful for checking two complex items, such as customer records, to see if anything has changed. The CRC can be stored with an existing record, and then the CRC can be computed on an incoming record. If there are any differences, the CRCs will be different. This eliminates the requirement to check each constituent field in the record.
Data access tool A client tool that queries, fetches, or manipulates data stored on a relational database, preferably a dimensional model located in the data presentation area. Contrast with a Data staging tool.
Data extract Process of copying data from an operational system in order to load it into a data warehouse.
Data mart A logical and physical subset of the data warehouse’s presentation area. Originally, data marts were defined as highly aggregated subsets of data, often chosen to answer a specific business question. This definition was unworkable because it led to stovepipe data marts that were inflexible and could not be combined with each other. This first definition has been
replaced, and the data mart is now defined as a flexible set of data, ideally based on the most atomic (granular) data possible to extract from an operational source, and presented in a symmetric (dimensional) model that is most resilient when faced with unexpected user queries. Data marts can be tied together using drill-across techniques when their dimensions are conformed. We say these data marts are connected to the data warehouse bus. In its most simplistic form, a data mart represents data from a single business process.
Data mining A class of undirected queries, often against the most atomic data, that seek to find unexpected patterns in the data. The most valuable results from data mining are clustering, classifying, estimating, predicting, and finding things that occur together. There are many kinds of tools that play a role in data mining. The principal tools include decision trees, neural networks, memory- and cased-based reasoning tools, visualization tools, genetic algorithms, fuzzy logic, and classical statistics. Generally, data mining is a client of the data warehouse.
Data presentation area The place where warehouse data is organized, stored, and available for direct querying by users, data access tools, and other analytical applications. All querying takes place on the data presentation area. The data in the presentation area must be dimensional and atomic (and possibly summarized, as appropriate) and adhere to the data warehouse bus architecture. Typically referred to as a series of integrated data marts. Contrast with the Data staging area.
Data quality assurance: The step during the production data staging process where the data is tested for consistency, completeness, and fitness to publish to the user community.
Data staging area A storage area and set of processes that clean, transform, combine, de-duplicate, household, archive, and prepare source data for use in the data warehouse. The data staging area is everything in between the source system and the data presentation server. No querying should be done in the data staging area because the data staging area normally is not set up to handle fine-grained security, indexing or aggregations for performance, or broad data integration across multiple data sources. Contrast with the Data presentation area.
Data staging tool A software application typically resident on both the client and the server that assists in the production data extract-transform-load processes. Contrast with Data access tools.
Data stovepipe Occurs when data is available in isolated application specific databases, where little investment has been made to sharing common data, such as customer or product, with other operational systems. Un-architected, stovepipe data marts are disastrous as they merely perpetuate isolated, incompatible views of the organization.
Data warehouse The conglomeration of an organization’s data warehouse staging and presentation areas, where operational data is specifically structured for query and analysis performance and ease-of-use. See
Data warehouse bus architecture The architecture for the data warehouse’s presentation area based on conformed dimensions and facts. Without adherence to the bus architecture, a data mart is a standalone stovepipe application.
Data warehouse bus matrix: Tool used to create, document, and communicate the bus architecture, where the rows of the matrix identify the organization’s business processes and the columns represent the conformed dimensions. The intersection of relevant dimensions applicable to each business process is then marked.
Decision support system (DSS): The original name for data warehousing. In our opinion, it’s still the best name because it’s the business rationale for the data warehouse—using data to make decisions in an organization. See also Business intelligence.
Degenerate dimension: A dimension key, such as a transaction number, invoice number, ticket number, or bill-of-lading number, that has no attributes and hence does not join to an actual dimension table.
Demand side Flow of processes in a business starting with finished goods inventory and progressing through to customer sales. Contrast with the Supply side.
Demographic minidimension See Minidimensions.
Denormalize: Allowing redundancy in a table so that the table can remain flat, rather than snowflaked or normalized, in order to optimize performance and ease-of-use, equivalent to Second normal form (2NF).
Depletions: Same as Shipments. Usually refers to a warehouse drawing down inventory in response to customer orders.
Dimension: An independent entity in a dimensional model that serves as an entry point or as a mechanism for slicing and dicing the additive measures located in the fact table of the dimensional model.
Dimension table: A table in a dimensional model with a single-part primary key and descriptive attribute columns.
Dimensional data warehouse Set of tables for decision support designed as star-joined schemas.
Dimensional modeling: A methodology for logically modeling data for query performance and ease of use that starts from a set of base measurement events. In the relational DBMS environment, a fact table is constructed generally with one record for each discrete measurement. This fact table is then surrounded by a set of dimension tables describing precisely what is known in the context of each measurement record. Because of the characteristic structure of a dimensional model, it is often called a star schema. Dimensional models have proved to be understandable, predictable,
extendable, and highly resistant to the ad hoc attack from groups of business users because of their predictable symmetric nature. Dimensional models are the basis of many DBMS performance enhancements, including powerful indexing approaches and aggregations. Dimensional models are the basis for the incremental and distributed development of data warehouses through the use of conformed dimensions and conformed facts. Dimensional models are also the logical foundation for all OLAP systems.
Directory server: A server, which can be viewed as a little data warehouse, that keeps track of all the users of a system as well as all the resources available on the system, such as database servers, file servers, printers, and communications resources. The industry standard way to communicate with a directory server is the Lightweight Directory Access Protocol (LDAP).
Dirty customer dimension Customer dimension in which the same person can appear multiple times, probably not with exactly the same name spellings or other attributes.
Discrete (dimension attributes) Data, usually textual, that takes on a fixed set of values, such as the flavor of a product. Discrete textual data always should be handled as attributes in a dimension table as opposed to continuously valued numeric data that belongs in a fact table.
Domain (1) A specific range of Internet addresses assigned to a single Internet user. The domain name is a unique text name, often ending in .com, .org, .gov, or .net.
(2) In a dimension, the complete set of legal values from which actual values are derived for an attribute.
Double-barreled joins Multiple parallel joins between a single dimension table and a fact table.
Drill across: The act of requesting similarly labeled data from two or more fact tables in a single report, almost always involving separate queries that are merged together in a second pass by matching row headers.
Drill down: The act of adding a row header or replacing a row header in a report to break down the rows of the answer set more finely.
Drill up The act of removing a row header or replacing a row header in a report to summarize the rows of the answer set. Sometimes called dynamic aggregation.
Dwell time The length of time that a specific Web page is available for viewing on a user’s browser.
Entity-relationship (ER) diagram (ERD): Drawings of boxes and lines to communicate the relationship between tables. Both third normal form (3NF) and dimensional models can be represented as ER diagrams because both consist of joined relational tables. The key difference between the models is the degree of dimension normalization. A dimensional model is a second normal form (2NF) model.
ETL See Extract-transform-load.
Event-tracking table: A fact table, frequently factless, where the dimensions of the table are brought together to describe an event, such as an insurance description of an automobile accident.
Extended ASCII The extension of the American Standard Code for Information Interchange to include European accented characters and other special characters. This encoding uses the high 128 characters in the 8-bit ASCII format. See ASCII and UNICODE.
Extended cost The unit cost multiplied by a quantity to give an additive value.
Extensible Markup Language (XML) A cousin of HTML that provides structured data exchange between parties. XML contains data and metadata but no formatting information. Contrast with HTML. XML is a flexible, strongly hierarchical framework for assigning tags to fields within a document. XML does not specify what the tags should be. It is up to various organizations or industry groups to define and use consistent sets of tags, and this effort is the main gating factor slowing the widespread use of XML.
Extract-transform-load (ETL): Set of processes by which the operational source data is prepared for the data warehouse. The primary processes of the backroom data staging area of the data warehouse, prior to any presentation or querying. Consists of extracting operational data from a source application, transforming it, loading and indexing it, quality-assuring it, and publishing it.
Fact A business performance measurement, typically numeric and additive, that is stored in a fact table.
Fact dimension A special dimension used to identify extremely sparse, dissimilar measurements in a single fact table.
Fact table: In a star schema (dimensional model), the central table with numeric performance measurements characterized by a composite key, each of whose elements is a foreign key drawn from a dimension table.
Factless fact table: A fact table that has no facts but captures certain many-to- many relationships between the dimension keys. Most often used to represent events or provide coverage information that does not appear in other fact tables.
First-level data mart: A data mart that is derived from a single primary source system. Contrast with Consolidated data mart.
Fixed depth hierarchy: A highly predictable hierarchy with a fixed number of levels. Contrast with Ragged hierarchy.
Flat file: A simple data structure, often implemented on a mainframe, which relies on non-relational files, such as IBM VSAM files.
Foreign key (FK): A column in a relational database table whose values are drawn from the values of a primary key in another table. In a star-join schema, the components of a composite fact table key are foreign keys with respect to each of the dimension tables.
General ledger (G/L): Ledger that represents the organization’s assets, liabilities, equity, income, and expense. The G/L remains balanced through offsetting transactions to debit and credit accounts.
Geographic information system (GIS) A hybrid application combining database and mapping technology. Typically, in a GIS, queries can be constructed from maps, and maps can be delivered as a result of a query.
GMROI: Gross margin return on inventory, equal to the number of turns of inventory multiplied by the gross margin percent. A measure of the return on each dollar invested in inventory.
Grain: The meaning of a single row in a fact table. The declaration of the grain of a fact table is the second of four key steps in the design of a dimensional model.
Granularity The level of detail captured in the data warehouse. See Grain.
Greenwich Mean Time (GMT): The local standard time at zero degrees0 longitude, this runs through the Royal Navy Observatory near
Gross profit: The gross revenue less the cost of the goods.
Gross revenue The total revenue paid to a company by its customers. If the gross revenue is calculated before applicable discounts, then the actual amount paid by the customers is called the net revenue.
Helper table
Heterogeneous products: A set of products typically characterized by many incompatible product attributes and measurable facts, a characteristic design challenge in financial service environments. See Core table and Custom line-of-business table.
Hierarchical relationship: A relationship where data rolls up into higher levels of summarization in a series of strict many-to-one relationships. Hierarchies are reflected by additional columns on the atomic dimension table.
Impact report: When reporting with a bridge table, the weighting factor assigned to the multi-valued dimension is ignored. The resulting totals provide a summarization for any case in which the multi-valued dimension was involved, regardless of the extent of the involvement. Contrast with Weighted report.
Implementation bus matrix: A more detailed version of the data warehouse bus matrix where fact tables are identified for each business process, as well as the fact table granularity and measurements.
Index: A data structure associated with a table that is logically ordered by the values of a key and used to improve database performance and query access speed. B-tree indexes are used for high-cardinality fields, and bitmap indexes are used for medium- and low-cardinality fields.
Julian day number A representation of a calendar date as the simple count of days from the beginning of an epoch, such as
Junk dimension An abstract dimension with the decodes for a group of low-cardinality flags and indicators, thereby removing the flags from the fact table.
LDAP Lightweight Directory Access Protocol, a standard currently agreed to by most of the major systems vendors for describing the users of a network and the resources available on a network. See Directory server.
Line item: An individual line of a control document such as an invoice usually identifying a single product within the invoice. Most often used as the grain of the associated fact table.
Logical design: The phase of a database design concerned with identifying the relationships among the data elements. Contrast with Physical design.
Many-to-many relationship: A logical data relationship in which the value of one data element can exist in combination with many values of another data element, and vice versa.
Many-valued dimensions: Normally, a fact table possesses only connections to dimensions representing a single value, such as a single time or a single product. But occasionally, it is valid to connect a fact table record to a dimension representing an open-ended number of values, such as the number of simultaneous diagnoses a patient may have at the moment of a single treatment. In this case we say that the fact table has a many-valued dimension. Also referred to as Multivalued dimensions. Typically handled using a bridge table.
Market basket analysis: A kind of analysis in retail environments that seeks to understand all the products purchased by a customer in a single shopping event. Market basket analysis is an example of affinity grouping that seeks to find things that happen together.
Merchandise hierarchy: A set of attributes in the product dimension that define an ascending many-to-one relationship. Common to all manufacturing and retail environments.
Minidimension: Subsets of a large dimension, such as customer, that are broken off into separate, smaller artificial dimensions to control the explosive growth of a large, rapidly changing dimension. The continuously changing demographic attributes of a customer are often modeled as a separate minidimension.
Mirrored database A physical organization of data where the entire database is duplicated on separate disk drives. Mirrored databases offer a number of performance and administrative advantages.
Modeling applications A sophisticated data warehouse client with analytic capabilities that transform or digest the output from the data warehouse. Modeling applications include forecasting models, behavior scoring models that cluster and classify customer purchase behavior or customer credit behavior, allocation models that take cost data from the data warehouse and spread the costs across product groupings or customer groupings, and most data mining tools.
Most recent indicator: An attribute, typically used in conjunction with type 2 slowly changing dimensions, that indicates the most current profile.
Multidimensional database: Database in which the data is presented in data cubes, as opposed to tables in a relational database platform.
Multidimensional OLAP (MOLAP): Dedicated online analytical processing implementations not dependent on relational databases. Although MOLAP systems do not scale to the sizes that relational databases systems can, they typically offer better performance and more tightly integrated tools than their relational counterparts.
Multipass SQL: Query capability supported by some data access tools in which the results of separate star-schema queries are combined column by column via the conformed dimensions. Not the same thing as a union, which is a row-by-row combination of separate queries.
Multitable join query One of the two characteristic types of queries in a data warehouse environment. Involves the joining of one or more dimension tables to a single fact table. Contrast with Browse queries.
Multivalued dimensions See Many-valued dimensions.
Natural key: The identifier used by the operational systems. Natural keys often have embedded meaning. They may appear as dimension attributes in dimensional models but should not serve as the dimension table primary key, which always should be a surrogate key.
Non-additive (facts): A fact that cannot logically be added between rows. May be numeric and therefore usually must be combined in a computation with other facts before being added across rows. If non-numeric, can only be used in constraints, counts, or groupings.
Normalize: A logical modeling technique that removes data redundancy by separating the data into many discrete entities, each of which becomes a table in a relational DBMS.
ODS See Operational data store.
One-to-many relationship: A logical data relationship in which the value of one data element can exist in combination with many values of another data element, but not vice versa.
Online analytic processing (OLAP): OLAP is a loosely defined set of principles that provide a dimensional framework for decision support. The term OLAP also is used to define a confederation of vendors who offer non-relational, multidimensional database products aimed at decision support. Contrast with Online transaction processing.
Online transaction processing (OLTP) The original description for all the activities and systems associated with entering data reliably into a database. Most frequently used with reference to relational databases, although OLTP can be used generically to describe any transaction-processing environment. Contrast with Online analytic processing.
Operational data store (ODS): A physical set of tables sitting between the operational systems and the data warehouse or a specially administered hot partition of the data warehouse itself. The main reason for an ODS is to provide immediate reporting of operational results if neither the operational system nor the regular data warehouse can provide satisfactory access. Because an ODS is necessarily an extract of the operational data, it also may play the role of source for the data warehouse.
Operational system of record: An operational system for capturing data about a company’s operations and business processes. May not necessarily be a transaction system or a relational system.
Outrigger table: A secondary dimension table attached to a dimension table. An outrigger table is a physical design interpretation of a single logical dimension table. Occurs when a dimension table is snowflaked.
Page event Refers to a Web page or frame downloaded from a Web server to a browser, exclusive of any ancillary content.
Parent-child database: Hierarchical organization of data typically involving a header and set of line items. The dimensional modeling approach strips all the information out of the header (parent) into separate dimensions and leaves the original parent natural key as a degenerate dimension.
Partitioned tables: Tables (and their associated indices) that are managed as physically separate tables but appear logically as a single table. Large fact tables are candidates for partitioning, often by date. Partitioning can improve both query and maintenance performance.
Partitioning of history: The natural correspondence between dimension table entries and fact table rows when a type-2 slowly changing dimension has been implemented. A type 2 slowly changing dimension partitions history because each value of its surrogate key is administered correctly to connect to the correct contemporary span of fact records.
Periodic snapshot fact table: A type of fact table that represents business performance at the end of each regular, predictable time period. Daily snapshots and monthly snapshots are common. Snapshots are required in a number of businesses, such as insurance, where the transaction history is too complicated to be used as the basis for computing snapshots on the fly. A separate record is placed in a periodic snapshot fact table each period regardless of whether any activity has taken place in the underlying account. Contrast with Transaction fact table and Accumulating snapshot
fact table.
Physical design: The phase of a database design following the logical design that identifies the actual database tables and index structures used to implement the logical design.
PK See Primary key.
Portal: A Web site designed to be the first point of entry for visitors to the Web. Portal sites usually feature a wide variety of contents and search capabilities in order to entice visitors to use them. Portals are often selected as browser home pages.
Price-point analysis: The breakdown of product sales by each discrete transaction price. Requires a fact table with fine enough grain to represent each price point separately.
Primary key (PK) A column in a database table that is uniquely different for each row in the table.
Profit-and-loss (P&L) schema The P&L, also known as an income statement, is the classic logical ordering of revenues and costs to represent a progression from gross revenues down to a bottom line that represents net profit. The profitability schema often is called the most powerful dimensional schema because it allows the business to slice and dice revenue, cost, and profit by their primary dimensions, such as customer and product.
Promotion An event, usually planned by marketing, that features one or more causal items such as ads, displays, or price reductions. Also thought of as a deal or sometimes as a contract.
Proxy An alternate Web server that responds to a Web page request in order to reduce the load on a primary Web server or network.
Pseudotransaction: A step needed in some production data extract systems where a non-transactional legacy system is analyzed to see what changed from the previous extract. These changes are then made into artificial (pseudo) transactions in order to be loaded into the data warehouse.
Publishing the right data: The most succinct way to describe the overall responsibility of the data warehouse. The data is right if it satisfies the business’s requirements. The act of publishing is driven ultimately by the business user’s needs.
Pull-down list A user-interface effect in a data access tool that displays a list of options for the user. The most interesting pull-down lists in a data warehouse come from browse queries on a dimension attribute.
Ragged hierarchy: A hierarchy with an unbalanced and arbitrarily deep structure that usually cannot be described in advance of loading the data. Sometimes referred to as a variable-depth hierarchy. Organization charts often are ragged hierarchies.
Real time partitions: A physically separate and specially administered set of tables, apart from the conventional data warehouse, to support more real-time access requirements. See also Operational data store.
Reason code: A field used in conjunction with a transaction dimension to describe why the transaction took place. Reason codes are valuable for returns and cancellations and for describing why something changed.
Redundancy: Storing more than one occurrence of the data.
Referential integrity (RI): Mandatory condition in a data warehouse where all the keys in the fact tables are legitimate foreign keys relative to the dimension tables. In other words, all the fact key components are subsets of the primary keys found in the dimension tables at all times.
Referral: The identity of the previous context of a URL. In other words, if you click on a link in page A and wind up on page B, page B’s Web server sees page A as the referral. Web servers can log referrals automatically, which is a very useful way to see why a visitor came to your Web site.
Relational database management system (RDBMS) Database management system based on the relational model that supports the full range of standard SQL. Uses a series of joined tables with rows and columns to organize and store data.
RI See Referential integrity.
ROI: Return on investment, usually expressed as a rate describing the growth of an investment during its lifetime.
Role-playing dimensions: The situation where a single physical dimension table appears several times in a single fact table. Each of the dimension roles is represented as a separate logical table with unique column names through views.
Roll up: To present higher levels of summarization. See Drill up.
Row header: The non-aggregated components of the SQL select list. Always listed in the SQL group by clause.
Scalability The ability to accommodate future growth requirements.
SCD See Slowly changing dimensions.
Schema The logical or physical design of a set of database tables, indicating the relationship among the tables.
Semantic layer: An interface layer placed between the user and the physical database structure.
Semi-additive (fact): Numeric fact that can be added along some dimensions in a fact table but not others. Inventory levels and balances cannot be added along the time dimension but can be averaged usefully over the time dimension.
Slice and dice: Ability to access a data warehouse through any of its dimensions equally. Slicing and dicing is the process of separating and combining warehouse data in seemingly endless combinations.
Slowly changing dimensions (SCD): The tendency of dimension rows to change gradually or occasionally over time. A type 1 SCD is a dimension whose attributes are overwritten when the value of an attribute changes. A type 2 SCD is a dimension where a new row is created when the value of an attribute changes. A type 3 SCD is a dimension where an alternate old column is created when an attribute changes.
Snowflake: A normalized dimension where a flat, single-table dimension is decomposed into a tree structure with potentially many nesting levels. In dimensional modeling, the fact tables in both a snowflake and star schema would be identical, but the dimensions in a snowflake are presented in third normal form, usually under the guise of space savings and maintainability. Although snowflaking can be regarded as an embellishment to the dimensional model, snowflaking generally compromises user understandability and browsing performance. Space savings typically are insignificant relative to the overall size of the data warehouse. Snowflaked normalized dimension tables may exist in the staging area to facilitate dimension maintenance.
Sparse: A fact table that has relatively few of all the possible combinations of key values. A grocery store product movement database is considered sparse because only 5 to 10 percent of all the key combinations for product, store, and day will be present. An airline’s frequent-flyer database is extremely sparse because very few of the customer, flight number, and day combinations actually appear in the database.
Sparsity failure: A situation that occurs when an aggregate table is created that is not appreciably smaller than the table on which it is based. For instance, if only one SKU in each brand is sold on a given day, then a brand aggregate for a day will be the same size as the base table.
Star-join schema The generic representation of a dimensional model in a relational database in which a fact table with a composite key is joined to a number of dimension tables, each with a single primary key.
Star schema See Star-join schema.
Stock keeping unit (SKU) A standard term in manufacturing and retail environments to describe an individual product.
Subrogation The act of an insurance company selling the rights remaining in a claim, such as the right to sue someone for damages.
Surrogate key: Integer keys that are sequentially assigned as needed in the staging area to populate a dimension table and join to the fact table. In the dimension table, the surrogate key is the primary key. In the fact table, the surrogate key is a foreign key to a specific dimension and may be part of the fact table’s primary key, although this is not required. A surrogate key usually cannot be interpreted by itself. That is, it is not a smart key in any way. Surrogate keys are required in many data warehouse situations to handle slowly changing dimensions, as well as missing or inapplicable data. Also known as artificial keys, integer keys, meaningless keys, nonnatural keys, and synthetic keys.
Syndicated data suppliers: Companies that collect data, clean it, package it, and resell it. A.C. Nielsen and IRI are the principal syndicated data suppliers for grocery and drug store scanner data, and IMS Health and Source Informatics (Walsh
Third normal form (3NF): Database design approach that eliminates redundancy and therefore facilitates insertion of new rows into tables in an OLTP application without introducing excessive data locking problems. Sometimes referred to as normalized.
3NF See Third normal form.
Time stamping Tagging each record with the time the data was processed or stored.
Transaction fact table: Type of fact table in which the fact table granularity is one row for the lowest level of detail captured by a transaction. A record in a transaction fact table is present only if a transaction event actually occurs. Contrast with Periodic snapshot fact table and Accumulating snapshot fact table.
Trending: Analyzing data representing multiple occurrences in a time series.
24/7 Operational availability 24 hours a day, 7 days a week.
Twinkling database: The tendency of a transaction-processing database to constantly be changing the data the user is attempting to query.
Type 1: A slowly changing dimension (SCD) technique where the changed attribute is overwritten.
Type 2 A slowly changing dimension (SCD) technique where a new dimension record with a new surrogate key is created to reflect the change.
Type 3 A slowly changing dimension (SCD) technique where a new column is added to the dimension table to capture the change.
UNICODE The UNICODE worldwide character standard is a character coding system designed to support the interchange, processing, and display of the written texts of the diverse languages of the modern world, including Japanese, Chinese, Arabic, Hebrew, Cyrillic, and many others. In addition, it supports classical and historical texts of many written languages. UNICODE is a 16-bit implementation, which means that 65,535 characters can be supported, unlike ASCII, which can support only 127, or extended ASCII, which supports 255. Release 2.1, the current release of UNICODE, defines 38,887 of the possible characters.
Universal Product Code (UPC) Standard bar-coded value found on most grocery and drug store merchandise.
Universal Resource Locator (URL): The text address of a specific object on the World Wide Web. It usually consists of three parts: a prefix describing the TCP protocol to use to retrieve it (for example, HTTP), a domain name (for example, webcom.com), and a document name (for example, index. html). Such a URL would be formatted as http://www.webcom.com/ index.html.
Value banding (facts) Grouping facts into flexible value bands as specified in a band definition table.
Variable-depth hierarchy See Ragged hierarchy.
Variable-width attribute set The situation where a varied number of dimension attributes are known, depending on the duration of the relationship, such as the case with prospects who evolve into customers.
Virgin territory Portion of disk storage that is unoccupied prior to a data load. In a static database experiencing no in-place updates or inserts and with a primary sort order with time as the leading term in the sort, all data loading takes place in virgin territory.
Weighted report When using a bridge table, the facts in the fact table are multiplied by the bridge table’s weighting factor to appropriately allocate the facts to the multivalued dimension. Contrast with Impact report.
XML See Extensible Markup Language.