1. Information Architecture Principle
For any complex endeavor, there is value
in beginning with a common principle to drive designs, procedures, and
decisions. A credible principle is understandable, robust, complete,
consistent, and stable. When an overarching principle is agreed upon,
conflicting opinions can be objectively measured, and standards can be
decided upon that support the principle.
The Information Architecture Principle encompasses
the three main areas of information management: database design and
development, enterprise data center management, and business
intelligence analysis.
Unpacking this principle reveals several practical
implications. There should be a known inventory of information,
including its location, source, sensitivity, present and future value,
and current owner. Although most organizational information is stored in
IT databases, uninventoried critical data is often found scattered
throughout the organization in desktop databases, spreadsheets, scraps
of papers, and Post-it notes, and (the most dangerous of all) inside the
head of key employees.
Just as the value of physical assets varies from
asset to asset and over time, the value of information is also variable
and so must be assessed. Information value may be high for an individual
or department, but less valuable to the organization as a whole;
information that is critical today might be meaningless in a month; or
information that may seem insignificant individually might become
critical for organizational planning when aggregated.
If the data is to be made readily available in the
future, then current designs must be flexible enough to avoid locking
the data in a rigid, but brittle, database.
2. Database Objectives
Based on the Information Architecture
Principle, every database can be architected or evaluated by six
interdependent database objectives. Four of these objectives are
primarily a function of design, development, and implementation: usability, extensibility, data integrity, and performance. Availability and security are more a function of implementation than design.
With sufficient design effort and a clear goal to
meet all six objectives, it is fully possible to design and develop an
elegant database that does just that. No database architecture is going
to be 100 percent perfect, but with an early focus on design and
fundamental principles, you can go a long way toward creating a database
that can grow along with your organization.
You can measure each objective on a continuum. The
data architect is responsible to inform the organization about these
six objectives, including the cost associated with meeting each
objective, the risk of failing to meet the objective, and the
recommended level for each objective.
It's the organization's privilege to then prioritize the objectives compared with the relative cost.
Usability
The usability of a data store (the
architectural term for a database) involves the completeness of meeting
the organization's requirements; the suitability of the design for its
intended purpose; the effectiveness of the format of data available to
applications; the robustness of the database; and the ease of extracting
information (by programmers and power users). The most common reason
why a database is less than usable is an overly complex or inappropriate
design.
Usability is enabled in the design by ensuring the following:
- A thorough and well-documented understanding of the organizational requirements
- Life-cycle planning of software features
- Selecting the correct meta-pattern (for example, transactional and dimensional) for the data store
- Normalization and correct handling of optional data
- Simplicity of design
- A well-defined abstraction layer
Extensibility
The Information Architecture Principle
states that the information must be readily available today and in the
future, which requires the database to be extensible and able to
be easily adapted to meet new requirements. The concepts of data
integrity, performance, and availability are all mature and well
understood by the computer science and IT professions. With enough time
and resources, you can design a data architecture that meets the
objective of extensibility. The trick is to make sure that your entire
organization understands that the resource investment is not only
important, but also absolutely necessary to good data architecture.
There are many databases that fell victim to the curse of not enough
time and too few resources. These are usually the ones that can't grow
and adapt to new business requirements or organizational change well.
Extensibility is incorporated into the design as follows:
- Normalization and correct handling of optional data.
- Generalization of entities when designing the schema.
- Data-driven designs that not only model the obvious data (for
example, orders and customers), but also enable the organization to
store the behavioral patterns, or process flow.
- A well-defined abstraction layer that decouples the database from
all client access, including client apps, middle tiers, ETL, and
reports.
- Extensibility is also closely related to simplicity. Complexity
breeds complexity and inhibits adaptation. Remember, a simple solution
is easy to understand and adopt, and ultimately, easy to adjust later.
Data Integrity
The ability to ensure that persisted
data can be retrieved without error is central to the Information
Architecture Principle, and it was the first major problem tackled by
the database world. Without data integrity, a query's answer cannot be
guaranteed to be correct; consequently, there's not much point in
availability or performance. Data integrity can be defined in multiple
ways:
- Entity integrity: Involves the structure (primary key and its
attributes) of the entity. If the primary key is unique and all
attributes are scalar and fully dependent on the primary key, then the
integrity of the entity is good. In the physical schema, the table's
primary key enforces entity integrity.
- Domain integrity: Ensures that only valid data is permitted
in the attribute. A domain is a set of possible values for an attribute,
such as integers, bit values, or characters. Nullability (whether a
null value is valid for an attribute) is also a part of domain
integrity. In the physical schema, the data type and nullability of the
row enforce domain integrity.
- Referential integrity: Refers to the domain integrity of
foreign keys. Domain integrity means that if an attribute has a value,
then that value must be in the domain. In the case of the foreign key,
the domain is the list of values in the related primary key. Referential
integrity, therefore, is not an issue of the integrity of the primary
key but of the foreign key.
- Transactional integrity: Ensures that every logical unit of
work, such as inserting 100 rows or updating 1,000 rows, is executed as a
single transaction. The quality of a database product is measured by
its transactions' adherence to the ACID properties: atomic — all or nothing; consistent — the database begins and ends the transaction in a consistent state; isolated — one transaction does not affect another transaction; and durable — once committed always committed.
In addition to these four generally accepted
definitions of data integrity, user-defined data integrity should be
considered as well:
- User-defined integrity means that the data meets the organization's
requirements with simple business rules, such as a restriction to a
domain and limiting the list of valid data entries. Check constraints
are commonly used to enforce these rules in the physical schema.
- Complex business rules limit the list of valid data based on some
condition. For example, certain tours may require a medical waiver.
Implementing these rules in the physical schema generally requires
stored procedures or triggers.
- Some data-integrity concerns can't be checked by constraints or
triggers. Invalid, incomplete, or questionable data may pass all the
standard data-integrity checks. For example, an order without any order
detail rows is not a valid order, but no SQL constraint or trigger traps
such an order. The abstraction layer can assist with this problem, and
SQL queries can locate incomplete orders and help to identify other less
measurable data-integrity issues, including wrong data, incomplete
data, questionable data, and inconsistent data.
Integrity is established in the design by ensuring the following:
- A thorough and well-documented understanding of the organizational requirements
- Normalization and correct handling of optional data
- A well-defined abstraction layer
- Data quality unit testing using a well-defined and understood set of test data
- Metadata and data audit trails documenting the source and veracity of the data, including updates
Performance/Scalability
Presenting readily usable information is
a key aspect of the Information Architecture Principle. Although the
database industry has achieved a high degree of performance, the ability
to scale that performance to large databases is still an area of
competition between database engine vendors.
Performance is enabled in the database design and development by ensuring the following:
- A well-designed schema with normalization and generalization, and correct handling of optional data
- Set-based queries implemented within a well-defined abstraction layer
- A sound indexing strategy, including careful selection of clustered and nonclustered indexes
- Tight, fast transactions that reduce locking and blocking
- Partitioning, which is useful for advanced scalability
Availability
The availability of information refers
to the information's accessibility when required regarding uptime,
locations, and the availability of the data for future analysis.
Disaster recovery, redundancy, archiving, and network delivery all
affect availability.
Availability is strengthened by the following:
- Quality, redundant hardware
- SQL Server's high-availability features
- Proper DBA procedures regarding data backup and backup storage
- Disaster recovery planning
Security
The sixth database objective based on
the Information Architecture Principle is security. For any
organizational asset, the level of security must be secured depending on
its value and sensitivity.
Security is enforced by the following:
- Physical security and restricted access of the data center
- Defensively coding against SQL injection
- Appropriate operating system security
- Reducing the surface area of SQL Server to only those services and features required
- Identifying and documenting ownership of the data
- Granting access according to the principle of least privilege, which
is the concept that users should have only the minimum access rights
required to perform necessary functions within the database
- Cryptography — data encryption of live databases, backups, and data warehouses
- Metadata and data audit trails documenting the source and veracity of the data, including updates
Planning Data Stores
The enterprise data architect helps an
organization plan the most effective use of information throughout the
organization. An organization's data store configuration (see Figure 1) includes multiple types of data stores, as illustrated in the following figure, each with a specific purpose:
- Operational databases, or online transaction processing (OLTP) databases
collect first-generation transactional data that is essential to the
day-to-day operation of the organization and unique to the organization.
An organization might have an operational data store to serve each unit
or function within it. Regardless of the organization's size, an
organization with a singly focused purpose may have only one operational
database.
- For performance, operational stores are tuned for a balance of data
retrieval and updates, so indexes and locking are key concerns. Because
these databases receive first-generation data, they are subject to data
update anomalies and benefit from normalization.
- Caching data stores, sometime called reporting databases,
are optional read-only copies of all or part of an operational
database. An organization might have multiple caching data stores to
deliver data throughout the organization. Caching data stores might use
SQL Server replication or log shipping to populate the database and are
tuned for high-performance data retrieval.
- Reference data stores are primarily read-only and store
generic data required by the organization but which seldom changes —
similar to the reference section of the library. Examples of reference
data might be unit of measure conversion factors or ISO country codes. A
reference data store is tuned for high-performance data retrieval.
- Data warehouses collect large amounts of data from multiple data stores across the entire enterprise using an extract-transform-load (ETL)
process to convert the data from the various formats and schema into a
common format, designed for ease of data retrieval. Data warehouses also
serve as the archival location, storing historical data and releasing
some of the data load from the operational data stores. The data is also
pre-aggregated, making research and reporting easier, thereby improving
the accessibility of information and reducing errors.
- Because the primary task of a data warehouse is data retrieval and
analysis, the data-integrity concerns presented with an operational data
store don't apply. Data warehouses are designed for fast retrieval and
are not normalized like master data stores. They are generally designed
using a basic star schema or snowflake design. Locks generally aren't an
issue, and the indexing is applied without adversely affecting inserts
or updates.
- The analysis process usually involves more than just SQL queries and
uses data cubes that consolidate gigabytes of data into dynamic pivot
tables. Business intelligence (BI) is the combination of the ETL process, the data warehouse data store, and the acts to create and browse cubes.
- A common data warehouse is essential to ensure that the entire
organization researches the same data set and achieves the same result
for the same query — a critical aspect of the Sarbanes-Oxley Act and
other regulatory requirements.
- Data marts are subsets of the data warehouse with
pre-aggregated data organized specifically to serve the needs of one
organizational group or one data domain.
- Master data store, or master data management (MDM),
refers to the data warehouse that combines the data from throughout the
organization. The primary purpose of the master data store is to provide
a single version of the truth for organizations with a complex set of
data stores and multiple data warehouses.
- Data Quality Services (DQS) refers to the SQL Server instance
feature that consists of three SQL Server catalogs with data-quality
functionality and storage. The purpose of this feature is to enable you
to build a knowledge base to support data quality tasks.