R    E   •   D    A    C   •   T    I    O    N
 
 
 
     
 
   

 

CONTENT REPOSITORY CONSIDERATIONS
A technical note looking at the underbelly of Content Management Systems

 

       Abstract:
 
This note provides an overview of Content Repositories, a common component concept underlying content-based applications, their place in the market, and their common attributes.

 

 
Both as good practice and given the state of the market, we should start with a definition: a Content Repository (CR) is an interface and data structures holding related logical units of content as a single logical corpus for easy, controlled access.

A CR is not equivalent to a Content Management System (CMS). For simplicity's sake lets use a broad and general definition of the application area of CMS: CMS are applications that add value to or leverage basic repository operations in a specific domain. By this definition CMS are a wide range of user supporting applications that include many features not properly in the CR space (e.g. workflow, document presentment, document clustering, corpus replication, etc.) A CR is always a core component of any type of CMS.

The goal of this note is only to outline the place of CR in the market and discuss details of the most common attributes of CR. We do not attempt a precise delimitation of CR from CMS at the attribute or domain level. The discussion should show how the CR technology space maps to the markets for CR and provide an preliminary guideline for groups considering acquiring a CR. We do not attempt to canvas the existing product space; however, as much as practical, implementation examples are called out, usually as part of well-known CMS products. The source of the thoughts below is mainly many years experience working with and developing a number of CMS products for a variety of application domains—very little focused market research supports these notes (but see the end links). Because of the breadth of the subject, lack of standards, and endless creativity in the software market this discussion can not hope to be comprehensive.

 
DESCRIPTION AND PLACE IN THE MARKET

A CR is the core ingredient of any domain handling items of content, where content is identifiable information in the form of a file. Figure 1 shows the position of a CR in relation to upstream business components and downstream domain components.

 

Figure 1. Content Repository (CR) in relation to Content Management System (CMS) and content domain sub-components.

 
Example CMS domains include:

  • Document management
      Management power over files greater than that provided by a file system. Frequently with permissioning and workflow features.
    Examples: Documentum, Interleaf.
  • Publishing systems
      Workflow and state oriented systems facilitating multiple formats, editions, and reuse.
    Examples: ArborText. PrintCafe Protus.
  • Catalogs
      Browseable front end to inventory systems.
    Examples: ContextMedia's Enterprise Content Catalog, Weblogic Commerce Server.
  • Configuration management
      Version and collection state controlled management systems. Frequently with managed users and locking or transactional writes.
    Examples: Visual Source Safe, StarTeam.
  • Directories
      Hierarchical schema-bound collections of typed objects with type specific attributes.
    Examples: LDAP compliant systems, Windows registry.
  • Meta data repositories
      Schema-bound collections of standard software object descriptions organized for control, cataloging and reuse and interfaced to development tools.
    Examples: Netbeans Metadata Repository, MetaMatrix.
  • Records management
      Rules-based, typed long-term archive systems.
    Examples: DoD 5015.2 compliant systems, QRMS Quest.
  • Personalized content presentment
      Rules-based systems assembling documents on the fly based on a user profile.
    Examples: ATG Dynamo Personalization Server, LikeMinds.
  • Web site management
      Template and/or edition publishing for web production teams.
    Examples: Interwoven, Expressroom.

CR provide minimally provide a programmatic interface. A CR may also provide command line, network, or graphical interfaces. Since CR are generally not stand alone software, a GUI or other sophisticated interface is more likely to be tied in with functionality of a larger system the CR is one part of.

There is a market for CR that focus on an API as a content store integration mechanism—ATG's repository components are in this category. The Xythos WebDAV server or OctetString's virtual directory products are examples of systems built around network interface oriented CR. The windows explorer with the windows file system is a an example of a GUI oriented system.

Although we do not have definitive research on this point, the leading markets for general CR seem to be:

  • Product line architecture-based system builders
  • CMS product venders
  • Operating Systems

There do not seem to be major CR-based systems integration, end user, or IT organization markets for stand alone CR, despite the basic similarity of CR to portals and other CMS. Further given high overlap of CR attributes (discussed below) with CMS functions and the commoditization-defeating number of options in building an otherwise common component, it is likely that the majority of pure-play CR entrants will continue to be open source frameworks. Venders of products adding value to the CR concept will adopt the better open source frameworks and extend them or build their own proprietary CR and use the network interface standards as the basic integration avenue.

Much of this note looks in detail at the attributes of CR. The core set of attributes are definitional. All CR have the following attributes:

  • Searchable    (NOTE: hierarchical search sufficient)
  • Create, read, update, delete (CRUD) enabled
  • Basic management information

Because CR play a role in so many application domains the set of all attributes reasonably associated with CR is considerably more broad. Common optional CR attributes include:

  • Advanced search
  • Permissioning
  • Definition and federation of underlying stores
  • Document state management
  • Document attributes (or metadata)
  • Document type definition (or prototypes)
  • Document validation
  • Schema defined content structure
  • Rules defined content structure
  • Multiple persistence implementations
  • Transactional read/write (or journaling)
  • Locking
  • Versioning
  • Expiration
  • Network interface
  • Batch import/export features
  • Multi document views or editions of corpus
  • Metadata extraction
  • Event monitoring
  • Linking and referencing
  • Extended management information

A number of common attributes of CR can also be found in some CMS outside of the CR (e.g. permissioning). The attributes listed above are those that are most closely aligned with the goal of a CR: unified, easy, controlled access to items of content.

There are several architectural styles commonly used in the CR domain. As set out here these styles are not necessarily mutually exclusive. Each style is a basic systems approach to which specific design patterns are matched. Some important stylistic choices are also itemized as attributes. Important issues include:

  • Representation of hierarchy
        Possibilities include:
    • Walkable, addressable document tree (pathed object graph)
      Can request '/x/y/z' from repository; can request 'a' from result; may be able to request 'a/b' from result
    • Walkable document tree (object graph)
      Can request 'x' from repository; can request 'y' from result
    • Association by address not by reference (virtual graph)
      Can request '/x/y/z' from repository; can request '/x/y/z/a' from repository
    • Flat list
  • Representation of documents
        Possibilities include:
    • Facade for all document related metadata (permissions, attributes, state, etc.)
    • Hierarchical object wrapper, with any ancillary services separately obtainable through interfaces taking a document pointer
    • Facade including hierarchical methods
    • Object of content itself
  • Handling of different document types
        Possibilities include:
    • Content is byte[] or stream
    • Content is object
    • Content is particular type of object based on document type
    • Content is a functional object that itself contains the document (e.g. a version set object)
    • Content is string or int reference to actual content

The remainder of this note looks in detail at the (non-exhaustive) set of the necessary and optional CR attributes, styles and patterns.

 
CONTENT REPOSITORY ATTRIBUTE DETAILS

NOTE: Order of the following items is arbitrary at this time. This outline is a work-in-progress—more details, examples, and discussion will be added with time.

  • Searchable
    • Flat or linked browse
    • Hierarchical index
    • Keyword search
    • Type filtered search
    • Clustering
    • Field search
      • Attribute and/or state search
      • Content metadata search (field-based content index)
    • Document or repository structure aware search
    • Wild carding
    • Nearest neighbor
    • Stemming, stopping
    • Ranking
    • Template query
    • Common query language
  • Create, read, update, delete (CRUD) enabled
    • Transactional
      • compensating actions
        actions taken to undo the effects of a failed action
      • transactions
        • store-based transactions
        • cross-store transactions
          (e.g. XA transactions)
        • store independent transactions
    • Optimistic/pessimistic updates
    • Caching
      • caching at persistence level vs at the repository level
      • MRU
      • Touch time vs. last use time
      • Refresh time and refresh-all
      • no cache option on repository or item
    • Node options
      • ordered children
      • return null, return new
      • content-or-children, content+children
      • document-and-folder, document-or-folder
      • compound (or virtual) documents
      • timed out and timed in content
    • Limits
      (e.g. number of items, number of bytes, depth)
    • Usage logging
  • Management information
    • Number of bytes
    • Number of documents, folders
    • Number of CRUD function calls
    • Last change timestamp
    • Created timestamp
    • Created by
    • Activity and holdings by user
    • ...
  • Permissioning
    • ACL
      • linked to items
      • linked to functions
    • Roles-based
    • ACL and Roles-based
    • Rules-based
  • Definition and federation of underlying stores
    • Multiple logical roots or single root
      (e.g. Windows filesystem roots vs. the UNIX root)
    • Nested repositories
    • Query (on underlying store) defined repositories
  • Document state management
    • Generated state
    • Application-set state
    • Store-dependent states or CR common states
  • Document attributes (or metadata)
    • Multivalues
    • Standard attributes
    • Mandatory attributes
  • Document type definition (or prototypes)
    • Statically configured
    • Runtime bound
    • Cloning
    • Type hierarchies
  • Document validation
    • Validation of content
    • Validation of attributes and/or state
    • Validation of document within structure
  • Content structure
    • Schema defined content structure
    • Rules defined content structure
  • Persistence implementation
    • Multiple store types
      • separation of content persistence mechanism and other repository infrastructure
        (e.g. doc object knows how to persist in a store vs. doc object knows what persistence service to request persistence from)
      • separation of hierarchical index persistence and content persistence
        (e.g. use of persistence service for all persistence vs. use of one persistence service to persist hierarchy and another to persist content indexed by hierarchy)
    • Store configuration
    • Transient stores
    • Hierarchy-store mapping
      • Pre-defined schema
      • Partially pre-defined schema
        (e.g. required parent id field)
      • Indirection
        (e.g. O/R mapping, LDAP mapping)
      • Pre-defined meta schema
        (e.g. XML DOM, ODBMS primitives or interface sets)
      • Named blobs
        (e.g. file system)
    • Code generated persistence-bound items
  • Transactional read/write (or journaling)
    • See under CRUD attribute above
  • Locking
    • Pessimistic updates
    • Read and write locks
    • Timed locks
    • Shared locks
    • Thread-based locking
  • Versioning
    • Handled natively by store
    • Handled by repository using arbitrary store
    • Handled by federation of stores
  • Expiration
    • See CRUD attributes above
  • Network interface
    • LDAP
    • WebDAV
    • NSF
    • IFS
    • HTTP/XPATH/XML
  • Batch import/export features
  • Multi document views or editions of corpus
    • Default view
    • User default view
    • Only-in-view or in-view+default
    • Query in view
    • Testing views
  • Metadata extraction
    • Generated metadata
  • Event monitoring
    • Choice of events to fire
    • Listeners
    • Triggered events
  • Linking and referencing
    • URL references
    • Refer within, out from, and between repositories
    • GUIDs
    • Cyclical reference
    • Child reference to repository
    • Reference to content at hierarchy without insertion of hierarchy into hierarchy
      (e.g. child refs other child, but children of ref are in ref's hierarchy—are not the children of the refed object)

 

End Links:

 

   
   

 

 

   
© 1999-2002, d.kershaw. all rights reserved.
Δ