Clever, Crafty Content Profiling of Objects


Content profiling consists of three high-level steps: meta-data gathering, processing & aggregation and meta-data analysis. The first step transforms the data in a model that supports faster and scalable analysis and stores it. Post-processing solves issues, such as conflict resolution, due to the normalisation of data provided by different tools and aggregation provides a machine readable overview of the data. The last part of profiling offers the planning expert a service on top of the data. It helps the analysis of the subtleties of the objects and partitioning the content into smaller sets fit for a specific preservation action.

Clever, Crafty, Content Profiling of Objects (c3po) is a software tool prototype, which uses FITS generated data of a digital collection as input and generates a profile of the content set in an automatic fashion. It is designed in a way so that different meta data formats originating from other tools can be easily integrated. The tool follows the proposed three part profiling process and provides facilities for data export and further analysis of the content, such as helpful visualisations of the meta data characteristics, partitioning of the collection into homogeneous sets based on any known characteristic. In order to support the decision making it also makes use of different algorithms that choose a small set of sample records (up to 10) based on the size of objects, the distribution of specific characteristics, or other common features. For each chosen partition of the content, a special machine-readable profile can be generated that contains aggregations and distributions for many of the properties. The profile optionally contains the set of chosen representative samples as well as their identifiers within a content repository and a list of all objects that fall into the particular partition. A machine-readable content profile conforming to such a specific format plays an important role for integration with a planning component, content repositories and monitoring systems and thus for the automation of the entire cycle of planning and operations.

Software Releases

Official release is coming soon.
For now, you can take a look at Github


Large-scale content profiling for preservation analysis. iPres 2012, Toronto, Canada [ Poster Paper ]
C3PO: a content profiling tool for preservation analysis. Blog post at Open Planets Foundation [ Post]


