Content Import (4.3.1)

From VYRE

(Redirected from Content Import)
Jump to: navigation, search

Contents

Content Import (version 4.3.1 and later)

Summary: This article describes importer functionality for versions starting from 4.3.1. There are some updates with each patch version (e.g. 4.3.1.1, 4.3.1.2, etc.); each of the updates specifically points if a feature emerged at that version of UNIFY.

Purpose

The purpose of the importer is to add (or modify, or both add and modify) a collection of items into one of data or file stores.

Types of import

Basic import settings: destination and import type

There are numerous ways of how the importer accepts the data. At the moment, there are four possible options (one of them being used solely for internal purposes):

  • XML files;
  • CSV (Comma Separated File) files, also referred as Delimited files;
  • Files without metadata (only applicable for file stores);
  • Bulk editing (bulk editor is internally connected to importer; however this is not visible from backend/frontend).

The importer behaviour is also slightly different if the import is scheduled. We will talk about each of these options in more detail later on.

Prerequisites

List of import configurations in the backend

It is only possible to run the import from within the backend (Content -> Import Configurations) or from the frontend when the Import Portlet is added to a page. Also, from version 4.3.1, the importer can only work with import configurations - it is not possible to select import type on the fly; each import portlet configuration will be bound to a certain import configuration ID. Effectively this means that the administrator (or anybody who has the access to the backend) has to create a config to be used together with portlet and/or backend.

In order to import a file (or a set of files) they must be uploaded into the Unify repository (possibly, using Content->Upload Files functionality in the backend, or using Upload portlet via the frontend).

Although there are many new features added as of this version, v4.3.1 import retains backwards compatibility: any XML or CSV file that used to be imported before version 4.3.1, must be importable in any version of 4.3.1 and producing the same outcome; although in some cases import configuration settings might need to be adjusted. The same applies for backwards compatibility between different versions of 4.3.1 (e.g. 4.3.1.1 and 4.3.1.7).

Configuration settings: Basic

Selection settings of Data and File store

When creating a new or editing existing configuration, one will be presented with a list of the following options:

  • Select the import destination (Data Store or File Store).
    • For the Data Store, one will see an additional combo box to select the target data store to import the items to.
    • For the File Store, one will see relevant file store selection combo box and checkbox to specify whether the importer has to search for derived files (see below). Also, the import type selector will additionaly allow to select "Files without metadata".

WARNING: avoid looking for derived files if the import folder is large (more than 200 files) as this may slow down the import process dramatically, see UNF-1318 for more details.

  • Select the import type:
    • Delimited Files option is for importing files from CSV.
      • You will also have to select your delimiter from the panel in the right;
      • Lookup attributes using name rather than XML name checkbox is kept for compatibility with older (pre-4.3.1) versions. For more details, see notes next to "The user wants to create a new item (or update existing item) and specify content sections and/or attributes." Use Case.
      • It is possible to select the charset the CSV file is in.

WARNING: make sure your CSV file does not have Byte-order Mark (BOM) at the very beginning of the file. Some Microsoft Windows editors (notably, the Notepad) inserts it by default. Due to lack of Java support for BOMs your CSV file might not be read correctly. See Java bug 4508058 for more details.

    • XML Files option is for importing data from XML files;
    • Files without metadata option is only available if the store type (see import destination) is a file store. This option is the simplified case of data import into the file store when no metadata is required. A good example would be an import of hundreds of photos (without any description!) which will later be arranged into an album.

Configuration settings: Advanced

Advanced settings panel
  • The GUI / CRON EXPRESSION selector denotes if the import should be executed at a certain time without user intervention (see below for detailed description of how scheduled import works). The syntax of CRON string is explained here and here. It also has to be marked as active, otherwise it will not be scheduled. Vice versa also applies - should the import been scheduled and active checkbox is unchecked and saved, the import would be no longer executed in the future.
  • The order of files setting is especially relevant for scheduled imports where same items might be located in many different directories (think of scenario when item is created, and later edited by querying to contain specific attributes).
  • Update behaviour setting (appeared in 4.3.1 release); previous one being Erase content if imported data misses attribute(s). The problem with this setting was that every time when someone would want to update an item, it would have to mimic entire content and metadata structure. The other option, Leave contents and only update what is specified (default since Unify 4.6.3), allows to specify only those attributes one might want to upgrade. Exceptions:
    • locale attribute cannot be updated nor be different from collection schema's locale. Import would instantly terminate if the locale would be different. It only makes sense to specify the locale when importing into a store which has secondary locales, just to identify which locale you are targeting to.
    • If any of standard metadata attributes {name, description, keywords} is/are not specified and the policy is to Erase content..., they will not be erased but retain previous value.
  • Taxonomy configuration setting is straightforward and is not described here.
  • Stop import process if an attribute is not found in the collection schema checkbox enforces strict validation. A warning is shown if the configured collection schema does not contain given attribute.

Configuration settings: Content/Metadata attributes

Content/Metadata and Item/User link settings

Content and Metadata attribute settings allow the user to specify how importer should behave for each of the attribute. All the methods and their behaviour both for Content and Metadata sections are the same.

  • Read from source is the default option. It assumes the user provides the data in CSV or XML file (for Files without metadata this can't be specified). Error will be shown if the attribute has Exactly one cardinality and the attribute value is not specified during the import process. If the cardinality does not imply mandatoryness, attribute value can be skipped. Auto-increment type of attributes should NOT be specified, otherwise error will be shown (valid from version 4.3.1.2, see UNF-1518 for more details).
  • Read from source but secondary items default to primary items value is similar to above when the Collection schema has secondary language(s) set. If selected, one can skip specifying attribute value as it will be taken from the relevant attribute of the primary item. Error will be shown for this selection if nor the imported item nor the primary item doesn't have this value set.
  • Use given value allows to set some static value. In this case, the value coming from the XML/CSV will be ignored.
  • Fallback to default value if import source doesn't have this attribute specified checkbox is only shown if relevant attribute has Exactly one cardinality. If checked, it will use default value set in collection schema properties (and thus, will prevent from error to be shown if not specified in XML/CSV), but the value passed in XML/CSV will have precedence over default value. This feature was added in 4.3.1.1 release, see UNF-1229 for more details.

Please note that these options are only available if the target collection schema contains content and/or metadata definition(s). Each setting is relevant only for given attribute.

Configuration settings: Item link sections

Item link import methods are the following:

  • Read from source is the default option. It assumes the user provides item link information in the CSV or XML file (for Files without metadata this can't be specified). As item link is not mandatory, mandatoryness error will never be shown if the item link information is missing.
  • Secondary items inherit links from parent. Still pretty much straightforward, secondary items will be linked against the same items/users their primaries are linked to (those can be both secondary and primary items).
  • Secondary items inherit links from parent but link to localized versions (no link if localized version not found). Available only for item links and is similar to the option above, but tries to find the same locale as currently imported item has. For example, let's say that current item's (let's name it XS) primary (XP) is linked to items YP(primary) and WS(secondary). So the checks (that are done to search for the same locale as of XS), in the order of precendce are:</span></span>
    • YP's locale
    • YP collection schema's locale
    • locales of secondary items of WS's primary item (once WS is secondary)
    • locales of YP secondary items (once YP is primary)
  • Secondary items inherit links from parent but link to localized versions (link to primary locale if localized version not found). Same as above, but will set the locale to primary item's locale if above method did not work.
  • Choose an item to link all imported items to allows to select an item from different end of item link definition's collection schema and statically bind all imported items to it. This would ignore all linking information provided in source document for this particular link definition.
  • Link to same items as current user option is described in Advanced topics section below as well as in relevant JIRA report (see UNF-30).
  • Ask user to select the item to link to when importing method allows the user to select the target item during the import process (both in backend and frontend). A popup window is shown (for selection). Source link information, if specified, is ignored. This feature was added in 4.3.1.5, see UNF-1053.

Similar to Content/Metadata, Item link options are only available if collection schema has item link definition(s) configured. Each definition has its own separate setting.

Configuration settings: User link sections

User link import methods are the following:

  • Read from source is the default option. It assumes the user provides user link information in the CSV or XML file (for Files without metadata this can't be specified). As user link is not mandatory, mandatoryness error will never be shown if the user link information is missing.
  • Secondary items inherit user links from parent would imply link inheritance from primary items (if secondary items are imported). Source link information, if specified, is ignored.
  • Choose a profile to link all imported items to allows to statically choose the user link profile in the import configuration (the realm is used which is set at the user link definition. Source link information, if specified, is ignored.

Similar to Content/Metadata, User link options are only available if collection schema has user link definition(s) configured. Each definition has its own separate setting.

Handling of the import files (scheduled VS GUI-triggered)

The process of finding files to import does depend on the fact whether the report is scheduled or not.

If the report is not scheduled, the user selects file(s) and/or directory(ies) from the GUI by clicking the + button. It is possible to select multiple files/directories as well as clear the selection by clicking X button next to the file/folder selected. When running the import, the UNIFY will recursively search (in specified directories) for:

  • files with .csv or .txt extension if the import type is Delimited files;
  • files with .xml extension if the import type is XML files;
  • All assets, if the import type is Files without metadata. Asset, at the moment of writing, is a file, not matching one one of the following patterns {*-thumbnail.jpg, *-pdf_preview.pdf, *-preview.jpg, *-fullsize.jpg, *-realvideo_lan.rm, *-realaudio_lan.rm, *-preview.wmv, *-preview.mov, *-preview.rm, *-storyboard-*, *-custom.*, *-vyre_service*, *.xml, *.csv, *.txt}.

When a non-scheduled import is finished, all files will remain in their original places and no additional files would be created in source directories. For file stores, the files (not the import XML/CSV files, but the ones which are imported into file store) and derived files can be selectively deleted, depending to Delete source files checkbox setting in the Advanced settings panel. The feature to select the behavior using the checkbox only appeared in 4.3.1.7 release. Prior to that, derived files were never deleted whereas the actual source file always was. For more details, see UNF-1721.

If the import is scheduled, it accepts the following rules:

  • Import root folder (as specified in the configuration) has to exist, or a blank one will be created (with an "incoming" subfolder within). If the incoming folder is not a directory, the import process would not continue. We'll use concept of import_root below;
  • The import_root/incoming folder would be scanned for subfolders. Only the subfolders containing ready_for_import.txt file would be matched.
  • Each subfolder would be read in a certain order (according to the relevant sort configuration setting); each folder would be imported separately. Under the cover, that would not only mean that the importer would submit items in such an order, but also the to wait for the indexing thread to finish indexing of the contents of the previous subfolder before proceeding to the subsequent one.
  • If the import of a particular subfolder would finish successfully, all files of it would be moved to import_root/finished/yyyyMMddHHmm/source_folder. Otherwise (if failed) the relevant folder would be import_root/failed/yyyyMMddHHmm/source_folder.
  • The log file (text format) would be created and subsequently moved to the same folder together with the source files. The name of the log file would be yyyyMMddHHmm.log, where yyyy represents year, MM - month, dd - day, HH - hours and mm - minutes.
  • If the importer would find impossible to move the files, relevant log entry would be created.

Internals of the import process

Important facts which might give an idea of how import is organized:

  • The import process is deterministic. It means that it should be always possible, given input data, to know what items would be created or how they will be updated.
  • Import code is organized so that different adaptors (CSV, XML, plain files, etc.) would create the same type of data (RawItem) which is later analysed and validated/imported. Validation is done in two stages: syntax and structure. For example, XML first stage validation is schema (.xsd) validation. CSV first stage validation is very weak (we can only determine if the line contains as much symbols as there are labels). This is the reason why second-stage validation covers a lot of duplicate checks for XML but is unavoidable nonetheless.
  • If the content is new, the new content item is created for it. Otherwise (if the content already exists), the existing item is updated. Whether the content is new or not is determined in the following way:
    • Unify ID is found for the item, or
    • Matching attribute is found. The ID and value of the attribute must be specified, the importer then uses this information to locate items that have the required value in the attribute specified. WARNING: if multiple items are found, then they are all updated.
  • If the metadata or content attributes are being imported, the importer checks their validity and throws an error if validation fails. The same applies if the attribute is mandatory (has relevant cardinality) and the value is not specified, or attribute is not found and strict validation is implied.
  • Attributes having auto-incremental type must not be specified although they are mandatory, otherwise an error will be thrown.
  • For non-scheduled import, the entire process is wrapped in one database transaction. This basically means that if an error would occur, nothing would be imported.
  • For scheduled import, a separate database transaction is arranged for each of import subfolder. Each subfolder would be processed regardless of the previous results (even if a transaction would fail on previous subfolder).
  • For scheduled import, history records are added to the database.
  • When running Unify version 4.3.X, the importer would immediately stop upon single error. Starting from version 4.4, importer would accumulate & show errors up until threshold (currently set to max 20 warnings and max 10 errors) is reached.

Import types: Files without metadata

This type of the import is the most simple. This option basically imports a set of files without any custom metadata attached. The system will generate following defaults:

  • item name will be copied from file name imported. Same applies for the file name information of the file store.
  • item description will be "Imported from [file_name] by [user] on [dd.MM.yyyy HH:mm]", where data between square brackets will be dynamically generated;
  • the locale will be taken from the primary locale of the file store the item is being imported into;
  • file mimetype will be "guessed" according to its extension.

Import types: XML files

This type of import is the most straightforward and provides the widest support of all the import features. XML schema (W3 spec) is used to validate the XML import file. For an example of the schema see Import xsd schema, note that the latest version of the schema can always be found in your Vyre UNIFY installation directory: conf/core/components/content_module/import/import.xsd. Please note that import would not proceed if the XML cannot be validated against the schema (relevant error will be shown on the screen and/or will be presented in the import log).

With XML, it is only allowed to create only one item per file, although it is possible to update many if ref-attribute references many items.

If the item name is not specified (or left empty), it will be populated with the file name of the XML file. If the item description is not specified, it will be populated using the following pattern: Imported from [filename] by [user_full_name] on [date], where filename is the current XML file; user_full_name is the full name of the logged in user and the date is the current date in format dd.MM.yyyy HH:mm.

Starting from version 4.3.1, item type attribute is no longer mandatory nor is being evaluated: whether the destination is filestore or datastore is now decided per import configuration (store is the mandatory option there).

TIP: See Use Cases for complete set of examples for importing data in XML format.

Import types: Delimited files

Sequence diagram showing CSV header parser mechanism

Any delimited file which is being imported, consists of the file header (label) and data. The header is the first line, which lists tags; the data in lines below must correspond to the tags by the same position (separated by delimiter), e.g.:

locale,name,description,keywords
en,Test NameA,Test DescriptionA,KeywordA1 KeywordA2
en,Test NameB,Test DescriptionB,KeywordB1 KeywordB2

As opposite to XML, it is possible to import many items per one CSV file, where each item would be described in one line. If different items contain different types of information (e.g. one item contains taxonomy information and another doesn't), the user is required to break the information into two different files with relevant headers and data.

The position of items within one line (either this is a label or a data line) doesn't matter. However, information in the data line must correspond to the relevant label position.

If the item name is not specified (or left empty), it will be populated with the file name of the CSV file. If the item description is not specified, it will be populated using the following pattern: Imported from [filename] by [user_full_name] on [date], where filename is the current CSV file; user_full_name is the full name of the logged in user and the date is the current date in format dd.MM.yyyy HH:mm.

Historically, most of the features and syntax were originally developed having XML format in mind whereas the CSV version was very skimmed feature-wise. 4.3.1 version added some missing cases for the CSV importer (however it is clear that CSV will never reach the XML richness due to its nature). Being flat structure, it is not trivial to mimic the XML functionality, especially deeply-nested data (such as item link ref attribute). This is why CSV tags like #item-link-vyreid#number were introduced. The idea behind such a structure is simple: there is a set of attributes which together form a small entity (e.g. item link requires link definition id, mode and possibly reffering attributes or vyre-id). Moreover, there can be many links imported together in one batch. The number here denotes that an attribute (definition id, mode or anything else) is bound to the same entity. In other words, attributes of the same entity would have the same number and if mandatory attribute will be missing, the importer should report an error.

Let's look at example. Let's say we want to change a set of items which meet the criteria of two different attributes (attribute1=valueX and attribute2=valueY). How would the CSV header and data line would look like?

#locale,#ref-attribute-id#1,#ref-attribute#1,#ref-attribute-id#2,#ref-attribute#2,#name
en,20,ref_attr_20_value,40,ref_attr_40_value,NewName-CSV2

In the example above we assign an identificator number to a tag so that system could know which attribute value maps to attribute id, or as in this example, ref_attr_20_value maps to id 20 and ref_attr_40_value to 40.

See Use Cases for complete set of examples for importing data in CSV format.

WARNING(1): When specifying content or metadata attributes, make sure they match "XML name" of the attribute, and not the "Attribute name". The latter is only meant to be used for visualisation purposes; however pre-4.3.1 CSV import was looking for "attribute name" during the process. If you still want to use old functionality, make sure Lookup attributes using name rather XML name checkbox is checked.

WARNING(2): make sure your CSV file does not have Byte-order Mark (BOM) at the very beginning of the file. Some Microsoft Windows editors (notably, the Notepad) inserts it by default. Due to lack of Java support for BOMs your CSV file might not be read correctly. See Java bug 4508058 for more details. Make sure encoding is set to ANSI when saving the file.

Looking for derived files

When importing into a file store, the system can be told to look for derived files. Derived files are files generated from the original file (such as thumbnail, preview, storyboards, etc.). Let's say that we have a file called original.xxx. If this feature is turned on, then the system will look for the following files:

  • thumbnail:
    • original-thumbnail.jpg
  • preview:
    • original-fullsize.jpg
    • original-pdf_preview.pdf
    • original-preview.jpg
    • original-preview.wmv
    • original-preview.mov
    • original-preview.rm
    • original-realvideo_lan.rm
    • original-realaudio_lan.rm
  • storyboard:
    • original-storyboard-xxx.jpg
  • custom derived:
    • original-custom.*

If found, the above files will be stored as "manual" derived files. You also have the option of importing derived files straight into a folder for a file service defined in VYRE. To do that, follow these naming rules (still assuming that the original file has the name original.xxx):

  • original-vyre_service${serviceId}_.${service_file_extension}

Where ${serviceId} is the ID of the file service, and ${service_file_extension} is the file extension of the files which the service generates. This also works for storyboard file services, where the storyboard files have to have names on the following format:

  • original-vyre_service${serviceId}_xxx.jpg

The "xxx" can be the index of the storyboard image (001, 002, etc.). If these files are found, then they are imported as derived files, and the Marchena is not told to generate the respective files. This feature is useful when you have the derived files and want to avoid unnecessary overhead of regenerating them, and also when you have thumbnails for files which VYRE does not know how to generate thumbnails for (.zip files, for example).

Handling primary and secondary items

UML Activity Diagram showing how secondary items are handled by the importer

It is possible to import primary and secondary items in either the same XML and/or CSV file or separate files (or even imports). Locale attribute is mandatory for secondary items in order to distinguish the localization target during the import (as there can be more than one secondary locale configured for given collection schema). See the diagram which contains the workflow of handling secondary items (localisations) by the importer. Each primary or secondary item is pushed through the rules shown there.

Use Cases (examples)

NOTE: The use case examples will assume that comma will be used as a separator for delimited files.

Use case #1: simple new item import

Scenario: The user is about to import:

  • a new item (for XML);
  • two new items (for CSV).

XML syntax:

<?xml version="1.0" encoding="UTF-8"?>
<item>
   <name>Test NameA</name>
   <description>Test DescriptionA</description>
   <keywords>KeywordA1 KeywordA2</keywords>
   <locale>en</locale>
</item>

CSV syntax:

#locale,#name,#description,#keywords
en,Test NameA,Test DescriptionA,KeywordA1 KeywordA2
en,Test NameB,Test DescriptionB,KeywordB1 KeywordB2

Notes: It is not possible to create many items in XML only using one file. Also please note that if the content or metadata sections are empty (in XML), they can be omitted.

Use case #2: simple item update

Scenario: the user wants to update existing item by specifying UNIFY id.

XML syntax:

<?xml version="1.0" encoding="UTF-8"?>
<item>
   <overwrite-id>62590</overwrite-id>
   <name>New name - XML</name>
</item>

CSV syntax:

#overwrite-id,#name
62590,New name-CSV

Notes:For both CSV and XML examples it was assumed that existing item's id is 62590. You can ommit the locale if the item is primary (such as in this case).

Use case #3: simple ref-attr usage

Scenario:the user wants to update existing item by specifying referring attributes.

XML syntax:

<?xml version="1.0" encoding="UTF-8"?>
<item>
   <ref-attribute id="description">desc</ref-attribute>
   <ref-attribute id="20">value_of_attr_20</ref-attribute>
   <locale>en</locale>
   <metadata>
      <item>
         <expiry-date>24.12.2006 18:00</expiry-date>
      </item>
   </metadata>
   <content>
      <item>
         <bicycle_rack>
            <rack>new rack value</rack>
         </bicycle_rack>
         <body>new body value</body>
      </item>
   </content>
</item>

CSV syntax:

#locale,#ref-attribute-id#1,#ref-attribute#1,#ref-attribute-id#2,#ref-attribute#2,#name
en,20,ref_attr_20_value,40,ref_attr_40_value,NewName-CSV2

Notes: In the XML example, we would update all items which have attribute with id=20 with the value value_of_attr_20 AND description being desc. Please note that for legacy reasons XML importer supports both id="20" and id="att20". In the CSV example we would update all items having attribute [id=20, value=ref_attr_20_value] AND attribute [id=40, value=ref_attr_40_value] to have new name NewName-CSV2.

TIP(1): ref-attribute value will match entire phrases which contain specified words, as opposed to exact words, i.e. "foo" will match "foo bar", therefore it is always safer to rely on item IDs rather than ref-attributes, unless you are 100% sure it will not affect more items than anticipated.

TIP(2): if no matches for ref-attribute will be found, a new item will be created. However, if ref-attribute is used for creating links, and the match will not be found, warning will occur and import process may stop.

Use case #4: taxonomy (un)assignments

Scenario: the user wants to import an item and assign it to some new (or to unassign from existing) taxonomy categories

XML syntax:

<?xml version="1.0" encoding="UTF-8"?>
<item>
   <overwrite-id>62596</overwrite-id>
   <categories>    
      <!-- omitting query type defaults to query-type="add" -->
      <category query-type="remove">/first/unassigned</category>
      <category>/third/assigned,/fourth/assigned</category>
      <!-- above's equal to two repeated tags; XML-only feature -->
   </categories>   
</item>

CSV syntax:

#name,#description,#add_taxonomy_category,#add_taxonomy_category
TestName,TestDescription,/first/to_assign,/second/to_be_assigned

Notes: in the CSV example, new item has been imported. In the XML example, the same item (id=62596 ) is altered. Please note that if a category for existing item is not mentioned in add nor remove, it will be ignored (left assigned or unassigned), regardless of the import mode (erase or leave). Use CSV label #remove_taxonomy_category to remove a taxonomy category. Note that for CSV add or remove taxonomy tags can be repeated many times without identificator number (tag#number).

Use case #5: basic item linking and unlinking

Scenario A: the user wants to import new item or update existing item and link it to another existing item. Scenario B: the user wants to import new or update existing item and unlink it from another existing item (by linking to given items).
XML syntax:
<?xml version="1.0" encoding="UTF-8"?>
<item>
   <overwrite-id>62596</overwrite-id>
   <item-links>
     <item-link definition-id="32">
         <vyre-id>62597</vyre-id>
      </item-link>
   </item-links>
</item>
XML syntax:
<?xml version="1.0" encoding="UTF-8"?>
<item>
   <overwrite-id>62596</overwrite-id>
   <locale>en</locale>
   <item-links>
      <!-- omitting link mode defaults to "add" -->
      <!-- but only for XML, not CSV -->
      <item-link link-mode="replace" definition-id="32">
         <!-- ref-attribute feature is only for XML -->
         <!-- it is not available for CSV -->
         <ref-attribute id="63">testval</ref-attribute>
      </item-link>
   </item-links>
</item>
CSV syntax:
#overwrite-id,#locale,#item-link-defid#1,#item-link-vyreid#1,#item-link-mode#1
62596,en,32,62597,add
CSV syntax:
#overwrite-id,#locale,#item-link-defid#1,#item-link-vyreid#1,#item-link-mode#1
62596,en,32,62598,replace

Notes:In the CSV example (scenario A) we linked existing item (id=62596) to the another item (id=62597) using item link definition (id=32). In CSV example (scenario B) we replaced just created link to another item (id=62598) using the same link definition. In the XML example (scenario A) we've done the same as first CSV example did, and in the scenario B we've replaced just created link with link(s) to item(s) which have content or metadata attribute (id=63) with value testval. Please pay attention to item link configuration settings in the import configuration. Link information, specified in XML and/or CSV will only matter if config option is set to Read from source, otherwise source information will simply be ignored. Item link definition must exist before creating links as the link definition id is the mandatory parameter. See a note next to ref-attribute use case about usage of ref-attributes when creating links.

Use case #6: basic adding/replacing user link

Scenario: the user wants to import new item or update existing item and link it to (or unlink from) user.

XML syntax:
<?xml version="1.0" encoding="UTF-8"?>
<item>
   <overwrite-id>62596</overwrite-id>
   <locale>en</locale>
   <user-links>
      <!-- omitting link mode defaults to "add" -->
      <!-- but only for XML, not CSV -->
      <user-link link-mode="replace" definition-id="63">
         <user-id>admin</user-id>
         <!-- use profile-id tag to reference profile id instead of user id -->
      </user-link>
   </user-links>
</item>
CSV syntax:
#overwrite-id,#locale,#user-link-defid#1,#user-link-profile-id#1,#user-link-mode#1
62596,en,63,1,add

Notes: in the CSV example we linked existing item (id=62596) to the existing user (profile id=1) using user link definition (id=63). To link to user id (rather than profile id), use keyword #user-link-user-id#number. In the XML example we replaced this link to another link to the user which has user id admin. Please pay attention to user link configuration settings in the import configuration. Link information, specified in XML and/or CSV will only matter if config option is set to Read from source, otherwise source information will simply be ignored.User link definition must exist before creating links as the link definition id is the mandatory parameter.

Use case #7: basic operations with secondary items (1)

Scenario: the user wants to import new secondary item and assign to existing primary item.

XML syntax:
<?xml version="1.0" encoding="UTF-8"?>
<item>
   <primary-id>62596</primary-id>
   <localizations>
      <locale>
         <item>
            <name>Vardas</name>
            <description>Aprašas</description>
            <locale>lt</locale>
         </item>
      </locale>
   </localizations>
</item>
CSV syntax:
#locale,#primary-id,#name,#description
lt,62596,Vardas,Aprašas

Notes: in both CSV and XML examples, we referenced primary item (id=62596) and created secondary item with lt locale. Please note that primary item locale is not required to be specified.

Use case #8: basic operations with secondary items (2)

Scenario: the user wants to import both new primary and secondary items

XML syntax:
<?xml version="1.0" encoding="UTF-8"?>
<item>
   <name>New Primary Name</name>
   <description>New Primary Description</description>
   <locale>en</locale>
   <localizations>
      <locale>
         <item>
            <name>Vardas</name>
            <description>Aprašas</description>
            <locale>lt</locale>
            <locale>lv</locale>
         </item>
      </locale>
   </localizations>
</item>

Notes: this functionality (as one batch) is not available in CSV.Users should create primary item first, and then follow the above example to create secondary item if needed. Note a special feature here: when a secondary item contains two or more locales, corresponding number of secondary items is created, one per locale. It is not possible to specify more than one locale for the primary item, though (or if no locale is specified, schema's locale is assumed). This feature is available to be used for CSV as well by specifying many #locale keywords.

Use case #9: basic operations with secondary items (3)

Scenario: the user wants to update existing secondary item.

XML syntax:
<?xml version="1.0" encoding="UTF-8"?>
<item>
   <primary-id>62596</primary-id>
   <localizations>
      <locale>
         <item>
            <overwrite-id>62624</overwrite-id>
            <name>NewSecondaryName2</name>
            <locale>lt</locale>
         </item>
      </locale>
   </localizations>
</item>
CSV syntax:
#locale,#primary-id,#overwrite-id,#name
lt,62596,62624,NewSecondaryName

Notes: in the CSV example, we did update secondary item (id=62624) which is a localization of the primary (id=62596) by changing its name to NewSecondaryName. In the XML example we did the same, except that we updated the name to NewSecondaryName2.

Use case #10: importing file into the file store

Scenario: the user wants to create a filestore item (or update existing one).

XML syntax:
<?xml version="1.0" encoding="UTF-8"?>
<item>
   <overwrite-id>123</overwrite-id>
   <name>NewNameForFilestore</name>
   <locale>en</locale>
   <file-info>
      <name>somefile.xxx</name>
   </file-info>
</item>
CSV syntax:
#locale,#name,#filename
en,NameForFilestore,somefile.xxx

Notes: In the CSV example, we would create a new (or in XML example - we would update existing) filestore item. The system would look for the derived files based on somefile template (e.g. somefil-thumbnail.jpg, somefile-fullsize.jpg, etc.). CSV also supports edit of existing item to update derived files by using overwrite-id tag. Note: file-info section (or filename label) is required in order to specify file name that is being imported into the file store. If custom derived files need to be imported, Look for derived files checkbox must be checked in the import configuration.

Use case #11: importing nested content/metadata

Scenario: the user wants to create a new item (or update existing item) and specify content and/or metadata sections and/or attributes.

XML syntax:
<?xml version="1.0" encoding="UTF-8"?>
<item>
   <name>a name</name>
   <description>a description</description>
   <keywords>some keywords</keywords>
   <locale>en</locale>
   <metadata>
      <item>
         <expiry-date>24.12.2006 18:00</expiry-date>
      </item>
   </metadata>
   <content>
      <item>
         <bicycle_rack>
            <shmack>
               <glack>a glack</glack>
            </shmack>
            <rack>a rack</rack>
         </bicycle_rack>
         <body>bodyval</body>
      </item>
   </content>
</item>
CSV syntax:
#name,#description,#keywords,#locale,$/expiry-date,/bicycle_rack/shmack/glack,/bicycle_rack/rack,/body
a name,a description,some keywords,en,24.12.2006 18:00,a glack,a rack,bodyval

Notes: in the XML example, we've created a new item with certain content and metadata attributes (structure is straightforward if looking at collecton schema's content and/or metadata structure). In the CSV example, same data is replicated. Note how expiry-date (which is a metadata attribute) has prepending $ sign. In the CSV metadata example, we've changed the value of expiry-date attribute to 24.12.2007 18:00.

XML and CSV syntax improvements from earlier versions

The following table contains the differences and the syntax of the 4.3 and 4.3.1 import files (XML and CSV). Important thing to note is that 4.3.1 syntax maintains backwards compatibility to both CSV and XML files. For more details, see CSV file syntax and Use Cases. For XML file syntax, see XML file syntax.

Bold text marks changed syntax and/or usage.

Feature, v4.3 XML tag(s), v4.3 XML tag(s), v4.3.1+ CSV header, v4.3 CSV header, v4.3.1+
name mandatory name mandatory
Allows to specify the primary item if imported one is secondary primary-id no same n/a #primary-id
Allows to specify the UNIFY ID of the item you want to update overwrite-id no same n/a #overwrite-id
Item to be updated based on a custom attribute, id of which is specified ref-attribute, id no same n/a #ref-attribute#number, #ref-attribute-id#number
Update item name name yes name no name name or #name
Update item description description yes description no description description or #description
Update item keywords keywords no same keywords keywords or #keywords
Update item content content* no same corresponding attribute name had to be specified in the CSV header same*
Update item metadata metadata* no
same n/a $ symbol has to be prepended to the CSV header, or legacy compatibility checkbox must be checked*
Specify derived files file-info, name, mime-type no same n/a Importer looks for possible derived files next to the imported CSV file if relevant import config is set
(Un)bind imported item to an existing taxonomy category categories no same n/a #add_taxonomy_category, #remove_taxonomy_category
(Un)link imported item to another item [if item link definition config is set to read from source] item-links, item-link(definition-id, link-mode), vyre-id, ref-attribute no same n/a

#item-link-vyreid#number, #item-link-defid#number, #item-link-mode#number={add|replace}

(Un)link imported item to user [if user link definition config is set to read from source] n/a user-links, user-link(definition-id, link-mode), profile-id, user-id no n/a #user-link-profileid#number, #user-link-user-id#number, #item-link-defid#number, #user-link-mode#number={add|replace}
Specify the collection schema the item is to be imported into item filestore/datastore yes The attribute is never evaluated any longer as the store is specified in the configuration. n/a n/a

* 4.3.1.1 allows specifying nested attributes - see Use Case examples. Legacy compatibility checkbox was introduced in 4.3.1.2

Important notes

  1. The system no longer supports survey imports. This feature, if required, will be reintroduced in later versions.
  2. Locale in both CSV and XML is not mandatory for primary items, but is mandatory for secondaries (as the collection schema can have more than one localization and locale points the importer which localization is meant to be used; whereas for primary item the schema's locale is assumed automatically).

Advanced Topics

Tracking the import

Logging and Tracing

Most of the events happening during the import process are not only shown in the screen, but rather logged in the Unify log file (vyre.log). If you are experiencing problems, enable debug level logging for all the importer classes. This can be done in the backend going by to Configuration -> Logging level, and selecting Debug level for the following classes:

  • vyre.content.io.ContentImportConfigurationImporter
  • all classes that have package (names starting with) vyre.content.io.importer.

It is also possible to monitor the import duration, by setting log level of vyre.content.io.ContentImportConfigurationImporter to Trace. An example of relevant log statements is shown below:

2007-09-11 10:01:04,004 [pool-1-thread-4] TRACE vyre.content.io.ContentImportConfigurationImporter - Started: ContentItemImporter
2007-09-11 10:01:16,337 [pool-1-thread-4] TRACE vyre.content.io.ContentImportConfigurationImporter - Finished ContentItemImporter in: 11,968.213 milliseconds.

Tracking the process

Content Import queue showing single finished import process

When an import process is finished, it stays in the memory for at least one minute AND until subsequent import process is started. This is true for versions 4.3.1.1 to 4.3.1.6. Starting from version 4.3.1.7 (see UNF-1816) the importer only stays in memory for a minute after it is finished. Immediately terminating process is not feasible for status to completely finish (status is retrieved via AJAX and is asynchronous from actual import); however longer trail could result a memory leak.

Note (1): as the queue page is static, a race condition might occur when stopping an import which is already finished. In such a case, a warning would be shown, but it would obviously be too late to rollback the data.

Note (2): that this queue will also include any currently running bulk edits in its list. Import queue feature was added in 4.3.1.6, see UNF-1592 for more details.

Statuses

It is also possible to see all imports with their statuses that are currently running. For this, use Content Module -> Import queue feature. From this screen one should be able to stop import prematurely. Such a stoppage would prevent data being committed to the database.

The import, either running or terminated/finished can have following statuses:

  • Waiting for another process to finish onthe same store - when another import, either scheduled or not, has currently locked the same store this import needs to access to;
  • Not started - when the import is not yet started (e.g. searching for the files to import, etc.);
  • Validating - when the importer is validating XML or CSV file for syntactical correctness;
  • Importing - when the importer is converting XML/CSV data into its own internal structures;
  • Committing - when committing the database transaction;
  • Finished successfully - when the process is finished successfully (import is still hanging around waiting to release the resources);
  • Finished successfully (with warnings) - same as above, the only difference being that there were (ignored) warnings when running the import process;
  • Finished successfully (some items might not be imported due to errors) - same as above, the only difference being that some of items would not be committed to the database because of errors in those items;
  • Not finished due to a warning. No items were imported.
  • Not finished due to an error. No items were imported.

Link Import methods: link to same items as current user

This option needs a good illustration; see picture below:

Let's assume we have collection schemas CollectionSchema1 and CollectionSchema2, and item link definition (ItemLinkDefinition1) between them. We also have user link definition UserLinkDefinition between realm RealmConfiguration2 and CollectionSchema2 (it's the same schema, but it is shown twice in the picture). CollectionSchema2 has existing items CollectionSchema2Item1, CollectionSchema2Item2 and CollectionSchema2Item3, which are linked to a user in RealmConfiguration2 (not shown in the picture).

If selected, this option would make newly created items (NewlyImportedItem) link to CollectionSchema2Item1, CollectionSchema2Item2 and CollectionSchema2Item3; or in generic words it links to all items of given user link definition which match given item link definition. I.e. If a user is linked to an Agency (datastore item) and that user uploads a new asset, we should be able to specify that the asset inherits the agency item link from the user. For more information on this feature, check original JIRA report.

Diagram showing affected relationships when using link to same items as current user link method

Escaping of commas and forward slashes when specifying taxonomy categories (4.3.1.1+ only)

When importing an item, taxonomy paths can contain multiple paths to be added, i.e.

/first/path/to/add,/second/path/to/add

and must be separated by comma. But what if happens if a comma is contained in the taxonomy name? In this case, parser would think you are passing two paths, as in above example. To get rid of this problem, comma literals must be escaped using a backslash, e.g.

/a/path/with\,a/comma/name/inside

Such a path would be understood as a single. Same approach can be used to escape forward slashes ("/") with backslashes, if a name contains any. For example, name

/taxonomy/category\/with forward/slashes

would would be interpreted as having three components: "taxonomy", "category/with forward" and "slashes". Now, what if you just want to have a symbol "\," in taxonomy name? Same answer applies, backslash needs to be escaped by another backslash:

\\,taxonomy/name/starting/with/backslash/and/comma

First component of such taxonomy path would be "\,First".

Running more than one import at a time

From Unify version 4.3.1.8, it is not possible to run more than one instance of the importer against the same collection schema (see UNF-1589) at the same time. The second and all other importer instances would simply sit in the queue, waiting for their counterparts to finish. This includes GUI-triggered, scheduled imports and bulk edits that address the same collection schema as well as combinations of all.

The reason behind it is that entire import is a single database transaction: if a single item import fails, no other data for that import is written to the database. So when an import is in progress, data to be committed is accumulated in the buffer and a lock is implied on the table. If another import behaves in the same manner, it tries to acquire the lock and operation times out rather quickly, resulting an exception to be thrown. Such exceptions are sufficiently hard to handle.

Another typical overrun case is when two scheduled imports start to compete with each other. E.g. when the first batch of first importer finishes, it waits for a special signal from the indexing thread. This signal literally means "all items for import configuration X are now indexed", so the importer can start processing the second batch. If there is another import process running at the same time and having same configuration id, the signal is sent to both importers, resulting premature action, or both importers are subject for waiting as signal is only fired once. This is even worse if number of concurrent imports exceeds 2.

Technically, there are no essential reasons preventing more than one importer running concurrently per se, although due to low demand/hard implementation specifics this feature has the current status.

Importing XML data with wildcard tags

In order to import data that has < and/or > tags, it all must be put into CDATA sections, e.g.:

<?xml version="1.0" encoding="utf-8"?>
<item>
...
   <content>
      <item>
         <some_attribute>
            <![CDATA[<p>data inside a tag</p>]]>
         </some_attribute>
      </item>
   </content>
</item>

This feature only works starting with version 4.3.1.7, see UNF-1757 for more details.

Future improvements

  • Introduce the capability to rollback the outcome of the particular import - it should be rather easy as we keep both XML and database history of the item;
  • Unify file handling for both scheduled and non-scheduled import;
  • Improve import logging and make a separate file or database table that would contain verbose information about the import process;
  • Introduce a watchdog who would e-mail import report after the process is finished (relevant for large data sets).
  • At the moment, any kind of error, regardless its severity, would stop the entire process (or part of process if the import is scheduled). There's an open possibility to introduce different thresholds in the future (e.g. info, warning, error, fatal) so that the user would be able to define what type of errors would allow the import to stop and/or continue.
  • Add the capability to allow the users to specify which fields to ignore. For example if you have 1000 items and some you wanted to update field A, some field B and the rest both fields you can still do this from one file. This can be done via the use of special character that is read by the importer as being a 'ignore' value.
  • Speed up CSV label parsing a little bit (cache CSV labels for the same file for each line).
  • Make sure XML files are deleted if import failed.
  • Try to find a way for multiple imports could be running at the same time.

Questions, comments, issues?

All should be sent to support@vyre.com.

Created by: Mindaugas Žakšauskas, 24/4/2008 3:25PM

Last modified by: --Mindas 16:56, 30 January 2009 (GMT)

See also

Personal tools