Technical documentation - Import documents

Documents import configurations can be one or many in the DocumentConfigs section. The attribute “config-name” is the name of configuration and the attribute “files-extensions” defines which type of files that are allowed to be imported.
For example, if we only want to import pdf files, then we specify files-extensions="*.pdf". If we want to import more than one type of file, we can separate the different extensions by using double pipe or vertical bar, for example "*.pdf||*.doc".

Documents import config contains two parts; configuration for a single directory and configuration for importing documents from sub-directories.

<SingleDirectory type="Single">

<SubDirectory type="SubDirectories">

Documents import config

<DocumentConfigs>
<DocumentsConfig config-name="Default" files-extensions="*.*">
<SingleDirectory type="Single">
<DocumentIdentification identification-source="Filename" name-regex="[A-Za-z0-9\--\s]+_([A-Za-z0-9\--\s]*)_[A-Za-z0-9\--\s]*" description-regex="" persistent-identity-regex="([A-Za-z0-9\--\s]+)_[A-Za-z0-9\--\s]*_[A-Za-z0-9\--\s]*" identity-regex="([A-Za-z0-9\--\s]+)_[A-Za-z0-9\--\s]*_[A-Za-z0-9\--\s]*" />
<LangaugeIdentification identification-source="Filename" regex="[A-Za-z0-9\--\s]+_[A-Za-z0-9\--\s]*_([A-Za-z0-9\--\s]*)">
<Language alias="en-US" culture="en-US" />
</LangaugeIdentification>
<SpecificationsIdentification identification-source="Filename" type="type" value-regex="[A-Za-z0-9\--\s]+_([A-Za-z0-9\--\s]*)_[A-Za-z0-9\--\s]*" />
</SingleDirectory>
<SubDirectory type="SubDirectories">
<DocumentIdentification identification-source="Filename" name-regex="[A-Za-z0-9\--\s]+_([A-Za-z0-9\--\s]*)_[A-Za-z0-9\--\s]*" description-regex="" persistent-identity-regex="([A-Za-z0-9\--\s]+)_[A-Za-z0-9\--\s]*_[A-Za-z0-9\--\s]*" identity-regex="([A-Za-z0-9\--\s]+)_[A-Za-z0-9\--\s]*_[A-Za-z0-9\--\s]*" />
<LangaugeIdentification identification-source="DirectoryName" regex="">
<Language alias="en-US" culture="en-US" />
<Language alias="Invariant" culture="" />
</LangaugeIdentification>
<SpecificationsIdentification identification-source="Filename" type="type" value-regex="[A-Za-z0-9\--\s]+_([A-Za-z0-9\--\s]*)_[A-Za-z0-9\--\s]*" />
</SubDirectory>
<Transformations src-file-extension="" style-name="" />
</DocumentsConfig>
</DocumentConfigs>

Document identity, language and transformation

Both single directory configuration and sub directories configuration consist of 3 parts:

1.      Document Identification 

2.      Language Identification

3.      Document transformation

Documents and document languages are identified by using the document filename or the directory name. For both document identification and language identification, we can apply regular expression to fetch required string values from the filename or directory name, which can be up to 255 characters long.
In the name-regex example below, the filenames are divided in three segments of information such as identity, name and language separated with underscore(_)  e.g: identity_name_language. If a filename is "1234_myDoc_sv",  then "1234" will be interpreted as persistent identity, "myDoc" is the name of document and "sv" is language which is being mapped to sv_Se. Language mapping is an optional option. In document import configuration example identity and persistent identity are using the same segment of string. Persistent identity ensures that not  duplicate document created and  if document exists already then update or replace  the same document. If document file has no language defined e.g :"identity_name_" then language will be imported as  invariant language. 

Documents import config

<DocumentIdentification identification-source="Filename" name-regex="[A-Za-z0-9\--\s]+_([A-Za-z0-9\--\s]+)_[A-Za-z0-9\--\s]+" 
description-regex="" persistent-identity-regex="([A-Za-z0-9\--\s]+)_[A-Za-z0-9\--\s]+_[A-Za-z0-9\--\s]+" />
<LangaugeIdentification identification-source="Filename" regex="[A-Za-z0-9\--\s]+_[A-Za-z0-9\--\s]+_([A-Za-z0-9\--\s]+)">
<Language alias="en" culture="en-US" /> 
<Language alias="sv" culture="sv_SE" />
</LangaugeIdentification>

Document transformations are linked to the transformation configs section of import.config

The transformation config below tells the import to apply a transformation style. The transformation style will transform the document file from its original format to the specified format and then file will be imported.
The source document file will not be transformed, it will still be available at the same location as before.

Documents import config

<Transformations src-file-extention=".xml" style-name="htmlDocumentStyle" />

The following configuration tells the import that language should be identified from the directory name, a good example for this can be if we are planing to import the structure in the image below. 

Documents import config

<SubDirectory type="SubDirectories">
<DocumentIdentification identification-source="Filename" name-regex="[A-Za-z0-9\--\s]+_([A-Za-z0-9\--\s]+)_[A-Za-z0-9\--\s]+"
	description-regex="" persistent-identity-regex="([A-Za-z0-9\--\s]+)_[A-Za-z0-9\--\s]+_[A-Za-z0-9\--\s]+" />
<LangaugeIdentification identification-source="DirectoryName" regex="">
<Language alias="en-US" culture="en-US" />
</LangaugeIdentification>
<SpecificationsIdentification identification-source="Filename" type="type" value-regex="[A-Za-z0-9\--\s]+_([A-Za-z0-9\--\s]+)_[A-Za-z0-9\--\s]+" />
</SubDirectory>


The structure above is an example of a documents import of type "sub-directories". In this case, each sub-directory represent a language and contains document files. Regular expressions and language mappings can also be used, in the same way as when using import type "single directory".
Documents can be imported by right clicking on documents repository.