File Metadata Extraction | Anjana Data Documentación

Introduction

The purpose of this document is to explain which types of files can have metadata extracted, their particularities, and what information is extracted from each.

The name indicated for each file is the name of the attribute that should exist in Anjana (name in the attribute_definition table) in the object templates from which information is to be extracted.

Path separator

The path-separator or extraction path separator is a configurable parameter. By default, the character “/” is used, but if the data structures to be extracted contain it in their name, for example, it is possible to configure a different one, in order to facilitate the correct extraction of those tables. To change it, it must be configured in kerno, tot and in the plugins to be used, so that it is the same in all of them. The properties to modify are:

In Kerno
- anjana.tot.extraction.pathSeparator: “/”
In Tot
- tot.extraction.pathSeparator: “/”
In plugins:
- Specifically for PowerBI and Tableau: totplugin.pathSeparator: “/”
- Any other plugin that is not PowerBI or Tableau: totplugin.connection.[<connectionName>].technology.pathSeparator: “/”

File types

CSV

CSV file types are distinguished by their “.csv” extension. Each column will be interpreted as a dataset_field, from which the following information will be populated:

name with the field name
physical_name with the field name
fieldDataType with the data type defined for the field (can be boolean, number, string or date)
position position of the field

The supported separators for CSV extraction are comma (,), semicolon (;) and tab.

AVRO

AVRO file types are distinguished by their “.avro” extension. These files can be single or partitioned.

Each column will be interpreted as a dataset_field, from which the following information will be populated:

name with the field value
physical_name with the field name
defaultValue with the default value defined for the field
fieldDataType with the data type defined for the field (can be record, enum, array, map, union, fixed, string, bytes, int, long, float, double, boolean or null)
position position of the field
description with the field description
alias the aliases the field has

In addition to these values present in every field of an AVRO file, extra properties can be added; all included properties will be collected and extracted.

EXCEL

Excel file types are distinguished by their “.xls” and “.xlsx” extensions. Each column will be interpreted as a dataset_field, from which the following information will be populated:

name with the field value
physical_name with the field name
fieldDataType with the data type defined for the field (can be string, boolean, number)
position position of the field
description with the field description

PARQUET

Parquet file types are distinguished by their “.parquet” extension. These files can be single or partitioned.

Each column will be interpreted as a dataset_field, from which the following information will be populated:

name with the field value
physical_name with the field name
fieldDataType with the data type defined for the field (can be int64, int32, boolean, binary, float, double, int96 or fixed_len_type_array)
position position of the field
nullable indicating whether the field is nullable
length indicating the field length

Only fields belonging to primitive types will be extracted.

Organization standards

HADOOP

Allowed directory types:

With a single file at the end

/folder_1_lvl1

/folder_1_1_lvl2

file.extension

With several parts of the same file at the end

/folder_1_lvl1

/fecha=feb

part.000001.nombre.parquet

part.000002.nombre.parquet

/fecha=march

part.000003.nombre.parquet

part.000004.nombre.parquet

All files contained within the same directory that have a name with the same number of characters will be counted as parts of the same file.

Allowed file types:

Parquet
Avro
CSV
Excel

Naming convention

Files

Files must follow a naming convention of part.000000.name.extension for partitioned files and name.extension for complete files. Replacing the ‘0’s with the desired value to indicate that it is part X of a file; that is, if a file had 2 parts there would be part.000001.name.extension and part.000002.name.extension for example.

Directories

Directories must follow a pattern indicating the level of each directory, for example:

/folder_1_lvl1

/folder_1_1_lvl2

part.000001.nombre.parquet

part.000002.nombre.parquet

/folder_1_2_lvl2

/folder_1_2_1_lvl3

part.000001.nombre.parquet

part.000002.nombre.parquet

Delta Lake

Allowed directory types:

With a single file at the end

/folder_1_lvl1

/folder_1_1_lvl2

/_delta_log

file.parquet

Note that in this format there is a DeltaLake-specific folder whose content is ignored.

With several parts of the same file at the end

/folder_1_lvl1

/folder_1_1_lvl2

/_delta_log

000000.json

part.000001.nombre.parquet

part.000002.nombre.parquet

All files contained within the same directory that have a name with the same number of characters will be counted as parts of the same file.
Note that in this format there is a DeltaLake-specific folder whose content is ignored.

With several parts of the same file at the end and with partitions

/folder_1_lvl1

/folder_1_1_lvl2

/_delta_log

000000.json

/parition

part.000001.nombre.parquet

part.000002.nombre.parquet

part.000001.nombre.parquet

part.000002.nombre.parquet

All files contained within the same directory that have a name with the same number of characters will be counted as parts of the same file.
Note that in this format there is a DeltaLake-specific folder whose content is ignored.

Allowed file types:

Parquet

Naming convention

Files

Files must follow a naming convention of part.000000.name.parquet for partitioned files and name.parquet for complete files. Replacing the ‘0’s with the desired value to indicate that it is part X of a file; that is, if a file had 2 parts there would be part.000001.name.parquet and part.000002.name.parquet for example.

Directories

Directories must follow a pattern indicating the level of each directory, for example:

/folder_1_lvl1

/folder_1_1_lvl2

/_delta_log

000000.json

part.000001.nombre.parquet

part.000002.nombre.parquet

/folder_1_2_lvl2

/folder_1_2_1_lvl3

/_delta_log

000000.json

part.000001.nombre.parquet

part.000002.nombre.parquet

They must also include the _delta_log folder inside each directory with files, as well as another folder called partition with files.