Skip to content

Analyse Service

tl;dr

Debug Information

Image: registry.datalab.tuwien.ac.at/dbrepo/analyse-service:1.4.7

  • Ports: 5000/tcp
  • Prometheus: http://<hostname>:5000/metrics
  • Health: http://<hostname>:5000/health
  • Swagger UI: http://<hostname>:5000/swagger-ui/ view online

To directly access in Kubernetes (for e.g. debugging), forward the svc port to your local machine:

kubectl [-n namespace] port-forward svc/analyse-service 5000:80

Overview

It suggests data types for the User Interface when creating a table from a comma separated values (CSV) -file. It recommends enumerations for columns and returns e.g. a list of potential primary key candidates. The researcher is able to confirm these suggestions manually. Moreover, the Analyse Service determines basic statistical properties of numerical columns.

Analysis

After uploading the CSV-file into the dbrepo-upload bucket of the Storage Service, analysis for data types and primary keys follows the flow:

  1. Retrieve the CSV-file from the dbrepo-upload bucket of the Storage Service as data stream (=nothing is stored in the service) with the boto3 client.
  2. When no separator is known, the Analyse Service tries to guess the separator from the first line with csv.Sniff().sniff(...). This step is optional when the separator was provided via HTTP-payload: {"separator": ";", ...}
  3. With the separator known (either from step 2 or via HTTP-payload), the Pandas guesses the headers and column types and enums by analysing the first 10.000 rows, if the HTTP-payload contains {"enum": true, ...}. The data type is guessed by a combination of Pandas and heuristics.

If your datasets are larger than 10.000 rows, increase the number of lines analysed by setting the ANALYSE_NROWS variable to the desired integer.

Examples

See the usage page for examples.

Limitations

Do you miss functionality? Do these limitations affect you?

We strongly encourage you to help us implement it as we are welcoming contributors to open-source software and get in contact with us, we happily answer requests for collaboration with attached CV and your programming experience!

Security

  1. Credentials for the Storage Service are stored in plaintext environment variables.