Cartolabe-data¶
Cartolabe-data is the data processing part of the Cartolabe project. It contains utility functions to
retrieve data from the HAL open archive API
extract entities (authors, teams, labs, words) from a set a documents
reduce dimensions and project on a 2D space
create named clusters
identify nearest neighbors for each entity
Installation¶
Note: We recommend the use of a Python virtual env manager like conda or virtualenv.
First clone the source code:
git clone https://gitlab.inria.fr/cartolabe/cartolabe-data.git
cd cartolabe-data
It is preferable to install cartolabe-data in a Conda environment or Python virtual environment.
To create Conda environment from `environment.yml` file:
conda env create -f environment.yml
conda activate cartodata-env # activate environment
This will create a conda environment named as cartodata-env and install cartolabe-data package.
To create Conda environment:
conda create -n cartodata-env python==3.10.10
conda activate cartodata-env
nmslib cannot be installed with pip. To install nmslib:
conda install -c conda-forge nmslib
Install other dependencies and cartolabe-data:
python3 -m pip install -e .
To create Python virtual environment:
python -m venv cartodata-env python==3.10.10
. cartodata-env/bin/activate # activate environment
After creating the Python virtual environment, you can install the cartolabe-data package by running the following command from project root directory:
python3 -m pip install -e .
Run the tests¶
To run the tests, install cartolabe-data with test option.
python3 -m pip install -e .[test]
pytest
Example Jupyter notebooks¶
The best way to get started with cartolabe-data is to run through the set of example notebooks in the examples directory.
To run the examples, install cartolabe-data with examples option:
python3 -m pip install -e .[examples]
cd examples
jupyter notebook
Docker¶
It is also possible to run cartolabe-data from the docker image without cloning or installing it. However you should have docker installed on your host.
To run an interactive container from the image:
docker run -it --network=host registry.gitlab.inria.fr/cartolabe/cartolabe-data:latest
From the command line provided by the container, it is possible to execute the CLI commands or it is possible to run the Jupyter notebooks with the command:
jupyter notebook
Then open the provided http link in the browser.
The notebooks are in the examples directory.
CLI commands¶
Once installed, the cartolabe-data package provides command-line scripts which can be executed in a terminal.
1. Download data from HAL¶
fetch-data¶
The fetch-data command will extract data from the HAL Open Archive. It takes three optional parameters:
-s <str> a research organization to filter publications for. Possible values are: - CNRS, - INRIA, - LISN, - UPS
-f <int> the min publication year; default value is 2000
-t <int> the max publication year; current year if not specifed
To fetch articles published by the CNRS between 2010 and 2020, in a terminal with the active environment where you installed the package, run
cartodata fetch-data -s CNRS -f 2010 -t 2020
Output data will be saved to the datas
directory in CSV format.
2. Process CSV files¶
Cartolabe-data provides two possible ways to process CSV files for datasets:
Pipelines: Provides an API to configure a pipeline for datasets. It is also possible to create pipelines using YAML files. Predefined pipeline YAML files are under conf directory under the directory with the name of the dataset.
Workflows: Standard workflows for datasets (inria, hal, ups, lri, wiki, debat, software, arxiv, bib) with fixed parameters. These are not updated recently, might not work correctly.
The recommended way to process datasets is Pipeline API.
pipeline¶
The pipeline command uses one of the predefined pipelines defined in conf directory to produce data (feather or json file) usable by cartolabe-vis. It takes 9 optional parameters:
-d <str> name of the dataset
-i <str> the name of the directory that contains the dataset.yaml and pipeline.yaml files for the dataset; default is conf
-n <str> the name of the CSV file if it is not specifed in the pipeline.yaml file or different then the one specifed in the file.
-v <str> current version that should be set for the dataset.
-f <bool> boolean value to indicate if the dataset will be reprocessed when it is already processed; default is false.
-s <bool> boolean value to indicate if the projection image for the dataset should be saved; default is true.
-a <bool> boolean value to indicate if aligned pipeline file will be used; default is false.
-p <str> the previous version to align for aligned processing.
-c <str> number of slices for the aligned processing.
mkdir dumps
cartodata pipeline -d lisn
This will use the pipeline.yaml file for pipeline for lisn to create the pipeline and process data.
To run an aligned pipeline:
cartodata pipeline -d lisn -a True
This will use the pipeline.yaml file for aligned pipeline for lisn dataset. If no previous version is specified, the process will run an initial aligned processing dividing the data into number of slices specified in the dataset.yaml file.
To align new data with a previously processed data, previous dataset version to align should be specified.
cartodata pipeline -d lisn -a True -v 2.0.1 -p 2.0.0
In this case, previously processing directory dumps/lisn/2.0.1 should exist in the workspace.
Examples of pipeline usage can be found in examples directory.
workflow¶
The workflow command runs one of the predefined workflows to produce data (feather or json file) usable by cartolabe-vis.
It takes one required argument (the name of one of the predefined workflows) and one optional argument (the output directory).
To run the LRI workflow, in a terminal with the active environment where you installed the package, run
mkdir dumps
cartodata workflow -o dumps/lri lri
This will run the set of instructions in the cartodata/workflows/hal
module and output the results in the dumps/lri
directory.
Examples of workflow usage can be found in examples directory.
Note: Some workflows might not be updated, hence might not work correctly.
About¶
Cartolabe is a project developped by Inria & CNRS.
It is licensed under BSD 3-Clause.