跳到主要内容

9 篇博文 含有标签「Trevas」

查看所有标签

Trevas 中的 Spark 4

· 阅读需 3 分钟
Nicolas Laval
Making Sense - Developer

我们很高兴宣布 Trevas 2.4.0,通过新的 vtl-spark4 模块增加了对 Apache Spark 4 的支持。

若要将基于 Spark 的客户端应用迁移到 Spark 4,可在 Trevas 技术栈其余部分之外依赖 fr.insee.trevas:vtl-spark4。VTL API 与行为保持不变,仅 Spark 集成层发生变化。

Spark 3 不会停用。 现有的 vtl-spark 模块将继续并行完整维护。您可按需继续使用 Spark 3,没有强制迁移时间表。

请参阅 2.4.0 版本说明GitHub 发布 了解完整变更日志。

Trevas 客户端应用 — 已 shade 的 ANTLR 导入

在同一 Trevas 代码库中同时支持 Spark 3Spark 4,也促使我们 改进了 ANTLR 的打包方式:运行时现已 shade 并重定位,避免 Trevas 与 Spark 在 classpath 上争夺相同的 org.antlr.v4 类。

自 Trevas 2.4.0 起,本说明仅适用于在 Trevas 之外 在自有代码中显式使用 ANTLR API(词法分析器、令牌流、解析树、监听器等)的 客户端应用。若应用仅调用 Trevas API、从不直接导入或操作 ANTLR 类型,对您没有任何变化

若您确实需要直接使用 ANTLR — 无论继续使用 Apache Spark 3vtl-spark)还是迁移到 Spark 4vtl-spark4)— 必须从 重定位后 的包命名空间导入运行时:

import fr.insee.vtl.antlr.runtime.*;
import fr.insee.vtl.antlr.runtime.tree.*;
// … 以及按需使用的其他 fr.insee.vtl.antlr.* 子包

此前,直接操作解析器或 ANTLR API 的代码通常使用标准 ANTLR 包,例如:

import org.antlr.v4.runtime.*;
import org.antlr.v4.runtime.tree.*;

这些导入 已不再与 Trevas 在运行时提供的类 匹配。Trevas 将 org.antlr:antlr4-runtime shade 进 vtl-antlr 构件,并把 org.antlr.v4 重定位为 fr.insee.vtl.antlr,以便 Trevas 与 Spark 可在同一 JVM 中共存,而无需加载两套相互竞争的 ANTLR 运行时。

需要修改的内容

  • 将应用中 所有 org.antlr.v4… 导入(以及针对 Trevas 解析器类型生成的代码)更新为对应的 fr.insee.vtl.antlr… 包。
  • 通过 fr.insee.trevas:vtl-antlr(经 vtl-parser / vtl-engine 传递依赖)获取运行时;不要 为 Trevas 相关解析单独添加 org.antlr:antlr4-runtime 依赖。
  • Spark 3 与 Spark 4 集成均适用:两者使用相同的 shade 解析器栈。

典型对应关系:

之前之后
org.antlr.v4.runtime.CharStreamsfr.insee.vtl.antlr.runtime.CharStreams
org.antlr.v4.runtime.CommonTokenStreamfr.insee.vtl.antlr.runtime.CommonTokenStream
org.antlr.v4.runtime.tree.ParseTreefr.insee.vtl.antlr.runtime.tree.ParseTree

更多技术细节,请参阅文档。

Trevas - Version 2.0.0

· 阅读需 1 分钟
Nicolas Laval
Making Sense - Developer

Trevas 2.0.0 is released!

Following the implementation of DAGs and the reordering of VTL instructions before execution, evaluating a VTL script will integrate this new functionality by default.

A technical documentation is available to describe this feature and how to disable it.

Trevas - VTL 2.1

· 阅读需 1 分钟
Nicolas Laval
Making Sense - Developer

Trevas 1.7.0 upgrade to version 2.1 of VTL.

This version introduces two new operators:

  • random
  • case

random produces a decimal number between 0 and 1.

case allows for clearer multi conditional branching, for example:

ds2 := ds1[ calc c := case when r < 0.2 then "Low" when r > 0.8 then "High" else "Medium" ]

Both operators are already available in Trevas!

The new grammar also provides time operators and includes corrections, without any breaking changes compared to the 2.0 version.

See the coverage section for more details.

Trevas - Provenance

· 阅读需 4 分钟
Nicolas Laval
Making Sense - Developer

News

Trevas 1.6.0 introduces the VTL Prov module.

This module enables to produce lineage metadata from Trevas, based on RDF ontologies: PROV-O and SDTH.

SDTH model overview

Adopted model

The vtl-prov module, version 1.6.0, uses the following partial model:

Improvements will come in next weeks.

Tools available

Provenance Trevas tools are documented here.

Example

Business use case

Two sources datasets are transformed to produce transient datasets and a final permanent one.

Inputs

ds1 & ds2 metadata:

idvar1var2
STRINGINTEGERNUMBER
IDENTIFIERMEASUREMEASURE

VTL script

ds_sum := ds1 + ds2;
ds_mul := ds_sum * 3;
ds_res <- ds_mul[filter mod(var1, 2) = 0][calc var_sum := var1 + var2];

RDF model target

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX prov: <http://www.w3.org/ns/prov#>
PREFIX sdth: <http://rdf-vocabulary.ddialliance.org/sdth#>

# --- Program and steps
<http://example.com/program1> a sdth:Program ;
a prov:Agent ; # Agent? Or an activity
rdfs:label "My program 1"@en, "Mon programme 1"@fr ;
sdth:hasProgramStep <http://example.com/program1/program-step1>,
<http://example.com/program1/program-step2>,
<http://example.com/program1/program-step3> .

<http://example.com/program1/program-step1> a sdth:ProgramStep ;
rdfs:label "Program step 1"@en, "Étape 1"@fr ;
sdth:hasSourceCode "ds_sum := ds1 + ds2;" ;
sdth:consumesDataframe <http://example.com/dataset/ds1>,
<http://example.com/dataset/ds2> ;
sdth:producesDataframe <http://example.com/dataset/ds_sum> .

<http://example.com/program1/program-step2> a sdth:ProgramStep ;
rdfs:label "Program step 2"@en, "Étape 2"@fr ;
sdth:hasSourceCode "ds_mul := ds_sum * 3;" ;
sdth:consumesDataframe <http://example.com/dataset/ds_sum> ;
sdth:producesDataframe <http://example.com/dataset/ds_mul> .

<http://example.com/program1/program-step3> a sdth:ProgramStep ;
rdfs:label "Program step 3"@en, "Étape 3"@fr ;
sdth:hasSourceCode "ds_res <- ds_mul[filter mod(var1, 2) = 0][calc var_sum := var1 + var2];" ;
sdth:consumesDataframe <http://example.com/dataset/ds_mul> ;
sdth:producesDataframe <http://example.com/dataset/ds_res> ;
sdth:usesVariable <http://example.com/variable/var1>,
<http://example.com/variable/var2> ;
sdth:assignsVariable <http://example.com/variable/var_sum> .

# --- Variables
# i think here it's not instances but names we refer to...
<http://example.com/variable/id1> a sdth:VariableInstance ;
rdfs:label "id1" .
<http://example.com/variable/var1> a sdth:VariableInstance ;
rdfs:label "var1" .
<http://example.com/variable/var2> a sdth:VariableInstance ;
rdfs:label "var2" .
<http://example.com/variable/var_sum> a sdth:VariableInstance ;
rdfs:label "var_sum" .

# --- Data frames
<http://example.com/dataset/ds1> a sdth:DataframeInstance ;
rdfs:label "ds1" ;
sdth:hasName "ds1" ;
sdth:hasVariableInstance <http://example.com/variable/id1>,
<http://example.com/variable/var1>,
<http://example.com/variable/var2> .

<http://example.com/dataset/ds2> a sdth:DataframeInstance ;
rdfs:label "ds2" ;
sdth:hasName "ds2" ;
sdth:hasVariableInstance <http://example.com/variable/id1>,
<http://example.com/variable/var1>,
<http://example.com/variable/var2> .

<http://example.com/dataset/ds_sum> a sdth:DataframeInstance ;
rdfs:label "ds_sum" ;
sdth:hasName "ds_sum" ;
sdth:wasDerivedFrom <http://example.com/dataset/ds1>,
<http://example.com/dataset/ds2> ;
sdth:hasVariableInstance <http://example.com/variable/id1>,
<http://example.com/variable/var1>,
<http://example.com/variable/var2> .

<http://example.com/dataset/ds_mul> a sdth:DataframeInstance ;
rdfs:label "ds_mul" ;
sdth:hasName "ds_mul" ;
sdth:wasDerivedFrom <http://example.com/dataset/ds_sum> ;
sdth:hasVariableInstance <http://example.com/variable/id1>,
<http://example.com/variable/var1>,
<http://example.com/variable/var2> .

<http://example.com/dataset/ds_res> a sdth:DataframeInstance ;
rdfs:label "ds_res" ;
sdth:wasDerivedFrom <http://example.com/dataset/ds_mul> ;
sdth:hasVariableInstance <http://example.com/variable/id1>,
<http://example.com/variable/var1>,
<http://example.com/variable/var2>,
<http://example.com/variable/var_sum> .

Trevas - SDMX

· 阅读需 1 分钟
Nicolas Laval
Making Sense - Developer

News

Trevas 1.4.1 introduces the VTL SDMX module.

This module enables to consume SDMX metadata sources to instantiate Trevas DataStructures and Datasets.

It also allows to execute the VTL TransformationSchemes to obtain the resulting persistent datasets.

Overview

VTL SDMX DiagramVTL SDMX Diagram

Trevas supports the above SDMX message elements. Only the VtlMappingSchemes element is optional.

The elements in box 1 are used to produce Trevas DataStructures, filling VTL components attributes name, role, type, nullable and valuedomain.

The elements in box 2 are used to generate the VTL code (rulesets & transformations).

Tools available

SDMX Trevas tools are documented here.

Troubleshooting

Have a look to this section.

Trevas - Temporal operators

· 阅读需 3 分钟
Hadrien Kohl
Hadrien Kohl Consulting - Developer

Temporal operators in Trevas

The version 1.4.1 of Trevas introduces preliminary support for date and time types and operators.

The specification describes temporal types such as date, time_period, time, and duration. However, Trevas authors find these descriptions unsatisfactory. This blog post outlines our implementation choices and how they differ from the spec.

In the specification, time_period (and the types date) is described as a compound type with a start and end (or a start and a duration). This complicates the implementation and brings little value to the language as one can simply operate on a combination of dates or date and duration directly. For this reason, we defined an algebra between the temporal types and did not yet implement the time_period.

result (operators)datedurationnumber
daten/adate (+, -)n/a
durationdate (+, -)duration (+, -)duration (*)
numbern/aduration (*)n/a

The period_indicator function relies on period-awareness for types that are not defined enough at the moment to be implemented.

Java mapping

The VTL type date is represented internally as the types java.time.Instant, java.time.ZonedDateTime and java.time.OffsetDateTime

Instant represent a specific moment in time. Note that this type does not include timezone information and is therefore not usable with all the operators. One can use the types ZonedDateTime and OffsetDateTime when timezone or time saving is required.

The VTL type duration is represented internally as the type org.threeten.extra.PeriodDuration from the threeten extra package. It represents a duration using both calendar units (years, months, days) and a temporal amount (hours, minutes, seconds and nanoseconds).

Function flow_to_stock

The flow_to_stock function converts a data set with flow interpretation into a stock interpretation. This transformation is useful when you want to aggregate flow data (e.g., sales or production rates) into cumulative stock data (e.g., total inventory).

Syntax:

result := flow_to_stock(op)

Parameters:

  • op - The input data set with flow interpretation. The data set must have an identifier of type time, additional identifiers, and at least one measure of type number.

Result:

The function returns a data set with the same structure as the input, but with the values converted to stock interpretation.

Function stock_to_flow

The stock_to_flow function converts a data set with stock interpretation into a flow interpretation. This transformation is useful when you want to derive flow data from cumulative stock data.

Syntax:

result := stock_to_flow(op)

Parameters:

  • op - The input data set with stock interpretation. The data set must have an identifier of type time, additional identifiers, and at least one measure of type number.

Result:

The function returns a data set with the same structure as the input, but with the values converted to flow interpretation.

Function timeshift

The timeshift function shifts the time component of a specified range of time in the data set. This is useful for analyzing data at different time offsets, such as comparing current values to past values.

Syntax:

result := timeshift(op, shiftNumber)

Parameters:

  • op - The operand data set containing time series.
  • shiftNumber - An integer representing the number of periods to shift. Positive values shift forward in time, while negative values shift backward.

Result:

The function returns a data set with the time identifiers shifted by the specified number of periods.

Trevas - Java 17

· 阅读需 1 分钟
Nicolas Laval
Making Sense - Developer

News

Trevas 1.2.0 enables Java 17 support.

Java modules handling

Spark does not support Java modules.

Java 17 client apps, embedding Trevas in Spark mode have to configure UNNAMED modules for Spark.

Maven

Add to your pom.xml file, in the build > plugins section:

<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.11.0</version>
<configuration>
<compilerArgs>
<arg>--add-exports</arg>
<arg>java.base/sun.nio.ch=ALL-UNNAMED</arg>
</compilerArgs>
</configuration>
</plugin>

Docker

ENTRYPOINT ["java", "--add-exports", "java.base/sun.nio.ch=ALL-UNNAMED", "mainClass"]

Trevas - Persistent assignments

· 阅读需 1 分钟
Nicolas Laval
Making Sense - Developer

News

Trevas 1.2.0 includes the persistent assignment support: ds1 <- ds;.

In Trevas, persistent datasets are represented as PersistentDataset.

Handle PersistentDataset

Trevas datasets are represented as Dataset.

After running the Trevas engine, you can use persistent datasets with something like:

Bindings engineBindings = engine.getContext().getBindings(ScriptContext.ENGINE_SCOPE);
engineBindings.forEach((k, v) -> {
if (v instanceof PersistentDataset) {
fr.insee.vtl.model.Dataset ds = ((PersistentDataset) v).getDelegate();
if (ds instanceof SparkDataset) {
Dataset<Row> sparkDs = ((SparkDataset) ds).getSparkDataset();
// Do what you want with sparkDs
}
}
});

Trevas - check_hierarchy

· 阅读需 1 分钟
Nicolas Laval
Making Sense - Developer

News

Trevas 1.1.0 includes hierarchical validation via operators define hierarchical ruleset and check_hierarchy.

Example

Input

ds1:

idMe
ABC12
A1
B10
C1
DEF100
E99
F1
HIJ100
H99
I0

VTL script

// Ensure ds1 metadata definition is good
ds1 := ds1[calc identifier id := id, Me := cast(Me, integer)];

// Define hierarchical ruleset
define hierarchical ruleset hr (variable rule Me) is
My_Rule : ABC = A + B + C errorcode "ABC is not sum of A,B,C" errorlevel 1;
DEF = D + E + F errorcode "DEF is not sum of D,E,F";
HIJ : HIJ = H + I - J errorcode "HIJ is not H + I - J" errorlevel 10
end hierarchical ruleset;

// Check hierarchy
ds_all := check_hierarchy(ds1, hr rule id all);
ds_all_measures := check_hierarchy(ds1, hr rule id always_null all_measures);
ds_invalid := check_hierarchy(ds1, hr rule id always_zero invalid);

Outputs

  • ds_all
idruleidbool_varerrorcodeerrorlevelimbalance
ABCMy_Ruletruenullnull0
  • ds_always_null_all_measures
idMeruleidbool_varerrorcodeerrorlevelimbalance
ABC12My_Ruletruenullnull0
DEF100hr_2nullnullnullnull
HIJ100HIJnullnullnullnull
  • ds_invalid
idMeruleiderrorcodeerrorlevelimbalance
HIJ100HIJHIJ is not H + I - J101