Skip to content
This repository was archived by the owner on May 3, 2023. It is now read-only.
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3,182 changes: 0 additions & 3,182 deletions datahub_core/libs/latin_words.py

This file was deleted.

10 changes: 10 additions & 0 deletions datahub_core/models/ctgan/analyse.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# import pandas as pd
# from ctgan import CTGANSynthesizer


# def run(disc, csv_filename=None, input_df=None, ):

# if not input_df and csv_filename:
# input_df = pd.from_csv(csv_file)

# ctgan.fit(data, discreet_columns, )
14 changes: 14 additions & 0 deletions datahub_core/models/ctgan/ctgan.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@

# class GanModel:
# def __init__(self, filename, randomstate):
# self.randomstate = randomstate

# with open(filename, 'r') as f:
# data = json.load(f)

# self.type_node = process_type(data, randomstate)

# def make_one(self):
# result = {}
# rec(self.type_node, result, self.randomstate)
# return result
Empty file.
2 changes: 1 addition & 1 deletion datahub_core/models/markov/markov_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

class MarkovModel:

def __init__(self, filename, randomstate):
def __init__(self, filename, randomstate):
self.randomstate = randomstate

with open(filename, 'r') as f:
Expand Down
124 changes: 0 additions & 124 deletions docs/DELEGATE_ACTION_GROUP/readme.md

This file was deleted.

Empty file added docs/asd.md
Empty file.
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# Title

__Parent:__ [Datahub-DataHelix-Collaboration](../readme.md)

__Discussion:__ [issue-49](https://github.com/finos/datahub/issues/49)

__status:__ DRAFT

## Abstract

DataHub and DataHelix use different underlying technologies (Python vs Java). Using the same underlying language/technology or enabling interoperability it will be significantly easier for developers to work on both and take advantage of features.
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Title

__Parent:__ [Datahub-DataHelix-Collaboration](../readme.md)

__Discussion:__ [issue-51](https://github.com/finos/datahub/issues/51)

__status:__ DRAFT

## Abstract

This outcome is to decide if we need a single, extensible specification language that can be used to drive an API and possibly a UI.

DataHelix has an already established 'JSON' spec for declaring data profiles - the purpose of this profile markup is to declare a high-level spec, which DataHelix's core engine can use to produce the synthetic set.
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Decision to integrate DataHelx and DataHub

__Status:__ Draft

## Problem

It's recognized that the DataHelix JSON configuration specification enables the quick creation of data for engineering teams, while the DataHub library-based approach allowed engineers to get in and easily extend, customize and produce more complex models.

DataHub (Python) and DataHelix (Java) are based on different underlying technologies. DataHub chose Python due to its status as the defacto 'data science' language of choice and wealth of existing libraries. In addition, Python was selected due to its portability, interoperability and that it can run on nearly any platform with minimum setup requirements.

DataHelix is Java-based, Java is a non-controversial choice well supported and a wealth of experienced developers.

## Outcome

* [[Link](./outcomes/single-underlying-technology.md)] Single underlying technology
* [[link](./outcomes/single-underlying-technology.md)] Support of configuration lanaguage

## Structure / Skills

| Role |Name |
|------------|---------------|
| Chair | James McLeod |
| DataHub | Paul Groves |
| DataHelix | Andrew Carr |

## Timelines and Constraints

TBD.

## Decisions

TBD.
126 changes: 126 additions & 0 deletions docs/delegated-action-groups/readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
# Delegate Action Groups

Delegate Action groups are used to officially delegate decision making responsibility to address a specific challenge or problem in an open manner. The mandate for a dag is outlined in a POST document (Problem, Outcomes Structure, Timeline)

## Book of DAG

| Status | Chair | Link/Description |
|-----------------------|--------------------------------------------------------------------------------|
| Draft | James McLeod | [Datahub-datahelix-collaboration](./datahub-datahelix-collaboration/readme.md) |
| Draft | James McLeod | [synthetic-data-architecture](./synthetic-data-architecture/readme.md) |

## Structure

``` bash
├── DELEGATE_ACTION_GROUP
│ ├── post-short-description 1
│ │ ├── POST.md << Master document
│ │ ├── outcomes
│ │ │ ├── outcome-1.md
│ │ │ ├── outcome-2.md
│ │ │ ├── outcome-3.md
│ │ ├── descisions
│ │ │ ├── descision-1.md
│ │ │ ├── descision-2.md
│ │ │ ├── descision-3.md
│ │
│ ├── post-short-description 2
│ │ ├── POST.md << Master document
│ │ ├── outcomes
│ │ │ ├── outcome-1.md
│ │ │ ├── outcome-2.md
│ │ ├── descisions
│ │ │ ├── descision-1.md
│ │ │ ├── descision-2.md
```

## Templates

### POST documents

``` markdown

# TITLE

Status: Draft|In-Progress|Complete

## Problem

Describe the problem in 2/3 paragraphs

## Outcome

* [link] Short description of outcome 1
* [link] Short description of outcome 2
* [link] Short description of outcome 3

## Structure / Skills

- [Chair] - <name>- [Data Science]
- Fred Morris- [Big Data Export]
- Someone Else- [Stakeholder 1]
- A.Stakeholder- [Stakeholder 2]
- B.Stakeholder- [Stakeholder 3] - C.Stakeholder

## Timelines and Constraints

Declare what is out of scope, urgency capacity from teams to contribute,

## Decisions

* [link] Short description of outcome 1 [Aproved|InProgress]
* [link] Short description of outcome 2 [Aproved|InProgress]
* [link] Short description of outcome 3 [Aproved|InProgress]

```

### Objectives

Objectives are written and approved as part of the DAG. While the DAG is in draft mode the objectives are fine-tuned. Once all objectives are agreed the DAG moves from Draft to 'In Progress' While the DAG is in draft - Create an issue for tracking the conversation.

Once the objective is agreed then the issue is closed the finalized objective text from the issue should be transferred to the objective document The 'issue' can then be 'closed' the /dag-name/objective/objective-name.md file becomes the final document.
The Issue remains closed, re-opening the objective to make a major changes means the entire DAG moves back to DRAFT status.

```markdown

# Title

[Link] to the Github Issue for conversation[Link] to the associated decision when it's made

## Abstract

2/3 paragraphs on the outcome and what it achieves

```

### Decision

Decisions are made by the working group and correspond with the scope of the DAG and the declared objectives Decisions are the output of the DAG. Once all decisions are approved the DAG is 'complete' decisions are binding for the future of the project.

Create an issue for tracking the conversation. Once the issue is closed the finalized decision text from the issue should be transferred to the decision document The issue can then be 'closed' the decision md file becomes the final document.

``` markdown

# Title

Status: In-Progress | Complete

[Link] to the Github Issue for conversation[Link] to the associated objective

## Abstract

2/3 paragraphs on the outcome and what it achieves

## Consequences of the Decision

What is the consequence of this decision being made - will it lead to a specific implementation, resourcing etc.

## Alternatives?

Document any alternative decisions that could have been made

## Decision Outcome

Document the decisions that were made
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# Title

__Parent:__ [Synthetic Data Architecture DAG](../readme.md)

__Discussion:__ Blocked by [issue-50](https://github.com/finos/datahub/issues/50)

__status:__ DRAFT

## Abstract

To allow interoperability between separately authored open source (or commercial) synthetic data modules the elements of the pipeline should have clear specifications that are protocol agnostic (i.e. file, HTTP)
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Title

__Parent:__ [Synthetic Data Architecture DAG](../readme.md)

__Discussion:__ [issue-50](https://github.com/finos/datahub/issues/50)

__status:__ DRAFT

## Abstract

Synthetic data processes should have a defined set of steps

- Analyse data - Classifiers, Identifiers, and Discreet values.
  - Financial organization have common classifiers (curves, tenors, countries, currencies, etc)
  - Identifiers to public external entities (LEI, ISIN, CUSIP)
  - Identifiers to private internal entities (account codes, trading books)

- Decide on best 'analysis' module (Sikmepl Bucketing, GAN)
- Parametertise the model (apply noise, fuzziness, generalize/normalize distributions as not to leak sensitive data)
- Run the model on the production set
- Use the model-data, and any additional properties to synthetically produce an artificial set
- Do something with the synthetic dataset
Loading