finos · mcleo-d · Oct 13, 2020 · Oct 13, 2020 · Oct 13, 2020 · Oct 13, 2020
diff --git a/datahub_core/libs/latin_words.py b/datahub_core/libs/latin_words.py
diff --git a/...datahub-datahelix-collaboration/readme.md → datahub_core/models/ctgan/__init__.py b/...datahub-datahelix-collaboration/readme.md → datahub_core/models/ctgan/__init__.py
diff --git a/datahub_core/models/ctgan/analyse.py b/datahub_core/models/ctgan/analyse.py
@@ -0,0 +1,10 @@
+# import pandas as pd
+# from ctgan import CTGANSynthesizer
+
+
+# def run(disc, csv_filename=None, input_df=None, ):
+
+#     if not input_df and csv_filename:
+#         input_df = pd.from_csv(csv_file)
+
+#     ctgan.fit(data, discreet_columns, )
diff --git a/datahub_core/models/ctgan/ctgan.py b/datahub_core/models/ctgan/ctgan.py
@@ -0,0 +1,14 @@
+
+# class GanModel:
+#     def __init__(self, filename, randomstate):
+#         self.randomstate = randomstate
+
+#         with open(filename, 'r') as f:
+#             data = json.load(f)
+
+#         self.type_node = process_type(data, randomstate)
+
+#     def make_one(self):
+#         result = {}
+#         rec(self.type_node, result, self.randomstate)
+#         return result
diff --git a/datahub_core/models/ctgan/generate.py b/datahub_core/models/ctgan/generate.py
diff --git a/datahub_core/models/markov/markov_model.py b/datahub_core/models/markov/markov_model.py
@@ -4,7 +4,7 @@
 
 class MarkovModel:
 
-    def __init__(self, filename, randomstate):        
+    def __init__(self, filename, randomstate):
         self.randomstate = randomstate
 
         with open(filename, 'r') as f:

diff --git a/docs/DELEGATE_ACTION_GROUP/readme.md b/docs/DELEGATE_ACTION_GROUP/readme.md
diff --git a/docs/asd.md b/docs/asd.md
diff --git a/...groups/datahub-datahelix-collaboration/outcomes/single-underlying-technology.md b/...groups/datahub-datahelix-collaboration/outcomes/single-underlying-technology.md
@@ -0,0 +1,11 @@
+# Title
+
+__Parent:__ [Datahub-DataHelix-Collaboration](../readme.md)
+
+__Discussion:__ [issue-49](https://github.com/finos/datahub/issues/49)
+
+__status:__ DRAFT
+
+## Abstract
+
+DataHub and DataHelix use different underlying technologies (Python vs Java). Using the same underlying language/technology or enabling interoperability it will be significantly easier for developers to work on both and take advantage of features.
diff --git a/...n-groups/datahub-datahelix-collaboration/outcomes/support-of-config-language.md b/...n-groups/datahub-datahelix-collaboration/outcomes/support-of-config-language.md
@@ -0,0 +1,13 @@
+# Title
+
+__Parent:__ [Datahub-DataHelix-Collaboration](../readme.md)
+
+__Discussion:__ [issue-51](https://github.com/finos/datahub/issues/51)
+
+__status:__ DRAFT
+
+## Abstract
+
+This outcome is to decide if we need a single, extensible specification language that can be used to drive an API and possibly a UI.
+
+DataHelix has an already established 'JSON' spec for declaring data profiles - the purpose of this profile markup is to declare a high-level spec, which DataHelix's core engine can use to produce the synthetic set.
diff --git a/docs/delegated-action-groups/datahub-datahelix-collaboration/readme.md b/docs/delegated-action-groups/datahub-datahelix-collaboration/readme.md
@@ -0,0 +1,32 @@
+# Decision to integrate DataHelx and DataHub
+
+__Status:__ Draft
+
+## Problem
+
+It's recognized that the DataHelix JSON configuration specification enables the quick creation of data for engineering teams, while the DataHub library-based approach allowed engineers to get in and easily extend, customize and produce more complex models.
+
+DataHub (Python) and DataHelix (Java) are based on different underlying technologies. DataHub chose Python due to its status as the defacto 'data science' language of choice and wealth of existing libraries. In addition, Python was selected due to its portability, interoperability and that it can run on nearly any platform with minimum setup requirements.
+
+DataHelix is Java-based, Java is a non-controversial choice well supported and a wealth of experienced developers.
+
+## Outcome
+
+* [[Link](./outcomes/single-underlying-technology.md)] Single underlying technology
+* [[link](./outcomes/single-underlying-technology.md)] Support of configuration lanaguage
+
+## Structure / Skills
+
+| Role       |Name           |
+|------------|---------------|
+| Chair      | James McLeod  |
+| DataHub    | Paul Groves   |
+| DataHelix  | Andrew Carr   |
+
+## Timelines and Constraints
+
+TBD.
+
+## Decisions
+
+TBD.
diff --git a/docs/delegated-action-groups/readme.md b/docs/delegated-action-groups/readme.md
@@ -0,0 +1,126 @@
+# Delegate Action Groups
+
+Delegate Action groups are used to officially delegate decision making responsibility to address a specific challenge or problem in an open manner. The mandate for a dag is outlined in a POST document (Problem, Outcomes Structure, Timeline)
+
+## Book of DAG
+
+| Status | Chair        | Link/Description                                                               |
+|-----------------------|--------------------------------------------------------------------------------|
+| Draft  | James McLeod | [Datahub-datahelix-collaboration](./datahub-datahelix-collaboration/readme.md) |
+| Draft  | James McLeod | [synthetic-data-architecture](./synthetic-data-architecture/readme.md)     |
+
+## Structure
+
+``` bash
+├── DELEGATE_ACTION_GROUP
+│   ├── post-short-description 1
+│   │     ├── POST.md   << Master document
+│   │     ├── outcomes
+│   │     │   ├── outcome-1.md  
+│   │     │   ├── outcome-2.md
+│   │     │   ├── outcome-3.md
+│   │     ├── descisions
+│   │     │   ├── descision-1.md  
+│   │     │   ├── descision-2.md
+│   │     │   ├── descision-3.md
+│   │
+│   ├── post-short-description 2
+│   │    ├── POST.md   << Master document
+│   │    ├── outcomes
+│   │    │   ├── outcome-1.md  
+│   │    │   ├── outcome-2.md
+│   │    ├── descisions
+│   │    │   ├── descision-1.md  
+│   │    │   ├── descision-2.md
+│
+```
+
+## Templates
+
+### POST documents
+
+``` markdown
+
+# TITLE
+
+Status: Draft|In-Progress|Complete
+
+## Problem
+
+Describe the problem in 2/3 paragraphs
+
+## Outcome
+
+* [link] Short description of outcome 1
+* [link] Short description of outcome 2
+* [link] Short description of outcome 3
+
+## Structure / Skills
+
+- [Chair] - <name>- [Data Science]
+- Fred Morris- [Big Data Export]
+- Someone Else- [Stakeholder 1]
+- A.Stakeholder- [Stakeholder 2]
+- B.Stakeholder- [Stakeholder 3] - C.Stakeholder
+
+## Timelines and Constraints
+
+Declare what is out of scope, urgency capacity from teams to contribute,
+
+## Decisions
+
+* [link] Short description of outcome 1  [Aproved|InProgress]
+* [link] Short description of outcome 2  [Aproved|InProgress]
+* [link] Short description of outcome 3  [Aproved|InProgress]
+
+```
+
+### Objectives
+
+Objectives are written and approved as part of the DAG. While the DAG is in draft mode the objectives are fine-tuned. Once all objectives are agreed the DAG moves from Draft to 'In Progress' While the DAG is in draft - Create an issue for tracking the conversation.
+
+Once the objective is agreed then the issue is closed the finalized objective text from the issue should be transferred to the objective document The 'issue' can then be 'closed' the /dag-name/objective/objective-name.md file becomes the final document.
+The Issue remains closed, re-opening the objective to make a major changes means the entire DAG moves back to DRAFT status.
+
+```markdown
+
+# Title
+
+[Link] to the Github Issue for conversation[Link] to the associated decision when it's made
+
+## Abstract
+
+2/3 paragraphs on the outcome and what it achieves
+
+```
+
+### Decision
+
+Decisions are made by the working group and correspond with the scope of the DAG and the declared objectives Decisions are the output of the DAG. Once all decisions are approved the DAG is 'complete' decisions are binding for the future of the project.
+
+Create an issue for tracking the conversation. Once the issue is closed the finalized decision text from the issue should be transferred to the decision document The issue can then be 'closed' the decision md file becomes the final document.
+
+``` markdown
+
+# Title
+
+Status: In-Progress | Complete
+
+[Link] to the Github Issue for conversation[Link] to the associated objective
+
+## Abstract
+
+2/3 paragraphs on the outcome and what it achieves
+
+## Consequences of the Decision
+
+What is the consequence of this decision being made - will it lead to a specific implementation, resourcing etc.
+
+## Alternatives?
+
+Document any alternative decisions that could have been made
+
+## Decision Outcome
+
+Document the decisions that were made
+```
diff --git a/...d-action-groups/synthetic-data-architecture/outcomes/specification-standards.md b/...d-action-groups/synthetic-data-architecture/outcomes/specification-standards.md
@@ -0,0 +1,11 @@
+# Title
+
+__Parent:__ [Synthetic Data Architecture DAG](../readme.md)
+
+__Discussion:__ Blocked by [issue-50](https://github.com/finos/datahub/issues/50)
+
+__status:__ DRAFT
+
+## Abstract
+
+To allow interoperability between separately authored open source (or commercial) synthetic data modules the elements of the pipeline should have clear specifications that are protocol agnostic (i.e. file, HTTP)
diff --git a/...action-groups/synthetic-data-architecture/outcomes/standard-pipeline-process.md b/...action-groups/synthetic-data-architecture/outcomes/standard-pipeline-process.md
@@ -0,0 +1,22 @@
+# Title
+
+__Parent:__ [Synthetic Data Architecture DAG](../readme.md)
+
+__Discussion:__ [issue-50](https://github.com/finos/datahub/issues/50)
+
+__status:__ DRAFT
+
+## Abstract
+
+Synthetic data processes should have a defined set of steps
+
+- Analyse data - Classifiers, Identifiers, and Discreet values.
+  - Financial organization have common classifiers (curves, tenors, countries, currencies, etc)
+  - Identifiers to public external entities (LEI, ISIN, CUSIP)
+  - Identifiers to private internal entities (account codes, trading books)
+
+- Decide on best 'analysis' module (Sikmepl Bucketing, GAN)
+- Parametertise the model (apply noise, fuzziness, generalize/normalize distributions as not to leak sensitive data)
+- Run the model on the production set
+- Use the model-data, and any additional properties to synthetically produce an artificial set
+- Do something with the synthetic dataset