summaryrefslogtreecommitdiff
path: root/server/continuedev/plugins/recipes/AddTransformRecipe
diff options
context:
space:
mode:
authorNate Sesti <33237525+sestinj@users.noreply.github.com>2023-10-09 18:37:27 -0700
committerGitHub <noreply@github.com>2023-10-09 18:37:27 -0700
commitf09150617ed2454f3074bcf93f53aae5ae637d40 (patch)
tree5cfe614a64d921dfe58b049f426d67a8b832c71f /server/continuedev/plugins/recipes/AddTransformRecipe
parent985304a213f620cdff3f8f65f74ed7e3b79be29d (diff)
downloadsncontinue-f09150617ed2454f3074bcf93f53aae5ae637d40.tar.gz
sncontinue-f09150617ed2454f3074bcf93f53aae5ae637d40.tar.bz2
sncontinue-f09150617ed2454f3074bcf93f53aae5ae637d40.zip
Preview (#541)
* Strong typing (#533) * refactor: :recycle: get rid of continuedev.src.continuedev structure * refactor: :recycle: switching back to server folder * feat: :sparkles: make config.py imports shorter * feat: :bookmark: publish as pre-release vscode extension * refactor: :recycle: refactor and add more completion params to ui * build: :building_construction: download from preview S3 * fix: :bug: fix paths * fix: :green_heart: package:pre-release * ci: :green_heart: more time for tests * fix: :green_heart: fix build scripts * fix: :bug: fix import in run.py * fix: :bookmark: update version to try again * ci: 💚 Update package.json version [skip ci] * refactor: :fire: don't check for old extensions version * fix: :bug: small bug fixes * fix: :bug: fix config.py import paths * ci: 💚 Update package.json version [skip ci] * ci: :green_heart: platform-specific builds test #1 * feat: :green_heart: ship with binary * fix: :green_heart: fix copy statement to include.exe for windows * fix: :green_heart: cd extension before packaging * chore: :loud_sound: count tokens generated * fix: :green_heart: remove npm_config_arch * fix: :green_heart: publish as pre-release! * chore: :bookmark: update version * perf: :green_heart: hardcode distro paths * fix: :bug: fix yaml syntax error * chore: :bookmark: update version * fix: :green_heart: update permissions and version * feat: :bug: kill old server if needed * feat: :lipstick: update marketplace icon for pre-release * ci: 💚 Update package.json version [skip ci] * feat: :sparkles: auto-reload for config.py * feat: :wrench: update default config.py imports * feat: :sparkles: codelens in config.py * feat: :sparkles: select model param count from UI * ci: 💚 Update package.json version [skip ci] * feat: :sparkles: more model options, ollama error handling * perf: :zap: don't show server loading immediately * fix: :bug: fixing small UI details * ci: 💚 Update package.json version [skip ci] * feat: :rocket: headers param on LLM class * fix: :bug: fix headers for openai.;y * feat: :sparkles: highlight code on cmd+shift+L * ci: 💚 Update package.json version [skip ci] * feat: :lipstick: sticky top bar in gui.tsx * fix: :loud_sound: websocket logging and horizontal scrollbar * ci: 💚 Update package.json version [skip ci] * feat: :sparkles: allow AzureOpenAI Service through GGML * ci: 💚 Update package.json version [skip ci] * fix: :bug: fix automigration * ci: 💚 Update package.json version [skip ci] * ci: :green_heart: upload binaries in ci, download apple silicon * chore: :fire: remove notes * fix: :green_heart: use curl to download binary * fix: :green_heart: set permissions on apple silicon binary * fix: :green_heart: testing * fix: :green_heart: cleanup file * fix: :green_heart: fix preview.yaml * fix: :green_heart: only upload once per binary * fix: :green_heart: install rosetta * ci: :green_heart: download binary after tests * ci: 💚 Update package.json version [skip ci] * ci: :green_heart: prepare ci for merge to main --------- Co-authored-by: GitHub Action <action@github.com>
Diffstat (limited to 'server/continuedev/plugins/recipes/AddTransformRecipe')
-rw-r--r--server/continuedev/plugins/recipes/AddTransformRecipe/README.md9
-rw-r--r--server/continuedev/plugins/recipes/AddTransformRecipe/dlt_transform_docs.md142
-rw-r--r--server/continuedev/plugins/recipes/AddTransformRecipe/main.py31
-rw-r--r--server/continuedev/plugins/recipes/AddTransformRecipe/steps.py106
4 files changed, 288 insertions, 0 deletions
diff --git a/server/continuedev/plugins/recipes/AddTransformRecipe/README.md b/server/continuedev/plugins/recipes/AddTransformRecipe/README.md
new file mode 100644
index 00000000..78d603a2
--- /dev/null
+++ b/server/continuedev/plugins/recipes/AddTransformRecipe/README.md
@@ -0,0 +1,9 @@
+# AddTransformRecipe
+
+Uses the Chess.com API example to show how to add map and filter Python transforms to a dlt pipeline.
+
+Background
+
+- https://dlthub.com/docs/general-usage/resource#filter-transform-and-pivot-data
+- https://dlthub.com/docs/customizations/customizing-pipelines/renaming_columns
+- https://dlthub.com/docs/customizations/customizing-pipelines/pseudonymizing_columns
diff --git a/server/continuedev/plugins/recipes/AddTransformRecipe/dlt_transform_docs.md b/server/continuedev/plugins/recipes/AddTransformRecipe/dlt_transform_docs.md
new file mode 100644
index 00000000..864aea87
--- /dev/null
+++ b/server/continuedev/plugins/recipes/AddTransformRecipe/dlt_transform_docs.md
@@ -0,0 +1,142 @@
+# Customize resources
+
+## Filter, transform and pivot data
+
+You can attach any number of transformations that are evaluated on item per item basis to your resource. The available transformation types:
+
+- map - transform the data item (resource.add_map)
+- filter - filter the data item (resource.add_filter)
+- yield map - a map that returns iterator (so single row may generate many rows - resource.add_yield_map)
+
+Example: We have a resource that loads a list of users from an api endpoint. We want to customize it so:
+
+- we remove users with user_id == 'me'
+- we anonymize user data
+ Here's our resource:
+
+```python
+import dlt
+
+@dlt.resource(write_disposition='replace')
+def users():
+ ...
+ users = requests.get(...)
+ ...
+ yield users
+```
+
+Here's our script that defines transformations and loads the data.
+
+```python
+from pipedrive import users
+
+def anonymize_user(user_data):
+ user_data['user_id'] = hash_str(user_data['user_id'])
+ user_data['user_email'] = hash_str(user_data['user_email'])
+ return user_data
+
+# add the filter and anonymize function to users resource and enumerate
+for user in users().add_filter(lambda user: user['user_id'] != 'me').add_map(anonymize_user):
+print(user)
+```
+
+Here is a more complex example of a filter transformation:
+
+ # Renaming columns
+ ## Renaming columns by replacing the special characters
+
+ In the example below, we create a dummy source with special characters in the name. We then write a function that we intend to apply to the resource to modify its output (i.e. replacing the German umlaut): replace_umlauts_in_dict_keys.
+ ```python
+ import dlt
+
+ # create a dummy source with umlauts (special characters) in key names (um)
+ @dlt.source
+ def dummy_source(prefix: str = None):
+ @dlt.resource
+ def dummy_data():
+ for _ in range(100):
+ yield {f'Objekt_{_}':{'Größe':_, 'Äquivalenzprüfung':True}}
+ return dummy_data(),
+
+ def replace_umlauts_in_dict_keys(d):
+ # Replaces umlauts in dictionary keys with standard characters.
+ umlaut_map = {'ä': 'ae', 'ö': 'oe', 'ü': 'ue', 'ß': 'ss', 'Ä': 'Ae', 'Ö': 'Oe', 'Ü': 'Ue'}
+ result = {}
+ for k, v in d.items():
+ new_key = ''.join(umlaut_map.get(c, c) for c in k)
+ if isinstance(v, dict):
+ result[new_key] = replace_umlauts_in_dict_keys(v)
+ else:
+ result[new_key] = v
+ return result
+
+ # We can add the map function to the resource
+
+ # 1. Create an instance of the source so you can edit it.
+ data_source = dummy_source()
+
+ # 2. Modify this source instance's resource
+ data_source = data_source.dummy_data().add_map(replace_umlauts_in_dict_keys)
+
+ # 3. Inspect your result
+ for row in data_source:
+ print(row)
+
+ # {'Objekt_0': {'Groesse': 0, 'Aequivalenzpruefung': True}}
+ # ...
+ ```
+
+Here is a more complex example of a map transformation:
+
+# Pseudonymizing columns
+
+## Pseudonymizing (or anonymizing) columns by replacing the special characters
+
+Pseudonymization is a deterministic way to hide personally identifiable info (PII), enabling us to consistently achieve the same mapping. If instead you wish to anonymize, you can delete the data, or replace it with a constant. In the example below, we create a dummy source with a PII column called 'name', which we replace with deterministic hashes (i.e. replacing the German umlaut).
+
+```python
+import dlt
+import hashlib
+
+@dlt.source
+def dummy_source(prefix: str = None):
+ @dlt.resource
+ def dummy_data():
+ for _ in range(3):
+ yield {'id':_, 'name': f'Jane Washington {_}'}
+ return dummy_data(),
+
+def pseudonymize_name(doc):
+ Pseudonmyisation is a deterministic type of PII-obscuring
+ Its role is to allow identifying users by their hash, without revealing the underlying info.
+
+ # add a constant salt to generate
+ salt = 'WI@N57%zZrmk#88c'
+ salted_string = doc['name'] + salt
+ sh = hashlib.sha256()
+ sh.update(salted_string.encode())
+ hashed_string = sh.digest().hex()
+ doc['name'] = hashed_string
+ return doc
+
+ # run it as is
+ for row in dummy_source().dummy_data().add_map(pseudonymize_name):
+ print(row)
+
+ #{'id': 0, 'name': '96259edb2b28b48bebce8278c550e99fbdc4a3fac8189e6b90f183ecff01c442'}
+ #{'id': 1, 'name': '92d3972b625cbd21f28782fb5c89552ce1aa09281892a2ab32aee8feeb3544a1'}
+ #{'id': 2, 'name': '443679926a7cff506a3b5d5d094dc7734861352b9e0791af5d39db5a7356d11a'}
+
+ # Or create an instance of the data source, modify the resource and run the source.
+
+ # 1. Create an instance of the source so you can edit it.
+ data_source = dummy_source()
+ # 2. Modify this source instance's resource
+ data_source = data_source.dummy_data().add_map(replace_umlauts_in_dict_keys)
+ # 3. Inspect your result
+ for row in data_source:
+ print(row)
+
+ pipeline = dlt.pipeline(pipeline_name='example', destination='bigquery', dataset_name='normalized_data')
+ load_info = pipeline.run(data_source)
+```
diff --git a/server/continuedev/plugins/recipes/AddTransformRecipe/main.py b/server/continuedev/plugins/recipes/AddTransformRecipe/main.py
new file mode 100644
index 00000000..583cef1a
--- /dev/null
+++ b/server/continuedev/plugins/recipes/AddTransformRecipe/main.py
@@ -0,0 +1,31 @@
+from textwrap import dedent
+
+from ....core.main import Step
+from ....core.sdk import ContinueSDK
+from ....core.steps import MessageStep, WaitForUserInputStep
+from .steps import AddTransformStep, SetUpChessPipelineStep
+
+
+class AddTransformRecipe(Step):
+ hide: bool = True
+
+ async def run(self, sdk: ContinueSDK):
+ text_observation = await sdk.run_step(
+ MessageStep(
+ message=dedent(
+ """\
+ This recipe will walk you through the process of adding a transform to a dlt pipeline that uses the chess.com API source. With the help of Continue, you will:
+ - Set up a dlt pipeline for the chess.com API
+ - Add a filter or map transform to the pipeline
+ - Run the pipeline and view the transformed data in a Streamlit app"""
+ ),
+ name="Add transformation to a dlt pipeline",
+ )
+ >> SetUpChessPipelineStep()
+ >> WaitForUserInputStep(
+ prompt="How do you want to transform the Chess.com API data before loading it? For example, you could filter out games that ended in a draw."
+ )
+ )
+ await sdk.run_step(
+ AddTransformStep(transform_description=text_observation.text)
+ )
diff --git a/server/continuedev/plugins/recipes/AddTransformRecipe/steps.py b/server/continuedev/plugins/recipes/AddTransformRecipe/steps.py
new file mode 100644
index 00000000..61638374
--- /dev/null
+++ b/server/continuedev/plugins/recipes/AddTransformRecipe/steps.py
@@ -0,0 +1,106 @@
+import os
+from textwrap import dedent
+
+from ....core.main import Step
+from ....core.sdk import ContinueSDK, Models
+from ....core.steps import MessageStep
+from ....libs.util.paths import find_data_file
+
+AI_ASSISTED_STRING = "(✨ AI-Assisted ✨)"
+
+
+class SetUpChessPipelineStep(Step):
+ hide: bool = True
+ name: str = "Setup Chess.com API dlt Pipeline"
+
+ async def describe(self, models: Models):
+ return "This step will create a new dlt pipeline that loads data from the chess.com API."
+
+ async def run(self, sdk: ContinueSDK):
+ # running commands to get started when creating a new dlt pipeline
+ await sdk.run(
+ [
+ "python3 -m venv .env",
+ "source .env/bin/activate",
+ "pip install dlt",
+ "dlt --non-interactive init chess duckdb",
+ "pip install -r requirements.txt",
+ "pip install pandas streamlit", # Needed for the pipeline show step later
+ ],
+ name="Set up Python environment",
+ description=dedent(
+ """\
+ - Create a Python virtual environment: `python3 -m venv .env`
+ - Activate the virtual environment: `source .env/bin/activate`
+ - Install dlt: `pip install dlt`
+ - Create a new dlt pipeline called "chess" that loads data into a local DuckDB instance: `dlt init chess duckdb`
+ - Install the Python dependencies for the pipeline: `pip install -r requirements.txt`"""
+ ),
+ )
+
+
+class AddTransformStep(Step):
+ hide: bool = True
+
+ # e.g. "Use the `python-chess` library to decode the moves in the game data"
+ transform_description: str
+
+ async def run(self, sdk: ContinueSDK):
+ source_name = "chess"
+ filename = f"{source_name}_pipeline.py"
+ abs_filepath = os.path.join(sdk.ide.workspace_directory, filename)
+
+ # Open the file and highlight the function to be edited
+ await sdk.ide.setFileOpen(abs_filepath)
+
+ await sdk.run_step(
+ MessageStep(
+ message=dedent(
+ """\
+ This step will customize your resource function with a transform of your choice:
+ - Add a filter or map transformation depending on your request
+ - Load the data into a local DuckDB instance
+ - Open up a Streamlit app for you to view the data"""
+ ),
+ name="Write transformation function",
+ )
+ )
+
+ with open(find_data_file("dlt_transform_docs.md")) as f:
+ dlt_transform_docs = f.read()
+
+ prompt = dedent(
+ f"""\
+ Task: Write a transform function using the description below and then use `add_map` or `add_filter` from the `dlt` library to attach it a resource.
+
+ Description: {self.transform_description}
+
+ Here are some docs pages that will help you better understand how to use `dlt`.
+
+ {dlt_transform_docs}"""
+ )
+
+ # edit the pipeline to add a transform function and attach it to a resource
+ await sdk.edit_file(
+ filename=filename,
+ prompt=prompt,
+ name=f"Writing transform function {AI_ASSISTED_STRING}",
+ )
+
+ await sdk.wait_for_user_confirmation(
+ "Press Continue to confirm that the changes are okay before we run the pipeline."
+ )
+
+ # run the pipeline and load the data
+ await sdk.run(
+ f"python3 {filename}",
+ name="Run the pipeline",
+ description=f"Running `python3 {filename}` to load the data into a local DuckDB instance",
+ )
+
+ # run a streamlit app to show the data
+ await sdk.run(
+ f"dlt pipeline {source_name}_pipeline show",
+ name="Show data in a Streamlit app",
+ description=f"Running `dlt pipeline {source_name} show` to show the data in a Streamlit app, where you can view and play with the data.",
+ )