Skip to content

Commit ff6b37b

Browse files
authored
Merge pull request #652 from harikrishnatp/sparql-query-service-approach
Sparql query service approach
2 parents ea65076 + dcf1580 commit ff6b37b

File tree

385 files changed

+1091
-787
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

385 files changed

+1091
-787
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -60,3 +60,4 @@ scribe_data_wikidata_dumps_export/*
6060
query_check_missing_features.json
6161
query_check_result_dump.json
6262
query_check_result_sparql.json
63+
query_check_sparql_service_features.json

.pre-commit-config.yaml

Lines changed: 22 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,25 +1,14 @@
11
repos:
22
- repo: https://github.com/pre-commit/pre-commit-hooks
3-
rev: v4.5.0
3+
rev: v6.0.0
44
hooks:
55
- id: trailing-whitespace
66
- id: end-of-file-fixer
77
- id: check-yaml
88
# - id: check-added-large-files
99

10-
- repo: https://github.com/tcort/markdown-link-check
11-
rev: v3.13.6
12-
hooks:
13-
- id: markdown-link-check
14-
args: [-q]
15-
16-
- repo: https://github.com/sphinx-contrib/sphinx-lint
17-
rev: v1.0.0
18-
hooks:
19-
- id: sphinx-lint
20-
2110
- repo: https://github.com/astral-sh/ruff-pre-commit
22-
rev: v0.8.5
11+
rev: v0.14.5
2312
hooks:
2413
- id: ruff
2514
args: [--fix]
@@ -32,3 +21,23 @@ repos:
3221
- id: numpydoc-validation
3322
files: ^src/
3423
exclude: ^(tests/|.*__init__\.py$)
24+
25+
- repo: https://github.com/sphinx-contrib/sphinx-lint
26+
rev: v1.0.0
27+
hooks:
28+
- id: sphinx-lint
29+
30+
- repo: https://github.com/tcort/markdown-link-check
31+
rev: v3.14.1
32+
hooks:
33+
- id: markdown-link-check
34+
args: [-q]
35+
36+
- repo: https://github.com/to-sta/spdx-checker-pre-commit
37+
rev: 0.1.3
38+
hooks:
39+
- id: spdx-license-checker
40+
name: run spdx-checker license check
41+
exclude: ^(?:.*/)?__init__\.py$
42+
args: [-l, GPL-3.0-or-later]
43+
types_or: [python]

CHANGELOG.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,12 @@ Emojis for the following are chosen based on [gitmoji](https://gitmoji.dev/).
1212

1313
## [Upcoming] Scribe-Data 5.x
1414

15+
## Scribe-Data 5.2.0
16+
17+
### ✨ Features
18+
19+
- The SPARQL queries for the Scribe-Data CLI are generated by a process that checks the available data via the Wikidata Query Service ([#617](https://github.com/scribe-org/Scribe-Data/issues/617)).
20+
1521
### 🐞 Bug Fixes
1622

1723
- The handling of missing language directories in the SQLite conversion process has been dramatically improved to communicate to the user which languages are missing and also alert them that no SQLite databases will be created if no data is available for any of the desired languages.

README.md

Lines changed: 4 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -36,13 +36,13 @@ Check out Scribe's [architecture diagrams](https://github.com/scribe-org/Organiz
3636
- [Environment Setup](#environment-setup)
3737
- [Featured By](#featured-by)
3838

39-
<a id="Process"></a>
39+
<a id="process"></a>
4040

4141
# Process [``](#contents)
4242

4343
The CLI commands defined within [scribe_data/cli](https://github.com/scribe-org/Scribe-Data/blob/main/src/scribe_data/cli) and the notebooks within the various [scribe_data](https://github.com/scribe-org/Scribe-Data/tree/main/src/scribe_data) directories are used to update all data for [Scribe-iOS](https://github.com/scribe-org/Scribe-iOS), with this functionality later being expanded to update [Scribe-Android](https://github.com/scribe-org/Scribe-Android) and [Scribe-Desktop](https://github.com/scribe-org/Scribe-Desktop) once they're active.
4444

45-
The main data update process in triggers [language based SPARQL queries](https://github.com/scribe-org/Scribe-Data/tree/main/src/scribe_data/wikidata/language_data_extraction) to query language data from [Wikidata](https://www.wikidata.org/) using [SPARQLWrapper](https://github.com/RDFLib/sparqlwrapper) as a URI. The autosuggestion process derives popular words from [Wikipedia](https://www.wikipedia.org/) as well as those words that normally follow them for an effective baseline feature until natural language processing methods are employed. Functions to generate autosuggestions are ran in [gen_autosuggestions.ipynb](https://github.com/scribe-org/Scribe-Data/blob/main/src/scribe_data/wikipedia/gen_autosuggestions.ipynb). Emojis are further sourced from [Unicode CLDR](https://github.com/unicode-org/cldr), with this process being ran via the `scribe-data get -lang LANGUAGE -dt emoji-keywords` command.
45+
The main data update process in triggers [language based SPARQL queries](https://github.com/scribe-org/Scribe-Data/tree/main/src/scribe_data/wikidata/language_data_extraction) to query language data from [Wikidata](https://www.wikidata.org/) using [SPARQLWrapper](https://github.com/RDFLib/sparqlwrapper) as a URI. The autosuggestion process derives popular words from [Wikipedia](https://www.wikipedia.org/) as well as those words that normally follow them for an effective baseline feature until natural language processing methods are employed. Functions to generate autosuggestions are ran in [gen_autosuggestions.py](https://github.com/scribe-org/Scribe-Data/blob/main/src/scribe_data/wikipedia/generate_autosuggestions.py). Emojis are further sourced from [Unicode CLDR](https://github.com/unicode-org/cldr), with this process being ran via the `scribe-data get -lang LANGUAGE -dt emoji-keywords` command.
4646

4747
<a id="installation"></a>
4848

@@ -111,7 +111,7 @@ scribe-data total -i
111111

112112
[Wikidata](https://www.wikidata.org/) has lots of [language data](https://www.wikidata.org/wiki/Wikidata:Lexicographical_data) available, but not all of it is useful for all applications. In order to make the functionality of the Scribe-Data `get` requests as simple as possible, we made the decision to always return all data for the given languages and data types. Adding the ability to pass desired forms to the commands seemed cumbersome, and larger Scribe-Data requests should be parsing [Wikidata lexeme dumps](https://dumps.wikimedia.org/wikidatawiki/entities/) as the data source.
113113

114-
Scribe's solution to the get all functionality while preserving the ability to get specific forms is to allow users to filter the resulting data by contracts. The data contracts for Scribe's client applications can be found in the [data_contracts](./data_contracts/) directory. Data contracts are JSON objects where the values that are used in end applications are the keys and the resulting data identifiers based on Wikidata lexeme forms are the values. If the forms for a lexeme change, then the values would also change, but all that's needed is to update the contract for the application to function again.
114+
Scribe's solution to the get all functionality while preserving the ability to get specific forms is to allow users to filter the resulting data by contracts. The data contracts for Scribe's client applications can be found in the [scribe_data_contracts](./scribe_data_contracts/) directory. Data contracts are JSON objects where the values that are used in end applications are the keys and the resulting data identifiers based on Wikidata lexeme forms are the values. If the forms for a lexeme change, then the values would also change, but all that's needed is to update the contract for the application to function again.
115115

116116
Efficient client application data updates using Scribe-Data follow as such:
117117

@@ -275,7 +275,7 @@ See the [contribution guidelines](https://github.com/scribe-org/Scribe-Data/blob
275275

276276
# Featured By [``](#contents)
277277

278-
Please see the [blog posts page on our website](https://scri.be/docs/about/blog-posts) for a list of articles on Scribe, and feel free to open a pull request to add one that you've written at [scribe-org/scri.be](github.com/scribe-org/scri.be)!
278+
Please see the [blog posts page on our website](https://scri.be/docs/about/blog-posts) for a list of articles on Scribe, and feel free to open a pull request to add one that you've written at [scribe-org/scri.be](https://github.com/scribe-org/scri.be)!
279279

280280
### Organizations
281281

@@ -309,20 +309,6 @@ Many thanks to all the [Scribe-Data contributors](https://github.com/scribe-org/
309309
<img src="https://contrib.rocks/image?repo=scribe-org/Scribe-Data" />
310310
</a>
311311

312-
### Code and Dependencies
313-
314-
The Scribe community would like to thank all the great software that made Scribe-Data's development possible.
315-
316-
<details><summary><strong>List of referenced posts</strong></summary>
317-
<p>
318-
319-
- [Building a Recommendation System Using Neural Network Embeddings](https://towardsdatascience.com/building-a-recommendation-system-using-neural-network-embeddings-1ef92e5c80c9) by [WillKoehrsen](https://github.com/WillKoehrsen)
320-
321-
- [Wikipedia Data Science: Working with the World’s Largest Encyclopedia](https://towardsdatascience.com/wikipedia-data-science-working-with-the-worlds-largest-encyclopedia-c08efbac5f5c) by [WillKoehrsen](https://github.com/WillKoehrsen)
322-
323-
</p>
324-
</details>
325-
326312
### Wikimedia Communities
327313

328314
<div align="center">

0 commit comments

Comments
 (0)