The raw and postprocessed datasets can be found here
We have created a new defoe query for extracting automatically articles from the EB. The articles are stored per edition in YAML files.
Here we have the command for running this query for extracting the articles of the first edition, assuming that we are located in the defoe directory.
spark-submit --py-files defoe.zip defoe/run_query.py nls_first_edition.txt nlsArticles defoe.nlsArticles.queries.write_articles_pages_df_yaml queries/write_to_yml.yml -r frances/ results_NLS/results_eb_1_edition] -n 34
Note that for running this query you need configuration file for specifying the operating system and the defoe path for the long_s fix:
We have stored these data stored in NLS_EB/results_NLS/results_eb_<1|2|3|4>_edition.
We have 8 EB editions, meaning that we have 8 extracted YAML files in total!!
Each YAML file has a row per article found within a page (Example), with the following columns (being the most important term , definition and type_page):
- title: title of the book (e.g. Encyclopaedia Britannica)
- edition: edition of the book (e.g Eighth edition, Volume 2, A-Anatomy)
- year: year of publication/edition (e.g. 1853)
- place: place (e.g. Edinburgh)
- archive_filename: directory path of the book (e.g. /home/rosa_filgueira_vicente/datasets/single_EB/193322698/)
- source_text_filename: directory Path of the page (e.g. alto/193403113.34.xml)
- text_unit: unit that represent each ALTO XML. These could be Page or Issue.
- text_unit_id: id of the page (e.g. Page704)
- num_text_unit: number of pages (e.g. 904)
- type_archive: type of archive. Thse could be book or newspapers.
- model: defoe model used for ingesting this dataset (nlsArticles)
- type_page: the page classification that has been done by defoe. These could be Topic, Articles, Mix or Full Page.
- header: the header of the page (e.g. AMERICA)
- term: term that is going to be described (e.g. AMERICA)
- definition: words describing an article / topic/ full page: ( e.g. “AMERICA. being inhabited. The Aleutian ….”)
- num_articles: number of articles per page. In case a page has been classified as Topic or FullPage, the number of articles is 1.
- num_page_words: number of words per page (e.g. 1373)
- num_article_words: number of words of an article (e.g. 1362)
- type_page: Type of Page.
We have detected two types of articles with two different patterns at “page” level:
-
Short articles (named as articles): Usually presented by a TERM in the main text in uppercase, followed by a “,” (e.g. ALARM, ) and then a DESCRIPTION of the TERM (similar to an entry in a dictionary). This description normally is one or two paragraphs, but of course there are exceptions.
- Term: ALARM
- Definition: in the Military Art, denotes either the apprehension of being suddenly attacked, or the notice thereof signified by firing a cannon, firelock, or the like. False alarms are frequently made use of to harass the enemy, by keeping them constantly under arms. , ….
-
Long articles (named as topics): In this is the case, the Encyclopaedia introduces a TERM in the header of a page (which is not the case for the short articles), and then it normally uses several pages to describe that topic (and very often it uses a combination of text, pictures, tables, etc.). For example, the “topic” AMERICA goes from page 677 to 724 (47 pages!)
We have also detected that some pages (e.g. Preface, FrontPage, List of Authors) do not contain articles nor topics. We classify those pages as "Full_Page". And we also have noticed that there are some pages that have a "Mix" of articles and topics - we classify those pages as "Mix".
Therefore a page can be classified (this information is stored in type_page) as:
- Article: If it has several short articles
- Topic: If it has a topic
- Mix: If it has a mix of Articles and Topics
- Full_Page: If it hasnt have Articles nor Topics.
Important: Topic is just the way we named the long articles that expands more than a page. It does not refer to “NLP topic”.
We have realised that those articles/topics need additional postprocess treatments before peforming futher analyses with them. For example, we need to merge articles and topics that are split across pages. We have also noticed that some pages have wrongly been classified as "Topic", since they should be classified as articles. And the first pages very often get confused as topics or articles - they should classified as "Full_Page".
Therefore, we have created Merging_EB_Terms.ipynb, a notebook that cleans each of the files obtained with defoe (applying different cleaning treatments). And it creates a new "clean" version of each them: NLS_EB/results_NLS/results_eb_<1|2|3|4...>edition_updated.
Here we have an example of the results of the 1st edition cleaned.
Furthermore, this notebook also re-arranges the updated information (and drops some metada) to create a NEW dataframe per file/edition, with the following METADATA/COLUMNS/PROPERTIES:
- definition: Definition of a term
- editionNum: 1,2,3,4,5,6,7,8
- editionTitle: Title of the edition
- header: Header of the page's term
- place: Place where the volume was edited (e.g. Edinburgh)
- relatedTerms: Related terms (see X article)
- altoXML: File Path of the XML file from which the term belongs
- term: Term name
- positionPage: Position of ther term in the page
- startsAt: Number page in which the term definition starts
- endsAt: Number page in which the term definition ends
- volumeTitle: Title of the Volume
- typeTerm: Type of term [Topic| Articles]
- year: Year of the edition
- volumeNum: Volume number (e.g. 1)
- letters: leters of the volume (A-B)
- part: Part of the volume (e.g 1)
- supplementTitle: Supplement's Title
- supplementsTo: It suppelements to editions [1, 2, 3....]
- numberOfWords: Number of words per term definition
- numberOfTerms: Number of terms per page
- numberOfPages: Number of pages per volume
- numberOfVolumes: Number of volumes per edition or supplement
- similar_terms: Applying Transformers - "all-mpnet-base-v2" - extract which terms are more similar to others.
- topic_summaritzation : Applying Transformers - "XLNeT"- summarize an topic definition
- sentiment_analysis: Applying Transformers - "siebert/sentiment-roberta-large-english" - classify the definitions between positve and negative.
- spelling_checker : Applying Transformers + neuspell - check the terms definitions and fix errors.
We have a row per TERM. Note, that a TERM can appear several times per edition. That is the case when we have several definitions per term.
ABACUS - Definition: a table strewed over with dust or sand, upon which the ancient mathematicians drew their figures, It also signified a cupboard, or buffet.
---
ABACUS - Definition: in architeflure, signifies the superior part or member of the capital of a column, and serves as a kind of crowning to both. It was originally intended to represent a square tile covering a basket. The form of the abacus is not the same in all orders: in the Tuscan, Doric, and Ionic, it‘is generally square; but in the Corinthian and Compofite, its four sides are arched ir Avards, and embellilhed in the middle withornament, as a rose or other flower, Scammozzi uses abacus for a concave moulding on the capital of the Tuscan pedefial; and Palladio calls the plinth above the echinus, or boultin, in the Tufean and Doric orders, by the same name. See plate I. fig. i. and
---
ABACUS - Definition: is also the name of an ancient instrument for facilitating operations in arithmetic. It is vadoully contrived. That chiefly used in Europe is made by drawing any number of parallel lines at the di(lance of two diameters of one of the counters used in the calculation. A counter placed on.the lowed line, signifies r; on the sd, 10; on the 3d, 100; on the 4th, 1000, &c. In the intermediate spaces, the same counters are eflimated at one Jialf of the value of the line immediately superior, viz. between the id and 2d, 5; between the 2d and 3d, 50, &c. See plate I. fig. 2. A B, where the same number, 1768 for example, is represented under both by different dispositions of the counters.
---
ABACUS - Definition: logijlicus, a right-angled triangle, whose sides forming the right angle contain the numbers from 1 to 60, and its area the fafta of every two of the numbers perpendicularly opposite. This is also called a canon Jk^&cus Pythagvricus, the multiplication-table, or any table of numbers that facilitates operations in arith-
---
THESE METADATA/COLUMNS/PROPERTIES ARE THE ONES THAT WE ARE GOING TO USE FROM NOW ON
VERY IMPORTANT These dataframes are stored as JSON files (using orient="index") in NLS_EB/results_NLS/results_eb_[1|2|3|4 ...]edition<1|2|3|4...>_postprocess_dataframe. Example. See bellow the comand that we used for storing the dataframe corresponding to the 1st Edition.
df.to_json(r'./results_NLS/results_eb_1_edition_postprocess_dataframe', orient="index")
We have also improved our query for extracting all the metadata from the all the editions, supplements and volumes from EB. In this case, we do not enter to extract the metadata at article level.
spark-submit --py-files defoe.zip defoe/run_query.py nls_first_edition.txt nls defoe.nls.queries.metadata_yaml -r frances/ results_NLS/eb_metadata_details.txt -n 34
propierties extracted:
- MMSID: Metadata Management System ID
- editionTitle: Title of the edition
- editionSubTilte: Subtitle of the edition
- editor: Editor (person) of an edition or a supplement
- termsOfAddress: Terms of Address of the editor (e.g. Sir)
- editor_date: Year of Birth - Year of Death
- genre: genre of the editions
- language: language used to write the volumes
- numberOfPages: number of pages of a volume
- physicalDescription: physical description of a edition
- place: place printed of a edition or a supplement
- publisher: publisher (organization or person) of an edition
- referencedBy: books which reference an edition
- shelfLocator: shelf locator of an edition
- subTitle: subtitle of an edition
- volumeTitle: title of a volume
- year: year of print
- volumeId: volume identifier
- metsXML: XML mets file
- permanentURL: URL of a volume
- publisherPersons: list of publishers which are persons
- volumeNum: Number of a volume
- letters: Letters of a volume
- part: Part of a volume
- editionNum: Number of an editior
- supplementTitle: Supplement subTitle
- supplementsTo: List of editions which a supplement supplements to
- numberOfVolumes: Number of volumes per edition or supplement
Here a list of questions that we want to ask to these data (using the EB_Articles Clean Metadata):
(Remember, a term can have more than one definition per edition)
- Give me all the volumes that we have per edition
- Given an edition, give me the years that each volume has been published.
- Given an edition and a volume, give me all the terms
- Given an edition, give me all the terms
- Given a term, give me all editions and volumes that it appears.
- Given a term, give me all the definitions that we have per edition.
- Give the terms that only appear in one edition.
- Give the terms that appears in all editions.
- Given an edition, tell me the terms for which we have more definitions
- Search definitions for a given term and edition.
- Given a term and edition, tell me which terms (based on "related_terms") are related with it.
- Given a term, see how the definition(s) have changed across editions.
Example 2: