DKPro Core 1.11.0
We are pleased to announce the release of
DKPro Core 1.11.0
a collection of interoperable software components for natural language processing (NLP) based on the Apache UIMA framework.
https://dkpro.github.io/dkpro-core
This is a feature release.
Important upgrade notice
- Changed groupIds and artifactIds. The group ID is now org.dkpro.core and the artifact IDs are dkpro-core-...-(asl/gpl)
- Changed package names. The packages are now all starting with org.dkpro.core... - except the packages of UIMA types which remain unchanged for data compatibility.
Notable changes since DKPro Core 1.10.0
- Changed parts of the brat data conversion code such that it can be more easily used outside a UIMA component
- Changed type mapping such that out-of-tagset types map to the generic type (e.g. an unknown POS tag maps to POS, not to POS_X)
- Changed name of NYTCollectionReader to NitfReader
- Added types to encode XML document structure in CAS
- Added new XmlDocumentReader/Writer components using these types
- Added basic reader for Annotated Gigaword corpus (only reads text so far) (thanks @az79nefy)
- Added basic support for PubAnnotation JSON format
- Added Maui component for keyword assignment
- Added parameter to SfstAnnotator to enable lower-case lookup of first word in a sentence (thanks @rziai)
- Added "order" feature to Token type
- Added support for CoNLL-U document and paragraph IDs (thanks @manuelciosici)
- Added support for CoNLL-U sentence IDs and text
- Added standardized parameter to disable type mapping
- Added support for TCF orthography layer using SofaChangeAnnotations
- Added segmenter for Chinese using jieba (thanks @Horsmann)
- Added MyStem for Russian
- Added links to OpenMinTeD categories in type system documentation
- Added support for the reading/writing the CoreNLP CoNLL flavor
- Added parameter to configure the Tika buffer size (useful for large documents)
- Updated to OpenNLP 1.9.1
- Updated to CoreNLP 3.9.2
- Updated to ICU4J 64.2
- Updated to Tika 1.19.1
- Updated to LanguageTool 4.3
- Updated to PDFBox 2.0.12
- Updated IllinoisNLP components
- Updated TreeTagger models/binaries in build.xml script (thanks @tilmanbeck)
- Updated LIF dependencies
- Updated dataset descriptions
- Updated various general dependencies (e.g. Apache Commons etc.)
- Improved robustness of checksum verification for text files used in datasets (e.g. license files)
- Improved error messages in WebAnno TSV3 module
- Fixed crash in WebannoTsv3XWriter when annotations do not start/end at token boundaries
- Fixed bug in WebAnno TSV3 support causing span annotations with slot features to disappear
- Fixed trimming of whitespace in TeiReader
- Fixed bug in NifWriter causing named entity identifier not to be written
- Fixed crash in BratReader with reading discontinuous segments
- Fixed problem in BratWriter when dealing with slot features
- Fixed metadata of CoNLL2012Writer
- Fixed potential problem of datasets being written outside their target directory
- Dropped the GrAF I/O module since the upstream libraries are outdated and no longer maintained
A more detailed overview of the changes in this release can be found here.
Thanks for contributions go to: @az79nefy, @ramonziai, @manuelciosici, @Horsmann, @tilmanbeck
When upgrading, please mind that you should not mix different versions of DKPro Core components in your projects - they may not be compatible with each other.