Prose: Natural Language Processing in pure Go

prose is a comprehensive natural language processing library in pure Go. It supports tokenization, segmentation, part-of-speech tagging, and named-entity extraction with advanced features including multilingual processing, position tracking, and confidence scoring.

This is a fork of the original project, github.com/jdkato/prose which was archived by the author some time ago.

What's New in v3.0.0

Major enhancements over the original library include:

Multilingual Support: Automatic language detection and processing for English, Spanish, French, German, and Japanese
Position Tracking: All tokens, entities, and sentences include precise position information in the original text
Confidence Scores: ML predictions include confidence levels for reliability assessment
Enhanced NER: Expanded from 2 to 16 entity types (PERSON, ORG, GPE, MONEY, DATE, TIME, PERCENT, FAC, PRODUCT, EVENT, WORK_OF_ART, LANGUAGE, NORP, LAW, ORDINAL, CARDINAL)
Training API: Complete model training system with cross-validation and performance metrics
Memory Optimization: Token pooling and efficient data structures reduce memory usage by 20-30%
Context Support: All operations support context.Context for cancellation, timeouts, and progress tracking
Rich Metadata: Documents include processing statistics, language detection, and performance metrics
100% Backward Compatible: All existing v2 APIs preserved while adding new functionality

See UPDATES.md for complete implementation details and migration guide.

You can find a summary on the original library's performance here: Introducing prose v2.0.0: Bringing NLP to Go.

Installation

$ go get github.com/tsawler/prose/v3

Usage

Overview

package main

import (
    "fmt"
    "log"

    "github.com/tsawler/prose/v3"
)

func main() {
    // Create a new document with the default configuration:
    doc, err := prose.NewDocument("Go is an open-source programming language created at Google.")
    if err != nil {
        log.Fatal(err)
    }

    // Iterate over the doc's tokens:
    for _, tok := range doc.Tokens() {
        fmt.Println(tok.Text, tok.Tag, tok.Label)
        // Go NNP B-GPE
        // is VBZ O
        // an DT O
        // ...
    }

    // Iterate over the doc's named-entities:
    for _, ent := range doc.Entities() {
        fmt.Println(ent.Text, ent.Label)
        // Go GPE
        // Google GPE
    }

    // Iterate over the doc's sentences:
    for _, sent := range doc.Sentences() {
        fmt.Println(sent.Text)
        // Go is an open-source programming language created at Google.
    }
}

The document-creation process adheres to the following sequence of steps:

tokenization -> POS tagging -> NE extraction
            \
             segmentation

Each step may be disabled (assuming later steps aren't required) by passing the appropriate functional option. To disable named-entity extraction, for example, you'd do the following:

doc, err := prose.NewDocument(
        "Go is an open-source programming language created at Google.",
        prose.WithExtraction(false))

Multilingual Support

prose v3.0.0 introduces comprehensive multilingual support with automatic language detection and language-specific processing for English, Spanish, French, German, and Japanese.

package main

import (
    "context"
    "fmt"
    "time"

    "github.com/tsawler/prose/v3"
)

func main() {
    // Automatic language detection
    multiDoc, err := prose.NewMultilingualDocument("Bonjour le monde! Comment allez-vous?")
    if err != nil {
        fmt.Printf("Error: %v\n", err)
        return
    }
    
    lang, confidence := multiDoc.DetectedLanguage()
    fmt.Printf("Detected language: %s (confidence: %.2f)\n", lang, confidence)
    // Detected language: fr (confidence: 0.51)
    
    // Enhanced features with context and metadata
    ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
    defer cancel()
    
    doc, err := prose.NewDocument("Go is excellent for NLP processing!",
        prose.WithContext(ctx),
        prose.WithLanguage(prose.English),
    )
    if err != nil {
        fmt.Printf("Error: %v\n", err)
        return
    }
    
    fmt.Printf("Processing time: %dms\n", doc.Metadata.ProcessingTimeMs)
    fmt.Printf("Token count: %d\n", doc.Metadata.TokenCount)
    
    // Enhanced tokens with position tracking and confidence scores
    for _, tok := range doc.Tokens() {
        fmt.Printf("%-12s POS: %-4s Position: %d-%d Confidence: %.3f\n", 
            tok.Text, tok.Tag, tok.Start, tok.End, tok.Confidence)
        // Go           POS: NNP  Position: 0-2 Confidence: 0.707
        // is           POS: VBZ  Position: 3-5 Confidence: 1.000
        // excellent    POS: JJ   Position: 6-15 Confidence: 1.000
        // ...
    }
}

The new multilingual capabilities include:

Automatic Language Detection: Detects the language of input text with confidence scoring
Language-Specific Processing: Optimized tokenization, stop words, and normalization per language
Context Support: All operations support context.Context for timeouts and cancellation
Enhanced Metadata: Processing statistics, timing, and language information
Position Tracking: Precise character positions for all tokens, entities, and sentences
Confidence Scores: ML predictions include reliability assessments

Tokenizing

prose includes a tokenizer capable of processing modern text, including the non-word character spans shown below.

Type	Example
Email addresses	`[email protected]`
Hashtags	`#trending`
Mentions	`@tsawler`
URLs	`https://github.com/tsawler/prose`
Emoticons	`:-)`, `>:(`, `o_0`, etc.

package main

import (
    "fmt"
    "log"

    "github.com/tsawler/prose/v3"
)

func main() {
    // Create a new document with the default configuration:
    doc, err := prose.NewDocument("@tsawler, go to http://example.com thanks :).")
    if err != nil {
        log.Fatal(err)
    }

    // Iterate over the doc's tokens:
    for _, tok := range doc.Tokens() {
        fmt.Println(tok.Text, tok.Tag)
        // @tsawler NN
        // , ,
        // go VB
        // to TO
        // http://example.com VB
        // thanks NNS
        // : :
        // ) )
        // . .
    }
}

Segmenting

prose includes one of the most accurate sentence segmenters available, according to the Golden Rules created by the developers of the pragmatic_segmenter.

Name	Language	License	GRS (English)	GRS (Other)	Speed†
Pragmatic Segmenter	Ruby	MIT	98.08% (51/52)	100.00%	3.84 s
prose	Go	MIT	75.00% (39/52)	N/A	0.96 s
TactfulTokenizer	Ruby	GNU GPLv3	65.38% (34/52)	48.57%	46.32 s
OpenNLP	Java	APLv2	59.62% (31/52)	45.71%	1.27 s
Standford CoreNLP	Java	GNU GPLv3	59.62% (31/52)	31.43%	0.92 s
Splitta	Python	APLv2	55.77% (29/52)	37.14%	N/A
Punkt	Python	APLv2	46.15% (24/52)	48.57%	1.79 s
SRX English	Ruby	GNU GPLv3	30.77% (16/52)	28.57%	6.19 s
Scapel	Ruby	GNU GPLv3	28.85% (15/52)	20.00%	0.13 s

† The original tests were performed using a MacBook Pro 3.7 GHz Quad-Core Intel Xeon E5 running 10.9.5, while prose was timed using a MacBook Pro 2.9 GHz Intel Core i7 running 10.13.3.

package main

import (
    "fmt"
    "strings"

    "github.com/tsawler/prose/v3"
)

func main() {
    // Create a new document with the default configuration:
    doc, _ := prose.NewDocument(strings.Join([]string{
        "I can see Mt. Fuji from here.",
        "St. Michael's Church is on 5th st. near the light."}, " "))

    // Iterate over the doc's sentences:
    sents := doc.Sentences()
    fmt.Println(len(sents)) // 2
    for _, sent := range sents {
        fmt.Println(sent.Text)
        // I can see Mt. Fuji from here.
        // St. Michael's Church is on 5th st. near the light.
    }
}

Tagging

prose includes a tagger based on Textblob's "fast and accurate" POS tagger. Below is a comparison of its performance against NLTK's implementation of the same tagger on the Treebank corpus:

Library	Accuracy	5-Run Average (sec)
NLTK	0.893	7.224
`prose`	0.961	2.538

(See scripts/test_model.py for more information.)

The full list of supported POS tags is given below.

TAG	DESCRIPTION
`(`	left round bracket
`)`	right round bracket
`,`	comma
`:`	colon
`.`	period
`''`	closing quotation mark
``	opening quotation mark
`#`	number sign
`$`	currency
`CC`	conjunction, coordinating
`CD`	cardinal number
`DT`	determiner
`EX`	existential there
`FW`	foreign word
`IN`	conjunction, subordinating or preposition
`JJ`	adjective
`JJR`	adjective, comparative
`JJS`	adjective, superlative
`LS`	list item marker
`MD`	verb, modal auxiliary
`NN`	noun, singular or mass
`NNP`	noun, proper singular
`NNPS`	noun, proper plural
`NNS`	noun, plural
`PDT`	predeterminer
`POS`	possessive ending
`PRP`	pronoun, personal
`PRP$`	pronoun, possessive
`RB`	adverb
`RBR`	adverb, comparative
`RBS`	adverb, superlative
`RP`	adverb, particle
`SYM`	symbol
`TO`	infinitival to
`UH`	interjection
`VB`	verb, base form
`VBD`	verb, past tense
`VBG`	verb, gerund or present participle
`VBN`	verb, past participle
`VBP`	verb, non-3rd person singular present
`VBZ`	verb, 3rd person singular present
`WDT`	wh-determiner
`WP`	wh-pronoun, personal
`WP$`	wh-pronoun, possessive
`WRB`	wh-adverb

NER

prose v3.0.0 includes a significantly enhanced named entity recognition system that can identify 16 different entity types including people (PERSON), geographical/political entities (GPE), organizations (ORG), dates (DATE), money (MONEY), and many more. All predictions include confidence scores and precise position information.

package main

import (
    "fmt"

    "github.com/tsawler/prose/v3"
)

func main() {
    doc, _ := prose.NewDocument("Lebron James plays basketball in Los Angeles.")
    for _, ent := range doc.Entities() {
        fmt.Println(ent.Text, ent.Label)
        // Lebron James PERSON
        // Los Angeles GPE
    }
}

It is also straightforward to train your own models for specific use cases. See Prodigy + prose: Radically efficient machine teaching in Go for a tutorial.

Credits

This library is based on the original prose library by Joseph Kato.

Dependencies

gonum - Scientific computing packages for Go
sentences - Sentence segmentation library
stopwords - Multilingual stop words removal library
golang.org/x/text - Go text processing support library

Name		Name	Last commit message	Last commit date
Latest commit History 315 Commits
model		model
scripts		scripts
testdata		testdata
.gitignore		.gitignore
.travis.yml		.travis.yml
EXTERNAL-LEXICON-GUIDE.md		EXTERNAL-LEXICON-GUIDE.md
LICENSE.md		LICENSE.md
Makefile		Makefile
ORIGINAL-LICENSE.md		ORIGINAL-LICENSE.md
README.md		README.md
SENTIMENT-ANALYSIS.md		SENTIMENT-ANALYSIS.md
UPDATES.md		UPDATES.md
data.go		data.go
doc.go		doc.go
document.go		document.go
document_test.go		document_test.go
extract.go		extract.go
extract_test.go		extract_test.go
go.mod		go.mod
go.sum		go.sum
model.go		model.go
model_test.go		model_test.go
multilingual.go		multilingual.go
multilingual_test.go		multilingual_test.go
segment.go		segment.go
segment_test.go		segment_test.go
sentiment.go		sentiment.go
sentiment_features.go		sentiment_features.go
sentiment_lexicon.go		sentiment_lexicon.go
sentiment_test.go		sentiment_test.go
tag.go		tag.go
tag_test.go		tag_test.go
tokenize.go		tokenize.go
tokenize_test.go		tokenize_test.go
training.go		training.go
types.go		types.go
utilities.go		utilities.go
words.go		words.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Prose: Natural Language Processing in pure Go

What's New in v3.0.0

Installation

Usage

Contents

Overview

Multilingual Support

Tokenizing

Segmenting

Tagging

NER

Credits

Dependencies

About

Uh oh!

Releases

Packages

Contributors 10

Uh oh!

Languages

License

tsawler/prose

Folders and files

Latest commit

History

Repository files navigation

Prose: Natural Language Processing in pure Go

What's New in v3.0.0

Installation

Usage

Contents

Overview

Multilingual Support

Tokenizing

Segmenting

Tagging

NER

Credits

Dependencies

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 10

Uh oh!

Languages

Packages