prose is a comprehensive natural language processing library in pure Go. It supports tokenization, segmentation, part-of-speech tagging, and named-entity extraction with advanced features including multilingual processing, position tracking, and confidence scoring.
This is a fork of the original project, github.com/jdkato/prose which was archived by the author some time ago.
Major enhancements over the original library include:
- Multilingual Support: Automatic language detection and processing for English, Spanish, French, German, and Japanese
- Position Tracking: All tokens, entities, and sentences include precise position information in the original text
- Confidence Scores: ML predictions include confidence levels for reliability assessment
- Enhanced NER: Expanded from 2 to 16 entity types (PERSON, ORG, GPE, MONEY, DATE, TIME, PERCENT, FAC, PRODUCT, EVENT, WORK_OF_ART, LANGUAGE, NORP, LAW, ORDINAL, CARDINAL)
- Training API: Complete model training system with cross-validation and performance metrics
- Memory Optimization: Token pooling and efficient data structures reduce memory usage by 20-30%
- Context Support: All operations support
context.Contextfor cancellation, timeouts, and progress tracking - Rich Metadata: Documents include processing statistics, language detection, and performance metrics
- 100% Backward Compatible: All existing v2 APIs preserved while adding new functionality
See UPDATES.md for complete implementation details and migration guide.
You can find a summary on the original library's performance here: Introducing prose v2.0.0: Bringing NLP to Go.
$ go get github.com/tsawler/prose/v3package main
import (
"fmt"
"log"
"github.com/tsawler/prose/v3"
)
func main() {
// Create a new document with the default configuration:
doc, err := prose.NewDocument("Go is an open-source programming language created at Google.")
if err != nil {
log.Fatal(err)
}
// Iterate over the doc's tokens:
for _, tok := range doc.Tokens() {
fmt.Println(tok.Text, tok.Tag, tok.Label)
// Go NNP B-GPE
// is VBZ O
// an DT O
// ...
}
// Iterate over the doc's named-entities:
for _, ent := range doc.Entities() {
fmt.Println(ent.Text, ent.Label)
// Go GPE
// Google GPE
}
// Iterate over the doc's sentences:
for _, sent := range doc.Sentences() {
fmt.Println(sent.Text)
// Go is an open-source programming language created at Google.
}
}The document-creation process adheres to the following sequence of steps:
tokenization -> POS tagging -> NE extraction
\
segmentation
Each step may be disabled (assuming later steps aren't required) by passing the appropriate functional option. To disable named-entity extraction, for example, you'd do the following:
doc, err := prose.NewDocument(
"Go is an open-source programming language created at Google.",
prose.WithExtraction(false))prose v3.0.0 introduces comprehensive multilingual support with automatic language detection and language-specific processing for English, Spanish, French, German, and Japanese.
package main
import (
"context"
"fmt"
"time"
"github.com/tsawler/prose/v3"
)
func main() {
// Automatic language detection
multiDoc, err := prose.NewMultilingualDocument("Bonjour le monde! Comment allez-vous?")
if err != nil {
fmt.Printf("Error: %v\n", err)
return
}
lang, confidence := multiDoc.DetectedLanguage()
fmt.Printf("Detected language: %s (confidence: %.2f)\n", lang, confidence)
// Detected language: fr (confidence: 0.51)
// Enhanced features with context and metadata
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
doc, err := prose.NewDocument("Go is excellent for NLP processing!",
prose.WithContext(ctx),
prose.WithLanguage(prose.English),
)
if err != nil {
fmt.Printf("Error: %v\n", err)
return
}
fmt.Printf("Processing time: %dms\n", doc.Metadata.ProcessingTimeMs)
fmt.Printf("Token count: %d\n", doc.Metadata.TokenCount)
// Enhanced tokens with position tracking and confidence scores
for _, tok := range doc.Tokens() {
fmt.Printf("%-12s POS: %-4s Position: %d-%d Confidence: %.3f\n",
tok.Text, tok.Tag, tok.Start, tok.End, tok.Confidence)
// Go POS: NNP Position: 0-2 Confidence: 0.707
// is POS: VBZ Position: 3-5 Confidence: 1.000
// excellent POS: JJ Position: 6-15 Confidence: 1.000
// ...
}
}The new multilingual capabilities include:
- Automatic Language Detection: Detects the language of input text with confidence scoring
- Language-Specific Processing: Optimized tokenization, stop words, and normalization per language
- Context Support: All operations support
context.Contextfor timeouts and cancellation - Enhanced Metadata: Processing statistics, timing, and language information
- Position Tracking: Precise character positions for all tokens, entities, and sentences
- Confidence Scores: ML predictions include reliability assessments
prose includes a tokenizer capable of processing modern text, including the non-word character spans shown below.
| Type | Example |
|---|---|
| Email addresses | [email protected] |
| Hashtags | #trending |
| Mentions | @tsawler |
| URLs | https://github.com/tsawler/prose |
| Emoticons | :-), >:(, o_0, etc. |
package main
import (
"fmt"
"log"
"github.com/tsawler/prose/v3"
)
func main() {
// Create a new document with the default configuration:
doc, err := prose.NewDocument("@tsawler, go to http://example.com thanks :).")
if err != nil {
log.Fatal(err)
}
// Iterate over the doc's tokens:
for _, tok := range doc.Tokens() {
fmt.Println(tok.Text, tok.Tag)
// @tsawler NN
// , ,
// go VB
// to TO
// http://example.com VB
// thanks NNS
// : :
// ) )
// . .
}
}prose includes one of the most accurate sentence segmenters available, according to the Golden Rules created by the developers of the pragmatic_segmenter.
| Name | Language | License | GRS (English) | GRS (Other) | Speed† |
|---|---|---|---|---|---|
| Pragmatic Segmenter | Ruby | MIT | 98.08% (51/52) | 100.00% | 3.84 s |
| prose | Go | MIT | 75.00% (39/52) | N/A | 0.96 s |
| TactfulTokenizer | Ruby | GNU GPLv3 | 65.38% (34/52) | 48.57% | 46.32 s |
| OpenNLP | Java | APLv2 | 59.62% (31/52) | 45.71% | 1.27 s |
| Standford CoreNLP | Java | GNU GPLv3 | 59.62% (31/52) | 31.43% | 0.92 s |
| Splitta | Python | APLv2 | 55.77% (29/52) | 37.14% | N/A |
| Punkt | Python | APLv2 | 46.15% (24/52) | 48.57% | 1.79 s |
| SRX English | Ruby | GNU GPLv3 | 30.77% (16/52) | 28.57% | 6.19 s |
| Scapel | Ruby | GNU GPLv3 | 28.85% (15/52) | 20.00% | 0.13 s |
† The original tests were performed using a MacBook Pro 3.7 GHz Quad-Core Intel Xeon E5 running 10.9.5, while
prosewas timed using a MacBook Pro 2.9 GHz Intel Core i7 running 10.13.3.
package main
import (
"fmt"
"strings"
"github.com/tsawler/prose/v3"
)
func main() {
// Create a new document with the default configuration:
doc, _ := prose.NewDocument(strings.Join([]string{
"I can see Mt. Fuji from here.",
"St. Michael's Church is on 5th st. near the light."}, " "))
// Iterate over the doc's sentences:
sents := doc.Sentences()
fmt.Println(len(sents)) // 2
for _, sent := range sents {
fmt.Println(sent.Text)
// I can see Mt. Fuji from here.
// St. Michael's Church is on 5th st. near the light.
}
}prose includes a tagger based on Textblob's "fast and accurate" POS tagger. Below is a comparison of its performance against NLTK's implementation of the same tagger on the Treebank corpus:
| Library | Accuracy | 5-Run Average (sec) |
|---|---|---|
| NLTK | 0.893 | 7.224 |
prose |
0.961 | 2.538 |
(See scripts/test_model.py for more information.)
The full list of supported POS tags is given below.
| TAG | DESCRIPTION |
|---|---|
( |
left round bracket |
) |
right round bracket |
, |
comma |
: |
colon |
. |
period |
'' |
closing quotation mark |
`` |
opening quotation mark |
# |
number sign |
$ |
currency |
CC |
conjunction, coordinating |
CD |
cardinal number |
DT |
determiner |
EX |
existential there |
FW |
foreign word |
IN |
conjunction, subordinating or preposition |
JJ |
adjective |
JJR |
adjective, comparative |
JJS |
adjective, superlative |
LS |
list item marker |
MD |
verb, modal auxiliary |
NN |
noun, singular or mass |
NNP |
noun, proper singular |
NNPS |
noun, proper plural |
NNS |
noun, plural |
PDT |
predeterminer |
POS |
possessive ending |
PRP |
pronoun, personal |
PRP$ |
pronoun, possessive |
RB |
adverb |
RBR |
adverb, comparative |
RBS |
adverb, superlative |
RP |
adverb, particle |
SYM |
symbol |
TO |
infinitival to |
UH |
interjection |
VB |
verb, base form |
VBD |
verb, past tense |
VBG |
verb, gerund or present participle |
VBN |
verb, past participle |
VBP |
verb, non-3rd person singular present |
VBZ |
verb, 3rd person singular present |
WDT |
wh-determiner |
WP |
wh-pronoun, personal |
WP$ |
wh-pronoun, possessive |
WRB |
wh-adverb |
prose v3.0.0 includes a significantly enhanced named entity recognition system that can identify 16 different entity types including people (PERSON), geographical/political entities (GPE), organizations (ORG), dates (DATE), money (MONEY), and many more. All predictions include confidence scores and precise position information.
package main
import (
"fmt"
"github.com/tsawler/prose/v3"
)
func main() {
doc, _ := prose.NewDocument("Lebron James plays basketball in Los Angeles.")
for _, ent := range doc.Entities() {
fmt.Println(ent.Text, ent.Label)
// Lebron James PERSON
// Los Angeles GPE
}
}It is also straightforward to train your own models for specific use cases. See Prodigy + prose: Radically efficient machine teaching in Go for a tutorial.
This library is based on the original prose library by Joseph Kato.
- gonum - Scientific computing packages for Go
- sentences - Sentence segmentation library
- stopwords - Multilingual stop words removal library
- golang.org/x/text - Go text processing support library