Bookmark
Advanced NLP with spaCy · A free online course
https://course.spacy.io/en/, posted Sep '23 by peter in development free language learning nlp toread
spaCy is a modern Python library for industrial-strength Natural Language Processing. In this free and interactive online course, you'll learn how to use spaCy to build advanced natural language understanding systems, using both rule-based and machine learning approaches.
Bookmark
Damn Cool Algorithms: Levenshtein Automata
blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata, posted 2022 by peter in development nlp reference text toread
The basic insight behind Levenshtein automata is that it's possible to construct a Finite state automaton that recognizes exactly the set of strings within a given Levenshtein distance of a target word. We can then feed in any word, and the automaton will accept or reject it based on whether the Levenshtein distance to the target word is at most the distance specified when we constructed the automaton. Further, due to the nature of FSAs, it will do so in O(n) time with the length of the string being tested. Compare this to the standard Dynamic Programming Levenshtein algorithm, which takes O(mn) time, where m and n are the lengths of the two input words! It's thus immediately apparrent that Levenshtein automaton provide, at a minimum, a faster way for us to check many words against a single target word and maximum distance - not a bad improvement to start with!
Of course, if that were the only benefit of Levenshtein automata, this would be a short article. There's much more to come, but first let's see what a Levenshtein automaton looks like, and how we can build one.
Bookmark
A list of free data matching and record linkage software
https://github.com/J535D165/data-matching-software, posted 2021 by peter in development free list nlp opensource software
This is a list of (Fuzzy) Data Matching software. The software in this list is open source and/or freely available.
The term data matching is used to indicate the procedure of bringing together information from two or more records that are believed to belong to the same entity. Data matching has two applications: (1) to match data across multiple datasets (linkage) and (2) to match data within a dataset (deduplication). See the Wikipedia page about data matching for more information.
Similar terms: record linkage, data matching, deduplication, fuzzy matching, entity resolution
Bookmark
How to calculate the alignment between BERT and spaCy tokens effectively and robustly
https://gist.github.com/tamuhey/af6cbb44a703423556c32798e1e1b704, posted 2021 by peter in development free language nlp opensource software toread
Suppose we want to combine a BERT-based named entity recognition (NER) model with a rule-based NER model built on top of spaCy. Although BERT's NER exhibits extremely high performance, it is usually combined with rule-based approaches for practical purposes. In such cases, what often bothers us is that tokens of spaCy and BERT are different, even if the input sentences are the same. For example, let's say the input sentence is "John Johanson 's house"; BERT tokenizes this sentence like
["john", "johan", "##son", "'", "s", "house"]
and spaCy tokenizes it like["John", "Johanson", "'s", "house"]
. To combine the outputs, we need to calculate the correspondence between the two different token sequences. This correspondence is the "alignment".
Bookmark
LibreTranslate: Free and Open Source Machine Translation API
https://github.com/uav4geo/LibreTranslate, posted 2021 by peter in api free language nlp opensource software
Free and Open Source Machine Translation API, entirely self-hosted. Unlike other APIs, it doesn't rely on proprietary providers such as Google or Azure to perform translations.
Bookmark
EleutherAI - GPT-Neo
https://www.eleuther.ai/gpt-neo, posted 2021 by peter in ai free nlp opensource
GPT-Neo is the code name for a series of transformer-based language models loosely styled around the GPT architecture that we plan to train and open source. Our primary goal is to replicate a GPT-3 sized model and open source it to the public, for free.
Bookmark
Apache Tika
https://tika.apache.org/, posted 2020 by peter in free language nlp opensource search software
The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more.
Bookmark
Tone Analyzer
https://tone-analyzer-demo.mybluemix.net/, posted 2016 by peter in ai demo language nlp online text writing
This service uses linguistic analysis to detect and interpret emotions, social tendencies, and language style cues found in text.
Bookmark
wooorm/franc
https://github.com/wooorm/franc, posted 2014 by peter in development free language nlp opensource python software
Detect the language of text.
Bookmark
TextBlob: Simplified Text Processing — TextBlob 0.5.0 documentation
https://textblob.readthedocs.org/en/latest/, posted 2013 by peter in development free language nlp python software toread
TextBlob is a Python (2 and 3) library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, translation, and more.