Spark NLP 101: LightPipeline

By Veysel Kocaman, Knowledge Scientist & ML Researcher



That is the second article in a collection through which we’re going to write a separate article for every annotator within the Spark NLP library. You could find all of the articles at this link.

This text is principally constructed on prime of Introduction to Spark NLP: Foundations and Basic Components (Part-I). Please learn that in the first place, if you wish to study extra about Spark NLP and its underlying ideas.

In machine studying, it’s common to run a sequence of algorithms to course of and study from knowledge. This sequence is often known as a Pipeline.

A Pipeline is specified as a sequence of levels, and every stage is both a Transformer or an Estimator. These levels are run so as, and the enter DataFrame is remodeled because it passes by means of every stage. That’s, the info are handed by means of the fitted pipeline so as. Every stage’s rework() technique updates the dataset and passes it to the following stage. With the assistance of Pipelines, we are able to make sure that coaching and take a look at knowledge undergo similar characteristic processing steps.


Every annotator utilized provides a brand new column to a knowledge body that’s fed into the pipeline


Now let’s see how this may be carried out in Spark NLP utilizing Annotators and Transformers. Assume that now we have the next steps that have to be utilized one after the other on an information body.

  • Cut up textual content into sentences
  • Tokenize
  • Normalize
  • Get phrase embeddings

And right here is how we code this pipeline up in Spark NLP.

from import Pipelinedocument_assembler = DocumentAssembler()

sentenceDetector = SentenceDetector()

tokenizer = Tokenizer() 

normalizer = Normalizer()


nlpPipeline = Pipeline(levels=[

pipelineModel = nlpPipeline.match(df)

I’m going to load a dataset after which feed it into this pipeline to allow you to see the way it works.


pattern DataFrame (5452 rows)


After operating the pipeline above, we get a educated pipeline mannequin. Let’s rework the complete DataFrame.

end result = pipelineModel.rework(df)
end result.present()

It took 501 ms to rework the primary 20 rows. If we rework the complete knowledge body, it will take 11 seconds.


end result = pipelineModel.rework(df).acquire()

CPU occasions: consumer 2.01 s, sys: 425 ms, whole: 2.43 s
Wall time: 11 s

It appears to be like good. What if we need to save this pipeline to disk after which deploy it to get runtime transformations on a given line of textual content (one row).

from pyspark.sql import Row

textual content = "How did serfdom develop in and then leave Russia ?"

line_df = spark.createDataFrame(record(map(lambda x: Row(textual content=x), [text])), ["text"])

%time end result = pipelineModel.rework(line_df).acquire()

CPU occasions: consumer 31.1 ms, sys: 7.73 ms, whole: 38.9 ms
Wall time: 515 ms

Reworking a single line of a brief textual content took 515 ms! Almost the identical because it took for remodeling the primary 20 rows. So, it isn’t good. Really, that is what occurs when making an attempt to make use of distributed processing on small knowledge. Distributed processing and cluster computing are primarily helpful for processing a considerable amount of knowledge (aka large knowledge). Utilizing Spark for small knowledge can be like getting in a combat with an ax:-)

Really, as a consequence of its interior mechanism and optimized structure, Spark might nonetheless be helpful for the common dimension of information that may very well be dealt with on a single machine. However on the subject of processing just some strains of textual content, it isn’t advisable until you utilize Spark NLP.

Allow us to make an analogy that can assist you perceive this. Spark is sort of a locomotive racing a bicycle. The bike will win if the load is gentle, it’s faster to speed up and extra agile, however with a heavy load the locomotive may take some time to stand up to hurry, nevertheless it’s going to be sooner ultimately.


Spark is sort of a locomotive racing a bicycle


So, what are we going to do if we need to have a sooner inference time? Right here comes LightPipeline.



LightPipelines are Spark NLP particular Pipelines, equal to Spark ML Pipeline, however meant to take care of smaller quantities of information. They’re helpful working with small datasets, debugging outcomes, or when operating both coaching or prediction from an API that serves one-off requests.

Spark NLP LightPipelines are Spark ML pipelines transformed right into a single machine however the multi-threaded job, turning into greater than 10x occasions sooner for smaller quantities of information (small is relative, however 50ok sentences are roughly a superb most). To make use of them, we merely plug in a educated (fitted) pipeline after which annotate a plain textual content. We do not even must convert the enter textual content to DataFrame so as to feed it right into a pipeline that is accepting DataFrame as an enter within the first place. This characteristic can be fairly helpful on the subject of getting a prediction for a number of strains of textual content from a educated ML mannequin.

from sparknlp.base import LightPipeline


Listed here are the out there strategies in LightPipeline. As you possibly can see, we are able to additionally use an inventory of strings as enter textual content.



LightPipelines are straightforward to create and likewise prevent from coping with Spark Datasets. They’re additionally very quick and, whereas working solely on the motive force node, they execute parallel computation. Let’s see the way it applies to our case described above:

from sparknlp.base import LightPipeline

lightModel = LightPipeline(pipelineModel, parse_embeddings=True)

%time lightModel.annotate("How did serfdom develop in and then leave Russia ?")

CPU occasions: consumer 12.4 ms, sys: 3.81 ms, whole: 16.3 ms
Wall time: 28.3 ms'sentences': ['How did serfdom develop in and then leave Russia ?'],
 'doc': ['How did serfdom develop in and then leave Russia ?'],
 'regular': ['How',
 'token': ['How',
 'embeddings': ['-0.23769 0.59392 0.58697 -0.041788 -0.86803 -0.0051122 -0.4493 -0.027985, ...]

Now it takes 28 ms! Almost 20x sooner than utilizing Spark ML Pipeline.

As you see, annotate return solely the end result attributes. For the reason that embedding array is saved below embeddings attribute of WordEmbeddingsModel annotator, we set parse_embeddings=True to parse the embedding array. In any other case, we might solely get the tokens attribute from embeddings within the output. For extra details about the attributes talked about, see here.

If we need to retrieve totally data of annotations, we are able to additionally use fullAnnotate() to return a dictionary record of complete annotations content material.

end result = lightModel.fullAnnotate("How did serfdom develop in and  
                                 then go away Russia ?")


fullAnnotate() returns the content material and metadata in Annotation sort. In accordance with documentation, the Annotation sort has the next attributes:

annotatorType: String, 
start: Int, 
finish: Int, 
end result: String, (that is what annotate returns)
metadata: Map[String, String], 
embeddings: Array[Float]

So, if we need to get the start and finish of any sentence, we are able to simply write:

end result[0]['sentences'][0].start
>> zero

end result[0]['sentences'][0].finish
>> 49

end result[0]['sentences'][0].end result
>> 'How did serfdom develop in after which go away Russia ?'

You’ll be able to even get metadata for every token with respect to embeddings.

end result[0]['embeddings'][2].metadata

>> 'isOOV': 'false',
 'pieceId': '-1',
 'isWordStart': 'true',
 'token': 'serfdom',
 'sentence': 'zero'

Sadly, we can’t get something from non-Spark NLP annotators by way of LightPipeline. That’s after we use the Spark ML characteristic like word2vec inside a pipeline together with SparkNLP annotators, after which use LightPipelineannotate solely returns the end result from SparkNLP annotations as there isn’t a end result area popping out of Spark ML fashions. So we are able to say that LightPipeline won’t return something from non-Spark NLP annotators. No less than for now. We plan to write down a Spark NLP wrapper for all of the ML fashions in Spark ML quickly. Then we can use LightPipeline for a machine studying use case through which we prepare a mannequin in Spark NLP after which deploy to get sooner runtime predictions.



Spark NLP LightPipelines are Spark ML pipelines transformed right into a single machine however the multi-threaded job, turning into greater than 10x occasions sooner for smaller quantities of information. On this article, we talked about how one can convert your Spark pipelines into Spark NLP LightPipelines to get a sooner response for small knowledge. This is among the coolest options of Spark NLP. You get to benefit from the energy of Spark whereas processing and coaching, after which get sooner inferences by means of LightPipelines as in the event you try this on a single machine.

We hope that you just already learn the earlier articles on our official Medium page, and began to play with Spark NLP. Listed here are the hyperlinks for the opposite articles. Don’t overlook to observe our web page and keep tuned!

Introduction to Spark NLP: Foundations and Basic Components (Part-I)

Introduction to: Spark NLP: Installation and Getting Started (Part-II)

Spark NLP 101 : Document Assembler

** These articles are additionally being revealed on John Snow Labs’ official blog page.

Bio: Veysel Kocaman is a Senior Knowledge Scientist and ML Engineer having a decade lengthy trade expertise. He’s presently engaged on his PhD in CS at Leiden College (NL) and holds an MS diploma in Operations Analysis from Penn State College.

Original. Reposted with permission.


About the Author

Leave a Reply

Your email address will not be published. Required fields are marked *