How to program in Grimport?

Ask us for help with programming in Grimport TRANSLATE THIS PAGE

To Go Further

 

Parallelism


What is parallelism?

Parallelism (or parallel programming) is a mode of operation where the task is executed simultaneously in several processes of the same computer. It is a complex subject that requires a lot of attention when writing code.
Parallelism works on the concept of multithreading. This is having several threads running the same part of a program simultaneously. Each part of a program is called a thread. A thread is a synchronous executor of part of the code. It corresponds to the path followed during the execution of a program. Threads run in relation to each other asynchronously (in parallel).

To put it differently, when you run a program. This program will first execute line 1 of the code, which will probably contain a function, the function executes and during this execution, the program is paused. It waits for the function to finish its execution and when line 1 is finished, it goes to line 2. This is the synchronous procedural system, which executes each element one after another in a logical order and one function/line after another. Each thread works like this.

But if there are several threads. The first one will execute the first line and again it will wait for it to finish executing before moving on to the second. However, during this time the second thread will execute line 1 in the same spirit, and it may even be that this line is finished before thread 1 has finished it (there is no guarantee on the time that the different threads take in relation to each other). Thus the same code is executed by parallel processes. It is of course logical that there are still some variables that are different in all these threads and that allows them to work on different tasks (because there is no point in doing the same thing several times in parallel).

 

Parallelism with Grimport

Parallelism with Grimport is a bit simpler and more assisted than with Java.

With Grimport, parallelism allows to go as fast as possible to crawl CSV or websites.
When crawling, one can send virtually several crawlers to a website, which will open the different pages and execute the different scripts in parallel.

However, parallelism with Grimport requires some requirements:

  • Use the synchronize() function

    It is sometimes necessary that some code runs one thread at a time, especially when there is some cache managment or when a file is written on the disk.
    For example, if two threads arrive on a different product with the same mark and there is no mark in the cache, both will create an identical mark. Immediately after this creation, the cache will update and several marks with the same name will have created before the cache has been updated.

    To avoid this we will use the synchronize() function. We will not need the wait() and notify() functions as in Java.
    This function allows not to execute different threads certain dependent actions at the same time. That is, let one thread execute a piece of code while the others wait for it to be done.

  • Variables must be local

    Be careful! All variables at the beginning of each declaration must be local (def). This allows each thread to manage different variables and not to mix information between threads. For example, a thread can manage a mark and a second thread another different mark. If the variable is global, then the first thread will end up with the mark of the second.

 

Simple parallelism

If we want to do "simple" parallelism, we will use the FORPAGE script. On a FORPAGE script, the several crawling tasks will correspond to the multithread.
You can configure the number of threads in the Options tab and enter the number in "Number of parallel crawling tasks".

Example (FORPAGE Script):

def  name=select("h1") 
def  brand=select(".brand") 

def id_brand=null synchronize("brand", //Only one thread at a time can enter into this block {-> //The others wait at this level until the current thread has completely exited the block if(brand) { id_brand=getManufacturer(brand) if(!id_brand) { id_brand=functionNow('add_manufacturer',[brand]) updateManufacturers() } } }) function("update_or_add_product",[ "name": name, "id_manufacturer": id_brand ]) send()

In the code above, the first argument of the synchronize() function, "brand", locks the synchronization and the second argument is the method that will be synchronized.
When the first thread enters the synchronize(), it will check if the "brand" locker is locked, and if it is not, it will enter the method to create the brand.
At the same time the second thread takes care of the second product, and waits at the synchronize() until the first thread finishes its execution.

 

Advanced parallelism

It is also possible to parallelize in the initial and final scripts with more complicated functions and supervised crawls.
To do this, we use the functions parallelize() and asynchronous():

  • parallelize() allows to launch threads in parallel which execute methods on some elements of the list. Instead of specifying the number of threads in the Options tab (as in the previous example), we will enter it in the parallelize() function.
  • asynchronous() is a bit like parallelize() but with a different approach. The function will manage the functions sent to it and will manage them by stacking them one after the other.

Example (Initial Script):

products=[ 
	["name":"tshirt","image":"http://website.com/image1.png"], 
	["name":"sandals","image":"http://website.com/image2.png"], 
	["name":"shoes","image":"http://website.com/image3.png"], 
	["name":"coat","image":"http://website.com/image4.png"], 
	["name":"trousers","image":"http://website.com/image5.png"]

] parallelize(products, 3, {def products-> def name=get(products, "name") def image=get(products, "image") function("update_or_add_product",[ "name": name ]) function("add_image", [image]) send() })

This function will execute 3 threads at the same time, starting with the first 3 elements of the list. For each one the defined closure will be executed. When one of these threads finishes, the 4th element of the list is launched and so on, so that there are always 3 active threads.

Howerver, the above example would not work because all the commands run with function() will be mixed up. For example if thread1 creates a product with update_or_add_product, then thread2 creates another product with update_or_add_product, then thread1 associates an image with add_image and finally thread2 associates another image with add_image. These 4 commands will be put in the same stack, sent at send() time. The problem is that with the contextual product id principle, the 2 images will be associated with the product created by thread2 because they come just after!

When you want to put several function() in supervised crawling, you must therefore create several command stacks with the functionStack() function and then use sendStack() instead of send().

The previous example becomes :

products=[ 
	["nom":"tshirt","image":"http://website.com/image1.png"], 
	["nom":"sandals","image":"http://website.com/image2.png"], 
	["nom":"shoes","image":"http://website.com/image3.png"], 
	["nom":"coat","image":"http://website.com/image4.png"], 
	["nom":"trousers","image":"http://website.com/image5.png"]

] parallelize(products, 3, {def products-> def stack=[] def name=get(products, "name") def image=get(products, "image") functionStack(stack, "update_or_add_product",[ "name": name ]) functionStack(stack, "add_image", [image]) sendStack(stack) })

The limits of parallelism

Parallelism can sometimes have limits. Here are the 3 main limitations of parallelism:

  • The machine on which the program is run limits performances (CPU, RAM, writing to disk, etc.).
  • The internet connection has limitations in terms of throughput or latency in sending requests.
  • The absorption capacity of the different servers. Indeed, when you have to create tens of thousands of products at the same time, the servers can become saturated.
  • The target server on which the information retrieval is done does not respond fast enough

 

 


Next ❯ ❮ Previous