To Go Further

Parallelism

What is parallelism?

Parallelism (or parallel programming) is a mode of operation where the task is executed simultaneously in several processes of the same computer. It is a complex subject that requires a lot of attention when writing code.
Parallelism works on the concept of multithreading. This is having several threads running the same part of a program simultaneously. Each part of a program is called a thread. A thread is a synchronous executor of part of the code. It corresponds to the path followed during the execution of a program. Threads run in relation to each other asynchronously (in parallel).

To put it differently, when you run a program. This program will first execute line 1 of the code, which will probably contain a function, the function executes and during this execution, the program is paused. It waits for the function to finish its execution and when line 1 is finished, it goes to line 2. This is the synchronous procedural system, which executes each element one after another in a logical order and one function/line after another. Each thread works like this.

But if there are several threads. The first one will execute the first line and again it will wait for it to finish executing before moving on to the second. However, during this time the second thread will execute line 1 in the same spirit, and it may even be that this line is finished before thread 1 has finished it (there is no guarantee on the time that the different threads take in relation to each other). Thus the same code is executed by parallel processes. It is of course logical that there are still some variables that are different in all these threads and that allows them to work on different tasks (because there is no point in doing the same thing several times in parallel).

Parallelism with Grimport

Parallelism with Grimport is a bit simpler and more assisted than with Java.

With Grimport, parallelism allows you to go as fast as possible to crawl CSV or websites.
When crawling, one can virtually send several crawlers to a website, which will open the different pages and execute the different scripts in parallel.

However, parallelism with Grimport requires some requirements:

Use the synchronize() function
It is sometimes necessary that some code runs one thread at a time, especially when there is some cache managment or when a file is written on the disk.
For example, if two threads arrive on a different product with the same mark and there is no mark in the cache, both will create an identical mark. Immediately after this creation, the cache will update and several marks with the same name will have been created before the cache has been updated.

To avoid this we will use the synchronize() function. We will not need the wait() and notify() functions like in Java.
This function makes sure not to execute different threads during certain dependent actions at the same time. That is, let one thread execute a piece of code while the others wait for it to be done.

Variables must be local

Be careful! All variables at the beginning of each declaration must be local (def). This allows each thread to manage different variables and not to mix information between threads. For example, a thread can manage a mark and a second thread another different mark. If the variable is global, then the first thread will end up with the mark of the second.

Simple parallelism

If we want to do "simple" parallelism, we will use the FORPAGE script. On a FORPAGE script, the several crawling tasks will correspond to the multithread.
You can configure the number of threads in the Options tab and enter the number in "Number of parallel crawling tasks".

Example (FORPAGE Script):

def  name=select("h1") 
def  brand=select(".brand") 

def  id_brand=null  
synchronize("brand",    //Only one thread at a time can enter into this block
{->                     //The others wait at this level until the current thread has completely exited the block
	if(brand) 
	{ 
		id_brand=getManufacturer(brand) 
		if(!id_brand)
		{ 
			id_brand=functionNow('add_manufacturer',[brand]) 
			updateManufacturers() 
		} 
	} 
}) 

function("update_or_add_product",[ 
	"name": name, 
	"id_manufacturer":  id_brand 
]) 

send()

In the code above, the first argument of the synchronize() function, "brand", locks the synchronization and the second argument is the method that will be synchronized.
When the first thread enters the synchronize(), it will check if the "brand" locker is locked, and if it is not, it will enter the method to create the brand.
At the same time the second thread takes care of the second product, and waits at the synchronize() until the first thread finishes its execution.

Advanced parallelism

It is also possible to parallelize in the initial and final scripts with more complicated functions and supervised crawls.
To do this, we use the functions parallelize() and asynchronous():

parallelize() allows to launch threads in parallel which execute methods on some elements of the list. Instead of specifying the number of threads in the Options tab (as in the previous example), we will enter it in the parallelize() function.
asynchronous() is a bit like parallelize() but with a different approach. The function will manage the functions sent to it and will manage them by stacking them one after the other.

Example (Initial Script):

products=[ 
	["name":"tshirt","image":"http://website.com/image1.png"], 
	["name":"sandals","image":"http://website.com/image2.png"], 
	["name":"shoes","image":"http://website.com/image3.png"], 
	["name":"coat","image":"http://website.com/image4.png"], 
	["name":"trousers","image":"http://website.com/image5.png"]
]

parallelize(products, 3,  
{def products-> 
	def  name=get(products, "name")
	def  image=get(products, "image")

	function("update_or_add_product",[ 
		"name":  name
	]) 
	function("add_image", [image])

    send()
})

This function will execute 3 threads at the same time, starting with the first 3 elements of the list. For each one the defined closure will be executed. When one of these threads finishes, the 4th element of the list is launched and so on, so that there are always 3 active threads.

However, the above example would not work because all the commands run with function() will be mixed up. For example if thread1 creates a product with update_or_add_product, then thread2 creates another product with update_or_add_product, then thread1 associates an image with add_image and finally thread2 associates another image with add_image. These 4 commands will be put in the same stack, sent at send() time. The problem is that with the contextual product id principle, the 2 images will be associated with the product created by thread2 because they come just after!

When you want to put several function() in supervised crawling, you must therefore create several command stacks with the functionStack() function and then use sendStack() instead of send().

The previous example becomes :

products=[ 
	["nom":"tshirt","image":"http://website.com/image1.png"], 
	["nom":"sandals","image":"http://website.com/image2.png"], 
	["nom":"shoes","image":"http://website.com/image3.png"], 
	["nom":"coat","image":"http://website.com/image4.png"], 
	["nom":"trousers","image":"http://website.com/image5.png"]
]

parallelize(products, 3,  
{def products-> 
    def stack=[]
	def  name=get(products, "name")
	def  image=get(products, "image")

	functionStack(stack, "update_or_add_product",[ 
		"name":  name
	]) 
	functionStack(stack, "add_image", [image])

    sendStack(stack)
})

The difference between asynchronous and parallelize

As a general rule the use of parallelize is recommended when you have a single list already built up containing a large number of items. parallelize is in fact a substitute for each or for. Instead of iterating one with each we will iterate it with parallelize and it is usually very intuitive.

Asynchronous comes into play in more complex situations, usually when there are several loops nested within each other, when it is not possible to determine precisely when the information about the tasks to be parallelized will be available or when the size of the sub-elements of a task stack is very different.

Let's take this example:

tasks=[ //an array of tasks to do
	["1-1", "1-2", "1-3"],
	["2-1", "2-2", "2-3", "2-4", "2-5", "2-6", "2-7", "2-8", "2-9", "2-10", "2-11"],
]

//each tasks spend 1 hour!
def operationWhichSpend1Hour(def information)
{
	console(information)
	sleep(1000*3600)
}

//with parallelize
parallelize(tasks, 10, {def informations-> //informations=["1-1", "1-2", "1-3"] at the first iteration
	informations.each
	{ def infos->
		operationWhichSpend1Hour(infos)
	}
})

//with asynchronous
tasks.each
{ def informations->  //informations=["1-1", "1-2", "1-3"] at the first iteration
	informations.each
	{def infos->
		asynchronous(operationWhichSpend1Hour, [infos], 10, "operationWhichSpend1Hour")
	}
}

Here the solution of parallelize is not relevant at all. First of all it is aberrant to use 10 spots, because the list contains at the first depth 2 elements. But paralellize iterates the elements of the list by parallelizing them, which implies that it will create 2 threads that will iterate the sub-elements, that is to say the list of 3 strings and the list of 11 strings in parallel.

Then, the first thread will take 3 hours and the second one 11 hours. Paralellize will then end 11 hours after being launched.

With asynchronous, we iterate the elements of the array and send the tasks to an asynchronous manager. 14 tasks (11+3) are thus sent almost simultaneously. Since the maximum number of threads is fixed at 10, asynchronous will take the first 10 received, which will take 1 hour on 10 parallel threads. As soon as a task finishes, a thread is released and a task will be launched and assigned to this thread. Here they all finish almost at the same time, so after 1 hour, the last 4 tasks will be assigned to 4 threads and the other 6 will die out. These 4 threads will finish 1 hour later.

So with this process, asynchronous tooks 2 hours while paralellize tooks 11 hours.

The limits of parallelism

Parallelism can sometimes have limits. Here are the 3 main limitations of parallelism:

The machine on which the program is run limits performances (CPU, RAM, writing to disk, etc.).
The internet connection has limitations in terms of throughput or latency in sending requests.
The absorption capacity of the different servers. Indeed, when you have to create tens of thousands of products at the same time, the servers can become saturated.
The target server on which the information retrieval is done does not respond fast enough

Next ❯ ❮ Previous