distributeMultiServers

distributeMultiServers ( mixed listOrElement , array[string] serverNameList , string _id , bool[default:false] _iterativeMode , map[string=>float] _serverCoefficients ) : void

Distribute data across multiple servers so that one process is split across multiple machines.
Note: when sharing information between servers, try to minimise the size of the information (e.g. do not share a whole source code). Overly large requests can slow down the process considerably.

Example

serverName = "master" //here the server name, in this example we have 3 machines : master, slave1 and slave2
//This script will be run on all 3 servers, but we just change the server name to identify the machine.

if(equals(serverName, "master")) //Only master server tells others what data they have to process
{
urlsToCrawls=["https://www.site.com/page1","https://www.site.com/page2","https://www.site.com/page3","https://www.site.com/page4","https://www.site.com/page5","https://www.site.com/page6","https://www.site.com/page7","https://www.site.com/page8","https://www.site.com/page9"]

distributeMultiServers(urlsToCrawls, ["master", "slave1", "slave2"])
}


//for all servers
//We recover the list of links (in practice the third of the original list) and each server has a different part to treat
serverUrlsToCrawls = waitAndgetMultiServerData(serverName)
serverUrlsToCrawls.each
{ def url->
code = getPage(url)
csv(path("desktop")+"extraction_"+serverName+".csv", ["Title": cleanSelect("h1", code)])
}


Example for iterative mode

INITIAL script:
serverName="master"  //here the server name, in this example we have 3 machines : master, slave1 and slave2
setGlobal("serverName")

//for all slave
if(contains(serverName, "slave"))
{
while(true)
{
finished=getSharedOnlineConfiguration("finished")
urlToCrawl = waitAndgetMultiServerData(serverName, null, true)
if(urlToCrawl)
{
code = getPage(urlToCrawl)
csv(path("desktop")+"extraction_"+serverName+".csv", ["Title": cleanSelect("h1", code)])
}

//If the master has finished and there are no more links to crawl for this slave, we exit and end the script on the slave machine (we could also have simply exited the loop and let the script die)
if(finished && !urlToCrawl) stopAndClose()
}
}

FORPAGE script:
//the master collect some links
if(equals(serverName, "master"))
{
interestingLink=cleanSelect("a.interesting", null, "href")
if(interestingLink) distributeMultiServers(interestingLink, "crawlThat_8624455", ["slave1", "slave2"], true)
}

FINAL script:
if(equals(serverName, "master"))
{
setSharedOnlineConfiguration("finished", true)
}


See also

getMultiServerData
clearMultiServerData
setSharedOnlineConfiguration
getSharedOnlineConfiguration
getCloudIdServer

Parameters

listOrElement

List of elements to be distributed between the different servers. If you are in iterative mode, set here just the element.

serverNameList

The list of the names of machines which need data. Data of listOrElement will be distributed between these machines.

_id (optional)

You must set here an ID for your distribution. If there is no ID, the unique script ID is used. It will be used in getMultiServerData to retreive your data. This ID must be unique in our database. So we recommand to fix a random part. Ex: CSV_amazon_63258963. This parameter is usefull if you share information between sevral scripts.

_iterativeMode (optional)

If false, set a list of elements in listOrElement, this list will be splitted in sub-lists and each one will be attributed to a server.
It true, set an element to distribute. This element will be randomly attributed to a server. It is pratical when you does not have a pre-established list and you need to add elements on the fly.

_serverCoefficients (optional)

Weighting coefficients for servers. Example: you define in serverNameList ["server1", "server2"], if you set in _serverCoefficients ["server1":0.1, "server2":0.9], 90% of elements are assigned to server2 and 10% to server1.