There are several parallel models. One of the most used parallel programming models is the threading model, where you create threads or lightweight processes which are sharing memory and resources, like open files. A thread is an independent execution space that shares memory with its creator inside the its parent process. So, imagine that you need to process 5 files containing call time duration and you need to sum and join that data. Processing that set of files sequentially probably is slower than doing a parallel program to process all files at once. Here is an example written in Python that can help you to understand a threaded solution for this problem.
So, with a set of 5 files or more, you can have one or more functions processing the set of files in parallel without blocking the execution, and doing the same task independently. Probably you need to write some data structure like a dictionary, but you can use locks to avoid racing conditions. On the following example you have a concurrent function called pproc_file(), which reads and writes the GLOBAL_HASH dictionary concurrently and uses the global lock MAIN_LOCK as access flag.
import threading as thr import os import sys import csv GLOBAL_HASH = dict() MAIN_LOCK = thr.RLock() def pproc_file(fname): """ Process the given File """ global GLOBAL_HASH global MAIN_LOCK if not os.access(fname, os.R_OK): raise Exception(u"Cannot Open %s" % (fnm)) try: fdi = open(fname) reader = csv.reader(fdi) # skip header reader.next() for cdata in reader: if not cdata: continue ditm = cdata[1] # you can read in parallel if cdata[1] in GLOBAL_HASH: nsum = GLOBAL_HASH[ditm] + int(cdata[2]) else: nsum = int(cdata[2]) # but you must write concurrently MAIN_LOCK.acquire() GLOBAL_HASH[ditm] = nsum MAIN_LOCK.release() fdi.close() except Exception, exc: print(repr(exc))
Then you can place one file to be processed by this function using threads, working almost independently, allowing faster processing, without sequential processing over each file. This will make the overall process to finish faster.
def main(): """ Main Program """ pfiles = ['phone-0.csv', 'phone-1.csv', 'phone-2.csv', 'phone-3.csv'] thrl = list() for myf in pfiles: sthr = thr.Thread(target=pproc_file, args=[myf]) sthr.start() thrl.append(sthr) map(lambda x: x.join(), thrl)
Finally the main function launches as many threads as files are required to process and the main thread — usually with ID 1 in Python — call the join() method of each thread, making the main thread to wait or block for the created threads to be finished. Without the join() call, the program finishes and terminates the child threads. So, is better to wait the final processing and all files to be read entirely. The Linux implementation of POSIX threads, creates a copy of its parent process like fork() does, but it shares the memory space with its creator, so a thread is internally a sub-process sharing the memory space with its creator, rather than a lightweight sub-process.
Hi, I’m thinking in a simple parallel parser for metadata in Haskell, where you run n threads over the same set of data, each thread generating metadata to be stored.
Recommend me please, a library to manage this threads.
Thanks in advance.
The Par Monad can bring you enough control.